Chapter 2: Boot and Hardware Discovery

Boot chain, device discovery, ACPI/DT, multi-architecture support, hardware memory safety

2.1 Boot and Installation

2.1.1 Overview

UmkaOS uses a phased boot architecture. The current implementation boots via the Multiboot1 protocol through GRUB or QEMU's -kernel flag — sufficient for development, testing, and early hardware bring-up. The production target is UEFI stub boot with Linux boot protocol compatibility, enabling drop-in package installation alongside existing Linux kernels.

The boot code lives in umka-kernel/src/boot/ (assembly entry, Multiboot parser) and umka-kernel/src/arch/*/boot.rs (per-architecture boot routines). The initialization sequence is in umka-kernel/src/main.rs.

2.1.2 Current Implementation: Multiboot Boot

2.1.2.1 Boot Protocols

The kernel ELF contains dual Multiboot headers — both Multiboot1 and Multiboot2 are present in the binary, allowing either protocol at the bootloader's choice:

Multiboot1 (magic 0x1BADB002): Fully implemented. Used by QEMU (-kernel flag) and GRUB (multiboot command). Parser in boot/multiboot1.rs extracts the memory map, command line, and bootloader name.
Multiboot2 (magic 0xE85250D6): Header present in the ELF but no parser implemented. The magic is recognized in umka_main() but the info structure is not parsed. Planned for Phase 2.

The linker script (linker-x86_64.ld) places headers in dedicated sections: .multiboot1 (4-byte aligned, first 8 KB) and .multiboot2 (8-byte aligned, first 32 KB), ensuring bootloaders find them. The kernel loads at physical address 0x100000 (1 MB), the standard Multiboot load address.

Build and boot methods:

# # Development: QEMU with -kernel (Multiboot1, no ISO needed)
qemu-system-x86_64 -kernel target/x86_64-unknown-none/release/umka-kernel -serial stdio

# # Testing: GRUB ISO boot (Multiboot1 via grub.cfg `multiboot` command)
make iso && qemu-system-x86_64 -cdrom target/umka-kernel.iso -serial stdio

Non-x86 architectures use different boot protocols:

Device Tree Blob (DTB): Used by AArch64, ARMv7, RISC-V 64, PPC32, and PPC64LE. The firmware or QEMU passes a pointer to a flattened device tree (FDT) in a register at entry (x0 on AArch64, r2 on ARMv7, a1 on RISC-V, r3 on PPC32, r3 on PPC64LE). The DTB describes the machine's physical memory layout, interrupt controllers, timers, and peripheral addresses. The format is big-endian with magic 0xD00DFEED. See Section 2.1.2.9 for the parsing specification.
OpenSBI (RISC-V only): The Supervisor Binary Interface firmware runs in M-mode and provides SBI ecalls for timer, IPI, console, and system reset services to S-mode code. QEMU's built-in OpenSBI occupies physical addresses 0x80000000–0x801FFFFF. At entry, OpenSBI passes a0 = hart_id (hardware thread identifier) and a1 = DTB address. The kernel must not overwrite the OpenSBI region.
OpenFirmware / SLOF (PPC64LE): On POWER systems, SLOF (Slimline Open Firmware) or OPAL (OpenPOWER Abstraction Layer) firmware initializes hardware and passes a DTB pointer in r3. QEMU's pseries machine uses SLOF; bare metal POWER8/9/10 uses OPAL (skiboot). At entry: r3 = DTB address, r4 = 0 (reserved). The kernel runs in hypervisor or supervisor mode.
U-Boot / OpenFirmware (PPC32): Embedded PowerPC boards typically use U-Boot which passes a DTB pointer in r3. QEMU's ppce500 machine uses U-Boot or direct kernel boot. At entry: r3 = DTB address, r4 = kernel_start, r5 = 0 (reserved).

2.1.2.2 x86-64 Entry Sequence

The boot assembly (boot/entry.asm, NASM syntax) handles the transition from 32-bit protected mode to 64-bit long mode:

1. GRUB/QEMU loads ELF at 1 MB, jumps to _start in 32-bit protected mode
   - eax = Multiboot1 magic (0x2BADB002)
   - ebx = pointer to Multiboot info structure

2. _start (32-bit):
   a. Save magic: eax → esi (preserved across BSS clear and CPUID check)
   b. Set temporary stack at 0x80000 (below kernel)
   c. Clear BSS (rep stosd from __bss_start to __bss_end — clobbers edi, ecx, eax)
   d. Build identity-map page tables for first 1 GB:
      PML4[0] → boot_pdpt | PRESENT | WRITABLE
      PDPT[0] → boot_pd   | PRESENT | WRITABLE
      PD[0..511] → 512 × 2 MB pages (flags: PRESENT | WRITABLE | PAGE_SIZE)
   e. Save info ptr: ebx → ebp (preserve across CPUID — ebx is clobbered by CPUID)
   f. Verify long mode: CPUID leaf 0x80000001 bit 29
      (displays "NO64" on VGA buffer and halts if not available)
   g. Restore info ptr: ebp → ebx
   h. Enable PAE (CR4 bit 5)
   i. Enable Long Mode (IA32_EFER MSR bit 8)
   j. Enable Paging (CR0 bit 31)
   k. Load temporary 64-bit GDT (null + code + data descriptors)
   l. Far jump to _start64 (selector 0x08 = 64-bit code segment)

3. _start64 (64-bit):
   a. Load 64-bit data segments (selector 0x10)
   b. Set kernel stack (boot_stack_top, 16 KB in .bss)
   c. Clear RFLAGS
   d. Map to 64-bit calling convention: edi = esi (magic), esi = ebx (info ptr)
   e. Call umka_main(multiboot_magic=rdi, multiboot_info_ptr=rsi)

Page tables and boot stack are allocated in .bss (zeroed by step 2c): boot_pml4 (4 KB), boot_pdpt (4 KB), boot_pd (4 KB), boot stack (16 KB).

2.1.2.3 Kernel Initialization Phases (x86-64)

umka_main() detects the boot protocol from the magic value, then runs an ordered initialization sequence. Each phase depends on the previous:

Phase 1:  GDT + TSS
          Load a proper GDT with TSS. Configure IST1 with a dedicated
          16 KB stack for double-fault handling.

Phase 2:  IDT + PIC
          Install exception handlers (0-31) and IRQ handlers (32-47).
          Remap the 8259 PIC: IRQ0 → vector 32, IRQ8 → vector 40.

Phase 3:  Physical Memory Manager
          Parse Multiboot1 memory map (see Section 2.1.2.11).
          Initialize bitmap allocator: mark available regions free,
          reserve first 1 MB (BIOS/legacy) and kernel image.

Phase 4:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB total).
          Initialize free-list allocator → enables alloc::Vec, Box, String.
          This initial heap size is a bootstrap minimum; the allocator expands
          dynamically once memory discovery completes (Section 4.1).

Phase 5:  Virtual Memory
          Verify identity mapping (virt_to_phys on mapped addresses).
          Test new page mappings: allocate frame, map at 0x40000000,
          write/read volatile, unmap, free frame.

Phase 6:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 7:  IPC / MPK Detection
          Query CPUID for PKU support. Test domain alloc/free.

Phase 8:  Enable Interrupts
          Enable IRQs, verify timer ticks are incrementing.

Phase 9:  Scheduler
          Initialize round-robin scheduler. Spawn two test threads
          (thread-A, thread-B). Run cooperative yield loop, then
          enable preemptive scheduling via timer tick callback.

Phase 10: SYSCALL/SYSRET
          Configure STAR/LSTAR/SFMASK MSRs. Register three syscall
          handlers: write(1), getpid(39), exit_group(231).
          Test with inline SYSCALL instruction from kernel mode.

2.1.2.4 Secondary CPU Bringup (x86-64 SMP)

After Phase 10 completes on the BSP (Boot Strap Processor), secondary CPUs (Application Processors, APs) are brought online.

AP Stack Allocation:

The BSP allocates each AP's initial kernel stack from the boot allocator before sending the INIT-SIPI wakeup. This ensures the stack is ready before the AP needs it.

Stack size: 16 KB per AP (same as the BSP initial stack). Allocated from the per-NUMA-node boot allocator, preferring memory local to the AP's NUMA node.
SP communication: The BSP stores the stack top address in a per-CPU startup mailbox, defined as:

/// Per-CPU startup data written by BSP before AP wakeup, read by AP during
/// very early boot (before the AP has its own stack pointer set up).
/// Must reside in a physically-mapped region accessible without paging (or
/// with the identity-mapped early page tables already in place).
#[repr(C, align(64))]
pub struct ApStartupMailbox {
    /// Initial kernel stack top (SP value to load). Written by BSP before
    /// sending the wakeup IPI; read by the AP entry stub in assembly.
    pub stack_top: u64,
    /// Physical address of the AP's per-CPU data area.
    pub percpu_base: u64,
    /// APIC ID — AP verifies this matches its own LAPIC_ID before proceeding.
    pub cpu_id: u32,
    /// BSP sets to MAILBOX_READY (0xAB1E1234) when all fields above are valid.
    /// AP spins on this field (with a short architectural pause) until ready.
    pub status: AtomicU32,
    pub _pad: [u8; 32],
}

pub const MAILBOX_READY: u32 = 0xAB1E_1234;
/// Array of mailboxes, one per possible CPU slot. Allocated from the boot
/// allocator during Phase 11 once the CPU count is known.
pub static AP_STARTUP_MAILBOXES: Once<&'static mut [ApStartupMailbox]> = Once::new();

AP entry stub: The AP's 16-bit → 64-bit trampoline (in assembly) reads stack_top from its mailbox slot, using the LAPIC ID as the array index, loads SP, then jumps to Rust ap_entry().
Stack allocation failure: If the boot allocator returns OOM for an AP's stack, the BSP marks that CPU permanently offline in the topology, does NOT send the wakeup IPI, and logs "CPU {lapic_id}: stack allocation failed, CPU disabled". Boot continues with the remaining CPUs.

Fan-out tree bringup:

A sequential per-CPU timeout of 1 second × N CPUs does not scale: 128 CPUs would require up to 127 seconds in the worst case. UmkaOS uses a binary fan-out tree to bound bringup time to O(log₂ N) phases regardless of CPU count.

The tree assignment is defined by index (not by LAPIC ID):

CPU i (tree index) wakes CPUs 2i+1 and 2i+2 (if they exist in the topology).
  Phase 0: BSP (index 0) wakes index 1 and index 2
  Phase 1: index 1 wakes 3, 4; index 2 wakes 5, 6
  Phase 2: each of 3–6 wakes two children
  ...
  Phase k = ⌈log₂(N)⌉ − 1: leaf CPUs (no children)

For 128 CPUs: 7 phases × ~50 ms per phase ≈ 350 ms worst case, versus up to 127 seconds with sequential 1-second timeouts.

The BSP sets up a shared SmpBringupState structure before waking the first AP:

/// Shared state for coordinating fan-out tree AP bringup.
/// Placed in physically-mapped memory accessible to all CPUs before the VMM
/// is fully operational.
///
/// `online_mask` and `pending_mask` are `CpuMask` instances (Section 8.1,
/// 08-security.md), allocated from the boot-time bump allocator after the CPU
/// count is discovered from ACPI MADT or DTB. They scale to the actual number
/// of CPUs found on the system — no hardcoded limit.
///
/// Allocation: `CpuMask::alloc(num_possible_cpus, boot_alloc)` is called once
/// during Phase 11 (CPU enumeration), before any AP is woken. The mask storage
/// is never reallocated — `num_possible_cpus` is a boot-time-fixed value.
#[repr(C, align(64))]
pub struct SmpBringupState {
    /// Bitmask of CPUs (by tree index) that have completed initialization.
    /// One bit per possible CPU. Sized at boot to `num_possible_cpus` bits.
    /// Each AP atomically sets its own bit via `CpuMask::set_atomic`.
    pub online_mask: CpuMask,
    /// Bitmask of CPUs currently being brought up (wakeup IPI sent, init
    /// not yet complete). Used to detect stalled APs at deadline.
    pub pending_mask: CpuMask,
    /// Total number of possible CPUs discovered from firmware (MADT/DTB).
    pub num_possible_cpus: usize,
    /// Count of CPUs that have completed init, used for tree coordination.
    /// An AP atomically increments this after setting its bit in `online_mask`.
    pub online_count: AtomicUsize,
    /// Global deadline (monotonic ns) by which all APs must come online.
    /// Set by BSP to `now_ns() + 30_000_000_000` (30 seconds) before Phase 11.
    pub deadline_ns: u64,
}

Protocol: 1. BSP initializes SmpBringupState, sets deadline_ns = now_ns() + 30s. 2. BSP prepares the mailbox for AP at tree index 1 (stack, percpu_base, APIC ID), sets mailbox[1].status = MAILBOX_READY, then sends INIT-SIPI to that AP. 3. Each AP, after completing its own init (Phase 14 below), atomically sets its bit in online_mask and increments online_count. It then reads its tree index i and sends wakeup IPIs to children at indices 2i+1 and 2i+2 (if those CPUs exist and their deadline_ns has not passed). The AP then enters the scheduler idle loop. 4. The BSP (Phase 15) polls online_count and deadline_ns. When online_count reaches the expected total or deadline_ns is exceeded, bringup ends. Any CPU whose bit is not set in online_mask by deadline is marked offline and excluded from the kernel CPU mask.

Phase 11: AP Detection
          Query ACPI MADT (Multiple APIC Descriptor Table) or MP Table
          for CPU count and LAPIC IDs. Assign sequential tree indices
          (0 = BSP, 1..N-1 = APs in MADT order). Allocate PerCpu<T>
          slots and AP_STARTUP_MAILBOXES for each detected CPU.
          Initialize SmpBringupState; set deadline_ns.

Phase 12: AP Trampoline Setup
          a. Allocate a 4 KB page below 1 MB (in low memory, identity-mapped)
             for the AP trampoline code. This is required because APs start
             in real mode (16-bit) with paging disabled.
          b. Copy trampoline code (16-bit → 32-bit → 64-bit transition) to
             the low-memory page. The trampoline:
             - Starts in 16-bit real mode at physical address 0xNN00
             - Enables protected mode (32-bit)
             - Loads a temporary GDT (same layout as BSP's)
             - Enables long mode (64-bit)
             - Loads CR3 with the kernel's page tables
             - Reads stack_top from ApStartupMailbox[own_lapic_id_index]
             - Loads SP from stack_top
             - Jumps to ap_entry() in high memory
          c. The trampoline uses the ApStartupMailbox array (Section 2.1.2.4
             above) for per-AP stack and percpu_base communication.

Phase 13: First AP Wakeup (BSP → tree root)
          BSP allocates stack for AP at tree index 1, fills mailbox[1],
          sets mailbox[1].status = MAILBOX_READY.
          BSP sends INIT IPI to AP 1's LAPIC (assert level).
          BSP waits 10 ms (Intel SDM recommendation).
          BSP sends STARTUP IPI (SIPI) with trampoline vector.
          BSP waits 200 μs; sends second SIPI (required by older silicon).
          The fan-out tree propagates from here — each AP wakes its children
          after completing its own init.

Phase 14: AP Initialization (per AP, in ap_entry())
          Each AP runs this sequence independently after its mailbox is ready:
          a. Load proper GDT and TSS (per-CPU TSS required for IST stacks)
          b. Load IDT (same as BSP)
          c. Enable interrupts
          d. Initialize per-CPU scheduler runqueue
          e. Calibrate LAPIC timer (delay calibration loop)
          f. Atomically set own bit in SmpBringupState.online_mask;
             increment online_count
          g. Read own tree index i; wake children at 2i+1, 2i+2:
             - Allocate stack for each child (from boot allocator)
             - Fill child's ApStartupMailbox; set status = MAILBOX_READY
             - Send INIT + SIPI + SIPI to child's LAPIC
          h. Enter scheduler idle loop (hlt + monitoring for work)

Phase 15: SMP Online
          BSP polls SmpBringupState.online_count and deadline_ns.
          Loop exits when online_count == expected_ap_count OR
          monotonic_now() >= deadline_ns (global 30-second timeout).
          Any AP whose bit is not set in online_mask at exit is marked
          permanently offline and removed from the kernel CPU mask.
          System is now fully multi-CPU. Scheduler load-balances
          across all online CPUs.

Per-CPU data initialization: Each AP needs its own per-CPU data structures initialized: - PerCpu<T> slots for scheduler runqueue, current task pointer, etc. - GDT with per-CPU TSS (TSS must be unique per CPU for IST stacks) - LAPIC timer calibration (varies per CPU due to manufacturing differences) - IRQ affinity: By default, all IRQs target BSP; distribute to other CPUs via IOAPIC redirection table or LAPIC logical destination mode.

ACPI MADT parsing (x86-64):

MADT (Multiple APIC Descriptor Table):
  - Located via RSDP → RSDT/XSDT → MADT signature "APIC"
  - Provides: Local APIC address, CPU LAPIC IDs, IOAPIC addresses
  - CPU entries: LAPIC ID, flags (enabled/disabled)
  - Override entries: IRQ source overrides, NMI sources

The BSP's LAPIC ID is read from LAPIC_ID register (MMIO at 0xFEE00020).
All other entries in MADT are APs.

Failure handling: If an AP fails to come online before the global deadline: - BSP logs the failure: "CPU {lapic_id} (tree index {i}): did not signal online before deadline, marking offline" - BSP marks the CPU slot offline; its children in the fan-out tree are also marked offline (they will never receive their wakeup IPI) - Boot continues with the available CPUs - Do NOT panic — reduced-CPU operation is valid

Hot-plug support (future): The ACPI namespace may indicate CPU hot-plug capability. The mailbox mechanism is reused for hot-plug: writing to the ACPI CPU hot-plug register triggers the same INIT/SIPI sequence for the newly added CPU, which inserts itself into the online_mask and online_count atomically.

2.1.2.5 ACPI Table Parsing and AML Interpreter Scope

UmkaOS uses ACPI tables for hardware discovery on x86-64 (and ARM SBSA/server platforms). The ACPI subsystem has two distinct components:

Static table parsing (Phase 1, boot-time): The kernel parses binary ACPI tables (MADT, MCFG, HPET, DMAR/IVRS, SRAT, SLIT, PPTT, FADT) to discover hardware topology. This is a straightforward binary structure walk — no interpreter needed. Static table parsing is required for boot.
AML interpreter (Phase 2, post-boot): ACPI Methods (DSDT/SSDT bytecode) require an AML interpreter to execute _STA, _CRS, _PRS, _PSx, _Sx, _OSC, _DSM, and power/thermal methods. UmkaOS implements a reduced AML interpreter covering:
Required for boot: _STA (device status), _CRS (current resources), _PRS (possible resources), _OSC (OS capabilities handshake), _INI (device init).
Required for power management: _PS0–_PS3 (power state transitions), _S3/_S4/_S5 (sleep states), _TMP/_PSV/_CRT (thermal).
Required for PCI/PCIe: _BBN (base bus number), _SEG (segment group), _PRT (PCI routing table).
Deferred: _DSM (device-specific methods) for vendor extensions — implemented per-driver as needed.

AML opcode coverage: The method names above describe which methods to execute, not which AML opcodes the interpreter must support. Real-world DSDT tables (Dell, HP, Lenovo, etc.) use a substantial subset of the AML opcode space within these method bodies. The AML interpreter must support at minimum: - Control flow: If/Else, While, Return, Break - Data manipulation: Store, Add, Subtract, And, Or, ShiftLeft/Right, Increment, Decrement, Not, FindSetLeftBit/RightBit - Object creation: CreateDWordField, CreateWordField, CreateByteField, CreateBitField, CreateQWordField - Composite types: Buffer, Package, DerefOf, Index, SizeOf, ObjectType - Method invocation: MethodCall (nested), Arg0-6, Local0-7 - Synchronisation: Acquire, Release, Mutex - Namespace: Scope, Device, Name, Alias, Notify - Field access: OpRegion, Field, IndexField, BankField (System Memory, SystemIO, PCI Config, Embedded Controller)

This covers ~80% of AML opcodes by frequency of occurrence (measured against a corpus of 47 production ACPI tables from x86 servers and laptops). Rare opcodes — object reference manipulation, external references, some buffer field operations — are deferred to Phase 2. Systems requiring only common opcodes will boot correctly. Systems hitting unimplemented opcodes produce a clear diagnostic: ACPI: unsupported AML opcode 0xXX at <table>+<offset>, skipping method <name>. Extended opcodes (LoadTable, Unload, Timer, ToBCD) are deferred to Phase 3.

Error handling for malformed ACPI tables: If a static table fails checksum or has invalid structure, the kernel logs a diagnostic and falls back to safe defaults (e.g., assume 1 CPU, no IOAPIC, use legacy PIC). If the AML interpreter encounters an illegal opcode or infinite loop (method timeout: 5 seconds), it aborts the method, logs the failure, and marks the affected device as non-functional. The kernel never panics on ACPI errors — degraded operation is always preferred over a boot failure.

2.1.2.6 AArch64 Boot Sequence

QEMU's -M virt -cpu cortex-a72 -kernel loads the ELF at 0x40080000 and enters at _start in EL1 (Exception Level 1) with the MMU off. Register x0 holds the DTB address provided by QEMU's built-in firmware.

Entry assembly (arch/aarch64/entry.S, GNU as syntax):

1. QEMU jumps to _start in EL1, MMU off
   - x0 = DTB address (passed by QEMU firmware)

2. _start:
   a. Save DTB pointer: mov x19, x0 (x19 is callee-saved)
   b. Disable all exceptions: msr daifset, #0xf
      (masks Debug, SError, IRQ, FIQ in DAIF register)
   c. Enable FPU/NEON: write CPACR_EL1.FPEN bits [21:20] = 0b11
      (without this, any NEON/FP instruction traps — Rust generates
      NEON instructions by default for aarch64). This clobbers x0,
      but the DTB pointer was saved to x19 in step (a).
   d. Load stack pointer: adrp x1, _stack_top / add / mov sp, x1
      (64 KB stack in .bss._stack, 16-byte aligned)
   e. Clear BSS: zero memory from __bss_start to __bss_end
      (str xzr loop, 8 bytes per iteration)
   f. Prepare arguments: x0 = 0 (no multiboot), x1 = x19 (DTB address)
   g. Branch: bl umka_main
   h. Halt loop: wfe (wait-for-event) if umka_main returns

Stack (64 KB) is allocated in .bss._stack (16-byte aligned). The linker script (linker-aarch64.ld) places .text._start first and provides __bss_start / __bss_end symbols for BSS clearing.

Initialization phases (in umka_main(), sequential):

Phase 1:  Exception Vectors (VBAR_EL1)
          Write vector table base to VBAR_EL1 (16 entries × 128 bytes,
          2 KB aligned). Vectors cover: Synchronous, IRQ, FIQ, SError
          at each of four exception origins (current EL SP0/SPx, lower
          EL AArch64/AArch32).

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly, same
          pattern as x86 entry.asm step 2d). Perform any additional
          initialization that depends on zeroed static data.

Phase 3:  DTB Parse
          Parse the DTB (received in x0 at entry, forwarded as the
          info pointer to umka_main; see Section 2.1.2.9). Extract /memory
          regions, /chosen bootargs, interrupt controller base (GIC),
          timer IRQ numbers, and UART base address.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). Mark available
          regions free, reserve kernel image (__bss_end and below).
          No legacy BIOS region to reserve (unlike x86).

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.
          This initial heap size is a bootstrap minimum; the allocator expands
          dynamically once memory discovery completes (Section 4.1).

Phase 6:  Virtual Memory (TTBR0_EL1)
          Build identity-map page tables using 4 KB granule:
          - TCR_EL1: T0SZ=16 (48-bit VA), TG0=0b00 (4 KB granule),
            ORGN0/IRGN0 = write-back cacheable, SH0 = inner shareable
          - 4-level tables: L0 (PGD) → L1 (PUD) → L2 (PMD) → L3 (PTE)
          - Identity map all physical RAM
          - Set TTBR0_EL1, isb, enable MMU via SCTLR_EL1.M bit

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  GIC Initialization (v2 or v3, detected at runtime)
          Read GIC version and base addresses from DTB
          (`compatible` = "arm,gic-400" for GICv2, "arm,gic-v3" for GICv3).
          - GICv2 path:
            GICD (Distributor): enable, configure IRQ priorities and
            targets for all SPIs. Set priority mask.
            GICC (CPU Interface): enable, set priority mask to 0xFF
            (accept all priorities), set BPR (binary point).
          - GICv3 path:
            GICD (Distributor): enable, configure affinity routing (ARE=1),
            set priorities for all SPIs.
            GICR (Redistributor): per-CPU, configure SGI/PPI group and
            priority. Enable redistributor.
            ICC system registers: ICC_PMR_EL1 = 0xFF (accept all),
            ICC_IGRPEN1_EL1 = 1 (enable group 1 interrupts).
          Route timer IRQ (PPI 27 = virtual timer) to this CPU.

Phase 9:  Generic Timer
          Configure the ARM generic timer (virtual counter):
          - Write timer period to CNTV_TVAL_EL0
          - Enable timer: CNTV_CTL_EL0 = ENABLE (bit 0), clear IMASK
          - Timer fires IRQ 27 (virtual timer PPI) → tick handler
          Enable interrupts: msr daifclr, #0xf

Phase 10: SVC / Exception-Vector Syscall Setup
          Configure the exception vector table to correctly dispatch system
          calls arriving from EL0 via the SVC instruction.

          Exception vector layout (VBAR_EL1, 16 entries × 128 bytes = 2 KB,
          must be 2 KB-aligned):
            Offset 0x000: Current EL with SP0 — Synchronous
            Offset 0x080: Current EL with SP0 — IRQ
            Offset 0x100: Current EL with SP0 — FIQ
            Offset 0x180: Current EL with SP0 — SError
            Offset 0x200: Current EL with SPx — Synchronous
            Offset 0x280: Current EL with SPx — IRQ
            Offset 0x300: Current EL with SPx — FIQ
            Offset 0x380: Current EL with SPx — SError
            Offset 0x400: Lower EL (AArch64) — Synchronous  ← SVC lands here
            Offset 0x480: Lower EL (AArch64) — IRQ
            Offset 0x500: Lower EL (AArch64) — FIQ
            Offset 0x580: Lower EL (AArch64) — SError
            Offset 0x600: Lower EL (AArch32) — Synchronous
            Offset 0x680: Lower EL (AArch32) — IRQ
            Offset 0x700: Lower EL (AArch32) — FIQ
            Offset 0x780: Lower EL (AArch32) — SError

          SVC handler entry (Lower EL AArch64 Synchronous, offset 0x400):
            1. Save all general-purpose registers and the ELR_EL1/SPSR_EL1
               pair to the per-task kernel stack (or per-CPU trap frame).
            2. Read ESR_EL1: check EC field (bits [31:26]) == 0x15 (SVC64
               instruction). If EC != 0x15, dispatch to generic fault path.
            3. Extract syscall number from X8 (Linux AArch64 ABI convention).
               Arguments are in X0-X5. Return value is written to X0.
            4. Invoke the syscall dispatch table (same table as all arches).
            5. Restore registers and return via ERET (restores PC from
               ELR_EL1 and PSTATE from SPSR_EL1).

          Control register configuration (verified during this phase):
            SCTLR_EL1: M=1 (MMU on), C=1 (data cache on), I=1 (icache on),
              SA=1 (SP alignment check at EL1), SA0=1 (SP alignment at EL0).
            HCR_EL2.TGE: must be 0 so that EL0 exceptions route to EL1, not
              EL2. Verified here if the kernel is running under a hypervisor
              that sets up HCR_EL2 before entering the guest kernel.
            SPSR_EL1: set up on return so EL0 re-enters AArch64 state (M=0b0000).

          Verification test (executed during boot):
            Trigger SVC from EL1 to test the synchronous exception vector
            (VBAR_EL1 + 0x200, "Current EL with SPx — Synchronous"). The
            handler fires, reads ESR_EL1 to verify EC == 0x15 (SVC64), and
            returns. This is a vector table self-test — not a user-mode
            execution test. User-mode execution is not possible until the
            scheduler is initialized in Phase 11.

Phase 11: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

Secondary CPU Bringup (AArch64 via PSCI):

After Phase 11 completes on the primary CPU, secondary CPUs are brought online using PSCI (Power State Coordination Interface).

AP Stack Allocation (AArch64):

The primary CPU allocates each secondary's kernel stack from the boot allocator before issuing the PSCI CPU_ON call. Stack size is 16 KB per AP, allocated from the per-NUMA-node boot allocator, preferring memory local to the target CPU's node. The stack top address and percpu base are written to the AP's ApStartupMailbox slot (see Section 2.1.2.4 for the struct definition; the same type is used on all architectures). The PSCI context_id parameter is set to the physical address of the AP's mailbox so that the secondary entry stub can locate its stack before the MMU is active.

If stack allocation fails (boot allocator OOM), the primary logs "CPU {mpidr}: stack allocation failed, CPU disabled", does not issue CPU_ON, and marks the CPU permanently offline. Boot continues with the remaining CPUs.

Phase 12: Secondary CPU Detection
          Parse DTB /cpus node for all CPU entries:
          - Each cpu@N node contains: reg = MPIDR affinity bits
          - device_type = "cpu"
          - enable-method = "psci" (indicates PSCI is used)
          Assign sequential tree indices (0 = primary, 1..N-1 = secondaries
          in DTB order). Allocate PerCpu<T> slots and AP_STARTUP_MAILBOXES.
          Initialize SmpBringupState; set deadline_ns = now_ns() + 30s.

Phase 13: PSCI Method Detection
          Check /psci node in DTB for PSCI method:
          - method = "smc": Use SMC (Secure Monitor Call) for PSCI
          - method = "hvc": Use HVC (Hypervisor Call) for PSCI
          Verify PSCI version via PSCI_VERSION (function ID 0x84000000):
          - Major version in bits 31:16, minor in 15:0
          - Require PSCI 1.0+ for full feature support

Phase 14: Secondary CPU Startup (fan-out tree, PSCI CPU_ON)
          Primary allocates stack for tree-index 1 (fills mailbox, sets
          mailbox[1].status = MAILBOX_READY), then calls PSCI CPU_ON:
            x0 = 0xC4000003  (CPU_ON function ID, AArch64 PSCI 0.2+)
            x1 = target_mpidr (MPIDR affinity value from DTB for index 1)
            x2 = secondary_entry_phys (physical address of entry stub)
            x3 = mailbox_phys (physical address of ApStartupMailbox[1])
          Issue via SMC or HVC depending on Phase 13 detection.

          Return values:
            0 (PSCI_SUCCESS):      CPU starting
           -2 (PSCI_INVALID_PARAMS): bad MPIDR or entry address
           -4 (PSCI_ALREADY_ON):   CPU was already running (treat as success)
           other negative:         firmware error; mark CPU offline

          Each secondary, after completing Phase 15 init, atomically sets its
          bit in SmpBringupState.online_mask (via CpuMask::set_atomic),
          increments online_count, then reads its own tree index i and issues
          CPU_ON for children at
          indices 2i+1 and 2i+2 (allocating stacks and filling mailboxes
          first), before entering the scheduler idle loop.

Phase 15: Secondary CPU Entry (secondary_entry stub, per AP)
          Each secondary CPU enters here in EL1 with MMU off.
          x0 = context_id = physical address of own ApStartupMailbox.
          a. Spin on mailbox.status until == MAILBOX_READY (pause loop)
          b. Verify mailbox.cpu_id matches own MPIDR[31:0]
          c. Enable FPU/NEON: write CPACR_EL1.FPEN = 0b11
          d. Load kernel page tables: write TTBR0_EL1 with primary's root
             table PPN; isb; enable MMU via SCTLR_EL1.M = 1; isb
          e. Load stack pointer: ldr x1, [x0, #offsetof(stack_top)]; mov sp, x1
          f. Load percpu_base: ldr x18, [x0, #offsetof(percpu_base)]
             (x18 = CpuLocal register on AArch64 per Section 3.1.2 (03-concurrency.md))
          g. Branch to Rust: bl secondary_init

          In secondary_init():
            1. Load VBAR_EL1 (exception vectors, same table as primary)
            2. Initialize GIC CPU interface only (GICC or ICC system regs);
               the primary already configured GICD for all CPUs during Phase 8
               GICv2: GICC_PMR = 0xFF (unmask all); GICC_CTLR = 0x1 (enable)
               GICv3: ICC_PMR_EL1 = 0xFF; ICC_IGRPEN1_EL1 = 1
            3. Calibrate generic timer (read CNTFRQ_EL0; program CNTV_TVAL_EL0)
            4. Enable interrupts: msr daifclr, #0xf
            5. Initialize per-CPU scheduler runqueue
            6. Atomically set own bit in SmpBringupState.online_mask;
               increment online_count
            7. Issue CPU_ON for own tree children (if any) as described above
            8. Enter scheduler idle loop (wfe)

Phase 16: SMP Online
          Primary polls SmpBringupState.online_count and deadline_ns.
          Loop exits when online_count == expected_secondary_count OR
          monotonic_now() >= deadline_ns (global 30-second timeout).
          Any secondary whose bit is not set in online_mask at exit is
          marked permanently offline and removed from the kernel CPU mask.
          System is fully multi-CPU. GIC affinity routing distributes
          interrupts across all online CPUs.

MPIDR affinity (AArch64): Each CPU has a unique MPIDR_EL1 value: - Bits [7:0]: Affinity level 0 (core within cluster) - Bits [15:8]: Affinity level 1 (cluster within socket) - Bits [23:16]: Affinity level 2 (socket) - Bits [39:32]: Affinity level 3 (extended, rare; multi-chip systems)

The DTB /cpus/cpu@N/reg property contains these affinity bits. PSCI_CPU_ON uses the full MPIDR value to identify the target CPU.

CPU Hotplug — RISC-V via SBI HSM (Hart State Management):

Secondary harts on RISC-V are brought online through the SBI HSM extension (Extension ID: 0x48534D = ASCII "HSM"), which provides a portable interface independent of the underlying platform firmware:

sbi_hart_start(hartid, start_addr, opaque) (FID 0): Bring an offline hart online. The hart begins execution at start_addr in S-mode with a0 = hartid and a1 = opaque. UmkaOS passes its SMP trampoline physical address as start_addr and a pointer to the per-hart data block as opaque.
Trampoline requirements: start_addr must be a physical address. On implementations that limit the address to 32 bits, the trampoline must reside below the 4 GB boundary. The hart starts with the MMU disabled (satp = 0) and all CSRs at their reset values.
UmkaOS RISC-V SMP trampoline (arch/riscv64/trampoline.S):
Load per-hart data pointer from a1 (opaque value set by primary hart).
Configure satp with the kernel's root page-table PPN and MODE=Sv48. Execute sfence.vma to flush any stale TLB state.
Write UmkaOS's trap handler address to stvec (Direct mode, bits[1:0]=0).
Write the per-hart kernel stack top address to sscratch (used by the trap entry stub to locate the kernel stack from U-mode).
Set sstatus.SIE = 1 to enable supervisor interrupts.
Call smp_secondary_init(hartid) (C calling convention: a0 = hartid).
sbi_hart_stop() (FID 1): Park the calling hart. The hart transitions to STOPPED state and may be restarted by the primary hart via sbi_hart_start. UmkaOS calls this during CPU offline (logical hot-remove).
sbi_hart_get_status(hartid) (FID 2): Query the current state of a hart. Return values: 0 = STARTED, 1 = STOPPED, 2 = START_PENDING, 3 = STOP_PENDING, 4 = SUSPENDED, 5 = SUSPEND_PENDING, 6 = RESUME_PENDING. UmkaOS polls this after calling sbi_hart_start to confirm the hart is online within the timeout window (1 second).
Hart discovery: Enumerate /cpus nodes from the Device Tree, recording each hart's reg property (hart ID). Cross-reference with SBI HSM status to filter out harts that are permanently disabled (STOPPED but not startable on this platform configuration).

CPU Hotplug — PPC32 / PPC64LE:

PowerPC platforms use firmware-specific mechanisms that vary by environment:

Bare-metal POWER (OpenPOWER / OPAL): Secondary processors are held at a spin-table address specified by the Device Tree property cpu-release-addr (per /cpus/cpu@N node with enable-method = "spin-table"). The BSP writes the secondary entry-point physical address to cpu-release-addr, then executes dcbf (data cache block flush) and sync + isync memory barriers to ensure the secondary observes the write. The secondary breaks out of its spin loop, loads the entry address, and jumps to the kernel SMP trampoline.
POWER LPARs under PowerVM: Use RTAS (Run-Time Abstraction Services): rtas_call(RTAS_TOKEN_START_CPU, 3, 1, NULL, hwcpu_id, start_addr, r3_val). The RTAS call is issued via the rtas firmware interface discovered from the Device Tree /rtas node. UmkaOS records the RTAS token for start-cpu at boot during DTB parsing.
KVM / QEMU pseries: Depending on the machine configuration, either RTAS or Device Tree spin-table is used. The DTB enable-method property on each CPU node identifies which mechanism applies.
Secondary entry (PPC64LE): The secondary processor begins execution in kernel virtual mode on POWER9+ systems with Radix MMU, or in real mode on POWER8 with HPT. The SMP trampoline configures the stack pointer (r1), thread pointer (r13, points to per-CPU data), and calls smp_secondary_init(cpu_id).
Secondary entry (PPC32): The secondary begins in supervisor mode. The trampoline sets r1 (stack pointer), enables the MMU via MSR[IR] and MSR[DR], and calls smp_secondary_init(cpu_id).

2.1.2.7 ARMv7 Boot Sequence

QEMU's -M vexpress-a15 -kernel loads the ELF at 0x60010000 and enters at _start in SVC (Supervisor) mode with the MMU off. Registers: r0 = 0, r1 = machine type, r2 = DTB address.

Entry assembly (arch/armv7/entry.S, GNU as syntax):

1. QEMU jumps to _start in SVC mode, MMU off
   - r0 = 0 (unused), r1 = machine type, r2 = DTB address

2. _start:
   a. Disable IRQ and FIQ: cpsid if
      (sets I and F bits in CPSR)
   b. Set up IRQ mode stack: switch to IRQ mode (cps #0x12),
      load 4 KB IRQ stack, switch back to SVC mode (cps #0x13)
   c. Load SVC stack pointer: ldr sp, =_stack_top
      (64 KB stack in .bss._stack, 16-byte aligned via .align 4)
   d. Clear BSS: zero memory from __bss_start to __bss_end
      (str r6 loop, 4 bytes per iteration)
   e. Prepare 64-bit arguments (AAPCS: u64 passed as register pairs):
      - r0:r1 = 0:0 (multiboot_magic, both halves)
      - r2:r3 = dtb_addr:0 (multiboot_info, low:high)
   f. Branch: bl umka_main
   g. Halt loop: wfe if umka_main returns

Stack (64 KB) is in .bss._stack (16-byte aligned via .align 4, which on ARM GAS means 2^4 = 16 bytes). The linker script (linker-armv7.ld) places .text._start first at 0x60010000 (offset from the vexpress-a15 base 0x60000000 to leave room for the bootloader stub).

Initialization phases (in umka_main(), sequential):

Phase 1:  Exception Vectors (VBAR)
          Write vector table base to VBAR via CP15 c12 register:
          mcr p15, 0, <reg>, c12, c0, 0
          Vector table: 8 entries (Reset, Undef, SVC, Prefetch Abort,
          Data Abort, reserved, IRQ, FIQ) × 4-byte branch instructions.
          Each vector branches to a full handler stub.

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly, same
          pattern as x86 entry.asm step 2d). Perform any additional
          initialization that depends on zeroed static data.

Phase 3:  DTB Parse
          Parse the DTB passed in r2 (see Section 2.1.2.9). Extract /memory
          regions, /chosen bootargs, GIC base addresses, timer IRQ
          numbers, and UART base. vexpress-a15 has well-known addresses
          but DTB parsing keeps the code machine-independent.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). The vexpress-a15
          machine provides up to 1 GB RAM starting at 0x60000000 (or
          0x80000000 depending on configuration). Reserve kernel image.

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.
          This initial heap size is a bootstrap minimum; the allocator expands
          dynamically once memory discovery completes (Section 4.1).

Phase 6:  Virtual Memory (TTBR0, Short Descriptor)
          Build identity-map using ARMv7 short-descriptor format:
          - TTBR0: points to L1 table (4096 × 32-bit entries, 16 KB)
          - L1 entries: section descriptors (1 MB pages) for identity map
            Flags: AP=0b11 (full access), TEX/C/B for normal cacheable
          - DACR: domain 0 = Client (0b01), all others = No Access
          - Enable MMU: set SCTLR.M bit via mcr p15, 0, <reg>, c1, c0, 0
          - 1 MB sections are sufficient for initial identity map;
            L2 tables (256 × 4 KB pages) added later for fine-grained mapping

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  GIC Initialization
          ARMv7 platforms typically use GICv2 (GICv3 supports ARMv7/AArch32
          but is rare on ARMv7 SoCs; limited to 3 affinity levels in AArch32).
          Read GICD/GICC bases from DTB (vexpress-a15 defaults:
          GICD = 0x2C001000, GICC = 0x2C002000).
          Configure distributor, CPU interface, route timer IRQ.

Phase 9:  Timer
          Configure SP804 dual timer or ARM generic timer (if available):
          - SP804 (vexpress): program LOAD register, enable with
            periodic mode + interrupt enable, IRQ via GIC SPI
          - Generic timer (Cortex-A15): CNTVCT, CNTV_TVAL, CNTV_CTL
            (same registers as AArch64, accessed via CP15 c14)
          Enable interrupts: cpsie if

Phase 10: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

Secondary CPU Bringup (ARMv7 via PSCI):

After Phase 10 completes on the primary CPU, secondary CPUs are brought online using PSCI (Power State Coordination Interface).

PSCI calling convention (ARMv7):

The kernel detects the PSCI version and calling mechanism at runtime from the DTB /psci node compatible property:

"arm,psci-0.2" or later: use PSCI 0.2 function IDs (preferred)
"arm,psci": use PSCI 0.1 function IDs (legacy fallback; function IDs are platform-specific and read from DTB cpu_on property under /psci)

PSCI 0.2 function IDs for ARMv7 (32-bit callee convention):

CPU_ON = 0x84000003   (PSCI 0.2, 32-bit)
  r0 = 0x84000003     (function ID)
  r1 = target_cpu     (MPIDR[31:0] of target AP)
  r2 = entry_point    (physical address of AP entry stub, must be 32-bit)
  r3 = context_id     (physical address of ApStartupMailbox for this AP)

Return values (in r0):
   0  PSCI_SUCCESS:       AP starting
  -2  PSCI_INVALID_PARAMS: bad MPIDR or entry address
  -4  PSCI_ALREADY_ON:   AP was already running (treat as success)
  other negative: firmware error; mark AP offline

Calling convention: use smc #0 if the DTB /psci node method = "smc"; use hvc #0 if method = "hvc". The method property is mandatory in valid PSCI device trees. If absent, default to smc.

AP Stack Allocation (ARMv7):

Stack allocation follows the same protocol as all architectures (see Section 2.1.2.4): the primary allocates 16 KB per AP from the boot allocator before issuing CPU_ON, fills the ApStartupMailbox, passes its physical address as context_id, and marks the AP offline on allocation failure.

GIC initialization for ARMv7 APs:

The primary CPU configures the GIC Distributor (GICD) during Phase 8 for all CPUs. Each AP, on startup, initializes only its own GIC CPU Interface (GICC):

GICC_PMR  = 0xFF   // unmask all interrupt priorities
GICC_CTLR = 0x1    // enable CPU interface

APs do not touch the GICD — the primary owns the distributor. IRQs are unmasked by clearing the CPSR.I and CPSR.F bits (cpsie if) after the scheduler is initialized and the AP is ready to run tasks.

ARMv7 AP entry sequence:

The entry stub physical address passed to CPU_ON as r2 is the ARMv7 SMP trampoline. The trampoline receives context_id (physical address of ApStartupMailbox) in r3 from the PSCI firmware and follows this sequence:

1. AP wakes at physical entry point (address passed in CPU_ON r2).
   r3 = physical address of own ApStartupMailbox (from PSCI context_id).

2. Disable IRQs and FIQs: cpsid if
   (sets CPSR.I and CPSR.F; prevents spurious interrupts before stack is set)

3. Confirm SVC mode: mrs r0, cpsr; and r0, r0, #0x1F; cmp r0, #0x13
   If not in SVC mode (0x13), switch: cps #0x13

4. Enable VFP/NEON if needed:
   mrc p15, 0, r1, c1, c0, 2    // read CPACR
   orr r1, r1, #(0xF << 20)     // enable CP10 + CP11 full access
   mcr p15, 0, r1, c1, c0, 2    // write CPACR
   vmrs r1, fpexc                 // enable VFP: FPEXC.EN = 1
   orr r1, r1, #(1 << 30)
   vmsr fpexc, r1

5. Enable MMU with kernel page tables:
   - Load TTBR0 with primary's L1 table physical address
     mcr p15, 0, <ttbr0>, c2, c0, 0
   - Set DACR domain 0 = Client (0b01):
     ldr r1, =0x00000001
     mcr p15, 0, r1, c3, c0, 0
   - Enable MMU and caches (set SCTLR.M, .C, .I via CP15 c1 c0 0)
   - isb

6. Spin on mailbox.status until == MAILBOX_READY (0xAB1E1234):
   ldr r0, [r3, #offsetof(ApStartupMailbox, status)]
   cmp r0, #0xAB1E1234
   bne spin (with yield: yield instruction or nop)

7. Verify mailbox.cpu_id matches own MPIDR[23:0]:
   mrc p15, 0, r1, c0, c0, 5    // read MPIDR
   and r1, r1, #0x00FFFFFF       // lower 24 affinity bits
   ldr r0, [r3, #offsetof(ApStartupMailbox, cpu_id)]
   cmp r0, r1
   bne fault_halt                 // mismatch: configuration error

8. Load SP from stack_top:
   ldr sp, [r3, #offsetof(ApStartupMailbox, stack_top)]

9. Load percpu_base (TPIDRPRW, the ARMv7 CpuLocal register per Section 3.1.2 (03-concurrency.md)):
   ldr r4, [r3, #offsetof(ApStartupMailbox, percpu_base)]
   mcr p15, 0, r4, c13, c0, 4   // write TPIDRPRW

10. Jump to Rust entry point:
    bl ap_secondary_init          // does not return

The ap_secondary_init() function (Rust) runs the following in order: 1. Load VBAR (exception vectors, same table as primary): mcr p15, 0, vbar, c12, c0, 0 2. Initialize GICC (CPU Interface): write GICC_PMR = 0xFF and GICC_CTLR = 0x1 3. Initialize per-CPU scheduler runqueue 4. Configure and enable the timer (generic timer or SP804 as appropriate) 5. Enable interrupts: cpsie if 6. Atomically set own bit in SmpBringupState.online_mask (via CpuMask::set_atomic); increment online_count 7. Issue CPU_ON for own tree children (indices 2i+1, 2i+2) if they exist and the global deadline_ns has not expired (allocate stacks, fill mailboxes, call PSCI, same protocol as the primary for tree index 1) 8. Enter scheduler idle loop (wfe)

SMP bringup phases (ARMv7):

Phase 11: Secondary CPU Detection
          Parse DTB /cpus node for CPU entries with enable-method = "psci".
          Assign sequential tree indices (0 = primary). Allocate PerCpu<T>
          slots and AP_STARTUP_MAILBOXES. Initialize SmpBringupState;
          set deadline_ns = now_ns() + 30s.

Phase 12: PSCI Method and Version Detection
          Read /psci node: detect method (smc/hvc) and compatible string
          (psci-0.2 vs psci-0.1). For psci-0.1, read cpu_on property.

Phase 13: First AP Wakeup (primary → tree index 1)
          Allocate stack for index 1; fill mailbox[1]; set MAILBOX_READY.
          Call PSCI CPU_ON for index 1 as described above.
          Fan-out tree propagates: each AP wakes its children after init.

Phase 14: SMP Online
          Primary polls SmpBringupState.online_count and deadline_ns.
          Loop exits when online_count == expected_secondary_count OR
          monotonic_now() >= deadline_ns (global 30-second timeout).
          Any AP whose bit is not set in online_mask at exit is marked
          permanently offline. Boot continues with available CPUs.
          System is fully multi-CPU once all online APs are in their
          scheduler idle loops.

2.1.2.8 RISC-V 64 Boot Sequence

QEMU's -M virt -bios default -kernel runs OpenSBI in M-mode, which then jumps to the kernel at 0x80200000 in S-mode (Supervisor mode). Registers: a0 = hart_id, a1 = DTB address (on QEMU and systems following the Linux boot convention — see note below).

Note on a1 and DTB discovery: The RISC-V SBI specification does NOT mandate that a1 contains the DTB physical address. This is a firmware convention established by QEMU and U-Boot, and is widely followed in practice, but real bare-metal boards may use different mechanisms. The boot code therefore validates a1 before trusting it: 1. Check if a1 is a valid DTB pointer: read the 4-byte magic at that address and verify it equals 0xD00DFEED (big-endian FDT magic). 2. If a1 is not a valid DTB: scan for a UEFI System Table (look for the IBI SYST signature in the EFI System Table header). 3. If UEFI is not found: use the SBI vendor extension to request the DTB address, or fall back to a compiled-in DTB for the target board. The reference implementation uses option 1 with UEFI fallback for production hardware targets.

Entry assembly (arch/riscv64/entry.S, GNU as syntax):

1. OpenSBI jumps to _start in S-mode
   - a0 = hart_id (hardware thread ID, usually 0 on single-core)
   - a1 = DTB address (QEMU/U-Boot convention; validated at runtime — see note above)

2. _start:
   a. Disable interrupts: csrci sstatus, 0x2
      (clears SIE bit in supervisor status register)
   b. Load stack pointer: la sp, _stack_top
      (64 KB stack in .bss._stack, 16-byte aligned)
   c. Clear BSS: zero memory from __bss_start to __bss_end
      (sd zero loop, 8 bytes per iteration)
   d. Arguments already in correct registers:
      a0 = hart_id (passed as multiboot_magic parameter)
      a1 = DTB address (passed as multiboot_info parameter)
   e. Call: call umka_main (jal with ra)
   f. Halt loop: wfi (wait-for-interrupt) if umka_main returns

Stack (64 KB) is in .bss._stack (16-byte aligned). The linker script (linker-riscv64.ld) places .text._start first at 0x80200000, after the OpenSBI firmware region (0x80000000–0x801FFFFF).

Initialization phases (in umka_main(), sequential):

Phase 1:  Exception Vectors (stvec)
          Write trap handler address to stvec CSR in Direct mode
          (stvec[1:0] = 0b00). All traps — exceptions, software
          interrupts, external interrupts — vector to a single entry
          point that reads scause to dispatch.

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly, same
          pattern as x86 entry.asm step 2d). Perform any additional
          initialization that depends on zeroed static data.

Phase 3:  DTB Parse
          Parse the DTB passed in a1 (see Section 2.1.2.9). Extract /memory
          regions, /chosen bootargs, PLIC base address, CLINT address
          (if present), and UART base. QEMU virt machine uses standard
          addresses but DTB parsing keeps the code machine-independent.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). Mark available
          regions free. Reserve:
          - OpenSBI firmware: 0x80000000–0x801FFFFF (2 MB)
          - Kernel image: 0x80200000 to __kernel_end
          Unlike x86, no legacy BIOS region to reserve.

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.
          This initial heap size is a bootstrap minimum; the allocator expands
          dynamically once memory discovery completes (Section 4.1).

Phase 6:  Virtual Memory (satp, Sv48)
          Build identity-map using Sv48 (4-level, 48-bit VA):
          - 4 levels: L3 (root) → L2 → L1 → L0, each 512 × 8-byte PTEs
          - PTE format: [53:10] PPN, [7:0] flags (V, R, W, X, U, G, A, D)
          - Identity map all physical RAM with RWX + Valid + Global
          - Write root table PPN to satp: MODE=Sv48 (9), ASID=0, PPN
          - Execute sfence.vma to flush TLB after satp write

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  PLIC Initialization
          Read PLIC base address from DTB (QEMU virt default: 0x0C000000).
          - Set priority threshold to 0 (accept all priorities)
          - Enable relevant interrupt sources (UART, etc.)
          - Set priority for each source
          PLIC handles external interrupts only; timer and software
          interrupts go through separate CSRs (sie.STIE, sie.SSIE).

Phase 9:  SBI Timer
          Use SBI ecall to program the timer:
          - Read current time: csrr time (or rdtime pseudo-instruction)
          - Set next deadline: sbi_set_timer(time + interval)
            (SBI EID=0x54494D45 "TIME", FID=0)
          - Enable timer interrupt: set sie.STIE (bit 5)
          Timer fires supervisor timer interrupt (scause = 5) →
          clear by calling sbi_set_timer with next deadline.
          Enable interrupts: csrsi sstatus, 0x2

Phase 10: ecall / Trap-Vector Syscall Setup
          Configure the trap vector and trap entry code to correctly dispatch
          system calls arriving from U-mode via the ecall instruction.

          stvec CSR configuration:
            bits[1:0] = 0b00 (Direct mode): all traps — synchronous
            exceptions, software interrupts, external interrupts — are
            delivered to the single base address written to stvec. UmkaOS
            uses Direct mode rather than Vectored mode (0b01) so that the
            handler can perform a unified register-save before reading scause.

          Trap entry sequence (all trap types, unified handler):
            1. csrrw sp, sscratch, sp — swap user and kernel stack pointers.
               sscratch holds the kernel stack top for this hart (set up in
               Phase 1 entry assembly and refreshed on each U→S transition).
            2. Save all general-purpose registers (x1-x31, or the full
               RISC-V integer register file) to the per-hart trap frame at
               the top of the kernel stack.
            3. Read scause to determine the trap source:
               - scause = 8 (ecall from U-mode): syscall path.
               - scause = 9 (external interrupt): PLIC claim/complete path.
               - scause = 5 (supervisor timer interrupt): timer tick path.
               - scause = 1 (supervisor software interrupt): IPI path.
               - Other synchronous exceptions: fault/signal path.

          ecall handler (scause == 8):
            Syscall number: a7 (per Linux RISC-V ABI, also known as the
              SBI-compatible register assignment).
            Arguments: a0–a5 (up to six arguments).
            Return convention: a0 carries the return value (negative values
              encode -errno on error); a1 carries a second return word for
              certain multi-value returns (e.g., pipe(2) returns two file
              descriptors in a0 and a1).
            After handling, sepc is advanced by 4 (skip past the ecall
              instruction, which is always 4 bytes) before SRET.

          Interrupt enable state:
            sstatus.SIE (bit 1): supervisor interrupt enable, set to 1 after
              trap entry saves state so nested interrupts are possible in
              long-running handlers. Cleared on trap entry by hardware.
            sstatus.SPIE (bit 5): previous SIE — saved and restored across
              SRET to allow transparent interrupt-enable state on return.
            sie.SEIE (bit 9): supervisor external interrupt enable (PLIC).
            sie.SSIE (bit 1): supervisor software interrupt enable (IPI).
            sie.STIE (bit 5): supervisor timer interrupt enable (already set
              in Phase 9).

          Verification test (executed during boot):
            Issue ecall from S-mode (supervisor mode) to test the ecall
            vector entry in stvec. The trap handler fires, reads scause to
            verify cause == 9 (ecall from S-mode), and returns. This tests
            trap vector setup — not user-mode. User-mode is not available
            until the scheduler is initialized in Phase 11.

Phase 11: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

> **SMP bringup — RISC-V 64**: Secondary harts are brought online via
> the SBI HSM (Hart State Management) extension. The boot hart calls
> `sbi_hart_start(hartid, start_addr, opaque)` for each secondary.
> Each secondary enters at `start_addr` in S-mode, runs Phases 5-11
> (Sv48 page tables, PLIC, timer, ecall setup, per-CPU init), and joins
> the scheduler. The full SBI HSM calling sequence, per-hart stack
> allocation, and PLIC per-hart context initialization are specified in
> [Section 2.1.2.8 RISC-V 64 Boot Sequence](#2128-risc-v-64-boot-sequence).

2.1.2.9 Device Tree Blob Parsing

The Device Tree Blob (DTB) is the memory map and hardware description format shared by AArch64, ARMv7, RISC-V 64, PPC32, and PPC64LE. It serves the same role as the Multiboot1 info structure on x86 (Section 2.1.2.11), providing the kernel with memory layout and device addresses at boot.

DTB format (Flattened Device Tree / FDT):

Offset  Field           Size    Description
0x00    magic           u32     0xD00DFEED (big-endian)
0x04    totalsize       u32     Total blob size in bytes
0x08    off_dt_struct   u32     Offset to structure block
0x0C    off_dt_strings  u32     Offset to strings block
0x10    off_mem_rsvmap  u32     Offset to memory reservation map
0x14    version         u32     DTB version (17)
0x18    last_comp_ver   u32     Last compatible version (16)
0x1C    boot_cpuid_phys u32     Physical ID of boot CPU
0x20    size_dt_strings u32     Size of strings block
0x24    size_dt_struct  u32     Size of structure block

All multi-byte fields are big-endian. The structure block contains a flattened tree of nodes and properties encoded as tokens: FDT_BEGIN_NODE (0x01), FDT_END_NODE (0x02), FDT_PROP (0x03), FDT_NOP (0x04), FDT_END (0x09).

Minimal parser (umka-kernel/src/boot/dtb.rs):

The kernel implements a minimal, no-alloc DTB parser that walks the structure block once and extracts only what's needed for boot:

Validate header: check magic (0xD00DFEED), version ≥ 16
/memory nodes → collect reg property values as MemoryRegion array (base + size pairs), passed to phys::init()
/chosen node → extract bootargs property (kernel command line)
Interrupt controller → extract reg property from the node with interrupt-controller property (GIC base for ARM, PLIC base for RISC-V)
Timer → extract IRQ numbers from /timer node interrupts property
UART → extract reg property from /serial or stdout-path device

The parser operates on raw byte slices with explicit big-endian reads and requires no heap allocation. It uses a fixed-size array (up to 64 entries) for memory regions, matching the Multiboot1 parser's approach. The DTB parser uses a fixed 64-entry buffer during early boot (before the heap allocator is available). Device tree nodes beyond this limit are parsed in a second pass after heap initialization.

Shared code: The DTB parser in umka-kernel/src/boot/dtb.rs is used by all five non-x86 architectures. Each architecture's boot.rs calls dtb::parse(dtb_addr) and passes the resulting memory regions to phys::init().

2.1.2.10 Cross-Architecture Comparison

The following table summarizes which boot components are architecture-specific and which are shared across all six architectures:

Phase	x86-64	AArch64	ARMv7	RISC-V 64	PPC32	PPC64LE
Exception vectors	IDT (256 entries)	VBAR_EL1 (16 vectors)	VBAR CP15 (8 vectors)	stvec (Direct mode)	IVPR+IVORn	LPCR vector table
Memory map source	Multiboot1 info	DTB `/memory`	DTB `/memory`	DTB `/memory`	DTB `/memory`	DTB `/memory`
Page table format	4-level PML4 (4 KB)	4-level 4 KB granule	Short-desc 2-level (1 MB sections)	Sv48 4-level	2-level (4 KB pages)	Radix tree (POWER9+) or HPT
IRQ controller	8259 PIC (I/O ports)	GIC v2/v3 (MMIO, detected at runtime)	GICv2 (MMIO)	PLIC (MMIO)	OpenPIC (MMIO)	XIVE (MMIO)
Timer	PIT (I/O port 0x40)	Generic timer (system regs)	SP804 or generic timer	SBI ecall	Decrementer (DEC SPR)	Decrementer (DEC SPR)
Boot assembly	NASM (32→64 transition)	GNU as (EL1 entry)	GNU as (SVC entry)	GNU as (S-mode entry)	GNU as (supervisor entry)	GNU as (supervisor entry)
BSS clearing	entry.asm (rep stosd)	entry.S (str xzr loop)	entry.S (str r6 loop)	entry.S (sd zero loop)	entry.S (stw loop)	entry.S (std loop)
Phys allocator	shared bitmap	shared bitmap	shared bitmap	shared bitmap	shared bitmap	shared bitmap
Heap allocator	shared free-list	shared free-list	shared free-list	shared free-list	shared free-list	shared free-list
Capability system	shared	shared	shared	shared	shared	shared
Scheduler	shared	shared	shared	shared	shared	shared

2.1.2.11 Multiboot1 Memory Map Parsing

boot/multiboot1.rs parses the Multiboot1 info structure (passed by GRUB/QEMU) to extract the physical memory map:

Read info structure flags to determine which fields are present
If FLAG_MEM set: log basic memory sizes (lower/upper KB)
If FLAG_CMDLINE set: log the kernel command line string
If FLAG_MMAP set: iterate the memory map entries:
Each entry has: base_addr (u64), length (u64), type (u32)
Types: available (1), reserved (2), ACPI reclaimable (3), NVS (4), defective (5)
Unaligned reads used (read_unaligned) — Multiboot mmap entries may not be aligned
Collect up to 64 MemoryRegion structs, pass to phys::init()

phys::init() processes the regions: - Phase 1: Mark all available regions as free (page-aligned) - Phase 2: Reserve first 1 MB (BIOS, VGA, legacy) - Phase 3: Reserve kernel image (1 MB to __kernel_end)

2.1.2.12 Boot Allocator Design

The boot allocator (BootAlloc) is the physical-memory allocator used during early boot, before the main buddy allocator (Section 4.1.1) is initialized. Its design must satisfy two constraints in tension:

It needs some memory before it can read the firmware memory map.
It must not impose a hardcoded limit on total usable RAM.

These constraints are resolved with a two-phase design.

Phase 1 — Bootstrap (BSS pre-allocator)

Before the firmware memory map is parsed, a tiny fixed-size buffer resident in .bss provides just enough memory to parse the firmware map and construct the BootAlloc region table. This buffer is declared as a global static array:

/// Pre-allocator scratch buffer in .bss.
/// Used ONLY to construct the BootAlloc region table.
/// This is NOT a limit on usable memory — it is a staging area for parsing
/// the firmware map before BootAlloc is initialized.
static mut BOOTSTRAP_BUF: [u8; 64 * 1024] = [0u8; 64 * 1024];
static mut BOOTSTRAP_OFFSET: usize = 0;

This 64 KB BSS bootstrap buffer covers the worst-case cost of parsing firmware memory map data structures. It is consumed once at boot and is never used again after BootAlloc::init_from_* completes.

Phase 2 — BootAlloc over all firmware-reported RAM

After the firmware map is parsed, BootAlloc is initialized with all conventional memory regions reported by the firmware. It is a simple bump allocator that walks regions in address order, moving to the next region when the current one is exhausted:

/// One contiguous physical memory region reported by firmware.
pub struct MemRegion {
    /// Base physical address (page-aligned).
    pub base: PhysAddr,
    /// Region size in bytes (page-aligned).
    pub size: usize,
}

/// Pre-main-allocator memory manager.
///
/// Initialized from the firmware memory map; manages all conventional RAM
/// regions reported by firmware (UEFI MemoryMap, Multiboot1 mmap, or
/// Device Tree `/memory` nodes). All conventional memory is available for
/// allocation — there is no hardcoded cap on total usable memory.
///
/// Allocation strategy: bump allocator, advancing through `regions` in
/// address order. When `regions[current_region]` is exhausted, moves to
/// `regions[current_region + 1]`. Allocations are never freed — this
/// allocator is discarded once the buddy allocator takes over.
pub struct BootAlloc {
    /// Firmware-reported memory regions, sorted by base address.
    /// Populated by `init_from_multiboot1`, `init_from_uefi`, or `init_from_dtb`.
    regions: [MemRegion; MAX_BOOT_REGIONS],
    /// Number of valid entries in `regions`.
    region_count: usize,
    /// Index into `regions` for the current bump position.
    current_region: usize,
    /// Byte offset within `regions[current_region]` for the next allocation.
    current_offset: usize,
}

/// Maximum number of distinct firmware memory map entries.
///
/// This caps the number of *separate address ranges*, not the total RAM size.
/// A 1 TB NUMA system may have 8-16 firmware-reported ranges; 64 covers all
/// realistic configurations (including heavily fragmented UEFI maps with many
/// reserved and reclaim regions alongside conventional memory ranges).
pub const MAX_BOOT_REGIONS: usize = 64;

Initialization entry points:

impl BootAlloc {
    /// Initialize from a Multiboot1 mmap (x86-64).
    /// Filters for type == 1 (available), page-aligns each region, skips
    /// the first 1 MB (BIOS/legacy) and the kernel image.
    pub fn init_from_multiboot1(mmap: &Multiboot1Mmap) -> Self;

    /// Initialize from a UEFI memory map (future x86-64 UEFI path).
    /// Filters for EfiConventionalMemory descriptor type.
    pub fn init_from_uefi(map: &UefiMemoryMap) -> Self;

    /// Initialize from Device Tree `/memory` nodes (all non-x86 architectures).
    /// Uses the regions collected by the DTB parser in `boot/dtb.rs`.
    pub fn init_from_dtb(regions: &[MemRegion]) -> Self;
}

Allocation:

impl BootAlloc {
    /// Allocate `size` bytes with `align`-byte alignment from firmware RAM.
    ///
    /// Advances the bump pointer through regions in address order until a
    /// region has enough contiguous space to satisfy the request. Panics at
    /// boot if no region can satisfy the request (indicates a firmware map
    /// problem, not a normal condition — boot cannot continue anyway).
    ///
    /// Returns a `PhysAddr` pointing to the allocated region.
    pub fn alloc(&mut self, size: usize, align: usize) -> PhysAddr;
}

Invariants:

regions is sorted by base in ascending order after initialization.
No region in regions overlaps the kernel image (__kernel_start to __kernel_end) or any reserved firmware region. These are subtracted out during initialization.
current_offset is always a multiple of the requested alignment after each alloc call; the bump pointer is aligned up before each allocation.
Once the buddy allocator is initialized (phys::init() completes), the BootAlloc instance is dropped and its memory is reclaimed.

Relationship to phys::init():

BootAlloc and phys::init() (the buddy allocator) both receive the same firmware region list. BootAlloc uses it as a bump allocator for early boot data structures. phys::init() builds a full buddy allocator over all discovered RAM, then marks the pages consumed by BootAlloc as allocated so they are not double-handed to userspace. The two-phase handoff is:

1. Firmware map parsed → BootAlloc::init_from_*(regions)
2. Early boot data structures allocated from BootAlloc
3. phys::init(regions) → buddy allocator built over all RAM
4. phys::mark_used(boot_alloc_consumed_pages) → reserve what BootAlloc used
5. BootAlloc dropped; all further allocation goes through buddy allocator

2.1.2.13 PPC32 Boot Sequence

PPC32 targets embedded PowerPC processors (e500, 440, etc.) using QEMU's ppce500 machine. The firmware (U-Boot or QEMU direct boot) passes a DTB pointer in r3.

Entry assembly (arch/ppc32/entry.S, GNU as syntax):

1. Firmware loads ELF and jumps to _start in supervisor mode
   - r3 = DTB address
   - r4 = kernel image start (optional)
   - r5 = 0 (reserved)
2. _start:
   a. Set up stack pointer (r1) from linker symbol
   b. Clear BSS (.sbss + .bss)
   c. Set up initial exception vectors (IVPR + IVORn)
   d. Call umka_main(0, r3)  [magic=0, info=DTB address]

The linker script (linker-ppc32.ld) places .text._start first at the kernel load address. PPC32 uses big-endian byte order by default.

Initialization phases (in umka_main(), sequential):

Phase 1:  Exception Vectors (IVPR + IVORn)
          Set IVPR to exception vector base address.
          Initialize IVOR0-IVOR15 for each exception type:
          - IVOR0 (Critical input), IVOR1 (Machine check)
          - IVOR2 (Data storage), IVOR3 (Instruction storage)
          - IVOR4 (External input), IVOR5 (Alignment)
          - IVOR6 (Program), IVOR7 (Floating-point unavailable)
          - IVOR8 (System call), IVOR9 (Auxiliary processor unavailable)
          - IVOR10 (Decrementer), IVOR11 (Fixed interval timer)
          - IVOR12 (Watchdog), IVOR13 (Data TLB)
          - IVOR14 (Instruction TLB), IVOR15 (Debug)

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly).

Phase 3:  DTB Parse
          Parse the DTB passed in r3 (see Section 2.1.2.9). Extract /memory
          regions, /chosen bootargs, OpenPIC base address, UART base.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). Reserve kernel image.

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.

Phase 6:  Virtual Memory (2-level page tables)
          Build identity-map using PPC32 2-level page table format:
          - PGD (Page Directory): 1024 × 32-bit entries (4 KB)
          - PTE (Page Table): 1024 × 32-bit entries per PGD entry (4 KB each)
          - Use 4 KB pages with WIMG bits for cache policy
          - Enable MMU via MSR[IR] and MSR[DR] bits

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  OpenPIC Initialization
          Read OpenPIC base address from DTB.
          - Configure interrupt vector base
          - Set priority for each interrupt source
          - Enable external interrupts via MSR[EE]

Phase 9:  Decrementer Timer
          Program the decrementer (DEC SPR) for periodic interrupts:
          - Load initial value into DEC
          - Decrementer exception is gated by MSR[EE] (already enabled in Phase 8)
          Timer fires decrementer exception → reload DEC in handler.

Phase 10: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

> **SMP bringup — PPC32**: Secondary CPUs on embedded PPC (e500) are
> brought online via platform-specific firmware (U-Boot spin table or
> ePAPR boot protocol). The primary CPU writes the secondary entry
> point to a spin-table address, and the secondary polls until
> released. **Full specification deferred to Phase 3** — the spin-table
> protocol, per-CPU stack allocation, and OpenPIC per-CPU
> initialization will be detailed when PPC32 SMP is implemented.

2.1.2.14 PPC64LE Boot Sequence

PPC64LE targets IBM POWER processors (POWER8, POWER9, POWER10) in little-endian mode. QEMU uses the pseries machine type with SLOF firmware, which passes a DTB pointer in r3. Bare metal systems use OPAL (skiboot) firmware.

Entry assembly (arch/ppc64le/entry.S, GNU as syntax):

1. SLOF/OPAL loads ELF and jumps to _start in hypervisor or supervisor mode
   - r3 = DTB address
   - r4 = 0 (reserved)
   - MSR: 64-bit mode (SF=1), little-endian (LE=1)
2. _start:
   a. Set up TOC pointer (r2) from .TOC. symbol
   b. Set up stack pointer (r1) from linker symbol
   c. Clear BSS
   d. Set up initial exception vectors
   e. Call umka_main(0, r3)  [magic=0, info=DTB address]

The linker script (linker-ppc64le.ld) places .text._start first at the kernel load address. PPC64LE uses the ELFv2 ABI with little-endian byte order.

Initialization phases (in umka_main(), sequential):

Phase 1:  Exception Vectors (LPCR + HSPRG0/1)
          Set HSPRG0 to per-CPU data pointer.
          Configure LPCR for exception vector base.
          Initialize system reset and machine check handlers.

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly).

Phase 3:  DTB Parse
          Parse the DTB passed in r3 (see Section 2.1.2.9). Extract /memory
          regions, /chosen bootargs, XIVE base addresses, UART base.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). Reserve kernel image.

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.

Phase 6:  Virtual Memory (Radix MMU on POWER9+, HPT on POWER8)
          Detect MMU type from DTB or CPU features:
          - POWER9+: Use Radix MMU (4-level page tables: PGD→PUD→PMD→PTE, 4 KB/64 KB/2 MB pages)
            Configure LPCR[HR] = 1 for Radix mode.
            Set up process table (PRTB) and page table root (PGD).
          - POWER8: Use HPT (Hash Page Table, base page size 4 KB default or 64 KB with 64KB page configuration; 16 MB is a huge page size)
            Configure LPCR[HR] = 0 for HPT mode.
            Set up HPT base and size in SDR1.
          Enable MMU via MSR[IR] and MSR[DR] bits.

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  XIVE Interrupt Controller
          Read XIVE base addresses from DTB.
          - Initialize Interrupt Controller (IC) registers
          - Initialize Thread Interrupt Management (TIMA)
          - Configure interrupt priorities and routing
          - Enable external interrupts via MSR[EE]

Phase 9:  Decrementer Timer
          Program the decrementer (DEC SPR) for periodic interrupts:
          - Load initial value into DEC (32-bit, wraps at 0)
          - Decrementer exception is gated by MSR[EE] (already enabled in Phase 8)
          Timer fires decrementer exception → reload DEC in handler.
          Note: POWER9+ also has HDEC (Hypervisor Decrementer) for L1 guests.

Phase 10: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

> **SMP bringup — PPC64LE**: Secondary CPUs on POWER systems are
> brought online via OPAL (OpenPOWER Abstraction Layer) on bare metal
> or RTAS (Run-Time Abstraction Services) under PowerVM. OPAL
> provides `opal_start_cpu(server_no, start_address)`. SLOF (QEMU)
> uses the device-tree `/cpus/cpu@N/ibm,ppc-interrupt-server#s`
> property and a spin-table release mechanism. **Full specification
> deferred to Phase 3** — the OPAL/RTAS calling convention, per-CPU
> stack allocation, and XIVE per-CPU initialization will be detailed
> when PPC64LE SMP is implemented.

2.1.2.15 Interrupt Controller Architecture: GIC (AArch64/ARMv7) and PLIC (RISC-V)

The x86-64 interrupt architecture (8259 PIC remapped through the IOAPIC, with per-CPU LAPIC) is described in Phase 2 of the x86-64 boot sequence. ARM and RISC-V use different interrupt controllers with distinct initialization models. This section specifies those controllers at the level of detail required to implement the UmkaOS Tier 0 interrupt initialization code.

AArch64 / ARMv7: GIC (Generic Interrupt Controller)

ARM platforms use the GIC family. The GIC version is detected at boot from the Device Tree compatible string or from an ACPI MADT Type 8 (GICC), Type 9 (GICD), and Type 14 (GICR) entry set. UmkaOS supports GICv2 and GICv3/v4.

GICv2 (ARM Cortex-A9, A15, A17, and earlier server SoCs):

GICD (Distributor): a single MMIO block shared by all CPUs. Controls SPI routing, enable/disable per-IRQ, and priority configuration.
GICC (CPU Interface): a separate MMIO block, one per CPU, accessed at a fixed per-CPU stride. Provides IAR (Interrupt Acknowledge Register) and EOIR (End-of-Interrupt Register) for claim/complete cycles.

GICv3 / GICv4 (ARM Neoverse, Cortex-A55/A75 and later, all current server and mobile SoCs):

GICD (Distributor): single shared MMIO block. On GICv3, affinity routing is enabled by setting GICD_CTLR.ARE_S=1 / ARE_NS=1. SPIs (IRQs 32-1019) are routed to CPUs via GICD_IROUTER[n] (64-bit affinity value matching MPIDR_EL1).
GICR (Redistributor): one MMIO region per CPU, containing an LPI and SGI/PPI frame. The GICR is discovered by walking a contiguous array of redistributor frames (8 KB stride per frame pair) or from ACPI MADT.
ICC system registers: On GICv3, the CPU interface is accessed entirely through system registers (no per-CPU MMIO). ICC_SRE_EL1.SRE=1 must be set first to enable system-register access; if running under a hypervisor, ICC_SRE_EL2.SRE=1 and ICC_SRE_EL2.Enable=1 must also be set.

IRQ taxonomy (all GIC versions):

Range	Name	Description
0–15	SGI (Software Generated Interrupts)	Inter-processor interrupts. Written to GICD_SGIR (GICv2) or ICC_SGI1R_EL1 (GICv3). Delivered only to the targeted CPU(s).
16–31	PPI (Private Peripheral Interrupts)	Per-CPU, non-shared. Arch timer: PPI 27 = EL1 Virtual Timer (CNTV_IRQ), PPI 28 = EL2 Physical Timer (CNTHP_IRQ), PPI 29 = Secure EL1 Physical Timer (CNTP_IRQ secure), PPI 30 = Non-secure EL1 Physical Timer (CNTP_IRQ).
32–1019	SPI (Shared Peripheral Interrupts)	Platform devices: UART, PCIe, USB, storage controllers. Routed via GICD_ITARGETSR (GICv2) or GICD_IROUTER (GICv3).
8192+	LPI (Locality-specific Peripheral Interrupts, GICv3+)	MSI-based, used for PCIe MSI and MSI-X. Backed by an in-memory interrupt property table and pending table allocated by the kernel.

GICv3 initialization sequence (per-system, once):

1. Read GIC base addresses from DTB or ACPI MADT.
2. Map GICD MMIO and GICR MMIO regions.
3. Disable GICD: write GICD_CTLR = 0. Wait for GICD_CTLR.RWP=0.
4. Enable affinity routing: GICD_CTLR = ARE_NS | EnableGrp1NS.
5. Configure SPI priorities: GICD_IPRIORITYR[n] for each SPI.
6. Configure SPI routing: GICD_IROUTER[n] = MPIDR affinity of target CPU
   (or 0x80000000_xxxxxxxx for any-affinity / lowest-power routing).
7. Enable GICD: GICD_CTLR.EnableGrp1NS = 1.

GICv3 per-CPU initialization sequence (executed on each CPU, including secondaries):

1. Locate this CPU's GICR frame (match GICR_TYPER.Affinity against MPIDR_EL1).
2. Wake redistributor: clear GICR_WAKER.ProcessorSleep, poll until
   GICR_WAKER.ChildrenAsleep = 0.
3. Enable ICC system registers: write ICC_SRE_EL1 = SRE | DFB | DIB.
   Execute ISB.
4. Set ICC_PMR_EL1 = 0xFF (accept all interrupt priorities).
5. Set ICC_BPR1_EL1 = 0 (no binary point split; all priority bits used).
6. Enable Group 1 interrupts: write ICC_IGRPEN1_EL1 = 1. Execute ISB.
7. Configure PPI priorities: GICR_IPRIORITYR[n] for timer PPI (PPI 27 = IRQ 27 = EL1 virtual timer CNTV_IRQ).
8. Enable timer PPI: GICR_ISENABLER0 |= (1 << 27).

Exception routing for interrupts (AArch64):

When an IRQ fires from EL0 or EL1 with SPx, the CPU jumps to the IRQ vector at VBAR_EL1 + 0x280 (Current EL with SPx, IRQ). The handler reads ICC_IAR1_EL1 to obtain the IRQ ID, dispatches to the registered handler, then writes ICC_EOIR1_EL1 with the same IRQ ID to signal completion. Priority drop (ICC_EOIR1_EL1 write) and deactivation (ICC_DIR_EL1 write) may be split when EOImode=1 is set in ICC_CTLR_EL1 for fine-grained priority management.

RISC-V: PLIC (Platform-Level Interrupt Controller)

The PLIC is the standard external interrupt controller for RISC-V supervisor-mode software. It is discovered from the Device Tree node with compatible = "riscv,plic0" or "sifive,plic-1.0.0", which provides the MMIO base address and the number of interrupt sources.

PLIC memory map (all offsets are from the PLIC base address):

Offset 0x000000 + source*4:  Source priority register (0=disabled, 1-7=priority level)
Offset 0x001000 + word*4:    Interrupt pending bits (read-only, one bit per source)
Offset 0x002000 + ctx*0x80 + word*4: Interrupt enable bits (one bit per source, per context)
Offset 0x200000 + ctx*0x1000: Priority threshold register (0=accept all, 7=accept none)
Offset 0x200004 + ctx*0x1000: Claim/Complete register (read=claim highest-priority
                               pending IRQ; write=signal completion for that IRQ ID)

A context maps to (hart_id × modes_per_hart) + mode_index. On standard RISC-V implementations with M-mode and S-mode per hart: context = hart_id × 2 + mode, where mode 0 = M-mode, mode 1 = S-mode. UmkaOS uses S-mode contexts exclusively.

PLIC initialization sequence:

1. Discover PLIC base from DTB; map the MMIO region.
2. For each interrupt source (1 to max_source):
   a. Set priority: PLIC[0x000000 + source*4] = desired_priority (1-7, or 0 to disable).
3. For each hart:
   a. Compute S-mode context: ctx = hart_id * 2 + 1.
   b. Set threshold to 0 (accept any non-zero priority):
      PLIC[0x200000 + ctx*0x1000] = 0.
   c. Enable desired sources:
      PLIC[0x002000 + ctx*0x80 + (source/32)*4] |= (1 << (source % 32)).
4. Enable PLIC external interrupts in sie CSR: sie.SEIE = 1 (bit 9).

IRQ handling sequence (trap, scause = 9, External interrupt):

1. Read claim register: source_id = PLIC[0x200004 + ctx*0x1000].
   A zero return means no interrupt is pending (spurious); ignore.
2. Dispatch to the registered handler for source_id.
3. Write completion: PLIC[0x200004 + ctx*0x1000] = source_id.
   This deasserts the interrupt and allows new interrupts of equal or
   lower priority to be delivered.

IPI delivery (RISC-V):

IPIs do not go through the PLIC. They use the SBI IPI extension (Extension ID: 0x735049 = ASCII "sPI"). The primary hart calls sbi_send_ipi(hart_mask, hart_mask_base) to set a software interrupt pending on one or more target harts. On the receiving hart, the software interrupt fires as a supervisor software interrupt (scause = 1, sie.SSIE = 1). UmkaOS clears the IPI by writing sip.SSIP = 0 in the IPI handler and then dispatches the pending IPI work item from the per-hart IPI queue.

2.1.2.16 NUMA Topology Discovery

On x86-64 and ARM SBSA/server platforms, NUMA topology is provided by ACPI tables: SRAT (System Resource Affinity Table) maps memory ranges and APIC / MPIDR IDs to NUMA node numbers, while SLIT (System Locality Information Table) provides the distance matrix. UmkaOS parses SRAT and SLIT during static table parsing (Phase 1 of x86-64 initialization; see Section 2.1.2.5).

On platforms that boot with a Device Tree (AArch64 embedded, RISC-V, PPC32, PPC64LE), NUMA topology is encoded directly in the Device Tree. UmkaOS performs DT-based NUMA discovery as a post-DTB-parse step for all non-x86 architectures.

Device Tree NUMA encoding:

/cpus/cpu@N
    numa-node-id = <0>;      // NUMA node this CPU belongs to

/memory@40000000
    device_type = "memory";
    reg = <0x0 0x40000000 0x0 0x40000000>;
    numa-node-id = <0>;      // NUMA node this memory range belongs to

/memory@200000000
    device_type = "memory";
    reg = <0x2 0x00000000 0x2 0x00000000>;
    numa-node-id = <1>;      // Second NUMA node

/distance-map                // Optional; absent on many embedded platforms
    compatible = "numa-distance-map-v1";
    distance-matrix =
        <0 0  10>,           // Node 0 → Node 0: local (normalized to 10)
        <0 1  20>,           // Node 0 → Node 1: remote
        <1 0  20>,           // Node 1 → Node 0: remote
        <1 1  10>;           // Node 1 → Node 1: local

UmkaOS DT-based NUMA discovery algorithm:

1. Walk all /cpus/cpu@N nodes. For each cpu node:
   a. Read the reg property (MPIDR affinity / hart ID / PIR).
   b. Read numa-node-id. If absent, assign to node 0.
   c. Record: cpu_id → numa_node mapping.

2. Walk all /memory@... nodes. For each memory node:
   a. Read reg (base, size) pairs.
   b. Read numa-node-id. If absent, assign all memory to node 0.
   c. Record: [base, base+size) → numa_node mapping (passed to phys::init).

3. If /distance-map node is present:
   a. Parse distance-matrix property: triples of (from_node, to_node, distance).
   b. Populate NumaDistanceMatrix[from][to] = distance.
   c. Distances are normalized: local access = 10. Remote = proportionally higher.
   If /distance-map is absent:
   a. Assume symmetric topology: all local accesses cost 10, all remote
      accesses cost 20 (single-hop assumption). This is conservative but safe.

4. Validate: ensure every CPU maps to a node that has at least some memory.
   If a CPU's node has no memory (misconfigured DTB), log a warning and
   migrate the CPU to the nearest node with memory (lowest distance score).

Per-architecture specifics:

ARM server (AWS Graviton 3, Ampere Altra, Neoverse N2/V2 platforms): Prefer ACPI SRAT over Device Tree on SBSA-compliant platforms (ACPI is mandatory on SBSA). The SRAT Memory Affinity Structure and Processor Affinity Structure (Types 1 and 0) map MPIDR values and memory ranges to NUMA proximity domains. Distance values come from SLIT. On platforms that provide both ACPI and a Device Tree (Graviton 3 exposes both), ACPI takes precedence.

RISC-V: No ACPI on most RISC-V platforms. The distance-map DT node is rarely populated on current RISC-V hardware (SiFive HiFive Unmatched, StarFive VisionFive 2). UmkaOS applies the symmetric topology fallback (local=10, remote=20) on RISC-V when the distance-map node is absent. Future multi-socket RISC-V server designs (expected from Ventana, SiFive, Alibaba T-Head) will populate distance-map.

PPC64LE (POWER10): IBM POWER systems encode NUMA topology using the proprietary ibm,associativity and ibm,associativity-reference-points DT properties:

/cpus/cpu@0
    ibm,associativity = <4 0 0 0 0>;
    // Four levels of hierarchy: chip group / chip / core / thread.
    // The reference-points property selects which levels to use for
    // NUMA distance calculation.

/ibm,associativity-reference-points = <0x4 0x2>;
    // Level index 4 (first element) = domain/chip-group boundary.
    // Level index 2 (second element) = chip boundary.
    // Distance between CPUs sharing the same value at each level:
    //   same at both levels = local (same chip) → distance 10
    //   same at first but different at second = 1 hop → distance 20
    //   different at first = multiple hops → distance 40

UmkaOS parses ibm,associativity-reference-points first to determine the number of distance levels, then for each CPU and memory node reads ibm,associativity to compute the NUMA node assignment and inter-node distance matrix.

2.1.2.17 Per-Architecture Extended State (FPU) Initialization

Each architecture requires explicit initialization to enable floating-point and SIMD registers before they can be used by kernel or user code. UmkaOS uses a lazy FP strategy on all architectures: extended state is not saved at every context switch, but only when the task has actually used FP/SIMD registers.

x86-64:

FPU/SSE/AVX/XSAVE initialization runs during early boot (before interrupts are enabled, after the physical memory manager is initialized):

1. Detect XSAVE: CPUID leaf 0x1, ECX bit 26 (OSXSAVE). If absent, fall back
   to legacy FXSAVE (SSE2 state only, 512 bytes).
2. Set CR0: CR0.EM = 0 (no FPU emulation), CR0.MP = 1 (monitor coprocessor).
3. Set CR4.OSFXSR = 1 (enable FXSAVE/FXRSTOR for SSE state).
   Set CR4.OSXSAVE = 1 (enable XSAVE/XRSTOR for extended state).
4. Query XCR0 to discover which extended state components are present:
   XCR0 bit 0 = x87 FPU, bit 1 = SSE, bit 2 = AVX, bit 5-7 = AVX-512,
   bit 9 = PKRU, bit 17-18 = AMX tile config/data.
5. Enable all supported components: write XCR0 with the bitmask of present
   components (CPUID leaf 0xD, sub-leaf 0 provides the valid bit set).
6. Lazy context switch: set CR0.TS = 1 (task switched). First FP use from
   any task triggers a #NM (Device Not Available) exception. The handler
   loads the task's saved FP state and clears CR0.TS before returning.
   On context switch out: if CR0.TS was clear (task used FP), save the
   extended state via XSAVE[OPT/C] to the per-task XSAVE area.

AArch64:

1. NEON/FP enable: Write CPACR_EL1.FPEN = 0b11 (no trapping of FP/NEON
   instructions at EL1 or EL0). Without this, any NEON/FP instruction
   from EL0 or EL1 causes an Undefined Instruction exception.
   (UmkaOS's entry.S already sets FPEN=0b11 for the boot CPU to allow
   Rust-generated NEON instructions in the early kernel; secondary CPUs
   set FPEN=0b11 in their secondary_entry stubs.)
2. Lazy context switch: use CPACR_EL1.FPEN = 0b00 (trap FP/NEON from
   all ELs) to detect first use. On the resulting trap (ESR_EL1.EC=0x07,
   FP/NEON access from AArch64), load the task's saved FP state and set
   FPEN=0b11 before returning. On context switch out: if FPEN was 0b11
   (task used FP), save Q0-Q31 + FPSR + FPCR to the per-task FP frame.
3. SVE (Scalable Vector Extension, ARMv8.2+): If CPUID reports SVE
   (ID_AA64PFR0_EL1.SVE != 0), set ZCR_EL1.LEN to the desired vector
   length minus 1 (0 = 128-bit, 1 = 256-bit, up to SMCCC-reported max).
   CPTR_EL2.ZEN = 0b00 (allow SVE at EL1/EL0, no trap to EL2).
   SVE state (Z registers, P registers, FFR) is saved/restored separately
   from the NEON state, using the larger per-task SVE frame.
4. SME (Scalable Matrix Extension, ARMv9.2+): Enabled via CPACR_EL1.SMEN
   and SMCR_EL1.LEN. SME streaming mode and ZA register file are saved as
   part of the per-task SME frame on context switch.

RISC-V:

sstatus.FS field (bits [14:13]) controls FP state:
  0b00 = Off:     Any FP instruction causes an Illegal Instruction exception.
  0b01 = Initial: FP registers accessible; initial (clean) state.
  0b10 = Clean:   FP registers accessible; not modified since last save.
  0b11 = Dirty:   FP registers accessible; modified since last save.

1. At boot (on each hart): set sstatus.FS = 0b01 (Initial). This enables
   FP instructions without immediately requiring a context-switch save.
2. Lazy save: set sstatus.FS = 0b00 (Off) on context switch in for tasks
   that have not used FP. First FP instruction traps (Illegal Instruction,
   scause = 2). The handler sets sstatus.FS = 0b01 and returns; the FP
   instruction re-executes. On context switch out: if sstatus.FS == 0b11
   (Dirty), save all 32 FP registers (f0-f31) plus fcsr to the per-task
   FP frame, then set sstatus.FS = 0b01 (Clean). This avoids saving FP
   state for tasks that never use FP.
3. Vector extension (V): If sstatus.VS (bits [10:9]) is supported, manage
   the V register file (v0-v31, vtype, vl, vlenb) identically to the FP
   FS field. VS = 0b00 traps; set on first use; save on switch-out if Dirty.

PPC32 / PPC64LE:

The MSR (Machine State Register) contains separate enable bits for each
extended register file:
  MSR.FP  (bit 18): FPU enable. 0 = FP instructions cause FP Unavailable exception.
  MSR.VEC (bit 25): AltiVec/VMX enable. 0 = VMX instructions cause VMX Unavailable.
  MSR.VSX (bit 23): VSX enable (PPC64 only). 0 = VSX instructions cause VSX Unavailable.

1. At boot: clear MSR.FP, MSR.VEC, MSR.VSX (all zero after reset; verify).
2. Lazy enable: the FP/VMX/VSX Unavailable exception fires on first use.
   The handler sets the corresponding MSR bit and returns. The instruction
   re-executes.
3. On context switch out: if any of MSR.FP / MSR.VEC / MSR.VSX is set,
   save the corresponding register file (32 FPRs + FPSCR, 32 VMX registers
   + VSCR/VRSAVE, 64 VSX registers) to the per-task frame, then clear the
   MSR bit to re-arm the trap for the next task.
4. On context switch in: do NOT restore FP state until first use (the
   Unavailable trap will do that). This means tasks that were FP-active
   when they were switched out will take one Unavailable trap on their
   next quantum — a single additional exception per task per scheduling
   interval, which is acceptable given the benefit of skipping FP restore
   for FP-idle tasks.

UmkaOS's unified lazy FP policy:

All six architectures implement the same semantic contract:

Tasks that never issue a FP/SIMD instruction pay zero extended-state save or restore cost at every context switch.
The first FP/SIMD instruction in a task's lifetime triggers one trap, which loads the task's initial (zero) FP state and marks the task as FP-active.
Subsequent context switches for FP-active tasks check the architecture's dirty indicator (CR0.TS cleared / FS=Dirty / MSR.FP set) and save only when needed.
The per-task FP frame is allocated at task creation (sized to the largest extended state the hardware can produce on that platform, as determined by XSAVE area size on x86, SVE vector length on AArch64, or fixed sizes on RISC-V/PPC) and freed at task exit.

2.1.3 Production Boot Target

The following subsections describe the target boot architecture for production deployments. None of this is implemented yet — it represents the design goal that the Multiboot implementation will evolve toward (see Section 2.1.7 for the migration path).

2.1.3.1 Goal: Drop-in Kernel Package

UmkaOS installs as a standard kernel package alongside the existing Linux kernel. The user can dual-boot between them using the GRUB menu.

# # Debian / Ubuntu
apt install umka
update-initramfs -c -k umka-1.0.0
update-grub

# # RHEL / Fedora
dnf install umka
dracut --force /boot/initramfs-umka-1.0.0.img umka-1.0.0
grub2-mkconfig -o /boot/grub2/grub.cfg

# # Arch Linux
pacman -S umka
mkinitcpio -p umka

# # Reboot, select "UmkaOS 1.0.0" from GRUB menu
# # Existing Linux kernel is always available as a fallback entry

2.1.3.2 Boot Requirements

Image format: ELF kernel image with an embedded PE/COFF stub header, compatible with GRUB2 (loading as ELF), systemd-boot, and UEFI direct boot (loading as PE/COFF). Installed as /boot/vmlinuz-umka-VERSION (the "vmlinuz" name is a convention; the actual format is a PE/COFF-stubbed ELF, similar to Linux's bzImage with EFISTUB).
Boot protocol: x86 Linux boot protocol (for BIOS legacy boot) and UEFI stub (for UEFI direct boot). Both are supported.
Initramfs: Custom initramfs containing UmkaOS-native drivers for early boot (storage controller, root filesystem). Built using standard tools (dracut, mkinitcpio) with UmkaOS-specific hooks.
/boot layout: Fully compatible with existing distribution tools.
/boot/vmlinuz-umka-VERSION
/boot/initramfs-umka-VERSION.img
/boot/System.map-umka-VERSION (optional, for debugging)
Kernel command line: Standard Linux cmdline parameters are parsed and honored (root=, console=, quiet, init=, rw/ro, etc.).

2.1.3.3 Target Boot Sequences

x86-64 (production):

1. UEFI firmware (PE/COFF stub) / BIOS bootloader loads kernel image
2. Boot stub (Rust/asm) sets up:
   - Identity-mapped page tables
   - GDT, IDT stubs
   - Stack
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
   a. Parse boot parameters and ACPI tables
   b. Initialize physical memory allocator (from e820/UEFI memory map)
   c. Initialize virtual memory (kernel page tables, PCID)
   d. Initialize per-CPU data structures
   e. Initialize Tier 0 drivers: APIC, timer, early console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init (typically systemd)

AArch64 (production):

1. UEFI firmware or QEMU -kernel loads the ELF, jumps to _start in EL1
2. Boot stub (assembly) sets up:
   - Exception vectors (VBAR_EL1)
   - Stack pointer
   - MMU disabled (identity-mapped initially)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
   a. Parse device tree blob (DTB) passed in x0
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (TTBR0_EL1/TTBR1_EL1, ASID, TCR_EL1)
   d. Initialize per-CPU data structures (MPIDR_EL1 affinity)
   e. Initialize Tier 0 drivers: GIC (distributor + redistributor), generic timer, early console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

No microcode loading is performed — ARM firmware updates are handled by the platform firmware (UEFI capsule updates or vendor-specific mechanisms), not the kernel. This is architecturally correct: ARM's trust model places firmware updates in the Secure World (EL3/EL2), not in the Normal World OS.

ARMv7 (production):

1. QEMU vexpress-a15 loads the ELF, jumps to _start in SVC mode
2. Boot stub (assembly) sets up:
   - Vector table (VBAR)
   - Stack pointer
   - Interrupts disabled (CPSR I+F bits)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
   a. Parse device tree blob (DTB) passed in r2
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (TTBR0, DACR for domain isolation)
   d. Initialize per-CPU data structures
   e. Initialize Tier 0 drivers: GIC, SP804 timer, early UART console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

ARMv7 does not have microcode updates. CPU errata on ARMv7 are addressed through kernel code paths (alternative instruction sequences) selected at boot based on the MIDR (Main ID Register) value.

RISC-V 64 (production):

1. OpenSBI (M-mode firmware) initializes hardware, jumps to _start in S-mode
   a0 = hart_id, a1 = DTB address
2. Boot stub (assembly) sets up:
   - Trap vector (stvec)
   - Stack pointer
   - Interrupts disabled (sstatus.SIE = 0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
   a. Parse device tree blob (DTB) from a1
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (satp CSR, Sv48 mode, ASID)
   d. Initialize per-CPU data structures (per-hart)
   e. Initialize Tier 0 drivers: PLIC, timer (via SBI ecall), early 16550 UART
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

RISC-V does not have microcode updates. CPU errata are handled by OpenSBI (M-mode) or by kernel alternative code paths selected based on the mvendorid/marchid/ mimpid CSRs (exposed via SBI or DTB).

PPC32 (production):

1. U-Boot or QEMU loads ELF, jumps to _start in supervisor mode
   r3 = DTB address
2. Boot stub (assembly) sets up:
   - Stack pointer (r1)
   - Exception vectors (IVPR base + IVOR offsets)
   - Interrupts disabled (MSR EE=0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
   a. Parse device tree blob (DTB) from r3
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (TLB1 entries for initial mapping, then software page table)
   d. Initialize per-CPU data structures
   e. Initialize Tier 0 drivers: OpenPIC, decrementer timer, early UART console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

PPC32 does not have microcode updates. CPU errata are handled by kernel code paths selected at boot based on the PVR (Processor Version Register).

PPC64LE (production):

1. SLOF/OPAL firmware loads ELF, jumps to _start
   r3 = DTB address, MSR: SF=1, LE=1
2. Boot stub (assembly) sets up:
   - TOC pointer (r2) for position-independent data access
   - Stack pointer (r1)
   - Exception vectors (via LPCR and HSPRG0/1)
   - Interrupts disabled (MSR EE=0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
   a. Parse device tree blob (DTB) from r3
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (Radix MMU on POWER9+, HPT fallback on POWER8)
   d. Initialize per-CPU data structures (PIR = Processor Identification Register)
   e. Initialize Tier 0 drivers: XIVE interrupt controller, decrementer timer, early UART console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

PPC64LE does not have user-loadable microcode. POWER processor firmware updates are applied by the service processor (FSP or BMC) out-of-band, not by the OS kernel.

2.1.3.4 Initramfs Detection and Loading

UmkaOS supports three initramfs loading mechanisms, tried in priority order. The mechanism used depends on the boot path (BIOS/Multiboot, UEFI, or firmware with device tree). All three paths expose the same result to the kernel: a physical address and byte length for a contiguous initramfs image in RAM.

Boot Path	Discovery Mechanism	Address Fields
x86 BIOS/Multiboot	`boot_params.hdr.ramdisk_image` (offset 0x218)	u32 phys addr + `ramdisk_size` at 0x21c
EFI stub (all arches)	`LINUX_EFI_INITRD_MEDIA_GUID` LoadFile2 protocol	GUID: `5568e427-68fc-4f3d-ac74-ca555231cc68`
Device Tree	`/chosen` node: `linux,initrd-start` + `linux,initrd-end`	u64 big-endian absolute physical addresses

All three paths converge on the same kernel-internal representation:

/// Initramfs blob location discovered during early boot.
/// Populated by one of the three platform-specific loading paths before
/// the memory allocator is fully online. The physical range
/// [phys_start, phys_start + len) must lie within usable RAM.
pub struct InitramfsBlob {
    /// Physical start address of the initramfs image.
    pub phys_start: PhysAddr,
    /// Byte length of the compressed CPIO archive.
    pub len: usize,
}

Path 1 — x86 boot_params (highest priority on x86/x86-64)

The Multiboot loader or UEFI stub populates fields in boot_params (the "zero page"). There are two distinct areas: the setup_header (header fields) and the boot_params extension area (zero-page fields). The ramdisk fields span both:

/// Fields read from the x86 Linux boot protocol.
/// `ramdisk_image` and `ramdisk_size` live in the setup_header at fixed offsets
/// from the start of the real-mode kernel header (0x01f1 into boot_params).
/// `ext_ramdisk_image` and `ext_ramdisk_size` are separate extension fields
/// in the boot_params zero-page area, not in the header itself.
pub struct BootParamsRamdiskFields {
    /// Low 32 bits of the initramfs physical base address.
    /// Offset from boot_params base: 0x218 (within setup_header).
    /// Boot protocol 2.00+ (kernel 1.3.73+).
    pub ramdisk_image: u32,
    /// Low 32 bits of the initramfs byte length.
    /// Offset from boot_params base: 0x21c (within setup_header).
    /// Boot protocol 2.00+ (kernel 1.3.73+).
    pub ramdisk_size: u32,
    /// High 32 bits of the initramfs physical base address.
    /// Offset from boot_params base: 0x0c0 (zero-page extension area).
    /// Added in boot protocol 2.12 (kernel 3.8) for loading above 4 GiB.
    pub ext_ramdisk_image: u32,
    /// High 32 bits of the initramfs byte length.
    /// Offset from boot_params base: 0x0c4 (zero-page extension area).
    /// Added in boot protocol 2.12 (kernel 3.8).
    pub ext_ramdisk_size: u32,
}

If boot_params.hdr.ramdisk_image != 0, UmkaOS reads the initramfs from:

physical_addr = ((ext_ramdisk_image as u64) << 32) | (ramdisk_image as u64)
size_bytes    = ((ext_ramdisk_size  as u64) << 32) | (ramdisk_size  as u64)

On systems without boot protocol 2.12 support (i.e., ext_ramdisk_image and ext_ramdisk_size are zero-initialized), this reduces to the 32-bit address and size directly from ramdisk_image and ramdisk_size.

Path 2 — EFI LoadFile2 / Initrd Media GUID Protocol (EFI systems, all architectures)

When booted via EFI (UEFI stub or EFI bootloader such as systemd-boot or GRUB2), the bootloader may expose the initramfs through the LoadFile2 protocol registered on the LINUX_EFI_INITRD_MEDIA_GUID vendor media device path. This mechanism was introduced in Linux 5.8 and is also implemented in the UmkaOS EFI stub.

/// EFI GUID identifying the initrd media vendor device path.
/// The kernel's EFI stub locates a handle with this GUID registered on the
/// firmware's device path protocol, then calls LoadFile2 to obtain the initrd.
/// Defined in the Linux EFI stub (drivers/firmware/efi/libstub/efi-stub-helper.c).
pub const LINUX_EFI_INITRD_MEDIA_GUID: EfiGuid = EfiGuid {
    data1: 0x5568_e427,
    data2: 0x68fc,
    data3: 0x4f3d,
    data4: [0xac, 0x74, 0xca, 0x55, 0x52, 0x31, 0xcc, 0x68],
};

The loading sequence:

Scan the EFI handle database for a handle that matches the LINUX_EFI_INITRD_MEDIA_GUID vendor media device path.
If found, query the LoadFile2 protocol on that handle.
Call LoadFile2.LoadFile() with BootPolicy = false to obtain the initrd size (first call returns EFI_BUFFER_TOO_SMALL with the size).
Allocate pages below the hard limit, call LoadFile2.LoadFile() again to transfer the data.
The resulting (base, size) pair is stored in the EFI configuration table under LINUX_EFI_INITRD_MEDIA_GUID and consumed by the kernel after ExitBootServices().

Path 3 — Device Tree /chosen node (AArch64, ARMv7, RISC-V, PPC)

The firmware or bootloader populates the /chosen DT node with the initramfs physical address range:

/ {
    chosen {
        /* linux,initrd-start and linux,initrd-end are big-endian cell values.
           Cell width follows #address-cells of the root node (typically 2 on
           64-bit platforms, giving 64-bit addresses across two 32-bit cells). */
        linux,initrd-start = <0x0 0x82000000>;  /* 64-bit: 0x0000000082000000 */
        linux,initrd-end   = <0x0 0x84000000>;  /* exclusive end address */
    };
};

/// DT /chosen property names for initramfs (standard Linux boot protocol).
/// Values are big-endian cells; cell width matches the root node's #address-cells.
/// Size of initramfs = initrd_end - initrd_start (initrd_end is exclusive).
pub const DT_INITRD_START_PROP: &str = "linux,initrd-start";
pub const DT_INITRD_END_PROP:   &str = "linux,initrd-end";

UmkaOS reads these properties during early DT parsing (step 4a in the DT-based boot sequences). Both are treated as 64-bit big-endian values regardless of platform word size, matching the Linux implementation.

Priority and fallback:

if arch == x86 || arch == x86_64:
    if boot_params.hdr.ramdisk_image != 0:
        use Path 1
    elif efi_boot && efi_load_initrd_dev_path() succeeds:
        use Path 2
    else:
        no initramfs
elif efi_boot:
    if efi_load_initrd_dev_path() succeeds:
        use Path 2
    else:
        no initramfs
elif dt_boot:
    if dt_property_exists("/chosen", "linux,initrd-start"):
        use Path 3
    else:
        no initramfs

No initramfs is also valid — the kernel falls back to a minimal in-kernel rootfs (tmpfs) and attempts to find /init from a built-in CPIO archive. If no built-in CPIO is present and no initramfs was loaded, the kernel panics with a descriptive message: "No initramfs found and no built-in rootfs — cannot locate /init".

Validation (after loading, regardless of path):

Verify the initramfs starts with a valid cpio magic: 070701 (newc, no CRC), 070702 (newc, with CRC), or 0707 (binary, legacy). Reject if absent.
Verify size_bytes > 0 and that the physical range [physical_addr, physical_addr + size_bytes) lies entirely within available RAM (not in reserved regions or MMIO holes). Reject with a boot error if not.
If IMA is active (Integrity Measurement Architecture, Section 8.3), measure the complete initramfs into PCR 9 before executing any init scripts. This matches the Linux IMA policy for initramfs measurement.

2.1.4 CPU Errata and Microcode

Modern CPUs ship with known errata — hardware bugs documented in vendor errata sheets. UmkaOS handles these systematically rather than scattering workarounds through the codebase.

Early microcode loading — CPU microcode is applied before most kernel initialization, matching the Linux early microcode loading model. The microcode blob is located by scanning the raw initramfs image in physical memory (NOT by mounting the filesystem — initramfs mount happens later at step 4h). Linux uses the same approach: the bootloader provides an uncompressed CPIO archive prepended to the initramfs; the kernel extracts the microcode by parsing the raw CPIO headers in memory at boot.

The microcode update runs between steps 4b (physical memory allocator init) and 4c (virtual memory init):

Boot step (between 4b and 4c): Early microcode update
  1. Scan raw initramfs blob in physical memory for microcode CPIO archive
     (/lib/firmware/intel-ucode/ or /lib/firmware/amd-ucode/ paths in CPIO)
  2. Validate signature (vendor-signed, no user-modifiable microcode)
  3. Apply via WRMSR to IA32_BIOS_UPDT_TRIG (Intel) or MSR_AMD64_PATCH_LOADER (AMD)
  4. Re-read CPUID — microcode may change feature flags (critical: must happen
     before step 4c which uses CPUID to configure page table features)
  5. Log applied microcode revision to ring buffer

Errata database — After microcode loading and CPUID enumeration, UmkaOS consults a per-CPU-model quirk table:

/// CPU errata entry — matches a specific CPU stepping to its required workarounds.
struct CpuErrata {
    /// CPU identification (vendor, family, model, stepping range).
    match_id: CpuMatch,
    /// Human-readable errata identifier (e.g., "SKX003", "ZEN4-ERR-1234").
    errata_id: &'static str,
    /// Workaround function applied during boot.
    workaround: fn() -> Result<()>,
    /// Category for boot-parameter override.
    category: ErrataCat,
}

enum ErrataCat {
    /// MSR write to disable/enable a feature.
    MsrTweak,
    /// Alternative code path (e.g., retpoline instead of indirect branch).
    CodePath,
    /// Disable a CPU feature entirely.
    FeatureDisable,
}

The quirk table is checked during boot (step 4d, after CPUID). Each matching entry's workaround function is called. Workarounds are logged to the ring buffer.

Spectre/Meltdown class mitigations:

Vulnerability	Mitigation	UmkaOS scope
Meltdown (v3)	KPTI (page table isolation)	Required for Tier 2 + userspace; NOT needed for Tier 1 (same ring, MPK isolation)
Spectre v1	LFENCE barriers at bounds checks; Speculative Load Hardening (SLH)	Compiler-inserted SLH (`-mllvm -x86-speculative-load-hardening`); manual `LFENCE` in asm hot paths
Spectre v2	Retpoline / IBRS / eIBRS	Retpoline (`-C target-feature=+retpoline-indirect-branches`) for indirect branches in kernel code; eIBRS preferred on supporting hardware
Spectre v4 (SSB)	SSBD (Spec. Store Bypass Disable)	Per-thread via `IA32_SPEC_CTRL` MSR; toggled on context switch for untrusted threads
MDS/TAA	Buffer clears (`VERW`)	On context switch to userspace; on VM entry/exit
SRBDS	Microcode + `VERW`	Handled by early microcode update
RFDS/GDS	Microcode + opt-in `VERW`	Same as MDS path

Mitigation boot parameters:

umka.mitigate=auto          # Default: apply mitigations based on detected CPU (recommended)
umka.mitigate=on            # Force all mitigations on, even if CPU claims to be fixed
umka.mitigate=off           # Disable all mitigations (INSECURE — see below)
umka.mitigate.kpti=off      # Disable specific mitigation class
umka.mitigate.retpoline=off # Disable specific mitigation class

Performance impact of mitigations:

The cumulative overhead of speculative execution mitigations is substantial — typically 5-30% depending on workload characteristics:

Mitigation	Overhead	Worst-case workload
KPTI	~5% syscall-heavy; ~100-200ns per user↔kernel transition	Database OLTP (millions of syscalls/sec)
Retpoline / eIBRS	~2-10%	Indirect-branch-heavy code (virtual dispatch, interpreters)
SSBD	~1-5%	Memory-intensive with store-to-load forwarding
MDS `VERW`	~1-3% on context switch	Frequent user↔kernel transitions
Cumulative	5-30%	Syscall-heavy + indirect-branch-heavy (databases, VMs)

umka.mitigate=off is legitimate for: - Air-gapped HPC clusters where all code is trusted and no untrusted workloads run. - Benchmarking to isolate application performance from mitigation overhead. - Single-tenant bare-metal where the threat model excludes local attackers. - Nested within a trusted VM where the host hypervisor enforces mitigations at the outer boundary (the guest's mitigations are redundant).

The kernel logs a prominent boot warning when mitigations are disabled:

umka: WARNING — speculative execution mitigations DISABLED (umka.mitigate=off).
      This system is vulnerable to Spectre, Meltdown, MDS, and related attacks.
      Do NOT use in multi-tenant or untrusted environments.

Interaction with umka.isolation=performance: When umka.isolation=performance is set (promoting Tier 1 drivers to Tier 0, disabling CPU-side isolation), the admin has already accepted a reduced security posture. Combining umka.isolation=performance with umka.mitigate=off provides the maximum performance envelope — no isolation overhead, no mitigation overhead — but should be limited to environments where all executing code is fully trusted. The two settings are independent; either can be set alone.

Runtime reporting — Vulnerability status is exposed via Linux-compatible sysfs:

/sys/devices/system/cpu/vulnerabilities/meltdown:       "Mitigation: PTI"
/sys/devices/system/cpu/vulnerabilities/spectre_v1:     "Mitigation: usercopy/LFENCE"
/sys/devices/system/cpu/vulnerabilities/spectre_v2:     "Mitigation: eIBRS"
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass: "Mitigation: SSBD"
/sys/devices/system/cpu/vulnerabilities/mds:            "Mitigation: Clear buffers"

This ensures monitoring tools (spectre-meltdown-checker, lynis) work without modification.

2.1.5 Speculation Mitigations (All Architectures)

The x86-specific mitigation table in Section 2.1.4 covers only one architecture. Here is the complete per-architecture mitigation matrix:

AArch64 mitigations:

Vulnerability	ARM Identifier	Mitigation	UmkaOS scope
Spectre v1 (bounds bypass)	—	CSDB barriers at bounds checks	Compiler-inserted CSDB barriers after conditional branches (ARM equivalent of x86 SLH; uses `CSDB` instruction, not LLVM's x86-specific `-x86-speculative-load-hardening` pass)
Spectre v2 (BTI)	CVE-2017-5715	BTI (Branch Target Identification)	Hardware BTI (ARMv8.5+): enabled via `SCTLR_EL1.BT1`. Software: SMCCC `ARCH_WORKAROUND_1` firmware call
Spectre-BHB	CVE-2022-23960	BHB clearing sequence or firmware call	SMCCC `ARCH_WORKAROUND_3` or BHB clearing loop on context switch
Meltdown (v3)	CVE-2017-5754	KPTI (separate EL0/EL1 page tables)	Full KPTI required on Cortex-A75 (all revisions) per ARM security bulletins; NOT needed on Cortex-A76 (all revisions), Cortex-A78, Cortex-X1, Cortex-X2, Cortex-A710, Cortex-A715 (NOT affected per ARM security bulletin — CSV3=Yes across all revisions), or other Armv9/v8.x cores with CSV3, or earlier in-order cores (A53, A55, etc.). Cortex-A510 (all revisions): classified as Variant 3 by ARM, but the actual erratum (3117295) describes a speculative unprivileged load issue whose workaround is a TLBI instruction before returning to EL0, not full KPTI page table splitting. UmkaOS applies the lightweight TLBI mitigation for A510, not the heavyweight page table split used for Cortex-A75. Cortex-A520 (prior to r0p2 only; r0p2+ is not affected): classified as Variant 3 by ARM (erratum 2966298); like Cortex-A510, the actual issue is a speculative unprivileged load whose workaround is a TLBI instruction, not full KPTI. UmkaOS applies the same lightweight TLBI mitigation as for A510.
Spectre v4 (SSB)	CVE-2018-3639	SSBS (Speculative Store Bypass Safe)	Hardware SSBS bit (ARMv8.5+): per-thread via `PSTATE.SSBS`. Software: SMCCC `ARCH_WORKAROUND_2`
Straight-line speculation	—	SB instruction after branches	Compiler-inserted speculation barriers

ARM firmware interface: Unlike x86 (which uses MSR writes), ARM mitigations are often applied through SMCCC (SMC Calling Convention) firmware calls to EL3 Secure Monitor code. The kernel calls ARCH_WORKAROUND_1/2/3 — the firmware applies the actual mitigation. This is architecturally cleaner (firmware knows the exact CPU revision) but adds ~100-200 cycles per SMCCC call.

ARMv7 mitigations:

Vulnerability	Mitigation	UmkaOS scope
Spectre v1	CSDB barriers at bounds checks	Same as AArch64
Spectre v2	Firmware workaround via SMCCC	`ARCH_WORKAROUND_1` for affected Cortex-A cores
Meltdown	Not applicable	ARMv7 Cortex-A cores are not affected
Spectre v4	Firmware workaround	`ARCH_WORKAROUND_2` where supported

RISC-V mitigations:

Vulnerability	Mitigation	UmkaOS scope
Spectre v1	FENCE instructions at bounds checks	Manual insertion in assembly; compiler support evolving
Spectre v2	Vendor-specific	SiFive: `FENCE.I` after indirect branches. Other vendors: per-implementation
Meltdown	Not applicable	In-order RISC-V cores not affected; OoO cores (e.g., SiFive P670) may need KPTI
Spectre v4	Vendor-specific	No standard RISC-V mitigation; per-vendor microarchitecture

RISC-V status: Speculation mitigations on RISC-V are less mature than x86 or ARM. The RISC-V CFI extensions Zicfiss (shadow stacks) and Zicfilp (landing pads) are ratified as standalone extensions (not part of the base privileged specification, but separate ratified ISA extensions). UmkaOS implements both when the hardware reports support via the Zicfiss and Zicfilp ISA string entries. UmkaOS also applies vendor-specific workarounds based on mvendorid/marchid from the device tree, similar to the x86 errata database approach.

PowerPC mitigations:

Vulnerability	Mitigation	UmkaOS scope
Spectre v1	`ori 31,31,0` (speculation barrier)	Inserted at bounds checks in assembly
Spectre v2	Count Cache Flush + link stack flush	POWER8/9: `bcctr` flush sequence; POWER10: hardware mitigation
Meltdown	RFI flush (L1D cache flush)	POWER7+: flush on return from interrupt via `rfid`/`hrfid`
Spectre v4	STF (Store Thread Forwarding) barrier	`ori 31,31,0` barrier; POWER9+ firmware toggle

PowerPC status: IBM POWER processors have well-documented mitigations managed via firmware (skiboot/OPAL) and kernel runtime patches. POWER10 includes hardware mitigations for most Spectre variants. PPC32 embedded cores (e500, 440) are generally in-order and not affected by speculative execution vulnerabilities. UmkaOS applies mitigations based on PVR (Processor Version Register) from the device tree.

Runtime reporting (all architectures) — The Linux-compatible sysfs interface (/sys/devices/system/cpu/vulnerabilities/) is populated on all architectures with architecture-appropriate mitigation status strings.

2.1.6 Dual-Boot Safety

UmkaOS never modifies the existing Linux kernel installation.
GRUB is configured with both kernels; the default can be set by the user.
If UmkaOS fails to boot, the user selects the Linux kernel from GRUB.
A "last known good" mechanism records successful boots and can auto-revert.

2.1.7 Boot Protocol Migration Path

The boot architecture evolves through four phases, each building on the previous:

Phase 1 — Multiboot1 (current). GRUB loads the ELF via multiboot command. QEMU loads directly with -kernel. Memory map from Multiboot1 info structure. Sufficient for all kernel development and QEMU-based testing.

Phase 2 — Multiboot2 full parser. Parse Multiboot2 tags to access richer boot information: ACPI RSDP pointer, EFI memory map, framebuffer info, boot services tag. This enables ACPI table parsing and EFI runtime services without changing the bootloader. GRUB2 already supports the multiboot2 command.

Phase 3 — UEFI stub boot. Add a PE/COFF header stub to the kernel image (similar to Linux EFISTUB). UEFI firmware requires PE/COFF executables, not ELF — the stub header makes the kernel image a valid PE/COFF binary that UEFI can load directly. The actual kernel code remains ELF internally; the PE/COFF header is a thin wrapper (like Linux's header.S which embeds a PE/COFF header in the bzImage). The kernel becomes directly bootable from UEFI firmware without GRUB — efibootmgr can register it. Use EFI boot services for memory map and GOP framebuffer, then call ExitBootServices() before entering the kernel proper. systemd-boot and other UEFI-native boot managers work at this stage.

Phase 4 — Linux boot protocol. Implement the x86 Linux boot protocol (struct boot_params at 0x10000). This makes the UmkaOS kernel loadable by any Linux-compatible bootloader. Combined with a standard /boot layout and initramfs, this enables the drop-in package installation described in Section 2.1.3.1. This is the final production boot target.

2.1.8 Secure Boot and Measured Boot

Secure Boot and Measured Boot are kernel-level boot-phase concerns. They apply equally to servers (enterprise attestation, confidential computing), cloud instances (vTPM-based instance identity), and consumer devices (UEFI Secure Boot for firmware lockdown). Neither feature is consumer-specific.

2.1.8.1 UEFI Secure Boot

UEFI Secure Boot enforces a chain of trust starting in firmware: the UEFI db (allowed signature database) and dbx (revocation list) are stored in firmware NVRAM. Every executable in the boot path (bootloader, shim, kernel) must be signed by a key in the db.

Deployment models:

Model	Chain	When used
Shim + GRUB	Microsoft UEFI CA → shim (signed by MS) → GRUB (signed by distro) → kernel (signed by distro)	Default for distros shipping via OEM
UEFI direct	Custom key enrolled in db → kernel PE/COFF (signed by UmkaOS key)	Self-managed servers, custom deployments
Unsigned (disabled)	No verification	Development hardware, QEMU

UmkaOS requires Phase 3 (UEFI stub, Section 2.1.7) before Secure Boot can be supported. The kernel image must be a valid PE/COFF binary for UEFI to verify its signature before loading. The build system produces a signed image via sbsign --key umka-signing.key --cert umka-signing.crt umka-kernel.efi.

Kernel module signing: Once the kernel is Secure Boot-booted, all kernel modules (Tier 1 drivers) must also be signed. Unsigned modules are rejected. The module signing key is separate from the UEFI boot key. The build system embeds the module signing public key in the kernel image; drivers are signed with the corresponding private key during the build.

UEFI Secure Boot state: The kernel reads the UEFI SecureBoot variable from EFI runtime services at boot and records it in a read-only kernel parameter. Userspace can query via /sys/firmware/efi/efivars/SecureBoot-*. This affects policy decisions (e.g., CAP_SYS_MODULE behaviour).

2.1.8.1.1 Key Compromise Recovery

If the UmkaOS signing key is compromised, three coordinated actions are required: updating the UEFI revocation list (dbx), rotating the signing key, and migrating TPM-sealed secrets to a new PCR state. This subsection specifies each step precisely.

dbx Update Path

The UEFI Signature Database Forbidden (dbx) is stored in EFI NVRAM and contains hashes or certificate thumbprints of revoked images and keys. Updates are delivered as signed UEFI authenticated variables:

Delivery mechanism: a signed UEFI capsule image (EFI_FIRMWARE_IMAGE_PROTOCOL, GUID 6dcbd5ed-e82d-4c44-bda1-7194199ad92a) deposited either via the EFI_UPDATE_CAPSULE runtime service or as a file at /EFI/UpdateCapsule/<GUID>.bin on the EFI System Partition. The firmware processes the capsule before ExitBootServices() on the next boot.
The capsule is authenticated by the firmware using the Platform Key (PK) or Key Exchange Key (KEK) chain already enrolled in NVRAM. A dbx capsule signed by the KEK — and delivered through a signed distro update package — requires no additional user interaction.
Early kernel verification: after the firmware has applied the new dbx and before entering the UmkaOS boot stub, UEFI re-verifies every image in the boot chain against the updated dbx. If the running kernel image's hash or signing certificate is now in the dbx, UEFI aborts the boot and presents an error to the user. The kernel never reaches umka_main() in this case — the revocation check fires before ExitBootServices().
EFI event log: the firmware records the dbx update in the TCG EFI Platform Specification event log (EV_EFI_VARIABLE_AUTHORITY entry, PCR 7). The kernel reads this log during early initialization and forwards the entry to the IMA audit log, creating a durable, ordered record that dbx was updated.

Key Rotation Protocol

The UmkaOS signing key is an ML-DSA-65 + Ed25519 hybrid key pair. Key rotation proceeds through five steps:

Generate the new key pair. Create a new ML-DSA-65 + Ed25519 hybrid signing key pair in a hardware security module (HSM). The HSM never exports the private key material.
Enroll the new public key in db. Submit the new public key certificate to the UEFI db (allowed signature database) via a KEK-authenticated variable update — the same delivery mechanism as the dbx capsule described above. The update is deployed through the normal distro package management pipeline (e.g., as a fwupd plugin or a signed distro package writing to /EFI/UpdateCapsule/). After the update applies, both the old key and the new key are accepted by UEFI.
Dual-signing period — minimum 30 days. Every kernel release during this period is signed with BOTH the old key and the new key. A dual-signed image satisfies any UEFI db that contains either key. This covers:
Existing systems that have not yet received the new db enrollment.
Systems that received the enrollment but whose db update failed to apply (e.g., NVRAM full, firmware bug). The 30-day window gives sufficient time for the db enrollment to propagate to all deployed systems via normal OS update channels.
Revoke the old key. After the dual-signing period ends, add the old signing certificate's SHA-256 hash to dbx via a KEK-signed capsule update. From this point, images signed only with the old key are rejected. Dual-signed images (carrying the new key's signature as well) continue to boot.
Out-of-band recovery media. Prepare a USB recovery drive containing:
The new public key certificate in DER format.
A signed db update capsule that adds the new key.
Instructions for manual enrollment via the UEFI setup utility. This drive is used on systems that missed the automatic enrollment (e.g., systems offline during the update window, air-gapped systems).

PCR Extension for the New Key

Standard UEFI Secure Boot behavior extends PCR 7 with the hash of each certificate used to verify an image (EV_EFI_VARIABLE_AUTHORITY events). When a kernel signed with the new key boots for the first time, UEFI extends PCR 7 with the new signing certificate hash. The PCR 7 value changes, which breaks TPM-sealed secrets (such as disk encryption keys) that were sealed with a policy referencing the old PCR 7 value.

Migration path — applied before the old key is revoked (Step 4 above):

Compute the expected new PCR 7 value. The new PCR 7 value is: PCR7_new = SHA256(PCR7_old || SHA256(new_signing_cert_der)) This can be computed offline from the current PCR 7 value and the new certificate, without rebooting.
Re-seal secrets with a dual-policy. Unseal each secret under the existing policy (PCR 7 = PCR7_old), then re-seal with a PolicyOR policy that accepts either the old or new PCR 7 value: PolicyOR( PolicyPCR(PCR7_old, pcr_selection = {PCR7}), PolicyPCR(PCR7_new, pcr_selection = {PCR7}) ) The re-sealed blob can be unsealed on systems booting with either key during the dual-signing period.
After key rotation completes. Once the old key is in dbx and all systems boot only with the new key, re-seal secrets one final time with a single-policy referencing only PCR7_new. This drops the fallback to the old PCR 7 value, producing a tighter policy for the steady state.

The migration (Steps 2 and 3) is performed by a userspace tool (umka-tpm-reseal) that runs as a systemd oneshot service during the transition window. The service is activated by detecting a new db entry for the UmkaOS signing key in the EFI event log.

2.1.8.2 Measured Boot (TPM PCR Chain)

Measured Boot extends a TPM Platform Configuration Register (PCR) with a cryptographic hash at each step of the boot chain. PCRs are append-only (extend = SHA256(current PCR value || new measurement)); they cannot be reset without rebooting. A remote attestation verifier can reconstruct the expected PCR values from the known firmware/bootloader/kernel and check that the running system matches.

Standard x86 PCR assignment (UEFI + Linux convention, which UmkaOS follows):

PCR	What is measured
0	UEFI firmware code and configuration
1	UEFI firmware data (platform config)
2	Option ROM code
3	Option ROM data
4	Boot manager code (GRUB/shim)
5	Boot manager data + GPT partition table
6	Resume from hibernate
7	Secure Boot policy (db, dbx, PK, KEK state)
8	GRUB command line
10	Kernel image (bzImage/UmkaOS kernel PE/COFF)
11	initramfs
12	Kernel command line
13	UmkaOS: Tier 1 driver measurements (Section 8.3.1, 08-security.md) + IMA policy keys
13–15	Available for OS/application use

The kernel extends PCR 9 with its own image hash during early boot (before ExitBootServices() on UEFI paths, or via GRUB's tpm module on Multiboot paths). PCR 10 is extended with the initramfs hash. PCR 11 is extended with the kernel command line.

TPM interface: The kernel accesses the TPM via the TPM CRB (Command Response Buffer, TPM 2.0 mandatory interface) or TPM TIS (legacy 1.2 / 2.0 FIFO interface). The driver is Tier 1, ACPI-probed (MSFT0101 or MSFT0200).

// umka-core/src/tpm/mod.rs

/// TPM 2.0 PCR Extend command.
/// Extends the given PCR with SHA-256(current || digest).
pub fn pcr_extend(pcr_index: u32, digest: &[u8; 32]) -> Result<(), TpmError>;

/// Read back the current value of a PCR.
pub fn pcr_read(pcr_index: u32) -> Result<[u8; 32], TpmError>;

/// Seal a secret to the current PCR state.
/// Returns a TPM2B_PUBLIC + TPM2B_PRIVATE blob.
/// The secret can only be unsealed if PCRs match the policy at seal time.
pub fn seal(pcr_policy: &PcrPolicy, secret: &[u8]) -> Result<SealedBlob, TpmError>;

/// Unseal a blob previously created by seal().
/// Fails if any PCR in the policy has changed since sealing.
pub fn unseal(blob: &SealedBlob) -> Result<Vec<u8>, TpmError>;

Disk encryption integration: seal() is the mechanism for TPM-bound disk encryption keys (equivalent to Linux's tpm2-totp / systemd-cryptenroll). The disk encryption key is sealed to a PCR policy covering PCRs 0, 4, 7, 9, 11 (firmware + Secure Boot policy + kernel + cmdline). Any modification to the boot chain (new kernel, changed cmdline, disabled Secure Boot) causes unseal to fail, prompting for a recovery passphrase.

Confidential computing intersection: On confidential VM platforms (AMD SEV-SNP, Intel TDX, ARM CCA), the TPM is replaced by a virtual TPM whose root of trust is the hardware attestation report (VCEK certificate, TD quote, Realm Attestation Token). The PCR-based measured boot model is the same; the trust root is the hardware VM isolation guarantee rather than a physical TPM chip. Section 5.1 covers the distributed/confidential computing architecture.

2.1.8.3 Kernel Responsibilities Summary

Responsibility	In kernel?	Notes
Kernel image signing	Build-time	`sbsign` in build system
Module signing verification	Yes	Enforced when Secure Boot active
PCR extension (kernel + cmdline)	Yes	Early boot, before driver init
TPM driver (CRB/TIS)	Yes	Tier 1, ACPI-probed
`seal()` / `unseal()` API	Yes	Exposed to userspace via ioctl
Key management policy	No	Userspace (systemd-cryptenroll, clevis)
Remote attestation protocol	No	Userspace (keylime, MAA agent)
Boot graphics, splash screen	No	Bootloader/compositor
Dual-boot chainloading	No	Bootloader (GRUB)

2.1.9 UEFI Runtime Services

After ExitBootServices(), the UEFI Boot Services memory map is invalidated and all Boot Services (memory allocation, protocol interfaces, etc.) are gone. However, a distinct set of UEFI Runtime Services remains accessible for the life of the running OS. These services operate on a virtual address mapping that the kernel establishes during boot via SetVirtualAddressMap().

2.1.9.1 Virtual Address Mapping

Before calling ExitBootServices(), the kernel enumerates the UEFI memory map and identifies all regions with the EFI_MEMORY_RUNTIME attribute. These regions are mapped into a dedicated kernel virtual address range (EFI_RUNTIME_VA_BASE, architecture-specific) using normal kernel page table entries. The mapping must preserve the relative offsets between firmware-runtime regions exactly as the firmware expects.

The kernel then calls SetVirtualAddressMap(map_size, descriptor_size, descriptor_version, virtual_map) once, passing the updated descriptors with the new virtual base addresses. After this call returns, all UEFI runtime service pointers stored in the EFI System Table are updated to use the new virtual addresses. The physical EFI System Table address is preserved separately so the kernel can locate it after the mapping call.

/// Handle to UEFI runtime services, valid after ExitBootServices().
pub struct EfiRuntime {
    /// Physical address of EFI System Table, preserved across ExitBootServices().
    pub system_table_pa: PhysAddr,
    /// Virtual address of EFI Runtime Services table, after SetVirtualAddressMap().
    pub runtime_services: *const EfiRuntimeServices,
    /// Whether runtime services are available (false if firmware is broken or
    /// SetVirtualAddressMap() failed).
    pub available: bool,
    /// Serializes all EFI runtime calls. UEFI firmware is not reentrant.
    pub lock: SpinLock<()>,
}

All accesses to EfiRuntime hold EfiRuntime::lock and execute with interrupts disabled. UEFI firmware is documented as non-reentrant; concurrent calls from different CPUs or from an IRQ handler preempting a runtime call both produce undefined behavior.

2.1.9.2 NVRAM (EFI Variables)

EFI variables are named byte arrays stored in firmware NVRAM. They persist across reboots and are accessed by name (UTF-16 string) and vendor GUID. Variables have attribute flags controlling persistence and visibility:

EFI_VARIABLE_NON_VOLATILE (bit 0): persists across power cycles.
EFI_VARIABLE_BOOTSERVICE_ACCESS (bit 1): accessible during Boot Services.
EFI_VARIABLE_RUNTIME_ACCESS (bit 2): accessible after ExitBootServices().

UmkaOS wraps the UEFI variable services with interrupt-disabled, locked calls:

/// Read a UEFI variable by name and GUID.
///
/// Returns the variable data on success, or an `EfiStatus` error code.
/// Common errors: `EFI_NOT_FOUND` (variable absent), `EFI_BUFFER_TOO_SMALL`
/// (internal — handled by the wrapper via a two-pass size query).
pub fn efi_get_variable(
    name: &UcsStr,
    guid: &EfiGuid,
) -> Result<(Vec<u8>, u32 /* attributes */), EfiStatus>;

/// Write or delete a UEFI variable.
///
/// Pass `data = &[]` with `attrs = 0` to delete an existing variable.
/// Authenticated variables (e.g., db, dbx) require a signed payload structure
/// in `data`; the firmware validates the signature before writing.
pub fn efi_set_variable(
    name: &UcsStr,
    guid: &EfiGuid,
    attrs: u32,
    data: &[u8],
) -> Result<(), EfiStatus>;

Uses by the kernel:

Reading the SecureBoot variable (GUID {8be4df61-...}) to determine whether Secure Boot is active (see Section 2.1.8.1).
Reading and writing the BootOrder and Boot#### variables to manage UEFI boot entries (used by umka-efibootmgr, a userspace tool that delegates to the kernel via an ioctl).
Delivering db/dbx updates as authenticated variable writes during the key compromise recovery process (see Section 2.1.8.1.1).

NVRAM wear: EFI NVRAM has limited write endurance (typically 100,000 to 1,000,000 cycles depending on the flash technology). The kernel must not write EFI variables at high frequency. Policy variables, boot configuration, and security databases are the intended use; per-boot or per-minute writes are acceptable; per-second writes are not.

2.1.9.3 Time Services

UEFI provides GetTime(time, capabilities) and SetTime(time) for wall-clock time access, and GetWakeupTime/SetWakeupTime for ACPI alarm-based resume.

UmkaOS uses EFI time services exactly once: during early boot (between Phase 2 and Phase 3 of the x86-64 initialization sequence) to read the hardware RTC and initialize the kernel wall clock. All subsequent timekeeping uses hardware-direct paths:

x86-64: HPET, TSC, LAPIC timer via direct MMIO and MSR reads.
AArch64: ARM Generic Timer (CNTPCT_EL0, CNTFRQ_EL0) via system registers.
RISC-V: rdtime pseudo-instruction, frequency from Device Tree.
PPC32/PPC64LE: Timebase register (mftb) and decrementer SPR.

This avoids the serialization cost of EfiRuntime::lock on the timekeeping hot path. EFI SetTime() is called when the user updates the wall clock (e.g., via adjtimex(2) or settimeofday(2)) to propagate the change back to the hardware RTC.

2.1.9.4 Reset and Shutdown

ResetSystem(type, status, data_size, data) is the UEFI-standard mechanism for system reset and shutdown. The type field is one of:

Type	Value	Semantics
`EfiResetCold`	0	Full hardware reset, re-runs POST.
`EfiResetWarm`	1	Warm reset without POST (where supported by platform).
`EfiResetShutdown`	2	Power off via ACPI S5 state.
`EfiResetPlatformSpecific`	3	Vendor-defined reset type identified by GUID in `data`.

UmkaOS maps Linux reboot syscall commands to EFI reset types as follows:

`reboot(2)` command	UEFI call
`LINUX_REBOOT_CMD_RESTART`	`EfiResetCold`
`LINUX_REBOOT_CMD_POWER_OFF`	`EfiResetShutdown`
`LINUX_REBOOT_CMD_HALT`	`EfiResetShutdown` (processor halt before calling)
`LINUX_REBOOT_CMD_RESTART2` (with command string)	`EfiResetPlatformSpecific` with distro-specific GUID

Fallback path when EFI runtime is unavailable. If EfiRuntime::available is false (non-UEFI boot, firmware bug, or SetVirtualAddressMap() failure), UmkaOS falls back to ACPI-direct paths:

Shutdown: write the ACPI sleep type for S5 (from \_S5 object in DSDT) to the PM1a Control Register (PM1a_CNT), setting the SLP_EN bit.
Reset: write 0x06 to I/O port 0xCF9 (keyboard controller reset, widely supported on x86 platforms), or use ACPI_RESET_REG if defined in the FADT.

2.2 First-Class Architectures

UmkaOS targets six architectures as first-class citizens. All six receive equal design consideration, CI testing, and performance optimization.

Architecture	Status	Isolation mechanism	Notes
x86-64	Primary dev target	Intel MPK (`WRPKRU`)	Most mature, widest hardware
aarch64	First-class, day one	POE (ARMv8.9+) / page-table fallback	ARM servers, Apple Silicon (VM)
armv7	First-class, day one	DACR memory domains	Embedded, IoT, Raspberry Pi
riscv64	First-class, day one	Page-table based	Emerging server/embedded platform
ppc32	First-class, day one	Segment registers / page-table based	Embedded PowerPC, AmigaOne, networking appliances
ppc64le	First-class, day one	HPT / Radix MMU / page-table based	POWER servers, IBM POWER8/9/10, Raptor Talos II

2.2.1 Architecture-Specific Code

Architecture-specific code is isolated under arch/ and umka-core/src/arch/:

Boot code: Rust and assembly, per-architecture
Syscall entry/exit: Assembly stubs
Context switch: Assembly (register save/restore)
Interrupt dispatch: Assembly stubs into Rust handlers
vDSO: Per-architecture user-accessible pages (see Section 2.2.1.1)
MPK / isolation primitives: Abstracted behind a common IsolationDomain trait

2.2.1.1 vDSO (Virtual Dynamic Shared Object)

The vDSO is a small ELF shared library that the kernel maps into every user process's address space at process creation. It provides fast userspace implementations of a small set of syscalls that can be answered without entering the kernel — specifically time-related syscalls — by reading kernel-maintained data from a shared page (the VVAR page).

Why the vDSO matters for performance: clock_gettime(CLOCK_MONOTONIC) is called millions of times per second in high-performance workloads (databases, gRPC, event loops). A kernel entry costs 100-300 ns on x86-64 with KPTI. The vDSO path costs ~5-20 ns — a 10-30x speedup. UmkaOS implements the Linux-compatible vDSO ABI so that existing glibc, musl, and uclibc-ng builds use the fast path automatically, with no changes to userspace.

Virtual address layout per process (x86-64 example, above stack, ASLR-randomized):

high address
  [vdso ELF]     1-4 pages, PROT_READ|PROT_EXEC — contains function code
  [vvar page]    1 page (4 KB), PROT_READ — kernel writes, userspace reads
low address

The VVAR page is mapped immediately below the vDSO ELF. Its address is derived by the vDSO code using a fixed negative offset from the vDSO load address (computed by the linker script). The kernel communicates the VVAR page address to userspace via the ELF auxiliary vector (AT_SYSINFO_EHDR points to the vDSO ELF base).

VVAR Page Layout:

/// Kernel-maintained data page shared with userspace for vDSO fast paths.
/// The kernel writes this page using a seqlock protocol; the vDSO reads it
/// without kernel entry.
///
/// This page is mapped read-only into every user process. The kernel maps it
/// read-write in kernel virtual address space only.
///
/// The layout is fixed ABI: the vDSO ELF references fields at fixed offsets.
/// Adding new fields must not change existing field offsets.
#[repr(C, align(4096))]
pub struct VvarPage {
    /// Seqlock sequence counter.
    /// Invariant: odd = kernel write in progress; even = data is stable.
    /// The vDSO reads this before and after reading data fields; if the
    /// value changes or is odd, it retries from the beginning.
    pub seq: AtomicU32,
    pub _pad_seq: u32,
    /// CLOCK_REALTIME: seconds since Unix epoch (TAI - leap seconds).
    pub clock_realtime_sec: u64,
    /// CLOCK_REALTIME: nanoseconds within the current second (0..999_999_999).
    pub clock_realtime_nsec: u32,
    pub _pad_rt: u32,
    /// CLOCK_MONOTONIC: nanoseconds since kernel boot (never steps backward).
    pub clock_monotonic_ns: u64,
    /// TSC-to-nanoseconds conversion multiplier.
    /// Formula: ns_delta = (tsc_delta * tsc_to_ns_mul) >> tsc_to_ns_shift
    /// Valid only when the hardware TSC is stable (invariant TSC required).
    /// Zero means TSC is not usable; fall back to a syscall.
    pub tsc_to_ns_mul: u32,
    /// TSC-to-nanoseconds conversion shift (see tsc_to_ns_mul).
    pub tsc_to_ns_shift: u32,
    /// TSC value at the time of the last VVAR update.
    pub tsc_base: u64,
    /// Timezone offset in seconds west of UTC (matches `struct timezone.tz_minuteswest * 60`).
    pub tz_minuteswest: i32,
    /// DST correction type (matches `struct timezone.tz_dsttime`; always 0 in practice).
    pub tz_dsttime: i32,
    /// Architecture-specific counter base for non-TSC paths.
    /// x86-64: unused (TSC used directly).
    /// AArch64: CNTVCT_EL0 value at last update.
    /// RISC-V: `rdtime` value at last update.
    pub arch_counter_base: u64,
    /// Architecture-specific counter frequency (Hz).
    /// x86-64: TSC frequency.
    /// AArch64: CNTFRQ_EL0.
    /// RISC-V: timer-frequency from Device Tree.
    pub arch_counter_freq_hz: u64,
    /// Per-CPU snapshot of the current CPU index (for __vdso_getcpu).
    /// Updated on each scheduler tick. Not cycle-precise; approximate is acceptable.
    pub cpu_id: u32,
    /// NUMA node of the current CPU (for __vdso_getcpu).
    pub numa_node: u32,
    pub _pad: [u8; 4016],   // Explicit padding: 80 bytes of fields above + 4016 = 4096 exactly.
    // (Do not rely on implicit tail padding from align(4096) — explicit is safer as fields grow.)
    // Compile-time guard: const _: () = assert!(core::mem::size_of::<VvarPage>() == 4096);
}

Exported Symbols (Linux-compatible ABI):

The vDSO ELF exports the following symbols with STV_DEFAULT visibility. These match the Linux x86-64 vDSO symbol names exactly so that glibc and other libc implementations find them without modification:

Symbol	Signature	Supported clocks
`__vdso_clock_gettime`	`(clockid_t clk_id, struct timespec *tp) -> int`	`CLOCK_REALTIME`, `CLOCK_MONOTONIC`, `CLOCK_MONOTONIC_RAW`, `CLOCK_REALTIME_COARSE`, `CLOCK_MONOTONIC_COARSE`
`__vdso_gettimeofday`	`(struct timeval tv, struct timezone tz) -> int`	All (derives from `clock_realtime`)
`__vdso_time`	`(time_t *tloc) -> time_t`	Derives from `clock_realtime_sec`
`__vdso_clock_getres`	`(clockid_t clk_id, struct timespec *res) -> int`	Returns resolution for supported clocks
`__vdso_getcpu`	`(unsigned cpu, unsigned node) -> int`	Returns `VvarPage::cpu_id` and `numa_node`

On AArch64, the equivalent symbols use the same names but read CNTVCT_EL0 (virtual counter) instead of RDTSC. On RISC-V, rdtime is used. The VVAR arch_counter_base and arch_counter_freq_hz fields supply the base and frequency needed for the conversion.

For clock IDs that the vDSO does not handle (e.g., CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID, CLOCK_BOOTTIME), the vDSO falls back to a real syscall via the syscall instruction (x86-64) or SVC / ecall (AArch64, RISC-V).

Seqlock Update Protocol:

Kernel (called on each timer tick or TSC calibration update):
  1. VvarPage::seq.fetch_add(1, Release)   // seq becomes odd: write in progress
  2. write clock_realtime_sec, clock_realtime_nsec, clock_monotonic_ns
  3. write tsc_base, tsc_to_ns_mul, tsc_to_ns_shift (if TSC calibration changed)
  4. write arch_counter_base (architecture-specific counter snapshot)
  5. write cpu_id, numa_node (approximate; read from CpuLocal::cpu_id)
  6. VvarPage::seq.fetch_add(1, Release)   // seq becomes even: write complete

vDSO userspace (pseudocode for __vdso_clock_gettime):
  loop:
    seq1 = load(VvarPage::seq, Acquire)
    if seq1 & 1 != 0: continue            // write in progress, retry
    tsc_now = RDTSC (or arch counter)
    ns = clock_monotonic_ns + ((tsc_now - tsc_base) * tsc_to_ns_mul) >> tsc_to_ns_shift
    seq2 = load(VvarPage::seq, Acquire)
    if seq2 != seq1: continue             // update raced, retry
    return ns

The retry loop is expected to execute zero times in practice: timer tick updates are infrequent (1–10 ms intervals) and short (< 1 μs). The loop exists only for correctness on the rare overlap.

ELF Build Requirements:

The vDSO ELF is built as a position-independent shared library with no dynamic linker dependencies:

Compiled with -fPIC -fno-plt -nostdlib -Wl,-shared.
No external symbol references (self-contained; no libc, no PLT stubs).
Linked with a custom linker script that produces exactly two PT_LOAD segments (one RX for code, one R for read-only data) plus PT_DYNAMIC and PT_GNU_EH_FRAME.
Stripped of debugging sections; .eh_frame retained for unwinding (stack traces in userspace debuggers work through the vDSO).
The vDSO ELF is embedded in the kernel image as a byte array in .rodata. At process creation (exec), the kernel copies it into a freshly allocated page and maps it with PROT_READ | PROT_EXEC.

Architecture-Specific Notes:

Architecture	Counter instruction	Notes
x86-64	`RDTSC`	Requires invariant TSC (CPUID leaf 0x80000007 bit 8). Non-invariant TSC (laptops with deep C-states on pre-Nehalem) falls back to syscall.
AArch64	`MRS x0, CNTVCT_EL0`	Virtual counter. `CNTFRQ_EL0` gives the frequency. Always available on ARMv8+ in EL0.
ARMv7	`MRC p15, 0, r0, c14, c3, 2` (CNTVCT)	Available on Cortex-A7/A15 with generic timer. Falls back to syscall if not available.
RISC-V 64	`rdtime` pseudo-instruction	Frequency from Device Tree `/cpus/timebase-frequency`.
PPC32	`mftb` (Timebase lower)	PPC32 vDSO is limited; `gettimeofday` uses syscall fallback on embedded targets without an invariant timebase.
PPC64LE	`mftb`	Timebase register, frequency from Device Tree or OPAL.

Per-architecture vDSO placement in arch/:

umka-kernel/src/arch/x86_64/vdso/      — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/aarch64/vdso/     — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/armv7/vdso/       — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/riscv64/vdso/     — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/ppc32/vdso/       — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/ppc64le/vdso/     — vdso.S, vdso.ld, vvar.rs

The VvarPage struct definition is shared (umka-kernel/src/vvar.rs); only the counter-reading instructions in vdso.S and the linker load addresses in vdso.ld differ per architecture.

Per-architecture hardware abstraction equivalents:

Concept	x86-64	AArch64	ARMv7	RISC-V 64	PPC32	PPC64LE
Privilege separation	GDT (ring 0/3 segments)	Exception levels (EL0/EL1)	Processor modes (USR/SVC)	Privilege levels (U/S)	MSR PR bit (user/supervisor)	MSR PR bit (user/supervisor)
Exception dispatch	IDT (256 gate descriptors)	Exception vector table (VBAR_EL1, 16 entries × 4 vectors)	Vector table (VBAR, 8 entries)	Trap vector (stvec, single entry + scause dispatch)	Exception vector table (IVPR + IVORn)	System Reset + Machine Check vectors (LPCR)
Interrupt controller	APIC (LAPIC + IOAPIC)	GIC v2/v3 (distributor + redistributor/CPU interface, detected at runtime)	GIC (distributor + CPU interface)	PLIC (+ CLINT for timer/IPI)	OpenPIC / MPIC	XICS / XIVE (POWER8/9/10)
Timer	APIC timer / HPET / TSC	Generic Timer (CNTPCT_EL0)	Generic Timer (CNTPCT)	SBI timer ecall / mtime	Decrementer (DEC SPR)	Decrementer (DEC SPR) / HDEC
Syscall mechanism	SYSCALL/SYSRET (MSRs)	SVC instruction (EL0→EL1)	SVC instruction (USR→SVC)	ecall instruction (U→S)	`sc` instruction (system call)	`sc` instruction / `scv` (POWER9+)
Page table format	4-level (PML4→PDPT→PD→PT)	4-level (L0→L1→L2→L3)	2-level (L1→L2, 1MB sections)	4-level Sv48	2-level (PGD→PTE, 4 KB pages)	Radix tree (POWER9+) or HPT (hashed page table)
Fast isolation	MPK (WRPKRU)	POE (POR_EL0) / MTE	DACR (16 domains)	Page-table based	Segment registers (16 segments)	Radix partition table / HPT LPAR
TLB ID	PCID (12-bit, CR3)	ASID (8/16-bit, TTBR)	ASID (8-bit, CONTEXTIDR)	ASID (9-16 bit, satp)	PID (8-bit, via PID SPR)	PID/LPID (Radix: 20-bit PID, LPIDR)

Everything else -- scheduling, memory management, capability system, driver model, syscall compatibility -- is architecture-independent Rust code.

2.2.2 No 32-bit Compatibility Modes on 64-bit Kernels

UmkaOS does not support running 32-bit binaries on 64-bit kernels: - No ia32 compatibility mode on x86-64 - No AArch32 compatibility mode on AArch64 - No RV32 compatibility mode on RV64

ARMv7 (32-bit ARM) is supported as a native first-class architecture — it runs a native 32-bit kernel, not a compatibility layer on a 64-bit kernel. This follows the principle that 32-bit support, where needed, is added as a separate target rather than as a compatibility layer that doubles the syscall surface.

2.2.3 64-bit Atomics on 32-bit Architectures

UmkaOS uses AtomicU64 in several core data structures (PTY ring buffers, MCE logs, lock-free IPC). On 32-bit architectures where native 64-bit atomics have limited support, the following strategies apply:

Architecture	Native 64-bit Atomic	Strategy
ARMv7 (Cortex-A)	`LDREXD`/`STREXD` (available on all ARMv7-A cores with LPAE)	Native hardware atomics. The `armv7a-none-eabi` target supports `AtomicU64` via doubleword exclusive load/store. Non-LPAE cores (Cortex-M, ARMv6) are not first-class targets.
PPC32	No native 64-bit atomics	Software emulation via interrupt-disabling (`wrteei 0/1`) around read-modify-write sequences. Implemented in `umka-kernel/src/arch/ppc32/atomics.rs`. The custom target JSON sets `max-atomic-width: 64` so LLVM generates calls to `__atomic_*` runtime functions provided by the kernel.

Both strategies are safe in a single-core or SMP-with-coherence context. The interrupt-disabling approach on PPC32 is correct because UmkaOS's 32-bit targets are single-core embedded systems; SMP PPC uses 64-bit PPC64LE.

2.2.4 Advanced Feature Architecture Parity

Chapters 16–18 define advanced features that rely on architecture-specific hardware mechanisms. The following matrix summarizes support status across all six first-class architectures. Where hardware is unavailable, UmkaOS either provides a software fallback (reduced performance) or marks the feature as not supported on that architecture. The kernel's #[cfg(target_feature)] mechanism ensures unsupported paths compile to no-ops with zero overhead.

Feature	Mechanism	x86-64	AArch64	ARMv7	RISC-V 64	PPC32	PPC64LE
Fast driver isolation	MPK/POE/DACR/page-table	WRPKRU (native)	POE (ARMv8.9+, POR_EL0) / page-table fallback	DACR 16 domains	Page-table based	Segment registers (16 segments) / page-table fallback	Radix partition table / HPT LPAR
Memory tagging	MTE/LAM	Intel LAM (pointer tagging only)	MTE (full, ARMv8.5+)	Not available	Not available	Not available	Not available
Hardware power metering	RAPL/SCMI/SBI	RAPL (native)	SCMI power domain	SCMI (limited)	SBI PMU (basic) / software estimation	Not available (software only)	OPAL/OCC power sensors (POWER8/9/10)
Confidential computing	SEV-SNP/TDX/CCA/CoVE	SEV-SNP + TDX (native)	ARM CCA (emerging)	Not available	RISC-V CoVE (draft)	Not available	Ultravisor Protected Execution Facility (POWER9+)
Cache partitioning	CAT/MPAM	Intel CAT + MBA (native)	ARM MPAM (ARMv8.4+)	Not available	Not available (software only)	Not available	Not available (software only)
Hardware preemption (GPU)	Device-dependent	Yes (vendor support)	Yes (Mali, Adreno)	Limited	Emerging	Not available	Limited (Nvidia via PCIe)
CXL memory pooling	CXL 2.0/3.0	Native (PCIe 5.0+)	Emerging (ARMv9 + CXL)	Not available	Not available	Not available	OpenCAPI / CXL (POWER10+)
In-kernel inference	ISA extensions	AMX (matrix), AVX-512	SME (matrix), SVE (vector)	NEON (vector)	V extension (vector)	AltiVec/SPE (limited)	VSX (vector-scalar, POWER7+)

Reading the table: "Native" means hardware support is available and UmkaOS uses it directly. "Fallback" means UmkaOS implements the feature using a slower mechanism (typically page-table manipulation). "Not available" means neither hardware nor a practical software fallback exists — the feature is compile-time disabled on that architecture. "Emerging" or "draft" means the hardware specification exists but is not yet widely deployed; UmkaOS includes provisional support gated behind a feature flag.

2.3 Hardware Memory Safety

2.3.1 ARM MTE (Memory Tagging Extension)

ARM MTE is architecturally defined in ARMv8.5-A and first implemented in ARMv9 silicon. MTE availability depends on both the core IP implementing the extension AND the SoC vendor enabling tag storage in the memory subsystem:

Core IP with MTE: ARM Neoverse V2, Neoverse V3 (all cores based on these designs implement the MTE extension at the microarchitectural level).
Mobile SoCs with MTE enabled: Google Pixel 8/9 (Tensor G3/G4, Cortex-X3/X4), MediaTek Dimensity 9300+ devices.
Datacenter SoC with MTE enabled: AmpereOne (the first datacenter SoC to fully enable MTE at the platform level, including tag storage in DRAM).
Cloud SoCs with MTE logic but NOT enabled: AWS Graviton 4 (Neoverse V2) and Google Axion (Neoverse V2) include MTE logic in the cores but their memory subsystems do not support tag storage — MTE is not usable on these platforms despite the core IP implementing it.
No MTE: Ampere Altra (Neoverse N1, ARMv8.2 — predates MTE entirely).

Every 16-byte memory granule carries a 4-bit tag. Pointer top bits carry a tag. Hardware compares them on every access. Mismatch = fault. Catches use-after-free, buffer overflow, in hardware, at near-zero runtime cost.

Important limitation: MTE is probabilistic, not complete. 4-bit tags = 16 possible values. Adjacent slab objects may receive the same tag by random chance (probability 1/16 = 6.25%). Single-violation detection rate: ~93.75%. This is acceptable for defense-in-depth — Rust's ownership model is the primary safety mechanism; MTE is an additional hardware layer that catches what Rust cannot (C driver bugs in Tier 1, unsafe blocks, compiler bugs). MTE is NOT a substitute for memory-safe code.

Tag Storage Requirement:

ARM MTE stores tags in storage managed by the memory controller: 4 bits per 16-byte granule. Relative to DRAM capacity, this means tag storage is sized at 3.125% of DRAM (4 bits / 128 bits = 1/32). High-performance implementations (Neoverse V2/V3, AmpereOne) typically use dedicated Tag RAM; other implementations may use reserved DRAM regions managed transparently by the memory controller. In all cases, the storage is invisible to software and managed automatically by the hardware. On SoCs without MTE support, the tagging code is compiled out (#[cfg(target_feature = "mte")]) — zero overhead, zero memory cost. MTE is only available on ARM; x86 systems are entirely unaffected.

TEE interaction: MTE tags are stored in separate physical tag RAM. For TEE-encrypted pages, tag RAM may also be encrypted. Confidential pages are allocated untagged (tag = 0); MTE checking is disabled for pages owned by a ConfidentialContext (see Section 8.6.3). Hardware encryption already prevents unauthorized access — MTE is redundant for confidential memory.

Section 4.1.7 already mentions MTE and Intel LAM. This section details the architectural integration.

2.3.2 Design: Tag-Aware Memory Allocator

// umka-core/src/mem/tagging.rs

/// Memory tagging policy (system-wide, configurable at boot).
#[repr(u32)]
pub enum TaggingPolicy {
    /// No tagging. Standard allocation. Zero overhead.
    /// Used on hardware without MTE, or for maximum performance.
    Disabled        = 0,

    /// Synchronous tagging: fault immediately on tag mismatch.
    /// Catches all tag violations. ~128 extra cycles per page allocation.
    /// Recommended for development and high-security production.
    Synchronous     = 1,

    /// Asynchronous tagging: record violations in a register, check lazily.
    /// Lower overhead (~10 cycles per allocation), but violations reported
    /// with delay. Good for production with logging.
    Asynchronous    = 2,
}

/// Tag operations for the memory allocator.
pub trait MemoryTagger {
    /// Assign a random tag to a newly allocated region.
    /// Called by: slab allocator (per-object), buddy allocator (per-page).
    fn tag_allocation(&self, addr: *mut u8, size: usize) -> TaggedPtr;

    /// Clear tags on freed memory (set to a "freed" tag value).
    /// Any subsequent access with the old tag will fault.
    fn tag_deallocation(&self, addr: *mut u8, size: usize);

    /// Set tags for a DMA buffer region (tag = 0, untagged).
    /// DMA engines don't understand tags — buffers must be untagged.
    fn untag_dma_region(&self, addr: *mut u8, size: usize);
}

2.3.3 Integration Points

Slab allocator (Section 4.1.2):
  Object allocation:
    1. Allocate object from slab (existing path).
    2. Assign random 4-bit tag to the object's 16-byte granules.
    3. Return tagged pointer (tag in top bits).
  Object deallocation:
    1. Return object to slab (existing path).
    2. Set the object's granules to a "freed" tag (e.g., 0xF).
    3. Any subsequent access with the old tag faults immediately.

  Benefit: use-after-free in kernel (or in Tier 1 C drivers) is caught
  by hardware. The fault is caught by domain isolation and triggers driver crash recovery.

Page allocator (Section 4.1.1):
  Page allocation: tag all granules in the page with a fresh tag.
  Page deallocation: tag all granules with "freed" tag.
  Granule counts: 4KB page = 256 granules (4096 / 16); 64KB page = 4096 granules (65536 / 16).
  Cost (4KB page): 256 STG instructions per alloc/dealloc (or 128 ST2G/STZ2G,
  each tagging two 16-byte granules).
  At ~0.5 cycles per STG on A510+ cores: ~128 cycles with STG (64 cycles with ST2G). Page alloc is ~300+ cycles.
  Overhead (4KB): ~43% with STG (128 tag cycles / ~300 base cycles); ~21% with ST2G.
  Cost (64KB page): 4096 STG instructions (or 2048 ST2G/STZ2G instructions).
  At ~0.5 cycles per STG: ~2048 cycles with STG (~1024 cycles with ST2G/STZ2G). Page alloc is ~300+ cycles.
  Overhead (64KB): ~683% with individual STG (2048 tag cycles / ~300 base cycles);
  ~341% with ST2G/STZ2G (1024 tag cycles / ~300 base cycles). Prefer STZ2G for bulk
  tagging as it zeros and tags in one pass. The 4KB case is the common slab/page
  allocation path. 64KB huge-page allocation is rarely hot and the high overhead is
  acceptable.

  Note: this only affects ARM. On x86 without MTE, zero overhead.
  On ARM without MTE enabled, zero overhead (policy = Disabled).

KABI boundary:
  When kernel passes a buffer to a Tier 1 driver:
    Buffer is tagged. Driver receives tagged pointer.
    If driver overflows the buffer: tag mismatch, hardware fault.
    Domain isolation catches the fault, driver is crash-recovered.
  This provides hardware-enforced bounds checking for C drivers,
  even though the kernel is written in Rust (which checks bounds in software).

DMA buffers:
  DMA engines cannot process tagged memory.
  DMA buffers are allocated untagged (tag = 0).
  IOMMU validates DMA addresses regardless.

fork() / CoW:
  Before CoW break: child shares parent's page (same tags, read-only).
  On CoW break (child or parent writes):
    1. Allocate new page, copy data.
    2. Assign FRESH RANDOM tags to the new page's granules.
    3. Do NOT copy the old page's tags.
    Rationale: if both pages kept the same tags, a stale pointer from
    one process could access the other's now-separate page without
    a tag fault (same tag, different physical page). Fresh tags ensure
    that cross-process stale pointers are detected by MTE.

2.3.4 Intel LAM (Linear Address Masking)

Intel LAM allows using top bits of 64-bit pointers for metadata without them being treated as part of the address. This is less powerful than MTE (no hardware tag checking), but useful for:

Pointer authentication (storing metadata in unused address bits)
Memory safety tooling (KASAN-like in-kernel detection)
Capability tagging (embedding capability metadata in pointers)

LAM modes:
  LAM_U48: bits 62:48 available for metadata (15 bits, user pointers only).
  LAM_U57: bits 62:57 available for metadata (6 bits, 5-level paging mode).

  Controlled via CR3 flags: CR3.LAM_U48 or CR3.LAM_U57.
  No runtime cost: address masking is performed by hardware in the MMU pipeline.

Comparison with MTE:
  MTE (ARM):  4-bit tag per 16-byte granule. Hardware CHECKS on every access.
              Detects use-after-free, buffer overflow at runtime. ~128 cycles per
              page allocation for tag setup. Zero-cost access checks (pipelined).
  LAM (x86): 6-15 metadata bits per pointer. NO hardware checking — metadata is
              simply ignored by the MMU. Software must perform its own checks.
              Zero overhead. Useful for tooling metadata, not for runtime safety.

  Result: MTE provides stronger guarantees (hardware-enforced); LAM provides
  more flexible metadata embedding. UmkaOS uses both where available.

Integration: the memory allocator stores metadata in LAM bits. Debug builds use these bits for KASAN-equivalent checking. Release builds can optionally use them for capability hints.

Security caveat: Intel LAM has been disabled in the Linux kernel since v6.12 due to the SLAM attack (Spectre-based exploitation of LAM metadata bits without LASS protection). UmkaOS does not enable LAM unless LASS (Linear Address Space Separation) is also available on the CPU. On CPUs without LASS, the upper address bits described above are not used for metadata; KASAN-equivalent checking uses shadow memory instead. When both LAM and LASS are present, LAM is enabled with the protections described above.

2.3.5 AArch64 Pointer Authentication (PAC)

AArch64 provides Pointer Authentication Codes (PAC, ARMv8.3+) as a complementary mechanism to MTE. PAC signs pointers with a cryptographic MAC using a per-process key, detecting pointer forgery and corruption:

PAC in UmkaOS:
  - Return address signing: PACIASP/AUTIASP in function prologue/epilogue.
    Compiler-inserted via -mbranch-protection=pac-ret+leaf.
  - Detects ROP (Return-Oriented Programming) attacks: corrupted return
    addresses fail authentication and trap.
  - Cost: ~1 cycle per PAC/AUT instruction (pipelined). Zero memory overhead.
  - Available on: Apple M1+, AWS Graviton 3+, Cortex-A710+.

  UmkaOS enables PAC for all kernel code on capable hardware. This is orthogonal
  to MTE (MTE detects memory safety bugs; PAC detects control-flow hijacking).

2.3.6 CHERI (Future)

ARM Morello (CHERI prototype) demonstrates hardware-capability pointers with bounds checking. CHERI pointers are 128-bit: address (64) + bounds (32) + permissions (16) + flags (16). Every pointer carries its own bounds and permission information. Hardware checks on every dereference.

UmkaOS's capability system (Section 8.1.1) is a software capability model. CHERI provides a hardware capability model. When CHERI hardware is available:

Software capabilities (current):
  Kernel maintains capability table. Validated on syscall.
  Overhead: ~5-10 cycles per capability check (bitmask test).

CHERI hardware capabilities (future):
  Pointer IS the capability. Hardware validates on every access.
  Overhead: 0 cycles (pipelined with memory access).

  UmkaOS's capability tokens become hardware CHERI capabilities.
  The translation is natural: both use unforgeable tokens with
  bounded permissions and delegation rules.

Design for CHERI readiness: the capability system should NOT assume that capabilities are always validated in software. The validation path should be abstractable so that CHERI hardware validation can replace software validation.

CHERI Morello Status:

ARM Morello evaluation boards shipped in 2022 (based on Neoverse N1 + CHERI extensions). As of 2026, production CHERI hardware is not available. The CHERI readiness design in Section 2.3.6 prepares for future hardware without depending on it. When production CHERI SoCs ship, the capability validation abstraction layer enables a transition from software to hardware capability checks.

2.3.7 Performance Impact

MTE on ARM (when enabled): ~128 cycles per page allocation (~40% of allocator hot path). Memory access checks are hardware-pipelined: zero overhead. Linux pays the same cost when MTE is enabled.

MTE disabled (default on x86, optional on ARM): zero overhead. No code runs.

Intel LAM: zero runtime overhead (address masking is free in hardware).

CHERI (future): zero overhead (hardware-pipelined capability checks).

2.3.8 Hardware Fault Handler Constraints

Hardware fault handlers (machine check exceptions, bus errors, SError, NMI, system error interrupts) operate in extremely constrained contexts where normal kernel operations are forbidden. Violating these constraints causes deadlock, system hang, or recursive faults.

2.3.8.1 Fault Handler Categories

Hardware fault handlers fall into three categories with progressively stricter constraints:

Category	Examples	Context	Permitted Operations
Maskable interrupts	Timer tick, device IRQ	IRQ context, interrupts disabled	Try-lock, lock-free writes, deferred work
Synchronous faults	Page fault, alignment fault, breakpoint	Fault context, preemptible	Blocking locks (with care), allocation (with care)
Non-maskable faults	Machine Check (MCE), NMI, SError, Bus Error, System Reset	NMI context, all interrupts blocked	Lock-free only, per-CPU buffers, no locks

The critical distinction: maskable interrupts can be delayed by disabling interrupts, but non-maskable faults fire regardless of interrupt state. Code holding a spinlock cannot prevent an MCE or NMI from occurring.

2.3.8.2 Non-Maskable Fault Handler Requirements

Non-maskable fault handlers (MCE, NMI, SError, Bus Error, System Reset vectors) MUST follow these rules:

1. No blocking operations. The handler MUST NOT: - Acquire a spinlock with blocking semantics (lock() / spin_lock()) - Acquire a mutex, rwlock, or semaphore - Allocate memory (kmalloc, vmalloc, page allocation) - Sleep or yield (schedule(), wait(), condvar) - Perform I/O that may block (disk, network) - Call any function that may transitively do the above

Rationale: The fault may have interrupted code already holding locks. If the handler blocks waiting for the same lock, deadlock occurs immediately.

2. Try-lock only, with fallback. If the handler needs a lock, it MUST use try-lock (try_lock() / spin_trylock()) and handle failure:

if lock.try_lock() {
    // critical section
    lock.unlock();
} else {
    // Fallback: cannot acquire lock
    // Options: log to per-CPU buffer and continue, force reboot, degrade gracefully
}

3. Per-CPU buffers for logging. NMI/MCE handlers MUST NOT write to shared ring buffers (MPSC, printk). Instead, use a pre-allocated per-CPU buffer:

Data types used by the MCE log:

/// Severity classification of a machine-check event.
#[repr(u32)]
enum MceSeverity {
    Corrected   = 0, // Hardware corrected; no data loss
    Recoverable = 1, // Software-recoverable with page offlining
    Fatal       = 2, // Unrecoverable; system must reboot
}

/// One entry in the per-CPU MCE ring log.
/// Padded to 64 bytes (one cache line) so that array elements never span cache line
/// boundaries. This prevents false sharing when a remote monitoring thread reads
/// the log while the NMI handler writes it.
///
/// Torn-read detection uses a seqcount-style generation counter (`gen`):
/// the writer sets `gen` to an odd value before writing fields, then to the
/// next even value after writing. A reader that observes an odd `gen` or a
/// changed `gen` between its two reads has caught a torn write and must retry.
#[repr(C, align(64))]
#[derive(Copy, Clone)]
struct MceLogEntry {
    gen: u32,               // Generation counter (odd = write in progress, even = stable)
    _pad_gen: [u8; 4],
    timestamp_tsc: u64,     // TSC at time of MCE
    bank: u8,               // MCE bank number
    _pad0: [u8; 7],
    status: u64,            // MCi_STATUS MSR value
    address: u64,           // MCi_ADDR MSR value (if valid)
    misc: u64,              // MCi_MISC MSR value (if valid)
    severity: MceSeverity,  // 4 bytes (repr(u32))
    _pad1: [u8; 12],        // Pad to 64 bytes total
}
// Total size: 4 + 4 + 8 + 1 + 7 + 8 + 8 + 8 + 4 + 12 = 64 bytes. One cache line each.

impl MceLogEntry {
    const EMPTY: Self = Self {
        gen: 0, _pad_gen: [0; 4],
        timestamp_tsc: 0, bank: 0, _pad0: [0; 7],
        status: 0, address: 0, misc: 0,
        severity: MceSeverity::Corrected, _pad1: [0; 12],
    };
}

/// Per-CPU MCE log with head counter and ring buffer.
struct MceLog {
    head: AtomicU32,                  // Monotonically increasing write index
    entries: [MceLogEntry; 64],       // Ring buffer (indexed by head % 64)
}

impl MceLog {
    const fn new() -> Self {
        Self { head: AtomicU32::new(0), entries: [MceLogEntry::EMPTY; 64] }
    }
}

// Allocated at boot, one per CPU, never freed.
static MCE_LOG: PerCpu<MceLog> = PerCpu::new(MceLog::new());

// In MCE handler (NMI context):
fn mce_handler(ctx: &MceContext) {
    let log = MCE_LOG.this_cpu();
    // Per-CPU: exactly one producer (this CPU's NMI handler), no concurrent writers.
    // load(Relaxed) is safe because only this CPU writes head.
    let count = log.head.load(Relaxed);
    let idx = count as usize % 64;
    log.entries[idx] = MceLogEntry::from_ctx(ctx);
    // ORDERING: Release store on head publishes the entry. Any thread that
    // subsequently reads head with Acquire will observe the entry write.
    log.head.store(count + 1, Release);
    // Handler returns; main kernel drains log later
}

// Drain path (thread context, outside NMI):
fn drain_mce_log(log: &MceLog) {
    // Use swap instead of load+store(0) to atomically capture AND reset head.
    // This prevents losing entries from an MCE that fires between load and store.
    let count = log.head.swap(0, AcqRel);
    // AcqRel: Acquire ensures prior entry writes are visible; Release publishes
    // the reset (head=0) so a concurrent MCE handler sees the new base.
    // Iterate the ring from (count-N) to (count-1), reading entries newest-first
    // would lose ordering — iterate oldest-first instead:
    let n = core::cmp::min(count, 64); // At most 64 entries in the ring
    // gen counter uses wrapping arithmetic to handle u32::MAX wrap-around.
    // Always use wrapping_sub() when computing the distance between two gen values.
    let start_gen = count.wrapping_sub(n);
    for i in 0..n {
        let i = start_gen.wrapping_add(i);
        let entry = log.entries[i as usize % 64];
        // ... process entry ...
    }
}

Race window: A narrow race exists between head.swap(0) and the drain loop. An MCE arriving after the swap writes to entries[0] while the drain may be reading entries at the same index (via modular arithmetic when the ring was full). Mitigation: each entry carries a seqcount-style generation counter (gen). The drain reads gen before and after reading the entry fields: if gen_before is odd (write in progress) or gen_after != gen_before (torn write), the drain skips that entry and logs a warning. The skipped MCE is not lost — the hardware MCE bank registers retain the error until explicitly cleared, so the next drain cycle will re-read it.

The main kernel drains these buffers after returning from the exception, outside NMI context.

4. No locks at all for NMI. NMI handlers specifically MUST NOT use any locks, even try-lock. The NMI can nest inside an MCE handler that already holds the lock, causing deadlock. NMI handlers use only: - Per-CPU variables (no sharing) - Lock-free atomic operations (atomic read/write, compare-and-swap) - Pre-mapped memory (no page faults possible)

5. Pre-allocated resources. All memory, buffers, and stacks used by NMI/MCE handlers MUST be allocated at boot time. Allocation during handler execution is forbidden. On x86-64, MCE handlers run on a dedicated IST (Interrupt Stack Table) stack, pre-allocated and never paged.

2.3.8.3 Deferred Recovery Actions

Any recovery action that might block MUST be deferred to a workqueue or tasklet:

MCE handler (NMI context):
  1. Capture fault context to per-CPU buffer (lock-free)
  2. Assess severity: recoverable vs. fatal
  3. If recoverable:
     a. Log to per-CPU buffer
     b. Set flag: NEEDS_RECOVERY = true
     c. Return from exception
  4. If fatal:
     a. Log to per-CPU buffer
     b. Trigger immediate reboot (no locking)

Workqueue (thread context, after NMI returns):
  1. Check NEEDS_RECOVERY flag
  2. If set:
     a. Drain per-CPU MCE log to kernel log (may block)
     b. Initiate memory offlining (may block)
     c. Notify userspace via netlink (may block)
     d. Clear NEEDS_RECOVERY flag

The workqueue runs in normal thread context where blocking operations are safe. The NMI handler does the minimum work needed to capture state and flag the need for recovery.

2.3.8.4 Architecture-Specific Fault Types

Architecture	Non-Maskable Fault Types	Vector / Entry Point
x86-64	Machine Check Exception (#MC), NMI	IDT vector 18 (MCE), vector 2 (NMI)
AArch64	SError Interrupt, Physical IRQ (FIQ)	VBAR_EL1 offset 0x380 (SError, Current EL with SPx)
ARMv7	Data Abort (imprecise), FIQ	VBAR offset 0x1C (FIQ), 0x10 (Data Abort)
RISC-V 64	NMI (platform-specific)	Platform-defined; often traps to mtvec in M-mode
PPC32	Machine Check, Critical Interrupt	IVOR[1] (MCE), IVOR[0] (Critical)
PPC64LE	Machine Check, System Reset	HSRR0/HSRR1 vectors, LPCR-defined

All handlers for these vectors MUST follow the non-maskable fault handler requirements in Section 2.3.8.2.

2.3.8.5 Recursive Fault Prevention

Hardware fault handlers MUST prevent recursive faults:

1. Guard pages. Handler stacks have guard pages (unmapped) at both ends. Stack overflow causes an immediate fault rather than corrupting adjacent memory.

2. Handler re-entry detection. Each handler checks a per-CPU flag on entry:

fn mce_handler(ctx: &MceContext) {
    let nesting = MCE_NESTING.this_cpu().fetch_add(1, Relaxed);
    if nesting > 0 {
        // Already in MCE handler — recursive fault.
        // Cannot log (might fault again), cannot recover.
        // Immediate halt to prevent infinite recursion.
        arch::halt_loop();
    }
    // ... normal handler logic ...
    //
    // Use fetch_sub (not store(false)) to avoid a race window:
    // store(false) + iret leaves a gap where a second MCE sees the flag
    // clear while the first handler is still returning. fetch_sub(1)
    // atomically decrements; a concurrent MCE that increments to 2 will
    // see nesting > 0 and halt, regardless of timing.
    MCE_NESTING.this_cpu().fetch_sub(1, Release);
}

3. Pre-pinned code. Handler code and data pages are pinned in memory (never paged out). A page fault during NMI/MCE handling would cause a double fault.