Chapter 2: Boot and Hardware Discovery
Boot chain, device discovery, ACPI/DT, multi-architecture support, hardware memory safety
2.1 Boot and Installation
2.1.1 Overview
UmkaOS uses a phased boot architecture. The current implementation boots via the
Multiboot1 protocol through GRUB or QEMU's -kernel flag — sufficient for
development, testing, and early hardware bring-up. The production target is
UEFI stub boot with Linux boot protocol compatibility, enabling drop-in
package installation alongside existing Linux kernels.
The boot code lives in umka-kernel/src/boot/ (assembly entry, Multiboot parser)
and umka-kernel/src/arch/*/boot.rs (per-architecture boot routines). The
initialization sequence is in umka-kernel/src/main.rs.
2.1.2 Current Implementation: Multiboot Boot
2.1.2.1 Boot Protocols
The kernel ELF contains dual Multiboot headers — both Multiboot1 and Multiboot2 are present in the binary, allowing either protocol at the bootloader's choice:
- Multiboot1 (magic
0x1BADB002): Fully implemented. Used by QEMU (-kernelflag) and GRUB (multibootcommand). Parser inboot/multiboot1.rsextracts the memory map, command line, and bootloader name. - Multiboot2 (magic
0xE85250D6): Header present in the ELF but no parser implemented. The magic is recognized inumka_main()but the info structure is not parsed. Planned for Phase 2.
The linker script (linker-x86_64.ld) places headers in dedicated sections:
.multiboot1 (4-byte aligned, first 8 KB) and .multiboot2 (8-byte aligned,
first 32 KB), ensuring bootloaders find them. The kernel loads at physical address
0x100000 (1 MB), the standard Multiboot load address.
Build and boot methods:
# # Development: QEMU with -kernel (Multiboot1, no ISO needed)
qemu-system-x86_64 -kernel target/x86_64-unknown-none/release/umka-kernel -serial stdio
# # Testing: GRUB ISO boot (Multiboot1 via grub.cfg `multiboot` command)
make iso && qemu-system-x86_64 -cdrom target/umka-kernel.iso -serial stdio
Non-x86 architectures use different boot protocols:
- Device Tree Blob (DTB): Used by AArch64, ARMv7, RISC-V 64, PPC32, and
PPC64LE. The firmware or QEMU passes a pointer to a flattened device tree
(FDT) in a register at entry (
x0on AArch64,r2on ARMv7,a1on RISC-V,r3on PPC32,r3on PPC64LE). The DTB describes the machine's physical memory layout, interrupt controllers, timers, and peripheral addresses. The format is big-endian with magic0xD00DFEED. See Section 2.1.2.9 for the parsing specification. - OpenSBI (RISC-V only): The Supervisor Binary Interface firmware runs in
M-mode and provides SBI ecalls for timer, IPI, console, and system reset
services to S-mode code. QEMU's built-in OpenSBI occupies physical addresses
0x80000000–0x801FFFFF. At entry, OpenSBI passesa0 = hart_id(hardware thread identifier) anda1 = DTB address. The kernel must not overwrite the OpenSBI region. - OpenFirmware / SLOF (PPC64LE): On POWER systems, SLOF (Slimline Open
Firmware) or OPAL (OpenPOWER Abstraction Layer) firmware initializes hardware
and passes a DTB pointer in
r3. QEMU'spseriesmachine uses SLOF; bare metal POWER8/9/10 uses OPAL (skiboot). At entry:r3 = DTB address,r4 = 0(reserved). The kernel runs in hypervisor or supervisor mode. - U-Boot / OpenFirmware (PPC32): Embedded PowerPC boards typically use
U-Boot which passes a DTB pointer in
r3. QEMU'sppce500machine uses U-Boot or direct kernel boot. At entry:r3 = DTB address,r4 = kernel_start,r5 = 0(reserved).
2.1.2.2 x86-64 Entry Sequence
The boot assembly (boot/entry.asm, NASM syntax) handles the transition from
32-bit protected mode to 64-bit long mode:
1. GRUB/QEMU loads ELF at 1 MB, jumps to _start in 32-bit protected mode
- eax = Multiboot1 magic (0x2BADB002)
- ebx = pointer to Multiboot info structure
2. _start (32-bit):
a. Save magic: eax → esi (preserved across BSS clear and CPUID check)
b. Set temporary stack at 0x80000 (below kernel)
c. Clear BSS (rep stosd from __bss_start to __bss_end — clobbers edi, ecx, eax)
d. Build identity-map page tables for first 1 GB:
PML4[0] → boot_pdpt | PRESENT | WRITABLE
PDPT[0] → boot_pd | PRESENT | WRITABLE
PD[0..511] → 512 × 2 MB pages (flags: PRESENT | WRITABLE | PAGE_SIZE)
e. Save info ptr: ebx → ebp (preserve across CPUID — ebx is clobbered by CPUID)
f. Verify long mode: CPUID leaf 0x80000001 bit 29
(displays "NO64" on VGA buffer and halts if not available)
g. Restore info ptr: ebp → ebx
h. Enable PAE (CR4 bit 5)
i. Enable Long Mode (IA32_EFER MSR bit 8)
j. Enable Paging (CR0 bit 31)
k. Load temporary 64-bit GDT (null + code + data descriptors)
l. Far jump to _start64 (selector 0x08 = 64-bit code segment)
3. _start64 (64-bit):
a. Load 64-bit data segments (selector 0x10)
b. Set kernel stack (boot_stack_top, 16 KB in .bss)
c. Clear RFLAGS
d. Map to 64-bit calling convention: edi = esi (magic), esi = ebx (info ptr)
e. Call umka_main(multiboot_magic=rdi, multiboot_info_ptr=rsi)
Page tables and boot stack are allocated in .bss (zeroed by step 2c):
boot_pml4 (4 KB), boot_pdpt (4 KB), boot_pd (4 KB), boot stack (16 KB).
2.1.2.3 Kernel Initialization Phases (x86-64)
umka_main() detects the boot protocol from the magic value, then runs an
ordered initialization sequence. Each phase depends on the previous:
Phase 1: GDT + TSS
Load a proper GDT with TSS. Configure IST1 with a dedicated
16 KB stack for double-fault handling.
Phase 2: IDT + PIC
Install exception handlers (0-31) and IRQ handlers (32-47).
Remap the 8259 PIC: IRQ0 → vector 32, IRQ8 → vector 40.
Phase 3: Physical Memory Manager
Parse Multiboot1 memory map (see Section 2.1.2.11).
Initialize bitmap allocator: mark available regions free,
reserve first 1 MB (BIOS/legacy) and kernel image.
Phase 4: Kernel Heap
Allocate 256 contiguous 4 KB frames (1 MB total).
Initialize free-list allocator → enables alloc::Vec, Box, String.
This initial heap size is a bootstrap minimum; the allocator expands
dynamically once memory discovery completes (Section 4.1).
Phase 5: Virtual Memory
Verify identity mapping (virt_to_phys on mapped addresses).
Test new page mappings: allocate frame, map at 0x40000000,
write/read volatile, unmap, free frame.
Phase 6: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 7: IPC / MPK Detection
Query CPUID for PKU support. Test domain alloc/free.
Phase 8: Enable Interrupts
Enable IRQs, verify timer ticks are incrementing.
Phase 9: Scheduler
Initialize round-robin scheduler. Spawn two test threads
(thread-A, thread-B). Run cooperative yield loop, then
enable preemptive scheduling via timer tick callback.
Phase 10: SYSCALL/SYSRET
Configure STAR/LSTAR/SFMASK MSRs. Register three syscall
handlers: write(1), getpid(39), exit_group(231).
Test with inline SYSCALL instruction from kernel mode.
2.1.2.4 Secondary CPU Bringup (x86-64 SMP)
After Phase 10 completes on the BSP (Boot Strap Processor), secondary CPUs (Application Processors, APs) are brought online.
AP Stack Allocation:
The BSP allocates each AP's initial kernel stack from the boot allocator before sending the INIT-SIPI wakeup. This ensures the stack is ready before the AP needs it.
- Stack size: 16 KB per AP (same as the BSP initial stack). Allocated from the per-NUMA-node boot allocator, preferring memory local to the AP's NUMA node.
- SP communication: The BSP stores the stack top address in a per-CPU startup mailbox, defined as:
/// Per-CPU startup data written by BSP before AP wakeup, read by AP during
/// very early boot (before the AP has its own stack pointer set up).
/// Must reside in a physically-mapped region accessible without paging (or
/// with the identity-mapped early page tables already in place).
#[repr(C, align(64))]
pub struct ApStartupMailbox {
/// Initial kernel stack top (SP value to load). Written by BSP before
/// sending the wakeup IPI; read by the AP entry stub in assembly.
pub stack_top: u64,
/// Physical address of the AP's per-CPU data area.
pub percpu_base: u64,
/// APIC ID — AP verifies this matches its own LAPIC_ID before proceeding.
pub cpu_id: u32,
/// BSP sets to MAILBOX_READY (0xAB1E1234) when all fields above are valid.
/// AP spins on this field (with a short architectural pause) until ready.
pub status: AtomicU32,
pub _pad: [u8; 32],
}
pub const MAILBOX_READY: u32 = 0xAB1E_1234;
/// Array of mailboxes, one per possible CPU slot. Allocated from the boot
/// allocator during Phase 11 once the CPU count is known.
pub static AP_STARTUP_MAILBOXES: Once<&'static mut [ApStartupMailbox]> = Once::new();
- AP entry stub: The AP's 16-bit → 64-bit trampoline (in assembly) reads
stack_topfrom its mailbox slot, using the LAPIC ID as the array index, loads SP, then jumps to Rustap_entry(). - Stack allocation failure: If the boot allocator returns OOM for an AP's
stack, the BSP marks that CPU permanently offline in the topology, does NOT
send the wakeup IPI, and logs
"CPU {lapic_id}: stack allocation failed, CPU disabled". Boot continues with the remaining CPUs.
Fan-out tree bringup:
A sequential per-CPU timeout of 1 second × N CPUs does not scale: 128 CPUs would require up to 127 seconds in the worst case. UmkaOS uses a binary fan-out tree to bound bringup time to O(log₂ N) phases regardless of CPU count.
The tree assignment is defined by index (not by LAPIC ID):
CPU i (tree index) wakes CPUs 2i+1 and 2i+2 (if they exist in the topology).
Phase 0: BSP (index 0) wakes index 1 and index 2
Phase 1: index 1 wakes 3, 4; index 2 wakes 5, 6
Phase 2: each of 3–6 wakes two children
...
Phase k = ⌈log₂(N)⌉ − 1: leaf CPUs (no children)
For 128 CPUs: 7 phases × ~50 ms per phase ≈ 350 ms worst case, versus up to 127 seconds with sequential 1-second timeouts.
The BSP sets up a shared SmpBringupState structure before waking the first AP:
/// Shared state for coordinating fan-out tree AP bringup.
/// Placed in physically-mapped memory accessible to all CPUs before the VMM
/// is fully operational.
///
/// `online_mask` and `pending_mask` are `CpuMask` instances (Section 8.1,
/// 08-security.md), allocated from the boot-time bump allocator after the CPU
/// count is discovered from ACPI MADT or DTB. They scale to the actual number
/// of CPUs found on the system — no hardcoded limit.
///
/// Allocation: `CpuMask::alloc(num_possible_cpus, boot_alloc)` is called once
/// during Phase 11 (CPU enumeration), before any AP is woken. The mask storage
/// is never reallocated — `num_possible_cpus` is a boot-time-fixed value.
#[repr(C, align(64))]
pub struct SmpBringupState {
/// Bitmask of CPUs (by tree index) that have completed initialization.
/// One bit per possible CPU. Sized at boot to `num_possible_cpus` bits.
/// Each AP atomically sets its own bit via `CpuMask::set_atomic`.
pub online_mask: CpuMask,
/// Bitmask of CPUs currently being brought up (wakeup IPI sent, init
/// not yet complete). Used to detect stalled APs at deadline.
pub pending_mask: CpuMask,
/// Total number of possible CPUs discovered from firmware (MADT/DTB).
pub num_possible_cpus: usize,
/// Count of CPUs that have completed init, used for tree coordination.
/// An AP atomically increments this after setting its bit in `online_mask`.
pub online_count: AtomicUsize,
/// Global deadline (monotonic ns) by which all APs must come online.
/// Set by BSP to `now_ns() + 30_000_000_000` (30 seconds) before Phase 11.
pub deadline_ns: u64,
}
Protocol:
1. BSP initializes SmpBringupState, sets deadline_ns = now_ns() + 30s.
2. BSP prepares the mailbox for AP at tree index 1 (stack, percpu_base, APIC ID),
sets mailbox[1].status = MAILBOX_READY, then sends INIT-SIPI to that AP.
3. Each AP, after completing its own init (Phase 14 below), atomically sets its
bit in online_mask and increments online_count. It then reads its tree
index i and sends wakeup IPIs to children at indices 2i+1 and 2i+2
(if those CPUs exist and their deadline_ns has not passed). The AP then
enters the scheduler idle loop.
4. The BSP (Phase 15) polls online_count and deadline_ns. When
online_count reaches the expected total or deadline_ns is exceeded,
bringup ends. Any CPU whose bit is not set in online_mask by deadline
is marked offline and excluded from the kernel CPU mask.
Phase 11: AP Detection
Query ACPI MADT (Multiple APIC Descriptor Table) or MP Table
for CPU count and LAPIC IDs. Assign sequential tree indices
(0 = BSP, 1..N-1 = APs in MADT order). Allocate PerCpu<T>
slots and AP_STARTUP_MAILBOXES for each detected CPU.
Initialize SmpBringupState; set deadline_ns.
Phase 12: AP Trampoline Setup
a. Allocate a 4 KB page below 1 MB (in low memory, identity-mapped)
for the AP trampoline code. This is required because APs start
in real mode (16-bit) with paging disabled.
b. Copy trampoline code (16-bit → 32-bit → 64-bit transition) to
the low-memory page. The trampoline:
- Starts in 16-bit real mode at physical address 0xNN00
- Enables protected mode (32-bit)
- Loads a temporary GDT (same layout as BSP's)
- Enables long mode (64-bit)
- Loads CR3 with the kernel's page tables
- Reads stack_top from ApStartupMailbox[own_lapic_id_index]
- Loads SP from stack_top
- Jumps to ap_entry() in high memory
c. The trampoline uses the ApStartupMailbox array (Section 2.1.2.4
above) for per-AP stack and percpu_base communication.
Phase 13: First AP Wakeup (BSP → tree root)
BSP allocates stack for AP at tree index 1, fills mailbox[1],
sets mailbox[1].status = MAILBOX_READY.
BSP sends INIT IPI to AP 1's LAPIC (assert level).
BSP waits 10 ms (Intel SDM recommendation).
BSP sends STARTUP IPI (SIPI) with trampoline vector.
BSP waits 200 μs; sends second SIPI (required by older silicon).
The fan-out tree propagates from here — each AP wakes its children
after completing its own init.
Phase 14: AP Initialization (per AP, in ap_entry())
Each AP runs this sequence independently after its mailbox is ready:
a. Load proper GDT and TSS (per-CPU TSS required for IST stacks)
b. Load IDT (same as BSP)
c. Enable interrupts
d. Initialize per-CPU scheduler runqueue
e. Calibrate LAPIC timer (delay calibration loop)
f. Atomically set own bit in SmpBringupState.online_mask;
increment online_count
g. Read own tree index i; wake children at 2i+1, 2i+2:
- Allocate stack for each child (from boot allocator)
- Fill child's ApStartupMailbox; set status = MAILBOX_READY
- Send INIT + SIPI + SIPI to child's LAPIC
h. Enter scheduler idle loop (hlt + monitoring for work)
Phase 15: SMP Online
BSP polls SmpBringupState.online_count and deadline_ns.
Loop exits when online_count == expected_ap_count OR
monotonic_now() >= deadline_ns (global 30-second timeout).
Any AP whose bit is not set in online_mask at exit is marked
permanently offline and removed from the kernel CPU mask.
System is now fully multi-CPU. Scheduler load-balances
across all online CPUs.
Per-CPU data initialization:
Each AP needs its own per-CPU data structures initialized:
- PerCpu<T> slots for scheduler runqueue, current task pointer, etc.
- GDT with per-CPU TSS (TSS must be unique per CPU for IST stacks)
- LAPIC timer calibration (varies per CPU due to manufacturing differences)
- IRQ affinity: By default, all IRQs target BSP; distribute to other CPUs
via IOAPIC redirection table or LAPIC logical destination mode.
ACPI MADT parsing (x86-64):
MADT (Multiple APIC Descriptor Table):
- Located via RSDP → RSDT/XSDT → MADT signature "APIC"
- Provides: Local APIC address, CPU LAPIC IDs, IOAPIC addresses
- CPU entries: LAPIC ID, flags (enabled/disabled)
- Override entries: IRQ source overrides, NMI sources
The BSP's LAPIC ID is read from LAPIC_ID register (MMIO at 0xFEE00020).
All other entries in MADT are APs.
Failure handling:
If an AP fails to come online before the global deadline:
- BSP logs the failure: "CPU {lapic_id} (tree index {i}): did not signal
online before deadline, marking offline"
- BSP marks the CPU slot offline; its children in the fan-out tree are
also marked offline (they will never receive their wakeup IPI)
- Boot continues with the available CPUs
- Do NOT panic — reduced-CPU operation is valid
Hot-plug support (future): The ACPI namespace may indicate CPU hot-plug capability. The mailbox mechanism is reused for hot-plug: writing to the ACPI CPU hot-plug register triggers the same INIT/SIPI sequence for the newly added CPU, which inserts itself into the online_mask and online_count atomically.
2.1.2.5 ACPI Table Parsing and AML Interpreter Scope
UmkaOS uses ACPI tables for hardware discovery on x86-64 (and ARM SBSA/server platforms). The ACPI subsystem has two distinct components:
-
Static table parsing (Phase 1, boot-time): The kernel parses binary ACPI tables (MADT, MCFG, HPET, DMAR/IVRS, SRAT, SLIT, PPTT, FADT) to discover hardware topology. This is a straightforward binary structure walk — no interpreter needed. Static table parsing is required for boot.
-
AML interpreter (Phase 2, post-boot): ACPI Methods (DSDT/SSDT bytecode) require an AML interpreter to execute
_STA,_CRS,_PRS,_PSx,_Sx,_OSC,_DSM, and power/thermal methods. UmkaOS implements a reduced AML interpreter covering: - Required for boot:
_STA(device status),_CRS(current resources),_PRS(possible resources),_OSC(OS capabilities handshake),_INI(device init). - Required for power management:
_PS0–_PS3(power state transitions),_S3/_S4/_S5(sleep states),_TMP/_PSV/_CRT(thermal). - Required for PCI/PCIe:
_BBN(base bus number),_SEG(segment group),_PRT(PCI routing table). - Deferred:
_DSM(device-specific methods) for vendor extensions — implemented per-driver as needed.
AML opcode coverage: The method names above describe which methods to execute, not which AML opcodes the interpreter must support. Real-world DSDT tables (Dell, HP, Lenovo, etc.) use a substantial subset of the AML opcode space within these method bodies. The AML interpreter must support at minimum: - Control flow: If/Else, While, Return, Break - Data manipulation: Store, Add, Subtract, And, Or, ShiftLeft/Right, Increment, Decrement, Not, FindSetLeftBit/RightBit - Object creation: CreateDWordField, CreateWordField, CreateByteField, CreateBitField, CreateQWordField - Composite types: Buffer, Package, DerefOf, Index, SizeOf, ObjectType - Method invocation: MethodCall (nested), Arg0-6, Local0-7 - Synchronisation: Acquire, Release, Mutex - Namespace: Scope, Device, Name, Alias, Notify - Field access: OpRegion, Field, IndexField, BankField (System Memory, SystemIO, PCI Config, Embedded Controller)
This covers ~80% of AML opcodes by frequency of occurrence (measured against
a corpus of 47 production ACPI tables from x86 servers and laptops). Rare
opcodes — object reference manipulation, external references, some buffer
field operations — are deferred to Phase 2. Systems requiring only common
opcodes will boot correctly. Systems hitting unimplemented opcodes produce a
clear diagnostic: ACPI: unsupported AML opcode 0xXX at <table>+<offset>,
skipping method <name>.
Extended opcodes (LoadTable, Unload, Timer, ToBCD) are deferred to Phase 3.
Error handling for malformed ACPI tables: If a static table fails checksum or has invalid structure, the kernel logs a diagnostic and falls back to safe defaults (e.g., assume 1 CPU, no IOAPIC, use legacy PIC). If the AML interpreter encounters an illegal opcode or infinite loop (method timeout: 5 seconds), it aborts the method, logs the failure, and marks the affected device as non-functional. The kernel never panics on ACPI errors — degraded operation is always preferred over a boot failure.
2.1.2.6 AArch64 Boot Sequence
QEMU's -M virt -cpu cortex-a72 -kernel loads the ELF at 0x40080000 and
enters at _start in EL1 (Exception Level 1) with the MMU off. Register x0
holds the DTB address provided by QEMU's built-in firmware.
Entry assembly (arch/aarch64/entry.S, GNU as syntax):
1. QEMU jumps to _start in EL1, MMU off
- x0 = DTB address (passed by QEMU firmware)
2. _start:
a. Save DTB pointer: mov x19, x0 (x19 is callee-saved)
b. Disable all exceptions: msr daifset, #0xf
(masks Debug, SError, IRQ, FIQ in DAIF register)
c. Enable FPU/NEON: write CPACR_EL1.FPEN bits [21:20] = 0b11
(without this, any NEON/FP instruction traps — Rust generates
NEON instructions by default for aarch64). This clobbers x0,
but the DTB pointer was saved to x19 in step (a).
d. Load stack pointer: adrp x1, _stack_top / add / mov sp, x1
(64 KB stack in .bss._stack, 16-byte aligned)
e. Clear BSS: zero memory from __bss_start to __bss_end
(str xzr loop, 8 bytes per iteration)
f. Prepare arguments: x0 = 0 (no multiboot), x1 = x19 (DTB address)
g. Branch: bl umka_main
h. Halt loop: wfe (wait-for-event) if umka_main returns
Stack (64 KB) is allocated in .bss._stack (16-byte aligned). The linker
script (linker-aarch64.ld) places .text._start first and provides
__bss_start / __bss_end symbols for BSS clearing.
Initialization phases (in umka_main(), sequential):
Phase 1: Exception Vectors (VBAR_EL1)
Write vector table base to VBAR_EL1 (16 entries × 128 bytes,
2 KB aligned). Vectors cover: Synchronous, IRQ, FIQ, SError
at each of four exception origins (current EL SP0/SPx, lower
EL AArch64/AArch32).
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly, same
pattern as x86 entry.asm step 2d). Perform any additional
initialization that depends on zeroed static data.
Phase 3: DTB Parse
Parse the DTB (received in x0 at entry, forwarded as the
info pointer to umka_main; see Section 2.1.2.9). Extract /memory
regions, /chosen bootargs, interrupt controller base (GIC),
timer IRQ numbers, and UART base address.
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). Mark available
regions free, reserve kernel image (__bss_end and below).
No legacy BIOS region to reserve (unlike x86).
Phase 5: Kernel Heap
Allocate 256 contiguous 4 KB frames (1 MB). Initialize
free-list allocator → enables alloc::Vec, Box, String.
This initial heap size is a bootstrap minimum; the allocator expands
dynamically once memory discovery completes (Section 4.1).
Phase 6: Virtual Memory (TTBR0_EL1)
Build identity-map page tables using 4 KB granule:
- TCR_EL1: T0SZ=16 (48-bit VA), TG0=0b00 (4 KB granule),
ORGN0/IRGN0 = write-back cacheable, SH0 = inner shareable
- 4-level tables: L0 (PGD) → L1 (PUD) → L2 (PMD) → L3 (PTE)
- Identity map all physical RAM
- Set TTBR0_EL1, isb, enable MMU via SCTLR_EL1.M bit
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: GIC Initialization (v2 or v3, detected at runtime)
Read GIC version and base addresses from DTB
(`compatible` = "arm,gic-400" for GICv2, "arm,gic-v3" for GICv3).
- GICv2 path:
GICD (Distributor): enable, configure IRQ priorities and
targets for all SPIs. Set priority mask.
GICC (CPU Interface): enable, set priority mask to 0xFF
(accept all priorities), set BPR (binary point).
- GICv3 path:
GICD (Distributor): enable, configure affinity routing (ARE=1),
set priorities for all SPIs.
GICR (Redistributor): per-CPU, configure SGI/PPI group and
priority. Enable redistributor.
ICC system registers: ICC_PMR_EL1 = 0xFF (accept all),
ICC_IGRPEN1_EL1 = 1 (enable group 1 interrupts).
Route timer IRQ (PPI 27 = virtual timer) to this CPU.
Phase 9: Generic Timer
Configure the ARM generic timer (virtual counter):
- Write timer period to CNTV_TVAL_EL0
- Enable timer: CNTV_CTL_EL0 = ENABLE (bit 0), clear IMASK
- Timer fires IRQ 27 (virtual timer PPI) → tick handler
Enable interrupts: msr daifclr, #0xf
Phase 10: SVC / Exception-Vector Syscall Setup
Configure the exception vector table to correctly dispatch system
calls arriving from EL0 via the SVC instruction.
Exception vector layout (VBAR_EL1, 16 entries × 128 bytes = 2 KB,
must be 2 KB-aligned):
Offset 0x000: Current EL with SP0 — Synchronous
Offset 0x080: Current EL with SP0 — IRQ
Offset 0x100: Current EL with SP0 — FIQ
Offset 0x180: Current EL with SP0 — SError
Offset 0x200: Current EL with SPx — Synchronous
Offset 0x280: Current EL with SPx — IRQ
Offset 0x300: Current EL with SPx — FIQ
Offset 0x380: Current EL with SPx — SError
Offset 0x400: Lower EL (AArch64) — Synchronous ← SVC lands here
Offset 0x480: Lower EL (AArch64) — IRQ
Offset 0x500: Lower EL (AArch64) — FIQ
Offset 0x580: Lower EL (AArch64) — SError
Offset 0x600: Lower EL (AArch32) — Synchronous
Offset 0x680: Lower EL (AArch32) — IRQ
Offset 0x700: Lower EL (AArch32) — FIQ
Offset 0x780: Lower EL (AArch32) — SError
SVC handler entry (Lower EL AArch64 Synchronous, offset 0x400):
1. Save all general-purpose registers and the ELR_EL1/SPSR_EL1
pair to the per-task kernel stack (or per-CPU trap frame).
2. Read ESR_EL1: check EC field (bits [31:26]) == 0x15 (SVC64
instruction). If EC != 0x15, dispatch to generic fault path.
3. Extract syscall number from X8 (Linux AArch64 ABI convention).
Arguments are in X0-X5. Return value is written to X0.
4. Invoke the syscall dispatch table (same table as all arches).
5. Restore registers and return via ERET (restores PC from
ELR_EL1 and PSTATE from SPSR_EL1).
Control register configuration (verified during this phase):
SCTLR_EL1: M=1 (MMU on), C=1 (data cache on), I=1 (icache on),
SA=1 (SP alignment check at EL1), SA0=1 (SP alignment at EL0).
HCR_EL2.TGE: must be 0 so that EL0 exceptions route to EL1, not
EL2. Verified here if the kernel is running under a hypervisor
that sets up HCR_EL2 before entering the guest kernel.
SPSR_EL1: set up on return so EL0 re-enters AArch64 state (M=0b0000).
Verification test (executed during boot):
Trigger SVC from EL1 to test the synchronous exception vector
(VBAR_EL1 + 0x200, "Current EL with SPx — Synchronous"). The
handler fires, reads ESR_EL1 to verify EC == 0x15 (SVC64), and
returns. This is a vector table self-test — not a user-mode
execution test. User-mode execution is not possible until the
scheduler is initialized in Phase 11.
Phase 11: Scheduler
Initialize round-robin scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
Secondary CPU Bringup (AArch64 via PSCI):
After Phase 11 completes on the primary CPU, secondary CPUs are brought online using PSCI (Power State Coordination Interface).
AP Stack Allocation (AArch64):
The primary CPU allocates each secondary's kernel stack from the boot allocator
before issuing the PSCI CPU_ON call. Stack size is 16 KB per AP, allocated
from the per-NUMA-node boot allocator, preferring memory local to the target
CPU's node. The stack top address and percpu base are written to the AP's
ApStartupMailbox slot (see Section 2.1.2.4 for the struct definition; the
same type is used on all architectures). The PSCI context_id parameter is
set to the physical address of the AP's mailbox so that the secondary entry
stub can locate its stack before the MMU is active.
If stack allocation fails (boot allocator OOM), the primary logs "CPU {mpidr}:
stack allocation failed, CPU disabled", does not issue CPU_ON, and marks
the CPU permanently offline. Boot continues with the remaining CPUs.
Phase 12: Secondary CPU Detection
Parse DTB /cpus node for all CPU entries:
- Each cpu@N node contains: reg = MPIDR affinity bits
- device_type = "cpu"
- enable-method = "psci" (indicates PSCI is used)
Assign sequential tree indices (0 = primary, 1..N-1 = secondaries
in DTB order). Allocate PerCpu<T> slots and AP_STARTUP_MAILBOXES.
Initialize SmpBringupState; set deadline_ns = now_ns() + 30s.
Phase 13: PSCI Method Detection
Check /psci node in DTB for PSCI method:
- method = "smc": Use SMC (Secure Monitor Call) for PSCI
- method = "hvc": Use HVC (Hypervisor Call) for PSCI
Verify PSCI version via PSCI_VERSION (function ID 0x84000000):
- Major version in bits 31:16, minor in 15:0
- Require PSCI 1.0+ for full feature support
Phase 14: Secondary CPU Startup (fan-out tree, PSCI CPU_ON)
Primary allocates stack for tree-index 1 (fills mailbox, sets
mailbox[1].status = MAILBOX_READY), then calls PSCI CPU_ON:
x0 = 0xC4000003 (CPU_ON function ID, AArch64 PSCI 0.2+)
x1 = target_mpidr (MPIDR affinity value from DTB for index 1)
x2 = secondary_entry_phys (physical address of entry stub)
x3 = mailbox_phys (physical address of ApStartupMailbox[1])
Issue via SMC or HVC depending on Phase 13 detection.
Return values:
0 (PSCI_SUCCESS): CPU starting
-2 (PSCI_INVALID_PARAMS): bad MPIDR or entry address
-4 (PSCI_ALREADY_ON): CPU was already running (treat as success)
other negative: firmware error; mark CPU offline
Each secondary, after completing Phase 15 init, atomically sets its
bit in SmpBringupState.online_mask (via CpuMask::set_atomic),
increments online_count, then reads its own tree index i and issues
CPU_ON for children at
indices 2i+1 and 2i+2 (allocating stacks and filling mailboxes
first), before entering the scheduler idle loop.
Phase 15: Secondary CPU Entry (secondary_entry stub, per AP)
Each secondary CPU enters here in EL1 with MMU off.
x0 = context_id = physical address of own ApStartupMailbox.
a. Spin on mailbox.status until == MAILBOX_READY (pause loop)
b. Verify mailbox.cpu_id matches own MPIDR[31:0]
c. Enable FPU/NEON: write CPACR_EL1.FPEN = 0b11
d. Load kernel page tables: write TTBR0_EL1 with primary's root
table PPN; isb; enable MMU via SCTLR_EL1.M = 1; isb
e. Load stack pointer: ldr x1, [x0, #offsetof(stack_top)]; mov sp, x1
f. Load percpu_base: ldr x18, [x0, #offsetof(percpu_base)]
(x18 = CpuLocal register on AArch64 per Section 3.1.2 (03-concurrency.md))
g. Branch to Rust: bl secondary_init
In secondary_init():
1. Load VBAR_EL1 (exception vectors, same table as primary)
2. Initialize GIC CPU interface only (GICC or ICC system regs);
the primary already configured GICD for all CPUs during Phase 8
GICv2: GICC_PMR = 0xFF (unmask all); GICC_CTLR = 0x1 (enable)
GICv3: ICC_PMR_EL1 = 0xFF; ICC_IGRPEN1_EL1 = 1
3. Calibrate generic timer (read CNTFRQ_EL0; program CNTV_TVAL_EL0)
4. Enable interrupts: msr daifclr, #0xf
5. Initialize per-CPU scheduler runqueue
6. Atomically set own bit in SmpBringupState.online_mask;
increment online_count
7. Issue CPU_ON for own tree children (if any) as described above
8. Enter scheduler idle loop (wfe)
Phase 16: SMP Online
Primary polls SmpBringupState.online_count and deadline_ns.
Loop exits when online_count == expected_secondary_count OR
monotonic_now() >= deadline_ns (global 30-second timeout).
Any secondary whose bit is not set in online_mask at exit is
marked permanently offline and removed from the kernel CPU mask.
System is fully multi-CPU. GIC affinity routing distributes
interrupts across all online CPUs.
MPIDR affinity (AArch64): Each CPU has a unique MPIDR_EL1 value: - Bits [7:0]: Affinity level 0 (core within cluster) - Bits [15:8]: Affinity level 1 (cluster within socket) - Bits [23:16]: Affinity level 2 (socket) - Bits [39:32]: Affinity level 3 (extended, rare; multi-chip systems)
The DTB /cpus/cpu@N/reg property contains these affinity bits. PSCI_CPU_ON uses the full MPIDR value to identify the target CPU.
CPU Hotplug — RISC-V via SBI HSM (Hart State Management):
Secondary harts on RISC-V are brought online through the SBI HSM extension
(Extension ID: 0x48534D = ASCII "HSM"), which provides a portable interface
independent of the underlying platform firmware:
-
sbi_hart_start(hartid, start_addr, opaque)(FID 0): Bring an offline hart online. The hart begins execution atstart_addrin S-mode witha0 = hartidanda1 = opaque. UmkaOS passes its SMP trampoline physical address asstart_addrand a pointer to the per-hart data block asopaque. -
Trampoline requirements:
start_addrmust be a physical address. On implementations that limit the address to 32 bits, the trampoline must reside below the 4 GB boundary. The hart starts with the MMU disabled (satp = 0) and all CSRs at their reset values. -
UmkaOS RISC-V SMP trampoline (
arch/riscv64/trampoline.S): - Load per-hart data pointer from
a1(opaque value set by primary hart). - Configure
satpwith the kernel's root page-table PPN and MODE=Sv48. Executesfence.vmato flush any stale TLB state. - Write UmkaOS's trap handler address to
stvec(Direct mode, bits[1:0]=0). - Write the per-hart kernel stack top address to
sscratch(used by the trap entry stub to locate the kernel stack from U-mode). - Set
sstatus.SIE = 1to enable supervisor interrupts. -
Call
smp_secondary_init(hartid)(C calling convention:a0 = hartid). -
sbi_hart_stop()(FID 1): Park the calling hart. The hart transitions to STOPPED state and may be restarted by the primary hart viasbi_hart_start. UmkaOS calls this during CPU offline (logical hot-remove). -
sbi_hart_get_status(hartid)(FID 2): Query the current state of a hart. Return values: 0 = STARTED, 1 = STOPPED, 2 = START_PENDING, 3 = STOP_PENDING, 4 = SUSPENDED, 5 = SUSPEND_PENDING, 6 = RESUME_PENDING. UmkaOS polls this after callingsbi_hart_startto confirm the hart is online within the timeout window (1 second). -
Hart discovery: Enumerate
/cpusnodes from the Device Tree, recording each hart'sregproperty (hart ID). Cross-reference with SBI HSM status to filter out harts that are permanently disabled (STOPPED but not startable on this platform configuration).
CPU Hotplug — PPC32 / PPC64LE:
PowerPC platforms use firmware-specific mechanisms that vary by environment:
-
Bare-metal POWER (OpenPOWER / OPAL): Secondary processors are held at a spin-table address specified by the Device Tree property
cpu-release-addr(per/cpus/cpu@Nnode withenable-method = "spin-table"). The BSP writes the secondary entry-point physical address tocpu-release-addr, then executesdcbf(data cache block flush) andsync+isyncmemory barriers to ensure the secondary observes the write. The secondary breaks out of its spin loop, loads the entry address, and jumps to the kernel SMP trampoline. -
POWER LPARs under PowerVM: Use RTAS (Run-Time Abstraction Services):
rtas_call(RTAS_TOKEN_START_CPU, 3, 1, NULL, hwcpu_id, start_addr, r3_val). The RTAS call is issued via thertasfirmware interface discovered from the Device Tree/rtasnode. UmkaOS records the RTAS token forstart-cpuat boot during DTB parsing. -
KVM / QEMU pseries: Depending on the machine configuration, either RTAS or Device Tree spin-table is used. The DTB
enable-methodproperty on each CPU node identifies which mechanism applies. -
Secondary entry (PPC64LE): The secondary processor begins execution in kernel virtual mode on POWER9+ systems with Radix MMU, or in real mode on POWER8 with HPT. The SMP trampoline configures the stack pointer (
r1), thread pointer (r13, points to per-CPU data), and callssmp_secondary_init(cpu_id). -
Secondary entry (PPC32): The secondary begins in supervisor mode. The trampoline sets
r1(stack pointer), enables the MMU via MSR[IR] and MSR[DR], and callssmp_secondary_init(cpu_id).
2.1.2.7 ARMv7 Boot Sequence
QEMU's -M vexpress-a15 -kernel loads the ELF at 0x60010000 and enters
at _start in SVC (Supervisor) mode with the MMU off. Registers: r0 = 0,
r1 = machine type, r2 = DTB address.
Entry assembly (arch/armv7/entry.S, GNU as syntax):
1. QEMU jumps to _start in SVC mode, MMU off
- r0 = 0 (unused), r1 = machine type, r2 = DTB address
2. _start:
a. Disable IRQ and FIQ: cpsid if
(sets I and F bits in CPSR)
b. Set up IRQ mode stack: switch to IRQ mode (cps #0x12),
load 4 KB IRQ stack, switch back to SVC mode (cps #0x13)
c. Load SVC stack pointer: ldr sp, =_stack_top
(64 KB stack in .bss._stack, 16-byte aligned via .align 4)
d. Clear BSS: zero memory from __bss_start to __bss_end
(str r6 loop, 4 bytes per iteration)
e. Prepare 64-bit arguments (AAPCS: u64 passed as register pairs):
- r0:r1 = 0:0 (multiboot_magic, both halves)
- r2:r3 = dtb_addr:0 (multiboot_info, low:high)
f. Branch: bl umka_main
g. Halt loop: wfe if umka_main returns
Stack (64 KB) is in .bss._stack (16-byte aligned via .align 4, which on
ARM GAS means 2^4 = 16 bytes). The linker script (linker-armv7.ld) places
.text._start first at 0x60010000 (offset from the vexpress-a15 base
0x60000000 to leave room for the bootloader stub).
Initialization phases (in umka_main(), sequential):
Phase 1: Exception Vectors (VBAR)
Write vector table base to VBAR via CP15 c12 register:
mcr p15, 0, <reg>, c12, c0, 0
Vector table: 8 entries (Reset, Undef, SVC, Prefetch Abort,
Data Abort, reserved, IRQ, FIQ) × 4-byte branch instructions.
Each vector branches to a full handler stub.
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly, same
pattern as x86 entry.asm step 2d). Perform any additional
initialization that depends on zeroed static data.
Phase 3: DTB Parse
Parse the DTB passed in r2 (see Section 2.1.2.9). Extract /memory
regions, /chosen bootargs, GIC base addresses, timer IRQ
numbers, and UART base. vexpress-a15 has well-known addresses
but DTB parsing keeps the code machine-independent.
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). The vexpress-a15
machine provides up to 1 GB RAM starting at 0x60000000 (or
0x80000000 depending on configuration). Reserve kernel image.
Phase 5: Kernel Heap
Allocate 256 contiguous 4 KB frames (1 MB). Initialize
free-list allocator → enables alloc::Vec, Box, String.
This initial heap size is a bootstrap minimum; the allocator expands
dynamically once memory discovery completes (Section 4.1).
Phase 6: Virtual Memory (TTBR0, Short Descriptor)
Build identity-map using ARMv7 short-descriptor format:
- TTBR0: points to L1 table (4096 × 32-bit entries, 16 KB)
- L1 entries: section descriptors (1 MB pages) for identity map
Flags: AP=0b11 (full access), TEX/C/B for normal cacheable
- DACR: domain 0 = Client (0b01), all others = No Access
- Enable MMU: set SCTLR.M bit via mcr p15, 0, <reg>, c1, c0, 0
- 1 MB sections are sufficient for initial identity map;
L2 tables (256 × 4 KB pages) added later for fine-grained mapping
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: GIC Initialization
ARMv7 platforms typically use GICv2 (GICv3 supports ARMv7/AArch32
but is rare on ARMv7 SoCs; limited to 3 affinity levels in AArch32).
Read GICD/GICC bases from DTB (vexpress-a15 defaults:
GICD = 0x2C001000, GICC = 0x2C002000).
Configure distributor, CPU interface, route timer IRQ.
Phase 9: Timer
Configure SP804 dual timer or ARM generic timer (if available):
- SP804 (vexpress): program LOAD register, enable with
periodic mode + interrupt enable, IRQ via GIC SPI
- Generic timer (Cortex-A15): CNTVCT, CNTV_TVAL, CNTV_CTL
(same registers as AArch64, accessed via CP15 c14)
Enable interrupts: cpsie if
Phase 10: Scheduler
Initialize round-robin scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
Secondary CPU Bringup (ARMv7 via PSCI):
After Phase 10 completes on the primary CPU, secondary CPUs are brought online using PSCI (Power State Coordination Interface).
PSCI calling convention (ARMv7):
The kernel detects the PSCI version and calling mechanism at runtime from the
DTB /psci node compatible property:
"arm,psci-0.2"or later: use PSCI 0.2 function IDs (preferred)"arm,psci": use PSCI 0.1 function IDs (legacy fallback; function IDs are platform-specific and read from DTBcpu_onproperty under/psci)
PSCI 0.2 function IDs for ARMv7 (32-bit callee convention):
CPU_ON = 0x84000003 (PSCI 0.2, 32-bit)
r0 = 0x84000003 (function ID)
r1 = target_cpu (MPIDR[31:0] of target AP)
r2 = entry_point (physical address of AP entry stub, must be 32-bit)
r3 = context_id (physical address of ApStartupMailbox for this AP)
Return values (in r0):
0 PSCI_SUCCESS: AP starting
-2 PSCI_INVALID_PARAMS: bad MPIDR or entry address
-4 PSCI_ALREADY_ON: AP was already running (treat as success)
other negative: firmware error; mark AP offline
Calling convention: use smc #0 if the DTB /psci node method = "smc";
use hvc #0 if method = "hvc". The method property is mandatory in valid
PSCI device trees. If absent, default to smc.
AP Stack Allocation (ARMv7):
Stack allocation follows the same protocol as all architectures (see
Section 2.1.2.4): the primary allocates 16 KB per AP from the boot allocator
before issuing CPU_ON, fills the ApStartupMailbox, passes its physical
address as context_id, and marks the AP offline on allocation failure.
GIC initialization for ARMv7 APs:
The primary CPU configures the GIC Distributor (GICD) during Phase 8 for all CPUs. Each AP, on startup, initializes only its own GIC CPU Interface (GICC):
GICC_PMR = 0xFF // unmask all interrupt priorities
GICC_CTLR = 0x1 // enable CPU interface
APs do not touch the GICD — the primary owns the distributor. IRQs are
unmasked by clearing the CPSR.I and CPSR.F bits (cpsie if) after the
scheduler is initialized and the AP is ready to run tasks.
ARMv7 AP entry sequence:
The entry stub physical address passed to CPU_ON as r2 is the ARMv7 SMP
trampoline. The trampoline receives context_id (physical address of
ApStartupMailbox) in r3 from the PSCI firmware and follows this sequence:
1. AP wakes at physical entry point (address passed in CPU_ON r2).
r3 = physical address of own ApStartupMailbox (from PSCI context_id).
2. Disable IRQs and FIQs: cpsid if
(sets CPSR.I and CPSR.F; prevents spurious interrupts before stack is set)
3. Confirm SVC mode: mrs r0, cpsr; and r0, r0, #0x1F; cmp r0, #0x13
If not in SVC mode (0x13), switch: cps #0x13
4. Enable VFP/NEON if needed:
mrc p15, 0, r1, c1, c0, 2 // read CPACR
orr r1, r1, #(0xF << 20) // enable CP10 + CP11 full access
mcr p15, 0, r1, c1, c0, 2 // write CPACR
vmrs r1, fpexc // enable VFP: FPEXC.EN = 1
orr r1, r1, #(1 << 30)
vmsr fpexc, r1
5. Enable MMU with kernel page tables:
- Load TTBR0 with primary's L1 table physical address
mcr p15, 0, <ttbr0>, c2, c0, 0
- Set DACR domain 0 = Client (0b01):
ldr r1, =0x00000001
mcr p15, 0, r1, c3, c0, 0
- Enable MMU and caches (set SCTLR.M, .C, .I via CP15 c1 c0 0)
- isb
6. Spin on mailbox.status until == MAILBOX_READY (0xAB1E1234):
ldr r0, [r3, #offsetof(ApStartupMailbox, status)]
cmp r0, #0xAB1E1234
bne spin (with yield: yield instruction or nop)
7. Verify mailbox.cpu_id matches own MPIDR[23:0]:
mrc p15, 0, r1, c0, c0, 5 // read MPIDR
and r1, r1, #0x00FFFFFF // lower 24 affinity bits
ldr r0, [r3, #offsetof(ApStartupMailbox, cpu_id)]
cmp r0, r1
bne fault_halt // mismatch: configuration error
8. Load SP from stack_top:
ldr sp, [r3, #offsetof(ApStartupMailbox, stack_top)]
9. Load percpu_base (TPIDRPRW, the ARMv7 CpuLocal register per Section 3.1.2 (03-concurrency.md)):
ldr r4, [r3, #offsetof(ApStartupMailbox, percpu_base)]
mcr p15, 0, r4, c13, c0, 4 // write TPIDRPRW
10. Jump to Rust entry point:
bl ap_secondary_init // does not return
The ap_secondary_init() function (Rust) runs the following in order:
1. Load VBAR (exception vectors, same table as primary): mcr p15, 0, vbar, c12, c0, 0
2. Initialize GICC (CPU Interface): write GICC_PMR = 0xFF and GICC_CTLR = 0x1
3. Initialize per-CPU scheduler runqueue
4. Configure and enable the timer (generic timer or SP804 as appropriate)
5. Enable interrupts: cpsie if
6. Atomically set own bit in SmpBringupState.online_mask (via CpuMask::set_atomic);
increment online_count
7. Issue CPU_ON for own tree children (indices 2i+1, 2i+2) if they exist
and the global deadline_ns has not expired (allocate stacks, fill mailboxes,
call PSCI, same protocol as the primary for tree index 1)
8. Enter scheduler idle loop (wfe)
SMP bringup phases (ARMv7):
Phase 11: Secondary CPU Detection
Parse DTB /cpus node for CPU entries with enable-method = "psci".
Assign sequential tree indices (0 = primary). Allocate PerCpu<T>
slots and AP_STARTUP_MAILBOXES. Initialize SmpBringupState;
set deadline_ns = now_ns() + 30s.
Phase 12: PSCI Method and Version Detection
Read /psci node: detect method (smc/hvc) and compatible string
(psci-0.2 vs psci-0.1). For psci-0.1, read cpu_on property.
Phase 13: First AP Wakeup (primary → tree index 1)
Allocate stack for index 1; fill mailbox[1]; set MAILBOX_READY.
Call PSCI CPU_ON for index 1 as described above.
Fan-out tree propagates: each AP wakes its children after init.
Phase 14: SMP Online
Primary polls SmpBringupState.online_count and deadline_ns.
Loop exits when online_count == expected_secondary_count OR
monotonic_now() >= deadline_ns (global 30-second timeout).
Any AP whose bit is not set in online_mask at exit is marked
permanently offline. Boot continues with available CPUs.
System is fully multi-CPU once all online APs are in their
scheduler idle loops.
2.1.2.8 RISC-V 64 Boot Sequence
QEMU's -M virt -bios default -kernel runs OpenSBI in M-mode, which then
jumps to the kernel at 0x80200000 in S-mode (Supervisor mode). Registers:
a0 = hart_id, a1 = DTB address (on QEMU and systems following the Linux
boot convention — see note below).
Note on a1 and DTB discovery: The RISC-V SBI specification does NOT mandate that
a1contains the DTB physical address. This is a firmware convention established by QEMU and U-Boot, and is widely followed in practice, but real bare-metal boards may use different mechanisms. The boot code therefore validatesa1before trusting it: 1. Check ifa1is a valid DTB pointer: read the 4-byte magic at that address and verify it equals0xD00DFEED(big-endian FDT magic). 2. Ifa1is not a valid DTB: scan for a UEFI System Table (look for theIBI SYSTsignature in the EFI System Table header). 3. If UEFI is not found: use the SBI vendor extension to request the DTB address, or fall back to a compiled-in DTB for the target board. The reference implementation uses option 1 with UEFI fallback for production hardware targets.
Entry assembly (arch/riscv64/entry.S, GNU as syntax):
1. OpenSBI jumps to _start in S-mode
- a0 = hart_id (hardware thread ID, usually 0 on single-core)
- a1 = DTB address (QEMU/U-Boot convention; validated at runtime — see note above)
2. _start:
a. Disable interrupts: csrci sstatus, 0x2
(clears SIE bit in supervisor status register)
b. Load stack pointer: la sp, _stack_top
(64 KB stack in .bss._stack, 16-byte aligned)
c. Clear BSS: zero memory from __bss_start to __bss_end
(sd zero loop, 8 bytes per iteration)
d. Arguments already in correct registers:
a0 = hart_id (passed as multiboot_magic parameter)
a1 = DTB address (passed as multiboot_info parameter)
e. Call: call umka_main (jal with ra)
f. Halt loop: wfi (wait-for-interrupt) if umka_main returns
Stack (64 KB) is in .bss._stack (16-byte aligned). The linker script
(linker-riscv64.ld) places .text._start first at 0x80200000, after the
OpenSBI firmware region (0x80000000–0x801FFFFF).
Initialization phases (in umka_main(), sequential):
Phase 1: Exception Vectors (stvec)
Write trap handler address to stvec CSR in Direct mode
(stvec[1:0] = 0b00). All traps — exceptions, software
interrupts, external interrupts — vector to a single entry
point that reads scause to dispatch.
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly, same
pattern as x86 entry.asm step 2d). Perform any additional
initialization that depends on zeroed static data.
Phase 3: DTB Parse
Parse the DTB passed in a1 (see Section 2.1.2.9). Extract /memory
regions, /chosen bootargs, PLIC base address, CLINT address
(if present), and UART base. QEMU virt machine uses standard
addresses but DTB parsing keeps the code machine-independent.
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). Mark available
regions free. Reserve:
- OpenSBI firmware: 0x80000000–0x801FFFFF (2 MB)
- Kernel image: 0x80200000 to __kernel_end
Unlike x86, no legacy BIOS region to reserve.
Phase 5: Kernel Heap
Allocate 256 contiguous 4 KB frames (1 MB). Initialize
free-list allocator → enables alloc::Vec, Box, String.
This initial heap size is a bootstrap minimum; the allocator expands
dynamically once memory discovery completes (Section 4.1).
Phase 6: Virtual Memory (satp, Sv48)
Build identity-map using Sv48 (4-level, 48-bit VA):
- 4 levels: L3 (root) → L2 → L1 → L0, each 512 × 8-byte PTEs
- PTE format: [53:10] PPN, [7:0] flags (V, R, W, X, U, G, A, D)
- Identity map all physical RAM with RWX + Valid + Global
- Write root table PPN to satp: MODE=Sv48 (9), ASID=0, PPN
- Execute sfence.vma to flush TLB after satp write
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: PLIC Initialization
Read PLIC base address from DTB (QEMU virt default: 0x0C000000).
- Set priority threshold to 0 (accept all priorities)
- Enable relevant interrupt sources (UART, etc.)
- Set priority for each source
PLIC handles external interrupts only; timer and software
interrupts go through separate CSRs (sie.STIE, sie.SSIE).
Phase 9: SBI Timer
Use SBI ecall to program the timer:
- Read current time: csrr time (or rdtime pseudo-instruction)
- Set next deadline: sbi_set_timer(time + interval)
(SBI EID=0x54494D45 "TIME", FID=0)
- Enable timer interrupt: set sie.STIE (bit 5)
Timer fires supervisor timer interrupt (scause = 5) →
clear by calling sbi_set_timer with next deadline.
Enable interrupts: csrsi sstatus, 0x2
Phase 10: ecall / Trap-Vector Syscall Setup
Configure the trap vector and trap entry code to correctly dispatch
system calls arriving from U-mode via the ecall instruction.
stvec CSR configuration:
bits[1:0] = 0b00 (Direct mode): all traps — synchronous
exceptions, software interrupts, external interrupts — are
delivered to the single base address written to stvec. UmkaOS
uses Direct mode rather than Vectored mode (0b01) so that the
handler can perform a unified register-save before reading scause.
Trap entry sequence (all trap types, unified handler):
1. csrrw sp, sscratch, sp — swap user and kernel stack pointers.
sscratch holds the kernel stack top for this hart (set up in
Phase 1 entry assembly and refreshed on each U→S transition).
2. Save all general-purpose registers (x1-x31, or the full
RISC-V integer register file) to the per-hart trap frame at
the top of the kernel stack.
3. Read scause to determine the trap source:
- scause = 8 (ecall from U-mode): syscall path.
- scause = 9 (external interrupt): PLIC claim/complete path.
- scause = 5 (supervisor timer interrupt): timer tick path.
- scause = 1 (supervisor software interrupt): IPI path.
- Other synchronous exceptions: fault/signal path.
ecall handler (scause == 8):
Syscall number: a7 (per Linux RISC-V ABI, also known as the
SBI-compatible register assignment).
Arguments: a0–a5 (up to six arguments).
Return convention: a0 carries the return value (negative values
encode -errno on error); a1 carries a second return word for
certain multi-value returns (e.g., pipe(2) returns two file
descriptors in a0 and a1).
After handling, sepc is advanced by 4 (skip past the ecall
instruction, which is always 4 bytes) before SRET.
Interrupt enable state:
sstatus.SIE (bit 1): supervisor interrupt enable, set to 1 after
trap entry saves state so nested interrupts are possible in
long-running handlers. Cleared on trap entry by hardware.
sstatus.SPIE (bit 5): previous SIE — saved and restored across
SRET to allow transparent interrupt-enable state on return.
sie.SEIE (bit 9): supervisor external interrupt enable (PLIC).
sie.SSIE (bit 1): supervisor software interrupt enable (IPI).
sie.STIE (bit 5): supervisor timer interrupt enable (already set
in Phase 9).
Verification test (executed during boot):
Issue ecall from S-mode (supervisor mode) to test the ecall
vector entry in stvec. The trap handler fires, reads scause to
verify cause == 9 (ecall from S-mode), and returns. This tests
trap vector setup — not user-mode. User-mode is not available
until the scheduler is initialized in Phase 11.
Phase 11: Scheduler
Initialize round-robin scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
> **SMP bringup — RISC-V 64**: Secondary harts are brought online via
> the SBI HSM (Hart State Management) extension. The boot hart calls
> `sbi_hart_start(hartid, start_addr, opaque)` for each secondary.
> Each secondary enters at `start_addr` in S-mode, runs Phases 5-11
> (Sv48 page tables, PLIC, timer, ecall setup, per-CPU init), and joins
> the scheduler. The full SBI HSM calling sequence, per-hart stack
> allocation, and PLIC per-hart context initialization are specified in
> [Section 2.1.2.8 RISC-V 64 Boot Sequence](#2128-risc-v-64-boot-sequence).
2.1.2.9 Device Tree Blob Parsing
The Device Tree Blob (DTB) is the memory map and hardware description format shared by AArch64, ARMv7, RISC-V 64, PPC32, and PPC64LE. It serves the same role as the Multiboot1 info structure on x86 (Section 2.1.2.11), providing the kernel with memory layout and device addresses at boot.
DTB format (Flattened Device Tree / FDT):
Offset Field Size Description
0x00 magic u32 0xD00DFEED (big-endian)
0x04 totalsize u32 Total blob size in bytes
0x08 off_dt_struct u32 Offset to structure block
0x0C off_dt_strings u32 Offset to strings block
0x10 off_mem_rsvmap u32 Offset to memory reservation map
0x14 version u32 DTB version (17)
0x18 last_comp_ver u32 Last compatible version (16)
0x1C boot_cpuid_phys u32 Physical ID of boot CPU
0x20 size_dt_strings u32 Size of strings block
0x24 size_dt_struct u32 Size of structure block
All multi-byte fields are big-endian. The structure block contains a
flattened tree of nodes and properties encoded as tokens: FDT_BEGIN_NODE
(0x01), FDT_END_NODE (0x02), FDT_PROP (0x03), FDT_NOP (0x04),
FDT_END (0x09).
Minimal parser (umka-kernel/src/boot/dtb.rs):
The kernel implements a minimal, no-alloc DTB parser that walks the structure block once and extracts only what's needed for boot:
- Validate header: check magic (
0xD00DFEED), version ≥ 16 /memorynodes → collectregproperty values asMemoryRegionarray (base + size pairs), passed tophys::init()/chosennode → extractbootargsproperty (kernel command line)- Interrupt controller → extract
regproperty from the node withinterrupt-controllerproperty (GIC base for ARM, PLIC base for RISC-V) - Timer → extract IRQ numbers from
/timernodeinterruptsproperty - UART → extract
regproperty from/serialorstdout-pathdevice
The parser operates on raw byte slices with explicit big-endian reads and requires no heap allocation. It uses a fixed-size array (up to 64 entries) for memory regions, matching the Multiboot1 parser's approach. The DTB parser uses a fixed 64-entry buffer during early boot (before the heap allocator is available). Device tree nodes beyond this limit are parsed in a second pass after heap initialization.
Shared code: The DTB parser in umka-kernel/src/boot/dtb.rs is used by
all five non-x86 architectures. Each architecture's boot.rs calls
dtb::parse(dtb_addr) and passes the resulting memory regions to
phys::init().
2.1.2.10 Cross-Architecture Comparison
The following table summarizes which boot components are architecture-specific and which are shared across all six architectures:
| Phase | x86-64 | AArch64 | ARMv7 | RISC-V 64 | PPC32 | PPC64LE |
|---|---|---|---|---|---|---|
| Exception vectors | IDT (256 entries) | VBAR_EL1 (16 vectors) | VBAR CP15 (8 vectors) | stvec (Direct mode) | IVPR+IVORn | LPCR vector table |
| Memory map source | Multiboot1 info | DTB /memory |
DTB /memory |
DTB /memory |
DTB /memory |
DTB /memory |
| Page table format | 4-level PML4 (4 KB) | 4-level 4 KB granule | Short-desc 2-level (1 MB sections) | Sv48 4-level | 2-level (4 KB pages) | Radix tree (POWER9+) or HPT |
| IRQ controller | 8259 PIC (I/O ports) | GIC v2/v3 (MMIO, detected at runtime) | GICv2 (MMIO) | PLIC (MMIO) | OpenPIC (MMIO) | XIVE (MMIO) |
| Timer | PIT (I/O port 0x40) | Generic timer (system regs) | SP804 or generic timer | SBI ecall | Decrementer (DEC SPR) | Decrementer (DEC SPR) |
| Boot assembly | NASM (32→64 transition) | GNU as (EL1 entry) | GNU as (SVC entry) | GNU as (S-mode entry) | GNU as (supervisor entry) | GNU as (supervisor entry) |
| BSS clearing | entry.asm (rep stosd) | entry.S (str xzr loop) | entry.S (str r6 loop) | entry.S (sd zero loop) | entry.S (stw loop) | entry.S (std loop) |
| Phys allocator | shared bitmap | shared bitmap | shared bitmap | shared bitmap | shared bitmap | shared bitmap |
| Heap allocator | shared free-list | shared free-list | shared free-list | shared free-list | shared free-list | shared free-list |
| Capability system | shared | shared | shared | shared | shared | shared |
| Scheduler | shared | shared | shared | shared | shared | shared |
2.1.2.11 Multiboot1 Memory Map Parsing
boot/multiboot1.rs parses the Multiboot1 info structure (passed by GRUB/QEMU)
to extract the physical memory map:
- Read info structure flags to determine which fields are present
- If
FLAG_MEMset: log basic memory sizes (lower/upper KB) - If
FLAG_CMDLINEset: log the kernel command line string - If
FLAG_MMAPset: iterate the memory map entries: - Each entry has:
base_addr(u64),length(u64),type(u32) - Types: available (1), reserved (2), ACPI reclaimable (3), NVS (4), defective (5)
- Unaligned reads used (
read_unaligned) — Multiboot mmap entries may not be aligned - Collect up to 64
MemoryRegionstructs, pass tophys::init()
phys::init() processes the regions:
- Phase 1: Mark all available regions as free (page-aligned)
- Phase 2: Reserve first 1 MB (BIOS, VGA, legacy)
- Phase 3: Reserve kernel image (1 MB to __kernel_end)
2.1.2.12 Boot Allocator Design
The boot allocator (BootAlloc) is the physical-memory allocator used during
early boot, before the main buddy allocator (Section 4.1.1)
is initialized. Its design must satisfy two constraints in tension:
- It needs some memory before it can read the firmware memory map.
- It must not impose a hardcoded limit on total usable RAM.
These constraints are resolved with a two-phase design.
Phase 1 — Bootstrap (BSS pre-allocator)
Before the firmware memory map is parsed, a tiny fixed-size buffer resident in
.bss provides just enough memory to parse the firmware map and construct the
BootAlloc region table. This buffer is declared as a global static array:
/// Pre-allocator scratch buffer in .bss.
/// Used ONLY to construct the BootAlloc region table.
/// This is NOT a limit on usable memory — it is a staging area for parsing
/// the firmware map before BootAlloc is initialized.
static mut BOOTSTRAP_BUF: [u8; 64 * 1024] = [0u8; 64 * 1024];
static mut BOOTSTRAP_OFFSET: usize = 0;
This 64 KB BSS bootstrap buffer covers the worst-case cost of parsing firmware
memory map data structures. It is consumed once at boot and is never used again
after BootAlloc::init_from_* completes.
Phase 2 — BootAlloc over all firmware-reported RAM
After the firmware map is parsed, BootAlloc is initialized with all
conventional memory regions reported by the firmware. It is a simple bump
allocator that walks regions in address order, moving to the next region when
the current one is exhausted:
/// One contiguous physical memory region reported by firmware.
pub struct MemRegion {
/// Base physical address (page-aligned).
pub base: PhysAddr,
/// Region size in bytes (page-aligned).
pub size: usize,
}
/// Pre-main-allocator memory manager.
///
/// Initialized from the firmware memory map; manages all conventional RAM
/// regions reported by firmware (UEFI MemoryMap, Multiboot1 mmap, or
/// Device Tree `/memory` nodes). All conventional memory is available for
/// allocation — there is no hardcoded cap on total usable memory.
///
/// Allocation strategy: bump allocator, advancing through `regions` in
/// address order. When `regions[current_region]` is exhausted, moves to
/// `regions[current_region + 1]`. Allocations are never freed — this
/// allocator is discarded once the buddy allocator takes over.
pub struct BootAlloc {
/// Firmware-reported memory regions, sorted by base address.
/// Populated by `init_from_multiboot1`, `init_from_uefi`, or `init_from_dtb`.
regions: [MemRegion; MAX_BOOT_REGIONS],
/// Number of valid entries in `regions`.
region_count: usize,
/// Index into `regions` for the current bump position.
current_region: usize,
/// Byte offset within `regions[current_region]` for the next allocation.
current_offset: usize,
}
/// Maximum number of distinct firmware memory map entries.
///
/// This caps the number of *separate address ranges*, not the total RAM size.
/// A 1 TB NUMA system may have 8-16 firmware-reported ranges; 64 covers all
/// realistic configurations (including heavily fragmented UEFI maps with many
/// reserved and reclaim regions alongside conventional memory ranges).
pub const MAX_BOOT_REGIONS: usize = 64;
Initialization entry points:
impl BootAlloc {
/// Initialize from a Multiboot1 mmap (x86-64).
/// Filters for type == 1 (available), page-aligns each region, skips
/// the first 1 MB (BIOS/legacy) and the kernel image.
pub fn init_from_multiboot1(mmap: &Multiboot1Mmap) -> Self;
/// Initialize from a UEFI memory map (future x86-64 UEFI path).
/// Filters for EfiConventionalMemory descriptor type.
pub fn init_from_uefi(map: &UefiMemoryMap) -> Self;
/// Initialize from Device Tree `/memory` nodes (all non-x86 architectures).
/// Uses the regions collected by the DTB parser in `boot/dtb.rs`.
pub fn init_from_dtb(regions: &[MemRegion]) -> Self;
}
Allocation:
impl BootAlloc {
/// Allocate `size` bytes with `align`-byte alignment from firmware RAM.
///
/// Advances the bump pointer through regions in address order until a
/// region has enough contiguous space to satisfy the request. Panics at
/// boot if no region can satisfy the request (indicates a firmware map
/// problem, not a normal condition — boot cannot continue anyway).
///
/// Returns a `PhysAddr` pointing to the allocated region.
pub fn alloc(&mut self, size: usize, align: usize) -> PhysAddr;
}
Invariants:
regionsis sorted bybasein ascending order after initialization.- No region in
regionsoverlaps the kernel image (__kernel_startto__kernel_end) or any reserved firmware region. These are subtracted out during initialization. current_offsetis always a multiple of the requested alignment after eachalloccall; the bump pointer is aligned up before each allocation.- Once the buddy allocator is initialized (
phys::init()completes), theBootAllocinstance is dropped and its memory is reclaimed.
Relationship to phys::init():
BootAlloc and phys::init() (the buddy allocator) both receive the same
firmware region list. BootAlloc uses it as a bump allocator for early boot
data structures. phys::init() builds a full buddy allocator over all
discovered RAM, then marks the pages consumed by BootAlloc as allocated
so they are not double-handed to userspace. The two-phase handoff is:
1. Firmware map parsed → BootAlloc::init_from_*(regions)
2. Early boot data structures allocated from BootAlloc
3. phys::init(regions) → buddy allocator built over all RAM
4. phys::mark_used(boot_alloc_consumed_pages) → reserve what BootAlloc used
5. BootAlloc dropped; all further allocation goes through buddy allocator
2.1.2.13 PPC32 Boot Sequence
PPC32 targets embedded PowerPC processors (e500, 440, etc.) using QEMU's ppce500
machine. The firmware (U-Boot or QEMU direct boot) passes a DTB pointer in r3.
Entry assembly (arch/ppc32/entry.S, GNU as syntax):
1. Firmware loads ELF and jumps to _start in supervisor mode
- r3 = DTB address
- r4 = kernel image start (optional)
- r5 = 0 (reserved)
2. _start:
a. Set up stack pointer (r1) from linker symbol
b. Clear BSS (.sbss + .bss)
c. Set up initial exception vectors (IVPR + IVORn)
d. Call umka_main(0, r3) [magic=0, info=DTB address]
The linker script (linker-ppc32.ld) places .text._start first at the kernel
load address. PPC32 uses big-endian byte order by default.
Initialization phases (in umka_main(), sequential):
Phase 1: Exception Vectors (IVPR + IVORn)
Set IVPR to exception vector base address.
Initialize IVOR0-IVOR15 for each exception type:
- IVOR0 (Critical input), IVOR1 (Machine check)
- IVOR2 (Data storage), IVOR3 (Instruction storage)
- IVOR4 (External input), IVOR5 (Alignment)
- IVOR6 (Program), IVOR7 (Floating-point unavailable)
- IVOR8 (System call), IVOR9 (Auxiliary processor unavailable)
- IVOR10 (Decrementer), IVOR11 (Fixed interval timer)
- IVOR12 (Watchdog), IVOR13 (Data TLB)
- IVOR14 (Instruction TLB), IVOR15 (Debug)
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly).
Phase 3: DTB Parse
Parse the DTB passed in r3 (see Section 2.1.2.9). Extract /memory
regions, /chosen bootargs, OpenPIC base address, UART base.
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). Reserve kernel image.
Phase 5: Kernel Heap
Allocate 256 contiguous 4 KB frames (1 MB). Initialize
free-list allocator → enables alloc::Vec, Box, String.
Phase 6: Virtual Memory (2-level page tables)
Build identity-map using PPC32 2-level page table format:
- PGD (Page Directory): 1024 × 32-bit entries (4 KB)
- PTE (Page Table): 1024 × 32-bit entries per PGD entry (4 KB each)
- Use 4 KB pages with WIMG bits for cache policy
- Enable MMU via MSR[IR] and MSR[DR] bits
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: OpenPIC Initialization
Read OpenPIC base address from DTB.
- Configure interrupt vector base
- Set priority for each interrupt source
- Enable external interrupts via MSR[EE]
Phase 9: Decrementer Timer
Program the decrementer (DEC SPR) for periodic interrupts:
- Load initial value into DEC
- Decrementer exception is gated by MSR[EE] (already enabled in Phase 8)
Timer fires decrementer exception → reload DEC in handler.
Phase 10: Scheduler
Initialize round-robin scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
> **SMP bringup — PPC32**: Secondary CPUs on embedded PPC (e500) are
> brought online via platform-specific firmware (U-Boot spin table or
> ePAPR boot protocol). The primary CPU writes the secondary entry
> point to a spin-table address, and the secondary polls until
> released. **Full specification deferred to Phase 3** — the spin-table
> protocol, per-CPU stack allocation, and OpenPIC per-CPU
> initialization will be detailed when PPC32 SMP is implemented.
2.1.2.14 PPC64LE Boot Sequence
PPC64LE targets IBM POWER processors (POWER8, POWER9, POWER10) in little-endian
mode. QEMU uses the pseries machine type with SLOF firmware, which passes a DTB
pointer in r3. Bare metal systems use OPAL (skiboot) firmware.
Entry assembly (arch/ppc64le/entry.S, GNU as syntax):
1. SLOF/OPAL loads ELF and jumps to _start in hypervisor or supervisor mode
- r3 = DTB address
- r4 = 0 (reserved)
- MSR: 64-bit mode (SF=1), little-endian (LE=1)
2. _start:
a. Set up TOC pointer (r2) from .TOC. symbol
b. Set up stack pointer (r1) from linker symbol
c. Clear BSS
d. Set up initial exception vectors
e. Call umka_main(0, r3) [magic=0, info=DTB address]
The linker script (linker-ppc64le.ld) places .text._start first at the kernel
load address. PPC64LE uses the ELFv2 ABI with little-endian byte order.
Initialization phases (in umka_main(), sequential):
Phase 1: Exception Vectors (LPCR + HSPRG0/1)
Set HSPRG0 to per-CPU data pointer.
Configure LPCR for exception vector base.
Initialize system reset and machine check handlers.
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly).
Phase 3: DTB Parse
Parse the DTB passed in r3 (see Section 2.1.2.9). Extract /memory
regions, /chosen bootargs, XIVE base addresses, UART base.
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). Reserve kernel image.
Phase 5: Kernel Heap
Allocate 256 contiguous 4 KB frames (1 MB). Initialize
free-list allocator → enables alloc::Vec, Box, String.
Phase 6: Virtual Memory (Radix MMU on POWER9+, HPT on POWER8)
Detect MMU type from DTB or CPU features:
- POWER9+: Use Radix MMU (4-level page tables: PGD→PUD→PMD→PTE, 4 KB/64 KB/2 MB pages)
Configure LPCR[HR] = 1 for Radix mode.
Set up process table (PRTB) and page table root (PGD).
- POWER8: Use HPT (Hash Page Table, base page size 4 KB default or 64 KB with 64KB page configuration; 16 MB is a huge page size)
Configure LPCR[HR] = 0 for HPT mode.
Set up HPT base and size in SDR1.
Enable MMU via MSR[IR] and MSR[DR] bits.
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: XIVE Interrupt Controller
Read XIVE base addresses from DTB.
- Initialize Interrupt Controller (IC) registers
- Initialize Thread Interrupt Management (TIMA)
- Configure interrupt priorities and routing
- Enable external interrupts via MSR[EE]
Phase 9: Decrementer Timer
Program the decrementer (DEC SPR) for periodic interrupts:
- Load initial value into DEC (32-bit, wraps at 0)
- Decrementer exception is gated by MSR[EE] (already enabled in Phase 8)
Timer fires decrementer exception → reload DEC in handler.
Note: POWER9+ also has HDEC (Hypervisor Decrementer) for L1 guests.
Phase 10: Scheduler
Initialize round-robin scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
> **SMP bringup — PPC64LE**: Secondary CPUs on POWER systems are
> brought online via OPAL (OpenPOWER Abstraction Layer) on bare metal
> or RTAS (Run-Time Abstraction Services) under PowerVM. OPAL
> provides `opal_start_cpu(server_no, start_address)`. SLOF (QEMU)
> uses the device-tree `/cpus/cpu@N/ibm,ppc-interrupt-server#s`
> property and a spin-table release mechanism. **Full specification
> deferred to Phase 3** — the OPAL/RTAS calling convention, per-CPU
> stack allocation, and XIVE per-CPU initialization will be detailed
> when PPC64LE SMP is implemented.
2.1.2.15 Interrupt Controller Architecture: GIC (AArch64/ARMv7) and PLIC (RISC-V)
The x86-64 interrupt architecture (8259 PIC remapped through the IOAPIC, with per-CPU LAPIC) is described in Phase 2 of the x86-64 boot sequence. ARM and RISC-V use different interrupt controllers with distinct initialization models. This section specifies those controllers at the level of detail required to implement the UmkaOS Tier 0 interrupt initialization code.
AArch64 / ARMv7: GIC (Generic Interrupt Controller)
ARM platforms use the GIC family. The GIC version is detected at boot from the
Device Tree compatible string or from an ACPI MADT Type 8 (GICC), Type 9
(GICD), and Type 14 (GICR) entry set. UmkaOS supports GICv2 and GICv3/v4.
GICv2 (ARM Cortex-A9, A15, A17, and earlier server SoCs):
- GICD (Distributor): a single MMIO block shared by all CPUs. Controls SPI routing, enable/disable per-IRQ, and priority configuration.
- GICC (CPU Interface): a separate MMIO block, one per CPU, accessed at a fixed per-CPU stride. Provides IAR (Interrupt Acknowledge Register) and EOIR (End-of-Interrupt Register) for claim/complete cycles.
GICv3 / GICv4 (ARM Neoverse, Cortex-A55/A75 and later, all current server and mobile SoCs):
- GICD (Distributor): single shared MMIO block. On GICv3, affinity routing is enabled by setting GICD_CTLR.ARE_S=1 / ARE_NS=1. SPIs (IRQs 32-1019) are routed to CPUs via GICD_IROUTER[n] (64-bit affinity value matching MPIDR_EL1).
- GICR (Redistributor): one MMIO region per CPU, containing an LPI and SGI/PPI frame. The GICR is discovered by walking a contiguous array of redistributor frames (8 KB stride per frame pair) or from ACPI MADT.
- ICC system registers: On GICv3, the CPU interface is accessed entirely through system registers (no per-CPU MMIO). ICC_SRE_EL1.SRE=1 must be set first to enable system-register access; if running under a hypervisor, ICC_SRE_EL2.SRE=1 and ICC_SRE_EL2.Enable=1 must also be set.
IRQ taxonomy (all GIC versions):
| Range | Name | Description |
|---|---|---|
| 0–15 | SGI (Software Generated Interrupts) | Inter-processor interrupts. Written to GICD_SGIR (GICv2) or ICC_SGI1R_EL1 (GICv3). Delivered only to the targeted CPU(s). |
| 16–31 | PPI (Private Peripheral Interrupts) | Per-CPU, non-shared. Arch timer: PPI 27 = EL1 Virtual Timer (CNTV_IRQ), PPI 28 = EL2 Physical Timer (CNTHP_IRQ), PPI 29 = Secure EL1 Physical Timer (CNTP_IRQ secure), PPI 30 = Non-secure EL1 Physical Timer (CNTP_IRQ). |
| 32–1019 | SPI (Shared Peripheral Interrupts) | Platform devices: UART, PCIe, USB, storage controllers. Routed via GICD_ITARGETSR (GICv2) or GICD_IROUTER (GICv3). |
| 8192+ | LPI (Locality-specific Peripheral Interrupts, GICv3+) | MSI-based, used for PCIe MSI and MSI-X. Backed by an in-memory interrupt property table and pending table allocated by the kernel. |
GICv3 initialization sequence (per-system, once):
1. Read GIC base addresses from DTB or ACPI MADT.
2. Map GICD MMIO and GICR MMIO regions.
3. Disable GICD: write GICD_CTLR = 0. Wait for GICD_CTLR.RWP=0.
4. Enable affinity routing: GICD_CTLR = ARE_NS | EnableGrp1NS.
5. Configure SPI priorities: GICD_IPRIORITYR[n] for each SPI.
6. Configure SPI routing: GICD_IROUTER[n] = MPIDR affinity of target CPU
(or 0x80000000_xxxxxxxx for any-affinity / lowest-power routing).
7. Enable GICD: GICD_CTLR.EnableGrp1NS = 1.
GICv3 per-CPU initialization sequence (executed on each CPU, including secondaries):
1. Locate this CPU's GICR frame (match GICR_TYPER.Affinity against MPIDR_EL1).
2. Wake redistributor: clear GICR_WAKER.ProcessorSleep, poll until
GICR_WAKER.ChildrenAsleep = 0.
3. Enable ICC system registers: write ICC_SRE_EL1 = SRE | DFB | DIB.
Execute ISB.
4. Set ICC_PMR_EL1 = 0xFF (accept all interrupt priorities).
5. Set ICC_BPR1_EL1 = 0 (no binary point split; all priority bits used).
6. Enable Group 1 interrupts: write ICC_IGRPEN1_EL1 = 1. Execute ISB.
7. Configure PPI priorities: GICR_IPRIORITYR[n] for timer PPI (PPI 27 = IRQ 27 = EL1 virtual timer CNTV_IRQ).
8. Enable timer PPI: GICR_ISENABLER0 |= (1 << 27).
Exception routing for interrupts (AArch64):
When an IRQ fires from EL0 or EL1 with SPx, the CPU jumps to the IRQ vector at VBAR_EL1 + 0x280 (Current EL with SPx, IRQ). The handler reads ICC_IAR1_EL1 to obtain the IRQ ID, dispatches to the registered handler, then writes ICC_EOIR1_EL1 with the same IRQ ID to signal completion. Priority drop (ICC_EOIR1_EL1 write) and deactivation (ICC_DIR_EL1 write) may be split when EOImode=1 is set in ICC_CTLR_EL1 for fine-grained priority management.
RISC-V: PLIC (Platform-Level Interrupt Controller)
The PLIC is the standard external interrupt controller for RISC-V supervisor-mode
software. It is discovered from the Device Tree node with compatible = "riscv,plic0"
or "sifive,plic-1.0.0", which provides the MMIO base address and the number
of interrupt sources.
PLIC memory map (all offsets are from the PLIC base address):
Offset 0x000000 + source*4: Source priority register (0=disabled, 1-7=priority level)
Offset 0x001000 + word*4: Interrupt pending bits (read-only, one bit per source)
Offset 0x002000 + ctx*0x80 + word*4: Interrupt enable bits (one bit per source, per context)
Offset 0x200000 + ctx*0x1000: Priority threshold register (0=accept all, 7=accept none)
Offset 0x200004 + ctx*0x1000: Claim/Complete register (read=claim highest-priority
pending IRQ; write=signal completion for that IRQ ID)
A context maps to (hart_id × modes_per_hart) + mode_index. On standard RISC-V implementations with M-mode and S-mode per hart: context = hart_id × 2 + mode, where mode 0 = M-mode, mode 1 = S-mode. UmkaOS uses S-mode contexts exclusively.
PLIC initialization sequence:
1. Discover PLIC base from DTB; map the MMIO region.
2. For each interrupt source (1 to max_source):
a. Set priority: PLIC[0x000000 + source*4] = desired_priority (1-7, or 0 to disable).
3. For each hart:
a. Compute S-mode context: ctx = hart_id * 2 + 1.
b. Set threshold to 0 (accept any non-zero priority):
PLIC[0x200000 + ctx*0x1000] = 0.
c. Enable desired sources:
PLIC[0x002000 + ctx*0x80 + (source/32)*4] |= (1 << (source % 32)).
4. Enable PLIC external interrupts in sie CSR: sie.SEIE = 1 (bit 9).
IRQ handling sequence (trap, scause = 9, External interrupt):
1. Read claim register: source_id = PLIC[0x200004 + ctx*0x1000].
A zero return means no interrupt is pending (spurious); ignore.
2. Dispatch to the registered handler for source_id.
3. Write completion: PLIC[0x200004 + ctx*0x1000] = source_id.
This deasserts the interrupt and allows new interrupts of equal or
lower priority to be delivered.
IPI delivery (RISC-V):
IPIs do not go through the PLIC. They use the SBI IPI extension
(Extension ID: 0x735049 = ASCII "sPI"). The primary hart calls
sbi_send_ipi(hart_mask, hart_mask_base) to set a software interrupt
pending on one or more target harts. On the receiving hart, the software
interrupt fires as a supervisor software interrupt (scause = 1, sie.SSIE = 1).
UmkaOS clears the IPI by writing sip.SSIP = 0 in the IPI handler and then
dispatches the pending IPI work item from the per-hart IPI queue.
2.1.2.16 NUMA Topology Discovery
On x86-64 and ARM SBSA/server platforms, NUMA topology is provided by ACPI tables: SRAT (System Resource Affinity Table) maps memory ranges and APIC / MPIDR IDs to NUMA node numbers, while SLIT (System Locality Information Table) provides the distance matrix. UmkaOS parses SRAT and SLIT during static table parsing (Phase 1 of x86-64 initialization; see Section 2.1.2.5).
On platforms that boot with a Device Tree (AArch64 embedded, RISC-V, PPC32, PPC64LE), NUMA topology is encoded directly in the Device Tree. UmkaOS performs DT-based NUMA discovery as a post-DTB-parse step for all non-x86 architectures.
Device Tree NUMA encoding:
/cpus/cpu@N
numa-node-id = <0>; // NUMA node this CPU belongs to
/memory@40000000
device_type = "memory";
reg = <0x0 0x40000000 0x0 0x40000000>;
numa-node-id = <0>; // NUMA node this memory range belongs to
/memory@200000000
device_type = "memory";
reg = <0x2 0x00000000 0x2 0x00000000>;
numa-node-id = <1>; // Second NUMA node
/distance-map // Optional; absent on many embedded platforms
compatible = "numa-distance-map-v1";
distance-matrix =
<0 0 10>, // Node 0 → Node 0: local (normalized to 10)
<0 1 20>, // Node 0 → Node 1: remote
<1 0 20>, // Node 1 → Node 0: remote
<1 1 10>; // Node 1 → Node 1: local
UmkaOS DT-based NUMA discovery algorithm:
1. Walk all /cpus/cpu@N nodes. For each cpu node:
a. Read the reg property (MPIDR affinity / hart ID / PIR).
b. Read numa-node-id. If absent, assign to node 0.
c. Record: cpu_id → numa_node mapping.
2. Walk all /memory@... nodes. For each memory node:
a. Read reg (base, size) pairs.
b. Read numa-node-id. If absent, assign all memory to node 0.
c. Record: [base, base+size) → numa_node mapping (passed to phys::init).
3. If /distance-map node is present:
a. Parse distance-matrix property: triples of (from_node, to_node, distance).
b. Populate NumaDistanceMatrix[from][to] = distance.
c. Distances are normalized: local access = 10. Remote = proportionally higher.
If /distance-map is absent:
a. Assume symmetric topology: all local accesses cost 10, all remote
accesses cost 20 (single-hop assumption). This is conservative but safe.
4. Validate: ensure every CPU maps to a node that has at least some memory.
If a CPU's node has no memory (misconfigured DTB), log a warning and
migrate the CPU to the nearest node with memory (lowest distance score).
Per-architecture specifics:
ARM server (AWS Graviton 3, Ampere Altra, Neoverse N2/V2 platforms): Prefer ACPI SRAT over Device Tree on SBSA-compliant platforms (ACPI is mandatory on SBSA). The SRAT Memory Affinity Structure and Processor Affinity Structure (Types 1 and 0) map MPIDR values and memory ranges to NUMA proximity domains. Distance values come from SLIT. On platforms that provide both ACPI and a Device Tree (Graviton 3 exposes both), ACPI takes precedence.
RISC-V: No ACPI on most RISC-V platforms. The distance-map DT node is rarely populated on current RISC-V hardware (SiFive HiFive Unmatched, StarFive VisionFive 2). UmkaOS applies the symmetric topology fallback (local=10, remote=20) on RISC-V when the distance-map node is absent. Future multi-socket RISC-V server designs (expected from Ventana, SiFive, Alibaba T-Head) will populate distance-map.
PPC64LE (POWER10):
IBM POWER systems encode NUMA topology using the proprietary
ibm,associativity and ibm,associativity-reference-points DT properties:
/cpus/cpu@0
ibm,associativity = <4 0 0 0 0>;
// Four levels of hierarchy: chip group / chip / core / thread.
// The reference-points property selects which levels to use for
// NUMA distance calculation.
/ibm,associativity-reference-points = <0x4 0x2>;
// Level index 4 (first element) = domain/chip-group boundary.
// Level index 2 (second element) = chip boundary.
// Distance between CPUs sharing the same value at each level:
// same at both levels = local (same chip) → distance 10
// same at first but different at second = 1 hop → distance 20
// different at first = multiple hops → distance 40
UmkaOS parses ibm,associativity-reference-points first to determine the number
of distance levels, then for each CPU and memory node reads ibm,associativity
to compute the NUMA node assignment and inter-node distance matrix.
2.1.2.17 Per-Architecture Extended State (FPU) Initialization
Each architecture requires explicit initialization to enable floating-point and SIMD registers before they can be used by kernel or user code. UmkaOS uses a lazy FP strategy on all architectures: extended state is not saved at every context switch, but only when the task has actually used FP/SIMD registers.
x86-64:
FPU/SSE/AVX/XSAVE initialization runs during early boot (before interrupts are enabled, after the physical memory manager is initialized):
1. Detect XSAVE: CPUID leaf 0x1, ECX bit 26 (OSXSAVE). If absent, fall back
to legacy FXSAVE (SSE2 state only, 512 bytes).
2. Set CR0: CR0.EM = 0 (no FPU emulation), CR0.MP = 1 (monitor coprocessor).
3. Set CR4.OSFXSR = 1 (enable FXSAVE/FXRSTOR for SSE state).
Set CR4.OSXSAVE = 1 (enable XSAVE/XRSTOR for extended state).
4. Query XCR0 to discover which extended state components are present:
XCR0 bit 0 = x87 FPU, bit 1 = SSE, bit 2 = AVX, bit 5-7 = AVX-512,
bit 9 = PKRU, bit 17-18 = AMX tile config/data.
5. Enable all supported components: write XCR0 with the bitmask of present
components (CPUID leaf 0xD, sub-leaf 0 provides the valid bit set).
6. Lazy context switch: set CR0.TS = 1 (task switched). First FP use from
any task triggers a #NM (Device Not Available) exception. The handler
loads the task's saved FP state and clears CR0.TS before returning.
On context switch out: if CR0.TS was clear (task used FP), save the
extended state via XSAVE[OPT/C] to the per-task XSAVE area.
AArch64:
1. NEON/FP enable: Write CPACR_EL1.FPEN = 0b11 (no trapping of FP/NEON
instructions at EL1 or EL0). Without this, any NEON/FP instruction
from EL0 or EL1 causes an Undefined Instruction exception.
(UmkaOS's entry.S already sets FPEN=0b11 for the boot CPU to allow
Rust-generated NEON instructions in the early kernel; secondary CPUs
set FPEN=0b11 in their secondary_entry stubs.)
2. Lazy context switch: use CPACR_EL1.FPEN = 0b00 (trap FP/NEON from
all ELs) to detect first use. On the resulting trap (ESR_EL1.EC=0x07,
FP/NEON access from AArch64), load the task's saved FP state and set
FPEN=0b11 before returning. On context switch out: if FPEN was 0b11
(task used FP), save Q0-Q31 + FPSR + FPCR to the per-task FP frame.
3. SVE (Scalable Vector Extension, ARMv8.2+): If CPUID reports SVE
(ID_AA64PFR0_EL1.SVE != 0), set ZCR_EL1.LEN to the desired vector
length minus 1 (0 = 128-bit, 1 = 256-bit, up to SMCCC-reported max).
CPTR_EL2.ZEN = 0b00 (allow SVE at EL1/EL0, no trap to EL2).
SVE state (Z registers, P registers, FFR) is saved/restored separately
from the NEON state, using the larger per-task SVE frame.
4. SME (Scalable Matrix Extension, ARMv9.2+): Enabled via CPACR_EL1.SMEN
and SMCR_EL1.LEN. SME streaming mode and ZA register file are saved as
part of the per-task SME frame on context switch.
RISC-V:
sstatus.FS field (bits [14:13]) controls FP state:
0b00 = Off: Any FP instruction causes an Illegal Instruction exception.
0b01 = Initial: FP registers accessible; initial (clean) state.
0b10 = Clean: FP registers accessible; not modified since last save.
0b11 = Dirty: FP registers accessible; modified since last save.
1. At boot (on each hart): set sstatus.FS = 0b01 (Initial). This enables
FP instructions without immediately requiring a context-switch save.
2. Lazy save: set sstatus.FS = 0b00 (Off) on context switch in for tasks
that have not used FP. First FP instruction traps (Illegal Instruction,
scause = 2). The handler sets sstatus.FS = 0b01 and returns; the FP
instruction re-executes. On context switch out: if sstatus.FS == 0b11
(Dirty), save all 32 FP registers (f0-f31) plus fcsr to the per-task
FP frame, then set sstatus.FS = 0b01 (Clean). This avoids saving FP
state for tasks that never use FP.
3. Vector extension (V): If sstatus.VS (bits [10:9]) is supported, manage
the V register file (v0-v31, vtype, vl, vlenb) identically to the FP
FS field. VS = 0b00 traps; set on first use; save on switch-out if Dirty.
PPC32 / PPC64LE:
The MSR (Machine State Register) contains separate enable bits for each
extended register file:
MSR.FP (bit 18): FPU enable. 0 = FP instructions cause FP Unavailable exception.
MSR.VEC (bit 25): AltiVec/VMX enable. 0 = VMX instructions cause VMX Unavailable.
MSR.VSX (bit 23): VSX enable (PPC64 only). 0 = VSX instructions cause VSX Unavailable.
1. At boot: clear MSR.FP, MSR.VEC, MSR.VSX (all zero after reset; verify).
2. Lazy enable: the FP/VMX/VSX Unavailable exception fires on first use.
The handler sets the corresponding MSR bit and returns. The instruction
re-executes.
3. On context switch out: if any of MSR.FP / MSR.VEC / MSR.VSX is set,
save the corresponding register file (32 FPRs + FPSCR, 32 VMX registers
+ VSCR/VRSAVE, 64 VSX registers) to the per-task frame, then clear the
MSR bit to re-arm the trap for the next task.
4. On context switch in: do NOT restore FP state until first use (the
Unavailable trap will do that). This means tasks that were FP-active
when they were switched out will take one Unavailable trap on their
next quantum — a single additional exception per task per scheduling
interval, which is acceptable given the benefit of skipping FP restore
for FP-idle tasks.
UmkaOS's unified lazy FP policy:
All six architectures implement the same semantic contract:
- Tasks that never issue a FP/SIMD instruction pay zero extended-state save or restore cost at every context switch.
- The first FP/SIMD instruction in a task's lifetime triggers one trap, which loads the task's initial (zero) FP state and marks the task as FP-active.
- Subsequent context switches for FP-active tasks check the architecture's dirty indicator (CR0.TS cleared / FS=Dirty / MSR.FP set) and save only when needed.
- The per-task FP frame is allocated at task creation (sized to the largest extended state the hardware can produce on that platform, as determined by XSAVE area size on x86, SVE vector length on AArch64, or fixed sizes on RISC-V/PPC) and freed at task exit.
2.1.3 Production Boot Target
The following subsections describe the target boot architecture for production deployments. None of this is implemented yet — it represents the design goal that the Multiboot implementation will evolve toward (see Section 2.1.7 for the migration path).
2.1.3.1 Goal: Drop-in Kernel Package
UmkaOS installs as a standard kernel package alongside the existing Linux kernel. The user can dual-boot between them using the GRUB menu.
# # Debian / Ubuntu
apt install umka
update-initramfs -c -k umka-1.0.0
update-grub
# # RHEL / Fedora
dnf install umka
dracut --force /boot/initramfs-umka-1.0.0.img umka-1.0.0
grub2-mkconfig -o /boot/grub2/grub.cfg
# # Arch Linux
pacman -S umka
mkinitcpio -p umka
# # Reboot, select "UmkaOS 1.0.0" from GRUB menu
# # Existing Linux kernel is always available as a fallback entry
2.1.3.2 Boot Requirements
- Image format: ELF kernel image with an embedded PE/COFF stub header, compatible
with GRUB2 (loading as ELF), systemd-boot, and UEFI direct boot (loading as PE/COFF).
Installed as
/boot/vmlinuz-umka-VERSION(the "vmlinuz" name is a convention; the actual format is a PE/COFF-stubbed ELF, similar to Linux's bzImage with EFISTUB). - Boot protocol: x86 Linux boot protocol (for BIOS legacy boot) and UEFI stub (for UEFI direct boot). Both are supported.
- Initramfs: Custom initramfs containing UmkaOS-native drivers for early boot (storage controller, root filesystem). Built using standard tools (dracut, mkinitcpio) with UmkaOS-specific hooks.
/bootlayout: Fully compatible with existing distribution tools./boot/vmlinuz-umka-VERSION/boot/initramfs-umka-VERSION.img/boot/System.map-umka-VERSION(optional, for debugging)- Kernel command line: Standard Linux cmdline parameters are parsed and honored
(
root=,console=,quiet,init=,rw/ro, etc.).
2.1.3.3 Target Boot Sequences
x86-64 (production):
1. UEFI firmware (PE/COFF stub) / BIOS bootloader loads kernel image
2. Boot stub (Rust/asm) sets up:
- Identity-mapped page tables
- GDT, IDT stubs
- Stack
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse boot parameters and ACPI tables
b. Initialize physical memory allocator (from e820/UEFI memory map)
c. Initialize virtual memory (kernel page tables, PCID)
d. Initialize per-CPU data structures
e. Initialize Tier 0 drivers: APIC, timer, early console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init (typically systemd)
AArch64 (production):
1. UEFI firmware or QEMU -kernel loads the ELF, jumps to _start in EL1
2. Boot stub (assembly) sets up:
- Exception vectors (VBAR_EL1)
- Stack pointer
- MMU disabled (identity-mapped initially)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) passed in x0
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (TTBR0_EL1/TTBR1_EL1, ASID, TCR_EL1)
d. Initialize per-CPU data structures (MPIDR_EL1 affinity)
e. Initialize Tier 0 drivers: GIC (distributor + redistributor), generic timer, early console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
No microcode loading is performed — ARM firmware updates are handled by the platform firmware (UEFI capsule updates or vendor-specific mechanisms), not the kernel. This is architecturally correct: ARM's trust model places firmware updates in the Secure World (EL3/EL2), not in the Normal World OS.
ARMv7 (production):
1. QEMU vexpress-a15 loads the ELF, jumps to _start in SVC mode
2. Boot stub (assembly) sets up:
- Vector table (VBAR)
- Stack pointer
- Interrupts disabled (CPSR I+F bits)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) passed in r2
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (TTBR0, DACR for domain isolation)
d. Initialize per-CPU data structures
e. Initialize Tier 0 drivers: GIC, SP804 timer, early UART console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
ARMv7 does not have microcode updates. CPU errata on ARMv7 are addressed through kernel code paths (alternative instruction sequences) selected at boot based on the MIDR (Main ID Register) value.
RISC-V 64 (production):
1. OpenSBI (M-mode firmware) initializes hardware, jumps to _start in S-mode
a0 = hart_id, a1 = DTB address
2. Boot stub (assembly) sets up:
- Trap vector (stvec)
- Stack pointer
- Interrupts disabled (sstatus.SIE = 0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) from a1
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (satp CSR, Sv48 mode, ASID)
d. Initialize per-CPU data structures (per-hart)
e. Initialize Tier 0 drivers: PLIC, timer (via SBI ecall), early 16550 UART
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
RISC-V does not have microcode updates. CPU errata are handled by OpenSBI (M-mode)
or by kernel alternative code paths selected based on the mvendorid/marchid/
mimpid CSRs (exposed via SBI or DTB).
PPC32 (production):
1. U-Boot or QEMU loads ELF, jumps to _start in supervisor mode
r3 = DTB address
2. Boot stub (assembly) sets up:
- Stack pointer (r1)
- Exception vectors (IVPR base + IVOR offsets)
- Interrupts disabled (MSR EE=0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) from r3
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (TLB1 entries for initial mapping, then software page table)
d. Initialize per-CPU data structures
e. Initialize Tier 0 drivers: OpenPIC, decrementer timer, early UART console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
PPC32 does not have microcode updates. CPU errata are handled by kernel code paths selected at boot based on the PVR (Processor Version Register).
PPC64LE (production):
1. SLOF/OPAL firmware loads ELF, jumps to _start
r3 = DTB address, MSR: SF=1, LE=1
2. Boot stub (assembly) sets up:
- TOC pointer (r2) for position-independent data access
- Stack pointer (r1)
- Exception vectors (via LPCR and HSPRG0/1)
- Interrupts disabled (MSR EE=0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) from r3
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (Radix MMU on POWER9+, HPT fallback on POWER8)
d. Initialize per-CPU data structures (PIR = Processor Identification Register)
e. Initialize Tier 0 drivers: XIVE interrupt controller, decrementer timer, early UART console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
PPC64LE does not have user-loadable microcode. POWER processor firmware updates are applied by the service processor (FSP or BMC) out-of-band, not by the OS kernel.
2.1.3.4 Initramfs Detection and Loading
UmkaOS supports three initramfs loading mechanisms, tried in priority order. The mechanism used depends on the boot path (BIOS/Multiboot, UEFI, or firmware with device tree). All three paths expose the same result to the kernel: a physical address and byte length for a contiguous initramfs image in RAM.
| Boot Path | Discovery Mechanism | Address Fields |
|---|---|---|
| x86 BIOS/Multiboot | boot_params.hdr.ramdisk_image (offset 0x218) |
u32 phys addr + ramdisk_size at 0x21c |
| EFI stub (all arches) | LINUX_EFI_INITRD_MEDIA_GUID LoadFile2 protocol |
GUID: 5568e427-68fc-4f3d-ac74-ca555231cc68 |
| Device Tree | /chosen node: linux,initrd-start + linux,initrd-end |
u64 big-endian absolute physical addresses |
All three paths converge on the same kernel-internal representation:
/// Initramfs blob location discovered during early boot.
/// Populated by one of the three platform-specific loading paths before
/// the memory allocator is fully online. The physical range
/// [phys_start, phys_start + len) must lie within usable RAM.
pub struct InitramfsBlob {
/// Physical start address of the initramfs image.
pub phys_start: PhysAddr,
/// Byte length of the compressed CPIO archive.
pub len: usize,
}
Path 1 — x86 boot_params (highest priority on x86/x86-64)
The Multiboot loader or UEFI stub populates fields in boot_params (the "zero
page"). There are two distinct areas: the setup_header (header fields) and the
boot_params extension area (zero-page fields). The ramdisk fields span both:
/// Fields read from the x86 Linux boot protocol.
/// `ramdisk_image` and `ramdisk_size` live in the setup_header at fixed offsets
/// from the start of the real-mode kernel header (0x01f1 into boot_params).
/// `ext_ramdisk_image` and `ext_ramdisk_size` are separate extension fields
/// in the boot_params zero-page area, not in the header itself.
pub struct BootParamsRamdiskFields {
/// Low 32 bits of the initramfs physical base address.
/// Offset from boot_params base: 0x218 (within setup_header).
/// Boot protocol 2.00+ (kernel 1.3.73+).
pub ramdisk_image: u32,
/// Low 32 bits of the initramfs byte length.
/// Offset from boot_params base: 0x21c (within setup_header).
/// Boot protocol 2.00+ (kernel 1.3.73+).
pub ramdisk_size: u32,
/// High 32 bits of the initramfs physical base address.
/// Offset from boot_params base: 0x0c0 (zero-page extension area).
/// Added in boot protocol 2.12 (kernel 3.8) for loading above 4 GiB.
pub ext_ramdisk_image: u32,
/// High 32 bits of the initramfs byte length.
/// Offset from boot_params base: 0x0c4 (zero-page extension area).
/// Added in boot protocol 2.12 (kernel 3.8).
pub ext_ramdisk_size: u32,
}
If boot_params.hdr.ramdisk_image != 0, UmkaOS reads the initramfs from:
physical_addr = ((ext_ramdisk_image as u64) << 32) | (ramdisk_image as u64)
size_bytes = ((ext_ramdisk_size as u64) << 32) | (ramdisk_size as u64)
On systems without boot protocol 2.12 support (i.e., ext_ramdisk_image and
ext_ramdisk_size are zero-initialized), this reduces to the 32-bit address
and size directly from ramdisk_image and ramdisk_size.
Path 2 — EFI LoadFile2 / Initrd Media GUID Protocol (EFI systems, all architectures)
When booted via EFI (UEFI stub or EFI bootloader such as systemd-boot or GRUB2),
the bootloader may expose the initramfs through the LoadFile2 protocol registered
on the LINUX_EFI_INITRD_MEDIA_GUID vendor media device path. This mechanism was
introduced in Linux 5.8 and is also implemented in the UmkaOS EFI stub.
/// EFI GUID identifying the initrd media vendor device path.
/// The kernel's EFI stub locates a handle with this GUID registered on the
/// firmware's device path protocol, then calls LoadFile2 to obtain the initrd.
/// Defined in the Linux EFI stub (drivers/firmware/efi/libstub/efi-stub-helper.c).
pub const LINUX_EFI_INITRD_MEDIA_GUID: EfiGuid = EfiGuid {
data1: 0x5568_e427,
data2: 0x68fc,
data3: 0x4f3d,
data4: [0xac, 0x74, 0xca, 0x55, 0x52, 0x31, 0xcc, 0x68],
};
The loading sequence:
- Scan the EFI handle database for a handle that matches the
LINUX_EFI_INITRD_MEDIA_GUIDvendor media device path. - If found, query the
LoadFile2protocol on that handle. - Call
LoadFile2.LoadFile()withBootPolicy = falseto obtain the initrd size (first call returnsEFI_BUFFER_TOO_SMALLwith the size). - Allocate pages below the hard limit, call
LoadFile2.LoadFile()again to transfer the data. - The resulting
(base, size)pair is stored in the EFI configuration table underLINUX_EFI_INITRD_MEDIA_GUIDand consumed by the kernel afterExitBootServices().
Path 3 — Device Tree /chosen node (AArch64, ARMv7, RISC-V, PPC)
The firmware or bootloader populates the /chosen DT node with the initramfs
physical address range:
/ {
chosen {
/* linux,initrd-start and linux,initrd-end are big-endian cell values.
Cell width follows #address-cells of the root node (typically 2 on
64-bit platforms, giving 64-bit addresses across two 32-bit cells). */
linux,initrd-start = <0x0 0x82000000>; /* 64-bit: 0x0000000082000000 */
linux,initrd-end = <0x0 0x84000000>; /* exclusive end address */
};
};
/// DT /chosen property names for initramfs (standard Linux boot protocol).
/// Values are big-endian cells; cell width matches the root node's #address-cells.
/// Size of initramfs = initrd_end - initrd_start (initrd_end is exclusive).
pub const DT_INITRD_START_PROP: &str = "linux,initrd-start";
pub const DT_INITRD_END_PROP: &str = "linux,initrd-end";
UmkaOS reads these properties during early DT parsing (step 4a in the DT-based boot sequences). Both are treated as 64-bit big-endian values regardless of platform word size, matching the Linux implementation.
Priority and fallback:
if arch == x86 || arch == x86_64:
if boot_params.hdr.ramdisk_image != 0:
use Path 1
elif efi_boot && efi_load_initrd_dev_path() succeeds:
use Path 2
else:
no initramfs
elif efi_boot:
if efi_load_initrd_dev_path() succeeds:
use Path 2
else:
no initramfs
elif dt_boot:
if dt_property_exists("/chosen", "linux,initrd-start"):
use Path 3
else:
no initramfs
No initramfs is also valid — the kernel falls back to a minimal in-kernel rootfs
(tmpfs) and attempts to find /init from a built-in CPIO archive. If no built-in
CPIO is present and no initramfs was loaded, the kernel panics with a descriptive
message: "No initramfs found and no built-in rootfs — cannot locate /init".
Validation (after loading, regardless of path):
- Verify the initramfs starts with a valid cpio magic:
070701(newc, no CRC),070702(newc, with CRC), or0707(binary, legacy). Reject if absent. - Verify
size_bytes > 0and that the physical range[physical_addr, physical_addr + size_bytes)lies entirely within available RAM (not in reserved regions or MMIO holes). Reject with a boot error if not. - If IMA is active (Integrity Measurement Architecture, Section 8.3), measure the complete initramfs into PCR 9 before executing any init scripts. This matches the Linux IMA policy for initramfs measurement.
2.1.4 CPU Errata and Microcode
Modern CPUs ship with known errata — hardware bugs documented in vendor errata sheets. UmkaOS handles these systematically rather than scattering workarounds through the codebase.
Early microcode loading — CPU microcode is applied before most kernel initialization, matching the Linux early microcode loading model. The microcode blob is located by scanning the raw initramfs image in physical memory (NOT by mounting the filesystem — initramfs mount happens later at step 4h). Linux uses the same approach: the bootloader provides an uncompressed CPIO archive prepended to the initramfs; the kernel extracts the microcode by parsing the raw CPIO headers in memory at boot.
The microcode update runs between steps 4b (physical memory allocator init) and 4c (virtual memory init):
Boot step (between 4b and 4c): Early microcode update
1. Scan raw initramfs blob in physical memory for microcode CPIO archive
(/lib/firmware/intel-ucode/ or /lib/firmware/amd-ucode/ paths in CPIO)
2. Validate signature (vendor-signed, no user-modifiable microcode)
3. Apply via WRMSR to IA32_BIOS_UPDT_TRIG (Intel) or MSR_AMD64_PATCH_LOADER (AMD)
4. Re-read CPUID — microcode may change feature flags (critical: must happen
before step 4c which uses CPUID to configure page table features)
5. Log applied microcode revision to ring buffer
Errata database — After microcode loading and CPUID enumeration, UmkaOS consults a per-CPU-model quirk table:
/// CPU errata entry — matches a specific CPU stepping to its required workarounds.
struct CpuErrata {
/// CPU identification (vendor, family, model, stepping range).
match_id: CpuMatch,
/// Human-readable errata identifier (e.g., "SKX003", "ZEN4-ERR-1234").
errata_id: &'static str,
/// Workaround function applied during boot.
workaround: fn() -> Result<()>,
/// Category for boot-parameter override.
category: ErrataCat,
}
enum ErrataCat {
/// MSR write to disable/enable a feature.
MsrTweak,
/// Alternative code path (e.g., retpoline instead of indirect branch).
CodePath,
/// Disable a CPU feature entirely.
FeatureDisable,
}
The quirk table is checked during boot (step 4d, after CPUID). Each matching entry's workaround function is called. Workarounds are logged to the ring buffer.
Spectre/Meltdown class mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Meltdown (v3) | KPTI (page table isolation) | Required for Tier 2 + userspace; NOT needed for Tier 1 (same ring, MPK isolation) |
| Spectre v1 | LFENCE barriers at bounds checks; Speculative Load Hardening (SLH) | Compiler-inserted SLH (-mllvm -x86-speculative-load-hardening); manual LFENCE in asm hot paths |
| Spectre v2 | Retpoline / IBRS / eIBRS | Retpoline (-C target-feature=+retpoline-indirect-branches) for indirect branches in kernel code; eIBRS preferred on supporting hardware |
| Spectre v4 (SSB) | SSBD (Spec. Store Bypass Disable) | Per-thread via IA32_SPEC_CTRL MSR; toggled on context switch for untrusted threads |
| MDS/TAA | Buffer clears (VERW) |
On context switch to userspace; on VM entry/exit |
| SRBDS | Microcode + VERW |
Handled by early microcode update |
| RFDS/GDS | Microcode + opt-in VERW |
Same as MDS path |
Mitigation boot parameters:
umka.mitigate=auto # Default: apply mitigations based on detected CPU (recommended)
umka.mitigate=on # Force all mitigations on, even if CPU claims to be fixed
umka.mitigate=off # Disable all mitigations (INSECURE — see below)
umka.mitigate.kpti=off # Disable specific mitigation class
umka.mitigate.retpoline=off # Disable specific mitigation class
Performance impact of mitigations:
The cumulative overhead of speculative execution mitigations is substantial — typically 5-30% depending on workload characteristics:
| Mitigation | Overhead | Worst-case workload |
|---|---|---|
| KPTI | ~5% syscall-heavy; ~100-200ns per user↔kernel transition | Database OLTP (millions of syscalls/sec) |
| Retpoline / eIBRS | ~2-10% | Indirect-branch-heavy code (virtual dispatch, interpreters) |
| SSBD | ~1-5% | Memory-intensive with store-to-load forwarding |
MDS VERW |
~1-3% on context switch | Frequent user↔kernel transitions |
| Cumulative | 5-30% | Syscall-heavy + indirect-branch-heavy (databases, VMs) |
umka.mitigate=off is legitimate for:
- Air-gapped HPC clusters where all code is trusted and no untrusted workloads run.
- Benchmarking to isolate application performance from mitigation overhead.
- Single-tenant bare-metal where the threat model excludes local attackers.
- Nested within a trusted VM where the host hypervisor enforces mitigations at
the outer boundary (the guest's mitigations are redundant).
The kernel logs a prominent boot warning when mitigations are disabled:
umka: WARNING — speculative execution mitigations DISABLED (umka.mitigate=off).
This system is vulnerable to Spectre, Meltdown, MDS, and related attacks.
Do NOT use in multi-tenant or untrusted environments.
Interaction with umka.isolation=performance: When umka.isolation=performance is
set (promoting Tier 1 drivers to Tier 0, disabling CPU-side isolation), the admin has
already accepted a reduced security posture. Combining umka.isolation=performance with
umka.mitigate=off provides the maximum performance envelope — no isolation overhead,
no mitigation overhead — but should be limited to environments where all executing code
is fully trusted. The two settings are independent; either can be set alone.
Runtime reporting — Vulnerability status is exposed via Linux-compatible sysfs:
/sys/devices/system/cpu/vulnerabilities/meltdown: "Mitigation: PTI"
/sys/devices/system/cpu/vulnerabilities/spectre_v1: "Mitigation: usercopy/LFENCE"
/sys/devices/system/cpu/vulnerabilities/spectre_v2: "Mitigation: eIBRS"
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass: "Mitigation: SSBD"
/sys/devices/system/cpu/vulnerabilities/mds: "Mitigation: Clear buffers"
This ensures monitoring tools (spectre-meltdown-checker, lynis) work without modification.
2.1.5 Speculation Mitigations (All Architectures)
The x86-specific mitigation table in Section 2.1.4 covers only one architecture. Here is the complete per-architecture mitigation matrix:
AArch64 mitigations:
| Vulnerability | ARM Identifier | Mitigation | UmkaOS scope |
|---|---|---|---|
| Spectre v1 (bounds bypass) | — | CSDB barriers at bounds checks | Compiler-inserted CSDB barriers after conditional branches (ARM equivalent of x86 SLH; uses CSDB instruction, not LLVM's x86-specific -x86-speculative-load-hardening pass) |
| Spectre v2 (BTI) | CVE-2017-5715 | BTI (Branch Target Identification) | Hardware BTI (ARMv8.5+): enabled via SCTLR_EL1.BT1. Software: SMCCC ARCH_WORKAROUND_1 firmware call |
| Spectre-BHB | CVE-2022-23960 | BHB clearing sequence or firmware call | SMCCC ARCH_WORKAROUND_3 or BHB clearing loop on context switch |
| Meltdown (v3) | CVE-2017-5754 | KPTI (separate EL0/EL1 page tables) | Full KPTI required on Cortex-A75 (all revisions) per ARM security bulletins; NOT needed on Cortex-A76 (all revisions), Cortex-A78, Cortex-X1, Cortex-X2, Cortex-A710, Cortex-A715 (NOT affected per ARM security bulletin — CSV3=Yes across all revisions), or other Armv9/v8.x cores with CSV3, or earlier in-order cores (A53, A55, etc.). Cortex-A510 (all revisions): classified as Variant 3 by ARM, but the actual erratum (3117295) describes a speculative unprivileged load issue whose workaround is a TLBI instruction before returning to EL0, not full KPTI page table splitting. UmkaOS applies the lightweight TLBI mitigation for A510, not the heavyweight page table split used for Cortex-A75. Cortex-A520 (prior to r0p2 only; r0p2+ is not affected): classified as Variant 3 by ARM (erratum 2966298); like Cortex-A510, the actual issue is a speculative unprivileged load whose workaround is a TLBI instruction, not full KPTI. UmkaOS applies the same lightweight TLBI mitigation as for A510. |
| Spectre v4 (SSB) | CVE-2018-3639 | SSBS (Speculative Store Bypass Safe) | Hardware SSBS bit (ARMv8.5+): per-thread via PSTATE.SSBS. Software: SMCCC ARCH_WORKAROUND_2 |
| Straight-line speculation | — | SB instruction after branches | Compiler-inserted speculation barriers |
ARM firmware interface: Unlike x86 (which uses MSR writes), ARM mitigations are
often applied through SMCCC (SMC Calling Convention) firmware calls to EL3 Secure
Monitor code. The kernel calls ARCH_WORKAROUND_1/2/3 — the firmware applies the
actual mitigation. This is architecturally cleaner (firmware knows the exact CPU
revision) but adds ~100-200 cycles per SMCCC call.
ARMv7 mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Spectre v1 | CSDB barriers at bounds checks | Same as AArch64 |
| Spectre v2 | Firmware workaround via SMCCC | ARCH_WORKAROUND_1 for affected Cortex-A cores |
| Meltdown | Not applicable | ARMv7 Cortex-A cores are not affected |
| Spectre v4 | Firmware workaround | ARCH_WORKAROUND_2 where supported |
RISC-V mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Spectre v1 | FENCE instructions at bounds checks | Manual insertion in assembly; compiler support evolving |
| Spectre v2 | Vendor-specific | SiFive: FENCE.I after indirect branches. Other vendors: per-implementation |
| Meltdown | Not applicable | In-order RISC-V cores not affected; OoO cores (e.g., SiFive P670) may need KPTI |
| Spectre v4 | Vendor-specific | No standard RISC-V mitigation; per-vendor microarchitecture |
RISC-V status: Speculation mitigations on RISC-V are less mature than x86 or ARM.
The RISC-V CFI extensions Zicfiss (shadow stacks) and Zicfilp (landing pads) are ratified
as standalone extensions (not part of the base privileged specification, but separate
ratified ISA extensions). UmkaOS implements both when the hardware
reports support via the Zicfiss and Zicfilp ISA string entries. UmkaOS also applies
vendor-specific workarounds based on mvendorid/marchid from the device tree, similar
to the x86 errata database approach.
PowerPC mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Spectre v1 | ori 31,31,0 (speculation barrier) |
Inserted at bounds checks in assembly |
| Spectre v2 | Count Cache Flush + link stack flush | POWER8/9: bcctr flush sequence; POWER10: hardware mitigation |
| Meltdown | RFI flush (L1D cache flush) | POWER7+: flush on return from interrupt via rfid/hrfid |
| Spectre v4 | STF (Store Thread Forwarding) barrier | ori 31,31,0 barrier; POWER9+ firmware toggle |
PowerPC status: IBM POWER processors have well-documented mitigations managed via firmware (skiboot/OPAL) and kernel runtime patches. POWER10 includes hardware mitigations for most Spectre variants. PPC32 embedded cores (e500, 440) are generally in-order and not affected by speculative execution vulnerabilities. UmkaOS applies mitigations based on PVR (Processor Version Register) from the device tree.
Runtime reporting (all architectures) — The Linux-compatible sysfs interface
(/sys/devices/system/cpu/vulnerabilities/) is populated on all architectures with
architecture-appropriate mitigation status strings.
2.1.6 Dual-Boot Safety
- UmkaOS never modifies the existing Linux kernel installation.
- GRUB is configured with both kernels; the default can be set by the user.
- If UmkaOS fails to boot, the user selects the Linux kernel from GRUB.
- A "last known good" mechanism records successful boots and can auto-revert.
2.1.7 Boot Protocol Migration Path
The boot architecture evolves through four phases, each building on the previous:
Phase 1 — Multiboot1 (current). GRUB loads the ELF via multiboot command.
QEMU loads directly with -kernel. Memory map from Multiboot1 info structure.
Sufficient for all kernel development and QEMU-based testing.
Phase 2 — Multiboot2 full parser. Parse Multiboot2 tags to access richer boot
information: ACPI RSDP pointer, EFI memory map, framebuffer info, boot services
tag. This enables ACPI table parsing and EFI runtime services without changing the
bootloader. GRUB2 already supports the multiboot2 command.
Phase 3 — UEFI stub boot. Add a PE/COFF header stub to the kernel image (similar
to Linux EFISTUB). UEFI firmware requires PE/COFF executables, not ELF — the stub
header makes the kernel image a valid PE/COFF binary that UEFI can load directly.
The actual kernel code remains ELF internally; the PE/COFF header is a thin wrapper
(like Linux's header.S which embeds a PE/COFF header in the bzImage). The kernel
becomes directly bootable from UEFI firmware without GRUB — efibootmgr can register
it. Use EFI boot services for memory map and GOP framebuffer, then call
ExitBootServices() before entering the kernel proper. systemd-boot and other
UEFI-native boot managers work at this stage.
Phase 4 — Linux boot protocol. Implement the x86 Linux boot protocol
(struct boot_params at 0x10000). This makes the UmkaOS kernel loadable by any
Linux-compatible bootloader. Combined with a standard /boot layout and initramfs,
this enables the drop-in package installation described in Section 2.1.3.1. This is the
final production boot target.
2.1.8 Secure Boot and Measured Boot
Secure Boot and Measured Boot are kernel-level boot-phase concerns. They apply equally to servers (enterprise attestation, confidential computing), cloud instances (vTPM-based instance identity), and consumer devices (UEFI Secure Boot for firmware lockdown). Neither feature is consumer-specific.
2.1.8.1 UEFI Secure Boot
UEFI Secure Boot enforces a chain of trust starting in firmware: the UEFI db (allowed signature database) and dbx (revocation list) are stored in firmware NVRAM. Every executable in the boot path (bootloader, shim, kernel) must be signed by a key in the db.
Deployment models:
| Model | Chain | When used |
|---|---|---|
| Shim + GRUB | Microsoft UEFI CA → shim (signed by MS) → GRUB (signed by distro) → kernel (signed by distro) | Default for distros shipping via OEM |
| UEFI direct | Custom key enrolled in db → kernel PE/COFF (signed by UmkaOS key) | Self-managed servers, custom deployments |
| Unsigned (disabled) | No verification | Development hardware, QEMU |
UmkaOS requires Phase 3 (UEFI stub, Section 2.1.7) before Secure Boot can be supported.
The kernel image must be a valid PE/COFF binary for UEFI to verify its
signature before loading. The build system produces a signed image via
sbsign --key umka-signing.key --cert umka-signing.crt umka-kernel.efi.
Kernel module signing: Once the kernel is Secure Boot-booted, all kernel modules (Tier 1 drivers) must also be signed. Unsigned modules are rejected. The module signing key is separate from the UEFI boot key. The build system embeds the module signing public key in the kernel image; drivers are signed with the corresponding private key during the build.
UEFI Secure Boot state: The kernel reads the UEFI SecureBoot variable
from EFI runtime services at boot and records it in a read-only kernel
parameter. Userspace can query via /sys/firmware/efi/efivars/SecureBoot-*.
This affects policy decisions (e.g., CAP_SYS_MODULE behaviour).
2.1.8.1.1 Key Compromise Recovery
If the UmkaOS signing key is compromised, three coordinated actions are required: updating the UEFI revocation list (dbx), rotating the signing key, and migrating TPM-sealed secrets to a new PCR state. This subsection specifies each step precisely.
dbx Update Path
The UEFI Signature Database Forbidden (dbx) is stored in EFI NVRAM and contains
hashes or certificate thumbprints of revoked images and keys. Updates are delivered
as signed UEFI authenticated variables:
-
Delivery mechanism: a signed UEFI capsule image (
EFI_FIRMWARE_IMAGE_PROTOCOL, GUID6dcbd5ed-e82d-4c44-bda1-7194199ad92a) deposited either via theEFI_UPDATE_CAPSULEruntime service or as a file at/EFI/UpdateCapsule/<GUID>.binon the EFI System Partition. The firmware processes the capsule beforeExitBootServices()on the next boot. -
The capsule is authenticated by the firmware using the Platform Key (PK) or Key Exchange Key (KEK) chain already enrolled in NVRAM. A dbx capsule signed by the KEK — and delivered through a signed distro update package — requires no additional user interaction.
-
Early kernel verification: after the firmware has applied the new dbx and before entering the UmkaOS boot stub, UEFI re-verifies every image in the boot chain against the updated dbx. If the running kernel image's hash or signing certificate is now in the dbx, UEFI aborts the boot and presents an error to the user. The kernel never reaches
umka_main()in this case — the revocation check fires beforeExitBootServices(). -
EFI event log: the firmware records the dbx update in the TCG EFI Platform Specification event log (EV_EFI_VARIABLE_AUTHORITY entry, PCR 7). The kernel reads this log during early initialization and forwards the entry to the IMA audit log, creating a durable, ordered record that dbx was updated.
Key Rotation Protocol
The UmkaOS signing key is an ML-DSA-65 + Ed25519 hybrid key pair. Key rotation proceeds through five steps:
-
Generate the new key pair. Create a new ML-DSA-65 + Ed25519 hybrid signing key pair in a hardware security module (HSM). The HSM never exports the private key material.
-
Enroll the new public key in
db. Submit the new public key certificate to the UEFIdb(allowed signature database) via a KEK-authenticated variable update — the same delivery mechanism as the dbx capsule described above. The update is deployed through the normal distro package management pipeline (e.g., as afwupdplugin or a signed distro package writing to/EFI/UpdateCapsule/). After the update applies, both the old key and the new key are accepted by UEFI. -
Dual-signing period — minimum 30 days. Every kernel release during this period is signed with BOTH the old key and the new key. A dual-signed image satisfies any UEFI db that contains either key. This covers:
- Existing systems that have not yet received the new
dbenrollment. -
Systems that received the enrollment but whose
dbupdate failed to apply (e.g., NVRAM full, firmware bug). The 30-day window gives sufficient time for thedbenrollment to propagate to all deployed systems via normal OS update channels. -
Revoke the old key. After the dual-signing period ends, add the old signing certificate's SHA-256 hash to
dbxvia a KEK-signed capsule update. From this point, images signed only with the old key are rejected. Dual-signed images (carrying the new key's signature as well) continue to boot. -
Out-of-band recovery media. Prepare a USB recovery drive containing:
- The new public key certificate in DER format.
- A signed
dbupdate capsule that adds the new key. - Instructions for manual enrollment via the UEFI setup utility. This drive is used on systems that missed the automatic enrollment (e.g., systems offline during the update window, air-gapped systems).
PCR Extension for the New Key
Standard UEFI Secure Boot behavior extends PCR 7 with the hash of each certificate used to verify an image (EV_EFI_VARIABLE_AUTHORITY events). When a kernel signed with the new key boots for the first time, UEFI extends PCR 7 with the new signing certificate hash. The PCR 7 value changes, which breaks TPM-sealed secrets (such as disk encryption keys) that were sealed with a policy referencing the old PCR 7 value.
Migration path — applied before the old key is revoked (Step 4 above):
-
Compute the expected new PCR 7 value. The new PCR 7 value is:
PCR7_new = SHA256(PCR7_old || SHA256(new_signing_cert_der))This can be computed offline from the current PCR 7 value and the new certificate, without rebooting. -
Re-seal secrets with a dual-policy. Unseal each secret under the existing policy (PCR 7 =
PCR7_old), then re-seal with aPolicyORpolicy that accepts either the old or new PCR 7 value:PolicyOR( PolicyPCR(PCR7_old, pcr_selection = {PCR7}), PolicyPCR(PCR7_new, pcr_selection = {PCR7}) )The re-sealed blob can be unsealed on systems booting with either key during the dual-signing period. -
After key rotation completes. Once the old key is in
dbxand all systems boot only with the new key, re-seal secrets one final time with a single-policy referencing onlyPCR7_new. This drops the fallback to the old PCR 7 value, producing a tighter policy for the steady state.
The migration (Steps 2 and 3) is performed by a userspace tool
(umka-tpm-reseal) that runs as a systemd oneshot service during the transition
window. The service is activated by detecting a new db entry for the UmkaOS
signing key in the EFI event log.
2.1.8.2 Measured Boot (TPM PCR Chain)
Measured Boot extends a TPM Platform Configuration Register (PCR) with a cryptographic hash at each step of the boot chain. PCRs are append-only (extend = SHA256(current PCR value || new measurement)); they cannot be reset without rebooting. A remote attestation verifier can reconstruct the expected PCR values from the known firmware/bootloader/kernel and check that the running system matches.
Standard x86 PCR assignment (UEFI + Linux convention, which UmkaOS follows):
| PCR | What is measured |
|---|---|
| 0 | UEFI firmware code and configuration |
| 1 | UEFI firmware data (platform config) |
| 2 | Option ROM code |
| 3 | Option ROM data |
| 4 | Boot manager code (GRUB/shim) |
| 5 | Boot manager data + GPT partition table |
| 6 | Resume from hibernate |
| 7 | Secure Boot policy (db, dbx, PK, KEK state) |
| 8 | GRUB command line |
| 10 | Kernel image (bzImage/UmkaOS kernel PE/COFF) |
| 11 | initramfs |
| 12 | Kernel command line |
| 13 | UmkaOS: Tier 1 driver measurements (Section 8.3.1, 08-security.md) + IMA policy keys |
| 13–15 | Available for OS/application use |
The kernel extends PCR 9 with its own image hash during early boot (before
ExitBootServices() on UEFI paths, or via GRUB's tpm module on Multiboot
paths). PCR 10 is extended with the initramfs hash. PCR 11 is extended with
the kernel command line.
TPM interface: The kernel accesses the TPM via the TPM CRB (Command
Response Buffer, TPM 2.0 mandatory interface) or TPM TIS (legacy 1.2 / 2.0
FIFO interface). The driver is Tier 1, ACPI-probed (MSFT0101 or MSFT0200).
// umka-core/src/tpm/mod.rs
/// TPM 2.0 PCR Extend command.
/// Extends the given PCR with SHA-256(current || digest).
pub fn pcr_extend(pcr_index: u32, digest: &[u8; 32]) -> Result<(), TpmError>;
/// Read back the current value of a PCR.
pub fn pcr_read(pcr_index: u32) -> Result<[u8; 32], TpmError>;
/// Seal a secret to the current PCR state.
/// Returns a TPM2B_PUBLIC + TPM2B_PRIVATE blob.
/// The secret can only be unsealed if PCRs match the policy at seal time.
pub fn seal(pcr_policy: &PcrPolicy, secret: &[u8]) -> Result<SealedBlob, TpmError>;
/// Unseal a blob previously created by seal().
/// Fails if any PCR in the policy has changed since sealing.
pub fn unseal(blob: &SealedBlob) -> Result<Vec<u8>, TpmError>;
Disk encryption integration: seal() is the mechanism for TPM-bound disk
encryption keys (equivalent to Linux's tpm2-totp / systemd-cryptenroll).
The disk encryption key is sealed to a PCR policy covering PCRs 0, 4, 7, 9,
11 (firmware + Secure Boot policy + kernel + cmdline). Any modification to the
boot chain (new kernel, changed cmdline, disabled Secure Boot) causes unseal
to fail, prompting for a recovery passphrase.
Confidential computing intersection: On confidential VM platforms (AMD SEV-SNP, Intel TDX, ARM CCA), the TPM is replaced by a virtual TPM whose root of trust is the hardware attestation report (VCEK certificate, TD quote, Realm Attestation Token). The PCR-based measured boot model is the same; the trust root is the hardware VM isolation guarantee rather than a physical TPM chip. Section 5.1 covers the distributed/confidential computing architecture.
2.1.8.3 Kernel Responsibilities Summary
| Responsibility | In kernel? | Notes |
|---|---|---|
| Kernel image signing | Build-time | sbsign in build system |
| Module signing verification | Yes | Enforced when Secure Boot active |
| PCR extension (kernel + cmdline) | Yes | Early boot, before driver init |
| TPM driver (CRB/TIS) | Yes | Tier 1, ACPI-probed |
seal() / unseal() API |
Yes | Exposed to userspace via ioctl |
| Key management policy | No | Userspace (systemd-cryptenroll, clevis) |
| Remote attestation protocol | No | Userspace (keylime, MAA agent) |
| Boot graphics, splash screen | No | Bootloader/compositor |
| Dual-boot chainloading | No | Bootloader (GRUB) |
2.1.9 UEFI Runtime Services
After ExitBootServices(), the UEFI Boot Services memory map is invalidated and
all Boot Services (memory allocation, protocol interfaces, etc.) are gone. However,
a distinct set of UEFI Runtime Services remains accessible for the life of the
running OS. These services operate on a virtual address mapping that the kernel
establishes during boot via SetVirtualAddressMap().
2.1.9.1 Virtual Address Mapping
Before calling ExitBootServices(), the kernel enumerates the UEFI memory map and
identifies all regions with the EFI_MEMORY_RUNTIME attribute. These regions are
mapped into a dedicated kernel virtual address range (EFI_RUNTIME_VA_BASE,
architecture-specific) using normal kernel page table entries. The mapping must
preserve the relative offsets between firmware-runtime regions exactly as the
firmware expects.
The kernel then calls SetVirtualAddressMap(map_size, descriptor_size,
descriptor_version, virtual_map) once, passing the updated descriptors with the
new virtual base addresses. After this call returns, all UEFI runtime service
pointers stored in the EFI System Table are updated to use the new virtual
addresses. The physical EFI System Table address is preserved separately so the
kernel can locate it after the mapping call.
/// Handle to UEFI runtime services, valid after ExitBootServices().
pub struct EfiRuntime {
/// Physical address of EFI System Table, preserved across ExitBootServices().
pub system_table_pa: PhysAddr,
/// Virtual address of EFI Runtime Services table, after SetVirtualAddressMap().
pub runtime_services: *const EfiRuntimeServices,
/// Whether runtime services are available (false if firmware is broken or
/// SetVirtualAddressMap() failed).
pub available: bool,
/// Serializes all EFI runtime calls. UEFI firmware is not reentrant.
pub lock: SpinLock<()>,
}
All accesses to EfiRuntime hold EfiRuntime::lock and execute with interrupts
disabled. UEFI firmware is documented as non-reentrant; concurrent calls from
different CPUs or from an IRQ handler preempting a runtime call both produce
undefined behavior.
2.1.9.2 NVRAM (EFI Variables)
EFI variables are named byte arrays stored in firmware NVRAM. They persist across reboots and are accessed by name (UTF-16 string) and vendor GUID. Variables have attribute flags controlling persistence and visibility:
EFI_VARIABLE_NON_VOLATILE(bit 0): persists across power cycles.EFI_VARIABLE_BOOTSERVICE_ACCESS(bit 1): accessible during Boot Services.EFI_VARIABLE_RUNTIME_ACCESS(bit 2): accessible afterExitBootServices().
UmkaOS wraps the UEFI variable services with interrupt-disabled, locked calls:
/// Read a UEFI variable by name and GUID.
///
/// Returns the variable data on success, or an `EfiStatus` error code.
/// Common errors: `EFI_NOT_FOUND` (variable absent), `EFI_BUFFER_TOO_SMALL`
/// (internal — handled by the wrapper via a two-pass size query).
pub fn efi_get_variable(
name: &UcsStr,
guid: &EfiGuid,
) -> Result<(Vec<u8>, u32 /* attributes */), EfiStatus>;
/// Write or delete a UEFI variable.
///
/// Pass `data = &[]` with `attrs = 0` to delete an existing variable.
/// Authenticated variables (e.g., db, dbx) require a signed payload structure
/// in `data`; the firmware validates the signature before writing.
pub fn efi_set_variable(
name: &UcsStr,
guid: &EfiGuid,
attrs: u32,
data: &[u8],
) -> Result<(), EfiStatus>;
Uses by the kernel:
- Reading the
SecureBootvariable (GUID{8be4df61-...}) to determine whether Secure Boot is active (see Section 2.1.8.1). - Reading and writing the
BootOrderandBoot####variables to manage UEFI boot entries (used byumka-efibootmgr, a userspace tool that delegates to the kernel via an ioctl). - Delivering
db/dbxupdates as authenticated variable writes during the key compromise recovery process (see Section 2.1.8.1.1).
NVRAM wear: EFI NVRAM has limited write endurance (typically 100,000 to 1,000,000 cycles depending on the flash technology). The kernel must not write EFI variables at high frequency. Policy variables, boot configuration, and security databases are the intended use; per-boot or per-minute writes are acceptable; per-second writes are not.
2.1.9.3 Time Services
UEFI provides GetTime(time, capabilities) and SetTime(time) for wall-clock time
access, and GetWakeupTime/SetWakeupTime for ACPI alarm-based resume.
UmkaOS uses EFI time services exactly once: during early boot (between Phase 2 and Phase 3 of the x86-64 initialization sequence) to read the hardware RTC and initialize the kernel wall clock. All subsequent timekeeping uses hardware-direct paths:
- x86-64: HPET, TSC, LAPIC timer via direct MMIO and MSR reads.
- AArch64: ARM Generic Timer (
CNTPCT_EL0,CNTFRQ_EL0) via system registers. - RISC-V:
rdtimepseudo-instruction, frequency from Device Tree. - PPC32/PPC64LE: Timebase register (
mftb) and decrementer SPR.
This avoids the serialization cost of EfiRuntime::lock on the timekeeping hot
path. EFI SetTime() is called when the user updates the wall clock (e.g., via
adjtimex(2) or settimeofday(2)) to propagate the change back to the hardware
RTC.
2.1.9.4 Reset and Shutdown
ResetSystem(type, status, data_size, data) is the UEFI-standard mechanism for
system reset and shutdown. The type field is one of:
| Type | Value | Semantics |
|---|---|---|
EfiResetCold |
0 | Full hardware reset, re-runs POST. |
EfiResetWarm |
1 | Warm reset without POST (where supported by platform). |
EfiResetShutdown |
2 | Power off via ACPI S5 state. |
EfiResetPlatformSpecific |
3 | Vendor-defined reset type identified by GUID in data. |
UmkaOS maps Linux reboot syscall commands to EFI reset types as follows:
reboot(2) command |
UEFI call |
|---|---|
LINUX_REBOOT_CMD_RESTART |
EfiResetCold |
LINUX_REBOOT_CMD_POWER_OFF |
EfiResetShutdown |
LINUX_REBOOT_CMD_HALT |
EfiResetShutdown (processor halt before calling) |
LINUX_REBOOT_CMD_RESTART2 (with command string) |
EfiResetPlatformSpecific with distro-specific GUID |
Fallback path when EFI runtime is unavailable. If EfiRuntime::available is
false (non-UEFI boot, firmware bug, or SetVirtualAddressMap() failure), UmkaOS
falls back to ACPI-direct paths:
- Shutdown: write the ACPI sleep type for S5 (from
\_S5object in DSDT) to the PM1a Control Register (PM1a_CNT), setting theSLP_ENbit. - Reset: write
0x06to I/O port0xCF9(keyboard controller reset, widely supported on x86 platforms), or useACPI_RESET_REGif defined in the FADT.
2.2 First-Class Architectures
UmkaOS targets six architectures as first-class citizens. All six receive equal design consideration, CI testing, and performance optimization.
| Architecture | Status | Isolation mechanism | Notes |
|---|---|---|---|
| x86-64 | Primary dev target | Intel MPK (WRPKRU) |
Most mature, widest hardware |
| aarch64 | First-class, day one | POE (ARMv8.9+) / page-table fallback | ARM servers, Apple Silicon (VM) |
| armv7 | First-class, day one | DACR memory domains | Embedded, IoT, Raspberry Pi |
| riscv64 | First-class, day one | Page-table based | Emerging server/embedded platform |
| ppc32 | First-class, day one | Segment registers / page-table based | Embedded PowerPC, AmigaOne, networking appliances |
| ppc64le | First-class, day one | HPT / Radix MMU / page-table based | POWER servers, IBM POWER8/9/10, Raptor Talos II |
2.2.1 Architecture-Specific Code
Architecture-specific code is isolated under arch/ and umka-core/src/arch/:
- Boot code: Rust and assembly, per-architecture
- Syscall entry/exit: Assembly stubs
- Context switch: Assembly (register save/restore)
- Interrupt dispatch: Assembly stubs into Rust handlers
- vDSO: Per-architecture user-accessible pages (see Section 2.2.1.1)
- MPK / isolation primitives: Abstracted behind a common
IsolationDomaintrait
2.2.1.1 vDSO (Virtual Dynamic Shared Object)
The vDSO is a small ELF shared library that the kernel maps into every user process's address space at process creation. It provides fast userspace implementations of a small set of syscalls that can be answered without entering the kernel — specifically time-related syscalls — by reading kernel-maintained data from a shared page (the VVAR page).
Why the vDSO matters for performance: clock_gettime(CLOCK_MONOTONIC) is
called millions of times per second in high-performance workloads (databases, gRPC,
event loops). A kernel entry costs 100-300 ns on x86-64 with KPTI. The vDSO path
costs ~5-20 ns — a 10-30x speedup. UmkaOS implements the Linux-compatible vDSO ABI
so that existing glibc, musl, and uclibc-ng builds use the fast path automatically,
with no changes to userspace.
Virtual address layout per process (x86-64 example, above stack, ASLR-randomized):
high address
[vdso ELF] 1-4 pages, PROT_READ|PROT_EXEC — contains function code
[vvar page] 1 page (4 KB), PROT_READ — kernel writes, userspace reads
low address
The VVAR page is mapped immediately below the vDSO ELF. Its address is derived by
the vDSO code using a fixed negative offset from the vDSO load address (computed
by the linker script). The kernel communicates the VVAR page address to userspace
via the ELF auxiliary vector (AT_SYSINFO_EHDR points to the vDSO ELF base).
VVAR Page Layout:
/// Kernel-maintained data page shared with userspace for vDSO fast paths.
/// The kernel writes this page using a seqlock protocol; the vDSO reads it
/// without kernel entry.
///
/// This page is mapped read-only into every user process. The kernel maps it
/// read-write in kernel virtual address space only.
///
/// The layout is fixed ABI: the vDSO ELF references fields at fixed offsets.
/// Adding new fields must not change existing field offsets.
#[repr(C, align(4096))]
pub struct VvarPage {
/// Seqlock sequence counter.
/// Invariant: odd = kernel write in progress; even = data is stable.
/// The vDSO reads this before and after reading data fields; if the
/// value changes or is odd, it retries from the beginning.
pub seq: AtomicU32,
pub _pad_seq: u32,
/// CLOCK_REALTIME: seconds since Unix epoch (TAI - leap seconds).
pub clock_realtime_sec: u64,
/// CLOCK_REALTIME: nanoseconds within the current second (0..999_999_999).
pub clock_realtime_nsec: u32,
pub _pad_rt: u32,
/// CLOCK_MONOTONIC: nanoseconds since kernel boot (never steps backward).
pub clock_monotonic_ns: u64,
/// TSC-to-nanoseconds conversion multiplier.
/// Formula: ns_delta = (tsc_delta * tsc_to_ns_mul) >> tsc_to_ns_shift
/// Valid only when the hardware TSC is stable (invariant TSC required).
/// Zero means TSC is not usable; fall back to a syscall.
pub tsc_to_ns_mul: u32,
/// TSC-to-nanoseconds conversion shift (see tsc_to_ns_mul).
pub tsc_to_ns_shift: u32,
/// TSC value at the time of the last VVAR update.
pub tsc_base: u64,
/// Timezone offset in seconds west of UTC (matches `struct timezone.tz_minuteswest * 60`).
pub tz_minuteswest: i32,
/// DST correction type (matches `struct timezone.tz_dsttime`; always 0 in practice).
pub tz_dsttime: i32,
/// Architecture-specific counter base for non-TSC paths.
/// x86-64: unused (TSC used directly).
/// AArch64: CNTVCT_EL0 value at last update.
/// RISC-V: `rdtime` value at last update.
pub arch_counter_base: u64,
/// Architecture-specific counter frequency (Hz).
/// x86-64: TSC frequency.
/// AArch64: CNTFRQ_EL0.
/// RISC-V: timer-frequency from Device Tree.
pub arch_counter_freq_hz: u64,
/// Per-CPU snapshot of the current CPU index (for __vdso_getcpu).
/// Updated on each scheduler tick. Not cycle-precise; approximate is acceptable.
pub cpu_id: u32,
/// NUMA node of the current CPU (for __vdso_getcpu).
pub numa_node: u32,
pub _pad: [u8; 4016], // Explicit padding: 80 bytes of fields above + 4016 = 4096 exactly.
// (Do not rely on implicit tail padding from align(4096) — explicit is safer as fields grow.)
// Compile-time guard: const _: () = assert!(core::mem::size_of::<VvarPage>() == 4096);
}
Exported Symbols (Linux-compatible ABI):
The vDSO ELF exports the following symbols with STV_DEFAULT visibility. These
match the Linux x86-64 vDSO symbol names exactly so that glibc and other libc
implementations find them without modification:
| Symbol | Signature | Supported clocks |
|---|---|---|
__vdso_clock_gettime |
(clockid_t clk_id, struct timespec *tp) -> int |
CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_REALTIME_COARSE, CLOCK_MONOTONIC_COARSE |
__vdso_gettimeofday |
(struct timeval *tv, struct timezone *tz) -> int |
All (derives from clock_realtime) |
__vdso_time |
(time_t *tloc) -> time_t |
Derives from clock_realtime_sec |
__vdso_clock_getres |
(clockid_t clk_id, struct timespec *res) -> int |
Returns resolution for supported clocks |
__vdso_getcpu |
(unsigned *cpu, unsigned *node) -> int |
Returns VvarPage::cpu_id and numa_node |
On AArch64, the equivalent symbols use the same names but read CNTVCT_EL0
(virtual counter) instead of RDTSC. On RISC-V, rdtime is used. The VVAR
arch_counter_base and arch_counter_freq_hz fields supply the base and
frequency needed for the conversion.
For clock IDs that the vDSO does not handle (e.g., CLOCK_PROCESS_CPUTIME_ID,
CLOCK_THREAD_CPUTIME_ID, CLOCK_BOOTTIME), the vDSO falls back to a real
syscall via the syscall instruction (x86-64) or SVC / ecall (AArch64,
RISC-V).
Seqlock Update Protocol:
Kernel (called on each timer tick or TSC calibration update):
1. VvarPage::seq.fetch_add(1, Release) // seq becomes odd: write in progress
2. write clock_realtime_sec, clock_realtime_nsec, clock_monotonic_ns
3. write tsc_base, tsc_to_ns_mul, tsc_to_ns_shift (if TSC calibration changed)
4. write arch_counter_base (architecture-specific counter snapshot)
5. write cpu_id, numa_node (approximate; read from CpuLocal::cpu_id)
6. VvarPage::seq.fetch_add(1, Release) // seq becomes even: write complete
vDSO userspace (pseudocode for __vdso_clock_gettime):
loop:
seq1 = load(VvarPage::seq, Acquire)
if seq1 & 1 != 0: continue // write in progress, retry
tsc_now = RDTSC (or arch counter)
ns = clock_monotonic_ns + ((tsc_now - tsc_base) * tsc_to_ns_mul) >> tsc_to_ns_shift
seq2 = load(VvarPage::seq, Acquire)
if seq2 != seq1: continue // update raced, retry
return ns
The retry loop is expected to execute zero times in practice: timer tick updates are infrequent (1–10 ms intervals) and short (< 1 μs). The loop exists only for correctness on the rare overlap.
ELF Build Requirements:
The vDSO ELF is built as a position-independent shared library with no dynamic linker dependencies:
- Compiled with
-fPIC -fno-plt -nostdlib -Wl,-shared. - No external symbol references (self-contained; no libc, no PLT stubs).
- Linked with a custom linker script that produces exactly two
PT_LOADsegments (one RX for code, one R for read-only data) plusPT_DYNAMICandPT_GNU_EH_FRAME. - Stripped of debugging sections;
.eh_frameretained for unwinding (stack traces in userspace debuggers work through the vDSO). - The vDSO ELF is embedded in the kernel image as a byte array in
.rodata. At process creation (exec), the kernel copies it into a freshly allocated page and maps it withPROT_READ | PROT_EXEC.
Architecture-Specific Notes:
| Architecture | Counter instruction | Notes |
|---|---|---|
| x86-64 | RDTSC |
Requires invariant TSC (CPUID leaf 0x80000007 bit 8). Non-invariant TSC (laptops with deep C-states on pre-Nehalem) falls back to syscall. |
| AArch64 | MRS x0, CNTVCT_EL0 |
Virtual counter. CNTFRQ_EL0 gives the frequency. Always available on ARMv8+ in EL0. |
| ARMv7 | MRC p15, 0, r0, c14, c3, 2 (CNTVCT) |
Available on Cortex-A7/A15 with generic timer. Falls back to syscall if not available. |
| RISC-V 64 | rdtime pseudo-instruction |
Frequency from Device Tree /cpus/timebase-frequency. |
| PPC32 | mftb (Timebase lower) |
PPC32 vDSO is limited; gettimeofday uses syscall fallback on embedded targets without an invariant timebase. |
| PPC64LE | mftb |
Timebase register, frequency from Device Tree or OPAL. |
Per-architecture vDSO placement in arch/:
umka-kernel/src/arch/x86_64/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/aarch64/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/armv7/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/riscv64/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/ppc32/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/ppc64le/vdso/ — vdso.S, vdso.ld, vvar.rs
The VvarPage struct definition is shared (umka-kernel/src/vvar.rs); only the
counter-reading instructions in vdso.S and the linker load addresses in vdso.ld
differ per architecture.
Per-architecture hardware abstraction equivalents:
| Concept | x86-64 | AArch64 | ARMv7 | RISC-V 64 | PPC32 | PPC64LE |
|---|---|---|---|---|---|---|
| Privilege separation | GDT (ring 0/3 segments) | Exception levels (EL0/EL1) | Processor modes (USR/SVC) | Privilege levels (U/S) | MSR PR bit (user/supervisor) | MSR PR bit (user/supervisor) |
| Exception dispatch | IDT (256 gate descriptors) | Exception vector table (VBAR_EL1, 16 entries × 4 vectors) | Vector table (VBAR, 8 entries) | Trap vector (stvec, single entry + scause dispatch) | Exception vector table (IVPR + IVORn) | System Reset + Machine Check vectors (LPCR) |
| Interrupt controller | APIC (LAPIC + IOAPIC) | GIC v2/v3 (distributor + redistributor/CPU interface, detected at runtime) | GIC (distributor + CPU interface) | PLIC (+ CLINT for timer/IPI) | OpenPIC / MPIC | XICS / XIVE (POWER8/9/10) |
| Timer | APIC timer / HPET / TSC | Generic Timer (CNTPCT_EL0) | Generic Timer (CNTPCT) | SBI timer ecall / mtime | Decrementer (DEC SPR) | Decrementer (DEC SPR) / HDEC |
| Syscall mechanism | SYSCALL/SYSRET (MSRs) | SVC instruction (EL0→EL1) | SVC instruction (USR→SVC) | ecall instruction (U→S) | sc instruction (system call) |
sc instruction / scv (POWER9+) |
| Page table format | 4-level (PML4→PDPT→PD→PT) | 4-level (L0→L1→L2→L3) | 2-level (L1→L2, 1MB sections) | 4-level Sv48 | 2-level (PGD→PTE, 4 KB pages) | Radix tree (POWER9+) or HPT (hashed page table) |
| Fast isolation | MPK (WRPKRU) | POE (POR_EL0) / MTE | DACR (16 domains) | Page-table based | Segment registers (16 segments) | Radix partition table / HPT LPAR |
| TLB ID | PCID (12-bit, CR3) | ASID (8/16-bit, TTBR) | ASID (8-bit, CONTEXTIDR) | ASID (9-16 bit, satp) | PID (8-bit, via PID SPR) | PID/LPID (Radix: 20-bit PID, LPIDR) |
Everything else -- scheduling, memory management, capability system, driver model, syscall compatibility -- is architecture-independent Rust code.
2.2.2 No 32-bit Compatibility Modes on 64-bit Kernels
UmkaOS does not support running 32-bit binaries on 64-bit kernels: - No ia32 compatibility mode on x86-64 - No AArch32 compatibility mode on AArch64 - No RV32 compatibility mode on RV64
ARMv7 (32-bit ARM) is supported as a native first-class architecture — it runs a native 32-bit kernel, not a compatibility layer on a 64-bit kernel. This follows the principle that 32-bit support, where needed, is added as a separate target rather than as a compatibility layer that doubles the syscall surface.
2.2.3 64-bit Atomics on 32-bit Architectures
UmkaOS uses AtomicU64 in several core data structures (PTY ring buffers, MCE logs,
lock-free IPC). On 32-bit architectures where native 64-bit atomics have limited
support, the following strategies apply:
| Architecture | Native 64-bit Atomic | Strategy |
|---|---|---|
| ARMv7 (Cortex-A) | LDREXD/STREXD (available on all ARMv7-A cores with LPAE) |
Native hardware atomics. The armv7a-none-eabi target supports AtomicU64 via doubleword exclusive load/store. Non-LPAE cores (Cortex-M, ARMv6) are not first-class targets. |
| PPC32 | No native 64-bit atomics | Software emulation via interrupt-disabling (wrteei 0/1) around read-modify-write sequences. Implemented in umka-kernel/src/arch/ppc32/atomics.rs. The custom target JSON sets max-atomic-width: 64 so LLVM generates calls to __atomic_* runtime functions provided by the kernel. |
Both strategies are safe in a single-core or SMP-with-coherence context. The interrupt-disabling approach on PPC32 is correct because UmkaOS's 32-bit targets are single-core embedded systems; SMP PPC uses 64-bit PPC64LE.
2.2.4 Advanced Feature Architecture Parity
Chapters 16–18 define advanced features that rely on architecture-specific hardware
mechanisms. The following matrix summarizes support status across all six first-class
architectures. Where hardware is unavailable, UmkaOS either provides a software fallback
(reduced performance) or marks the feature as not supported on that architecture. The
kernel's #[cfg(target_feature)] mechanism ensures unsupported paths compile to no-ops
with zero overhead.
| Feature | Mechanism | x86-64 | AArch64 | ARMv7 | RISC-V 64 | PPC32 | PPC64LE |
|---|---|---|---|---|---|---|---|
| Fast driver isolation | MPK/POE/DACR/page-table | WRPKRU (native) | POE (ARMv8.9+, POR_EL0) / page-table fallback | DACR 16 domains | Page-table based | Segment registers (16 segments) / page-table fallback | Radix partition table / HPT LPAR |
| Memory tagging | MTE/LAM | Intel LAM (pointer tagging only) | MTE (full, ARMv8.5+) | Not available | Not available | Not available | Not available |
| Hardware power metering | RAPL/SCMI/SBI | RAPL (native) | SCMI power domain | SCMI (limited) | SBI PMU (basic) / software estimation | Not available (software only) | OPAL/OCC power sensors (POWER8/9/10) |
| Confidential computing | SEV-SNP/TDX/CCA/CoVE | SEV-SNP + TDX (native) | ARM CCA (emerging) | Not available | RISC-V CoVE (draft) | Not available | Ultravisor Protected Execution Facility (POWER9+) |
| Cache partitioning | CAT/MPAM | Intel CAT + MBA (native) | ARM MPAM (ARMv8.4+) | Not available | Not available (software only) | Not available | Not available (software only) |
| Hardware preemption (GPU) | Device-dependent | Yes (vendor support) | Yes (Mali, Adreno) | Limited | Emerging | Not available | Limited (Nvidia via PCIe) |
| CXL memory pooling | CXL 2.0/3.0 | Native (PCIe 5.0+) | Emerging (ARMv9 + CXL) | Not available | Not available | Not available | OpenCAPI / CXL (POWER10+) |
| In-kernel inference | ISA extensions | AMX (matrix), AVX-512 | SME (matrix), SVE (vector) | NEON (vector) | V extension (vector) | AltiVec/SPE (limited) | VSX (vector-scalar, POWER7+) |
Reading the table: "Native" means hardware support is available and UmkaOS uses it directly. "Fallback" means UmkaOS implements the feature using a slower mechanism (typically page-table manipulation). "Not available" means neither hardware nor a practical software fallback exists — the feature is compile-time disabled on that architecture. "Emerging" or "draft" means the hardware specification exists but is not yet widely deployed; UmkaOS includes provisional support gated behind a feature flag.
2.3 Hardware Memory Safety
2.3.1 ARM MTE (Memory Tagging Extension)
ARM MTE is architecturally defined in ARMv8.5-A and first implemented in ARMv9 silicon. MTE availability depends on both the core IP implementing the extension AND the SoC vendor enabling tag storage in the memory subsystem:
- Core IP with MTE: ARM Neoverse V2, Neoverse V3 (all cores based on these designs implement the MTE extension at the microarchitectural level).
- Mobile SoCs with MTE enabled: Google Pixel 8/9 (Tensor G3/G4, Cortex-X3/X4), MediaTek Dimensity 9300+ devices.
- Datacenter SoC with MTE enabled: AmpereOne (the first datacenter SoC to fully enable MTE at the platform level, including tag storage in DRAM).
- Cloud SoCs with MTE logic but NOT enabled: AWS Graviton 4 (Neoverse V2) and Google Axion (Neoverse V2) include MTE logic in the cores but their memory subsystems do not support tag storage — MTE is not usable on these platforms despite the core IP implementing it.
- No MTE: Ampere Altra (Neoverse N1, ARMv8.2 — predates MTE entirely).
Every 16-byte memory granule carries a 4-bit tag. Pointer top bits carry a tag. Hardware compares them on every access. Mismatch = fault. Catches use-after-free, buffer overflow, in hardware, at near-zero runtime cost.
Important limitation: MTE is probabilistic, not complete. 4-bit tags = 16
possible values. Adjacent slab objects may receive the same tag by random chance
(probability 1/16 = 6.25%). Single-violation detection rate: ~93.75%. This is
acceptable for defense-in-depth — Rust's ownership model is the primary safety
mechanism; MTE is an additional hardware layer that catches what Rust cannot
(C driver bugs in Tier 1, unsafe blocks, compiler bugs). MTE is NOT a
substitute for memory-safe code.
Tag Storage Requirement:
ARM MTE stores tags in storage managed by the memory controller: 4 bits per 16-byte
granule. Relative to DRAM capacity, this means tag storage is sized at 3.125% of DRAM
(4 bits / 128 bits = 1/32). High-performance implementations (Neoverse V2/V3,
AmpereOne) typically use dedicated Tag RAM; other implementations may use reserved
DRAM regions managed transparently by the memory controller. In all cases, the
storage is invisible to software and managed automatically by the hardware.
On SoCs without MTE support, the tagging code is compiled out
(#[cfg(target_feature = "mte")]) — zero overhead, zero memory cost.
MTE is only available on ARM; x86 systems are entirely unaffected.
TEE interaction: MTE tags are stored in separate physical tag RAM. For
TEE-encrypted pages, tag RAM may also be encrypted. Confidential pages are
allocated untagged (tag = 0); MTE checking is disabled for pages owned by a
ConfidentialContext (see Section 8.6.3). Hardware encryption already prevents
unauthorized access — MTE is redundant for confidential memory.
Section 4.1.7 already mentions MTE and Intel LAM. This section details the architectural integration.
2.3.2 Design: Tag-Aware Memory Allocator
// umka-core/src/mem/tagging.rs
/// Memory tagging policy (system-wide, configurable at boot).
#[repr(u32)]
pub enum TaggingPolicy {
/// No tagging. Standard allocation. Zero overhead.
/// Used on hardware without MTE, or for maximum performance.
Disabled = 0,
/// Synchronous tagging: fault immediately on tag mismatch.
/// Catches all tag violations. ~128 extra cycles per page allocation.
/// Recommended for development and high-security production.
Synchronous = 1,
/// Asynchronous tagging: record violations in a register, check lazily.
/// Lower overhead (~10 cycles per allocation), but violations reported
/// with delay. Good for production with logging.
Asynchronous = 2,
}
/// Tag operations for the memory allocator.
pub trait MemoryTagger {
/// Assign a random tag to a newly allocated region.
/// Called by: slab allocator (per-object), buddy allocator (per-page).
fn tag_allocation(&self, addr: *mut u8, size: usize) -> TaggedPtr;
/// Clear tags on freed memory (set to a "freed" tag value).
/// Any subsequent access with the old tag will fault.
fn tag_deallocation(&self, addr: *mut u8, size: usize);
/// Set tags for a DMA buffer region (tag = 0, untagged).
/// DMA engines don't understand tags — buffers must be untagged.
fn untag_dma_region(&self, addr: *mut u8, size: usize);
}
2.3.3 Integration Points
Slab allocator (Section 4.1.2):
Object allocation:
1. Allocate object from slab (existing path).
2. Assign random 4-bit tag to the object's 16-byte granules.
3. Return tagged pointer (tag in top bits).
Object deallocation:
1. Return object to slab (existing path).
2. Set the object's granules to a "freed" tag (e.g., 0xF).
3. Any subsequent access with the old tag faults immediately.
Benefit: use-after-free in kernel (or in Tier 1 C drivers) is caught
by hardware. The fault is caught by domain isolation and triggers driver crash recovery.
Page allocator (Section 4.1.1):
Page allocation: tag all granules in the page with a fresh tag.
Page deallocation: tag all granules with "freed" tag.
Granule counts: 4KB page = 256 granules (4096 / 16); 64KB page = 4096 granules (65536 / 16).
Cost (4KB page): 256 STG instructions per alloc/dealloc (or 128 ST2G/STZ2G,
each tagging two 16-byte granules).
At ~0.5 cycles per STG on A510+ cores: ~128 cycles with STG (64 cycles with ST2G). Page alloc is ~300+ cycles.
Overhead (4KB): ~43% with STG (128 tag cycles / ~300 base cycles); ~21% with ST2G.
Cost (64KB page): 4096 STG instructions (or 2048 ST2G/STZ2G instructions).
At ~0.5 cycles per STG: ~2048 cycles with STG (~1024 cycles with ST2G/STZ2G). Page alloc is ~300+ cycles.
Overhead (64KB): ~683% with individual STG (2048 tag cycles / ~300 base cycles);
~341% with ST2G/STZ2G (1024 tag cycles / ~300 base cycles). Prefer STZ2G for bulk
tagging as it zeros and tags in one pass. The 4KB case is the common slab/page
allocation path. 64KB huge-page allocation is rarely hot and the high overhead is
acceptable.
Note: this only affects ARM. On x86 without MTE, zero overhead.
On ARM without MTE enabled, zero overhead (policy = Disabled).
KABI boundary:
When kernel passes a buffer to a Tier 1 driver:
Buffer is tagged. Driver receives tagged pointer.
If driver overflows the buffer: tag mismatch, hardware fault.
Domain isolation catches the fault, driver is crash-recovered.
This provides hardware-enforced bounds checking for C drivers,
even though the kernel is written in Rust (which checks bounds in software).
DMA buffers:
DMA engines cannot process tagged memory.
DMA buffers are allocated untagged (tag = 0).
IOMMU validates DMA addresses regardless.
fork() / CoW:
Before CoW break: child shares parent's page (same tags, read-only).
On CoW break (child or parent writes):
1. Allocate new page, copy data.
2. Assign FRESH RANDOM tags to the new page's granules.
3. Do NOT copy the old page's tags.
Rationale: if both pages kept the same tags, a stale pointer from
one process could access the other's now-separate page without
a tag fault (same tag, different physical page). Fresh tags ensure
that cross-process stale pointers are detected by MTE.
2.3.4 Intel LAM (Linear Address Masking)
Intel LAM allows using top bits of 64-bit pointers for metadata without them being treated as part of the address. This is less powerful than MTE (no hardware tag checking), but useful for:
- Pointer authentication (storing metadata in unused address bits)
- Memory safety tooling (KASAN-like in-kernel detection)
- Capability tagging (embedding capability metadata in pointers)
LAM modes:
LAM_U48: bits 62:48 available for metadata (15 bits, user pointers only).
LAM_U57: bits 62:57 available for metadata (6 bits, 5-level paging mode).
Controlled via CR3 flags: CR3.LAM_U48 or CR3.LAM_U57.
No runtime cost: address masking is performed by hardware in the MMU pipeline.
Comparison with MTE:
MTE (ARM): 4-bit tag per 16-byte granule. Hardware CHECKS on every access.
Detects use-after-free, buffer overflow at runtime. ~128 cycles per
page allocation for tag setup. Zero-cost access checks (pipelined).
LAM (x86): 6-15 metadata bits per pointer. NO hardware checking — metadata is
simply ignored by the MMU. Software must perform its own checks.
Zero overhead. Useful for tooling metadata, not for runtime safety.
Result: MTE provides stronger guarantees (hardware-enforced); LAM provides
more flexible metadata embedding. UmkaOS uses both where available.
Integration: the memory allocator stores metadata in LAM bits. Debug builds use these bits for KASAN-equivalent checking. Release builds can optionally use them for capability hints.
Security caveat: Intel LAM has been disabled in the Linux kernel since v6.12 due to the SLAM attack (Spectre-based exploitation of LAM metadata bits without LASS protection). UmkaOS does not enable LAM unless LASS (Linear Address Space Separation) is also available on the CPU. On CPUs without LASS, the upper address bits described above are not used for metadata; KASAN-equivalent checking uses shadow memory instead. When both LAM and LASS are present, LAM is enabled with the protections described above.
2.3.5 AArch64 Pointer Authentication (PAC)
AArch64 provides Pointer Authentication Codes (PAC, ARMv8.3+) as a complementary mechanism to MTE. PAC signs pointers with a cryptographic MAC using a per-process key, detecting pointer forgery and corruption:
PAC in UmkaOS:
- Return address signing: PACIASP/AUTIASP in function prologue/epilogue.
Compiler-inserted via -mbranch-protection=pac-ret+leaf.
- Detects ROP (Return-Oriented Programming) attacks: corrupted return
addresses fail authentication and trap.
- Cost: ~1 cycle per PAC/AUT instruction (pipelined). Zero memory overhead.
- Available on: Apple M1+, AWS Graviton 3+, Cortex-A710+.
UmkaOS enables PAC for all kernel code on capable hardware. This is orthogonal
to MTE (MTE detects memory safety bugs; PAC detects control-flow hijacking).
2.3.6 CHERI (Future)
ARM Morello (CHERI prototype) demonstrates hardware-capability pointers with bounds checking. CHERI pointers are 128-bit: address (64) + bounds (32) + permissions (16) + flags (16). Every pointer carries its own bounds and permission information. Hardware checks on every dereference.
UmkaOS's capability system (Section 8.1.1) is a software capability model. CHERI provides a hardware capability model. When CHERI hardware is available:
Software capabilities (current):
Kernel maintains capability table. Validated on syscall.
Overhead: ~5-10 cycles per capability check (bitmask test).
CHERI hardware capabilities (future):
Pointer IS the capability. Hardware validates on every access.
Overhead: 0 cycles (pipelined with memory access).
UmkaOS's capability tokens become hardware CHERI capabilities.
The translation is natural: both use unforgeable tokens with
bounded permissions and delegation rules.
Design for CHERI readiness: the capability system should NOT assume that capabilities are always validated in software. The validation path should be abstractable so that CHERI hardware validation can replace software validation.
CHERI Morello Status:
ARM Morello evaluation boards shipped in 2022 (based on Neoverse N1 + CHERI extensions). As of 2026, production CHERI hardware is not available. The CHERI readiness design in Section 2.3.6 prepares for future hardware without depending on it. When production CHERI SoCs ship, the capability validation abstraction layer enables a transition from software to hardware capability checks.
2.3.7 Performance Impact
MTE on ARM (when enabled): ~128 cycles per page allocation (~40% of allocator hot path). Memory access checks are hardware-pipelined: zero overhead. Linux pays the same cost when MTE is enabled.
MTE disabled (default on x86, optional on ARM): zero overhead. No code runs.
Intel LAM: zero runtime overhead (address masking is free in hardware).
CHERI (future): zero overhead (hardware-pipelined capability checks).
2.3.8 Hardware Fault Handler Constraints
Hardware fault handlers (machine check exceptions, bus errors, SError, NMI, system error interrupts) operate in extremely constrained contexts where normal kernel operations are forbidden. Violating these constraints causes deadlock, system hang, or recursive faults.
2.3.8.1 Fault Handler Categories
Hardware fault handlers fall into three categories with progressively stricter constraints:
| Category | Examples | Context | Permitted Operations |
|---|---|---|---|
| Maskable interrupts | Timer tick, device IRQ | IRQ context, interrupts disabled | Try-lock, lock-free writes, deferred work |
| Synchronous faults | Page fault, alignment fault, breakpoint | Fault context, preemptible | Blocking locks (with care), allocation (with care) |
| Non-maskable faults | Machine Check (MCE), NMI, SError, Bus Error, System Reset | NMI context, all interrupts blocked | Lock-free only, per-CPU buffers, no locks |
The critical distinction: maskable interrupts can be delayed by disabling interrupts, but non-maskable faults fire regardless of interrupt state. Code holding a spinlock cannot prevent an MCE or NMI from occurring.
2.3.8.2 Non-Maskable Fault Handler Requirements
Non-maskable fault handlers (MCE, NMI, SError, Bus Error, System Reset vectors) MUST follow these rules:
1. No blocking operations. The handler MUST NOT:
- Acquire a spinlock with blocking semantics (lock() / spin_lock())
- Acquire a mutex, rwlock, or semaphore
- Allocate memory (kmalloc, vmalloc, page allocation)
- Sleep or yield (schedule(), wait(), condvar)
- Perform I/O that may block (disk, network)
- Call any function that may transitively do the above
Rationale: The fault may have interrupted code already holding locks. If the handler blocks waiting for the same lock, deadlock occurs immediately.
2. Try-lock only, with fallback. If the handler needs a lock, it MUST use
try-lock (try_lock() / spin_trylock()) and handle failure:
if lock.try_lock() {
// critical section
lock.unlock();
} else {
// Fallback: cannot acquire lock
// Options: log to per-CPU buffer and continue, force reboot, degrade gracefully
}
3. Per-CPU buffers for logging. NMI/MCE handlers MUST NOT write to shared ring buffers (MPSC, printk). Instead, use a pre-allocated per-CPU buffer:
Data types used by the MCE log:
/// Severity classification of a machine-check event.
#[repr(u32)]
enum MceSeverity {
Corrected = 0, // Hardware corrected; no data loss
Recoverable = 1, // Software-recoverable with page offlining
Fatal = 2, // Unrecoverable; system must reboot
}
/// One entry in the per-CPU MCE ring log.
/// Padded to 64 bytes (one cache line) so that array elements never span cache line
/// boundaries. This prevents false sharing when a remote monitoring thread reads
/// the log while the NMI handler writes it.
///
/// Torn-read detection uses a seqcount-style generation counter (`gen`):
/// the writer sets `gen` to an odd value before writing fields, then to the
/// next even value after writing. A reader that observes an odd `gen` or a
/// changed `gen` between its two reads has caught a torn write and must retry.
#[repr(C, align(64))]
#[derive(Copy, Clone)]
struct MceLogEntry {
gen: u32, // Generation counter (odd = write in progress, even = stable)
_pad_gen: [u8; 4],
timestamp_tsc: u64, // TSC at time of MCE
bank: u8, // MCE bank number
_pad0: [u8; 7],
status: u64, // MCi_STATUS MSR value
address: u64, // MCi_ADDR MSR value (if valid)
misc: u64, // MCi_MISC MSR value (if valid)
severity: MceSeverity, // 4 bytes (repr(u32))
_pad1: [u8; 12], // Pad to 64 bytes total
}
// Total size: 4 + 4 + 8 + 1 + 7 + 8 + 8 + 8 + 4 + 12 = 64 bytes. One cache line each.
impl MceLogEntry {
const EMPTY: Self = Self {
gen: 0, _pad_gen: [0; 4],
timestamp_tsc: 0, bank: 0, _pad0: [0; 7],
status: 0, address: 0, misc: 0,
severity: MceSeverity::Corrected, _pad1: [0; 12],
};
}
/// Per-CPU MCE log with head counter and ring buffer.
struct MceLog {
head: AtomicU32, // Monotonically increasing write index
entries: [MceLogEntry; 64], // Ring buffer (indexed by head % 64)
}
impl MceLog {
const fn new() -> Self {
Self { head: AtomicU32::new(0), entries: [MceLogEntry::EMPTY; 64] }
}
}
// Allocated at boot, one per CPU, never freed.
static MCE_LOG: PerCpu<MceLog> = PerCpu::new(MceLog::new());
// In MCE handler (NMI context):
fn mce_handler(ctx: &MceContext) {
let log = MCE_LOG.this_cpu();
// Per-CPU: exactly one producer (this CPU's NMI handler), no concurrent writers.
// load(Relaxed) is safe because only this CPU writes head.
let count = log.head.load(Relaxed);
let idx = count as usize % 64;
log.entries[idx] = MceLogEntry::from_ctx(ctx);
// ORDERING: Release store on head publishes the entry. Any thread that
// subsequently reads head with Acquire will observe the entry write.
log.head.store(count + 1, Release);
// Handler returns; main kernel drains log later
}
// Drain path (thread context, outside NMI):
fn drain_mce_log(log: &MceLog) {
// Use swap instead of load+store(0) to atomically capture AND reset head.
// This prevents losing entries from an MCE that fires between load and store.
let count = log.head.swap(0, AcqRel);
// AcqRel: Acquire ensures prior entry writes are visible; Release publishes
// the reset (head=0) so a concurrent MCE handler sees the new base.
// Iterate the ring from (count-N) to (count-1), reading entries newest-first
// would lose ordering — iterate oldest-first instead:
let n = core::cmp::min(count, 64); // At most 64 entries in the ring
// gen counter uses wrapping arithmetic to handle u32::MAX wrap-around.
// Always use wrapping_sub() when computing the distance between two gen values.
let start_gen = count.wrapping_sub(n);
for i in 0..n {
let i = start_gen.wrapping_add(i);
let entry = log.entries[i as usize % 64];
// ... process entry ...
}
}
Race window: A narrow race exists between
head.swap(0)and the drain loop. An MCE arriving after the swap writes toentries[0]while the drain may be reading entries at the same index (via modular arithmetic when the ring was full). Mitigation: each entry carries a seqcount-style generation counter (gen). The drain readsgenbefore and after reading the entry fields: ifgen_beforeis odd (write in progress) orgen_after != gen_before(torn write), the drain skips that entry and logs a warning. The skipped MCE is not lost — the hardware MCE bank registers retain the error until explicitly cleared, so the next drain cycle will re-read it.
The main kernel drains these buffers after returning from the exception, outside NMI context.
4. No locks at all for NMI. NMI handlers specifically MUST NOT use any locks, even try-lock. The NMI can nest inside an MCE handler that already holds the lock, causing deadlock. NMI handlers use only: - Per-CPU variables (no sharing) - Lock-free atomic operations (atomic read/write, compare-and-swap) - Pre-mapped memory (no page faults possible)
5. Pre-allocated resources. All memory, buffers, and stacks used by NMI/MCE handlers MUST be allocated at boot time. Allocation during handler execution is forbidden. On x86-64, MCE handlers run on a dedicated IST (Interrupt Stack Table) stack, pre-allocated and never paged.
2.3.8.3 Deferred Recovery Actions
Any recovery action that might block MUST be deferred to a workqueue or tasklet:
MCE handler (NMI context):
1. Capture fault context to per-CPU buffer (lock-free)
2. Assess severity: recoverable vs. fatal
3. If recoverable:
a. Log to per-CPU buffer
b. Set flag: NEEDS_RECOVERY = true
c. Return from exception
4. If fatal:
a. Log to per-CPU buffer
b. Trigger immediate reboot (no locking)
Workqueue (thread context, after NMI returns):
1. Check NEEDS_RECOVERY flag
2. If set:
a. Drain per-CPU MCE log to kernel log (may block)
b. Initiate memory offlining (may block)
c. Notify userspace via netlink (may block)
d. Clear NEEDS_RECOVERY flag
The workqueue runs in normal thread context where blocking operations are safe. The NMI handler does the minimum work needed to capture state and flag the need for recovery.
2.3.8.4 Architecture-Specific Fault Types
| Architecture | Non-Maskable Fault Types | Vector / Entry Point |
|---|---|---|
| x86-64 | Machine Check Exception (#MC), NMI | IDT vector 18 (MCE), vector 2 (NMI) |
| AArch64 | SError Interrupt, Physical IRQ (FIQ) | VBAR_EL1 offset 0x380 (SError, Current EL with SPx) |
| ARMv7 | Data Abort (imprecise), FIQ | VBAR offset 0x1C (FIQ), 0x10 (Data Abort) |
| RISC-V 64 | NMI (platform-specific) | Platform-defined; often traps to mtvec in M-mode |
| PPC32 | Machine Check, Critical Interrupt | IVOR[1] (MCE), IVOR[0] (Critical) |
| PPC64LE | Machine Check, System Reset | HSRR0/HSRR1 vectors, LPCR-defined |
All handlers for these vectors MUST follow the non-maskable fault handler requirements in Section 2.3.8.2.
2.3.8.5 Recursive Fault Prevention
Hardware fault handlers MUST prevent recursive faults:
1. Guard pages. Handler stacks have guard pages (unmapped) at both ends. Stack overflow causes an immediate fault rather than corrupting adjacent memory.
2. Handler re-entry detection. Each handler checks a per-CPU flag on entry:
fn mce_handler(ctx: &MceContext) {
let nesting = MCE_NESTING.this_cpu().fetch_add(1, Relaxed);
if nesting > 0 {
// Already in MCE handler — recursive fault.
// Cannot log (might fault again), cannot recover.
// Immediate halt to prevent infinite recursion.
arch::halt_loop();
}
// ... normal handler logic ...
//
// Use fetch_sub (not store(false)) to avoid a race window:
// store(false) + iret leaves a gap where a second MCE sees the flag
// clear while the first handler is still returning. fetch_sub(1)
// atomically decrements; a concurrent MCE that increments to 2 will
// see nesting > 0 and halt, regardless of timing.
MCE_NESTING.this_cpu().fetch_sub(1, Release);
}
3. Pre-pinned code. Handler code and data pages are pinned in memory (never paged out). A page fault during NMI/MCE handling would cause a double fault.