Chapter 2: Boot and Hardware Discovery¶
Boot chain, device discovery, ACPI/DT, multi-architecture support, hardware memory safety
UmkaOS targets eight architectures (x86-64, AArch64, ARMv7, RISC-V 64, PPC32, PPC64LE, s390x, LoongArch64) using platform-native boot protocols: Multiboot1/2 on x86-64, DTB-based on ARM/RISC-V/PPC. Hardware is discovered at runtime via ACPI or Device Tree — no hardcoded limits or compile-time assumptions about memory size, CPU count, or device topology.
2.1 Boot Overview and Protocols¶
UmkaOS uses a phased boot architecture. The current implementation boots via the
Multiboot1 protocol through GRUB or QEMU's -kernel flag — sufficient for
development, testing, and early hardware bring-up. The production target is
UEFI stub boot with Linux boot protocol compatibility, enabling drop-in
package installation alongside existing Linux kernels.
The boot code lives in umka-kernel/src/boot/ (assembly entry, Multiboot parser)
and umka-kernel/src/arch/*/boot.rs (per-architecture boot routines). The
initialization sequence is in umka-kernel/src/main.rs.
Current Implementation: Multiboot Boot¶
2.1.1 Boot Protocols¶
The kernel ELF contains dual Multiboot headers — both Multiboot1 and Multiboot2 are present in the binary, allowing either protocol at the bootloader's choice:
- Multiboot1 (magic
0x1BADB002): Fully implemented. Used by QEMU (-kernelflag) and GRUB (multibootcommand). Parser inboot/multiboot1.rsextracts the memory map, command line, and bootloader name. The header checksum is computed aschecksum = -(magic + flags)(i.e.,0u32.wrapping_sub(magic.wrapping_add(flags))), ensuringmagic + flags + checksum == 0when summed as unsigned 32-bit integers. - Multiboot2 (magic
0xE85250D6): Header present in the ELF but no parser implemented. The magic is recognized inumka_main()but the info structure is not parsed. Planned for Phase 2 (Section 24.2).
The linker script (linker-x86_64.ld) places headers in dedicated sections:
.multiboot1 (4-byte aligned, first 8 KB) and .multiboot2 (8-byte aligned,
first 32 KB), ensuring bootloaders find them. The kernel loads at physical address
0x100000 (1 MB), the standard Multiboot load address.
Build and boot methods:
# # Development: QEMU with -kernel (Multiboot1, no ISO needed)
qemu-system-x86_64 -kernel target/x86_64-unknown-none/release/umka-kernel -serial stdio
# # Testing: GRUB ISO boot (Multiboot1 via grub.cfg `multiboot` command)
make iso && qemu-system-x86_64 -cdrom target/umka-kernel.iso -serial stdio
Non-x86 architectures use different boot protocols:
- Device Tree Blob (DTB): Used by AArch64, ARMv7, RISC-V 64, PPC32, and
PPC64LE. The firmware or QEMU passes a pointer to a flattened device tree
(FDT) in a register at entry (
x0on AArch64,r2on ARMv7,a1on RISC-V,r3on PPC32,r3on PPC64LE). The DTB describes the machine's physical memory layout, interrupt controllers, timers, and peripheral addresses. The format is big-endian with magic0xD00DFEED. See Section 2.8 for the parsing specification. - OpenSBI (RISC-V only): The Supervisor Binary Interface firmware runs in
M-mode and provides SBI ecalls for timer, IPI, console, and system reset
services to S-mode code. QEMU's built-in OpenSBI occupies physical addresses
0x80000000–0x801FFFFF. At entry, OpenSBI passesa0 = hart_id(hardware thread identifier) anda1 = DTB address. The kernel must not overwrite the OpenSBI region. - OpenFirmware / SLOF (PPC64LE): On POWER systems, SLOF (Slimline Open
Firmware) or OPAL (OpenPOWER Abstraction Layer) firmware initializes hardware
and passes a DTB pointer in
r3. QEMU'spseriesmachine uses SLOF; bare metal POWER8/9/10 uses OPAL (skiboot). At entry:r3 = DTB address,r4 = 0(reserved). The kernel runs in hypervisor or supervisor mode. - U-Boot / OpenFirmware (PPC32): Embedded PowerPC boards typically use
U-Boot which passes a DTB pointer in
r3. QEMU'sppce500machine uses U-Boot or direct kernel boot. At entry:r3 = DTB address,r4 = 0(reserved),r5 = 0(reserved),r6 = EPAPR_MAGIC(0x45504150),r7 = sizeof(initial TLB1 mapping). - IPL / SCLP (s390x): IBM Z systems boot via Initial Program Load (IPL).
The channel subsystem reads the boot image from a DASD or FCP device into
lowcore (physical address 0x0). The PSW at address 0x0 contains the entry
point. SCLP (Service Call Logical Processor) provides the early console and
hardware configuration discovery (memory extents, CPU topology). QEMU's
s390-ccw-virtiomachine uses CCW-based IPL. No DTB; topology is discovered via SCLP and STSI (Store System Information) instructions. See Section 2.12 for the full boot specification. - DTB / UEFI (LoongArch64): Loongson systems boot via UEFI firmware or
direct kernel load. QEMU's
virtmachine passes a DTB pointer in register$a1. The kernel is loaded at physical address 0x200000 (2 MB) in the DMW1 cached window (VA 0x9000000000200000). DMW1 is cached (CA=1, MAT=1) per the LoongArch convention (matching LinuxCSR_DMW1_INIT). No trampoline is needed — the kernel entry point is already in the cached window. See Section 2.13 for the full boot specification.
2.2 x86-64 Boot Entry and Initialization¶
2.2.1 x86-64 Entry Sequence¶
The boot assembly (boot/entry.asm, NASM syntax) handles the transition from
32-bit protected mode to 64-bit long mode:
1. GRUB/QEMU loads ELF at 1 MB, jumps to _start in 32-bit protected mode
- eax = Multiboot1 magic (0x2BADB002)
- ebx = pointer to Multiboot info structure
2. _start (32-bit):
a. Save magic: eax → esi (preserved across BSS clear and CPUID check)
b. Set temporary stack at 0x80000 (below kernel)
c. Clear BSS (rep stosd from __bss_start to __bss_end — clobbers edi, ecx, eax)
d. Build identity-map page tables for first 1 GB:
PML4[0] → boot_pdpt | PRESENT | WRITABLE
PDPT[0] → boot_pd | PRESENT | WRITABLE
PD[0..511] → 512 × 2 MB pages (flags: PRESENT | WRITABLE | PAGE_SIZE)
e. Save info ptr: ebx → ebp (preserve across CPUID — ebx is clobbered by CPUID)
f. Verify long mode: CPUID leaf 0x80000001 bit 29
(displays "NO64" on VGA buffer and halts if not available)
g. Restore info ptr: ebp → ebx
h. Enable PAE (CR4 bit 5)
h2. Load CR3 with address of boot_pml4 (the PML4 root page table built
in step 2d). CR3 must be loaded before enabling paging in step 2j.
i. Enable Long Mode (IA32_EFER MSR bit 8)
j. Enable Paging (CR0 bit 31)
k. Load temporary 64-bit GDT (null + code + data descriptors)
l. Far jump to _start64 (selector 0x08 = 64-bit code segment)
3. _start64 (64-bit):
a. Load 64-bit data segments (selector 0x10)
b. Set kernel stack (boot_stack_top, 16 KB in .bss)
c. Clear RFLAGS (`pushq $2; popfq` — sets RFLAGS = 0x2, preserving
the architecturally-required bit 1 = 1 while clearing all flags)
d. Map to 64-bit calling convention: edi = esi (magic), esi = ebx (info ptr)
e. Call umka_main(multiboot_magic=rdi, multiboot_info_ptr=rsi)
Page tables and boot stack are allocated in .bss (zeroed by step 2c):
boot_pml4 (4 KB), boot_pdpt (4 KB), boot_pd (4 KB), boot stack (16 KB).
2.2.2 Kernel Initialization Phases (x86-64)¶
umka_main() detects the boot protocol from the magic value, then runs an
ordered initialization sequence. Each phase depends on the previous.
Canonical Phase Mapping:
The following table maps x86-64 local phase numbers to the canonical architecture-neutral init ordering defined in Section 2.3.
| Canonical Phase | Description | Local Implementation |
|---|---|---|
| 0.1 | arch_early_init | Entry assembly (steps 1–3) + Phase 1 (GDT+TSS) + Phase 2 (IDT+PIC) |
| 0.14 | Early serial | Phase 0.14: COM1 UART init (I/O 0x3F8, 115200 8N1) |
| 0.14a | Microcode load | Phase 0.14a: CPU microcode from early initramfs |
| 0.15 | early_log_init | Phase 0.15: early_log_init() — BSS ring buffer checkpoint |
| 0.2 | identity_map | Entry assembly step 2d (identity-map page tables, 1 GB) |
| 0.3 | parse_firmware_memmap | Phase 3: Multiboot1 E820 memory map parse |
| 0.4 | boot_alloc_init | Phase 3 (bitmap allocator init from E820) |
| 0.5 | reserve_regions | Phase 3 (reserve first 1 MB + kernel image) |
| 0.55 | ACPI table discovery | Phase 0.55: RSDP/RSDT/XSDT scan |
| 0.6 | numa_discover_topology | Phase 3a: ACPI SRAT/SLIT parse |
| 0.7 | cpulocal_bsp_init | Phase 3b: wrmsr(IA32_GS_BASE, &CpuLocalBlock) |
| 0.8a | evolvable_verify | Phase 3c: Evolvable signature verification (MMU already active from entry assembly step 2d) |
| 0.8b | evolvable_map_and_init | Phase 3c (cont.): Evolvable virtual mapping at EVOLVABLE_VIRT_BASE + VTABLE_SLOTS[] population |
| 1.1 | buddy allocator | Phase 4: hand_off_to_buddy() |
| 1.2 | slab allocator | Phase 4a: slab_init() |
| 2.1 | IRQ domain | Phase 8a: IrqDomain setup |
| 2.2 | capability system | Phase 6: CapSpace init |
| 2.3 | scheduler | Phase 9: EEVDF scheduler init |
| 2.35 | syscall MSR config | Phase 10: STAR/LSTAR/SFMASK MSRs |
| 2.7 | workqueue infra | Phase 9a: workqueue_init_early() |
| 2.8 | RCU | Phase 9a2: rcu_init() |
| 2.9 | LSM framework | Phase 9a3: lsm_init() |
| 3.5 | PID allocator | Phase 9b: pid_alloc_init() |
| 5.2 | VFS init | Phase 10a: dentry/inode/mount init (unordered with 4.6, see Section 2.3) |
| 4.6 | network stack | Phase 10b: net_init() (unordered with 5.2, see Section 2.3) |
Phase 0.14: COM1 Serial Init (arch-specific, x86-64 only)
Initialize COM1 UART at I/O port 0x3F8 for early serial output.
This must precede Phase 0.15 (early_log_init) so that early_log
messages can be mirrored to the serial console.
Configuration:
- Disable interrupts: outb(0x3F8+1, 0x00) (IER = 0)
- Set DLAB: outb(0x3F8+3, 0x80) (LCR DLAB bit)
- Divisor for 115200 baud: outb(0x3F8+0, 0x01) (DLL = 1)
outb(0x3F8+1, 0x00) (DLM = 0)
- Line control 8N1: outb(0x3F8+3, 0x03) (LCR = 8 data, no parity, 1 stop)
- Enable FIFO: outb(0x3F8+2, 0xC7) (FCR = enable, clear, 14-byte threshold)
- Modem control: outb(0x3F8+4, 0x0B) (MCR = DTR + RTS + OUT2)
After this point, serial_putc() is available for diagnostics.
Phase 0.14a: CPU Microcode Load (arch-specific, x86-64 only)
Load CPU microcode update from the early initramfs CPIO
(path: kernel/x86/microcode/AuthenticAMD.bin or
GenuineIntel.bin) or from a built-in blob linked into the
kernel image. Microcode must be applied before Phase 3
(CPUID-dependent decisions) because microcode updates can
change CPUID feature flags, errata workarounds, and MSR
behavior. Application sequence:
1. Read CPUID vendor string (leaf 0) to select Intel or AMD path.
2. Locate microcode blob (early CPIO scan or built-in).
3. Intel: wrmsr(IA32_BIOS_UPDT_TRIG, blob_phys_addr).
AMD: wrmsr(MSR_AMD64_PATCH_LOADER, blob_phys_addr).
4. Verify update: re-read CPUID / microcode revision MSR.
If no microcode blob is available, continue without update.
Phase 0.15: Early Log Ring Init (canonical Phase 0.15)
Call early_log_init() — a sequencing checkpoint confirming the
BSS-resident EarlyLogRing ([Section 2.3](#boot-init-cross-arch--early-boot-log-ring))
is accessible. After this point, early_log() emits messages to
both the ring buffer and the serial console (Phase 0.14).
Phase 1: GDT + TSS
Load a proper GDT with TSS. Configure IST1 with a dedicated
16 KB stack for double-fault handling.
Phase 2: IDT + PIC
Install exception handlers (vectors 0–31) and IRQ handlers
(vectors 32–47). Remap the 8259 PIC: IRQ0 → vector 32,
IRQ8 → vector 40.
**Note**: Exception vectors 0–31 MUST be installed before any
operation that could fault. The buddy allocator (Phase 4)
could trigger a page-fault (#PF, vector 14) on corrupted
memory, and the stack probe code could trigger a double-fault
(#DF, vector 8). IDT installation for critical vectors
0–31 corresponds to canonical Phase 0.1 (arch_early_init):
the entry assembly + GDT + IDT stub collectively form the
earliest exception-safe state. If the IDT is deferred past
Phase 2, a fault before that point causes a triple-fault
(immediate CPU reset with no diagnostic output).
Phase 0.55: ACPI Table Discovery (arch-specific, x86-64 only)
Scan for the RSDP (Root System Description Pointer):
1. Search the EBDA (Extended BIOS Data Area) — the 1 KB region
starting at the segment address stored at physical 0x040E.
2. Search the region 0x000E_0000–0x000F_FFFF (legacy BIOS ROM area).
3. On UEFI systems: RSDP address is provided in the EFI System Table
(obtained via Multiboot2 EFI tag or directly from UEFI boot services).
Validate RSDP checksum (v1: 20-byte sum; v2: 36-byte extended checksum).
Parse RSDT (32-bit pointers, RSDP v1) or XSDT (64-bit pointers,
RSDP v2+). Build an index of ACPI table signatures and physical
addresses for use by subsequent phases (SRAT for Phase 3a, MADT
for Phase 11, MCFG for PCI enumeration, DMAR for IOMMU, HPET for
timekeeping). Required before NUMA discovery (Phase 3a).
Phase 3: Physical Memory Manager
Parse Multiboot1 memory map (see [Section 2.9](#boot-memory-management)).
Initialize bitmap allocator: mark available regions free,
reserve first 1 MB (BIOS/legacy) and kernel image.
Phase 3a: NUMA Topology Discovery (canonical Phase 0.6)
Parse ACPI SRAT (Static Resource Affinity Table) and SLIT
(System Locality Information Table). Build NumaTopology:
node_ranges[], distance[][] matrix, node_cpus[] masks.
On non-NUMA systems: single node 0 covering all memory.
Required before slab_init() because slab caches are NUMA-aware
(one partial list per node).
Cross-ref: [Section 4.11](04-memory.md#numa-topology-and-policy).
Phase 3b: CpuLocal BSP Init (canonical Phase 0.7)
Initialize CpuLocalBlock for the BSP and set the arch-specific
fast-access register:
wrmsr(IA32_GS_BASE, &CPU_LOCAL_BLOCKS[0])
On x86-64, the `gs` segment base register points to the per-CPU
CpuLocalBlock. All subsequent per-CPU data access (current task,
preempt count, IRQ stack pointer, slab magazines) goes through
`gs`-relative addressing. See [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--initialization-sequence)
for the full init sequence and memory ordering requirements.
Phase 3c: Evolvable Activation (canonical Phases 0.8a + 0.8b)
Activate the Evolvable code image. On x86-64, MMU is already
active from entry assembly step 2d, so Phases 0.8a (verify) and
0.8b (map+init) run as a single local phase. This is the
transition from Nucleus-only execution to the full kernel with
replaceable policy modules.
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
Sequence:
1. (Phase 0.8a) Nucleus standalone LMS verifier (~2 KB) checks
the Evolvable image signature embedded in the kernel binary.
On failure, panic with "Evolvable signature verification failed".
2. Map Evolvable .text (RX) and .rodata (RO) pages at
EVOLVABLE_VIRT_BASE (0xFFFF_FFFF_A000_0000 on x86-64).
Allocate fresh RW pages for .data+.bss via BootAlloc, map
at Evolvable's linked RW VAs.
3. Call evolvable_init() which populates all VTABLE_SLOTS[]
([Section 13.18](13-device-classes.md#live-kernel-evolution)) entries with Evolvable function
pointers. After this point, Evolvable code is callable via
vtable dispatch.
**Invariant**: During Phases 0.1–3b, Nucleus code MUST NOT
dispatch through any VTABLE_SLOTS[] or replaceable policy vtable.
Phase 4: Kernel Heap (canonical Phase 1.1)
Initialize the buddy allocator with all available physical
memory discovered from the Multiboot memory map (Phase 3).
The buddy allocator manages power-of-two blocks (order 0–10,
4 KB–4 MB) and provides the foundation for all subsequent
allocations. See [Section 4.2](04-memory.md#physical-memory-allocator).
Phase 4a: Slab Allocator (canonical Phase 1.2)
Initialize slab caches on top of the buddy allocator (one
partial list per NUMA node). After this point, Box::new,
Arc::new, and typed allocations are available through the
slab fast path.
Cross-ref: [Section 4.3](04-memory.md#slab-allocator).
Phase 5: Virtual Memory
Verify identity mapping (virt_to_phys on mapped addresses).
Test new page mappings: allocate frame, map at 0x40000000,
write/read volatile, unmap, free frame.
Phase 6: Capability System (canonical Phase 2.2)
Create CapSpace, test create/check/attenuate operations.
Phase 7: IPC / MPK Detection
Query CPUID for PKU support. Test domain alloc/free.
Phase 8: Enable Interrupts (timer IRQ only)
Three-step interrupt enable protocol:
1. All IOAPIC redirection entries except the timer vector
(vector 32) remain masked. The timer uses a direct IDT
handler installed in Phase 2 — no IrqDomain lookup required.
2. Execute STI to enable interrupts.
3. Verify timer ticks are incrementing (read LAPIC timer count
or the CpuLocal tick counter across a short spin).
**Postcondition**: only the timer IRQ (vector 32) is unmasked.
All other IOAPIC redirection entries are masked. Spurious
interrupts to unmasked vectors are harmless (the IDT handler
logs and returns). No IrqDomain lookup is needed for the timer
because it uses a direct IDT handler.
Phase 8a: IrqDomain Setup (canonical Phase 2.1)
Create root IrqDomain for the IOAPIC interrupt controller.
Register IrqChip implementations. Map hardware IRQ lines
to virtual IRQ numbers. After IrqDomain setup, drivers
unmask their individual IRQs during probe (Phase 9+).
Cross-ref: [Section 3.12](03-concurrency.md#irq-chip-and-irqdomain-hierarchy).
Phase 9: Scheduler (canonical Phase 2.3)
Initialize EEVDF scheduler. Spawn two test threads
(thread-A, thread-B). Run cooperative yield loop, then
enable preemptive scheduling via timer tick callback.
Phase 9a: Workqueue Framework (canonical Phase 2.7)
Initialize named kernel worker thread pools (standard 10 per
[Section 3.11](03-concurrency.md#workqueue-deferred-work)). Required before any subsystem
can defer work.
Phase 9a2: RCU Init (canonical Phase 2.8)
Initialize RCU infrastructure (grace period tracking, callback
queues, per-CPU RCU state). Required before any RCU-protected
data structure is used. Cross-ref: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
Phase 9a3: LSM Framework Init (canonical Phase 2.9)
Initialize the Linux Security Module framework. Create empty
LSM registry, register compiled-in LSMs (if any), finalize
static-key NOP patching for hook sites. Must complete before
the first LSM-mediated access check (device registry at Phase
4.2+). Cross-ref: [Section 9.8](09-security.md#linux-security-module-framework).
Phase 9b: PID Allocator (canonical Phase 3.5)
Initialize PID ID allocator (Idr<Arc<Task>>). Required before
fork()/clone() can assign PIDs.
Cross-ref: [Section 8.1](08-process.md#process-and-task-management).
Phase 10: SYSCALL/SYSRET (canonical Phase 2.35)
Configure STAR/LSTAR/SFMASK MSRs. Register three syscall
handlers: write(1), getpid(39), exit_group(231).
Test with inline SYSCALL instruction from kernel mode.
This is placed after the scheduler (Phase 9) but before
VFS init, mapping to approximately canonical Phase 2.35.
Phase 10a: VFS Initialization (canonical Phase 5.2)
Initialize dentry cache, inode cache, mount table. Mount
initramfs or early rootfs.
Cross-ref: [Section 14.1](14-vfs.md#virtual-filesystem-layer).
Phase 10b: Network Stack Initialization (canonical Phase 4.6)
Initialize umka-net Tier 1 domain. Set up loopback interface,
routing table infrastructure, socket layer.
Cross-ref: [Section 16.31](16-networking.md#network-service-provider).
Boot Phase to Subsystem Init Mapping:
| Phase | Subsystem | Canonical Phase | Canonical Spec |
|---|---|---|---|
| 0.14 | COM1 Serial Init | (arch-specific) | Section 2.2 (this section) |
| 0.14a | CPU Microcode Load | (arch-specific) | Section 2.2 (this section) |
| 0.15 | Early Log Ring | 0.15 | Section 2.3 |
| 1 | GDT + TSS | 0.1 | Section 2.2 (this section) |
| 2 | IDT + PIC | 0.1 | Section 2.2 (this section) |
| 0.55 | ACPI Table Discovery | (arch-specific) | Section 2.2 (this section) |
| 3 | Physical Memory Manager | 0.3–0.5 | Section 4.2 |
| 3a | NUMA Topology Discovery | 0.6 | Section 4.11 |
| 3b | CpuLocal BSP Init | 0.7 | Section 3.2 |
| 3c | Evolvable Activation | 0.8 | Section 2.21 |
| 4 | Kernel Heap (buddy) | 1.1 | Section 4.2 |
| 4a | Slab Allocator | 1.2 | Section 4.3 |
| 5 | Virtual Memory | (arch-specific, VMM verification) | Section 4.8 |
| 6 | Capability System | 2.2 | Section 9.1 |
| 7 | IPC / MPK Detection | — | Section 11.2 |
| 8 | Enable Interrupts | — | Section 2.2 (this section) |
| 8a | IrqDomain Setup | 2.1 | Section 3.12 |
| 9 | Scheduler | 2.3 | Section 7.1 |
| 9a | Workqueue Framework | 2.7 | Section 3.11 |
| 9a2 | RCU Init | 2.8 | Section 3.1 |
| 9a3 | LSM Framework Init | 2.9 | Section 9.8 |
| 9b | PID Allocator | 3.5 | Section 8.1 |
| 10 | SYSCALL/SYSRET | 2.35 | Section 19.1 |
| 10a | VFS Initialization | 5.2 | Section 14.1 |
| 10b | Network Stack Init | 4.6 | Section 16.31 |
| 11–15 | SMP Bringup | 3.1–3.3 | Section 2.3 |
Note: Local Phase 5 is an architecture-specific VMM verification step with no canonical counterpart. The identity map is created in the entry assembly (canonical Phase 0.2); local Phase 5 verifies VMM post-buddy.
2.3 Boot Init Reference and SMP Bringup¶
2.3.1 Kernel Init Phase Reference (Cross-Architecture)¶
The x86-64 phases above are implementation-specific (GDT, PIC, etc.). This table provides the canonical, architecture-neutral init ordering. Individual subsystem sections provide detailed specifications; this table provides the definitive sequencing that all architectures must follow.
Canonical entry point (all 8 architectures):
/// The Rust entry point called from per-architecture assembly after
/// register/stack setup. The signature is uniform across all architectures:
/// 64-bit arches pass arguments naturally; 32-bit arches (ARMv7, PPC32)
/// use register pairs per their ABI (AAPCS r0:r1/r2:r3, PPC EABI r3:r4/r5:r6).
/// umka_main immediately constructs a typed BootInfo struct as its first action,
/// providing type safety from the Rust boundary onward.
pub extern "C" fn umka_main(multiboot_magic: u64, multiboot_info_ptr: u64)
| Phase | Step | Subsystem | Function | Prerequisite | Source Section |
|---|---|---|---|---|---|
| 0.1 | Firmware handoff | Boot | arch_early_init() |
— | §2.1 |
| 0.1a | CPU feature detection | Boot | cpu_features_detect() → cpu_features_freeze() — arch-specific: x86-64 CPUID, AArch64 ID registers, ARMv7 MIDR/ID_MMFR/ID_ISAR, RISC-V misa/extensions, PPC32 PVR, PPC64LE PVR, s390x STFLE/STIDP, LoongArch64 CPUCFG. Must run before CPUID-dependent decisions (microcode, page table format, isolation mechanism selection). Populates CpuFeatureSet; cpu_features_freeze() seals the feature set and enables static-key patching. |
0.1 | Section 2.16 |
| 0.14 | Early serial init | Boot | arch_serial_init() — arch-specific: x86-64 COM1 (0x3F8), AArch64 PL011, ARMv7 PL011, RISC-V 16550/SBI, PPC32 NS16550 via CCSR, PPC64LE UART, s390x SCLP, LoongArch64 NS16550. Must be available before Phase 0.15 for diagnostic output. |
0.1 | §2.2 |
| 0.15 | Early log ring | Observability | early_log_init() |
0.14 | Section 20.1 |
| 0.2 | Identity mapping / MMU enable | VMM | setup_identity_map() — on architectures where the buddy allocator needs page tables (x86-64, AArch64/ARMv7 for cached access), this runs before Phase 1.1. On architectures with firmware-provided identity access (LoongArch64 DMW, s390x prefix mapping), this may be deferred. See per-arch boot sections for the concrete ordering. Must complete before Phase 0.8b. |
0.1 | §4.6 |
| 0.3 | Memory map parse | Boot | parse_firmware_memmap() — arch-specific: x86-64 E820/UEFI, AArch64/ARMv7/RISC-V/PPC32/PPC64LE DTB /memory, s390x SCLP Read SCP Info (init_from_sclp()), LoongArch64 DTB or UEFI memory map. |
0.1 | §2.1 |
| 0.4 | Boot allocator | Memory | boot_alloc_init() |
0.3 | §4.1 |
| 0.5 | Reserve regions | Memory | reserve_kernel_initramfs_acpi() |
0.4 | §4.1 |
| 0.55 | Firmware table discovery | Boot | firmware_table_discover() — arch-specific: x86-64 RSDP/RSDT/XSDT scan (EBDA + BIOS ROM), AArch64 SBSA ACPI from UEFI system table, s390x SCLP-provided tables, LoongArch64 ACPI from UEFI. No-op on DTB-only platforms (ARMv7, RISC-V, PPC32) where firmware tables come from the DTB itself. Prerequisite for NUMA discovery (Phase 0.6). |
0.5 | Section 2.4 |
| 0.6 | NUMA discovery | Memory | numa_discover_topology() — arch-specific: x86-64 ACPI SRAT/SLIT, AArch64/ARMv7 DTB+PPTT, RISC-V DTB, PPC32/PPC64LE DTB+ibm,associativity, s390x SCLP+STSI, LoongArch64 ACPI SRAT or DTB. |
0.3 | §4.9 (fallback: single node 0 if no SRAT/DT) |
| 0.7 | CpuLocal BSP | Concurrency | cpulocal_bsp_init() — arch-specific register: x86-64 wrmsr(IA32_GS_BASE), AArch64 msr TPIDR_EL1, ARMv7 mcr TPIDRPRW, RISC-V mv tp, <addr> (tp holds CpuLocal base; sscratch set to 0 for kernel-mode indicator), PPC32 mtspr SPRG3, PPC64LE r13, s390x PREFIX/lowcore, LoongArch64 move $r21, <addr> ($r21/$u0 is the kernel per-CPU register; KSave3 CSR stores a copy for trap entry). |
0.4 | §3.1.2.1 |
| 0.8a | Evolvable verify | Boot | evolvable_verify() — signature verification of the Evolvable image. Operates on physical addresses only (no MMU required). ML-DSA-65 + Ed25519 hybrid signature check against embedded public key. On s390x, uses LMS (Leighton-Micali Signatures) for stateless verification without DAT. |
0.7 | Section 2.21 |
| 0.8b | Evolvable map + init | Boot | evolvable_map_and_init() — map Evolvable code at EVOLVABLE_VIRT_BASE (kernel virtual address), populate VTABLE_SLOTS[], run Evolvable init callbacks. Requires MMU enabled (Phase 0.2 complete). |
0.2, 0.8a | Section 2.21 |
Nucleus/Evolvable invariant during early boot: During Phases 0.1–0.8a, Nucleus code
MUST NOT dispatch through any VTABLE_SLOTS[] or replaceable policy vtable.
Phase 0.8a verifies the Evolvable image signature using physical addresses (no MMU
required). Phase 0.8b maps Evolvable at virtual addresses and populates
VTABLE_SLOTS[] — this requires MMU to be enabled (Phase 0.2 complete). All code
paths during Phases 0.1–0.8a use only static, compile-time-resolved Nucleus
functions. The first call through an Evolvable vtable is permissible only after
Phase 0.8b completes and the Evolvable verification/initialization protocol has
succeeded.
Phase 0.2 ordering constraint: On x86-64, Phase 0.2 (identity mapping) runs
in entry assembly before umka_main() — both 0.8a and 0.8b execute with MMU
already active. On all other architectures, Phase 0.2 runs during umka_main(),
typically between Phase 0.8a and 0.8b (after signature verification but before
virtual mapping). The per-arch boot sequences document the exact local ordering.
s390x further splits Phase 0.2 within its DAT setup sequence. The invariant is:
0.8a may precede 0.2; 0.8b must follow 0.2.
BootstrapContext compile-time enforcement: Phase 0.8b Evolvable bootstrap takes
&BootstrapContext (not &KernelContext). BootstrapContext provides
register_vtable() and init_subsystem() but has NO cap_create() or
cap_delegate() methods — the type system prevents creating user-visible
capabilities during bootstrap (CapTable does not exist until Phase 2.2). After
Phase 2.2, the kernel constructs KernelContext (which wraps BootstrapContext
+ CapTable reference) and passes it to subsequent init phases. This is zero-cost:
BootstrapContext and KernelContext are different struct types with different
method sets, but share the same underlying state pointers.
/// Available during Phase 0.8b–2.1 (before CapTable exists).
pub struct BootstrapContext {
pub vtable_slots: &'static VtableSlotArray,
pub boot_alloc: &'static BootAlloc,
}
impl BootstrapContext {
pub fn register_vtable(&self, slot: VtableSlotId, ptr: *const ()) { /* ... */ }
pub fn init_subsystem(&self, name: &str) -> Result<(), InitError> { /* ... */ }
// NOTE: no cap_create, cap_delegate, cap_revoke — type system enforced.
}
/// Available after Phase 2.2 (CapTable initialized).
pub struct KernelContext {
pub bootstrap: BootstrapContext,
pub cap_table: &'static CapTable,
}
impl KernelContext {
pub fn cap_create(&self, /* ... */) -> CapHandle { /* ... */ }
pub fn cap_delegate(&self, /* ... */) -> Result<CapHandle, CapError> { /* ... */ }
}
| 1.1 | Buddy allocator | Memory | mem_init() → hand_off_to_buddy() | 0.4, 0.6 | §4.2 |
| 1.2 | Slab allocator | Memory | slab_init() | 1.1 | §4.3 |
| 1.2.2 | SWIOTLB | Memory | swiotlb_init() — always allocate bounce buffer pool from low memory (64 MB default, adjustable via swiotlb= kernel command line). Allocated early before IOMMU discovery; released post-probe at Phase 5.41 if all devices have IOMMU coverage. | 1.1 | Section 4.14 |
| 1.3 | Per-CPU magazines BSP | Memory | slab_init_cpu_magazines(0) | 1.2 | §3.1.2.1 |
| 1.3.1 | Crypto builtin registration | Security | crypto_register_builtin_algs() — registers software implementations of SHA-256, AES, Ed25519 as builtin algorithms. These are available before the full crypto API (crypto_init() at Phase 3.6) for early signature verification (e.g., Evolvable image verification by the Nucleus standalone verifier does NOT use this — it uses its own ~2KB LMS verifier. This registration enables module signature checks that may run before Phase 3.6). | 1.2 (slab) | Section 10.1 |
| 1.4 | Page cache | Memory | page_cache_init() | 1.2 | §4.4 |
| 1.5 | Clock framework | Boot | clock_tree_init() | 1.2 | Section 2.24 |
| 2.1 | IRQ domain | Concurrency | irq_domain_init() | 1.2 | §3.1.12 |
| 2.2 | Capability system | Security | cap_table_init() | 1.2 | §9.1 |
| 2.3 | Scheduler | Scheduling | sched_init() | 2.1 | §7.1 |
| 2.35 | Syscall entry setup | Boot | arch_syscall_init() — arch-specific syscall entry mechanism configuration. Must complete before any userspace entry. x86-64: STAR/LSTAR/SFMASK MSRs (this phase). AArch64: VBAR_EL1 (satisfied by Phase 0.1). ARMv7: exception vectors (satisfied by Phase 0.1). RISC-V: stvec (satisfied by Phase 0.1). PPC32/PPC64LE: sc handler (satisfied by Phase 0.1). s390x: SVC old/new PSW (satisfied by Phase 0.1). LoongArch64: CSR.EENTRY (satisfied by Phase 0.1). Arches that configure the syscall mechanism in Phase 0.1 satisfy this constraint transitively; early completion is not a violation. | 2.1 | §19.1 |
| 2.4 | Idle thread | Scheduling | idle_thread_create() | 2.3 | §7.1 |
| 2.5 | Timekeeping | Scheduling | timekeeping_init() | 2.1 | §7.5 |
| 2.6 | Timer wheel | Scheduling | timer_wheel_init() | 2.5 | §7.5 |
| 2.7 | Workqueue infra | Concurrency | workqueue_init_early() | 2.3 | §3.1.11 |
| 2.8 | RCU | Concurrency | rcu_init() | 2.7 | §3.1.6 |
| 2.9 | LSM framework | Security | lsm_init() | 1.2, 2.2 | Section 9.8 |
| 3.1 | AP trampoline | Boot | setup_ap_trampoline() | 2.3 | §2.1 |
| 3.2 | AP bringup | Boot | boot_secondary_cpus() | 3.1 | §3.1.2.1 |
| 3.3 | Per-CPU magazines APs | Memory | Each AP calls slab_init_cpu_magazines(n) for itself | 3.2 | §3.1.2.1 |
| 3.5 | PID allocator | Process | pid_alloc_init() | 1.2 | Section 8.1 |
| 3.6 | Crypto API | Security | crypto_init() | 1.2 | Section 10.1 |
| 3.7 | KABI keyring | Security | kabi_keyring_init() | 3.6 | Section 10.2 |
| 3.8 | FMA/tracing | Observability | fma_init() | 2.7, 2.8, 1.2 | Section 20.1 |
| 3.9 | Tracepoint registration | Observability | tracepoint_init() | 3.8 | Section 20.2 |
| 4.1 | IOMMU | Drivers | iommu_init() | 1.2 | §11.4 |
| 4.2 | Device registry | Drivers | device_registry_init() | 1.2, 2.2 | §11.4.3 |
| 4.25 | IRQ controller | Drivers | irq_chip_init() | 2.1 | §3.1.12 |
| 4.3 | KABI runtime | Drivers | kabi_runtime_init() | 4.2, 3.7 | §12.1 |
| 4.4a | Bus enumeration | Drivers | pci_enumerate() / dt_probe() — Internal ordering: (1) PCI/platform bus scan (enumerate devices), (2) device-driver matching (match table lookup), (3) driver probe calls (ordered by bus topology — parent before child). Creates named workqueues pm-async, fw-loader, dma-fence, hotplug, mod-loader before first probe (see Section 3.11). See Section 11.4 for the detailed probe sequence. | 2.7, 4.1, 4.3, 4.25 | §11.5 |
| 4.4b | Clock tree DT population | Boot | clock_tree_populate_dt() | 4.4a | Section 2.24 |
| 4.4c | Regulator framework | Drivers | regulator_init() — register regulator providers from DT/ACPI, resolve supply chains, apply boot constraints. Consumers (device drivers) reference regulators during Phase 5.x probe via regulator_get(). | 4.4b | Section 13.27 |
| 4.5 | Block layer | Storage | block_init() | 1.2, 2.7 | Section 15.2 |
| 4.6 | Network stack | Networking | net_init() | 1.2, 2.7, 4.2 | Section 16.2 |
Note: Phases 4.6 (
net_init) and 5.2 (vfs_init) have no mutual dependency and may execute in either order. The canonical numbering reflects logical grouping (networking in Part 4, filesystems in Part 5), not a strict ordering requirement. Architectures may reorder these phases. | 4.7 | eBPF subsystem | SysAPI |ebpf_init()— initializes verifier, JIT compiler, map infrastructure, and bpffs pseudo-filesystem. Required before any BPF program can be loaded (XDP programs may be attached during NIC driver probe at Phase 5.1+). | 1.2, 4.6 | Section 19.2 | | 4.8 | Tier M peer detection | Drivers |tier_m_peer_detect()— bus-agnostic: for each device discovered in Phase 4.4a, check bus-specific magic (PCIe BAR0 magic0x554D4B41, s390x SENSE ID CU type0x554D, virtio device IDVIRTIO_ID_UMKA_PEER, USB interface class + vendor magic). Tier M devices: allocate ring pair (BAR2 / QDIO queue / virtqueue / USB bulk), map into isolation domain, run peer join handshake (JoinRequest/JoinAccept via ring pair), receive CapAdvertise, create PeerServiceProxy entries in KabiServiceRegistry. Non-Tier-M devices: proceed to KABI driver probe at Phase 5.x. | 4.4a, 4.1, 4.2, 4.3 | Section 5.11 | | 5.1 | Tier 0 drivers | Drivers |load_tier0_drivers()| 4.4b | §11.2 | | 5.2 | VFS init | VFS |vfs_init()→ mount devtmpfs | 1.4, 4.2 | §14.1 | | 5.25 | Filesystem type registration | VFS |register_filesystem_types()| 5.2 | Section 14.1 | | 5.3 | Tier 1 drivers | Drivers |load_tier1_drivers()| 5.1, 5.2 | §11.2 | | 5.35 | TPM init | Security |tpm_init()| 5.3, 3.6, 4.4a | Section 9.4 | | 5.36 | EVM init | Security |evm_init()| 5.35, 2.9, 3.6 | Section 9.5 | | 5.4 | Storage probe | Storage |storage_probe()— NVMe/SCSI/AHCI | 5.3, 4.5 | Section 15.2 | | 5.41 | SWIOTLB release check | Memory |swiotlb_release_if_unused()— iterate all probed devices; if every device has IOMMU coverage, release the 64 MB bounce buffer pool back to the buddy allocator. Must run after all device probing (Phase 4.4a + 5.1) and storage probe (5.4) complete so the device registry is fully populated, including late-probed storage controllers. | 5.1, 4.4a, 5.3, 5.4 | Section 4.14 | | 5.45 | Device-mapper init | Storage |dm_init()| 5.4 | Section 15.2 | | 5.5 | Rootfs mount | VFS |mount_rootfs()| 5.4, 5.25 | §14.1 | | 5.6 | Initramfs release | Memory |free_initramfs()— release initramfs pages back to the buddy allocator. Runs AFTER rootfs is mounted (Phase 5.5) and all initramfs-resident modules have been loaded. Modules loaded from initramfs that are notmbs_excludehave their binaries in the MBS (Section 11.9) and do not need the initramfs pages. Modules withmbs_exclude=truethat were only in initramfs (not on rootfs) CANNOT be reloaded after this phase — this is acceptable becausembs_excludeshould only be set on non-critical media drivers that are also present on the rootfs. | 5.5, 4.8 | Section 4.2 | | 6.1 | NFS client | Storage |nfs_init()| 5.2, 4.6 | Section 15.11 | | 6.2 | FUSE init | VFS |fuse_init()| 5.2, 4.3 | Section 14.11 | | 6.3 | DLM init | Storage |dlm_init()| 4.6 | Section 15.15 | | 7.0a | On-host peer transport | Distributed |tier_m_transport_activate()— for each Tier M peer withEXTERNAL_NETWORKorRDMA_CAPABLE: ServiceBind for transport service, createClusterTransportbinding using the NIC's ring pair. On RDMA-capable NICs, the ring pair carries native RDMA WQEs. On Ethernet-only NICs, TcpPeerTransport uses TxPacket/RxPacket. This makes remote peers reachable WITHOUT loading a KABI NIC driver. | 4.8, 4.6 | Section 5.10 | | 7.1 | RDMA transport | Distributed |rdma_transport_init()— if Tier M NIC activated at Phase 7.0a, this phase uses the existing transport binding. Otherwise, loads KABI NIC driver (mlx5, ixgbe, etc.) and builds transport from the driver's netdev. | 5.3, 4.6, 7.0a | Section 5.4 | | 7.2 | Cluster join | Distributed |cluster_join()| 7.1 | Section 5.1 | | 7.3 | DSM init | Distributed |dsm_init()| 7.2 | Section 6.1 | | 8.1 | Init process creation | Process |create_init_process()— create PID 1 (init) with root credentials:uid=0, gid=0, cap_effective=CAP_FULL_SET, cap_permitted=CAP_FULL_SET, cap_bounding=CAP_FULL_SET, cap_inheritable=0, cap_ambient=0, user_ns=init_user_ns. FsInfo initialized to rootfs mount root dentry. This is the ONLY task that receivesCAP_FULL_SETwithout delegation — all subsequent tasks inherit or receive capabilities throughfork()/exec()semantics (Section 9.9). The CapTable (Section 9.1) is empty at this point; object capabilities are created as PID 1 opens files, binds devices, and delegates to child processes. | 5.5, 2.2, 2.3 | Section 8.1 | | 8.2 | Exec init binary | Process |exec_init_binary()— PID 1 callsexecve("/sbin/init")(orinit=kernel command line override). Standard exec path: load ELF, apply file capabilities fromsecurity.capabilityxattr (if present), set up user stack, jump to entry point. After this, userspace is running. | 8.1, 5.5 | Section 8.1 |
Workqueue pre-activation guard: ModuleLoaderQueue has an activated: AtomicBool
flag (initially false, set true at Phase 4.4a start). If queue_work() is called
before activation (e.g., by an eager bus scan routine), it returns EAGAIN. The caller
retries on the next probe cycle. This prevents work submission before the workqueue's
backing thread pool is ready.
Cluster boot ordering: RDMA transport (Phase 7.1) is initialized before cluster
join (Phase 7.2) to break the circular dependency: DSM→peer protocol→RDMA→cluster.
RDMA transport operates independently of DSM and cluster membership. If no cluster is
discovered within 10 seconds (CLUSTER_JOIN_TIMEOUT_MS = 10_000), the node boots in
standalone mode. Standalone→cluster transition is supported post-boot via
cluster_join_deferred(). DSM is deferred until cluster join completes — DSM regions
require peer communication.
Dependency DAG summary (critical path highlighted):
Phases 0.x: firmware → early_log → identity_map → memmap → boot_alloc → NUMA → CpuLocal → Evolvable_verify → (MMU) → Evolvable_map
↓
Phases 1.x: buddy → slab → page_cache, clock, PID
↓
Phases 2.x: IRQ_domain, cap_sys → sched → idle, workqueue → RCU → LSM
↓
Phases 3.x: PID_alloc, crypto → keyring FMA → tracepoints
↓ AP bringup
Phases 4.x: IOMMU → dev_registry → KABI_runtime → bus_enum block_init, net_init
↓
Phases 5.x: Tier0 → VFS → fs_types → Tier1 → TPM → EVM → storage_probe → dm_init → rootfs
↓
Phases 6.x: NFS, FUSE, DLM (post-root)
The critical boot path (from power-on to rootfs mount) is:
firmware → memmap → buddy → slab → sched → bus_enum → Tier0 → VFS → Tier1 → storage_probe → rootfs
Phase 6 subsystems (NFS, FUSE, DLM) are post-rootfs and initialized on demand or by init scripts. They are included in the table for completeness but do not block boot.
Relationship to per-arch phases: The per-architecture phase numbering (e.g., x86-64 Phases 1-10b, AArch64 Phases 1-16) predates this canonical table and uses different granularity. The canonical table is the authoritative ordering; the per-arch phases are concrete instantiations. For example, x86-64 Phase 3 (Physical Memory Manager) maps to canonical phases 0.3-0.6 + 1.1; x86-64 Phase 9 (Scheduler) maps to canonical phase 2.3. Per-arch boot sequences may include arch-specific phases not in the canonical table (e.g., interrupt enable timing, isolation mechanism detection, IPC subsystem init) -- these are implementation details that each architecture handles at the appropriate point in its boot sequence. Notably, interrupt enabling is inherently arch-specific (tied to interrupt controller init and exception vector configuration, which varies across GIC, PLIC, APIC, XIVE, etc.) and is therefore NOT a canonical phase -- each per-arch boot file documents when interrupts are enabled relative to its local phases. Every per-arch boot file includes its own canonical phase mapping table for traceability.
2.3.1.1 Early Boot Log Ring¶
The FMA/tracing subsystem (Section 20.1) requires the slab allocator, workqueues, and per-CPU structures — none of which exist during early boot (Phases 0.x). To capture errors, panics, and diagnostic messages emitted before those subsystems are online, Phase 0.15 initializes a static BSS ring buffer:
/// Boot-phase log ring buffer. Resides in .bss — zero-initialized, no allocator
/// needed. All boot code (Phases 0.x through 2.x) writes diagnostics here via
/// `early_log()`. After FMA init (Phase 3.8), the ring is replayed into the
/// live tracing subsystem, then the ring is decommissioned (its memory can be
/// reclaimed or left as a fallback for late-boot panics).
///
/// The ring is a simple single-producer (BSP-only during Phases 0.x–1.x) byte
/// ring with a monotonic write cursor. No locking is needed until AP bringup
/// (Phase 3); after that, `write_pos` is updated with `AtomicUsize` CAS.
///
/// Size: 64 KB — sufficient for ~1000 typical log lines. If the ring wraps,
/// oldest entries are silently overwritten (best-effort during early boot).
const EARLY_LOG_RING_SIZE: usize = 64 * 1024;
// kernel-internal, not KABI — BSP/AP logging ring, never crosses an ABI boundary.
#[repr(C, align(64))]
pub struct EarlyLogRing {
/// Ring buffer storage.
pub buf: [u8; EARLY_LOG_RING_SIZE],
/// Next write position (byte offset, wraps modulo EARLY_LOG_RING_SIZE).
/// Atomic after AP bringup; plain usize before that (BSP is sole writer).
pub write_pos: AtomicUsize,
/// Number of bytes written since boot (monotonic, does not wrap).
/// Used to detect overwrite: if total_written > EARLY_LOG_RING_SIZE,
/// the oldest (total_written - EARLY_LOG_RING_SIZE) bytes are lost.
pub total_written: AtomicU64,
}
/// Global early log ring. Placed in .bss (zero-initialized at load time).
pub static EARLY_LOG_RING: EarlyLogRing = EarlyLogRing {
buf: [0u8; EARLY_LOG_RING_SIZE],
write_pos: AtomicUsize::new(0),
total_written: AtomicU64::new(0),
};
/// Write a log message to the early ring buffer.
/// Safe to call from any boot phase after Phase 0.15 (which simply verifies
/// the ring is accessible — a no-op for BSS, but the phase exists as a
/// sequencing checkpoint). Before Phase 3 (AP bringup), this is single-
/// threaded (BSP only). After Phase 3, uses atomic CAS on write_pos.
///
/// **Entry format**: Each log entry is framed with a 4-byte length header:
/// `[len: u32 LE][message: [u8; len]]`. Maximum message length: 4096 bytes
/// (`MAX_EARLY_LOG_MSG`). The framing enables clean replay: entries with
/// corrupted length headers (garbled by concurrent ring wrap) are detected
/// and skipped during `early_log_replay()`.
///
/// **Multi-writer protocol** (active after AP bringup):
///
/// **Important**: This is a best-effort boot log ring with known data races
/// on ring wrap, NOT a lock-free MPSC ring with correctness guarantees.
/// Between step 1 (CAS reserve) and step 3 (memcpy), a concurrent writer
/// can wrap around and overwrite the first writer's reserved-but-not-yet-written
/// region. The CAS on `write_pos` ensures non-overlapping reservations only if
/// all writers complete their memcpy before the ring wraps. If a writer is
/// preempted between CAS and memcpy (possible after timer interrupts are
/// enabled at Phase 8), a later writer may wrap and overwrite the stale
/// reservation. Garbled entries are detected and discarded during replay.
///
/// 1. CAS loop on `write_pos` to reserve `4 + msg.len()` bytes (Acquire ordering on success).
/// 2. Write `(msg.len() as u32).to_le_bytes()` at `buf[old_pos..old_pos+4]`.
/// 3. `memcpy` message bytes into `buf[old_pos+4..old_pos+4+msg.len()]`
/// (modular wrap around EARLY_LOG_RING_SIZE).
/// 4. `total_written.fetch_add(4 + msg.len(), Relaxed)`.
///
/// On replay, validate each entry: `len <= MAX_EARLY_LOG_MSG` and the next
/// entry's offset is consistent. Entries where the length field was partially
/// overwritten by a concurrent wrapper produce invalid `len` values and are
/// skipped. Interleaved output between concurrent writers may garble individual
/// entries on ring wrap; the framing header allows detecting and discarding
/// such entries.
pub fn early_log(msg: &[u8]);
/// Replay all early log entries into the FMA tracing subsystem.
/// Called once after FMA init completes. Entries are emitted as
/// `TraceEvent::BootLog` with monotonic sequence numbers derived from
/// ring position. After replay, sets a flag that redirects future
/// `early_log()` calls directly to the live tracing path.
pub fn early_log_replay();
Ordering invariants: Each phase's prerequisite column defines a strict partial order. Phases at the same level (e.g., 2.1 and 2.2) have no ordering dependency on each other and could theoretically execute concurrently, but UmkaOS runs them sequentially on the BSP for simplicity and debuggability. The only true parallelism is in phase 3 (AP bringup), where multiple APs initialize concurrently.
Critical ordering constraint — LSM before first capability check:
Phase 2.9 (lsm_init()) must complete before any subsystem that performs
LSM-mediated access checks. This means LSM init must precede the device
registry (4.2), VFS init (5.2), and all driver probes (5.1, 5.3). The
prerequisites are the slab allocator (1.2, for LSM blob slab caches) and
the capability system (2.2, since LSM hooks reference SystemCaps and
TaskCredential). LSM does not require the scheduler (2.3), so it is
placed between RCU (2.8) and AP trampoline (3.1). If no lsm= parameter
is provided and no compiled-in LSM defaults are set, lsm_init() still
runs (creating the empty registry and setting any_lsm_active = false),
so that static-key NOP patching is finalized before the first hook site
is reachable.
2.3.2 Secondary CPU Bringup (x86-64 SMP)¶
After Phase 10b completes on the BSP (Boot Strap Processor), secondary CPUs (Application Processors, APs) are brought online.
AP Stack Allocation:
The BSP allocates each AP's initial kernel stack from the boot allocator before sending the INIT-SIPI wakeup. This ensures the stack is ready before the AP needs it.
- Stack size: 16 KB per AP (same as the BSP initial stack). Allocated from the per-NUMA-node boot allocator, preferring memory local to the AP's NUMA node.
- SP communication: The BSP stores the stack top address in a per-CPU startup mailbox, defined as:
/// Per-CPU startup data written by BSP before AP wakeup, read by AP during
/// very early boot (before the AP has its own stack pointer set up).
/// Must reside in a physically-mapped region accessible without paging (or
/// with the identity-mapped early page tables already in place).
#[repr(C, align(64))]
pub struct ApStartupMailbox {
/// Initial kernel stack top (SP value to load). Written by BSP before
/// sending the wakeup IPI; read by the AP entry stub in assembly.
pub stack_top: u64,
/// Physical address of the AP's per-CPU data area.
pub percpu_base: u64,
/// CPU identity value — AP verifies this matches its own hardware ID.
/// Per-architecture semantics: x86-64: LAPIC ID (32-bit x2APIC).
/// AArch64: packed Aff3:Aff2:Aff1:Aff0 from MPIDR_EL1 (4×8 = 32 bits).
/// RISC-V: hart_id (unsigned long, practically <32 bits on all current HW).
/// PPC32/PPC64LE: PIR (processor ID register, <32 bits).
/// s390x: CPU address (16 bits). LoongArch64: CPUID (32 bits).
/// u32 is sufficient for all 8 architectures.
pub cpu_id: u32,
/// BSP sets to MAILBOX_READY (0xAB1E1234) when all fields above are valid.
/// AP spins on this field (with a short architectural pause) until ready.
pub status: AtomicU32,
pub _pad: [u8; 32],
}
// Explicit fields total 56 bytes; align(64) adds 8 bytes of implicit tail
// padding to fill the cache line (sizeof(ApStartupMailbox) == 64).
const_assert!(size_of::<ApStartupMailbox>() == 64);
pub const MAILBOX_READY: u32 = 0xAB1E_1234;
/// Array of mailboxes, one per possible CPU slot. Allocated from the boot
/// allocator during Phase 11 once the CPU count is known.
pub static AP_STARTUP_MAILBOXES: OnceLock<&'static mut [ApStartupMailbox]> = OnceLock::new();
- AP entry stub: The AP's 16-bit → 64-bit trampoline (in assembly) reads
stack_topfrom its mailbox slot, using the LAPIC ID as the array index, loads SP, then jumps to Rustap_entry(). - Stack allocation failure: If the boot allocator returns OOM for an AP's
stack, the BSP marks that CPU permanently offline in the topology, does NOT
send the wakeup IPI, and logs
"CPU {lapic_id}: stack allocation failed, CPU disabled". Boot continues with the remaining CPUs.
Fan-out tree bringup:
A sequential per-CPU timeout of 1 second × N CPUs does not scale: 128 CPUs would require up to 127 seconds in the worst case. UmkaOS uses a binary fan-out tree to bound bringup time to O(log₂ N) phases regardless of CPU count.
The tree assignment is defined by index (not by LAPIC ID):
CPU i (tree index) wakes CPUs 2i+1 and 2i+2 (if they exist in the topology).
Phase 0: BSP (index 0) wakes index 1 and index 2
Phase 1: index 1 wakes 3, 4; index 2 wakes 5, 6
Phase 2: each of 3–6 wakes two children
...
Phase k = ⌈log₂(N)⌉ − 1: leaf CPUs (no children)
For 128 CPUs: 7 phases × ~50 ms per phase ≈ 350 ms worst case, versus up to 127 seconds with sequential 1-second timeouts.
The BSP sets up a shared SmpBringupState structure before waking the first AP:
/// Shared state for coordinating fan-out tree AP bringup.
/// Placed in physically-mapped memory accessible to all CPUs before the VMM
/// is fully operational.
///
/// `online_mask` and `pending_mask` are `CpuMask` instances (Section 9.1,
/// 08-security.md), allocated from the boot-time bump allocator after the CPU
/// count is discovered from ACPI MADT or DTB. They scale to the actual number
/// of CPUs found on the system — no hardcoded limit.
///
/// Allocation: `CpuMask::alloc(num_possible_cpus, boot_alloc)` is called once
/// during Phase 11 (CPU enumeration), before any AP is woken. The mask storage
/// is never reallocated — `num_possible_cpus` is a boot-time-fixed value.
// repr(C) ensures field order is preserved — prevents false sharing between
// online_count (hammered by APs) and num_possible_cpus (read-only after init).
// No const_assert because CpuMask is dynamically sized (depends on CPU count).
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct SmpBringupState {
/// Bitmask of CPUs (by tree index) that have completed initialization.
/// One bit per possible CPU. Sized at boot to `num_possible_cpus` bits.
/// Each AP atomically sets its own bit via `CpuMask::set_atomic`.
pub online_mask: CpuMask,
/// Bitmask of CPUs currently being brought up (wakeup IPI sent, init
/// not yet complete). Used to detect stalled APs at deadline.
pub pending_mask: CpuMask,
/// Total number of possible CPUs discovered from firmware (MADT/DTB).
pub num_possible_cpus: usize,
/// Count of CPUs that have completed init, used for tree coordination.
/// An AP atomically increments this after setting its bit in `online_mask`.
pub online_count: AtomicUsize,
/// Global deadline (monotonic ns) by which all APs must come online.
/// Set by BSP to `now_ns() + 30_000_000_000` (30 seconds) before Phase 11.
pub deadline_ns: u64,
}
Protocol:
1. BSP initializes SmpBringupState, sets deadline_ns = now_ns() + 30s.
2. BSP prepares mailboxes for both tree children: for each child_idx in
[2*0+1, 2*0+2] = [1, 2] (if child exists in topology): allocate stack,
fill mailbox (stack, percpu_base, APIC ID), set mailbox[child_idx].status =
MAILBOX_READY, send INIT-SIPI to that AP.
3. Each AP, after completing its own init (Phase 14 below), atomically sets its
bit in online_mask and increments online_count. It then reads its tree
index i and sends wakeup IPIs to children at indices 2i+1 and 2i+2
(if those CPUs exist and their deadline_ns has not passed). The AP then
enters the scheduler idle loop.
4. The BSP (Phase 15) polls online_count and deadline_ns. When
online_count reaches the expected total or deadline_ns is exceeded,
bringup ends. Any CPU whose bit is not set in online_mask by deadline
is marked offline and excluded from the kernel CPU mask.
Phase 11: AP Detection
Query ACPI MADT (Multiple APIC Descriptor Table) or MP Table
for CPU count and LAPIC IDs. Assign sequential tree indices
(0 = BSP, 1..N-1 = APs in MADT order). Allocate PerCpu<T>
slots and AP_STARTUP_MAILBOXES for each detected CPU.
Initialize SmpBringupState; set deadline_ns.
Phase 12: AP Trampoline Setup
a. Allocate a 4 KB page below 1 MB (in low memory, identity-mapped)
for the AP trampoline code. This is required because APs start
in real mode (16-bit) with paging disabled.
b. Copy trampoline code (16-bit → 32-bit → 64-bit transition) to
the low-memory page. The trampoline:
- Starts in 16-bit real mode at physical address 0xNN00
- Enables protected mode (32-bit)
- Loads a temporary GDT (same layout as BSP's)
- Enables long mode (64-bit)
- Loads CR3 with the kernel's page tables
- Reads stack_top from ApStartupMailbox[own_lapic_id_index]
- Loads SP from stack_top
- Jumps to ap_entry() in high memory
c. The trampoline uses the ApStartupMailbox array (defined
above) for per-AP stack and percpu_base communication.
Phase 13: First AP Wakeup (BSP → tree root)
BSP allocates stack for AP at tree index 1, fills mailbox[1],
sets mailbox[1].status = MAILBOX_READY.
BSP sends INIT IPI to AP 1's LAPIC (assert level).
BSP waits 10 ms (Intel SDM recommendation).
BSP sends STARTUP IPI (SIPI) with trampoline vector.
BSP waits 200 μs; sends second SIPI (required by older silicon).
The fan-out tree propagates from here — each AP wakes its children
after completing its own init.
Phase 14: AP Initialization (per AP, in ap_entry())
Each AP runs this sequence independently after its mailbox is ready:
a. Load proper GDT and TSS (per-CPU TSS required for IST stacks)
b. Load IDT (same as BSP)
c. Initialize per-CPU interrupt controller (LAPIC on x86-64,
GIC redistributor on AArch64, PLIC hart context on RISC-V,
OpenPIC per-CPU on PPC)
d. Initialize per-CPU scheduler runqueue
**Note**: Runqueue init MUST precede enabling interrupts.
A timer interrupt on the AP requires a valid runqueue for
the scheduler tick handler (scheduler_tick() reads
current_rq()). If interrupts are enabled first, the timer
fires into a CPU with no runqueue — undefined behavior.
d2. Initialize per-CPU slab magazines (slab_init_cpu_magazines(n))
d3. Enable interrupts
e. Calibrate LAPIC timer (delay calibration loop)
f. Atomically set own bit in SmpBringupState.online_mask;
increment online_count
g. Read own tree index i; wake children at 2i+1, 2i+2:
- Allocate stack for each child (from boot allocator)
- Fill child's ApStartupMailbox; set status = MAILBOX_READY
- Send INIT + SIPI + SIPI to child's LAPIC
h. Enter scheduler idle loop (hlt + monitoring for work)
Phase 15: SMP Online
BSP polls SmpBringupState.online_count and deadline_ns.
Loop exits when online_count == expected_ap_count OR
monotonic_now() >= deadline_ns (global 30-second timeout).
Any AP whose bit is not set in online_mask at exit is marked
permanently offline and removed from the kernel CPU mask.
System is now fully multi-CPU. Scheduler load-balances
across all online CPUs.
Per-CPU data initialization:
Each AP needs its own per-CPU data structures initialized:
- PerCpu<T> slots for scheduler runqueue, current task pointer, etc.
- GDT with per-CPU TSS (TSS must be unique per CPU for IST stacks)
- LAPIC timer calibration (varies per CPU due to manufacturing differences)
- IRQ affinity: By default, all IRQs target BSP; distribute to other CPUs
via IOAPIC redirection table or LAPIC logical destination mode.
ACPI MADT parsing (x86-64):
MADT (Multiple APIC Descriptor Table):
- Located via RSDP → RSDT/XSDT → MADT signature "APIC"
- Provides: Local APIC address, CPU LAPIC IDs, IOAPIC addresses
- CPU entries: LAPIC ID, flags (enabled/disabled)
- Override entries: IRQ source overrides, NMI sources
The BSP's LAPIC ID is read from LAPIC_ID register (MMIO at 0xFEE00020).
All other entries in MADT are APs.
Failure handling:
If an AP fails to come online before the global deadline:
- BSP logs the failure: "CPU {lapic_id} (tree index {i}): did not signal
online before deadline, marking offline"
- BSP marks the CPU slot offline; its children in the fan-out tree are
also marked offline (they will never receive their wakeup IPI)
- Boot continues with the available CPUs
- Do NOT panic — reduced-CPU operation is valid
Hot-plug support (future): The ACPI namespace may indicate CPU hot-plug capability. The mailbox mechanism is reused for hot-plug: writing to the ACPI CPU hot-plug register triggers the same INIT/SIPI sequence for the newly added CPU, which inserts itself into the online_mask and online_count atomically.
2.4 ACPI Table Parsing¶
UmkaOS uses ACPI tables for hardware discovery on x86-64 (and ARM SBSA/server platforms). The ACPI subsystem has two distinct components:
-
Static table parsing (Phase 1, boot-time): The kernel parses binary ACPI tables (MADT, MCFG, HPET, DMAR/IVRS, SRAT, SLIT, PPTT, FADT) to discover hardware topology. This is a straightforward binary structure walk — no interpreter needed. Static table parsing is required for boot.
-
AML interpreter (Phase 2, post-boot): ACPI Methods (DSDT/SSDT bytecode) require an AML interpreter to execute
_STA,_CRS,_PRS,_PSx,_Sx,_OSC,_DSM, and power/thermal methods. UmkaOS implements a reduced AML interpreter covering: - Required for boot:
_STA(device status),_CRS(current resources),_PRS(possible resources),_OSC(OS capabilities handshake),_INI(device init). - Required for power management:
_PS0–_PS3(power state transitions),_S3/_S4/_S5(sleep states),_TMP/_PSV/_CRT(thermal). - Required for PCI/PCIe:
_BBN(base bus number),_SEG(segment group),_PRT(PCI routing table). - Deferred:
_DSM(device-specific methods) for vendor extensions — implemented per-driver as needed.
AML opcode coverage: The method names above describe which methods to
execute, not which AML opcodes the interpreter must support. Real-world DSDT
tables (Dell, HP, Lenovo, etc.) use a substantial subset of the AML opcode
space within these method bodies. The AML interpreter must support at minimum:
- Control flow: If/Else, While, Return, Break
- Data manipulation: Store, Add, Subtract, And, Or, ShiftLeft/Right,
Increment, Decrement, Not, FindSetLeftBit/RightBit
- Object creation: CreateDWordField, CreateWordField, CreateByteField,
CreateBitField, CreateQWordField
- Composite types: Buffer, Package, DerefOf, Index, SizeOf, ObjectType
- Method invocation: MethodCall (nested), Arg0-6, Local0-7
- Type conversion: ToInteger (0x99), ToString (0x9C), ToBuffer (0x96)
- String/buffer manipulation: Concatenate (0x73), ConcatenateResTemplate
(0x84), Mid (0x9E). ConcatenateResTemplate is heavily used in _CRS
methods on HP/Dell/Lenovo BIOS to construct resource template buffers
dynamically — without it, such devices silently lose their IRQ/MMIO/DMA
resource descriptors and cannot be probed.
- Conditional references: CondRefOf (0x12 extended), Match (0x89).
CondRefOf is used in _STA methods to check optional feature existence;
Match is used in _OSI string matching.
- Synchronisation: Acquire, Release, Mutex
- Namespace: Scope, Device, Name, Alias, Notify
- Field access: OpRegion, Field, IndexField, BankField (System Memory,
SystemIO, PCI Config, Embedded Controller)
Any AML opcode not in the above categories produces the diagnostic
ACPI: unsupported AML opcode 0xXX at <table>+<offset>, skipping method <name>
and the containing method is skipped. Systems using only opcodes in the above
categories will boot correctly. Extended opcodes (LoadTable 0x5B1F,
Unload 0x5B20, Timer 0x5B33, ToBCD 0x5B29, FromBCD 0x5B28, DerefOf 0x83,
RefOf 0x71, Index 0x88 for buffer fields) are deferred to Phase 2.
Remaining rare opcodes (CopyObject 0x9D, ObjectType 0x8E, external
reference resolution) are deferred to Phase 3.
Error handling for malformed ACPI tables: If a static table fails checksum or has invalid structure, the kernel logs a diagnostic and falls back to safe defaults (e.g., assume 1 CPU, no IOAPIC, use legacy PIC). If the AML interpreter encounters an illegal opcode or infinite loop (method timeout: 5 seconds), it aborts the method, logs the failure, and marks the affected device as non-functional. The kernel never panics on ACPI errors — degraded operation is always preferred over a boot failure.
2.5 AArch64 Boot Sequence¶
QEMU's -M virt -cpu cortex-a72 -kernel loads the ELF at 0x40080000 and
enters at _start with the MMU off. The exception level at entry depends on
the environment: QEMU -M virt without -enable-kvm enters at EL1; real
hardware and some hypervisors enter at EL2. Register x0 holds the DTB
address provided by QEMU's built-in firmware.
Entry assembly (arch/aarch64/entry.S, GNU as syntax):
1. QEMU/firmware jumps to _start, MMU off.
- x0 = DTB address (passed by QEMU firmware or bootloader)
- Exception level: EL2 or EL1 depending on platform.
2. _start:
a. Save DTB pointer: mov x19, x0 (x19 is callee-saved)
b. EL2-to-EL1 transition (if entered at EL2):
Read CurrentEL register (MRS x1, CurrentEL; LSR x1, x1, #2).
If bits [3:2] == 0b10 (EL2):
- Configure HCR_EL2:
RW = 1 (bit 31): EL1 executes in AArch64 state
TGE = 0 (bit 27): route EL0 exceptions to EL1, not EL2
All other bits = 0 (no stage-2 translation, no traps)
Write: MOV x1, #(1 << 31); MSR HCR_EL2, x1
- Initialize SCTLR_EL1 to a known-safe state (MMU off, RES1 bits set):
MOV x1, #INIT_SCTLR_EL1_MMU_OFF; MSR SCTLR_EL1, x1
where INIT_SCTLR_EL1_MMU_OFF = LSMAOE (bit 29) | nTLSMD (bit 28)
| EIS (bit 22) | TSCXT (bit 20) | EOS (bit 11). These bits are
RES1 or architecturally required on ARMv8.5+ (Cortex-A78, A510,
X1-X4, Neoverse V1-V3). MMU, caches, and alignment checks remain
off. Writing 0 would clear RES1 bits, causing IMPLEMENTATION DEFINED
behavior on newer cores. Matches Linux INIT_SCTLR_EL1_MMU_OFF.
Phases 6/6a configure the final SCTLR_EL1 value.
- Set up return state:
MSR SPSR_EL2, #0x3C5 (M=0b0101 = EL1h, D|A|I|F bits [9:6] all set = DAIF masked)
ADR x1, el1_entry
MSR ELR_EL2, x1
ERET
el1_entry:
If bits [3:2] == 0b01 (already EL1): fall through, no transition.
If bits [3:2] == 0b11 (EL3): unsupported — branch to halt loop.
c. Disable all exceptions: msr daifset, #0xf
(masks Debug, SError, IRQ, FIQ in DAIF register)
d. Enable FPU/NEON: write CPACR_EL1.FPEN bits [21:20] = 0b11
(without this, any NEON/FP instruction traps — Rust generates
NEON instructions by default for aarch64). This clobbers x0,
but the DTB pointer was saved to x19 in step (a).
e. Load stack pointer: adrp x1, _stack_top / add / mov sp, x1
(64 KB stack in .bss._stack, 16-byte aligned)
f. Clear BSS: zero memory from __bss_start to __bss_end
(str xzr loop, 8 bytes per iteration)
g. Prepare arguments: x0 = 0 (no multiboot), x1 = x19 (DTB address)
h. Branch: bl umka_main
i. Halt loop: wfe (wait-for-event) if umka_main returns
Stack (64 KB) is allocated in .bss._stack (16-byte aligned). The linker
script (linker-aarch64.ld) places .text._start first and provides
__bss_start / __bss_end symbols for BSS clearing.
Initialization phases (in umka_main(), sequential):
Canonical Phase Mapping:
| Canonical Phase | Description | Local Implementation |
|---|---|---|
| 0.1 | arch_early_init | Entry assembly (steps 1–2) + Phase 1 (VBAR_EL1) |
| 0.15 | early_log_init | Phase 1a: early_log_init() — BSS ring buffer checkpoint |
| 0.3 | parse_firmware_memmap | Phase 3: DTB /memory parse |
| 0.4 | boot_alloc_init | Phase 4: phys::init() from DTB regions |
| 0.5 | reserve_regions | Phase 4: reserve kernel image |
| 0.6 | numa_discover_topology | Phase 4a: DTB /memory nodes + PPTT (if present) |
| 0.7 | cpulocal_bsp_init | Phase 4b: msr TPIDR_EL1, &CpuLocalBlock |
| 0.8a | evolvable_verify | Phase 4c: Evolvable signature verification (physical addresses, no MMU) |
| 0.2 | identity_map | Phase 4d: TTBR0_EL1 page tables + MMU enable |
| 0.8b | evolvable_map_and_init | Phase 4e: Evolvable virtual mapping at EVOLVABLE_VIRT_BASE + VTABLE_SLOTS[] population |
| 1.1 | buddy allocator | Phase 5: buddy init |
| 1.2 | slab allocator | Phase 5a: slab_init() |
| 2.1 | IRQ domain | Phase 8: GIC init + IrqDomain setup |
| 2.2 | capability system | Phase 7: CapSpace init |
| 2.3 | scheduler | Phase 11: scheduler init |
| 2.7 | workqueue infra | Phase 11a: workqueue_init_early() |
| 2.8 | RCU | Phase 11b: rcu_init() |
| 2.9 | LSM framework | Phase 11c: lsm_init() |
| 3.1–3.3 | SMP bringup | Phases 12–16: PSCI CPU_ON fan-out |
Phase 1: Exception Vectors (VBAR_EL1)
Write vector table base to VBAR_EL1 (16 entries × 128 bytes,
2 KB aligned). Vectors cover: Synchronous, IRQ, FIQ, SError
at each of four exception origins (current EL SP0/SPx, lower
EL AArch64/AArch32).
Phase 1a: Early Log Ring Init (canonical Phase 0.15)
Call early_log_init() — sequencing checkpoint confirming the
BSS-resident EarlyLogRing is accessible. After this point,
early_log() emits messages to both the ring buffer and the
PL011 UART. See [Section 2.3](#boot-init-cross-arch--early-boot-log-ring).
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly, same
pattern as x86 entry.asm step 2d). Perform any additional
initialization that depends on zeroed static data.
Phase 3: DTB Parse (Nucleus minimal parser)
Parse the DTB (received in x0 at entry, forwarded as the
info pointer to umka_main; see [Section 2.8](#device-tree-and-platform-discovery)). Extract /memory
regions, /chosen bootargs, interrupt controller base (GIC),
timer IRQ numbers, and UART base address.
**Parser placement**: The minimal no-alloc DTB parser
(`umka-kernel/src/boot/dtb.rs`) is Nucleus code — it must
execute during Phase 0.3 (canonical memory map parse) before
the heap, slab, or Evolvable are available. It uses a
fixed-size 64-entry `MemoryRegion` array on the stack. The
full-featured DTB parser (ACPI/DTB parser in Evolvable,
see [Section 2.21](#kernel-image-structure--evolvable-boot-monolith-first-loadable-swappable))
handles second-pass parsing after heap initialization for
device nodes beyond the 64-entry boot limit.
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). Mark available
regions free, reserve kernel image (__bss_end and below).
No legacy BIOS region to reserve (unlike x86).
Phase 4a: NUMA Topology Discovery (canonical Phase 0.6)
Parse DTB /memory nodes for memory region topology. If ACPI
is available (UEFI boot with ACPI tables), parse SRAT/SLIT.
If the PPTT (Processor Properties Topology Table) is present,
extract cache hierarchy and socket/cluster grouping. On
non-NUMA systems (most QEMU -M virt configurations): single
node 0 covering all memory. Required before slab_init()
because slab caches are NUMA-aware (one partial list per node).
Cross-ref: [Section 4.11](04-memory.md#numa-topology-and-policy).
Phase 4b: CpuLocal BSP Init (canonical Phase 0.7)
Initialize CpuLocalBlock for the BSP and set the arch-specific
fast-access register:
msr TPIDR_EL1, &CPU_LOCAL_BLOCKS[0]
On AArch64, TPIDR_EL1 is the per-CPU data pointer register
(accessible only at EL1+). All subsequent per-CPU data access
(current task, preempt count, IRQ stack, slab magazines) goes
through TPIDR_EL1-relative addressing.
See [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--initialization-sequence).
Phase 4c: Evolvable Signature Verification (canonical Phase 0.8a)
Verify the Evolvable image signature using physical addresses
only (no MMU required). Nucleus LMS verifier (~2 KB) checks
ML-DSA-65 + Ed25519 hybrid signature against embedded public
key. On failure, panic.
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
**Invariant**: Phases 0.1–4c MUST NOT dispatch through
VTABLE_SLOTS[] or any replaceable policy vtable.
Phase 4d: Virtual Memory — MMU Enable (canonical Phase 0.2)
See Phase 6 below for the full TTBR0/TTBR1 page table setup
and MMU enable sequence. This is the canonical Phase 0.2
identity mapping step, placed here to satisfy the constraint
that MMU must be enabled before Evolvable virtual mapping
(Phase 4e). The buddy allocator (Phase 5) is not yet available
at this point — page table pages are allocated from BootAlloc.
Phase 4e: Evolvable Virtual Mapping (canonical Phase 0.8b)
Map Evolvable .text (RX) and .rodata (RO) pages at
EVOLVABLE_VIRT_BASE (0xFFFF_0000_4000_0000 on AArch64).
Allocate fresh RW pages for .data+.bss via BootAlloc.
Call evolvable_init() to populate VTABLE_SLOTS[].
**Requires MMU enabled** (Phase 4d complete).
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
After Phase 4e, Evolvable vtable dispatch is permitted.
Phase 5: Kernel Heap (buddy allocator)
Initialize the buddy allocator with all available physical
memory discovered from the DTB memory regions (Phase 4).
The buddy allocator manages power-of-two blocks (order 0–10,
4 KB–4 MB). See [Section 4.2](04-memory.md#physical-memory-allocator).
Phase 5a: Slab Allocator (canonical Phase 1.2)
Initialize slab caches on top of the buddy allocator (one
partial list per NUMA node). After this point, Box::new,
Arc::new, and typed allocations are available.
Cross-ref: [Section 4.3](04-memory.md#slab-allocator).
**Identity map timing note**: MMU is enabled at Phase 4d
(canonical Phase 0.2) using page table pages allocated from
BootAlloc. The buddy allocator (Phase 5) and slab (Phase 5a)
then operate with MMU active. Phase 6 below documents the
full TTBR0/TTBR1 page table construction and MMU enable
sequence that runs at Phase 4d.
Phase 6: Virtual Memory Detail (TTBR0_EL1 + TTBR1_EL1, runs at Phase 4d)
Allocate page table pages from BootAlloc (Phase 4).
Build two sets of mappings:
**TCR_EL1 configuration** (controls both TTBR0 and TTBR1):
- T0SZ = 16: 48-bit VA for TTBR0 region (0x0000_0000_0000_0000
to 0x0000_FFFF_FFFF_FFFF)
- T1SZ = 16: 48-bit VA for TTBR1 region (0xFFFF_0000_0000_0000
to 0xFFFF_FFFF_FFFF_FFFF)
- TG0 = 0b00: 4 KB granule for TTBR0
- TG1 = 0b10: 4 KB granule for TTBR1 (note: TG1 encoding
differs from TG0 — 0b10 means 4 KB for TG1)
- ORGN0/IRGN0 = 0b01: write-back write-allocate cacheable (TTBR0)
- ORGN1/IRGN1 = 0b01: write-back write-allocate cacheable (TTBR1)
- SH0 = 0b11: inner shareable (TTBR0)
- SH1 = 0b11: inner shareable (TTBR1)
- IPS: set to match physical address size from
ID_AA64MMFR0_EL1.PARange (runtime-discovered, not hardcoded)
**TTBR0_EL1 — identity map** (user/lower half):
- 4-level tables: L0 (PGD) → L1 (PUD) → L2 (PMD) → L3 (PTE)
- Identity map all physical RAM (VA == PA)
- Used during early boot and for user-space processes
**TTBR1_EL1 — kernel higher-half map**:
- Separate 4-level page table hierarchy
- Map kernel .text (RX), .rodata (RO), .data+.bss (RW)
at their linked virtual addresses in the 0xFFFF_... range
- Map Evolvable at EVOLVABLE_VIRT_BASE (0xFFFF_0000_4000_0000,
see [Section 2.21](#kernel-image-structure--per-architecture-evolvable-virtual-address-base))
- Direct-map (physmap) region for physical memory access
**Cache maintenance before MMU enable**:
1. Clean+invalidate all page table pages to Point of Coherence:
For each page table page:
dc civac, <page_addr> // clean+invalidate each cache line
(iterate in 64-byte steps across each 4 KB page)
2. dsb sy // ensure all cleans complete
3. isb // synchronize context
**MMU enable sequence**:
1. Write MAIR_EL1 (Memory Attribute Indirection Register):
Attr0 = 0xFF (normal, write-back cacheable inner+outer)
Attr1 = 0x04 (device-nGnRE, for MMIO)
Attr2 = 0x44 (normal, non-cacheable)
Attr3 = 0x00 (device-nGnRnE, strongly ordered)
2. Write TCR_EL1 (as configured above)
3. Write TTBR0_EL1 = identity map L0 table physical address
4. Write TTBR1_EL1 = kernel map L0 table physical address
5. isb
6. Set SCTLR_EL1: M=1 (MMU), C=1 (data cache), I=1 (icache),
SA=1 (SP alignment EL1), SA0=1 (SP alignment EL0),
nTWE=0, nTWI=0, EOS=1 (if FEAT_ExS), SPAN=1 (if FEAT_PAN2)
7. isb
8. tlbi vmalle1 // invalidate all TLB entries
9. dsb sy
10. isb
Phase 6a: MTE System Register Activation (conditional)
Probe ID_AA64PFR1_EL1.MTE field (bits [11:8]):
- 0b0000: MTE not supported — skip this phase entirely
- 0b0001: MTE1 (instructions only, no tag checking) — skip
- 0b0010: MTE2 (full tag checking) — activate below
- 0b0011: MTE3 (asymmetric, FEAT_MTE3) — activate below
If MTE >= 2, configure the following system registers:
- TCR_EL1.TCMA0 = 0 (do NOT disable tag checking on EL0
unchecked accesses — default safe). TCMA1 = 0 (same for EL1).
- SCTLR_EL1.ATA = 1 (enable allocation tag access at EL1)
- SCTLR_EL1.ATA0 = 1 (enable allocation tag access at EL0)
- SCTLR_EL1.TCF0 = 0b01 (synchronous tag check faults for
EL0 — always synchronous for user-space determinism)
- SCTLR_EL1.TCF = 0b01 (synchronous for EL1 during
development/debug). Production kernels may set TCF = 0b10
(asynchronous) via boot parameter `mte=async` to reduce
performance impact while still detecting violations.
- GCR_EL1: set Exclude field to 0x0000 (all 16 tags available
for random tag generation via IRG instruction). Rrnd = 0
(implementation-defined LFSR).
- RGSR_EL1: seed the tag generation LFSR with a value derived
from CNTPCT_EL0 (physical counter, provides per-boot entropy).
- TFSR_EL1: clear any pending async tag check fault status.
See [Section 10.4](10-security-extensions.md#arm-memory-tagging-extension) for the full MTE
integration with the memory allocator and user-space API.
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
**Dependency note**: cap_table_init() (canonical Phase 2.2)
requires the slab allocator (canonical Phase 1.2), which in
turn requires the buddy allocator (canonical Phase 1.1 /
local Phase 5). The local numbering (Phase 5 → 6 → 7) is
sequential and the ordering is correct. Phase 7 corresponds
to canonical Phase 2.2.
Phase 8: GIC Initialization (v2 or v3, detected at runtime)
Read GIC version and base addresses from DTB
(`compatible` = "arm,gic-400" for GICv2, "arm,gic-v3" for GICv3).
- GICv2 path:
GICD (Distributor): enable, configure IRQ priorities and
targets for all SPIs. Set priority mask.
GICC (CPU Interface): enable, set priority mask to 0xFF
(accept all priorities), set BPR (binary point).
Interrupt group assignment: set GICD_IGROUPRn bit = 1 for
all SPIs and PPIs (Group 1 Non-secure). Group 0 is reserved
for secure firmware interrupts (if any).
- GICv3 path:
GICD (Distributor): enable, configure affinity routing (ARE=1),
set priorities for all SPIs. Set GICD_IGROUPRn = all 1s and
GICD_IGRPMODRn = all 0s for all SPIs (Group 1 Non-secure).
GICR (Redistributor) wakeup sequence:
1. Clear GICR_WAKER.ProcessorSleep (bit 1) by writing 0
2. Poll GICR_WAKER.ChildrenAsleep (bit 2) until it reads 0
(timeout: 1 ms with yield loop; if not cleared, log
warning and continue — redistributor may be stuck)
3. Configure SGI/PPI: set GICR_IGROUPRn = all 1s,
GICR_IGRPMODRn = all 0s (Group 1 Non-secure for PPIs/SGIs)
4. Set priorities for PPIs (GICR_IPRIORITYRn)
5. Enable desired PPIs (GICR_ISENABLERn) — specifically
INTID 27 / PPI 11 (virtual timer), INTID 30 / PPI 14 (hypervisor timer if needed)
ICC system registers:
ICC_SRE_EL1.SRE = 1 (enable system register access, must be
set before any other ICC_ register access)
ICC_PMR_EL1 = 0xF0 (accept normal priorities 0x00-0xEF;
reserve 0xF0-0xFF for pseudo-NMI — perf sampling, hard lockup
detection, SDEI. Matches Linux DEFAULT_PMR_VALUE.)
ICC_BPR1_EL1 = 0 (no priority grouping, all bits for preemption)
ICC_IGRPEN1_EL1 = 1 (enable Group 1 interrupt signaling)
Route timer IRQ (INTID 27 / PPI 11 = virtual timer) to this CPU.
Phase 9: Generic Timer
Configure the ARM generic timer (virtual counter):
- Write timer period to CNTV_TVAL_EL0
- Enable timer: CNTV_CTL_EL0 = ENABLE (bit 0), clear IMASK
- Timer fires IRQ 27 (virtual timer PPI) → tick handler
Enable interrupts: msr daifclr, #0xf
Phase 10: SVC / Exception-Vector Syscall Setup
Configure the exception vector table to correctly dispatch system
calls arriving from EL0 via the SVC instruction.
Exception vector layout (VBAR_EL1, 16 entries × 128 bytes = 2 KB,
must be 2 KB-aligned):
Offset 0x000: Current EL with SP0 — Synchronous
Offset 0x080: Current EL with SP0 — IRQ
Offset 0x100: Current EL with SP0 — FIQ
Offset 0x180: Current EL with SP0 — SError
Offset 0x200: Current EL with SPx — Synchronous
Offset 0x280: Current EL with SPx — IRQ
Offset 0x300: Current EL with SPx — FIQ
Offset 0x380: Current EL with SPx — SError
Offset 0x400: Lower EL (AArch64) — Synchronous ← SVC lands here
Offset 0x480: Lower EL (AArch64) — IRQ
Offset 0x500: Lower EL (AArch64) — FIQ
Offset 0x580: Lower EL (AArch64) — SError
Offset 0x600: Lower EL (AArch32) — Synchronous
Offset 0x680: Lower EL (AArch32) — IRQ
Offset 0x700: Lower EL (AArch32) — FIQ
Offset 0x780: Lower EL (AArch32) — SError
SVC handler entry (Lower EL AArch64 Synchronous, offset 0x400):
1. Save all general-purpose registers and the ELR_EL1/SPSR_EL1
pair to the per-task kernel stack (or per-CPU trap frame).
2. Read ESR_EL1: check EC field (bits [31:26]) == 0x15 (SVC64
instruction). If EC != 0x15, dispatch to generic fault path.
3. Extract syscall number from X8 (Linux AArch64 ABI convention).
Arguments are in X0-X5. Return value is written to X0.
4. Invoke the syscall dispatch table (same table as all arches).
5. Restore registers and return via ERET (restores PC from
ELR_EL1 and PSTATE from SPSR_EL1).
Control register configuration (verified during this phase):
SCTLR_EL1: M=1 (MMU on), C=1 (data cache on), I=1 (icache on),
SA=1 (SP alignment check at EL1), SA0=1 (SP alignment at EL0).
HCR_EL2.TGE: must be 0 so that EL0 exceptions route to EL1, not
EL2. Verified here if the kernel is running under a hypervisor
that sets up HCR_EL2 before entering the guest kernel.
SPSR_EL1: set up on return so EL0 re-enters AArch64 state (M=0b0000).
Verification test (executed during boot):
Trigger SVC from EL1 to test the synchronous exception vector
(VBAR_EL1 + 0x200, "Current EL with SPx — Synchronous"). The
handler fires, reads ESR_EL1 to verify EC == 0x15 (SVC64), and
returns. This is a vector table self-test — not a user-mode
execution test. User-mode execution is not possible until the
scheduler is initialized in Phase 11.
Phase 11: Scheduler (canonical Phase 2.3)
Initialize EEVDF scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
Phase 11a: Workqueue Framework (canonical Phase 2.7)
Initialize named kernel worker thread pools.
Cross-ref: [Section 3.11](03-concurrency.md#workqueue-deferred-work).
Phase 11b: RCU Init (canonical Phase 2.8)
Initialize RCU infrastructure (grace period tracking,
callback queues, per-CPU state).
Cross-ref: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
Phase 11c: LSM Framework Init (canonical Phase 2.9)
Initialize LSM framework and register compiled-in LSMs.
Cross-ref: [Section 9.8](09-security.md#linux-security-module-framework).
Secondary CPU Bringup (AArch64 via PSCI):
After Phase 11 completes on the primary CPU, secondary CPUs are brought online using PSCI (Power State Coordination Interface).
AP Stack Allocation (AArch64):
The primary CPU allocates each secondary's kernel stack from the boot allocator
before issuing the PSCI CPU_ON call. Stack size is 16 KB per AP, allocated
from the per-NUMA-node boot allocator, preferring memory local to the target
CPU's node. The stack top address and percpu base are written to the AP's
ApStartupMailbox slot (see Section 2.3 for the struct definition; the
same type is used on all architectures). The PSCI context_id parameter is
set to the physical address of the AP's mailbox so that the secondary entry
stub can locate its stack before the MMU is active.
If stack allocation fails (boot allocator OOM), the primary logs "CPU {mpidr}:
stack allocation failed, CPU disabled", does not issue CPU_ON, and marks
the CPU permanently offline. Boot continues with the remaining CPUs.
Phase 12: Secondary CPU Detection
Parse DTB /cpus node for all CPU entries:
- Each cpu@N node contains: reg = MPIDR affinity bits
- device_type = "cpu"
- enable-method = "psci" (indicates PSCI is used)
Assign sequential tree indices (0 = primary, 1..N-1 = secondaries
in DTB order). Allocate PerCpu<T> slots and AP_STARTUP_MAILBOXES.
Initialize SmpBringupState; set deadline_ns = now_ns() + 30s.
Phase 13: PSCI Method Detection
Check /psci node in DTB for PSCI method:
- method = "smc": Use SMC (Secure Monitor Call) for PSCI
- method = "hvc": Use HVC (Hypervisor Call) for PSCI
Verify PSCI version via PSCI_VERSION (function ID 0x84000000):
- Major version in bits 31:16, minor in 15:0
- Require PSCI 1.0+ for full feature support
Phase 14: Secondary CPU Startup (fan-out tree, PSCI CPU_ON)
Primary allocates stack for tree-index 1 (fills mailbox, sets
mailbox[1].status = MAILBOX_READY).
**Cache maintenance before CPU_ON** (critical for correctness):
The primary must ensure that the mailbox contents and page
table pages are visible to the secondary CPU, which starts
with cold caches. Before issuing each CPU_ON call:
1. dc civac on every cache line of the ApStartupMailbox
(64 bytes = 1 cache line at align(64))
2. dc civac on the page table root page (the secondary
will load TTBR0_EL1 from the mailbox before enabling MMU)
3. dsb sy // ensure all cache operations complete
4. isb // synchronize instruction stream
This cleans the mailbox and page tables to the Point of
Coherence, guaranteeing that non-coherent secondary CPUs
(which have not yet joined the coherency domain) observe the
correct data when they perform uncached reads.
Then calls PSCI CPU_ON:
x0 = 0xC4000003 (CPU_ON function ID, AArch64 PSCI 0.2+)
x1 = target_mpidr (MPIDR affinity value from DTB for index 1)
x2 = secondary_entry_phys (physical address of entry stub)
x3 = mailbox_phys (physical address of ApStartupMailbox[1])
Issue via SMC or HVC depending on Phase 13 detection.
Return values:
0 (PSCI_SUCCESS): CPU starting
-2 (PSCI_INVALID_PARAMS): bad MPIDR or entry address
-4 (PSCI_ALREADY_ON): CPU was already running (treat as success)
other negative: firmware error; mark CPU offline
Each secondary, after completing Phase 15 init, atomically sets its
bit in SmpBringupState.online_mask (via CpuMask::set_atomic),
increments online_count, then reads its own tree index i and issues
CPU_ON for children at
indices 2i+1 and 2i+2 (allocating stacks and filling mailboxes
first), before entering the scheduler idle loop.
Phase 15: Secondary CPU Entry (secondary_entry stub, per AP)
Each secondary CPU enters here in EL1 with MMU off.
x0 = context_id = physical address of own ApStartupMailbox.
a. EL2-to-EL1 transition (same check as BSP step 2b):
Read CurrentEL. If EL2: configure HCR_EL2, SCTLR_EL1,
SPSR_EL2, ELR_EL2, then ERET to EL1. If already EL1:
fall through. (Some firmware enters secondaries at EL2
even when the primary was dropped to EL1.)
b. Cache and TLB invalidation before MMU enable:
ic iallu // invalidate entire instruction cache
tlbi vmalle1 // invalidate all TLB entries for EL1
dsb sy // wait for invalidation to complete
isb // synchronize instruction stream
(The secondary's caches may contain stale data from a
previous boot or firmware context. Invalidation ensures
the first instruction fetch and page table walk after MMU
enable use fresh data from memory.)
c. Spin on mailbox.status until == MAILBOX_READY (pause loop)
d. Verify mailbox.cpu_id matches own MPIDR[31:0]
e. Enable FPU/NEON: write CPACR_EL1.FPEN = 0b11
f. Load kernel page tables and enable MMU:
- Write MAIR_EL1 (same value as primary's Phase 6)
- Write TCR_EL1 (same value as primary's Phase 6)
- Write TTBR0_EL1 with primary's identity map root table PPN
- Write TTBR1_EL1 with primary's kernel map root table PPN
- isb
- Set SCTLR_EL1 (same value as primary's Phase 6: M=1, C=1,
I=1, SA=1, SA0=1)
- isb
- tlbi vmalle1; dsb sy; isb // flush TLB after MMU enable
g. Load stack pointer: ldr x1, [x0, #offsetof(stack_top)]; mov sp, x1
h. Load percpu_base into TPIDR_EL1:
`ldr x1, [x0, #offsetof(percpu_base)]`
`msr TPIDR_EL1, x1`
TPIDR_EL1 is the CpuLocal register on AArch64
(see [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path)). x18 is NOT
used — the platform register is reserved for the OS on AArch64 but
UmkaOS uses TPIDR_EL1 per the Ch 3 specification.
i. Branch to Rust: bl secondary_init
In secondary_init():
1. Load VBAR_EL1 (exception vectors, same table as primary)
2. Initialize GIC CPU interface:
the primary already configured GICD for all CPUs during Phase 8
GICv2: GICC_PMR = 0xFF (unmask all); GICC_CTLR = 0x1 (enable)
GICv3:
a. Wake this CPU's redistributor: clear GICR_WAKER.ProcessorSleep,
poll GICR_WAKER.ChildrenAsleep until clear (same sequence as
primary Phase 8)
b. Configure SGI/PPI groups (GICR_IGROUPRn, GICR_IGRPMODRn)
c. Enable INTID 27 / PPI 11 (virtual timer) in GICR_ISENABLERn
d. ICC_SRE_EL1.SRE = 1
e. ICC_PMR_EL1 = 0xF0; ICC_BPR1_EL1 = 0
f. ICC_IGRPEN1_EL1 = 1
3. Calibrate generic timer (read CNTFRQ_EL0; program CNTV_TVAL_EL0)
4. If MTE supported (check CpuFeatureTable): configure SCTLR_EL1.TCF,
GCR_EL1, RGSR_EL1 with per-CPU LFSR seed (same register values as
primary Phase 6a, but RGSR_EL1 seed derived from CNTPCT_EL0 XOR
MPIDR_EL1 for per-CPU uniqueness)
5. Enable interrupts: msr daifclr, #0xf
6. Initialize per-CPU scheduler runqueue
7. Atomically set own bit in SmpBringupState.online_mask;
increment online_count
8. Issue CPU_ON for own tree children (if any) as described above
(with cache maintenance on mailbox + page tables before each
CPU_ON call, same protocol as Phase 14)
9. Enter scheduler idle loop (wfe)
Phase 16: SMP Online
Primary polls SmpBringupState.online_count and deadline_ns.
Loop exits when online_count == expected_secondary_count OR
monotonic_now() >= deadline_ns (global 30-second timeout).
Any secondary whose bit is not set in online_mask at exit is
marked permanently offline and removed from the kernel CPU mask.
System is fully multi-CPU. GIC affinity routing distributes
interrupts across all online CPUs.
PSCI Function IDs (AArch64):
| Function | ID (SMC64) | Arguments | Return |
|---|---|---|---|
| PSCI_VERSION | 0x84000000 | — | Version (31:16 = major, 15:0 = minor) |
| CPU_ON | 0xC4000003 | x1=target_mpidr, x2=entry_phys, x3=context_id | 0=success, negative=error |
| CPU_OFF | 0x84000002 | — | Does not return on success; -1 (NOT_SUPPORTED) on failure |
| PSCI_FEATURES | 0x8400000A | x1=function_id to query | 0=supported, negative=not supported or error |
CPU_OFF (0x84000002): Powers down the calling CPU. Used for: - Parking CPUs that fail initialization during AP bringup (Phase 15) - CPU hotplug offline (runtime hot-remove) - System suspend (each non-boot CPU calls CPU_OFF before the last CPU enters SYSTEM_SUSPEND)
The calling CPU must have migrated all interrupts and timer callbacks to another online CPU before calling CPU_OFF. After CPU_OFF, the CPU can only be restarted via CPU_ON from another CPU.
PSCI_FEATURES (0x8400000A): Queries whether a specific PSCI function
is supported by the firmware. UmkaOS calls PSCI_FEATURES during Phase 13
for CPU_ON, CPU_OFF, SYSTEM_SUSPEND, and SYSTEM_RESET to discover the
firmware's capability set. If CPU_ON is not supported, the kernel falls
back to spin-table bringup (if the DTB specifies enable-method =
"spin-table" on individual CPU nodes) or boots single-CPU.
MPIDR affinity (AArch64): Each CPU has a unique MPIDR_EL1 value: - Bits [7:0]: Affinity level 0 (core within cluster) - Bits [15:8]: Affinity level 1 (cluster within socket) - Bits [23:16]: Affinity level 2 (socket) - Bits [39:32]: Affinity level 3 (extended, rare; multi-chip systems)
The DTB /cpus/cpu@N/reg property contains these affinity bits. PSCI_CPU_ON uses the full MPIDR value to identify the target CPU.
2.5.1.1 CPU Hotplug — Other Architectures¶
RISC-V and PPC SMP bringup are documented in their respective architecture boot sequence files:
- RISC-V 64: SBI HSM extension (Hart State Management). See Section 2.7.
- PPC32 / PPC64LE: OPAL spin-table, RTAS start-cpu, or DTB enable-method. See Section 2.10 and Section 2.11.
2.6 ARMv7 Boot Sequence¶
QEMU's -M vexpress-a15 -kernel loads the ELF at 0x80010000 and enters
at _start in SVC (Supervisor) mode with the MMU off. Registers: r0 = 0,
r1 = machine type, r2 = DTB address.
Entry assembly (arch/armv7/entry.S, GNU as syntax):
1. QEMU jumps to _start in SVC mode, MMU off
- r0 = 0 (unused), r1 = machine type, r2 = DTB address
2. _start:
a. Disable IRQ and FIQ: cpsid if
(sets I and F bits in CPSR)
b. Set ACTLR.SMP bit for cache coherency (Cortex-A15):
mrc p15, 0, r3, c1, c0, 1 // read ACTLR
orr r3, r3, #(1 << 6) // set SMP bit (bit 6)
mcr p15, 0, r3, c1, c0, 1 // write ACTLR
isb
(ACTLR.SMP must be set before enabling data caches. Without it,
the CPU does not participate in cache coherency — other CPUs'
snoops are not honored, and data cache operations do not maintain
coherence. On Cortex-A15, this bit also enables TLB and BTB
maintenance broadcast. Must be set on both BSP and all APs.)
c. Enable VFP/NEON (required before any Rust code — the compiler
generates floating-point and NEON instructions on ARMv7):
mrc p15, 0, r3, c1, c0, 2 // read CPACR
orr r3, r3, #(0xF << 20) // CP10 + CP11 = full access (bits [23:20])
mcr p15, 0, r3, c1, c0, 2 // write CPACR
isb // ensure CPACR change takes effect
vmrs r3, fpexc // read FPEXC
orr r3, r3, #(1 << 30) // set FPEXC.EN (bit 30)
vmsr fpexc, r3 // write FPEXC — VFP/NEON now enabled
d. Set up IRQ mode stack: switch to IRQ mode (cps #0x12),
load 4 KB IRQ stack, switch back to SVC mode (cps #0x13)
e. Load SVC stack pointer: ldr sp, =_stack_top
(64 KB stack in .bss._stack, 16-byte aligned via .align 4)
f. Clear BSS: zero memory from __bss_start to __bss_end
(str r6 loop, 4 bytes per iteration)
g. Prepare 64-bit arguments (AAPCS: u64 passed as register pairs):
- r0:r1 = 0:0 (multiboot_magic, both halves)
- r2:r3 = dtb_addr:0 (multiboot_info, low:high)
h. Branch: bl umka_main
i. Halt loop: wfe if umka_main returns
Stack (64 KB) is in .bss._stack (16-byte aligned via .align 4, which on
ARM GAS means 2^4 = 16 bytes). The linker script (linker-armv7.ld) places
.text._start first at 0x80010000 (offset from the vexpress-a15 base
0x80000000 to leave room for the bootloader stub).
Initialization phases (in umka_main(), sequential):
Canonical Phase Mapping:
| Canonical Phase | Description | Local Implementation |
|---|---|---|
| 0.1 | arch_early_init | Entry assembly (steps 1–2) + Phase 0 (PL011) + Phase 1 (VBAR) |
| 0.15 | early_log_init | Phase 1a: early_log_init() |
| 0.3 | parse_firmware_memmap | Phase 3: DTB /memory parse |
| 0.4 | boot_alloc_init | Phase 4: phys::init() from DTB regions |
| 0.5 | reserve_regions | Phase 4: reserve kernel image |
| 0.6 | numa_discover_topology | Phase 4a: DTB /memory nodes (single node on ARMv7) |
| 0.7 | cpulocal_bsp_init | Phase 4b: mcr p15, 0, &CpuLocalBlock, c13, c0, 4 (TPIDRPRW) |
| 0.8a | evolvable_verify | Phase 4c: Evolvable signature verification (physical addresses, no MMU) |
| 0.2 | identity_map | Phase 4d: TTBR0+TTBR1 short-descriptor MMU enable |
| 0.8b | evolvable_map_and_init | Phase 4e: Evolvable virtual mapping at EVOLVABLE_VIRT_BASE + VTABLE_SLOTS[] population |
| 1.1 | buddy allocator | Phase 5: buddy init |
| 1.2 | slab allocator | Phase 5a: slab_init() |
| 2.1 | IRQ domain | Phase 8: GIC init + IrqDomain setup |
| 2.2 | capability system | Phase 7: CapSpace init |
| 2.3 | scheduler | Phase 10: scheduler init |
| 2.7 | workqueue infra | Phase 10a: workqueue_init_early() |
| 2.8 | RCU | Phase 10b: rcu_init() |
| 2.9 | LSM framework | Phase 10c: lsm_init() |
| 3.1–3.3 | SMP bringup | Phases 11–14: PSCI CPU_ON fan-out |
Phase 0: PL011 UART Initialization (early console)
The PL011 UART at 0x1C090000 (vexpress-a15 motherboard UART0)
must be initialized before the early log ring (canonical Phase
0.15) to enable diagnostic output. Sequence:
1. Write UARTCR = 0 (disable UART — required before changing
baud rate registers)
2. Write UARTIBRD (Integer Baud Rate Divisor) and UARTFBRD
(Fractional Baud Rate Divisor) for 115200 baud:
Assuming 24 MHz reference clock (vexpress-a15 default):
IBRD = 24_000_000 / (16 × 115200) = 13
FBRD = round(0.020833... × 64) = 1
3. Write UARTLCR_H = 0x70:
WLEN = 0b11 (8 data bits), FEN = 1 (enable FIFOs),
STP2 = 0 (1 stop bit), PEN = 0 (no parity) → 8N1
4. Write UARTCR = 0x301:
UARTEN = 1 (bit 0), TXE = 1 (bit 8), RXE = 1 (bit 9)
After this, `serial::puts()` can write to UARTDR (offset 0x00)
by polling UARTFR.TXFF (bit 5) to avoid overflow.
Phase 1: Exception Vectors (VBAR)
Write vector table base to VBAR via CP15 c12 register:
mcr p15, 0, <reg>, c12, c0, 0
Vector table: 8 entries (Reset, Undef, SVC, Prefetch Abort,
Data Abort, reserved, IRQ, FIQ) × 4-byte branch instructions.
Each vector branches to a full handler stub.
Phase 1a: Early Log Ring Init (canonical Phase 0.15)
Call early_log_init() — sequencing checkpoint confirming the
BSS-resident EarlyLogRing is accessible. After this point,
early_log() emits messages to both the ring buffer and the
PL011 UART (Phase 0). See [Section 2.3](#boot-init-cross-arch--early-boot-log-ring).
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly, same
pattern as x86 entry.asm step 2d). Perform any additional
initialization that depends on zeroed static data.
Phase 3: DTB Parse (Nucleus minimal parser)
Parse the DTB passed in r2 (see [Section 2.8](#device-tree-and-platform-discovery)). Extract /memory
regions, /chosen bootargs, GIC base addresses, timer IRQ
numbers, and UART base. vexpress-a15 has well-known addresses
but DTB parsing keeps the code machine-independent.
**Parser placement**: Same as AArch64 — the minimal no-alloc
DTB parser is Nucleus code that runs before any allocator.
The full-featured Evolvable DTB parser handles second-pass
device enumeration after heap initialization.
**PSCI method detection**: Also extract the `/psci` node's
`method` property ("hvc" or "smc") during this pass. The
vexpress-a15 QEMU machine defaults to `method = "hvc"` (not
"smc"). Store the detected method for use in Phase 12.
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). The vexpress-a15
machine provides up to 1 GB RAM starting at 0x80000000.
Reserve kernel image.
Phase 4a: NUMA Topology Discovery (canonical Phase 0.6)
Parse DTB /memory nodes. ARMv7 platforms are virtually always
single-node (UMA). Set single node 0 covering all memory.
Cross-ref: [Section 4.11](04-memory.md#numa-topology-and-policy).
Phase 4b: CpuLocal BSP Init (canonical Phase 0.7)
Initialize CpuLocalBlock for the BSP and set the arch-specific
fast-access register:
mcr p15, 0, &CPU_LOCAL_BLOCKS[0], c13, c0, 4 (TPIDRPRW)
TPIDRPRW (Thread ID Register, Privileged R/W) is the ARMv7
per-CPU data pointer, accessible only in privileged modes.
See [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--initialization-sequence).
Phase 4c: Evolvable Signature Verification (canonical Phase 0.8a)
Verify the Evolvable image signature using physical addresses
only (no MMU required). Nucleus LMS verifier (~2 KB) checks
ML-DSA-65 + Ed25519 hybrid signature against embedded public
key. On failure, panic.
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
**Invariant**: Phases 0.1–4c MUST NOT dispatch through
VTABLE_SLOTS[] or any replaceable policy vtable.
Phase 4d: Virtual Memory — MMU Enable (canonical Phase 0.2)
See Phase 6 below for the full short-descriptor page table
setup and MMU enable sequence. This is the canonical Phase 0.2
identity mapping step, placed here to satisfy the constraint
that MMU must be enabled before Evolvable virtual mapping
(Phase 4e). Page table pages are allocated from BootAlloc.
Phase 4e: Evolvable Virtual Mapping (canonical Phase 0.8b)
Map Evolvable .text (RX) and .rodata (RO) pages at
EVOLVABLE_VIRT_BASE (0xC040_0000 on ARMv7).
Allocate fresh RW pages for .data+.bss via BootAlloc.
Call evolvable_init() to populate VTABLE_SLOTS[].
**Requires MMU enabled** (Phase 4d complete).
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
After Phase 4e, Evolvable vtable dispatch is permitted.
Phase 5: Kernel Heap (buddy allocator)
Initialize the buddy allocator with all available physical
memory discovered from the DTB memory regions (Phase 4).
The buddy allocator manages power-of-two blocks (order 0–10,
4 KB–4 MB). See [Section 4.2](04-memory.md#physical-memory-allocator).
Phase 5a: Slab Allocator (canonical Phase 1.2)
Initialize slab caches on top of the buddy allocator.
After this point, Box::new, Arc::new, and typed allocations
are available. Cross-ref: [Section 4.3](04-memory.md#slab-allocator).
**Identity map timing note**: MMU is enabled at Phase 4d
(canonical Phase 0.2) using page table pages allocated from
BootAlloc. The buddy allocator (Phase 5) and slab (Phase 5a)
then operate with MMU active.
Phase 6: Virtual Memory Detail (TTBR0 + TTBR1, Short Descriptor, runs at Phase 4d)
Allocate L1 page table (16 KB, 16 KB-aligned) from BootAlloc.
Build mappings using ARMv7 short-descriptor
(32-bit) format.
**SCTLR explicit initialization** (before any MMU operations):
Read current SCTLR via `mrc p15, 0, r0, c1, c0, 0`, then
write a known-safe value:
M = 0 (MMU off — will be enabled last)
A = 0 (alignment faults disabled during setup)
C = 0 (data cache off — enabled with MMU)
I = 0 (instruction cache off — enabled with MMU)
Z = 1 (branch prediction enable)
V = 0 (low exception vectors, use VBAR)
TRE = 1 (TEX remap enable — required for Cortex-A15)
AFE = 1 (Access Flag Enable — AP[0] becomes access flag)
TE = 0 (ARM mode exception handlers, not Thumb)
Write via `mcr p15, 0, r0, c1, c0, 0; isb`.
**PRRR/NMRR configuration** (TEX remap registers):
Cortex-A15 uses TEX remapping (SCTLR.TRE=1). The PRRR (Primary
Region Remap Register, c10 c2 0) and NMRR (Normal Memory Remap
Register, c10 c2 1) define how TEX[0]/C/B page table bits map
to actual memory types:
PRRR = 0xFF0A_0028:
TR0 = 0b00 (Strongly-Ordered for TEX[0]:C:B = 000)
TR1 = 0b10 (Normal for TEX[0]:C:B = 001 — outer+inner WB-WA)
TR2 = 0b10 (Normal for TEX[0]:C:B = 010 — outer+inner WT)
TR3 = 0b10 (Normal for TEX[0]:C:B = 011 — outer+inner WB-WA)
TR4 = 0b01 (Device for TEX[0]:C:B = 100)
TR5 = 0b10 (Normal for TEX[0]:C:B = 101 — non-cacheable)
TR6 = 0b00 (reserved)
TR7 = 0b00 (reserved)
NS1/NS0 = 1 for normal memory types (inner-shareable implied)
DS0/DS1 = 0 (no remapping for shareable attribute)
NMRR = 0x40E0_40E0:
IR0 = 0b00 (non-cacheable for region type 0)
IR1 = 0b11 (WB-WA inner for region type 1)
IR2 = 0b10 (WT inner for region type 2)
IR3 = 0b11 (WB-WA inner for region type 3)
OR0-OR3 = same as IR0-IR3 (symmetric inner/outer cacheability)
Mapping summary with TEX remap enabled:
| TEX[0]:C:B | PRRR type | NMRR inner | NMRR outer | Use |
|------------|-----------|------------|------------|-----|
| 000 | SO | — | — | Strongly-ordered MMIO |
| 001 | Normal | WB-WA | WB-WA | Normal RAM (default) |
| 010 | Normal | WT | WT | Write-through regions |
| 011 | Normal | WB-WA | WB-WA | Normal RAM (alias) |
| 100 | Device | — | — | Device-nGnRE (UART, GIC) |
| 101 | Normal | NC | NC | Non-cacheable DMA buffers |
Write PRRR and NMRR before enabling MMU:
mcr p15, 0, <prrr_val>, c10, c2, 0
mcr p15, 0, <nmrr_val>, c10, c2, 1
**TTBCR configuration for 3G/1G kernel split**:
EVOLVABLE_VIRT_BASE = 0xC0400000 resides in the upper 1 GB
(0xC0000000–0xFFFFFFFF). The TTBCR.N field controls the
TTBR0/TTBR1 boundary:
N = 2: TTBR0 covers VA [0x0000_0000, 0xC000_0000) — 3 GB user
TTBR1 covers VA [0xC000_0000, 0xFFFF_FFFF] — 1 GB kernel
TTBCR register (c2 c0 2):
N = 2 (bits [2:0])
PD0 = 0 (translation table walk for TTBR0 enabled)
PD1 = 0 (translation table walk for TTBR1 enabled)
Write: mcr p15, 0, <ttbcr_val>, c2, c0, 2
**TTBR0 — identity map + user space** (lower 3 GB):
- L1 table: 3072 entries (N=2 reduces TTBR0 table to 3072 × 4 = 12 KB)
- L1 entries: section descriptors (1 MB) for identity map of
all physical RAM
- Page table flags: AP=0b11 (full access), TEX[0]:C:B = 001
(normal WB-WA cacheable), S=1 (shareable), Domain=0
Write TTBR0: mcr p15, 0, <l1_phys>, c2, c0, 0
**TTBR1 — kernel higher-half** (upper 1 GB):
- Separate L1 table: 1024 entries (1024 × 4 = 4 KB, covers
0xC0000000–0xFFFFFFFF)
- Map kernel .text (section descriptors, AP=RO if FEAT_APX
available, else full access), .rodata, .data+.bss
- Map Evolvable at EVOLVABLE_VIRT_BASE (0xC0400000)
Write TTBR1: mcr p15, 0, <l1_kernel_phys>, c2, c0, 1
**DACR initial configuration**:
- Domain 0 = Client (0b01): all accesses checked against page
table permissions. Used for all boot-time mappings.
- Domains 1-3 = No Access (0b00): reserved for user, I/O,
vectors (configured when the respective mappings are created).
- Domains 4-15 = No Access (0b00): reserved for Tier 1 driver
isolation (see Phase 6b below for the transition plan).
Write: ldr r1, =0x00000001; mcr p15, 0, r1, c3, c0, 0
**Cache maintenance and MMU enable sequence**:
1. Invalidate all TLBs:
mcr p15, 0, r0, c8, c7, 0 // TLBIALL
2. Invalidate instruction cache:
mcr p15, 0, r0, c7, c5, 0 // ICIALLU
3. Invalidate branch predictor:
mcr p15, 0, r0, c7, c5, 6 // BPIALL
4. dsb; isb
5. Enable MMU + caches (final SCTLR write):
Read SCTLR, set: M=1 (MMU), C=1 (data cache), I=1 (icache),
Z=1 (branch prediction — already set above)
mcr p15, 0, <sctlr_val>, c1, c0, 0
6. isb (ensure pipeline sees new SCTLR)
**L2 page table creation timing** (Phase 6, deferred):
The initial identity map uses 1 MB section descriptors in L1.
L2 page tables (256 × 4-byte entries = 1 KB each, providing
4 KB page granularity) are created after the slab allocator
is available (canonical Phase 1.2 / local Phase post-5).
L2 tables are needed for:
- Fine-grained permission mapping (.text RX, .rodata RO,
.data RW, guard pages)
- User-space process page tables (4 KB pages)
- MMIO regions smaller than 1 MB
L2 installation: replace the L1 section descriptor with a
coarse page table descriptor (bits [1:0] = 0b01), pointing
to the L2 table's physical address.
Phase 6b: DACR Transition Plan (Tier 1 driver isolation)
Initially (Phase 6), all mappings use domain 0 (Client).
When the driver framework initializes (canonical Phase 4.2+),
ARMv7 Tier 1 drivers are assigned to domains 4-15:
- Each Tier 1 driver gets a dedicated DACR domain
- Domain switch: write DACR to set the target domain to
Client (0b01) and the previous domain to No Access (0b00),
then ISB (~30-40 cycles). See [Section 11.3](11-drivers.md#driver-isolation-tiers).
- ARMv7 supports 16 domains (4 bits each in the 32-bit DACR
register). 4 domains are reserved for system use:
Domain 0 = Kernel (text, data, stacks)
Domain 1 = User (userspace page tables)
Domain 2 = I/O (ioremap'd device MMIO)
Domain 3 = Vectors (exception vector page at 0xFFFF0000)
Domains 4-15 provide up to 12 concurrent Tier 1 driver domains.
This matches Linux's domain allocation (arch/arm/include/asm/domain.h:
DOMAIN_KERNEL=0, DOMAIN_USER=1, DOMAIN_IO=2, DOMAIN_VECTORS=3).
- Manager mode (0b11): bypasses all permission checks. Used
only during driver reload to access the faulting domain's
pages for state recovery. Never in steady state.
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
**Dependency note**: Same as AArch64 — cap_table_init()
(canonical Phase 2.2) requires slab (canonical Phase 1.2)
which requires buddy (canonical Phase 1.1 / local Phase 5).
Local Phase 7 = canonical Phase 2.2.
Phase 8: GIC Initialization
ARMv7 platforms typically use GICv2 (GICv3 supports ARMv7/AArch32
but is rare on ARMv7 SoCs; limited to 3 affinity levels in AArch32).
Read GICD/GICC bases from DTB (vexpress-a15 defaults:
GICD = 0x2C001000, GICC = 0x2C002000).
Configure distributor, CPU interface, route timer IRQ.
Phase 9: Timer
Configure SP804 dual timer or ARM generic timer (if available):
- SP804 (vexpress): program LOAD register, enable with
periodic mode + interrupt enable, IRQ via GIC SPI
- Generic timer (Cortex-A15): CNTVCT, CNTV_TVAL, CNTV_CTL
(same registers as AArch64, accessed via CP15 c14)
Enable interrupts: cpsie if
Phase 10: Scheduler (canonical Phase 2.3)
Initialize EEVDF scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
Phase 10a: Workqueue Framework (canonical Phase 2.7)
Initialize named kernel worker thread pools.
Cross-ref: [Section 3.11](03-concurrency.md#workqueue-deferred-work).
Phase 10b: RCU Init (canonical Phase 2.8)
Initialize RCU infrastructure.
Cross-ref: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
Phase 10c: LSM Framework Init (canonical Phase 2.9)
Initialize LSM framework and register compiled-in LSMs.
Cross-ref: [Section 9.8](09-security.md#linux-security-module-framework).
Secondary CPU Bringup (ARMv7 via PSCI):
After Phase 10 completes on the primary CPU, secondary CPUs are brought online using PSCI (Power State Coordination Interface).
PSCI calling convention (ARMv7):
The kernel detects the PSCI version and calling mechanism at runtime from the
DTB /psci node compatible property:
"arm,psci-0.2"or later: use PSCI 0.2 function IDs (preferred)"arm,psci": use PSCI 0.1 function IDs (legacy fallback; function IDs are platform-specific and read from DTBcpu_onproperty under/psci)
PSCI 0.2 function IDs for ARMv7 (32-bit callee convention):
CPU_ON = 0x84000003 (PSCI 0.2, 32-bit)
r0 = 0x84000003 (function ID)
r1 = target_cpu (MPIDR[31:0] of target AP)
r2 = entry_point (physical address of AP entry stub, must be 32-bit)
r3 = context_id (physical address of ApStartupMailbox for this AP)
Return values (in r0):
0 PSCI_SUCCESS: AP starting
-2 PSCI_INVALID_PARAMS: bad MPIDR or entry address
-4 PSCI_ALREADY_ON: AP was already running (treat as success)
other negative: firmware error; mark AP offline
Calling convention: use smc #0 if the DTB /psci node method = "smc";
use hvc #0 if method = "hvc". The method property is mandatory in valid
PSCI device trees. If absent, default to hvc for the vexpress-a15 machine
(QEMU's vexpress-a15 DTB specifies method = "hvc") and smc for all
other platforms. The DTB /psci node is the authoritative source — always
read the method property during Phase 3 DTB parse rather than assuming
a compile-time default.
AP Stack Allocation (ARMv7):
Stack allocation follows the same protocol as all architectures (see
Section 2.3): the primary allocates 16 KB per AP from the boot allocator
before issuing CPU_ON, fills the ApStartupMailbox, passes its physical
address as context_id, and marks the AP offline on allocation failure.
GIC initialization for ARMv7 APs:
The primary CPU configures the GIC Distributor (GICD) during Phase 8 for all CPUs. Each AP, on startup, initializes only its own GIC CPU Interface (GICC):
APs do not touch the GICD — the primary owns the distributor. IRQs are unmasked by clearing theCPSR.I and CPSR.F bits (cpsie if) after the
scheduler is initialized and the AP is ready to run tasks.
ARMv7 AP entry sequence:
The entry stub physical address passed to CPU_ON as r2 is the ARMv7 SMP
trampoline. The AP entry point receives context_id (physical address of
ApStartupMailbox) in r0 per PSCI DEN0022 §5.1.3 (AArch32 calling convention
delivers context_id to the callee in r0, not r3 — r3 is the caller-side argument
to CPU_ON) and follows this sequence:
1. AP wakes at physical entry point (address passed in CPU_ON r2).
r0 = physical address of own ApStartupMailbox (from PSCI context_id).
2. Save mailbox pointer: mov r5, r0
(r5 is callee-saved; preserves mailbox address across steps 3-4)
3. Disable IRQs and FIQs: cpsid if
(sets CPSR.I and CPSR.F; prevents spurious interrupts before stack is set)
4. Confirm SVC mode: mrs r0, cpsr; and r0, r0, #0x1F; cmp r0, #0x13
If not in SVC mode (0x13), switch: cps #0x13
4a. Set ACTLR.SMP bit for cache coherency (same as BSP step 2b):
mrc p15, 0, r1, c1, c0, 1 // read ACTLR
orr r1, r1, #(1 << 6) // set SMP bit (bit 6)
mcr p15, 0, r1, c1, c0, 1 // write ACTLR
isb
(Must be set before enabling caches on this AP. Without it, this
CPU's data cache is not coherent with the BSP and other APs.)
5. Enable VFP/NEON if needed:
mrc p15, 0, r1, c1, c0, 2 // read CPACR
orr r1, r1, #(0xF << 20) // enable CP10 + CP11 full access
mcr p15, 0, r1, c1, c0, 2 // write CPACR
vmrs r1, fpexc // enable VFP: FPEXC.EN = 1
orr r1, r1, #(1 << 30)
vmsr fpexc, r1
6. Enable MMU with kernel page tables (must match primary's Phase 6):
- Write PRRR and NMRR (same values as primary Phase 6):
mcr p15, 0, <prrr_val>, c10, c2, 0
mcr p15, 0, <nmrr_val>, c10, c2, 1
- Write TTBCR (N=2 for 3G/1G split, same as primary):
mcr p15, 0, <ttbcr_val>, c2, c0, 2
- Load TTBR0 with primary's identity-map L1 table physical address:
mcr p15, 0, <ttbr0_phys>, c2, c0, 0
- Load TTBR1 with primary's kernel L1 table physical address:
mcr p15, 0, <ttbr1_phys>, c2, c0, 1
- Set DACR domain 0 = Client (0b01):
ldr r1, =0x00000001
mcr p15, 0, r1, c3, c0, 0
- Invalidate TLBs and caches:
mcr p15, 0, r0, c8, c7, 0 // TLBIALL
mcr p15, 0, r0, c7, c5, 0 // ICIALLU
mcr p15, 0, r0, c7, c5, 6 // BPIALL
dsb; isb
- Write SCTLR (same value as primary Phase 6): M=1, C=1, I=1,
Z=1, TRE=1, AFE=1, V=0
mcr p15, 0, <sctlr_val>, c1, c0, 0
- isb
7. Spin on mailbox.status until == MAILBOX_READY (0xAB1E1234):
ldr r0, [r5, #offsetof(ApStartupMailbox, status)]
cmp r0, #0xAB1E1234
bne spin (with yield: yield instruction or nop)
8. Verify mailbox.cpu_id matches own MPIDR[23:0]:
mrc p15, 0, r1, c0, c0, 5 // read MPIDR
and r1, r1, #0x00FFFFFF // lower 24 affinity bits
ldr r0, [r5, #offsetof(ApStartupMailbox, cpu_id)]
cmp r0, r1
bne fault_halt // mismatch: configuration error
9. Load SP from stack_top:
ldr sp, [r5, #offsetof(ApStartupMailbox, stack_top)]
10. Load percpu_base (TPIDRPRW, the ARMv7 CpuLocal register per Section 3.1.2 (03-concurrency.md)):
ldr r4, [r5, #offsetof(ApStartupMailbox, percpu_base)]
mcr p15, 0, r4, c13, c0, 4 // write TPIDRPRW
11. Jump to Rust entry point:
bl ap_secondary_init // does not return
The ap_secondary_init() function (Rust) runs the following in order:
1. Load VBAR (exception vectors, same table as primary): mcr p15, 0, vbar, c12, c0, 0
2. Initialize GICC (CPU Interface): write GICC_PMR = 0xFF and GICC_CTLR = 0x1
3. Initialize per-CPU scheduler runqueue
4. Configure and enable the timer (generic timer or SP804 as appropriate)
5. Enable interrupts: cpsie if
6. Atomically set own bit in SmpBringupState.online_mask (via CpuMask::set_atomic);
increment online_count
7. Issue CPU_ON for own tree children (indices 2i+1, 2i+2) if they exist
and the global deadline_ns has not expired (allocate stacks, fill mailboxes,
call PSCI, same protocol as the primary for tree index 1)
8. Enter scheduler idle loop (wfe)
SMP bringup phases (ARMv7):
Phase 11: Secondary CPU Detection
Parse DTB /cpus node for CPU entries with enable-method = "psci".
Assign sequential tree indices (0 = primary). Allocate PerCpu<T>
slots and AP_STARTUP_MAILBOXES. Initialize SmpBringupState;
set deadline_ns = now_ns() + 30s.
Phase 12: PSCI Method and Version Detection
Read /psci node: detect method (smc/hvc) and compatible string
(psci-0.2 vs psci-0.1). For psci-0.1, read cpu_on property.
Phase 13: First AP Wakeup (primary → tree index 1)
Allocate stack for index 1; fill mailbox[1]; set MAILBOX_READY.
Call PSCI CPU_ON for index 1 as described above.
Fan-out tree propagates: each AP wakes its children after init.
Phase 14: SMP Online
Primary polls SmpBringupState.online_count and deadline_ns.
Loop exits when online_count == expected_secondary_count OR
monotonic_now() >= deadline_ns (global 30-second timeout).
Any AP whose bit is not set in online_mask at exit is marked
permanently offline. Boot continues with available CPUs.
System is fully multi-CPU once all online APs are in their
scheduler idle loops.
2.7 RISC-V 64 Boot Sequence¶
QEMU's -M virt -bios default -kernel runs OpenSBI in M-mode, which then
jumps to the kernel at 0x80200000 in S-mode (Supervisor mode). Registers:
a0 = hart_id, a1 = DTB address (on QEMU and systems following the Linux
boot convention — see note below).
Note on a1 and DTB discovery: The RISC-V SBI specification does NOT mandate that
a1contains the DTB physical address. This is a firmware convention established by QEMU and U-Boot, and is widely followed in practice, but real bare-metal boards may use different mechanisms. The boot code therefore validatesa1before trusting it: 1. Check ifa1is a valid DTB pointer: read the 4-byte magic at that address and verify it equals0xD00DFEED(big-endian FDT magic). 2. Ifa1is not a valid DTB: scan for a UEFI System Table (look for theIBI SYSTsignature in the EFI System Table header). 3. If UEFI is not found: use the SBI vendor extension to request the DTB address, or fall back to a compiled-in DTB for the target board. The reference implementation uses option 1 with UEFI fallback for production hardware targets.Note on OpenSBI boot modes: OpenSBI can be compiled as FW_JUMP (fixed next-stage entry point) or FW_DYNAMIC (entry point discovered at runtime from a
struct fw_dynamic_infopassed by the previous-stage firmware). QEMU's-bios defaultuses FW_DYNAMIC — QEMU itself builds and embeds an OpenSBI FW_DYNAMIC image. U-Boot SPL also passesfw_dynamic_infowhen chainloading OpenSBI. FW_DYNAMIC is the modern recommended approach: OpenSBI auto-detects the next-stage entry point and runtime options from thefw_dynamic_infostructure (version, next_addr, next_mode, options, boot_hart). UmkaOS is compatible with both FW_JUMP (the kernel is loaded at the fixed address OpenSBI expects) and FW_DYNAMIC (the firmware passes the kernel entry address dynamically). The choice of OpenSBI build mode is transparent to the kernel — both result in the same S-mode entry witha0 = hartid,a1 = DTB address.
Entry assembly (arch/riscv64/entry.S, GNU as syntax):
1. OpenSBI jumps to _start in S-mode
- a0 = hart_id (hardware thread ID, usually 0 on single-core)
- a1 = DTB address (QEMU/U-Boot convention; validated at runtime — see note above)
2. _start:
a. Disable interrupts: csrci sstatus, 0x2
(clears SIE bit in supervisor status register)
b. Load stack pointer: la sp, _stack_top
(64 KB stack in .bss._stack, 16-byte aligned)
c. Initialize sscratch for trap entry:
csrw sscratch, 0
(sscratch = 0 means "currently in kernel mode". OpenSBI sets mscratch
for M-mode but does NOT initialize sscratch — S-mode must do this.
The trap entry prologue uses `csrrw tp, sscratch, tp`: if sscratch
was 0, we came from kernel mode (re-entrant trap); if nonzero, we
came from user mode and sscratch now holds user tp.)
d. Clear BSS: zero memory from __bss_start to __bss_end
(sd zero loop, 8 bytes per iteration)
e. Arguments already in correct registers:
a0 = hart_id (passed as multiboot_magic parameter)
a1 = DTB address (passed as multiboot_info parameter)
f. Call: call umka_main (jal with ra)
g. Halt loop: wfi (wait-for-interrupt) if umka_main returns
Stack (64 KB) is in .bss._stack (16-byte aligned). The linker script
(linker-riscv64.ld) places .text._start first at 0x80200000, after the
OpenSBI firmware region (0x80000000–0x801FFFFF).
Initialization phases (in umka_main(), sequential):
Canonical Phase Mapping:
| Canonical Phase | Description | Local Implementation |
|---|---|---|
| 0.1 | arch_early_init | Entry assembly (steps 1–2) + Phase 1 (stvec) |
| 0.15 | early_log_init | Phase 0.15: SBI DBCN fallback + early_log_init() |
| 0.3 | parse_firmware_memmap | Phase 3: DTB /memory parse |
| 0.4 | boot_alloc_init | Phase 4: phys::init() from DTB regions |
| 0.5 | reserve_regions | Phase 4: reserve OpenSBI + kernel image |
| 0.6 | numa_discover_topology | Phase 4a: DTB /memory nodes |
| 0.7 | cpulocal_bsp_init | Phase 4b: mv tp, &CpuLocalBlock; csrw sscratch, 0 (Linux RISC-V convention) |
| 0.8a | evolvable_verify | Phase 4c: Evolvable signature verification (physical addresses, no MMU) |
| 0.2 | identity_map | Phase 4d: satp Sv48 page tables + MMU enable |
| 0.8b | evolvable_map_and_init | Phase 4e: Evolvable virtual mapping at EVOLVABLE_VIRT_BASE + VTABLE_SLOTS[] population |
| 1.1 | buddy allocator | Phase 5: buddy init |
| 1.2 | slab allocator | Phase 5a: slab_init() |
| 2.1 | IRQ domain | Phase 8: PLIC init + IrqDomain setup |
| 2.2 | capability system | Phase 7: CapSpace init |
| 2.3 | scheduler | Phase 11: scheduler init |
| 2.7 | workqueue infra | Phase 11a: workqueue_init_early() |
| 2.8 | RCU | Phase 11b: rcu_init() |
| 2.9 | LSM framework | Phase 11c: lsm_init() |
| 3.1–3.3 | SMP bringup | SBI HSM sbi_hart_start() fan-out |
Phase 0.15: Early Serial / SBI Console Fallback
Before DTB is parsed (Phase 3), no UART base address is known.
During Phases 0.15–2, use the SBI Debug Console extension as
a fallback for early boot output:
SBI DBCN (Debug Console): EID = 0x4442434E ("DBCN")
FID 0 (sbi_debug_console_write): write bytes to console.
a0 = num_bytes, a1 = base_addr_lo, a2 = base_addr_hi.
The buffer must be in physically-mapped memory (no VA
translation — SBI operates in M-mode with MMU off).
FID 1 (sbi_debug_console_read): read bytes (optional).
FID 2 (sbi_debug_console_write_byte): write single byte.
a0 = byte value. Simpler, no buffer address needed.
If DBCN is not available (older OpenSBI < 1.3), fall back to
the legacy SBI putchar (EID=0x01, FID=0, a0=char). Legacy
putchar is deprecated but universally supported.
After DTB parse (Phase 3), switch to direct 16550 MMIO UART:
UART base: read from DTB /soc/serial@... or /soc/uart@...
node (QEMU virt default: 0x10000000). Compatible string:
"ns16550a" or "ns16550".
Init sequence: write LCR=0x03 (8N1), FCR=0x01 (enable FIFO),
IER=0x00 (no interrupts — polled mode during boot),
MCR=0x00. Divisor not set — QEMU 16550 ignores baud rate.
On real hardware, set DLM:DLL for 115200 baud using the
clock frequency from the DTB clock-frequency property.
After UART init, all subsequent early_log() output goes to
both the early log ring ([Section 2.3](#boot-init-cross-arch--early-boot-log-ring))
and the 16550 UART. SBI console calls are no longer used.
Phase 1: Exception Vectors (stvec)
Write trap handler address to stvec CSR in Direct mode
(stvec[1:0] = 0b00). All traps — exceptions, software
interrupts, external interrupts — vector to a single entry
point that reads scause to dispatch.
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly, same
pattern as x86 entry.asm step 2d). Perform any additional
initialization that depends on zeroed static data.
Phase 3: DTB Parse
Parse the DTB passed in a1 (see [Section 2.8](#device-tree-and-platform-discovery)). Extract /memory
regions, /chosen bootargs, PLIC base address, CLINT address
(if present), and UART base. QEMU virt machine uses standard
addresses but DTB parsing keeps the code machine-independent.
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). Mark available
regions free. Reserve:
- OpenSBI firmware: 0x80000000–0x801FFFFF (2 MB)
- Kernel image: 0x80200000 to __kernel_end
Unlike x86, no legacy BIOS region to reserve.
Phase 4a: NUMA Topology Discovery (canonical Phase 0.6)
Parse DTB /memory nodes for topology. RISC-V NUMA support
is limited — most QEMU virt configurations are single-node.
On NUMA-capable hardware, DTB /memory nodes with
`numa-node-id` properties provide the topology. Set single
node 0 if no NUMA information is found.
Cross-ref: [Section 4.11](04-memory.md#numa-topology-and-policy).
Phase 4b: CpuLocal BSP Init (canonical Phase 0.7)
Initialize CpuLocalBlock for the BSP and set the arch-specific
fast-access register:
mv tp, &CPU_LOCAL_BLOCKS[0]
csrw sscratch, 0 // Re-confirm: already set in entry asm (line 49),
// but Phase 0.7 is the canonical initialization point.
**Register convention** (matches Linux RISC-V): `tp` (x4) holds the
per-CPU base in kernel mode; `sscratch` holds user `tp` (U-mode) or
0 (S-mode). On trap entry: `csrrw tp, sscratch, tp` swaps the two.
If sscratch was 0, the trap came from kernel mode (re-entrant); if
nonzero, it came from user mode and sscratch now holds user tp.
See [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--initialization-sequence).
Phase 4c: Evolvable Signature Verification (canonical Phase 0.8a)
Verify the Evolvable image signature using physical addresses
only (no MMU required). Nucleus LMS verifier (~2 KB) checks
ML-DSA-65 + Ed25519 hybrid signature against embedded public
key. On failure, panic.
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
**Invariant**: Phases 0.1–4c MUST NOT dispatch through
VTABLE_SLOTS[] or any replaceable policy vtable.
Phase 4d: Virtual Memory — MMU Enable (canonical Phase 0.2)
See Phase 6 below for the full Sv48 page table setup and
satp configuration. Page table pages are allocated from
BootAlloc.
Phase 4e: Evolvable Virtual Mapping (canonical Phase 0.8b)
Map Evolvable .text (RX) and .rodata (RO) pages at
EVOLVABLE_VIRT_BASE (0xFFFF_FFC0_8040_0000 on RISC-V 64).
Allocate fresh RW pages for .data+.bss via BootAlloc.
Call evolvable_init() to populate VTABLE_SLOTS[].
**Requires MMU enabled** (Phase 4d complete).
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
After Phase 4e, Evolvable vtable dispatch is permitted.
Phase 5: Kernel Heap
Initialize the buddy allocator with all available physical
memory discovered from the DTB memory regions (Phase 4).
The buddy allocator manages power-of-two blocks (order 0–10,
4 KB–4 MB). See [Section 4.2](04-memory.md#physical-memory-allocator).
Phase 5a: Slab Allocator (canonical Phase 1.2)
Initialize slab caches on top of the buddy allocator.
After this point, Box::new, Arc::new, and typed allocations
are available. Cross-ref: [Section 4.3](04-memory.md#slab-allocator).
Phase 6: Virtual Memory Detail (satp, Sv48, runs at Phase 4d)
Build kernel page tables and higher-half mapping using Sv48
(4-level, 48-bit virtual address, 56-bit physical address):
Page table structure:
- 4 levels: L3 (root) → L2 → L1 → L0
- Each level: 512 entries × 8 bytes = 4 KB page table page
- VA decomposition (Sv48):
[63:48] sign-extension (must equal bit 47)
[47:39] L3 index (9 bits, 512 entries)
[38:30] L2 index (9 bits)
[29:21] L1 index (9 bits)
[20:12] L0 index (9 bits)
[11:0] page offset (12 bits, 4 KB)
PTE format (64 bits):
[63] N — NAPOT (Svnapot extension, 0 if unsupported)
[62:61] PBMT — Page-Based Memory Type (Svpbmt, 0 if unsupported)
[60:54] Reserved (must be 0)
[53:28] PPN[2] — Physical Page Number bits [55:30]
[27:19] PPN[1] — Physical Page Number bits [29:21]
[18:10] PPN[0] — Physical Page Number bits [20:12]
[9] RSW bit 1 — Reserved for Software.
UmkaOS: reserved for future use.
[8] RSW bit 0 — Reserved for Software.
UmkaOS: COW (copy-on-write) marker. Set on
fork() to mark pages as copy-on-write. Cleared
when the page is copied on a store page fault.
[7] D — Dirty (page has been written)
[6] A — Accessed (page has been read or written)
[5] G — Global (not flushed on ASID change)
[4] U — User (accessible from U-mode)
[3] X — Execute
[2] W — Write
[1] R — Read
[0] V — Valid
Megapages (large page mappings):
- L2 entry with R|W|X bits set (any nonzero combination):
1 GiB megapage. PPN[1] and PPN[0] must be 0 (aligned).
- L1 entry with R|W|X bits set:
2 MiB megapage. PPN[0] must be 0 (aligned).
- Used for identity-mapping large physical RAM regions
during early boot. Kernel .text uses 2 MiB pages.
Early boot A/D bit policy:
- Pre-set A=1 and D=1 on all kernel page table entries
during boot to avoid software A/D fault handling. The
RISC-V specification allows implementations where A/D
bits are managed by software (no hardware A/D update).
On such implementations, a page fault occurs on the
first access/write to a page with A=0 or D=0. Pre-
setting these bits eliminates this during early boot
when the page fault handler may not be fully operational.
- After the full VMM is online (post-Phase 6), the page
fault handler manages A/D bits normally for user pages.
User page A/D bit management:
- RISC-V Sv48 implementations fall into two categories:
(a) Hardware-managed A/D: the hardware atomically sets A on
first access and D on first store (like x86-64/AArch64).
(b) Software-managed A/D: the hardware traps on first access
to a PTE with A=0 (Store/AMO page fault with cause=15)
or first store to a PTE with D=0. The kernel fault
handler must set the bit and retry.
- Detection: at boot, attempt an access to a page with A=0.
If no trap occurs, hardware A/D is available. Cache result
in `CpuFeatureTable::riscv_hw_ad_bits`.
- Software A/D path (hot — page fault handler):
1. Read PTE from page table (atomically via `amoor.d` to
avoid lost updates from concurrent faults on other harts).
2. Set A (bit 6), and D (bit 7) if the fault was a store.
3. `sfence.vma` for the faulting VA to update the TLB.
4. Return from trap — instruction retries successfully.
- This is transparent to the VMM layer: the arch page fault
handler ([Section 4.8](04-memory.md#virtual-memory-manager)) calls
`arch::current::mm::handle_ad_fault()` before checking for
actual permission violations.
Kernel VA layout (Sv48 higher-half):
- Higher-half base: 0xFFFF_FFC0_0000_0000
(Sv48: addresses with bit 47 = 1 are sign-extended to
0xFFFF_xxxx_xxxx_xxxx, giving the range
0xFFFF_FFC0_0000_0000 – 0xFFFF_FFFF_FFFF_FFFF = 256 GB)
- Kernel direct-map: 0xFFFF_FFC0_8000_0000
Maps physical 0x8000_0000+ (DRAM start on QEMU virt).
Kernel .text at 0xFFFF_FFC0_8020_0000 (phys 0x80200000).
- EVOLVABLE_VIRT_BASE: 0xFFFF_FFC0_8040_0000
4 MB above kernel base, 2 MB aligned. See
[Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
Setup sequence:
1. Allocate root L3 page table (4 KB, zero-initialized).
2. Build identity map: map physical RAM 1:1 using 1 GiB or
2 MiB megapages where alignment permits, 4 KB pages for
the remainder. Set V+R+W+X+G+A+D on all entries.
3. Build higher-half map: map kernel at
0xFFFF_FFC0_8020_0000 → physical 0x80200000. Kernel .text
is R+X (no W), .rodata is R (no W/X), .data/.bss is R+W.
4. Write root table PPN to satp:
satp = (MODE << 60) | (ASID << 44) | (root_ppn)
MODE = 9 (Sv48), ASID = 0 (boot), root_ppn = root >> 12.
5. Execute sfence.vma (with rs1=x0, rs2=x0 — flush all
entries, all ASIDs) to synchronize TLB with new satp.
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: PLIC Initialization
Read PLIC base address from DTB (typically found in the
/soc/plic@... or /soc/interrupt-controller@... node with
compatible = "sifive,plic-1.0.0" or "riscv,plic0").
QEMU virt default: 0x0C000000.
PLIC register map:
- Priority registers: base + 4 × source_id
(4 bytes each, value 0–7; 0 = never interrupt,
1 = lowest, 7 = highest). Source IDs start at 1
(source 0 is reserved/no-interrupt).
On QEMU virt: source 10 = UART (virtio starts at 1).
- Pending bits: base + 0x001000
(1 bit per source, read-only. Bit N = source N pending.
Organized as 32-bit words: word at offset 0x1000 covers
sources 0–31, word at 0x1004 covers sources 32–63, etc.)
- Enable bits: base + 0x002000 + 0x80 × context
(1 bit per source per context. Bit N = source N enabled
for this context. Same 32-bit word layout as pending.)
- Priority threshold: base + 0x200000 + 0x1000 × context
(4 bytes. Interrupts with priority ≤ threshold are masked.
Set to 0 to accept all priorities > 0.)
- Claim/complete: base + 0x200004 + 0x1000 × context
(4 bytes. Read = claim highest-priority pending interrupt
for this context, returns source ID (0 if none pending).
Write source ID = complete interrupt processing for that
source, allowing it to pend again.)
Context calculation:
- Each hart has two contexts: M-mode and S-mode.
- context = 2 × hartid + 1 (for S-mode on standard PLICs).
- context = 2 × hartid + 0 (for M-mode, used by OpenSBI).
- UmkaOS only configures S-mode contexts. M-mode contexts
are managed by OpenSBI.
- On some non-standard PLICs (e.g., T-Head C906), the
context numbering may differ. UmkaOS reads the
interrupts-extended property from the DTB PLIC node to
determine the context-to-hart mapping.
BSP initialization sequence:
1. Set priority for each used source to > 0 (e.g., 1).
2. Set S-mode priority threshold to 0 for hart 0.
3. Enable UART source in the S-mode context enable bits.
4. Enable supervisor external interrupts: csrs sie, (1 << 9)
(set sie.SEIE bit 9).
PLIC handles external interrupts only; timer interrupts go
through SBI (Phase 9), and software interrupts (IPI) are
sent via SBI sbi_send_ipi() (EID=0x735049 "sPI", FID=0)
and delivered through sie.SSIE (bit 1).
Phase 9: SBI Timer
Use SBI ecall to program the timer:
- Read current time: csrr time (or rdtime pseudo-instruction)
- Set next deadline: sbi_set_timer(time + interval)
(SBI EID=0x54494D45 "TIME", FID=0)
- Enable timer interrupt: set sie.STIE (bit 5)
Timer fires supervisor timer interrupt (scause = 5) →
clear by calling sbi_set_timer with next deadline.
Enable interrupts: csrsi sstatus, 0x2
**SBI timer error handling:** If `sbi_set_timer()` returns
`SBI_ERR_FAILED` (-1) or `SBI_ERR_NOT_SUPPORTED` (-2), log a
kernel warning and fall back to polling `rdtime` in a tight loop
(degraded mode, no timer interrupt). If `SBI_ERR_INVALID_PARAM`
(-3) is returned (deadline in the past), immediately re-read
`rdtime` and retry with `rdtime + interval`. After 3 consecutive
failures, panic with "SBI timer permanently unavailable".
Phase 10: ecall / Trap-Vector Syscall Setup
Configure the trap vector and trap entry code to correctly dispatch
system calls arriving from U-mode via the ecall instruction.
stvec CSR configuration:
bits[1:0] = 0b00 (Direct mode): all traps — synchronous
exceptions, software interrupts, external interrupts — are
delivered to the single base address written to stvec. UmkaOS
uses Direct mode rather than Vectored mode (0b01) so that the
handler can perform a unified register-save before reading scause.
Trap entry sequence (all trap types, unified handler):
1. csrrw tp, sscratch, tp — swap tp with sscratch. If sscratch
was 0, we came from kernel mode (tp already holds per-CPU base);
if nonzero, we came from user mode (sscratch now holds user tp,
tp now holds per-CPU base). Load kernel stack pointer from the
per-hart CpuLocalBlock: `ld sp, CPULOCAL_STACK_TOP_OFFSET(tp)`.
2. Save all general-purpose registers (x1-x31, or the full
RISC-V integer register file) to the per-hart trap frame at
the top of the kernel stack.
3. Read scause to determine the trap source. The MSB (bit 63
on RV64) distinguishes interrupts from synchronous exceptions:
Synchronous exceptions (scause MSB = 0):
- scause = 8 (ecall from U-mode): syscall path.
- scause = 12 (instruction page fault): fault path.
- scause = 13 (load page fault): fault path.
- scause = 15 (store/AMO page fault): fault path.
- Other synchronous exceptions: fault/signal path.
Interrupts (scause MSB = 1, code = scause & ~(1<<63)):
- code = 1 (supervisor software interrupt): IPI path.
- code = 5 (supervisor timer interrupt): timer tick path.
- code = 9 (supervisor external interrupt): PLIC claim/complete.
ecall handler (scause == 8):
Syscall number: a7 (per Linux RISC-V ABI, also known as the
SBI-compatible register assignment).
Arguments: a0–a5 (up to six arguments).
Return convention: a0 carries the return value (negative values
encode -errno on error); a1 carries a second return word for
certain multi-value returns (e.g., pipe(2) returns two file
descriptors in a0 and a1).
After handling, sepc is advanced by 4 (skip past the ecall
instruction, which is always 4 bytes) before SRET.
Interrupt enable state:
sstatus.SIE (bit 1): supervisor interrupt enable, set to 1 after
trap entry saves state so nested interrupts are possible in
long-running handlers. Cleared on trap entry by hardware.
sstatus.SPIE (bit 5): previous SIE — saved and restored across
SRET to allow transparent interrupt-enable state on return.
sie.SEIE (bit 9): supervisor external interrupt enable (PLIC).
sie.SSIE (bit 1): supervisor software interrupt enable (IPI).
sie.STIE (bit 5): supervisor timer interrupt enable (already set
in Phase 9).
Verification test (executed during boot):
Issue an illegal instruction (e.g., a reserved encoding or
`unimp` pseudo-instruction) from S-mode to test the stvec
trap handler. The handler fires, reads scause to verify
cause == 2 (Illegal instruction), advances sepc by 4, and
returns. This tests trap vector setup — not user-mode.
Note: ecall from S-mode (scause=9) is NOT suitable for this
test because OpenSBI does not delegate it (medeleg bit 9 is
unset). S-mode ecall traps to M-mode (OpenSBI), not stvec.
Illegal instruction exceptions (scause=2) ARE delegated by
default (medeleg bit 2 is set by OpenSBI).
User-mode is not available until the scheduler is initialized
in Phase 11.
Phase 11: Scheduler (canonical Phase 2.3)
Initialize EEVDF scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
Phase 11a: Workqueue Framework (canonical Phase 2.7)
Initialize named kernel worker thread pools.
Cross-ref: [Section 3.11](03-concurrency.md#workqueue-deferred-work).
Phase 11b: RCU Init (canonical Phase 2.8)
Initialize RCU infrastructure (grace period tracking,
callback queues, per-CPU state).
Cross-ref: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
Phase 11c: LSM Framework Init (canonical Phase 2.9)
Initialize LSM framework and register compiled-in LSMs.
Cross-ref: [Section 9.8](09-security.md#linux-security-module-framework).
2.7.1.1 SMP Bringup — RISC-V 64¶
Secondary harts are brought online via the SBI HSM (Hart State Management)
extension (Extension ID: 0x48534D = ASCII "HSM"). The boot hart calls
sbi_hart_start(hartid, start_addr, opaque) for each secondary. SBI HSM
functions used by UmkaOS:
sbi_hart_start(hartid, start_addr, opaque)(FID 0): Bring an offline hart online. The hart begins execution atstart_addr(physical address) in S-mode witha0 = hartidanda1 = opaque.sbi_hart_stop()(FID 1): Park the calling hart (CPU offline).sbi_hart_get_status(hartid)(FID 2): Query hart state (0=STARTED, 1=STOPPED, 2=START_PENDING, 3=STOP_PENDING, 4=SUSPENDED, 5=SUSPEND_PENDING, 6=RESUME_PENDING). UmkaOS polls this aftersbi_hart_startwith a 1-second timeout.- Hart discovery: Enumerate
/cpusDTB nodes, cross-reference with SBI HSM status to filter permanently disabled harts.
Secondary hart entry requirements:
Each secondary hart starts at start_addr in S-mode with:
- MMU off (satp = 0, no address translation)
- All S-mode CSRs at reset defaults (stvec = 0, sscratch = 0,
sstatus.SIE = 0, sie = 0)
- a0 = hartid, a1 = opaque (per-hart data pointer from BSP)
Secondary hart initialization sequence (arch/riscv64/trampoline.S):
- Load per-hart data pointer from
a1(contains stack top address, per-hart CpuLocal base, and pre-built page table root PPN). - Set
satp= (MODE=Sv48 << 60) | (ASID=0 << 44) | kernel_root_ppn. Executesfence.vmato flush stale TLB state. The AP uses the BSP's page tables — it does NOT build new page tables. The BSP has already built the kernel's higher-half map and identity map during Phase 6. - Write UmkaOS trap handler address to
stvec(Direct mode, bits[1:0]=0). - Load per-hart CpuLocal base into
tp:mv tp, <percpu_base from opaque>. Setsscratchto 0 (kernel mode indicator, matching BSP convention). - Enable supervisor interrupts:
csrs sstatus, (1 << 1)(set SIE). - Initialize per-hart PLIC context:
- Set priority threshold to 0 for context = 2 × hartid + 1 (S-mode).
- Enable relevant interrupt sources in the per-hart enable register.
- Program per-hart SBI timer:
sbi_set_timer(rdtime + interval). Enablesie.STIE(bit 5). - Call
smp_secondary_init(hartid)which initializes per-hart CpuLocal, slab per-CPU magazines, and joins the scheduler.
APs must NOT re-run: DTB parse (Phase 3), phys::init() (Phase 4), or buddy_init() (Phase 5). These are BSP-only, run-once operations. The AP uses the BSP's already-initialized physical memory allocator, buddy allocator, and page tables.
SMP fan-out scaling:
For small hart counts (up to 16 harts), sequential sbi_hart_start calls
are adequate — total bringup time is bounded by 16 × 1 second timeout =
16 seconds worst case (typical: 16 × ~5 ms = ~80 ms). For larger systems
(RISC-V server SoCs with 32+ harts), the binary fan-out tree pattern is
applied: hart i (by tree index) starts harts 2i+1 and 2i+2 via SBI HSM.
Each hart calls sbi_hart_start for its two children immediately after
completing its own initialization. This bounds total bringup to
O(log₂ N) phases. The same SmpBringupState structure used by x86-64
(see Section 2.3) coordinates
the fan-out, with CpuMask indexed by tree index rather than LAPIC ID.
2.8 Device Tree and Platform Discovery¶
2.8.1 Device Tree Blob Parsing¶
The Device Tree Blob (DTB) is the memory map and hardware description format shared by AArch64, ARMv7, RISC-V 64, PPC32, and PPC64LE. It serves the same role as the Multiboot1 info structure on x86 (Section 2.9), providing the kernel with memory layout and device addresses at boot.
DTB format (Flattened Device Tree / FDT):
Offset Field Size Description
0x00 magic u32 0xD00DFEED (big-endian)
0x04 totalsize u32 Total blob size in bytes
0x08 off_dt_struct u32 Offset to structure block
0x0C off_dt_strings u32 Offset to strings block
0x10 off_mem_rsvmap u32 Offset to memory reservation map
0x14 version u32 DTB version (17)
0x18 last_comp_ver u32 Last compatible version (16)
0x1C boot_cpuid_phys u32 Physical ID of boot CPU
0x20 size_dt_strings u32 Size of strings block
0x24 size_dt_struct u32 Size of structure block
All multi-byte fields are big-endian. The structure block contains a
flattened tree of nodes and properties encoded as tokens: FDT_BEGIN_NODE
(0x01), FDT_END_NODE (0x02), FDT_PROP (0x03), FDT_NOP (0x04),
FDT_END (0x09).
Minimal parser (umka-kernel/src/boot/dtb.rs):
The kernel implements a minimal, no-alloc DTB parser that walks the structure block once and extracts only what's needed for boot:
- Validate header: check magic (
0xD00DFEED), version ≥ 16 /memorynodes → collectregproperty values asMemoryRegionarray (base + size pairs), passed tophys::init()/chosennode → extractbootargsproperty (kernel command line)- Interrupt controller → extract
regproperty from the node withinterrupt-controllerproperty (GIC base for ARM, PLIC base for RISC-V) - Timer → extract IRQ numbers from
/timernodeinterruptsproperty - UART → extract
regproperty from/serialorstdout-pathdevice
The parser operates on raw byte slices with explicit big-endian reads
(e.g., u32::from_be_bytes(buf[offset..offset+4].try_into().unwrap())) and
requires no heap allocation. It uses a fixed-size array (up to 64 entries)
for memory regions, matching the Multiboot1 parser's approach. The DTB parser
uses a fixed 64-entry buffer during early boot (before the heap allocator is
available). Device tree nodes beyond this limit are parsed in a second pass
after heap initialization.
Shared code: The DTB parser in umka-kernel/src/boot/dtb.rs is used by
all six DTB-based non-x86 architectures (AArch64, ARMv7, RISC-V 64, PPC32,
PPC64LE, LoongArch64). s390x uses SCLP for memory discovery instead of DTB
(Section 2.12). Each architecture's boot.rs calls
dtb::parse(dtb_addr) and passes the resulting memory regions to
phys::init().
2.8.2 Cross-Architecture Comparison¶
The following table summarizes which boot components are architecture-specific and which are shared across all eight architectures:
| Phase | x86-64 | AArch64 | ARMv7 | RISC-V 64 | PPC32 | PPC64LE | s390x | LoongArch64 |
|---|---|---|---|---|---|---|---|---|
| Exception vectors | IDT (256 entries) | VBAR_EL1 (16 vectors) | VBAR CP15 (8 vectors) | stvec (Direct mode) | IVPR+IVORn | LPCR vector table | PSW-swap lowcore pairs | CSR.EENTRY + CSR.TLBRENTRY |
| Memory map source | Multiboot1 info | DTB /memory |
DTB /memory |
DTB /memory |
DTB /memory |
DTB /memory |
SCLP Read SCP Info | DTB /memory |
| Page table format | 4-level PML4 (4 KB) | 4-level 4 KB granule | Short-desc 2-level (1 MB sections) | Sv48 4-level | 2-level (4 KB pages) | Radix tree (POWER9+) or HPT | DAT 3-5 level (4 KB) | 4-level (4 KB, software/HW PTW) |
| IRQ controller | 8259 PIC (I/O ports) | GIC v2/v3 (MMIO, detected at runtime) | GICv2 (MMIO) | PLIC (MMIO) | OpenPIC (MMIO) | XIVE (MMIO) | None (PSW-swap + ISC masking) | EIOINTC (IOCSR) |
| Timer | PIT (I/O port 0x40) | Generic timer (system regs) | SP804 or generic timer | SBI ecall | Decrementer (DEC SPR) | Decrementer (DEC SPR) | CPU Timer (SPT) | Stable Counter (CSR.TCFG) |
| Boot assembly | NASM (32→64 transition) | GNU as (EL1 entry) | GNU as (SVC entry) | GNU as (S-mode entry) | GNU as (supervisor entry) | GNU as (supervisor entry) | GNU as (64-bit PSW entry) | GNU as (PLV0 entry) |
| BSS clearing | entry.asm (rep stosd) | entry.S (str xzr loop) | entry.S (str r6 loop) | entry.S (sd zero loop) | entry.S (stw loop) | entry.S (std loop) | entry.S (xc 256-byte blocks) | entry.S (st.d zero loop) |
| Phys allocator | shared bitmap | shared bitmap | shared bitmap | shared bitmap | shared bitmap | shared bitmap | shared bitmap | shared bitmap |
| Heap allocator | shared free-list | shared free-list | shared free-list | shared free-list | shared free-list | shared free-list | shared free-list | shared free-list |
| Capability system | shared | shared | shared | shared | shared | shared | shared | shared |
| Scheduler | shared | shared | shared | shared | shared | shared | shared | shared |
2.9 Boot Memory Management¶
2.9.1 Multiboot1 Memory Map Parsing¶
boot/multiboot1.rs parses the Multiboot1 info structure (passed by GRUB/QEMU)
to extract the physical memory map:
- Read info structure flags to determine which fields are present
- If
FLAG_MEMset: log basic memory sizes (lower/upper KB) - If
FLAG_CMDLINEset: log the kernel command line string - If
FLAG_MMAPset: iterate the memory map entries: - Each entry has:
base_addr(u64),length(u64),type(u32) - Types: available (1), reserved (2), ACPI reclaimable (3), NVS (4), defective (5)
- Unaligned reads used (
read_unaligned) — Multiboot mmap entries may not be aligned - Collect up to 64
MemoryRegionstructs, pass tophys::init()
phys::init() processes the regions:
- Phase 1: Mark all available regions as free (page-aligned)
- Phase 2: Reserve first 1 MB (BIOS, VGA, legacy)
- Phase 3: Reserve kernel image (1 MB to __kernel_end)
2.9.2 Boot Allocator Design¶
The boot allocator (BootAlloc) is the physical-memory allocator used during
early boot, before the main buddy allocator (Section 4.2)
is initialized. Its design must satisfy two constraints in tension:
- It needs some memory before it can read the firmware memory map.
- It must not impose a hardcoded limit on total usable RAM.
These constraints are resolved with a two-phase design.
Phase 1 — Bootstrap (BSS pre-allocator)
Before the firmware memory map is parsed, a tiny fixed-size buffer resident in
.bss provides just enough memory to parse the firmware map and construct the
BootAlloc region table. This buffer is declared as a global static array:
/// Pre-allocator scratch buffer in .bss.
/// Used ONLY to construct the BootAlloc region table.
/// This is NOT a limit on usable memory — it is a staging area for parsing
/// the firmware map before BootAlloc is initialized.
static mut BOOTSTRAP_BUF: [u8; 64 * 1024] = [0u8; 64 * 1024];
static mut BOOTSTRAP_OFFSET: usize = 0;
This 64 KB BSS bootstrap buffer covers the worst-case cost of parsing firmware
memory map data structures. It is consumed once at boot and is never used again
after BootAlloc::init_from_* completes.
Phase 2 — BootAlloc over all firmware-reported RAM
After the firmware map is parsed, BootAlloc is initialized with all
conventional memory regions reported by the firmware. The BootAlloc struct
is defined canonically in Section 4.1.
This section describes the boot-protocol-specific initialization entry points
that populate the allocator's memory ranges from firmware tables.
Initialization entry points:
impl BootAlloc {
/// Initialize from a Multiboot1 mmap (x86-64).
/// Filters for type == 1 (available), page-aligns each region, skips
/// the first 1 MB (BIOS/legacy) and the kernel image.
pub fn init_from_multiboot1(mmap: &Multiboot1Mmap) -> Self;
/// Initialize from a UEFI memory map (future x86-64 UEFI path).
/// Filters for EfiConventionalMemory descriptor type.
pub fn init_from_uefi(map: &UefiMemoryMap) -> Self;
/// Initialize from Device Tree `/memory` nodes (all non-x86 architectures).
/// Uses the regions collected by the DTB parser in `boot/dtb.rs`.
pub fn init_from_dtb(regions: &[MemRegion]) -> Self;
}
The alloc(), reserve(), and hand_off_to_buddy() methods, invariants,
and initialization sequence are specified in
Section 4.1.
2.9.2.1 UEFI Memory Map Types (UEFI 2.10)¶
The UefiMemoryMap wrapper and supporting types define the full UEFI 2.10
memory descriptor interface. These types are used by BootAlloc::init_from_uefi()
on x86-64 UEFI boot paths and on AArch64 systems that boot via UEFI (server
platforms, some embedded boards with Tianocore/EDK2).
/// Wrapper over the UEFI memory map obtained from `GetMemoryMap()`.
/// The map is a packed array of `EfiMemoryDescriptor` entries, but the
/// stride between entries is `descriptor_size` (not necessarily
/// `size_of::<EfiMemoryDescriptor>()`), because the firmware may
/// append vendor-specific data after the standard fields.
pub struct UefiMemoryMap {
/// Pointer to the first descriptor in the firmware-provided buffer.
base: *const u8,
/// Total size of the memory map buffer in bytes.
map_size: usize,
/// Stride between adjacent descriptors in bytes. Always >=
/// `size_of::<EfiMemoryDescriptor>()` (40 bytes). Firmware may
/// return a larger value if it appends vendor-specific fields.
descriptor_size: usize,
/// Descriptor version (must be 1 for UEFI 2.10).
descriptor_version: u32,
}
impl UefiMemoryMap {
/// Number of descriptors in the map.
pub fn len(&self) -> usize {
self.map_size / self.descriptor_size
}
/// Iterator over `EfiMemoryDescriptor` entries. Each entry is at
/// offset `i * descriptor_size` from `base`. Only the first 40
/// bytes (the standard fields) are interpreted; any trailing
/// vendor data is ignored.
pub fn iter(&self) -> UefiMemoryMapIter<'_>;
}
/// A single UEFI memory descriptor (UEFI 2.10 Table 7.6).
/// Matches the C struct layout exactly — `#[repr(C)]` with the 4-byte
/// padding hole after `type_` that the UEFI spec mandates (the firmware
/// writes 8-byte-aligned `phys_start`).
#[repr(C)]
pub struct EfiMemoryDescriptor {
/// Memory region type (see `EfiMemoryType` enum).
pub type_: u32,
/// Padding inserted by `#[repr(C)]` for 8-byte alignment of `phys_start`.
pub _pad: u32,
/// Physical start address (page-aligned, 4KB granularity).
pub phys_start: u64,
/// Virtual address mapping (set by `SetVirtualAddressMap()`; zero
/// at `GetMemoryMap()` time — UmkaOS does not use UEFI virtual mode).
pub virt_start: u64,
/// Number of 4KB pages in this region.
pub num_pages: u64,
/// Cacheability, protection, and runtime attributes (see
/// `EfiMemoryAttribute` bitflags).
pub attribute: u64,
}
const_assert!(size_of::<EfiMemoryDescriptor>() == 40);
/// UEFI 2.10 memory type codes (Table 7.5).
/// Values 0-15 are defined by the specification. OEM and OS ranges are
/// reserved above `0x7000_0000`.
#[repr(u32)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum EfiMemoryType {
/// Not usable.
ReservedMemoryType = 0,
/// UEFI application (loader) code — reclaimable after ExitBootServices.
LoaderCode = 1,
/// UEFI application (loader) data — reclaimable after ExitBootServices.
LoaderData = 2,
/// Boot Services code — reclaimable after ExitBootServices.
BootServicesCode = 3,
/// Boot Services data — reclaimable after ExitBootServices.
BootServicesData = 4,
/// Runtime Services code — must be preserved (UEFI runtime calls).
RuntimeServicesCode = 5,
/// Runtime Services data — must be preserved (UEFI runtime variables).
RuntimeServicesData = 6,
/// **Conventional (free) memory** — available for OS use. This is
/// the type `BootAlloc::init_from_uefi()` collects.
ConventionalMemory = 7,
/// Memory with uncorrectable errors — must not be used.
UnusableMemory = 8,
/// ACPI tables — reclaimable after the OS copies the tables.
AcpiReclaimMemory = 9,
/// ACPI firmware NVS — must be preserved for ACPI runtime.
AcpiMemoryNvs = 10,
/// Memory-mapped I/O — device MMIO ranges.
MemoryMappedIo = 11,
/// Memory-mapped I/O port space (x86-specific legacy).
MemoryMappedIoPortSpace = 12,
/// Processor-specific code (Itanium PAL; unused on x86/ARM).
PalCode = 13,
/// Persistent memory (NVDIMM) — usable as conventional memory but
/// has persistence semantics. UmkaOS maps these as DAX regions
/// ([Section 15.16](15-storage.md#persistent-memory)).
PersistentMemory = 14,
/// Unaccepted memory (UEFI 2.10, TDX/SEV-SNP confidential VMs).
/// Must be accepted via `AcceptMemory()` before use. UmkaOS
/// accepts lazily on first access (Phase 3+).
UnacceptedMemoryType = 15,
/// Sentinel — not a valid type. Values above this (up to
/// `0x6FFF_FFFF`) are reserved by the UEFI spec.
MaxMemoryType = 16,
}
bitflags::bitflags! {
/// UEFI 2.10 memory attribute flags (Table 7.6).
/// Describe cacheability, protection, and runtime properties.
pub struct EfiMemoryAttribute: u64 {
// --- Cacheability attributes ---
/// Uncacheable (UC). Device memory — no caching permitted.
const UC = 0x0000_0000_0000_0001;
/// Write-Combining (WC). Suitable for framebuffers.
const WC = 0x0000_0000_0000_0002;
/// Write-Through (WT). Reads may be cached; writes go through.
const WT = 0x0000_0000_0000_0004;
/// Write-Back (WB). Normal cacheable memory.
const WB = 0x0000_0000_0000_0008;
/// Uncacheable, exported (UCE). Supports fetch-and-add
/// semaphore mechanism (legacy Itanium; rarely used).
const UCE = 0x0000_0000_0000_0010;
// --- Physical memory protection attributes ---
/// Write-Protected (WP). Writes cause a fault.
const WP = 0x0000_0000_0000_1000;
/// Read-Protected (RP). Reads cause a fault.
const RP = 0x0000_0000_0000_2000;
/// Execute-Protected (XP). Instruction fetch causes a fault.
const XP = 0x0000_0000_0000_4000;
// --- Additional attributes ---
/// Non-Volatile (NV). Persistent memory (NVDIMM).
const NV = 0x0000_0000_0000_8000;
/// More Reliable. Memory has higher reliability than normal
/// (e.g., ECC with additional redundancy, mirrored DIMM).
const MORE_RELIABLE = 0x0000_0000_0001_0000;
/// Read-Only (RO). Memory is read-only (firmware code regions).
const RO = 0x0000_0000_0002_0000;
/// Specific Purpose (SP). Memory designated for a specific
/// use (e.g., CXL device memory, HBM).
const SP = 0x0000_0000_0004_0000;
/// CPU Crypto. Memory supports CPU hardware encryption
/// (e.g., AMD SEV, Intel TME/MKTME).
const CPU_CRYPTO = 0x0000_0000_0008_0000;
/// Hot-Pluggable. Memory may be hot-removed at runtime.
/// (Bit 20; formerly EFI_MEMORY_ISA_VALID in UEFI <2.6.)
const HOT_PLUGGABLE = 0x0000_0000_0010_0000;
// --- Runtime attribute (bit 63) ---
/// Runtime. Memory region requires runtime mapping via
/// `SetVirtualAddressMap()`. UmkaOS does not use UEFI
/// virtual mode — runtime regions are identity-mapped and
/// protected via page table permissions.
const RUNTIME = 0x8000_0000_0000_0000;
}
}
BootAlloc::init_from_uefi() iterates the map, collecting regions where
type_ == EfiMemoryType::ConventionalMemory (free RAM). Regions of type
LoaderCode, LoaderData, BootServicesCode, and BootServicesData are
also reclaimable after ExitBootServices() completes — these are added in
a second pass after the kernel has relocated its own data out of loader
memory. AcpiReclaimMemory regions are reserved until the ACPI subsystem
has copied the tables (Section 2.4), then
reclaimed. All other types are marked reserved in the boot allocator's
free map.
Handoff to buddy allocator:
After early boot data structures are allocated, BootAlloc hands off all
remaining free ranges to the buddy allocator via hand_off_to_buddy().
The full handoff sequence is specified in
Section 4.1 (steps 1-9 of the
initialisation sequence). After hand_off_to_buddy() completes, the
BootAlloc is marked as retired and further alloc() calls panic.
2.10 PPC32 Boot Sequence¶
PPC32 targets NXP/Freescale e500 embedded PowerPC processors using QEMU's ppce500
machine. The e500 implements the Power ISA Book E architecture, which differs
fundamentally from the Book S (server) architecture used by POWER processors:
Book E has no hardware page table walker. Instead, TLB entries are managed entirely
by software via explicit tlbwe instructions and miss-handler exception vectors
(IVOR13/IVOR14). The firmware (U-Boot or QEMU direct boot) passes a DTB pointer
in r3.
Entry assembly (arch/ppc32/entry.S, GNU as syntax):
1. Firmware loads ELF and jumps to _start in supervisor mode
- r3 = DTB address (physical)
- r4 = 0 (reserved; QEMU ppce500 sets r4=0)
- r5 = 0 (reserved)
- r6 = EPAPR_MAGIC (0x45504150) — ePAPR boot convention identifier
- r7 = sizeof(initial TLB1 mapping) — size in bytes of the firmware-provided
TLB1 entry covering the kernel image (typically 64 MB or 256 MB)
2. _start:
a. Set up stack pointer (r1) from linker symbol
b. Clear BSS (.sbss + .bss)
c. Set up initial exception vectors (IVPR + IVORn)
d. Call umka_main(0, r3) [magic=0, info=DTB address]
QEMU-provided initial TLB1 entry: When QEMU boots the ppce500 machine with
-kernel, the firmware stub creates a single TLB1 entry (entry 0) before jumping
to _start. This entry identity-maps the kernel load region:
| MAS Register | Value | Meaning |
|---|---|---|
| MAS0 | 0x10000000 |
TLBSEL=1 (TLB1), ESEL=0 (entry 0) |
| MAS1 | 0xC0000n00 |
V=1, IPROT=1, TID=0 (global), TS=0 (IS=0 space), TSIZE=n (size from r7; typically 0x09=256MB or 0x07=64MB) |
| MAS2 | kernel_phys \| 0x00 |
EPN=kernel physical base, W=0, I=0, M=0, G=0, E=0 (cacheable, big-endian) |
| MAS3 | kernel_phys \| 0x3F |
RPN=kernel physical base, UX=1, SX=1, UW=1, SW=1, UR=1, SR=1 |
The TSIZE field in MAS1 encodes power-of-4 page sizes: 0x01=4KB, 0x03=64KB,
0x05=1MB, 0x07=16MB, 0x09=256MB. The value of r7 at entry tells the kernel how
large this initial mapping is. The kernel must verify that the DTB physical address
falls within this initial TLB1 entry; if not, a second TLB1 entry must be created
for DTB access before parsing can begin.
The linker script (linker-ppc32.ld) places .text._start first at the kernel
load address. PPC32 uses big-endian byte order by default.
DTB byte order: The Device Tree Blob (DTB) format is defined as big-endian
by the Devicetree Specification (Section 5.2: "All data in the devicetree blob
is in big-endian format"). On PPC32, which is natively big-endian, no byte
swapping is needed when parsing DTB fields — u32 and u64 values from the
DTB can be read directly. The DTB parser must still use explicit big-endian
accessors (e.g., u32::from_be_bytes()) for correctness on all architectures,
but on PPC32 these are identity operations and compile to no additional
instructions.
Initialization phases (in umka_main(), sequential):
Canonical Phase Mapping:
| Canonical Phase | Description | Local Implementation |
|---|---|---|
| 0.1 | arch_early_init | Entry assembly (steps 1–2) + Phase 1 (IVPR+IVORn) + Phase 1a (errata) |
| 0.15 | early_log_init | Phase 1b: early_log_init() |
| 0.3 | parse_firmware_memmap | Phase 3: DTB /memory parse |
| 0.4 | boot_alloc_init | Phase 4: phys::init() from DTB regions |
| 0.5 | reserve_regions | Phase 4: reserve kernel image + DTB |
| 0.6 | numa_discover_topology | Phase 4a: DTB /memory nodes + ibm,associativity (if present) |
| 0.7 | cpulocal_bsp_init | Phase 4b: mtspr SPRG3, &CpuLocalBlock |
| 0.8a | evolvable_verify | Phase 4c: Evolvable signature verification (physical addresses, no MMU) |
| 0.2 | identity_map | Phase 4d: Book E TLB1 pinned entries |
| 0.8b | evolvable_map_and_init | Phase 4e: Evolvable virtual mapping at EVOLVABLE_VIRT_BASE + VTABLE_SLOTS[] population |
| 1.1 | buddy allocator | Phase 5: buddy init |
| 1.2 | slab allocator | Phase 5a: slab_init() |
| 2.1 | IRQ domain | Phase 8: OpenPIC init + IrqDomain setup |
| 2.2 | capability system | Phase 7: CapSpace init |
| 2.3 | scheduler | Phase 10: scheduler init |
| 2.7 | workqueue infra | Phase 10a: workqueue_init_early() |
| 2.8 | RCU | Phase 10b: rcu_init() |
| 2.9 | LSM framework | Phase 10c: lsm_init() |
| 3.1–3.3 | SMP bringup | Spin-table / ePAPR protocol (Phase 3 deferred) |
Phase 1: Exception Vectors (IVPR + IVORn)
Set IVPR to exception vector base address (aligned to 64 KB boundary
per Book E specification — IVPR[32:47] provides the upper 16 bits,
IVORn[48:59] provides the offset within the vector page).
Initialize IVOR0-IVOR15 for each exception type:
- IVOR0 (Critical input), IVOR1 (Machine check)
- IVOR2 (Data storage), IVOR3 (Instruction storage)
- IVOR4 (External input), IVOR5 (Alignment)
- IVOR6 (Program), IVOR7 (Floating-point unavailable)
- IVOR8 (System call), IVOR9 (Auxiliary processor unavailable)
- IVOR10 (Decrementer), IVOR11 (Fixed interval timer)
- IVOR12 (Watchdog), IVOR13 (Data TLB miss)
- IVOR14 (Instruction TLB miss), IVOR15 (Debug)
Note: entry.S sets up minimal exception vectors (critical input,
machine check, and a catch-all fault handler) before calling
umka_main(). Phase 1 replaces these with the full set of handlers.
The early vectors exist only to produce diagnostic output if the
kernel faults during the assembly preamble.
Phase 1a: e500 Errata Application
Read PVR (Processor Version Register) to identify the exact e500
core revision. Apply hardware errata workarounds that must be active
before any further initialization. See
[Section 2.18](#cpu-errata-and-mitigations) for the full errata table.
Required workarounds for e500 at early boot:
- E500_HANG (A-005125): Set SPR976 bits [40:41] = 0b10 to prevent
CCB arbiter deadlock on guarded load/PCI write collision. Must be
applied before any PCI/PCIe MMIO access. Affected: MPC8548 and
similar e500v1/v2 SoCs.
```
mfspr r3, 976 # Read SPR976
oris r3, r3, 0x0080 # Set bit 40 (big-endian bit numbering)
mtspr 976, r3 # Write back
isync
```
- E500_BTB (A-004466): Flush Branch Target Buffer by setting BBFI
bit in BUCSR (SPR 1013). This clears stale BTB entries left by
firmware. Also applied on every context switch (Spectre v2
mitigation on Book E). See [Section 2.18](#cpu-errata-and-mitigations).
```
mfspr r3, 1013 # Read BUCSR
oris r3, r3, 0x0200 # Set BBFI (bit 6, big-endian)
mtspr 1013, r3 # Write — initiates BTB flush
isync # Wait for completion
```
- E500_TLB1_FI (CPU-A001): Do NOT use TLB1 flash-invalidate
(MMUCSR0[TLB1_FI]) — it races with tlbivax on multi-core parts.
Instead, invalidate TLB1 entries individually via tlbivax when
remapping. This constraint affects Phase 6 TLB1 management.
Phase 1b: Early Log Ring Init (canonical Phase 0.15)
Call early_log_init() — sequencing checkpoint confirming the
BSS-resident EarlyLogRing is accessible. After this point,
early_log() emits messages to both the ring buffer and the
UART. See [Section 2.3](#boot-init-cross-arch--early-boot-log-ring).
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly).
Phase 3: DTB Parse
Parse the DTB passed in r3 (see [Section 2.8](#device-tree-and-platform-discovery)).
Extract from DTB:
- /memory regions (base address, size)
- /chosen/bootargs (kernel command line)
- /soc/open-pic compatible node → OpenPIC MMIO base address
- /soc/serial compatible node → UART MMIO base address (within CCSR)
- /cpus/timebase-frequency → decrementer tick rate (see Phase 9)
- /cpus/cpu@N enable-method → SMP bringup protocol
- /cpus/cpu@N/cpu-release-addr → spin-table release address (if present)
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). Reserve kernel image,
DTB region, and any firmware-reserved regions from /reserved-memory.
Phase 4a: NUMA Topology Discovery (canonical Phase 0.6)
Parse DTB /memory nodes. On PPC32 embedded platforms with
`ibm,associativity` properties (rare on e500), build NUMA
distance matrix. Most PPC32 targets are single-node — set
node 0 covering all memory if no NUMA information found.
Cross-ref: [Section 4.11](04-memory.md#numa-topology-and-policy).
Phase 4b: CpuLocal BSP Init (canonical Phase 0.7)
Initialize CpuLocalBlock for the BSP and set the arch-specific
fast-access register:
mtspr SPRG3, &CPU_LOCAL_BLOCKS[0]
SPRG3 (SPR 259) is the PPC32 per-CPU data pointer. It is
readable from user mode on some implementations (for vDSO),
so the kernel uses only SPRG3 for CpuLocal (not SPRG0-2
which are reserved for exception handler scratch).
See [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--initialization-sequence).
Phase 4c: Evolvable Signature Verification (canonical Phase 0.8a)
Verify the Evolvable image signature using physical addresses
only (no MMU required). Nucleus LMS verifier (~2 KB) checks
ML-DSA-65 + Ed25519 hybrid signature against embedded public
key. On failure, panic.
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
**Invariant**: Phases 0.1–4c MUST NOT dispatch through
VTABLE_SLOTS[] or any replaceable policy vtable.
Phase 4d: Virtual Memory — TLB Setup (canonical Phase 0.2)
See Phase 6 below for the full Book E TLB architecture,
TLB1 pinned entry setup, and address space configuration.
Page table structures are allocated from BootAlloc.
Phase 4e: Evolvable Virtual Mapping (canonical Phase 0.8b)
Map Evolvable .text (RX) and .rodata (RO) pages at
EVOLVABLE_VIRT_BASE (0x0040_0000 on PPC32).
**Safety**: this low address is userspace-inaccessible
because PPC32 Book E uses TS-bit isolation: the kernel
runs in TS=0 address space, userspace runs in TS=1.
VA 0x0040_0000 in TS=0 is a separate mapping from
VA 0x0040_0000 in TS=1.
Allocate fresh RW pages for .data+.bss via BootAlloc.
Call evolvable_init() to populate VTABLE_SLOTS[].
**Requires TLB/MMU setup** (Phase 4d complete).
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
After Phase 4e, Evolvable vtable dispatch is permitted.
Phase 5: Kernel Heap
Initialize the buddy allocator with all available physical
memory discovered from the DTB memory regions (Phase 4).
The buddy allocator manages power-of-two blocks (order 0–10,
4 KB–4 MB). See [Section 4.2](04-memory.md#physical-memory-allocator).
Phase 5a: Slab Allocator (canonical Phase 1.2)
Initialize slab caches on top of the buddy allocator.
After this point, Box::new, Arc::new, and typed allocations
are available. Cross-ref: [Section 4.3](04-memory.md#slab-allocator).
Phase 6: Virtual Memory Detail — Book E TLB Architecture (runs at Phase 4d)
The e500 uses a software-managed TLB with two arrays and NO
hardware page table walker. This is fundamentally different from
Book S (POWER) and most other architectures. All TLB entries are
created explicitly by software via MAS register writes + tlbwe.
=== TLB Architecture Overview ===
TLB1: Fully-associative, 16 entries, variable page size (4 KB–256 MB).
Used for large, pinned mappings: kernel text/data, MMIO regions.
Entries persist until explicitly invalidated via tlbivax or
overwritten via tlbwe. Each entry can map a different page size.
Supports IPROT (Invalidate Protect) bit to prevent accidental
invalidation by tlbivax broadcast.
TLB0: 4-way set-associative, 256 entries (64 sets × 4 ways), fixed
4 KB page size. Used for fine-grained mappings: userspace pages,
dynamically-mapped kernel pages. Entries are filled by the
software TLB miss handler (IVOR13/IVOR14). Hardware selects
the replacement victim within a set (round-robin or LRU
depending on e500 revision).
=== MAS Register Encodings ===
TLB entries are programmed via five Machine Address Setup (MAS)
registers. The kernel writes MAS0-MAS3 and MAS7, then executes
`tlbwe` (TLB Write Entry) to commit.
MAS0 — TLB Select and Entry Select:
| Bits | Field | Description |
|---------|---------|-------------|
| 0 | — | Reserved |
| 1:3 | TLBSEL | TLB array select: 0b001=TLB1, 0b000=TLB0 |
| 4:15 | ESEL | Entry select (TLB1: 0-15; TLB0: way within set) |
| 16:27 | — | Reserved |
| 28:29 | NV | Next victim hint (TLB0 only, hardware-updated) |
| 30:31 | — | Reserved |
MAS1 — Descriptor Context and Configuration:
| Bits | Field | Description |
|---------|---------|-------------|
| 0 | V | Valid: 1=entry is valid, 0=entry is invalid |
| 1 | IPROT | Invalidate protect: 1=protected from tlbivax |
| 2:7 | — | Reserved |
| 8:15 | TID | Translation ID (address space / PID context, 0=global) |
| 16:19 | — | Reserved |
| 20 | TS | Translation Space: 0=IS=0/DS=0, 1=IS=1/DS=1 |
| 21:23 | — | Reserved |
| 24:27 | TSIZE | Page size: 0x01=4KB, 0x03=64KB, 0x05=1MB, 0x07=16MB, 0x09=256MB |
| 28:31 | — | Reserved |
MAS2 — Effective Page Number and Attributes:
| Bits | Field | Description |
|---------|---------|-------------|
| 0:19 | EPN | Effective Page Number (virtual address [0:19]) |
| 20:25 | — | Reserved |
| 26 | VLE | Variable-Length Encoding: 0=standard Book E |
| 27 | W | Write-through |
| 28 | I | Cache-inhibited (must be 1 for MMIO) |
| 29 | M | Memory coherence required (for SMP) |
| 30 | G | Guarded (prevents speculative access; must be 1 for MMIO) |
| 31 | E | Endianness: 0=big-endian, 1=little-endian |
MAS3 — Real Page Number and Access Permissions:
| Bits | Field | Description |
|---------|---------|-------------|
| 0:19 | RPN | Real Page Number (physical address [0:19]) |
| 20:25 | — | Reserved |
| 26 | U0 | User-defined attribute 0 |
| 27 | U1 | User-defined attribute 1 |
| 28 | U2 | User-defined attribute 2 |
| 29 | U3 | User-defined attribute 3 |
| 30 | UX | User execute |
| 31 | SX | Supervisor execute |
Note: e500v2 extends MAS3 with additional permission bits at
positions that overlap U0-U3 in the original e500v1 layout. On
e500v2 (QEMU ppce500 default), the full permission layout is:
| Bits (e500v2) | Field | Description |
|---|---|---|
| 0:19 | RPN | Real Page Number |
| 20:21 | — | Reserved |
| 22 | UX | User execute |
| 23 | SX | Supervisor execute |
| 24 | UW | User write |
| 25 | SW | Supervisor write |
| 26 | UR | User read |
| 27 | SR | Supervisor read |
| 28:31 | — | Reserved / user-defined |
MAS7 — Real Page Number Extension (upper bits):
| Bits | Field | Description |
|---------|---------|-------------|
| 0:27 | — | Reserved |
| 28:31 | RPN_HI | Physical address bits [32:35] (for >4GB phys on e500v2) |
On QEMU ppce500 with ≤4 GB RAM, MAS7 is always 0.
=== Phase 6a: TLB1 Boot Identity Map ===
At this point, only the QEMU-provided TLB1 entry 0 is active
(identity-mapping the kernel load region). Phase 6a establishes
the remaining pinned TLB1 entries needed for kernel operation.
TLB1 entry layout after Phase 6a:
| Entry | Mapping | TSIZE | EPN/RPN | WIMG | Permissions | IPROT |
|---|---|---|---|---|---|---|
| 0 | Kernel text+data | 0x09 (256MB) or per r7 | kernel_phys | 0b0000 (cacheable) | SX+SR+SW | 1 |
| 1 | DTB blob | 0x05 (1MB) or 0x03 (64KB) | dtb_phys (page-aligned) | 0b0000 (cacheable) | SR | 1 |
| 2 | CCSR MMIO region | 0x05 (1MB) | ccsr_base (from DTB /soc ranges) | 0b1010 (I+G) | SR+SW | 1 |
Entry 0 is inherited from firmware (see table above). If the DTB
physical address falls within the entry 0 mapping (common case for
QEMU, which places DTB adjacent to the kernel), entry 1 is not needed
and remains available.
Entry 2 maps the CCSR (Configuration, Control, and Status Registers)
block which contains the UART, OpenPIC, and other SoC peripherals.
The CCSR base address is obtained from the DTB `/soc` node `ranges`
property. On QEMU ppce500, CCSR is at `0xFE000000` (default) with
size 1 MB. WIMG bits must be I=1 (cache-inhibited) and G=1 (guarded)
for all MMIO regions — failure to set these bits causes speculative
reads to MMIO space, which can hang or corrupt device state.
Programming sequence for each TLB1 entry:
```
mtspr MAS0, r_mas0 # TLBSEL=1, ESEL=entry_number
mtspr MAS1, r_mas1 # V=1, IPROT=1, TID=0, TS=0, TSIZE=size
mtspr MAS2, r_mas2 # EPN=virt_base | WIMG bits
mtspr MAS3, r_mas3 # RPN=phys_base | permission bits
mtspr MAS7, r_mas7 # RPN upper bits (0 for <4GB)
tlbwe # Commit entry to TLB1
isync # Synchronize instruction stream
```
=== Phase 6b: Software Page Table for TLB0 Refill ===
After the buddy allocator is available (Phase 5), allocate the
software page table used for TLB0 miss handling. This is a
two-level radix structure conceptually similar to RISC-V Sv32
(both are software-walked), but with Book E-specific PTE format:
Level 1 — PGD (Page Global Directory):
- 1024 entries × 4 bytes = 4 KB, aligned to 4 KB
- Index: VA[31:22] (upper 10 bits of effective address)
- Each entry: physical address of a PTE page, or 0 (not present)
- PGD entry format: PTE_page_phys[31:12] | flags[11:0]
- Bit 0: V (valid — PTE page allocated)
- Bits 1-11: reserved (must be 0)
Level 2 — PTE (Page Table Entry):
- 1024 entries × 4 bytes = 4 KB per PTE page, aligned to 4 KB
- Index: VA[21:12] (next 10 bits)
- Page offset: VA[11:0] (12 bits → 4 KB pages)
- PTE format — encodes fields that map directly to MAS2/MAS3:
| Bits | Field | MAS register | Description |
|---|---|---|---|
| 0:19 | RPN | MAS3[0:19] | Real Page Number (physical address [0:19]) |
| 20 | — | — | Reserved |
| 21 | V | — | Valid (PTE present) |
| 22 | UX | MAS3 UX | User execute |
| 23 | SX | MAS3 SX | Supervisor execute |
| 24 | UW | MAS3 UW | User write |
| 25 | SW | MAS3 SW | Supervisor write |
| 26 | UR | MAS3 UR | User read |
| 27 | SR | MAS3 SR | Supervisor read |
| 28 | W | MAS2 W | Write-through |
| 29 | I | MAS2 I | Cache-inhibited |
| 30 | M | MAS2 M | Memory coherence |
| 31 | G | MAS2 G | Guarded |
This PTE layout is chosen so the TLB miss handler can extract
MAS2 and MAS3 fields with simple mask-and-shift operations,
minimizing handler latency. The handler runs with interrupts
disabled (MSR[CE]=0, MSR[EE]=0) and must complete in minimal
cycles.
Initial population: After allocating the PGD, the kernel populates
PTE entries for all kernel memory (text, data, heap, stacks) that
is not already covered by TLB1 pinned entries. Kernel pages use
TID=0 (global), TS=0.
=== Phase 6c: Install TLB Miss Handlers ===
Install software TLB refill handlers at IVOR13 (Data TLB Miss)
and IVOR14 (Instruction TLB Miss). These replace the early
fault handlers from Phase 1.
Data TLB Miss handler (IVOR13) pseudocode:
```
1. Save r0, r1 (to SPRG4/SPRG5 scratch registers)
2. Read DEAR (Data Exception Address Register) → faulting VA
3. Read MAS1 (hardware sets TID and TS from current context)
4. Compute PGD index: pgd_idx = VA[31:22]
5. Load PGD entry: pgd_entry = pgd_base[pgd_idx]
6. If pgd_entry.V == 0 → jump to page fault handler (data storage
exception — this is a genuine fault, not a TLB refill)
7. Compute PTE index: pte_idx = VA[21:12]
8. Load PTE: pte = pte_page[pte_idx]
9. If pte.V == 0 → jump to page fault handler
10. Build MAS0: TLBSEL=0 (TLB0), ESEL from NV hint
11. Build MAS2: EPN = VA[31:12], WIMG from PTE[28:31]
12. Build MAS3: RPN = PTE[0:19], permissions from PTE[22:27]
13. Set MAS7 = 0 (or from extended PTE if >4GB phys)
14. mtspr MAS0/MAS1/MAS2/MAS3/MAS7
15. tlbwe
16. Restore r0, r1 from SPRG4/SPRG5
17. rfi (Return From Interrupt)
```
Instruction TLB Miss handler (IVOR14) is identical except:
- Step 2: Read SRR0 (instruction address) instead of DEAR
- Step 6/9: Jump to instruction storage exception on fault
The PGD base physical address is stored in a dedicated SPR
(SPRG6) for single-load access in the miss handler. On context
switch, SPRG6 is updated to point to the new process's PGD.
=== Phase 6d: MSR[IS]/MSR[DS] Transition ===
Book E uses MSR[IS] (Instruction Space) and MSR[DS] (Data Space)
to select which translation space is active. When IS=0/DS=0, TLB
entries with TS=0 are used; when IS=1/DS=1, entries with TS=1 are
used. (Note: Book S uses different names — MSR[IR] and MSR[DR] —
for the equivalent bits. UmkaOS uses Book E nomenclature for PPC32.)
The kernel runs with IS=0, DS=0 (TS=0 space) throughout. All
kernel TLB1 entries and kernel page table entries use TS=0.
Userspace processes use TS=1, differentiated by TID (PID context).
This allows kernel and user mappings to coexist in TLB0 without
conflict.
Translation is enabled by setting MSR[IS] and MSR[DS] via rfi:
```
mfmsr r3
ori r3, r3, MSR_IS | MSR_DS # Enable translation for both I and D
mtsrr1 r3 # New MSR value for rfi
lis r3, phase6_translated@ha
ori r3, r3, phase6_translated@l
mtsrr0 r3 # Return address after rfi
rfi # Atomically load SRR0→PC, SRR1→MSR
phase6_translated:
# Now running with address translation enabled
```
At this point, TLB1 entry 0 provides the identity map so the
kernel continues executing at the same physical/virtual address.
The transition is invisible to subsequent code.
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: OpenPIC Initialization
The OpenPIC controller on e500 SoCs (FSL MPIC variant) is memory-
mapped within the CCSR region. The base address is discovered from
the DTB `/soc/open-pic` (or compatible `fsl,mpic`) node. On QEMU
ppce500, OpenPIC is at CCSR_BASE + 0x40000 (default: 0xFE040000).
OpenPIC register map (offsets from OpenPIC base):
| Offset | Register | Description |
|---|---|---|
| 0x1000 | GCR (Global Config) | Reset, mixed mode, pass-through |
| 0x1020 | SVIR (Spurious Vector) | Spurious interrupt vector number |
| 0x1080 | VPR0 (Vector/Priority 0) | First external interrupt source |
| 0x1090 | DR0 (Destination 0) | CPU destination mask for source 0 |
| 0x10A0 | VPR1 | Second source vector/priority |
| 0x10B0 | DR1 | Second source destination |
| ... | VPRn at 0x1080 + n×0x20 | Repeats for each interrupt source |
| ... | DRn at 0x1090 + n×0x20 | Repeats for each interrupt source |
| 0x20080 | CTPR0 (Current Task Priority, CPU0) | Per-CPU task priority register |
| 0x200A0 | IACK0 (Interrupt Ack, CPU0) | Per-CPU: read to claim interrupt |
| 0x200B0 | EOI0 (End of Interrupt, CPU0) | Per-CPU: write 0 to signal EOI |
| 0x21080 | CTPR1 (CPU1) | Per-CPU registers at +0x1000 per CPU |
Per-CPU register blocks are at OpenPIC_base + 0x20080 + (cpu_id × 0x1000).
Initialization sequence:
1. **Reset**: Write GCR[RESET]=1, poll until clear (controller ready).
2. **Spurious vector**: Write SVIR = 0xFF (or chosen spurious ID).
3. **Disable all sources**: For each source n (0 to max_irq-1),
write VPRn[MSK]=1 (bit 31) to mask the source.
4. **Configure sources**: For each interrupt source to be enabled:
- Write VPRn: priority (bits 16:19, 0=disabled, 15=max),
vector number (bits 0:7), polarity and sense in bits 22:23.
- Write DRn: CPU destination bitmask (bit 0 = CPU0, etc.).
5. **Set current task priority**: Write CTPR0 = 0 (accept all
priority levels). Higher CTPR values mask lower-priority
interrupts.
6. **Enable external interrupts**: Set MSR[EE] = 1 via:
```
mfmsr r3
ori r3, r3, MSR_EE # MSR_EE = 0x8000
mtmsr r3
isync
```
Interrupt dispatch flow (after setup):
1. External interrupt → IVOR4 handler
2. Read IACK0 → returns vector number of highest-priority pending IRQ
3. If vector == spurious (0xFF), return immediately (no real interrupt)
4. Dispatch to registered handler via irqdomain lookup
(see [Section 3.12](03-concurrency.md#irq-chip-and-irqdomain-hierarchy))
5. Write EOI0 = 0 to signal end-of-interrupt
6. rfi to return from exception
Phase 9: Decrementer Timer
The decrementer (DEC SPR 22) is a 32-bit down-counter that fires
a Decrementer exception (IVOR10) when it transitions from 0 to
-1 (underflow, detected as sign-bit transition). The exception
is gated by MSR[EE] (already enabled in Phase 8).
**Timebase frequency discovery**: The decrementer counts at the
timebase frequency, which is discovered from the DTB:
- Primary source: `/cpus/timebase-frequency` property (u32, Hz)
- Fallback: `/cpus/cpu@0/timebase-frequency`
- QEMU ppce500 default: typically 33333333 Hz (33.3 MHz) or as
configured by `-machine ppce500,timebase-frequency=N`
The kernel reads this value in Phase 3 (DTB parse) and stores it
in a global `TIMEBASE_FREQ_HZ: u32`. The tick interval is
computed as:
```
dec_reload = timebase_freq_hz / CONFIG_HZ // e.g., 33333333 / 1000 = 33333
```
Programming sequence:
1. Read timebase frequency from stored DTB value
2. Compute reload value for desired tick rate (CONFIG_HZ)
3. Write DEC = reload value via `mtspr DEC, r3`
4. In the decrementer handler (IVOR10):
a. Save volatile registers
b. Acknowledge by reloading DEC with the reload value
(writing DEC clears the pending exception)
c. Call scheduler tick / timekeeping update
d. Restore registers and `rfi`
Note: The timebase register (TBL/TBU, SPR 268/269) counts UP at
the same frequency and is used for wall-clock timekeeping. DEC
counts DOWN from the loaded value. Both derive from the same
oscillator. TBL/TBU are read-only in supervisor mode on Book E
(writable only in hypervisor mode or via debug facilities).
Phase 10: Scheduler (canonical Phase 2.3)
Initialize EEVDF scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
Phase 10a: Workqueue Framework (canonical Phase 2.7)
Initialize named kernel worker thread pools.
Cross-ref: [Section 3.11](03-concurrency.md#workqueue-deferred-work).
Phase 10b: RCU Init (canonical Phase 2.8)
Initialize RCU infrastructure.
Cross-ref: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
Phase 10c: LSM Framework Init (canonical Phase 2.9)
Initialize LSM framework and register compiled-in LSMs.
Cross-ref: [Section 9.8](09-security.md#linux-security-module-framework).
PCI/PCIe little-endian access on big-endian PPC32: PCI configuration space and BAR-mapped MMIO regions use little-endian byte order per the PCI specification, while PPC32 is natively big-endian. All PCI register accesses must use byte-swapping load/store intrinsics:
| Operation | Instruction | Rust wrapper | Description |
|---|---|---|---|
| 32-bit LE read | lwbrx |
in_le32(addr) |
Load word byte-reversed |
| 16-bit LE read | lhbrx |
in_le16(addr) |
Load halfword byte-reversed |
| 32-bit LE write | stwbrx |
out_le32(addr, val) |
Store word byte-reversed |
| 16-bit LE write | sthbrx |
out_le16(addr, val) |
Store halfword byte-reversed |
| 8-bit read/write | lbz/stb |
in_8(addr)/out_8(addr, val) |
No swap needed |
These are implemented as unsafe inline functions in arch::ppc32::io with safety
invariants requiring that the address is within a valid MMIO mapping (TLB1 entry with
I=1, G=1). All PCI BAR and config-space accesses go through these wrappers — raw
pointer reads/writes to PCI MMIO are forbidden. For DMA buffers written by PCI devices
(which produce little-endian data in host memory), the driver must use explicit
u32::from_le() / u32::to_le() conversions.
Note: Some FSL SoCs provide an "outbound address translation" window that performs
hardware byte-swap for PCI. QEMU ppce500 does not emulate this; software byte-swap
via lwbrx/stwbrx is the reliable portable path.
SMP bringup — PPC32 (ePAPR spin-table protocol):
Secondary CPUs on e500 multi-core SoCs (e.g., MPC8572, P2020, P4080) are brought
online via the ePAPR spin-table protocol. This is embedded PowerPC-specific and
distinct from the RTAS start-cpu mechanism used on server POWER systems.
The DTB describes each CPU with enable-method = "spin-table" and a
cpu-release-addr property pointing to a 64-bit spin-table entry in memory:
/cpus/cpu@1 {
compatible = "fsl,e500v2";
enable-method = "spin-table";
cpu-release-addr = <0x0 0x00BFFD00>; // 64-bit physical address
};
Spin-table entry format (8 bytes, naturally aligned):
| Offset | Size | Field | Description |
|---|---|---|---|
| 0 | 4 bytes | addr_hi |
Entry point physical address [63:32] (0 on 32-bit) |
| 4 | 4 bytes | addr_lo |
Entry point physical address [31:0], or 1 = "not released" |
Protocol sequence:
-
Firmware initialization: At power-on, firmware starts all cores. Secondary cores execute a tight spin loop polling their spin-table entry:
-
BSP releases secondary: The BSP (CPU 0) writes the secondary entry point to the spin-table, then ensures visibility:
-
Secondary entry: The secondary CPU jumps to the entry point with:
r3= 0 (or DTB address, platform-dependent)- MSR: supervisor mode, translation off (IS=0, DS=0)
-
The secondary must set up its own stack, exception vectors (IVPR/IVORn), per-CPU data pointer, and TLB1 entries before enabling interrupts.
-
Secondary initialization sequence: a. Set stack pointer from per-CPU stack allocation (BSP pre-allocates one 16 KB stack per secondary CPU) b. Apply e500 errata (same as Phase 1a — each core needs its own SPR writes) c. Set IVPR + IVORn (each core has independent exception vectors) d. Copy kernel TLB1 entries (entries 0 and 2 at minimum) — each core's TLB is independent; firmware only maps the spin-loop region e. Install TLB miss handlers (IVOR13/IVOR14) and set SPRG6 = kernel PGD f. Initialize per-CPU OpenPIC registers (CTPRn, IACKn, EOIn) g. Enable MSR[EE] for interrupts h. Initialize per-CPU scheduler runqueue i. Signal "online" to BSP via atomic flag, enter idle loop
On QEMU ppce500 with -smp N, QEMU creates spin-table entries for CPUs 1 through
N-1 at addresses specified in the auto-generated DTB. The cpu-release-addr values
are within the firmware-reserved memory region and are already covered by the
firmware's TLB1 setup on each secondary core.
Full SMP specification (per-CPU stack sizing, OpenPIC affinity routing,
cross-core TLB invalidation via tlbivax + IPI, and NUMA-unaware scheduling
for embedded PPC) is deferred to Phase 3 — the protocol above is sufficient
for SMP bringup; the remaining details will be specified when PPC32 SMP is
implemented.
2.11 PPC64LE Boot Sequence¶
PPC64LE targets IBM POWER processors (POWER8, POWER9, POWER10) in little-endian mode. Two distinct platform types exist:
- pseries: Hypervisor-managed virtual machine (QEMU
-M pseries, PowerVM LPARs). Firmware is SLOF (Slimline Open Firmware) or PowerVM PHYP. The kernel runs as a supervisor (MSR[HV]=0) — hypervisor-privileged resources (LPCR, HSPRG0/1, HDEC) are inaccessible. Firmware APIs are accessed through RTAS (Run-Time Abstraction Services) with a big-endian calling convention. - powernv: Bare-metal (QEMU
-M powernv, real hardware with skiboot). The kernel runs as a hypervisor (MSR[HV]=1) with full hardware access. Firmware APIs are accessed through OPAL (Open Power Abstraction Layer) with a C calling convention (little-endian on POWER9+).
Both platforms pass a DTB pointer in r3 at kernel entry. The kernel detects the
platform type in Phase 1a and branches to platform-specific code paths throughout
initialization.
2.11.1.1 ELFv2 ABI and TOC¶
PPC64LE uses the ELFv2 ABI, which requires r2 to point at the TOC (Table of
Contents) base — a data area holding addresses for global variables and function
descriptors. The .TOC. symbol is defined by the linker at the start of the
.got section (or a linker-defined location in the linker script). Position-
independent data access on PPC64 works through r2-relative offsets: the compiler
generates ld rN, offset(r2) to load global addresses.
The linker script (linker-ppc64le.ld) must:
1. Define the .TOC. symbol at the appropriate position (typically .got + 0x8000
per ELFv2 convention, to maximize the signed 16-bit offset range of ±32 KB).
2. Place .got, .toc, and .plt sections contiguously.
3. Ensure the _start entry point can compute r2 from a PC-relative calculation
or absolute symbol load before any function call or global access.
2.11.1.2 Per-CPU Data Pointer (r13)¶
PPC64 ELFv2 ABI reserves r13 as the thread-local storage base pointer. In the
kernel context, UmkaOS uses r13 as the per-CPU data pointer — the PPC64
equivalent of x86-64 GS base, AArch64 TPIDR_EL1, and RISC-V tp (in kernel
mode). The entry assembly must set r13 = &CpuLocalBlock for the BSP before any
code that accesses per-CPU data. Each AP receives its CpuLocalBlock address from
the firmware-provided mailbox and sets r13 immediately upon entry.
r13 is callee-saved (non-volatile) in the ELFv2 ABI, so once set during boot
it persists across all function calls. The kernel never modifies r13 except
during CPU hotplug initialization.
2.11.1.3 Firmware API: RTAS vs OPAL¶
PPC64LE interacts with firmware at runtime for operations that cannot be performed by the kernel alone (power management, error logging, hardware configuration). The two firmware interfaces are mutually exclusive — a system is either RTAS-based (pseries) or OPAL-based (powernv), never both.
RTAS (Run-Time Abstraction Services) — pseries only:
- Discovered from the DTB
/rtasnode, which providesrtas-base(physical address of the RTAS blob) andrtas-entry(entry point offset within the blob). - Token-based dispatch: Each RTAS function has a numeric token read from the
DTB
/rtasnode properties (e.g.,start-cpu= token0x28,stop-self= token from DTB property"stop-self"). Token values are NOT fixed — they are assigned by the firmware and vary between SLOF versions and hypervisors. The kernel must parse the DTB to build a token-to-function map at boot. - Big-endian calling convention: Regardless of kernel endianness, the RTAS args buffer is always big-endian (RTAS was defined in the 32-bit big-endian OpenFirmware era). The args buffer layout is: The kernel must byte-swap all fields when building/parsing the args buffer on LE systems. The call sequence is:
- Allocate
RtasArgsin memory below 4 GB (RTAS runs in 32-bit real mode). - Fill fields in big-endian byte order.
- Disable interrupts (RTAS is not reentrant — a global lock is required on SMP).
- Switch to big-endian mode:
mtmsrdwith MSR[LE]=0 (or use a trampoline). - Branch to
rtas-entrywithr3= physical address ofRtasArgs. - On return, restore LE mode and parse results.
OPAL (Open Power Abstraction Layer) — powernv only:
- Discovered from the DTB
/ibm,opalnode, which providesopal-base-address(physical address of the OPAL firmware image) andopal-entry-address(entry point for OPAL calls). - Function-number dispatch: OPAL functions are identified by a fixed numeric
token passed in
r0(e.g.,OPAL_START_CPU= 41,OPAL_CEC_POWER_DOWN= 5,OPAL_POLL_EVENTS= 10). Token numbers are defined by the skiboot/OPAL specification and are stable across versions. - Standard C calling convention: Arguments in
r3-r9per PPC64 ELFv2 ABI. On POWER9+, OPAL is little-endian (same as the kernel). On POWER8, OPAL is big-endian — the kernel must switch to BE mode before calling OPAL, same as for RTAS calls. The call sequence (POWER9+ LE OPAL) is: - Load function token into
r0. - Load arguments into
r3-r9. - Save non-volatile registers (OPAL may clobber them on some calls).
- Disable interrupts (OPAL calls are not preemptible).
- Branch to
opal-entry-address. - On return,
r3= return code (OPAL_SUCCESS= 0,OPAL_BUSY= -1, etc.).
| Property | RTAS (pseries) | OPAL (powernv) |
|---|---|---|
| DTB node | /rtas |
/ibm,opal |
| Dispatch | Token from DTB property | Fixed function number in r0 |
| Endianness | Always big-endian | LE on POWER9+, BE on POWER8 |
| Calling convention | Custom args buffer at r3 |
Standard C ABI (r3-r9) |
| Reentrancy | Not reentrant (global lock) | Most calls not reentrant |
| Memory constraint | Args buffer below 4 GB | No constraint (64-bit addresses) |
| Interrupt state | Must be disabled | Must be disabled |
Entry assembly (arch/ppc64le/entry.S, GNU as syntax):
1. SLOF/OPAL loads ELF and jumps to _start in hypervisor or supervisor mode
- r3 = DTB address
- r4 = 0 (reserved)
- MSR: 64-bit mode (SF=1), little-endian (LE=1)
2. _start:
a. Compute TOC pointer: load r2 from .TOC. symbol
(addis r2, r12, .TOC.-_start@ha ; addi r2, r2, .TOC.-_start@l)
r12 holds the entry address per ELFv2 global entry convention.
b. Set up stack pointer (r1) from linker symbol (_stack_top)
c. Set r13 = &CpuLocalBlock (BSP per-CPU data pointer)
(lis r13, bsp_cpulocal@highest ; ori ... ; sldi ... pattern
to load a 64-bit immediate, or use TOC-relative: ld r13, bsp_cpulocal@toc(r2))
d. Clear BSS (dcbz loop or std loop from __bss_start to __bss_end)
e. Set up initial exception vectors (copy to 0x0 or use LPCR to relocate)
f. Call umka_main(0, r3) [magic=0, info=DTB address]
The linker script (linker-ppc64le.ld) places .text._start first at the kernel
load address. PPC64LE uses the ELFv2 ABI with little-endian byte order.
Initialization phases (in umka_main(), sequential):
Canonical Phase Mapping:
| Canonical Phase | Description | Local Implementation |
|---|---|---|
| 0.1 | arch_early_init | Entry assembly (steps 1–2) + Phase 1 (vectors) + Phase 1a (platform detect) |
| 0.15 | early_log_init | Phase 1b: early_log_init() |
| 0.3 | parse_firmware_memmap | Phase 3: DTB /memory parse |
| 0.4 | boot_alloc_init | Phase 4: phys::init() from DTB regions |
| 0.5 | reserve_regions | Phase 4: reserve kernel image |
| 0.6 | numa_discover_topology | Phase 4a: DTB /memory nodes + ibm,associativity |
| 0.7 | cpulocal_bsp_init | Entry assembly step 2c: r13 = &CpuLocalBlock (ELFv2 ABI) |
| 0.8a | evolvable_verify | Phase 4c: Evolvable signature verification (physical addresses, no MMU) |
| 0.2 | identity_map | Phase 4d: Radix MMU (POWER9+) or HPT (POWER8) |
| 0.8b | evolvable_map_and_init | Phase 4e: Evolvable virtual mapping at EVOLVABLE_VIRT_BASE + VTABLE_SLOTS[] population |
| 1.1 | buddy allocator | Phase 5: buddy init |
| 1.2 | slab allocator | Phase 5a: slab_init() |
| 2.1 | IRQ domain | Phase 8: XIVE/XICS init + IrqDomain setup |
| 2.2 | capability system | Phase 7: CapSpace init |
| 2.3 | scheduler | Phase 10: scheduler init |
| 2.7 | workqueue infra | Phase 10a: workqueue_init_early() |
| 2.8 | RCU | Phase 10b: rcu_init() |
| 2.9 | LSM framework | Phase 10c: lsm_init() |
| 3.1–3.3 | SMP bringup | RTAS start-cpu / OPAL opal_start_cpu (Phase 3 deferred) |
Phase 1: Exception Vectors (LPCR + HSPRG0/1)
Configure exception vector base and initial handlers.
- **powernv**: Set HSPRG0 to per-CPU data pointer via mtspr.
Configure LPCR for exception vector base via direct mtspr.
- **pseries**: HSPRG0/1 are hypervisor-privileged and inaccessible.
Use SPRG0/SPRG1 (supervisor-accessible) for per-CPU data.
Exception vectors are configured by the hypervisor at partition
creation; the guest installs handlers at the prescribed locations.
Set LPCR[ILE] = 1 (Interrupt Little-Endian): ensures exceptions
deliver control in little-endian mode. Without this, the first
exception on a LE kernel will execute the handler in big-endian
mode, causing immediate crash.
- **powernv**: Direct mtspr(LPCR, ...) with ILE bit set.
- **pseries**: H_SET_MODE hcall (hcall 0x31C, mode=1 for LE
interrupts, resource=CIABR/DAWR or global) to request LE
interrupt delivery from the hypervisor.
Initialize system reset and machine check handlers.
Phase 1a: Platform Detection
Detect pseries vs powernv by reading MSR[HV] (Hypervisor bit):
- MSR[HV] = 0 → pseries guest (runs under hypervisor, LPCR is
hypervisor-privileged — writes cause Program exception)
- MSR[HV] = 1 → powernv bare-metal (direct hardware access)
Also detect POWER generation from PVR (Processor Version Register):
- PVR[0:15] = 0x004B → POWER8E (Express variant)
- PVR[0:15] = 0x004C → POWER8NVL (NVLink variant)
- PVR[0:15] = 0x004D → POWER8
- PVR[0:15] = 0x004E → POWER9
- PVR[0:15] = 0x0080 → POWER10
This detection governs all subsequent privilege-sensitive operations
(LPCR writes, interrupt controller choice, SMP bringup mechanism,
firmware API selection).
Parse DTB to locate firmware interface:
- If `/rtas` node present → store rtas_base, rtas_entry, and build
RTAS token map from DTB properties.
- If `/ibm,opal` node present → store opal_base, opal_entry.
Exactly one of these nodes will be present (never both, never neither).
Phase 1b: Early Log Ring Init (canonical Phase 0.15)
Call early_log_init() — sequencing checkpoint confirming the
BSS-resident EarlyLogRing is accessible. After this point,
early_log() emits messages to both the ring buffer and the
UART. See [Section 2.3](#boot-init-cross-arch--early-boot-log-ring).
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly).
Phase 3: DTB Parse
Parse the DTB passed in r3 (see [Section 2.8](#device-tree-and-platform-discovery)).
Extract:
- `/memory` regions (physical memory map)
- `/chosen` bootargs
- Interrupt controller: `/interrupt-controller` node with
`compatible` = `"ibm,opal-xive"` (powernv XIVE),
`"ibm,ppc-xicp"` (pseries XICS), or `"ibm,opal-xive-pe"`
**XIVE vs XICS detection**: The DTB `compatible` string
determines the interrupt controller backend:
- `"ibm,opal-xive"` or `"ibm,opal-xive-pe"` → XIVE
(POWER9+, native). XIVE provides hardware-managed
interrupt queues and end-of-interrupt signalling.
- `"ibm,ppc-xicp"` + `"ibm,ppc-xics"` → XICS (POWER8,
or POWER9+ in XICS compatibility mode). XICS uses
programmed I/O to the ICP (Interrupt Control Presentation)
and ICS (Interrupt Control Source) units.
- pseries: the hypervisor may expose XIVE-on-XICS via
`"ibm,power-xive"` in `/interrupt-controller`. Use
H_INT_* hcalls for XIVE-exploiter mode if available;
fall back to XICS `ibm,set-xive` RTAS calls if the
`"ibm,power-xive"` node is absent.
The selected backend is stored in a global `IrqBackend`
enum (XIVE or XICS) used by all subsequent IRQ domain and
IPI setup code.
- UART base address (VTY for pseries, 16550 for powernv)
- `/cpus/cpu@N` nodes with `ibm,ppc-interrupt-server#s` for
CPU thread topology and SMP enumeration
Phase 4: Physical Memory Manager
Pass DTB memory regions to phys::init(). Reserve kernel image.
Phase 4a: NUMA Topology Discovery (canonical Phase 0.6)
Parse DTB /memory nodes with `ibm,associativity` properties
to build the NUMA distance matrix. POWER systems commonly
have multi-node NUMA topologies (one node per processor
module). The `ibm,associativity-reference-points` property
in the root node defines which levels of the associativity
tuple correspond to NUMA domains.
Cross-ref: [Section 4.11](04-memory.md#numa-topology-and-policy).
Phase 4b: CpuLocal BSP Init (canonical Phase 0.7)
The CpuLocal pointer is set in entry assembly (step 2c):
r13 = &CPU_LOCAL_BLOCKS[0]
per the ELFv2 ABI convention (r13 = thread pointer). This
is set before umka_main() is called, so CpuLocal access is
already available at Phase 1. This phase verifies that r13
points to a valid CpuLocalBlock and initializes the remaining
fields (preempt_count, current_task, etc.).
See [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--initialization-sequence).
Phase 4c: Evolvable Signature Verification (canonical Phase 0.8a)
Verify the Evolvable image signature using physical addresses
only (no MMU required). Nucleus LMS verifier (~2 KB) checks
ML-DSA-65 + Ed25519 hybrid signature against embedded public
key. On failure, panic.
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
**Invariant**: Phases 0.1–4c MUST NOT dispatch through
VTABLE_SLOTS[] or any replaceable policy vtable.
Phase 4d: Virtual Memory — MMU Enable (canonical Phase 0.2)
See Phase 6 below for the full Radix/HPT page table setup
and MMU enable sequence. Page table structures are allocated
from BootAlloc.
Phase 4e: Evolvable Virtual Mapping (canonical Phase 0.8b)
Map Evolvable .text (RX) and .rodata (RO) pages at
EVOLVABLE_VIRT_BASE (0x2040_0000 on PPC64LE).
**Safety**: this low address is userspace-inaccessible
because PPC64LE Radix MMU uses PID-based isolation:
the kernel runs with PID=0 (separate page tables from
user PID>0). VA 0x2040_0000 in PID=0 context is
unreachable from userspace.
Allocate fresh RW pages for .data+.bss via BootAlloc.
Call evolvable_init() to populate VTABLE_SLOTS[].
**Requires MMU enabled** (Phase 4d complete).
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
After Phase 4e, Evolvable vtable dispatch is permitted.
Phase 5: Kernel Heap
Initialize the buddy allocator with all available physical
memory discovered from the DTB memory regions (Phase 4).
The buddy allocator manages power-of-two blocks (order 0-10,
4 KB-4 MB). See [Section 4.2](04-memory.md#physical-memory-allocator).
Phase 5a: Slab Allocator (canonical Phase 1.2)
Initialize slab caches on top of the buddy allocator.
After this point, Box::new, Arc::new, and typed allocations
are available. Cross-ref: [Section 4.3](04-memory.md#slab-allocator).
Phase 6: Virtual Memory Detail (Radix MMU on POWER9+, HPT on POWER8, runs at Phase 4d)
Detect MMU type from DTB or CPU features (PVR from Phase 1a):
- POWER9+: Use Radix MMU (4-level page tables: PGD->PUD->PMD->PTE,
page sizes: 4 KB, 64 KB, 2 MB, 1 GB)
- **powernv**: Direct mtspr(LPCR, LPCR | HR) to enable Radix.
Set up PTCR (Partition Table Control Register) pointing to the
partition table entry, which contains the process table base.
- **pseries**: Use CAS (Client Architecture Support) negotiation
during early boot (via the `ibm,client-architecture-support`
RTAS call or the H_CAS hcall) to declare Radix MMU support.
The hypervisor acknowledges by updating the DTB
`/chosen/ibm,architecture-vec-5` property. If the hypervisor
does not grant Radix, fall back to HPT.
Once Radix is negotiated, the hypervisor manages LPCR[HR] and
the partition table; the guest only manages its own page tables.
Set up process table (PRTB):
- The process table is indexed by LPID:PID (Logical Partition
ID : Process ID). Each entry contains the page table root
(PGD base) and the process table entry flags.
- On powernv, write PTCR to point to the process table.
- On pseries, the hypervisor manages PTCR; the guest registers
process table entries via H_REGISTER_PROC_TBL hcall.
PID allocation begins AFTER the process table is initialized:
- PID 0 = kernel identity mapping (always present).
- PID assignment for userspace processes starts from 1.
- The PIDR (Process Identification Register) SPR is written
on context switch to select the active process table entry.
- PID isolation: each PID has its own page table root,
providing hardware-enforced address space isolation.
- POWER8: Use HPT (Hash Page Table, base page size 4 KB default
or 64 KB; 16 MB and 16 GB are huge page sizes)
- **powernv**: Direct mtspr(LPCR, LPCR & ~HR) + SDR1 setup.
SDR1 encodes the HPT base address and size (power-of-two).
- **pseries**: HPT is typically pre-configured by the hypervisor
via H_ENTER hcall for HTAB management. The guest uses
H_ENTER, H_REMOVE, H_BULK_REMOVE hcalls to manage PTEs.
Enable MMU via MSR[IR] and MSR[DR] bits (Instruction Relocate,
Data Relocate).
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: Interrupt Controller (XIVE or XICS)
The interrupt controller depends on both the POWER generation and
the platform type.
**powernv (POWER9+): XIVE**
XIVE (eXternal Interrupt Virtualization Engine) is the native
interrupt controller on POWER9 and POWER10 bare-metal systems.
XIVE provides hardware virtualization of interrupt delivery with
per-CPU event queues (no software routing needed).
Initialization:
1. Read XIVE MMIO base addresses from DTB `/interrupt-controller`
node (IC BAR, TM BAR, VC BAR).
2. Map IC (Interrupt Controller) MMIO region.
All XIVE MMIO registers are little-endian on POWER9+ -- no
byte-swapping is required when the kernel runs in LE mode.
(On POWER8, XIVE does not exist; XICS is used instead.)
3. Initialize IC: configure CQ_RST_CTL (reset control), then
set CQ_CFG_PB_GEN (PowerBus generation matching the chip).
4. Configure XIVE sources: each interrupt source has an ESB
(Event State Buffer) page. Write ESB to arm/disarm sources.
5. Initialize per-CPU TIMA (Thread Interrupt Management Area):
map TIMA MMIO for each hardware thread. TIMA provides the
per-thread CAM (Current Activity Monitor) line that matches
incoming interrupts to the target thread.
6. Configure EQ (Event Queue) per CPU: allocate a power-of-two
sized queue in memory, register it with the IC. Interrupts
are delivered by hardware writing the interrupt number into
the EQ; the CPU polls or is notified via a special escalation
interrupt.
7. Set interrupt priorities (8 priority levels, 0 = highest).
8. Enable external interrupts via MSR[EE] = 1.
**powernv (POWER8): XICS**
POWER8 bare-metal uses XICS (eXternal Interrupt Controller
Specification) via OPAL firmware calls:
1. Call `opal_get_xive(irq)` / `opal_set_xive(irq, server, priority)`
to query and configure interrupt routing through OPAL.
2. Each CPU has an ICP (Interrupt Controller Presentation) layer
accessed via MMIO. Read ICP_XIRR to acknowledge; write ICP_EOI.
3. Enable MSR[EE] = 1.
**pseries: XICS (default) with optional XIVE negotiation**
pseries guests default to XICS (emulated by the hypervisor),
regardless of the underlying hardware generation. XIVE is only
available if explicitly negotiated via CAS.
*XICS on pseries* (default path, always available):
1. The hypervisor presents an XICS-compatible interrupt controller
to the guest via the DTB (`compatible` = `"ibm,ppc-xicp"`
for the ICP, `"ibm,ppc-xics"` for the ICS).
2. Interrupt management uses hcalls:
- `H_XIRR` (hcall 0x04): read XIRR (External Interrupt
Request Register) -- acknowledges the highest-priority
pending interrupt. Returns a 32-bit value: bits [0:7] =
CPPR (current priority), bits [8:31] = XISR (interrupt
source number). XISR = 0 means spurious.
- `H_EOI` (hcall 0x64): signal End-of-Interrupt for a given
XIRR value.
- `H_IPI` (hcall 0x08): send an inter-processor interrupt
to a target CPU server number with a given priority.
- `H_CPPR` (hcall 0x58): set the Current Processor Priority
Register (masks interrupts below this priority).
- `H_IPOLL` (hcall 0x0C): poll for pending interrupts
without acknowledging.
3. Configure interrupt routing: the hypervisor manages XICS
routing; the guest uses `H_XIRR`/`H_EOI` for the claim/
complete cycle.
4. Enable MSR[EE] = 1.
*XIVE on pseries* (negotiated, POWER9+ hypervisors only):
CAS (Client Architecture Support) negotiation can request XIVE
exploitation mode. The kernel includes the XIVE option vector
in the `ibm,client-architecture-support` call during early DTB
processing (Phase 3). If the hypervisor grants XIVE, the DTB
is updated with XIVE-compatible interrupt controller nodes, and
subsequent interrupt management uses XIVE hcalls:
- `H_INT_GET_SOURCE_INFO` (hcall 0x3A0): get ESB page address.
- `H_INT_SET_SOURCE_CONFIG` (hcall 0x3A4): configure source
routing (target CPU, priority, EQ).
- `H_INT_GET_QUEUE_INFO` (hcall 0x3A8): get EQ page address.
- `H_INT_SET_QUEUE_CONFIG` (hcall 0x3AC): configure EQ.
- `H_INT_ESB` (hcall 0x3B4): manipulate ESB (arm/disarm).
- `H_INT_RESET` (hcall 0x3C0): reset all interrupt state.
If CAS does not grant XIVE, the kernel falls back to XICS
(the default path above). This is the expected case on POWER8
hypervisors and older KVM versions.
| Platform | POWER8 | POWER9+ |
|----------|--------|---------|
| powernv | XICS (via OPAL) | XIVE (native MMIO) |
| pseries | XICS (hcalls) | XICS default; XIVE if CAS grants |
Phase 9: Decrementer Timer
Program the decrementer (DEC SPR) for periodic interrupts:
- Load initial value into DEC (32-bit on POWER8, 56-bit on
POWER9+ with large decrementer support)
- Decrementer exception is gated by MSR[EE] (already enabled
in Phase 8)
Timer fires decrementer exception (vector 0x900) -> reload DEC
in handler.
The DEC register counts down at the timebase frequency (typically
512 MHz on POWER9). When it transitions from 0 to negative (MSB
set), a decrementer exception is generated if MSR[EE]=1.
**HDEC (Hypervisor Decrementer)** -- powernv only:
HDEC is a separate decrementer available only when MSR[HV]=1.
It generates a hypervisor decrementer exception (vector 0x980)
which is distinct from the guest decrementer exception (0x900).
On powernv, UmkaOS can use either DEC or HDEC:
- DEC: standard timer for scheduling ticks (used by default).
- HDEC: reserved for hypervisor-level timekeeping if UmkaOS
is hosting KVM guests. When not running KVM, HDEC is unused.
On pseries, HDEC is managed by the hypervisor and invisible to
the guest -- the guest only sees DEC.
**Large Decrementer** (POWER9+):
If the DTB property `/chosen/ibm,large-decrementer` is present
or FSCR[LDECRM] is set, the DEC register extends to 56 bits
(max count ~1.4 x 10^8 seconds at 512 MHz), eliminating the
need for frequent reloads. The kernel should enable large
decrementer mode by setting LPCR[LD] = 1:
- **powernv**: Direct mtspr(LPCR, LPCR | LD).
- **pseries**: H_SET_MODE hcall to request large decrementer.
Phase 10: Scheduler (canonical Phase 2.3)
Initialize EEVDF scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
Phase 10a: Workqueue Framework (canonical Phase 2.7)
Initialize named kernel worker thread pools.
Cross-ref: [Section 3.11](03-concurrency.md#workqueue-deferred-work).
Phase 10b: RCU Init (canonical Phase 2.8)
Initialize RCU infrastructure.
Cross-ref: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
Phase 10c: LSM Framework Init (canonical Phase 2.9)
Initialize LSM framework and register compiled-in LSMs.
Cross-ref: [Section 9.8](09-security.md#linux-security-module-framework).
> **SMP bringup -- PPC64LE**: Secondary CPUs on POWER systems are
> brought online via platform-specific firmware mechanisms:
>
> - **pseries (SLOF, PowerVM)**: RTAS `start-cpu` call.
> The `start-cpu` RTAS token is read from the DTB `/rtas` node
> property `"start-cpu"` (commonly `0x28` on SLOF, but the value
> is NOT guaranteed -- always read from DTB). Arguments (big-endian):
> - `args[0]` = server number (CPU `ibm,ppc-interrupt-server#s[0]`
> from the DTB `/cpus/cpu@N` node)
> - `args[1]` = start address (physical address of AP entry point)
> - `args[2]` = r3 value (typically pointer to per-CPU init data)
> All secondary CPUs start in "RTAS stopped" state. The BSP issues
> `start-cpu` for each AP. Each AP enters at the start address in
> big-endian mode (RTAS starts CPUs in BE mode on LE kernels), so
> the AP entry trampoline must immediately switch to LE mode:
> ```
> ap_entry_trampoline:
> li r0, 1 // LE bit
> mtmsrd r0, 1 // Set MSR[LE]=1 (mode switch)
> // Now running little-endian
> ld r2, toc_base // Set up TOC
> ld r13, percpu(r3) // Set r13 = &CpuLocalBlock for this AP
> b ap_init
> ```
>
> - **powernv (OPAL/skiboot)**: `OPAL_START_CPU` (token 41).
> Arguments: `r3` = server number, `r4` = start address.
> OPAL starts the CPU at the specified address. On POWER9+, the
> AP enters in LE mode (matching the OPAL firmware endianness),
> so no endianness trampoline is needed. The AP sets `r2` (TOC)
> and `r13` (CpuLocalBlock) from a per-CPU data block whose
> address is encoded in the entry point or a well-known location.
>
> - **Per-AP initialization sequence** (both platforms):
> 1. Set r2 = TOC base, r13 = &CpuLocalBlock for this CPU.
> 2. Set up stack pointer (r1) from per-CPU stack allocation.
> 3. Configure exception vectors (same as BSP Phase 1, but using
> this CPU's SPRG0 for per-CPU pointer on pseries).
> 4. Set LPCR[ILE]=1 (powernv: mtspr; pseries: H_SET_MODE).
> 5. Initialize per-CPU interrupt controller state:
> - XIVE: configure TIMA for this thread, allocate EQ.
> - XICS: set CPPR via H_CPPR (pseries) or ICP MMIO (powernv).
> 6. Enable MSR[EE] (external interrupts).
> 7. Signal BSP that AP is online (write to per-CPU ready flag).
> 8. Enter scheduler idle loop.
>
> Note: spin-table is NOT used on QEMU pseries despite having
> `/cpus/cpu@N/ibm,ppc-interrupt-server#s` properties. Spin-table is
> an embedded PPC / PowerNV early-boot mechanism, not used by either
> SLOF or modern skiboot.
>
> **Full specification deferred to Phase 3** -- the per-CPU stack
> allocation sizing and NUMA-aware placement will be detailed when
> PPC64LE SMP is implemented.
2.12 s390x Boot Sequence¶
s390x targets IBM z/Architecture processors (z13, z14, z15, z16) in 64-bit mode.
QEMU uses the s390-ccw-virtio machine type, which performs IPL (Initial Program
Load) directly into the kernel ELF. There is no DTB or ACPI — hardware topology is
discovered via STSI (Store System Information) and SCLP (Service Call Logical
Processor). The console is SCLP line-mode or virtio-console.
Why s390x: s390x is the architecture wrecking ball — it stress-tests every abstraction boundary. Channel I/O (not PCI/MMIO), PSW-swap interrupts (not vector tables), SIGP for IPI (not APIC/GIC), floating interrupts (not routed), separate user/kernel address spaces (not shared page tables), and SCLP for firmware (not DT/ACPI). If the architecture abstraction survives s390x, it is genuinely generic.
Unique architectural properties tested by s390x:
| Dimension | s390x | All other 7 arches |
|---|---|---|
| Device model | Channel I/O (CCW programs, subchannel addressing) | PCI and/or DT-based MMIO |
| Firmware interface | SCLP / DIAG | DT, ACPI, SBI, OPAL, UEFI |
| Virtualization | SIE (Start Interpretive Execution) | VMX, VHE, or none |
| Endianness (64-bit) | Big-endian | Little-endian (all five 64-bit targets) |
| Interrupt model | PSW-swap to fixed lowcore addresses | Vector table (IDT, VBAR, stvec, IVPR, EIOINTC) |
| IPI mechanism | SIGP instruction | APIC, GIC, PLIC, XIVE, EIOINTC |
| User/kernel memory | Separate address spaces (Primary/Home ASCE) | Shared address space with permission bits |
| In-CPU crypto | CPACF (instruction-stream AES/SHA) | Separate accelerator or ISA extension |
| In-CPU compression | DFLTCC (hardware zlib) | Software only |
Target triple: s390x-unknown-linux-gnu
QEMU invocation:
IPL modes:
s390x supports two distinct boot paths:
-
QEMU
-kernel(development boot): Thes390-ccw-virtiofirmware performs direct IPL from the ELF file. It loads the ELF segments into memory at their specified addresses, constructs an initial PSW pointing to_start, and transfers control. This is the primary development and testing boot path. No CCW channel programs are involved — the firmware handles ELF loading internally. -
CCW-IPL (real hardware / QEMU with virtio-blk boot): On real z/ Architecture hardware, IPL is performed from a DASD (disk) or SCSI device via channel I/O. The hardware reads the IPL record (first 24 bytes of the boot device), which contains a CCW (Channel Command Word) chain that loads a bootstrap loader. The bootstrap loader (typically
ziplon Linux) reads the kernel image via further CCW I/O and constructs the initial PSW. In QEMU, this path is exercised when booting from a virtio-blk device (-drive file=disk.img) without the-kernelflag. UmkaOS provides azipl-compatible boot record for production deployments on z/Architecture systems.
Both paths converge at _start with the same register and PSW state.
Entry assembly (arch/s390x/entry.S, GNU as syntax):
1. IPL (Initial Program Load) loads the ELF image into memory.
QEMU s390-ccw-virtio firmware performs IPL from the -kernel ELF
directly. The firmware loads the ELF at its specified load address,
sets up the initial PSW (Program Status Word) to point to _start,
and transitions to 64-bit mode.
- No DTB, no ACPI, no Multiboot — s390x uses STSI/SCLP for discovery.
- Registers: R15 = initial stack (firmware-provided, overwritten in step 2b), R2 = 0.
2. _start:
a. Set addressing mode: 64-bit (PSW bit 31 = 1, bit 32 = 1)
b. Load stack pointer: larl r15, _stack_top
(64 KB stack in .bss._stack, 8-byte aligned)
c. Clear BSS: xc loop from __bss_start to __bss_end
(256-byte blocks using XC instruction)
d. Set up lowcore (first 8 KB of real memory):
- Store new PSW for each interrupt class at fixed lowcore offsets:
External (0x1B0), SVC (0x1C0), Program (0x1D0),
Machine Check (0x1E0), I/O (0x1F0), Restart (0x1A0).
- Each new PSW points to the corresponding trap handler.
- Hardware saves old PSW to the corresponding lowcore offset
(0x120-0x170) on interrupt.
e. Set prefix register: SPX to per-CPU lowcore page
(physical page 0 for BSP, separate page per AP).
f. Call umka_main(0, 0) [magic=0, info=0 — no DTB/MB info]
The linker script (linker-s390x.ld) places .text._start at the kernel load
address (0x10000 on QEMU s390-ccw-virtio). The lowcore (addresses 0x000–0x1FFF)
is reserved by the architecture for interrupt PSW pairs and is NOT part of the
kernel image — it is initialized at runtime in step 2d.
Initialization phases (in umka_main(), sequential):
Canonical Phase Mapping:
| Canonical Phase | Description | Local Implementation |
|---|---|---|
| 0.1 | arch_early_init | Entry assembly (steps 1–2) + Phase 1 (lowcore PSW pairs) |
| 0.15 | early_log_init | Phase 2: SCLP console init + early_log_init() |
| 0.3 | parse_firmware_memmap | Phase 4: SCLP Read SCP Info (init_from_sclp) |
| 0.4 | boot_alloc_init | Phase 4: phys::init() from SCLP memory bitmap |
| 0.5 | reserve_regions | Phase 4: reserve lowcore + kernel image |
| 0.6 | numa_discover_topology | Phase 4a: STSI 15.1.x topology |
| 0.7 | cpulocal_bsp_init | Phase 4b: CpuLocal via prefix-register-relative addressing |
| 0.8a | evolvable_verify | Phase 4c: Evolvable LMS signature verification (physical addresses, no DAT) |
| 0.2 | identity_map | Phase 4d: DAT setup — Region-Third ASCE, CR13, PTLB (via BootAlloc; requires 16 KB aligned allocations) |
| 0.8b | evolvable_map_and_init | Phase 4e: Evolvable virtual mapping at EVOLVABLE_VIRT_BASE + VTABLE_SLOTS[] population |
| 1.1 | buddy allocator | Phase 5: buddy init |
| 1.2 | slab allocator | Phase 5a: slab_init() |
| 2.1 | IRQ domain | Phase 8: PSW-swap interrupt subsystem + IrqDomain |
| 2.2 | capability system | Phase 7: CapSpace init |
| 2.3 | scheduler | Phase 11: scheduler init |
| 2.7 | workqueue infra | Phase 11a: workqueue_init_early() |
| 2.8 | RCU | Phase 11b: rcu_init() |
| 2.9 | LSM framework | Phase 11c: lsm_init() |
| 3.1–3.3 | SMP bringup | SIGP SET_PREFIX + RESTART fan-out |
| 4.4a | bus enumeration | Phase 8a: STSCH subchannel enumeration |
Phase 1: Lowcore Verification
Verify PSW pairs are correctly installed in lowcore.
Test with a supervisor-call (SVC) instruction to confirm the
SVC new-PSW dispatches to the correct handler.
Phase 2: SCLP Console Init
Initialize the SCLP (Service Call Logical Processor) console
for early boot output. SCLP is accessed via the SERVC (Service
Call) instruction:
- Send SCLP Write Event Data (command 0x00760005) for line-mode
console output.
- On virtio-console systems (QEMU), detect virtio device via
subchannel scan and use virtio protocol instead.
This is the s390x equivalent of COM1/PL011/NS16550 early serial.
Phase 3: Facility Detection (replaces CPUID/ID registers)
Execute STFLE (Store Facility List Extended) to populate a
2048-bit facility mask. Key facility bits:
- 2: z/Architecture active (always set on z13+)
- 17: MSA (Message Security Assist — base crypto)
- 76/77: MSA3/MSA4 (AES-128/256, SHA-256/512)
- 82: eToken (Spectre v2 mitigation — eliminates expolines)
- 129: vector facility (128-bit SIMD)
- 134: vector enhancements 1
- 146: MSA8 (AES-GCM AEAD)
Execute STIDP (Store CPU ID) to read machine type and CPU
serial number. Machine type identifies the CPU generation:
z13=0x2964, z14=0x3906, z15=0x8561, z16=0x3931.
Populate CpuFeatureSet from facility bits and machine type.
See [Section 2.16](#extended-state-and-cpu-features) for the full
s390x detection function specification.
Phase 4: Physical Memory Manager
Discover memory layout via SCLP Read SCP Info. This is the
s390x-specific implementation of parse_firmware_memmap()
(canonical Phase 0.3 in [Section 2.3](#boot-init-cross-arch--kernel-init-phase-reference-cross-architecture)).
SCLP Read SCP Info protocol:
1. Allocate a 4 KB SCLP response buffer (page-aligned, in
the first 2 GB of physical memory — SCLP DMA limitation).
2. Issue SERVC instruction with:
- R1 = command code 0x00020001 (Read SCP Info)
- R2 = physical address of the response buffer
3. Wait for SCLP interrupt (External interrupt subclass).
Check the response header for completion (response code
0x0010 = normal completion, 0x0020 = buffer too small).
4. Parse the SCLP response buffer:
- `rnmax`: maximum memory increment number (u16, offset
depending on SCLP version).
- `rnsize`: memory increment size in MB (u8). Typical
values: 128 MB (LPAR), 256 MB (z/VM).
- `rnsize2`: extended increment size (u32, if rnsize=0).
- Memory bitmap: `rnmax` bits, one per increment. Bit N=1
means increment N (physical address N × rnsize × 1 MB)
is assigned (usable). Bit N=0 means standby or absent.
The init_from_sclp() function iterates the bitmap and
populates BootAlloc.ranges[] with contiguous usable memory
ranges. Each range: base = increment_number × increment_size,
length = count_of_consecutive_assigned_increments × increment_size.
Reserve:
- Lowcore pages (one per CPU, 8 KB each)
- Kernel image (load address to __kernel_end)
Phase 4a: NUMA Topology Discovery
s390x LPAR topology is discovered via STSI (Store System
Information) instruction:
- STSI function code 15 (Topology Information):
STSI 15.1.2 returns the CPU topology: nesting levels
(drawer → book → socket → core), core type (IFL, CP,
etc.), and dedicated/shared status. The response contains
a TLE (Topology List Entry) hierarchy.
- Memory-to-node affinity: SCLP provides the physical
memory increment assignment. Each increment can be
associated with a proximity domain inferred from the
CPU topology. On z15+ with NUMA support, SCLP Read SCP
Info extended returns explicit NUMA identifiers.
- For z13/z14 without NUMA exposure: treat all memory as
node 0 (single NUMA domain). The fallback is identical
to the canonical Phase 0.6 behavior when no SRAT/DT
is present.
Populate the NUMA topology map. See
[Section 4.2](04-memory.md#physical-memory-allocator--numa-topology-integration)
for integration with the physical memory allocator.
Phase 4b: CpuLocal BSP Init (canonical Phase 0.7)
Initialize CpuLocalBlock for the BSP. On s390x, the per-CPU
data pointer is accessed via a fixed offset relative to the
prefix register (lowcore). The prefix register (set via SPX)
remaps physical page 0 to the per-CPU lowcore page, so a
load from a fixed lowcore address (e.g., offset 0x340 in the
architecture-reserved area) yields the per-CPU CpuLocal
pointer. This is set during BSP init and for each AP after
SET_PREFIX.
See [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--initialization-sequence).
Phase 4c: Evolvable Verification (canonical Phase 0.8a)
Verify the Evolvable image signature using the Nucleus LMS
verifier (~2 KB). This runs in real-address mode (no DAT) —
the Evolvable image is accessed at its physical load address
within the kernel ELF. On s390x without DAT, `__core1_start`
linker symbol resolves to the physical address of the embedded
Evolvable image because the kernel is loaded at its link-time
physical base and DAT=0 means virtual == physical.
On signature failure, panic.
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
**Invariant**: Phases 0.1–4b MUST NOT dispatch through
VTABLE_SLOTS[] or any replaceable policy vtable.
Phase 4d: DAT Setup (canonical Phase 0.2 — s390x-specific ordering)
Enable Dynamic Address Translation (DAT) so that virtual
addresses are available for Evolvable mapping (Phase 4e).
On most architectures, canonical Phase 0.2 (identity_map)
runs before memory discovery. On s390x, DAT setup requires
page table allocations from BootAlloc (which needs Phase 4/4a
memory discovery), so it is deferred here.
**BootAlloc 16 KB alignment requirement**: s390x Region-Third
and Segment Tables are 16 KB (2048 x 8-byte entries), unlike
the 4 KB page tables on other architectures. BootAlloc MUST
support 16 KB aligned allocations for these tables. The
`boot_alloc_aligned(size, align)` interface accepts an
alignment parameter; s390x DAT setup passes `align = 16384`
for Region-Third and Segment Table allocations, and
`align = 2048` for Page Table allocations (256 x 8 = 2 KB).
See Phase 6 below for the full DAT setup procedure.
Phase 4e: Evolvable Mapping and Init (canonical Phase 0.8b)
Map the verified Evolvable image at its virtual address now
that DAT is enabled:
1. Map Evolvable .text (read-execute) and .rodata (read-only)
pages at EVOLVABLE_VIRT_BASE (0x0000_0200_0000_0000 on
s390x).
2. Allocate fresh RW pages for .data+.bss via BootAlloc.
3. Call evolvable_init() to populate VTABLE_SLOTS[].
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
Phase 5: Kernel Heap
Initialize the buddy allocator with all available physical
memory from Phase 4. See [Section 4.2](04-memory.md#physical-memory-allocator).
Phase 5a: Slab Allocator (canonical Phase 1.2)
Initialize slab caches on top of the buddy allocator.
After this point, Box::new, Arc::new, and typed allocations
are available. Cross-ref: [Section 4.3](04-memory.md#slab-allocator).
Phase 6: Virtual Memory (DAT — Dynamic Address Translation)
**Note**: The actual DAT enable sequence runs in Phase 4d
above (before Evolvable mapping). This section documents
the full DAT setup procedure referenced by Phase 4d.
s390x uses DAT with ASCE (Address Space Control Element):
Table hierarchy (Region-Third, 3-level, 4 TB VA space):
- Region-Third Table: 2048 × 8-byte entries = 16 KB.
Each entry points to a Segment Table. Covers 4 TB
(2048 × 2 GB per segment table).
- Segment Table: 2048 × 8-byte entries = 16 KB.
Each entry points to a Page Table. Covers 2 GB
(2048 × 1 MB per page table).
- Page Table: 256 × 8-byte entries = 2 KB.
Each entry maps a 4 KB page frame.
Note on table sizes: unlike x86-64/ARM/RISC-V where page
tables are 4 KB (512 × 8), s390x Region/Segment tables
are 16 KB (2048 × 8) and Page Tables are 2 KB (256 × 8).
Allocations must respect these sizes (16 KB aligned for
Region/Segment, 2 KB aligned for Page Tables).
ASCE format (64-bit value loaded into control registers):
[63:12] Table Origin — physical address of the top-level
table (must be 4 KB aligned, low 12 bits zero)
[11] Private Space bit (P) — if 1, page protection
applies per DAT rules
[10:8] Reserved
[7:6] Table Length (TL) — encodes the table size:
for Region-Third ASCE, TL = number of 4 KB units
in the table minus 1 (TL=3 for 16 KB table)
[5] Real Space Control — if 1, no translation (real
addresses). Must be 0 for virtual address spaces.
[4:3] Reserved
[2] Reserved
[1:0] Designation Type (DT):
00 = Region-First Table
01 = Region-Second Table
10 = Region-Third Table
11 = Segment Table
UmkaOS uses DT=10 (Region-Third) initially, expanding to
DT=01 (Region-Second) or DT=00 (Region-First) if physical
memory exceeds 4 TB.
Control register assignment:
- CR1 = Primary Space ASCE (user address space).
Used when PSW AS = 00 (Primary).
- CR7 = Secondary Space ASCE (inter-space references).
Used by MVCS/MVCP for cross-space copies.
- CR13 = Home Space ASCE (kernel address space).
Used when PSW AS = 11 (Home).
User code runs in Primary space (CR1), kernel code runs
in Home space (CR13). This is fundamentally different from
all other architectures where user and kernel share a single
address space with permission bits — s390x has separate
translation tables per space. MVCOS (Move with Optional
Specifications) copies between spaces.
Kernel VA layout:
- Lowcore: 0x0–0x1FFF (per-CPU, 8 KB). Each CPU has its
own lowcore page set via the prefix register (SPX). The
prefix register remaps physical page 0 to the per-CPU
lowcore page and vice versa, so address 0x0 always refers
to the current CPU's lowcore regardless of which physical
page backs it.
- Kernel .text: starts at 0x10000 (64 KB, after lowcore
and reserved areas).
- Direct-map region: identity-maps all physical memory
starting at physical address 0. The kernel accesses all
physical memory through this identity map.
Identity map setup sequence:
1. Allocate Region-Third Table (16 KB, zero-initialized).
2. For each 2 GB of physical memory:
a. Allocate a Segment Table (16 KB).
b. For each 1 MB segment in that 2 GB:
Allocate a Page Table (2 KB).
Fill 256 entries with 4 KB page frames.
c. (Optimization: use 1 MB large pages via Segment Table
entries with the STE-format Large Page bit, avoiding
per-page Page Table allocation. Requires facility 78.)
3. Construct the Home Space ASCE:
ASCE = table_origin | TL=3 | DT=10 (Region-Third).
4. Load CR13 with the Home Space ASCE.
5. Enable DAT: set PSW bit 5 (T bit) = 1.
This is done by loading a new PSW with DAT enabled
via LPSWE instruction (Load PSW Extended).
6. Execute PTLB (Purge Translation Lookaside Buffer)
after ASCE load and DAT enable.
EVOLVABLE_VIRT_BASE: 0x0000_0200_0000_0000 (2 TB).
Placed well above the direct-mapped physical memory region
in the kernel's 4 TB VA space. See
[Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
Phase 6a: Evolvable Header Byte-Swap Note (s390x-specific)
The EvolvableImageHeader ([Section 2.21](#kernel-image-structure)) uses
little-endian wire format (Le16/Le32/Le64 types). s390x is the
only big-endian 64-bit architecture in UmkaOS. All header field
reads during Phases 4c/4e (Evolvable verification and mapping)
use `.to_ne()` conversion which byte-swaps on big-endian and is
a no-op on little-endian. This is automatic — the Le types
handle it. No special code path is needed; this note documents
the endianness consideration.
Phase 7: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 8: Interrupt Subsystem (no interrupt controller — s390x model)
s390x does NOT have an APIC, GIC, PLIC, or any external
interrupt controller. Interrupts are managed through:
- PSW-swap model: on interrupt, hardware saves current PSW
to Old PSW lowcore offset (0x120-0x170), loads New PSW from
the corresponding offset (0x1A0-0x1F0).
- Six interrupt classes, each with its own PSW pair:
External (signals, timers, clock comparator),
SVC (system calls),
Program (faults, traps),
Machine Check (hardware errors),
I/O (channel I/O completion),
Restart (SIGP restart).
- I/O interrupts float to any CPU with the appropriate ISC
(Interrupt Sub-Class, CR6 bits) enabled. Affinity is
managed by masking ISC bits per-CPU, NOT by routing
interrupts to specific CPUs.
- External interrupts (clock comparator, CPU timer, SIGP
external call/emergency signal) are delivered to the
target CPU identified by the SIGP instruction.
Enable external and I/O interrupts by setting PSW bits
(External=bit 7, I/O=bit 6) to 1.
Phase 8a: Subchannel Enumeration
Channel I/O devices are discovered by iterating subchannel
IDs using the STSCH (Store Subchannel) instruction. This is
the s390x equivalent of PCI bus enumeration (canonical Phase
4.4a in [Section 2.3](#boot-init-cross-arch--kernel-init-phase-reference-cross-architecture)).
Subchannel enumeration:
1. Iterate subchannel IDs from 0x00000 to 0xFFFF (the
architectural maximum). STSCH reads the Subchannel
Information Block (SCHIB) for each ID.
2. Check the SCHIB validity: SCHIB.PMCW.V (valid bit).
If V=0, the subchannel does not exist — skip.
3. For valid subchannels, record:
- Subchannel type: I/O (0), CHSC (1), or Message (2).
- Device number (from SCHIB.PMCW.DEV).
- Channel path IDs (SCHIB.PMCW.CHPID[0..7]).
4. Enable the subchannel: MSCH (Modify Subchannel) with
SCHIB.PMCW.E = 1 (enabled). Without enabling, no I/O
operations can be started on the subchannel.
5. For each enabled I/O subchannel, issue SenseID (CCW
command 0xE4) to discover the device type (CU type,
device type, model). This identifies the device
(e.g., 3390 = DASD, 3174 = terminal controller,
virtio = QEMU virtio device).
This must run after DAT setup (Phase 6) so the SCHIB
structures (which reside in kernel memory) are accessible
through translated addresses. On QEMU, virtio devices
appear as subchannels with device type matching virtio
device IDs.
Phase 9: CPU Timer
s390x has two timer mechanisms:
- Clock Comparator: 64-bit TOD value. When the TOD clock
reaches the comparator value, an External interrupt fires.
Set via SCKC (Set Clock Comparator) instruction.
- CPU Timer: 64-bit signed decrementer. Counts down at
TOD clock rate. Fires External interrupt when it goes
negative. Set via SPT (Set CPU Timer) instruction.
UmkaOS uses the CPU Timer for scheduler ticks (equivalent
to APIC timer / decrementer on other architectures).
TOD clock frequency is always 2^12 = 4096 ticks per
microsecond (architectural constant, not discoverable).
Phase 10: SVC (Supervisor Call) Syscall Setup
s390x syscall entry differs from all other architectures:
- SVC instruction with 8-bit immediate (0-255): syscall
number is in the instruction itself.
- SVC 0 with syscall number in R1: for syscall numbers > 255.
- Hardware saves old PSW to lowcore 0x140, loads new PSW
from lowcore 0x1C0.
- Arguments: R2-R7 (up to 6 arguments).
- Return value: R2 (negative = -errno).
The SVC handler must check both the SVC immediate field
(in the instruction at the old PSW address) and R1 to
determine the syscall number.
Phase 11: Scheduler (canonical Phase 2.3)
Initialize EEVDF scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via CPU timer tick callback.
Phase 11a: Workqueue Framework (canonical Phase 2.7)
Initialize named kernel worker thread pools.
Cross-ref: [Section 3.11](03-concurrency.md#workqueue-deferred-work).
Phase 11b: RCU Init (canonical Phase 2.8)
Initialize RCU infrastructure (grace period tracking,
callback queues, per-CPU state).
Cross-ref: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
Phase 11c: LSM Framework Init (canonical Phase 2.9)
Initialize LSM framework and register compiled-in LSMs.
Cross-ref: [Section 9.8](09-security.md#linux-security-module-framework).
SMP bringup — s390x via SIGP:
Secondary CPUs are brought online via the SIGP (Signal Processor) instruction. SIGP takes three operands: the target CPU address (logical CPU number), the order code (operation to perform), and an optional parameter.
SIGP order codes used during bringup:
| Order | Code | Function |
|---|---|---|
| SENSE | 0x01 | Query CPU status. Returns status bits in R1. |
| START | 0x02 | Start the CPU (deprecated, use RESTART). |
| STOP | 0x05 | Stop the CPU. CPU enters stopped state. |
| RESTART | 0x04 | Restart the CPU. Loads Restart new-PSW from lowcore 0x1A0. |
| SET_PREFIX | 0x0D | Set the prefix register of the target CPU. Parameter = new prefix value (physical address of per-CPU lowcore page, must be 8 KB aligned). |
| SENSE_RUNNING | 0x09 | Check if CPU is in running state (faster than SENSE). |
SIGP condition codes (CC):
| CC | Meaning | Action |
|---|---|---|
| 0 | Order accepted | Success. |
| 1 | Status stored | Target CPU has pending conditions; status stored in R1. Retry after handling. |
| 2 | Busy | Target CPU is processing another SIGP. Retry with short delay. |
| 3 | Not operational | Target CPU does not exist or is permanently unavailable. Do not retry. |
BSP bringup sequence (for each AP):
-
Allocate per-AP lowcore page (8 KB, 8 KB-aligned) from the boot allocator. On allocation failure, mark the CPU permanently offline and log
"CPU {cpu_addr}: lowcore allocation failed, CPU disabled". -
Initialize the AP's lowcore page:
- Write new PSW pairs for all six interrupt classes at the same offsets as the BSP lowcore (External 0x1B0, SVC 0x1C0, Program 0x1D0, Machine Check 0x1E0, I/O 0x1F0, Restart 0x1A0).
- Write the AP entry code address into the Restart new-PSW at offset 0x1A0. The PSW must have: DAT=0 (real mode initially), 64-bit mode (bit 31=1, bit 32=1), problem state=0 (supervisor).
-
Allocate per-AP kernel stack (16 KB) and store the stack top address in a known lowcore offset or a shared per-CPU data structure.
-
Issue
SIGP cpu_addr, SET_PREFIX, prefix_page: Set the target CPU's prefix register to point to the newly allocated lowcore page. CC=0 confirms success. On CC=3 (not operational), skip this CPU. -
Issue
SIGP cpu_addr, RESTART: The target CPU loads its Restart new-PSW (from offset 0x1A0 of its own lowcore — now the page set in step 3). Execution begins at the AP entry code address embedded in the PSW. -
Verify CPU started: poll via
SIGP cpu_addr, SENSE_RUNNING(orSIGP cpu_addr, SENSEand check status bits). Wait up to 1 second per CPU. If the CPU does not enter running state within the timeout, mark it permanently offline.
Secondary CPU initialization sequence (AP entry code):
Each AP begins execution at the Restart new-PSW address:
- Load stack pointer from per-CPU data.
- Run facility detection (Phase 3): execute STFLE to populate per-CPU facility mask. This may differ between CPUs in mixed hardware configurations.
- Load kernel page tables (Phase 6): load CR13 with the BSP's Home Space ASCE. Enable DAT (set PSW bit 5=1 via LPSWE). Execute PTLB. The AP uses the BSP's page tables — it does NOT build new tables.
- Configure per-CPU timers (Phase 9): set CPU Timer via SPT for scheduler tick.
- Configure ISC (Interrupt Sub-Class) masking in CR6 for I/O interrupt affinity.
- Initialize per-CPU CpuLocal, slab magazines, and join the scheduler.
APs must NOT re-run: SCLP memory discovery (Phase 4), buddy_init() (Phase 5). These are BSP-only, run-once operations.
vDSO on s390x:
The s390x vDSO uses the STCK (Store Clock) or STCKE (Store Clock Extended)
instruction for clock_gettime. Unlike other architectures where the counter
instruction is unprivileged, s390x STCK is available in problem state (user mode)
on all z/Architecture systems. The vDSO reads STCK, subtracts the TOD epoch
(1900-01-01) to derive Unix time, and applies the VVAR page's calibration data.
| Architecture | Counter instruction | Notes |
|---|---|---|
| s390x | STCK / STCKE |
TOD clock, epoch 1900-01-01. Always available in problem state. Frequency = 2^12 ticks/µs (architectural). |
Per-architecture hardware abstraction equivalents:
| Concept | s390x |
|---|---|
| Privilege separation | PSW problem-state bit (bit 15): 0=supervisor, 1=problem (user) |
| Exception dispatch | PSW-swap: old PSW saved to lowcore, new PSW loaded from lowcore. Six interrupt classes. |
| Interrupt controller | None (architectural). ISC masking in CR6 for I/O. SIGP for external. |
| Timer | CPU Timer (SPT/STPT) + Clock Comparator (SCKC/STCKC). TOD clock = 4096 ticks/µs. |
| Syscall mechanism | SVC instruction (8-bit immediate or R1 for >255). |
| Page table format | DAT: 3-5 level (Region-First through Page), 4 KB pages, 1 MB large pages. |
| Fast isolation | Storage Keys (4-bit per page, too coarse). Tier 1 → Tier 0. |
| TLB ID | ASCE (Address Space Control Element) in CR1/CR7/CR13. No explicit ASID — ASCE change implies TLB context. |
2.13 LoongArch64 Boot Sequence¶
LoongArch64 targets Loongson processors (3A5000, 3A6000) in 64-bit mode. QEMU uses
the virt machine type with EDK2 UEFI firmware. The firmware passes control to the
kernel via a UEFI-style entry or a direct -kernel load with DTB. The console is
a NS16550-compatible UART.
Why LoongArch64: LoongArch64 is the control test — a clean, modern, well-documented 64-bit RISC architecture. Adding it should be straightforward if the abstraction layer is correctly generic. If it isn't easy, something is over-fitted to the existing seven. LoongArch also has genuinely unique properties: a hybrid hardware/software TLB model, CSR-based system registers (not memory-mapped, not MSR-style), and a stable ISA manual.
Unique architectural properties tested by LoongArch64:
| Dimension | LoongArch64 | Closest comparison |
|---|---|---|
| TLB management | Hybrid: software TLBFILL handler (3A5000) + optional hardware PTW (3A6000) | MIPS-heritage, unlike pure hardware (x86/ARM) or pure software (old RISC-V) |
| System registers | CSR instructions (CSRRD/CSRWR/CSRXCHG), 14-bit index space | Not MSR (x86), not coprocessor (ARM CP15), not memory-mapped |
| Interrupt controller | EIOINTC (Extended I/O Interrupt Controller) | Unique — not APIC, GIC, PLIC, XIVE, or channel I/O |
| ISA design | Clean RISC, ratified ISA manual, no legacy baggage | Similar to RISC-V in cleanliness but with a different instruction encoding |
| China ecosystem | Growing Linux mainline support (since 5.19), GCC/LLVM/Rust backends | Strategic platform for non-x86/ARM market |
Target triple: loongarch64-unknown-linux-gnu
QEMU invocation:
qemu-system-loongarch64 -M virt -m 512M -nographic \
-kernel umka-kernel.elf -dtb loongarch64-virt.dtb
Note: QEMU's LoongArch virt machine can provide a DTB or UEFI boot. For Phase 1,
UmkaOS uses DTB-based boot (simpler, consistent with RISC-V and ARM patterns). UEFI
boot is Phase 3.
Entry assembly (arch/loongarch64/entry.S, GNU as syntax):
1. Firmware loads the kernel ELF and jumps to _start.
Register convention depends on boot method:
- **QEMU -kernel with DTB (development boot)**:
a0 = boot CPU ID, a1 = DTB address (passed via -dtb flag)
- **UEFI boot (production, Linux/LoongArch standard)**:
a0 = argc (efi_boot flag), a1 = argv (cmdline pointer),
a2 = envp (struct boot_params / systemtable pointer)
UmkaOS Phase 1 uses DTB-based boot (simpler, consistent with ARM/RISC-V
patterns). Production UEFI boot is Phase 3. The entry code detects the
boot method by checking if a1 points to a valid DTB (magic 0xD00DFEED)
or a UEFI system table.
- PLV0 (Privilege Level 0 — kernel mode)
- MMU off (direct address translation via CSR.DMW — Direct Mapping Window)
2. _start:
a. Enable FPU: CSRWR CSR.EUEN, set bit 0 (FPE = 1).
LoongArch requires CSR.EUEN.FPE = 1 to execute any floating-point
instruction. Without this, any FP instruction (which Rust may
generate for any code, including integer-only code via auto-
vectorization or ABI requirements) causes an FPE exception. This
must be done before calling any Rust code.
b. Set up Direct Mapping Window (CSR.DMW0/DMW1):
DMW0: 0x8000000000000000 → physical 0, uncached (CA=0, MAT=0)
DMW1: 0x9000000000000000 → physical 0, cached (CA=1, MAT=1)
This provides identity-mapped access to all physical memory
before page tables are set up (similar to x86 identity map).
b2. The kernel ELF is linked at 0x9000_0000_0020_0000 (DMW1, cached).
After DMW setup in step 2b, the kernel is already executing in the
cached DMW1 window — no trampoline is needed. Linux uses the same
approach: the kernel runs in DMW1 (0x9000..., cached) from entry
(verified: `arch/loongarch/kernel/head.S` in torvalds/linux master).
All subsequent boot code runs via cached DMW1.
c. Load stack pointer: la.abs $sp, _stack_top
(64 KB stack in .bss._stack, 16-byte aligned)
d. Clear BSS: st.d zero loop from __bss_start to __bss_end
e. Set up exception vectors:
- Write trap handler address to CSR.EENTRY (Exception Entry Base)
- Write TLB refill handler to CSR.TLBRENTRY
- Configure CSR.ECFG (Exception Configuration): enable desired
interrupt lines (HWI0-HWI7, TI=timer, IPI).
f. Arguments already in correct registers:
a0 = boot CPU ID (passed as multiboot_magic parameter)
a1 = DTB address (passed as multiboot_info parameter)
g. Call: bl umka_main
h. Halt loop: idle 0 (wait-for-interrupt) if umka_main returns
The linker script (linker-loongarch64.ld) places .text._start first at the
kernel load address (0x9000000000200000 on QEMU virt — in the cached DMW1
window). The kernel entry point is already in the cached window (DMW1,
CA=1, MAT=1), so no trampoline is needed. All code runs at full cached
speed from the first instruction. No linker script changes are needed.
Initialization phases (in umka_main(), sequential):
Canonical Phase Mapping:
| Canonical Phase | Description | Local Implementation |
|---|---|---|
| 0.1 | arch_early_init | Entry assembly (steps 1–2) + Phase 1 (CSR.EENTRY) |
| 0.1a | cpu_features_detect | Phase 4: CPUCFG word 0-19 read |
| 0.14 | early_serial_init | Phase 0.14: NS16550 UART init (fallback address) |
| 0.15 | early_log_init | Phase 0.15: early_log_init() after UART |
| 0.3 | parse_firmware_memmap | Phase 3: DTB /memory parse |
| 0.4 | boot_alloc_init | Phase 5: phys::init() from DTB regions |
| 0.5 | reserve_regions | Phase 5: reserve kernel image + DTB |
| 0.6 | numa_discover_topology | Phase 5a: DTB /memory nodes |
| 0.7 | cpulocal_bsp_init | Phase 5b: CpuLocal via percpu base register |
| 0.8a | evolvable_verify | Phase 5c: Evolvable signature verification (physical addresses via DMW, no page tables required) |
| 0.2 | identity_map | Phase 5d: TLB + page tables, DA->PG transition |
| 0.8b | evolvable_map_and_init | Phase 5e: Evolvable virtual mapping at EVOLVABLE_VIRT_BASE + VTABLE_SLOTS[] population |
| 1.1 | buddy allocator | Phase 6: buddy init |
| 1.2 | slab allocator | Phase 6a: slab_init() |
| 2.1 | IRQ domain | Phase 9: EIOINTC + PCH-PIC + IrqDomain setup |
| 2.2 | capability system | Phase 8: CapSpace init |
| 2.3 | scheduler | Phase 12: scheduler init |
| 2.7 | workqueue infra | Phase 12a: workqueue_init_early() |
| 2.8 | RCU | Phase 12b: rcu_init() |
| 2.9 | LSM framework | Phase 12c: lsm_init() |
| 3.1–3.3 | SMP bringup | IOCSR mailbox + IPI fan-out |
Phase 0.14: Early UART Init (NS16550)
Initialize NS16550-compatible UART for early boot output
before DTB parsing is complete. The UART base address depends
on the platform:
- QEMU virt: address from DTB (typically a standard offset
within the MMIO region; exact address is machine-dependent).
- Loongson SoC (3A5000/3A6000): 0x1FE001E0 (fixed MMIO).
For the very first output before DTB is parsed, use the
Loongson fixed address as a fallback if QEMU detection fails
(read CPUCFG word 0 PRID to detect platform).
Init sequence:
1. Write LCR = 0x80 (DLAB=1, access divisor registers)
2. Write DLL = clock_freq / (16 × 115200), DLM = high byte
(QEMU ignores baud rate, but real hardware requires this.
Loongson SoC UART clock is typically 100 MHz.)
3. Write LCR = 0x03 (DLAB=0, 8 data bits, no parity, 1 stop)
4. Write FCR = 0x07 (enable + reset TX/RX FIFOs)
5. Write IER = 0x00 (no interrupts — polled mode during boot)
6. Write MCR = 0x03 (DTR + RTS asserted)
After DTB parse (Phase 3), re-read the actual UART base from
the DTB /soc/serial@... node and reconfigure if the fallback
address was different.
Phase 1: Exception Vectors (CSR.EENTRY / CSR.TLBRENTRY)
Write exception handler base to CSR.EENTRY. LoongArch
dispatches exceptions by adding (ecode × spacing) to EENTRY,
where spacing is configured via CSR.ECFG.VS (0=no vectoring,
all to EENTRY; non-zero=vectored). UmkaOS uses VS=0 (single
entry point, dispatch by reading CSR.ESTAT.Ecode).
Write TLB refill handler to CSR.TLBRENTRY (separate entry
from EENTRY — TLB refill has its own dedicated vector).
Phase 2: BSS Verification
Verify BSS is zeroed (entry.S clears BSS in assembly).
Phase 3: DTB Parse
Parse the DTB passed in a1 (see [Section 2.8](#device-tree-and-platform-discovery)).
Extract /memory regions, /chosen bootargs, EIOINTC base
address, UART base, and CPU topology. QEMU LoongArch virt
uses standard DTB layout.
Phase 4: CPUCFG Feature Detection
Read CPUCFG (CPU Configuration) words 0-19 via the CPUCFG
instruction:
- Word 0: PRID (Processor ID) — company[31:16], series[15:8],
revision[7:0]. 0x14C0xx = 3A5000, 0x14D0xx = 3A6000.
- Word 1: Architecture features (ARCH bits)
- Word 2: ISA features (FP, LSX, LASX, COMPLEX, CRYPTO,
LBT, LAMO, LAM_BH, LAMCAS)
- Word 3: Cache info (I-cache, D-cache line size)
- Word 4: TLB info (STLB entries, MTLB entries)
Note: hardware PTW support is at word 0x2 bit 24 (CPUCFG2_PTW),
NOT word 4. Word 4 contains TLB geometry (entry counts).
- Word 16-19: Performance info
Populate CpuFeatureSet from CPUCFG. See
[Section 2.16](#extended-state-and-cpu-features) for the full
LoongArch detection function specification.
Phase 5: Physical Memory Manager
Pass DTB memory regions to phys::init(). Reserve:
- Kernel image: load address to __kernel_end
- DTB: preserve the DTB blob
Unlike x86, no legacy BIOS regions.
Phase 5a: NUMA Topology Discovery (canonical Phase 0.6)
Parse DTB /memory nodes for topology. LoongArch NUMA support
is available on multi-socket Loongson systems. DTB /memory
nodes with `numa-node-id` properties provide the topology.
On single-socket configurations (QEMU virt), set single
node 0 if no NUMA information is found.
Cross-ref: [Section 4.11](04-memory.md#numa-topology-and-policy).
Phase 5b: CpuLocal BSP Init (canonical Phase 0.7)
Initialize CpuLocalBlock for the BSP. LoongArch uses a
dedicated per-CPU base register $r21 ($u0, reserved by the
kernel ABI — distinct from $tp/$r2 which is the userspace
TLS thread pointer) to hold the CpuLocal pointer. CSR
PERCPU_BASE (KSave3) stores a copy for trap entry
save/restore. The $r21 value is set per-CPU during init
and on AP bringup.
See [Section 3.2](03-concurrency.md#cpulocal-register-based-per-cpu-fast-path--initialization-sequence).
Phase 5c: Evolvable Signature Verification (canonical Phase 0.8a)
Verify the Evolvable image signature using physical addresses
accessible via DMW0/DMW1 (no page tables required). Nucleus
LMS verifier (~2 KB) checks ML-DSA-65 + Ed25519 hybrid
signature against embedded public key. On failure, panic.
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
**Invariant**: Phases 0.1–5c MUST NOT dispatch through
VTABLE_SLOTS[] or any replaceable policy vtable.
Phase 5d: Virtual Memory — DA->PG Transition (canonical Phase 0.2)
See Phase 7 below for the full TLB + page table setup and
the DA->PG (Direct Address to Page-table) transition. Page
table structures are allocated from BootAlloc. After this
phase, per-page permissions (NX, NR, W) are enforced.
Phase 5e: Evolvable Virtual Mapping (canonical Phase 0.8b)
Map Evolvable at EVOLVABLE_VIRT_BASE
(0x9000_0000_4000_0000 on LoongArch64). During Phase 5c,
Evolvable was accessible via DMW1 (cached, CA=1, MAT=1)
without page-level permissions. After Phase 5d, page tables
enforce proper permissions: .text (RX), .rodata (RO),
.data+.bss (RW). Allocate fresh RW pages for .data+.bss
via BootAlloc.
Call evolvable_init() to populate VTABLE_SLOTS[].
**Requires page table MMU** (Phase 5d complete).
See [Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
After Phase 5e, Evolvable vtable dispatch is permitted.
Phase 6: Kernel Heap
Initialize the buddy allocator with all available physical
memory from Phase 5. See [Section 4.2](04-memory.md#physical-memory-allocator).
Phase 6a: Slab Allocator (canonical Phase 1.2)
Initialize slab caches on top of the buddy allocator.
After this point, Box::new, Arc::new, and typed allocations
are available. Cross-ref: [Section 4.3](04-memory.md#slab-allocator).
Phase 7: Virtual Memory Detail (TLB + optional Page Table Walker, runs at Phase 5d)
LoongArch uses a TLB-based virtual memory model:
- Software TLB refill (3A5000): CSR.TLBRENTRY handler reads
the page table entry from the page table root (CSR.PGDH
for kernel, CSR.PGDL for user) and executes TLBFILL to
install it. Four-level page table: PGD→PUD→PMD→PTE.
TLB refill handler pseudocode (CSR.TLBRENTRY entry point):
```
tlb_refill:
csrrd t0, CSR_TLBRBADV // faulting virtual address
csrrd t1, CSR_PGD // auto-selects PGDH or PGDL
// based on bit 47 of BADV
// PGD index: VA[47:39] (9 bits)
srli.d t2, t0, 39
andi t2, t2, 0x1FF
slli.d t2, t2, 3 // * 8 (pointer size)
ldx.d t1, t1, t2 // PUD base
// PUD index: VA[38:30] (9 bits)
srli.d t2, t0, 30
andi t2, t2, 0x1FF
slli.d t2, t2, 3
ldx.d t1, t1, t2 // PMD base
// PMD index: VA[29:21] (9 bits)
srli.d t2, t0, 21
andi t2, t2, 0x1FF
slli.d t2, t2, 3
ldx.d t1, t1, t2 // PTE base
// PTE index: VA[20:12] (9 bits), load even/odd pair
srli.d t2, t0, 12
andi t2, t2, 0x1FE // clear bit 0 for pair
slli.d t2, t2, 3
ldx.d t3, t1, t2 // even PTE (TLBELO0)
addi.d t2, t2, 8
ldx.d t4, t1, t2 // odd PTE (TLBELO1)
csrwr t3, CSR_TLBRELO0
csrwr t4, CSR_TLBRELO1
tlbfill // install in TLB
ertn // return from exception
```
- Hardware PTW (3A6000+): If CPUCFG word 0x2 bit 24 (HPTW,
aka CPUCFG2_PTW) is set, configure CSR.PWCL/PWCH (Page Walk Controller) with
the page table layout parameters (directory widths, shifts).
Hardware walks the page table on TLB miss without trapping.
- Page sizes: 4 KB (default), 16 KB, 64 KB (CSR.STLBPS).
- Identity-map all physical RAM via TLBFILL or hardware PTW.
- After page table setup, switch from DMW to mapped addressing
for kernel virtual addresses.
Page table entry format (64-bit PTE):
[63:48] Reserved (must be 0)
[47:12] PPN — Physical Page Number (36 bits, 48-bit phys addr)
[11] RPLV — Restricted PLV (if set, page accessible only
at exactly PLV, not PLV ≤ entry PLV)
[10] NX — No Execute (1 = execute prohibited)
[9] NR — No Read (1 = read prohibited; LoongArch supports
execute-only pages via NR=1, NX=0)
[8] W — Writable (1 = write permitted)
[7] P — Physical/Present (1 = page present in memory.
Software-defined; hardware ignores if V=0)
[6] G — Global (not flushed on ASID change)
[5:4] MAT — Memory Access Type:
0 = SUC (Strongly-ordered UnCached, for MMIO)
1 = CC (Coherent Cached, for normal memory)
2 = WUC (Weakly-ordered UnCached)
3 = Reserved
Kernel memory: MAT=1 (CC).
MMIO regions: MAT=0 (SUC).
[3:2] PLV — Privilege Level (2 bits):
0 = PLV0 (kernel only)
3 = PLV3 (user accessible)
[1] D — Dirty (page has been written; set by hardware or
software depending on TLB refill implementation)
[0] V — Valid (1 = entry is valid, TLB miss if 0)
CSR.CRMD DA/PG mode transition:
During early boot, the CPU operates in Direct Address (DA)
mode: CSR.CRMD bits DA=1, PG=0. Physical addresses are
accessed directly through DMW0/DMW1 windows without TLB
translation. After page tables are built, the kernel must
transition to Paged (PG) mode:
1. Build page tables (PGD→PUD→PMD→PTE) for kernel VA space.
2. Load page table roots:
CSR.PGDH = kernel page table root physical address
(for addresses with bit 47 = 1, kernel half).
CSR.PGDL = user page table root physical address
(for addresses with bit 47 = 0, user half).
During boot, set to 0 (no user mappings yet).
3. If hardware PTW is available (CPUCFG2_PTW):
Configure CSR.PWCL (Page Walk Controller Low):
Dir1_Base = 12, Dir1_Width = 9 (PTE level)
Dir2_Base = 21, Dir2_Width = 9 (PMD level)
Configure CSR.PWCH (Page Walk Controller High):
Dir3_Base = 30, Dir3_Width = 9 (PUD level)
Dir4_Base = 39, Dir4_Width = 9 (PGD level)
PTEWidth = 3 (8-byte PTEs, encoded as log2(8)-1)
4. Transition: modify CSR.CRMD in a single CSRWR instruction:
Clear DA bit (bit 3), set PG bit (bit 4).
The instruction itself must execute from a DMW-mapped
address that remains valid after the transition (DMW
windows remain active in PG mode).
5. Execute INVTLB 0, $r0, $r0 (invalidate all TLB entries)
to ensure no stale DA-mode translations persist.
EVOLVABLE_VIRT_BASE: 0x9000_0000_4000_0000
(1 GB into DMW1 cached window, 2 MB aligned). See
[Section 2.21](#kernel-image-structure--phase-08-evolvable-boot-loading-protocol).
Phase 8: Capability System
Create CapSpace, test create/check/attenuate operations.
Phase 9: PCH-PIC and EIOINTC Initialization
On QEMU LoongArch virt (and Loongson SoC boards), external
device interrupts pass through two controllers in series:
PCH-PIC → EIOINTC → CPU.
**PCH-PIC (Platform Controller Hub PIC):**
The PCH-PIC is an I/O interrupt controller that collects
device interrupts (UART, virtio, etc.) and routes them to
the EIOINTC. PCH-PIC base address is read from the DTB
(compatible = "loongson,pch-pic-1.0").
- Enable interrupt sources: write enable bits for each
device interrupt line.
- Set routing: each PCH-PIC interrupt maps to an EIOINTC
vector (typically 1:1 for the first 64 interrupts).
- Set edge/level trigger mode per source.
**EIOINTC (Extended I/O Interrupt Controller):**
EIOINTC is accessed via IOCSR (I/O Control and Status
Register) instructions (IOCSRRD/IOCSRWR), NOT via MMIO:
- EIOINTC enable: IOCSR address 0x0420, set bit 48 to
enable the EIOINTC. Without this bit set, no external
interrupts are delivered to any CPU.
- Configure interrupt routing: each of 256 vectors can be
directed to a specific CPU via routing registers at
IOCSR 0x0800 + (vector / 4) × 4 (4 vectors per 32-bit
word, 8 bits per vector specifying target CPU mask).
- Set priority for each interrupt source.
- Enable desired interrupt lines in EIOINTC enable register
(IOCSR 0x0600 + (vector / 32) × 4, 1 bit per vector).
- Also initialize LIOINTC (Legacy I/O Interrupt Controller)
for UART and other legacy devices if present in DTB.
Enable interrupts: set CSR.CRMD.IE = 1 (global interrupt
enable in Current Mode register, bit 2).
Phase 10: Stable Counter Timer
LoongArch has a dedicated Stable Counter for timekeeping:
- Read counter frequency from CSR.TCFG or DTB
/cpus/timebase-frequency.
- Program CSR.TCFG (Timer Configuration): set InitVal
(initial countdown value), Periodic bit (1=auto-reload),
and En bit (1=enable).
- Timer fires a TI (Timer Interrupt) exception when the
counter reaches zero. If Periodic=1, it reloads InitVal
automatically.
- Current counter value readable from CSR.TVAL.
This is equivalent to APIC timer / decrementer / SBI timer
on other architectures.
Phase 11: Syscall Setup (SYSCALL instruction)
LoongArch uses the SYSCALL instruction for user → kernel
transitions:
- SYSCALL triggers an exception with Ecode=0xB (Syscall).
- Hardware saves PC to CSR.ERA (Exception Return Address),
saves PLV/IE to CSR.PRMD (Pre-exception Mode).
- Syscall number: a7 (consistent with RISC-V and AArch64
Linux ABI convention).
- Arguments: a0-a5 (up to 6 arguments).
- Return: a0 (negative = -errno).
- Return to user: ERTN instruction (restores PLV, IE, and
jumps to ERA).
Verification test: issue SYSCALL from PLV0 to test vector
entry. Handler reads CSR.ESTAT.Ecode, verifies it equals
0xB, and returns.
Phase 12: Scheduler (canonical Phase 2.3)
Initialize EEVDF scheduler. Spawn test threads.
Run cooperative yield loop, then enable preemptive
scheduling via timer tick callback.
Phase 12a: Workqueue Framework (canonical Phase 2.7)
Initialize named kernel worker thread pools.
Cross-ref: [Section 3.11](03-concurrency.md#workqueue-deferred-work).
Phase 12b: RCU Init (canonical Phase 2.8)
Initialize RCU infrastructure (grace period tracking,
callback queues, per-CPU state).
Cross-ref: [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
Phase 12c: LSM Framework Init (canonical Phase 2.9)
Initialize LSM framework and register compiled-in LSMs.
Cross-ref: [Section 9.8](09-security.md#linux-security-module-framework).
SMP bringup — LoongArch64:
Secondary CPUs are brought online via IPI through the IOCSR mailbox mechanism.
BSP bringup sequence (for each AP):
- Allocate per-AP kernel stack (16 KB) from the per-NUMA-node boot allocator.
On allocation failure, mark the CPU permanently offline and log
"CPU {cpu_id}: stack allocation failed, CPU disabled". - Write the AP entry address to the per-CPU IOCSR mailbox register: IOCSR address 0x1020 + 8 × cpu_id (64-bit mailbox, each CPU has its own). The entry address must be a physical address accessible in DA mode (before the AP has MMU configured).
- Send IPI to the target CPU: IOCSRWR to the IPI send register (IOCSR 0x1040, write the target CPU bit to trigger the doorbell).
Secondary CPU initialization sequence:
Each AP wakes from its firmware spin loop, reads the mailbox entry address,
and jumps to the SMP trampoline (arch/loongarch64/trampoline.S):
- Enable FPU: CSRWR CSR.EUEN, set FPE = 1 (bit 0). Required before calling any Rust code.
- Set up DMW0/DMW1 (same values as BSP — cached/uncached windows).
- Load per-CPU stack pointer from the per-CPU data block.
- Run CPUCFG feature detection (Phase 4) to populate per-CPU feature set.
- Load kernel page tables (Phase 7): write CSR.PGDH with the BSP's kernel page table root. If hardware PTW is available, write CSR.PWCL/PWCH (same values as BSP). Transition CSR.CRMD from DA→PG mode (clear DA, set PG). Execute INVTLB 0, $r0, $r0. The AP uses the BSP's page tables — it does NOT build new ones.
- Configure per-CPU EIOINTC routing (Phase 9): update routing registers to direct interrupts assigned to this CPU. Each AP configures its own vector routing entries.
- Program per-CPU Stable Counter timer (Phase 10): write CSR.TCFG with InitVal, Periodic=1, En=1.
- Set up exception vectors: write CSR.EENTRY and CSR.TLBRENTRY (same handler addresses as BSP).
- Configure syscall entry (Phase 11): same as BSP (shared handler code).
- Initialize per-CPU CpuLocal, slab magazines, and join the scheduler.
APs must NOT re-run: DTB parse (Phase 3), phys::init() (Phase 5), buddy_init() (Phase 6). These are BSP-only, run-once operations.
vDSO on LoongArch64:
The LoongArch vDSO uses the RDTIME.D instruction for clock_gettime. RDTIME.D
reads the Stable Counter and the counter ID (which CPU timer is being read) in a
single instruction. It is available in user mode (PLV3). The VVAR page provides
the counter frequency and calibration data.
| Architecture | Counter instruction | Notes |
|---|---|---|
| LoongArch64 | RDTIME.D |
Stable Counter. Frequency from CPUCFG or DTB. Available at PLV3 (user). |
Per-architecture hardware abstraction equivalents:
| Concept | LoongArch64 |
|---|---|
| Privilege separation | PLV (Privilege Level): PLV0=kernel, PLV3=user. CSR.CRMD.PLV field. |
| Exception dispatch | CSR.EENTRY base + (Ecode × VS). TLB refill has dedicated CSR.TLBRENTRY. |
| Interrupt controller | EIOINTC (256 vectors, per-CPU routing via IOCSR). PCH-PIC for device interrupts. LIOINTC for legacy. |
| Timer | Stable Counter (CSR.TCFG, auto-reload). RDTIME.D for reading. |
| Syscall mechanism | SYSCALL instruction (PLV3→PLV0). ERTN to return. |
| Page table format | 4-level (PGD→PUD→PMD→PTE), 4 KB/16 KB/64 KB pages. Software or hardware PTW. |
| Fast isolation | None. Tier 1 → Tier 0. |
| TLB ID | ASID (10-bit, CSR.ASID). INVTLB instruction for TLB invalidation (local only). |
2.14 Interrupt Controller Architecture¶
The x86-64 interrupt architecture (8259 PIC remapped through the IOAPIC, with per-CPU LAPIC) is described in Phase 2 of the x86-64 boot sequence. ARM and RISC-V use different interrupt controllers with distinct initialization models. This section specifies those controllers at the level of detail required to implement the UmkaOS Tier 0 interrupt initialization code.
AArch64 / ARMv7: GIC (Generic Interrupt Controller)
ARM platforms use the GIC family. The GIC version is detected at boot from the
Device Tree compatible string or from an ACPI MADT Type 11 (GICC), Type 12
(GICD), and Type 14 (GICR) entry set. UmkaOS supports GICv2 and GICv3/v4.
GICv2 (ARM Cortex-A9, A15, A17, and earlier server SoCs):
- GICD (Distributor): a single MMIO block shared by all CPUs. Controls SPI routing, enable/disable per-IRQ, and priority configuration.
- GICC (CPU Interface): a separate MMIO block, one per CPU, accessed at a fixed per-CPU stride. Provides IAR (Interrupt Acknowledge Register) and EOIR (End-of-Interrupt Register) for claim/complete cycles.
GICv3 / GICv4 (ARM Neoverse, Cortex-A55/A75 and later, all current server and mobile SoCs):
- GICD (Distributor): single shared MMIO block. On GICv3, affinity routing is enabled by setting GICD_CTLR.ARE_S=1 / ARE_NS=1. SPIs (IRQs 32-1019) are routed to CPUs via GICD_IROUTER[n] (64-bit affinity value matching MPIDR_EL1).
- GICR (Redistributor): one MMIO region per CPU, containing an LPI and SGI/PPI frame. The GICR is discovered by walking a contiguous array of redistributor frames (8 KB stride per frame pair) or from ACPI MADT.
- ICC system registers: On GICv3, the CPU interface is accessed entirely through system registers (no per-CPU MMIO). ICC_SRE_EL1.SRE=1 must be set first to enable system-register access; if running under a hypervisor, ICC_SRE_EL2.SRE=1 and ICC_SRE_EL2.Enable=1 must also be set.
IRQ taxonomy (all GIC versions):
| Range | Name | Description |
|---|---|---|
| 0–15 | SGI (Software Generated Interrupts) | Inter-processor interrupts. Written to GICD_SGIR (GICv2) or ICC_SGI1R_EL1 (GICv3). Delivered only to the targeted CPU(s). |
| 16–31 | PPI (Private Peripheral Interrupts) | Per-CPU, non-shared. PPIs are numbered 0-15 (PPI N = INTID N+16). Arch timer INTIDs: INTID 27 / PPI 11 = EL1 Virtual Timer (CNTV_IRQ), INTID 26 / PPI 10 = EL2 Physical Timer (CNTHP_IRQ), INTID 29 / PPI 13 = Secure EL1 Physical Timer (CNTP_IRQ secure), INTID 30 / PPI 14 = Non-secure EL1 Physical Timer (CNTP_IRQ). |
| 32–1019 | SPI (Shared Peripheral Interrupts) | Platform devices: UART, PCIe, USB, storage controllers. Routed via GICD_ITARGETSR (GICv2) or GICD_IROUTER (GICv3). |
| 8192+ | LPI (Locality-specific Peripheral Interrupts, GICv3+) | MSI-based, used for PCIe MSI and MSI-X. Backed by an in-memory interrupt property table and pending table allocated by the kernel. |
GICv3 initialization sequence (per-system, once):
1. Read GIC base addresses from DTB or ACPI MADT.
2. Map GICD MMIO and GICR MMIO regions.
3. Disable GICD: write GICD_CTLR = 0. Wait for GICD_CTLR.RWP=0.
4. Enable affinity routing: GICD_CTLR = ARE_NS. Wait for GICD_CTLR.RWP=0.
(ARE_NS must be set while EnableGrp1NS is still 0; changing ARE_NS with
EnableGrp1NS=1 is UNPREDICTABLE per ARM IHI 0069.)
ARE_S (secure affinity routing) is configured by EL3 firmware (TF-A);
UmkaOS at EL1 handles only the non-secure side.
5. Configure SPI priorities: GICD_IPRIORITYR[n] for each SPI.
6. Configure SPI routing: GICD_IROUTER[n] = MPIDR affinity of target CPU
(or 0x80000000_xxxxxxxx for any-affinity / lowest-power routing).
7. Enable GICD: GICD_CTLR |= EnableGrp1NS.
GICv3 per-CPU initialization sequence (executed on each CPU, including secondaries):
1. Locate this CPU's GICR frame (match GICR_TYPER.Affinity against MPIDR_EL1).
2. Wake redistributor: clear GICR_WAKER.ProcessorSleep. Execute DSB SY to
ensure the MMIO write is observable by the redistributor before polling.
Execute ISB to synchronize the instruction stream. Poll until
GICR_WAKER.ChildrenAsleep = 0 (per ARM IHI 0069F §9.1: the DSB ensures
the write reaches the redistributor; ISB ensures subsequent reads observe
the updated state).
**Note**: Linux's `gic_enable_redist()` omits the explicit DSB/ISB here,
relying on Device memory ordering properties of MMIO accesses. UmkaOS adds
them as an intentional conservative divergence per the ARM architecture spec.
The barriers are harmless (one-time init path) and guard against
implementations where MMIO ordering is weaker than expected.
3. Enable ICC system registers: write ICC_SRE_EL1 = SRE | DFB | DIB.
Execute ISB.
4. Set ICC_PMR_EL1 = 0xF0 (accept normal interrupt priorities 0x00-0xEF;
reserve 0xF0-0xFF for pseudo-NMI — perf sampling, hard lockup detection,
SDEI. Matches Linux DEFAULT_PMR_VALUE in irq-gic-v3.c).
5. Set ICC_BPR1_EL1 = 0 (no binary point split; all priority bits used).
6. Enable Group 1 interrupts: write ICC_IGRPEN1_EL1 = 1. Execute ISB.
7. Configure PPI priorities: GICR_IPRIORITYR[n] for timer PPI (INTID 27 / PPI 11 = EL1 virtual timer CNTV_IRQ).
8. Enable timer PPI: GICR_ISENABLER0 |= (1 << 27). (Bit 27 in ISENABLER0 corresponds to INTID 27.)
Exception routing for interrupts (AArch64):
When an IRQ fires from EL0 or EL1 with SPx, the CPU jumps to the IRQ vector at VBAR_EL1 + 0x280 (Current EL with SPx, IRQ). The handler reads ICC_IAR1_EL1 to obtain the IRQ ID, dispatches to the registered handler, then writes ICC_EOIR1_EL1 with the same IRQ ID to signal completion. Priority drop (ICC_EOIR1_EL1 write) and deactivation (ICC_DIR_EL1 write) may be split when EOImode=1 is set in ICC_CTLR_EL1 for fine-grained priority management.
RISC-V: PLIC (Platform-Level Interrupt Controller)
The PLIC is the standard external interrupt controller for RISC-V supervisor-mode
software. It is discovered from the Device Tree node with compatible = "riscv,plic0"
or "sifive,plic-1.0.0", which provides the MMIO base address and the number
of interrupt sources.
PLIC memory map (all offsets are from the PLIC base address):
Offset 0x000000 + source*4: Source priority register (0=disabled, 1-7=priority level)
Offset 0x001000 + word*4: Interrupt pending bits (read-only, one bit per source)
Offset 0x002000 + ctx*0x80 + word*4: Interrupt enable bits (one bit per source, per context)
Offset 0x200000 + ctx*0x1000: Priority threshold register (0=accept all, 7=accept none)
Offset 0x200004 + ctx*0x1000: Claim/Complete register (read=claim highest-priority
pending IRQ; write=signal completion for that IRQ ID)
A context is a PLIC-internal index that maps to a specific (hart, privilege mode)
pair. The mapping is NOT fixed by the PLIC specification — it is described by the
Device Tree's interrupts-extended property on the PLIC node. The common convention
context = hart_id × 2 + mode (mode 0 = M-mode, mode 1 = S-mode) holds for QEMU
virt but is NOT reliable on real hardware, especially with the H extension (which
adds HS-mode and VS-mode contexts) or on platforms where M-mode contexts are not
exposed via the PLIC (OpenSBI handles M-mode interrupts internally).
UmkaOS uses S-mode contexts exclusively and discovers them from the Device Tree.
PLIC initialization sequence:
1. Discover PLIC base from DTB; map the MMIO region.
2. For each interrupt source (1 to max_source):
a. Set priority: PLIC[0x000000 + source*4] = desired_priority (1-7, or 0 to disable).
3. For each hart, discover its S-mode context via the DT:
a. Parse the PLIC DT node's `interrupts-extended` property. This property
contains a flat list of (cpu_intc_phandle, irq_type) pairs, one per context.
b. For each (cpu_intc, irq_type) pair at context index `ctx`:
If irq_type == 9 (S-mode external interrupt) and the cpu_intc belongs
to this hart (resolved via riscv,cpu-intc → parent hart DT node):
This `ctx` is the S-mode context for this hart. Record the mapping.
c. Set threshold to 0 (accept any non-zero priority):
PLIC[0x200000 + ctx*0x1000] = 0.
d. Enable desired sources:
PLIC[0x002000 + ctx*0x80 + (source/32)*4] |= (1 << (source % 32)).
Note: the formula `hart_id * 2 + 1` is a QEMU virt simplification; production
code MUST use DT-discovered context indices. Linux (drivers/irqchip/irq-sifive-plic.c)
uses the same DT-based discovery via plic_parse_context_parent().
4. Enable PLIC external interrupts in sie CSR: sie.SEIE = 1 (bit 9).
IRQ handling sequence (trap, scause = 9, External interrupt):
1. Look up the DT-discovered S-mode context for this hart: ctx = plic_ctx[hart_id].
2. Read claim register: source_id = PLIC[0x200004 + ctx*0x1000].
A zero return means no interrupt is pending (spurious); ignore.
3. Dispatch to the registered handler for source_id.
4. Write completion: PLIC[0x200004 + ctx*0x1000] = source_id.
This deasserts the interrupt and allows new interrupts of equal or
lower priority to be delivered.
IPI delivery (RISC-V):
IPIs do not go through the PLIC. They use the SBI IPI extension
(Extension ID: 0x735049 = ASCII "sPI"). The primary hart calls
sbi_send_ipi(hart_mask, hart_mask_base) to set a software interrupt
pending on one or more target harts. On the receiving hart, the software
interrupt fires as a supervisor software interrupt (scause = 1, sie.SSIE = 1).
UmkaOS clears the IPI by writing sip.SSIP = 0 in the IPI handler and then
dispatches the pending IPI work item from the per-hart IPI queue.
2.15 NUMA Topology Discovery¶
On x86-64 and ARM SBSA/server platforms, NUMA topology is provided by ACPI tables: SRAT (System Resource Affinity Table) maps memory ranges and APIC / MPIDR IDs to NUMA node numbers, while SLIT (System Locality Information Table) provides the distance matrix. UmkaOS parses SRAT and SLIT during static table parsing (Phase 1 of x86-64 initialization; see Section 2.4).
On platforms that boot with a Device Tree (AArch64 embedded, RISC-V, PPC32, PPC64LE), NUMA topology is encoded directly in the Device Tree. UmkaOS performs DT-based NUMA discovery as a post-DTB-parse step for all non-x86 architectures.
Device Tree NUMA encoding:
/cpus/cpu@N
numa-node-id = <0>; // NUMA node this CPU belongs to
/memory@40000000
device_type = "memory";
reg = <0x0 0x40000000 0x0 0x40000000>;
numa-node-id = <0>; // NUMA node this memory range belongs to
/memory@200000000
device_type = "memory";
reg = <0x2 0x00000000 0x2 0x00000000>;
numa-node-id = <1>; // Second NUMA node
/distance-map // Optional; absent on many embedded platforms
compatible = "numa-distance-map-v1";
distance-matrix =
<0 0 10>, // Node 0 → Node 0: local (normalized to 10)
<0 1 20>, // Node 0 → Node 1: remote
<1 0 20>, // Node 1 → Node 0: remote
<1 1 10>; // Node 1 → Node 1: local
UmkaOS DT-based NUMA discovery algorithm:
1. Walk all /cpus/cpu@N nodes. For each cpu node:
a. Read the reg property (MPIDR affinity / hart ID / PIR).
b. Read numa-node-id. If absent, assign to node 0.
c. Record: cpu_id → numa_node mapping.
2. Walk all /memory@... nodes. For each memory node:
a. Read reg (base, size) pairs.
b. Read numa-node-id. If absent, assign all memory to node 0.
c. Record: [base, base+size) → numa_node mapping (passed to phys::init).
3. If /distance-map node is present:
a. Parse distance-matrix property: triples of (from_node, to_node, distance).
b. Populate NumaDistanceMatrix[from][to] = distance.
c. Distances are normalized: local access = 10. Remote = proportionally higher.
d. Validate the distance matrix:
i. Self-distance: distance[N][N] must equal LOCAL_DISTANCE (10) for all
nodes N. If not, log a firmware warning and force to 10.
ii. Symmetry: distance[A][B] must equal distance[B][A] for all node pairs.
If not, log a firmware warning and set both to the maximum of the two
values (conservative: assume the worse latency applies in both directions).
iii. Unreachable sentinel: distance == 255 (REMOTE_DISTANCE_UNREACHABLE)
means no direct path exists between nodes; the node pair must not be
used for migration or memory allocation fallback.
iv. Range cap: all non-sentinel distances must be in [10, 254]. Values
above 254 (other than 255) are clamped to 254 with a firmware warning.
Values below 10 are invalid (a remote access cannot be cheaper than
local) and are clamped to 10 with a firmware warning.
If /distance-map is absent:
a. Assume symmetric topology: all local accesses cost 10, all remote
accesses cost 20 (single-hop assumption). This is conservative but safe.
4. Validate: ensure every CPU maps to a node that has at least some memory.
If a CPU's node has no memory (misconfigured DTB), log a warning and
migrate the CPU to the nearest node with memory (lowest distance score).
Per-architecture specifics:
ARM server (AWS Graviton 3, Ampere Altra, Neoverse N2/V2 platforms): Prefer ACPI SRAT over Device Tree on SBSA-compliant platforms (ACPI is mandatory on SBSA). The SRAT Memory Affinity Structure and Processor Affinity Structure (Types 1 and 0) map MPIDR values and memory ranges to NUMA proximity domains. Distance values come from SLIT. On platforms that provide both ACPI and a Device Tree (Graviton 3 exposes both), ACPI takes precedence.
RISC-V: No ACPI on most RISC-V platforms. The distance-map DT node is rarely populated on current RISC-V hardware (SiFive HiFive Unmatched, StarFive VisionFive 2). UmkaOS applies the symmetric topology fallback (local=10, remote=20) on RISC-V when the distance-map node is absent. Future multi-socket RISC-V server designs (expected from Ventana, SiFive, Alibaba T-Head) will populate distance-map.
PPC64LE (POWER10):
IBM POWER systems encode NUMA topology using the proprietary
ibm,associativity and ibm,associativity-reference-points DT properties:
/cpus/cpu@0
ibm,associativity = <4 0 0 0 0>;
// Four levels of hierarchy: chip group / chip / core / thread.
// The reference-points property selects which levels to use for
// NUMA distance calculation.
/ibm,associativity-reference-points = <0x4 0x2>;
// Level index 4 (first element) = domain/chip-group boundary.
// Level index 2 (second element) = chip boundary.
// Distance between CPUs sharing the same value at each level:
// same at both levels = local (same chip) → distance 10
// same at first but different at second = 1 hop → distance 20
// different at first = multiple hops → distance 40
UmkaOS parses ibm,associativity-reference-points first to determine the number
of distance levels, then for each CPU and memory node reads ibm,associativity
to compute the NUMA node assignment and inter-node distance matrix.
2.15.1.1 TEE Capability Discovery¶
After NUMA topology is established, UmkaOS populates the NumaNodeTeeInfo field
in each BuddyAllocator (Section 4.2) to
record whether a node's memory controller supports hardware encryption. This is
consumed by the tiering engine and physical allocator to prevent confidential pages
from being migrated to non-TEE-capable nodes
(Section 9.7).
Per-architecture discovery:
AMD (SEV-SNP):
SEV capability is detected via CPUID Fn8000_001F[EAX]: bit 1 = SEV enabled,
bit 4 = SEV-SNP supported. The maximum number of encrypted ASIDs (key IDs) is
read from CPUID Fn8000_001F[ECX]. On AMD platforms, all socket-attached DRAM
nodes share the same memory encryption engine — each populated NUMA node backed
by socket DRAM is marked tee_capable = true with max_key_ids set from ECX.
CXL-attached memory nodes on AMD platforms are marked tee_capable = false
unless the CXL device firmware explicitly advertises encryption support via
the CXL DVSEC (Designated Vendor-Specific Extended Capability) security field.
Intel (TDX / MKTME):
MKTME capability is detected via CPUID.(EAX=7,ECX=0):ECX[13] (TME) and the
IA32_TME_CAPABILITY MSR (MSR 0x981), which reports the number of available
MKTME key IDs in bits 50:36. The IA32_TME_ACTIVATE MSR (MSR 0x982) confirms
that TME/MKTME is enabled and reports the active key ID count. Each NUMA node
backed by socket DRAM is marked tee_capable = true with max_key_ids from the
MSR key count. CXL-attached nodes are marked tee_capable = false unless the
device reports MKTME key routing support via its CXL compliance structure.
ARM (CCA / Granule Protection):
CCA capability is detected via ID_AA64PFR0_EL1 bits 43:40 (RME field, value
0b0001 = RME supported). The Granule Protection Check (GPC) enforces memory
world partitioning per-granule. On CCA-capable platforms, all DRAM nodes managed
by the GPC-aware memory controller are marked tee_capable = true. The
max_key_ids field is set to the number of realm IDs supported by the RMM
(Realm Management Monitor), queried via the RMI version interface at boot.
CXL-attached nodes are marked tee_capable = false unless the CXL host bridge
is within the GPC-protected address space.
RISC-V, PPC32, PPC64LE, s390x, LoongArch64:
No hardware TEE memory encryption mechanism equivalent to SEV-SNP/TDX/CCA exists
on these architectures. All NUMA nodes are marked tee_capable = false with
max_key_ids = 0. If future ISA extensions add memory encryption (e.g., RISC-V
CoVE), this discovery path will be extended.
2.16 Extended State and CPU Features¶
2.16.1 Per-Architecture Extended State (FPU) Initialization¶
Each architecture requires explicit initialization to enable floating-point and SIMD registers before they can be used by kernel or user code. UmkaOS uses a lazy FP strategy on all architectures: extended state is not saved at every context switch, but only when the task has actually used FP/SIMD registers.
x86-64:
FPU/SSE/AVX/XSAVE initialization runs during early boot (before interrupts are enabled, after the physical memory manager is initialized):
1. Detect XSAVE: CPUID leaf 0x1, ECX bit 26 (OSXSAVE). If absent, fall back
to legacy FXSAVE (SSE2 state only, 512 bytes).
2. Set CR0: CR0.EM = 0 (no FPU emulation), CR0.MP = 1 (monitor coprocessor).
3. Set CR4.OSFXSR = 1 (enable FXSAVE/FXRSTOR for SSE state).
Set CR4.OSXSAVE = 1 (enable XSAVE/XRSTOR for extended state).
4. Query XCR0 to discover which extended state components are present:
XCR0 bit 0 = x87 FPU, bit 1 = SSE, bit 2 = AVX, bit 5-7 = AVX-512,
bit 9 = PKRU, bit 17-18 = AMX tile config/data.
5. Enable all supported components: write XCR0 with the bitmask of present
components (CPUID leaf 0xD, sub-leaf 0 provides the valid bit set).
6. Lazy context switch: set CR0.TS = 1 (task switched). First FP use from
any task triggers a #NM (Device Not Available) exception. The handler
loads the task's saved FP state and clears CR0.TS before returning.
On context switch out: if CR0.TS was clear (task used FP), save the
extended state via XSAVE[OPT/C] to the per-task XSAVE area.
AArch64:
1. NEON/FP enable: Write CPACR_EL1.FPEN = 0b11 (no trapping of FP/NEON
instructions at EL1 or EL0). Without this, any NEON/FP instruction
from EL0 or EL1 causes an Undefined Instruction exception.
(UmkaOS's entry.S already sets FPEN=0b11 for the boot CPU to allow
Rust-generated NEON instructions in the early kernel; secondary CPUs
set FPEN=0b11 in their secondary_entry stubs.)
2. Lazy context switch: use CPACR_EL1.FPEN = 0b00 (trap FP/NEON from
all ELs) to detect first use. On the resulting trap (ESR_EL1.EC=0x07,
FP/NEON access from AArch64), load the task's saved FP state and set
FPEN=0b11 before returning. On context switch out: if FPEN was 0b11
(task used FP), save Q0-Q31 + FPSR + FPCR to the per-task FP frame.
3. SVE (Scalable Vector Extension, ARMv8.2+): If CPUID reports SVE
(ID_AA64PFR0_EL1.SVE != 0), set ZCR_EL1.LEN to the desired vector
length minus 1 (0 = 128-bit, 1 = 256-bit, up to SMCCC-reported max).
CPTR_EL2.ZEN = 0b00 (allow SVE at EL1/EL0, no trap to EL2).
SVE state (Z registers, P registers, FFR) is saved/restored separately
from the NEON state, using the larger per-task SVE frame.
4. SME (Scalable Matrix Extension, ARMv9.2+): Enabled via CPACR_EL1.SMEN
and SMCR_EL1.LEN. SME streaming mode and ZA register file are saved as
part of the per-task SME frame on context switch.
RISC-V:
sstatus.FS field (bits [14:13]) controls FP state:
0b00 = Off: Any FP instruction causes an Illegal Instruction exception.
0b01 = Initial: FP registers accessible; initial (clean) state.
0b10 = Clean: FP registers accessible; not modified since last save.
0b11 = Dirty: FP registers accessible; modified since last save.
1. At boot (on each hart): set sstatus.FS = 0b01 (Initial). This enables
FP instructions without immediately requiring a context-switch save.
2. Lazy save: set sstatus.FS = 0b00 (Off) on context switch in for tasks
that have not used FP. First FP instruction traps (Illegal Instruction,
scause = 2). The handler sets sstatus.FS = 0b01 and returns; the FP
instruction re-executes. On context switch out: if sstatus.FS == 0b11
(Dirty), save all 32 FP registers (f0-f31) plus fcsr to the per-task
FP frame, then set sstatus.FS = 0b01 (Clean). This avoids saving FP
state for tasks that never use FP.
3. Vector extension (V): If sstatus.VS (bits [10:9]) is supported, manage
the V register file (v0-v31, vtype, vl, vlenb) identically to the FP
FS field. VS = 0b00 traps; set on first use; save on switch-out if Dirty.
PPC32 / PPC64LE:
The MSR (Machine State Register) contains separate enable bits for each
extended register file:
MSR.FP (bit 18): FPU enable. 0 = FP instructions cause FP Unavailable exception.
MSR.VEC (bit 25): AltiVec/VMX enable. 0 = VMX instructions cause VMX Unavailable.
MSR.VSX (bit 23): VSX enable (PPC64 only). 0 = VSX instructions cause VSX Unavailable.
1. At boot: clear MSR.FP, MSR.VEC, MSR.VSX (all zero after reset; verify).
2. Lazy enable: the FP/VMX/VSX Unavailable exception fires on first use.
The handler sets the corresponding MSR bit and returns. The instruction
re-executes.
3. On context switch out: if any of MSR.FP / MSR.VEC / MSR.VSX is set,
save the corresponding register file (32 FPRs + FPSCR, 32 VMX registers
+ VSCR/VRSAVE, 64 VSX registers) to the per-task frame, then clear the
MSR bit to re-arm the trap for the next task.
4. On context switch in: do NOT restore FP state until first use (the
Unavailable trap will do that). This means tasks that were FP-active
when they were switched out will take one Unavailable trap on their
next quantum — a single additional exception per task per scheduling
interval, which is acceptable given the benefit of skipping FP restore
for FP-idle tasks.
s390x:
s390x has no hardware trap-on-first-use mechanism (no equivalent of
x86 CR0.TS or ARM CPACR trapping). Instead, software usage flags
track whether a task has used FP/vector registers. The `__switch_to`
function is called on every context switch, but the actual register
save/restore within it IS conditional — it checks per-task usage flags
(`ufpu_flags`, `kfpu_flags`) and skips register groups that were not used:
1. At boot (on each CPU): verify vector facility via STFLE bit 129.
If present, enable vector instructions via CR0 bit 17.
2. Context switch: the save/restore functions check per-task usage flags
(AFP/VX). If the task has never used FP, save is skipped entirely.
If the task has used FP, save is performed (there is no hardware
"dirty" flag to distinguish "used but not modified" from "used and
modified" — once a task uses FP, all subsequent context switches
save/restore those registers). STFPC/LFPC saves/restores the FP
control register; VST/VL handles vector registers (16 x 128-bit when
the vector facility is present, facility bit 129). This is slightly
less efficient than hardware lazy FP (tasks that used FP once but
never again still pay save/restore cost) but architecturally correct
given the absence of a hardware trap mechanism.
LoongArch64:
LoongArch uses CSR.EUEN (Extended Unit Enable) bits for lazy FP:
EUEN.FPE (bit 0): FP enable. 0 = FP instructions cause FPD exception.
EUEN.SXE (bit 1): 128-bit SIMD (LSX) enable. 0 = LSX causes LSXD.
EUEN.ASXE (bit 2): 256-bit SIMD (LASX) enable. 0 = LASX causes LASXD.
1. At boot: clear EUEN.FPE, EUEN.SXE, EUEN.ASXE.
2. Lazy enable: FPD/LSXD/LASXD exception fires on first use.
The handler sets the corresponding EUEN bit and returns.
3. On context switch out: if any EUEN bit is set, save the
corresponding register file (32 FPRs + FCSR, 32 × 128-bit LSX
registers, or 32 × 256-bit LASX registers), then clear EUEN bits.
4. LoongArch FP/SIMD state sizes: FP = 256 bytes, LSX = 512 bytes,
LASX = 1024 bytes.
UmkaOS's unified lazy FP policy:
All eight architectures implement the same semantic contract (s390x uses software usage flags instead of hardware trapping, but the external contract is identical — tasks that never use FP pay zero allocation cost):
- Tasks that never issue a FP/SIMD instruction pay zero extended-state save or restore cost at every context switch (including s390x, where software usage flags skip save/restore for tasks that have never used FP).
- The first FP/SIMD instruction in a task's lifetime triggers one trap, which loads the task's initial (zero) FP state and marks the task as FP-active.
- Subsequent context switches for FP-active tasks check the architecture's dirty indicator (CR0.TS cleared / FS=Dirty / MSR.FP set / EUEN bits set) and save only when needed.
- The per-task FP frame is allocated at task creation (sized to the largest extended state the hardware can produce on that platform, as determined by XSAVE area size on x86, SVE vector length on AArch64, or fixed sizes on RISC-V/PPC/s390x/LoongArch) and freed at task exit.
2.16.2 CPU Feature Registry¶
After per-architecture feature detection (Section 2.16 sub-sections per arch) completes on each
CPU, the kernel consolidates all discovered capabilities into a single, global
CpuFeatureTable. Every subsystem — scheduler, crypto, compression, checksum,
algorithm dispatch — queries this table rather than re-reading architecture-specific
registers at runtime.
Design contract: the CpuFeatureTable is write-once. It is populated during
boot, then frozen permanently before any Tier 1 driver or non-boot kthread begins
execution. After freezing, all reads are lock-free: the table page is marked
read-only by cpu_features_freeze(), and Rust's type system prevents mutation
through the &'static CpuFeatureTable reference vended to callers.
CPU hotplug feature handling: When a CPU is onlined after boot (physical hotplug
or echo 1 > /sys/devices/system/cpu/cpuN/online), the kernel runs
arch::current::cpu::detect_features(cpu_id) on the new CPU and compares its
feature set against the frozen universal mask. If the new CPU lacks any feature
in universal, the hotplug is rejected with EINVAL (the kernel cannot downgrade
algorithm dispatch decisions already made). If the new CPU has additional features
beyond universal, those extras are recorded in the per-CPU entry but do NOT
update universal (subsystems already using the universal mask remain safe).
The per-CPU entry is stored at the pre-allocated slot for cpu_id (slots up to
NR_CPUS are allocated at boot). XFD state (x86-64 SPR+) is re-initialized on
the hotplugged CPU per the XFD_HOTPLUG errata workaround.
Boot phase ordering: feature detection runs in two sub-phases.
Sub-phase 1 — BSP early init (before ACPI/DT parsing):
arch::current::cpu::detect_features(cpu_id=0) → fills entry[0]
Sub-phase 2 — each AP's first kernel instruction (during SMP bringup):
arch::current::cpu::detect_features(cpu_id=N) → fills entry[N]
Sub-phase 3 — immediately before Tier 1 driver init:
cpu_features_freeze():
1. ANDs all per-CPU entries → computes `universal` intersection.
2. Marks the CpuFeatureTable page read-only (arch MMU call).
3. Logs the universal capability set at KERN_INFO.
4. Asserts that at least one generic fallback exists for every
registered AlgoDispatch (§3.10). Panics if not.
2.16.2.1 Per-CPU Feature Set¶
/// Capability description of one logical CPU.
///
/// One entry per logical CPU (hardware thread), indexed by CPU ID.
/// Populated at boot by `arch::current::cpu::detect_features(cpu_id)`.
/// Immutable after `cpu_features_freeze()`.
///
/// Cache-line aligned: parallel reads from distinct CPUs never share a line.
// kernel-internal, not KABI — CpuFeatureSet sub-structs (CryptoCaps, AtomicCaps,
// IsolationCaps, VirtCaps) have per-arch-dependent content. Size verified at build.
#[repr(C, align(64))]
pub struct CpuFeatureSet {
/// Width in bytes of the widest SIMD register available in kernel mode.
///
/// | Architecture | Possible values | Condition |
/// |--------------|------------------------|----------------------------------|
/// | x86-64 | 16 / 32 / 64 | SSE2 / AVX2 / AVX-512F |
/// | AArch64 | 16; 32–256 (×16) | NEON baseline; SVE (VLEN/8) |
/// | ARMv7 | 0 / 8 / 16 | no FP / VFPv3-D16 / NEON |
/// | RISC-V | 0; 4–256 (power-of-2) | no V; RVV (VLEN/8 bytes) |
/// | PPC32 | 0 / 16 | no AltiVec / AltiVec/VMX |
/// | PPC64LE | 16 | VMX/VSX (64 × 128-bit registers) |
/// | s390x | 0 / 16 | no VX / Vector Facility (32 × 128-bit registers) |
/// | LoongArch64 | 0 / 16 / 32 | no SIMD / LSX (128-bit) / LASX (256-bit) |
///
/// Kernel SIMD use (§3.10.2) checks this value before selecting an
/// implementation. Value 0 means no SIMD unit is accessible to kernel mode
/// on this CPU; only scalar implementations may be used.
pub simd_width_bytes: u16,
/// Cryptographic hardware acceleration.
pub crypto: CryptoCaps,
/// Atomics and memory-model extensions beyond the baseline ISA.
pub atomics: AtomicCaps,
/// Fast intra-process isolation extensions (for the driver tier model).
pub isolation: IsolationCaps,
/// Virtualisation hosting support.
pub virt: VirtCaps,
/// Firmware/platform capabilities available to this CPU.
/// Populated during boot by probing firmware interfaces (SMCCC on ARM,
/// OPAL/RTAS on PPC, STFLE on s390x, CPUCFG on LoongArch).
/// Used by errata workarounds that require firmware cooperation.
pub firmware: FirmwareCaps,
/// Microcode revision loaded on this CPU. Populated after early microcode
/// loading (x86: IA32_UCODE_REV MSR; AMD: MSR 0x8B). Zero on architectures
/// without loadable microcode (ARM, RISC-V, s390x, LoongArch — though s390x
/// has MCL (Microcode Level) reported via STSI, stored here if available).
/// Used by the errata database to gate workarounds: if microcode_revision
/// >= minimum safe version, the erratum's software workaround is skipped.
pub microcode_revision: u64,
/// Architecture-specific raw capability bits, opaque to cross-platform code.
/// Consumed by arch-specific modules (scheduler, crypto, drivers) that need
/// fine-grained feature sub-variants not covered by the typed fields above.
/// 256 bits: sufficient for all current and near-future ISA extensions.
pub arch_raw: [u64; 4],
/// CPU errata and microarchitectural workaround flags.
/// Each bit indicates a known hardware bug that requires a kernel code
/// path alternative. Populated during `detect_features()` by matching
/// the CPU's model/stepping (CPUID family/model/stepping on x86,
/// MIDR_EL1 on AArch64, mvendorid/marchid/mimpid on RISC-V, PVR on PPC).
///
/// The universal intersection for errata is computed as the **union**
/// (OR), not intersection (AND): if ANY CPU in the system has a bug,
/// the workaround must be active system-wide (because kthreads can
/// migrate to that CPU). This is the opposite of the capability fields
/// above, which use AND (feature must be present on ALL CPUs).
pub errata: ErrataCaps,
/// Microarchitectural tuning hints. Read-only after freeze.
/// These are NOT errata (bugs) — they are performance characteristics
/// that vary by CPU model and inform algorithm/parameter selection.
pub microarch: MicroarchHints,
}
2.16.2.2 Capability Bitflag Types¶
bitflags! {
/// Cryptographic acceleration: what the hardware can accelerate independently.
/// Each bit is independently meaningful; set bits do not imply other bits.
/// Combinations (e.g., AES_GCM requires both AES_BLOCK and CLMUL) are
/// checked by the caller, not encoded here.
pub struct CryptoCaps: u32 {
/// AES block cipher (hardware encrypt/decrypt of a single 128-bit block).
///
/// x86-64: AES-NI — CPUID leaf 1 ECX[25] (Sandy Bridge+, Atom Silvermont+)
/// AArch64: FEAT_AES — ID_AA64ISAR0_EL1[7:4] ≥ 1 (Cortex-A53+, all Cortex-A7x)
/// ARMv7: ID_ISAR5_EL1 AES field ≥ 1 (Cortex-A32, A53 with crypto extension)
/// RISC-V: Zkne + Zknd (ratified in RISC-V ISA v20220112; Zkn group)
/// PPC64LE: vcipher/vncipher (POWER8+, AltiVec AES; VMX ISA)
/// PPC32: Not available.
const AES_BLOCK = 1 << 0;
/// Carry-less multiply for GHASH (GCM authentication tag computation).
/// Combined with AES_BLOCK, enables full hardware AES-GCM.
///
/// x86-64: PCLMULQDQ — CPUID leaf 1 ECX[1] (Westmere+, Atom Silvermont+)
/// VPCLMULQDQ (vectorised): CPUID leaf 7 ECX[10], requires AVX-512
/// AArch64: FEAT_PMULL — ID_AA64ISAR0_EL1[15:12] ≥ 1 (same as FEAT_AES)
/// PMULL/PMULL2 instructions operate on 64-bit and 128-bit polynomials
/// RISC-V: Zkg (carry-less multiply: CLMUL/CLMULH/CLMULR, ratified Zkn group)
/// PPC64LE: vpmsumd (POWER8+; polynomial multiply-sum doubleword)
/// PPC32: Not available.
const CLMUL = 1 << 1;
/// SHA-256 hardware acceleration (round function + message schedule).
///
/// x86-64: SHA-NI — CPUID leaf 7 EBX[29] (Goldmont+, Zen+)
/// SHA256RNDS2, SHA256MSG1, SHA256MSG2
/// AArch64: FEAT_SHA256 — ID_AA64ISAR0_EL1[15:12] ≥ 1
/// SHA256H, SHA256H2, SHA256SU0, SHA256SU1
/// ARMv7: ID_ISAR5 SHA2 field ≥ 1 (Cortex-A32 crypto extension)
/// RISC-V: Zknh — SHA256SUM0, SHA256SUM1, SHA256SIG0, SHA256SIG1
/// PPC: Not available as dedicated instructions.
const SHA2_256 = 1 << 2;
/// SHA-512 hardware acceleration.
///
/// x86-64: Not available as dedicated instructions; use SHA-NI schedule
/// rotation trick (SHA-256 based SHA-512, ~2× slower than native).
/// Set this bit only on AArch64 and RISC-V.
/// AArch64: FEAT_SHA512 — ID_AA64ISAR0_EL1[19:16] ≥ 2
/// SHA512H, SHA512H2, SHA512SU0, SHA512SU1
/// RISC-V: Zknh — SHA512SUM0R, SHA512SUM1R, SHA512SIG0L/H, SHA512SIG1L/H
/// Others: Not available.
const SHA2_512 = 1 << 3;
/// SHA-3 / Keccak hardware acceleration.
///
/// AArch64: FEAT_SHA3 — ID_AA64ISAR0_EL1[35:32] ≥ 1
/// EOR3, RAX1, XAR, BCAX
/// Others: Not available.
const SHA3 = 1 << 4;
/// SM3 and SM4 (Chinese national standard) hardware acceleration.
///
/// AArch64: FEAT_SM3 — ID_AA64ISAR0_EL1[39:36] ≥ 1 (SM3H, SM3PARTW1/2/TT)
/// FEAT_SM4 — ID_AA64ISAR0_EL1[43:40] ≥ 1 (SM4E, SM4EKEY)
/// Others: Not available.
const SM3_SM4 = 1 << 5;
/// CRC32C hardware instruction (single-instruction CRC32C of 1/2/4/8 bytes).
///
/// x86-64: SSE4.2 — CPUID leaf 1 ECX[20]; CRC32 instruction family
/// AArch64: FEAT_CRC32 — ID_AA64ISAR0_EL1[19:16] ≥ 1
/// CRC32CB, CRC32CH, CRC32CW, CRC32CX
/// ARMv7: ID_ISAR5 CRC32 field ≥ 1 (Cortex-A32 with CRC option)
/// RISC-V: Zbkc (CLMUL-based polynomial equivalent) or Zbc
/// PPC64LE: vpmsumd-based CRC32C (POWER8+; ~4 instructions, not single-insn)
/// Set the bit; the implementation is equivalent in throughput.
const CRC32C = 1 << 6;
/// Hardware true random number generator (non-deterministic entropy source).
/// The implementation guarantees cryptographic-quality randomness per
/// the relevant ISA specification; the kernel uses it to seed its CSPRNG.
///
/// x86-64: RDRAND — CPUID leaf 1 ECX[30] (Ivy Bridge+, Zen+)
/// RDSEED — CPUID leaf 7 EBX[18] (Broadwell+, Zen+); prefer RDSEED
/// AArch64: FEAT_RNG — ID_AA64ISAR0_EL1[63:60] ≥ 1 (RNDR, RNDRRS registers)
/// RISC-V: Zkr — `seed` CSR (ratified; entropy source with WAIT/ES16/BIST)
/// PPC64LE: DARN (Deliver A Random Number) — POWER ISA 3.0 (POWER9+)
/// PPC32: Not available.
/// ARMv7: Not available in hardware; software CSPRNG only.
const HW_RNG = 1 << 7;
/// Vectorised AES: multiple AES blocks per instruction cycle.
/// Requires AES_BLOCK. Provides ≥4× throughput improvement for bulk crypto.
///
/// x86-64: VAES — CPUID leaf 7 ECX[9]; requires AVX-512F or AVX10
/// VAESENC/VAESENCLAST operate on 256-bit (4-block) or
/// 512-bit (8-block) YMM/ZMM registers
/// Others: Not separately available; use AES_BLOCK + simd_width_bytes
/// (wider SIMD → more blocks per iteration in software pipelining).
const VAES = 1 << 8;
/// Poly1305 / GHASH acceleration via vector-polynomial multiply.
/// Enables ChaCha20-Poly1305 in near-hardware speed on supported platforms.
///
/// AArch64: FEAT_PMULL covers GHASH; Poly1305 reduction uses PMULL too.
/// Already covered by CLMUL bit; no separate bit needed.
/// x86-64: VPCLMULQDQ covers both GCM and Poly1305 field arithmetic.
/// Already covered by CLMUL bit.
/// This constant is reserved for future ISAs that add dedicated Poly1305 insns.
const POLY1305_HW = 1 << 9;
}
}
bitflags! {
/// Atomic operation extensions beyond each architecture's baseline.
/// Baseline (always available per supported architecture) is NOT listed:
/// 64-bit CAS is baseline on x86-64, AArch64 (LL/SC), RISC-V (A), PPC64.
pub struct AtomicCaps: u16 {
/// 128-bit atomic compare-and-swap (double-word CAS).
///
/// x86-64: CMPXCHG16B — CPUID leaf 1 ECX[13]. Present on all x86-64 since
/// 2006; treat as always-set after BSP init confirms it.
/// AArch64: FEAT_LSE2 — ID_AA64MMFR2_EL1[35:32] ≥ 1 (Armv8.4+)
/// Single-copy-atomic 128-bit loads (LDP) and stores (STP)
/// RISC-V: Not yet standardised as true atomic 128-bit CAS (64-bit LR/SC pairs
/// can emulate it, but not atomically visible to all harts). Bit=0.
/// PPC64: lqarx/stqcx. — POWER8+ (POWER ISA 2.07); quad-word LL/SC
const CAS128 = 1 << 0;
/// ARM Large System Extensions: single-instruction atomics without LL/SC loop.
/// Eliminates retry loops for CAS, swap, and fetch-add on AArch64.
///
/// AArch64: FEAT_LSE — ID_AA64ISAR0_EL1[23:20] ≥ 2 (Armv8.1+)
/// CAS, CASP, SWP, LDADD, LDCLR, LDSET, LDEOR and their variants
/// Others: Not applicable (set only on AArch64).
const ARM_LSE = 1 << 1;
/// RISC-V Zabha: byte and halfword atomic instructions.
/// Enables lock-free per-byte atomic operations (useful for flag arrays).
///
/// RISC-V: Zabha extension (ratified 2024); AMOADD.B, AMOOR.B, etc.
const RV_ZABHA = 1 << 2;
}
}
bitflags! {
/// Fast intra-process isolation extensions used by the driver tier model.
/// These enable Tier 1 (Ring 0, hardware-domain-isolated) driver operation.
/// The scheduler and isolation subsystem query these bits to decide whether
/// a driver can be elevated to Tier 1 or must fall back to Tier 2.
pub struct IsolationCaps: u16 {
/// x86-64 Memory Protection Keys — WRPKRU/RDPKRU (~23 cycles).
/// CPUID leaf 7 ECX[3] (MPK for user pages, Skylake+).
/// Extended: PKS (supervisor pages) — CPUID leaf 7 ECX[31], Ice Lake+.
const MPK = 1 << 0;
/// AArch64 Permission Overlay Extension — MSR POR_EL0 (~40–80 cycles).
/// FEAT_S1POE (Permission Overlay Extension) — ID_AA64MMFR3_EL1 fields (Armv8.9/Armv9.4+).
/// Note: FEAT_S1PIE (Permission Indirection Extension) is a separate, unrelated feature.
const POE = 1 << 1;
/// ARMv7 Domain Access Control Register — MCR p15 DACR + ISB (~30–40 cycles).
/// Always present on ARMv7-A; set unconditionally during ARMv7 boot.
const DACR = 1 << 2;
/// RISC-V Physical Memory Protection (PMP) — no fast PKRU equivalent.
/// Context-switch cost ~200–500 cycles (full page-table switch).
/// Tier 1 on RISC-V degrades to Tier 0 until a RISC-V fast-isolation ISA
/// extension is ratified. Flag is always 0 on RISC-V; included for
/// completeness so cross-platform code can query uniformly.
const RV_PMP_ONLY = 1 << 3;
/// PPC32 segment registers — mtsr/mfsr (~10–30 cycles).
/// Always present on PPC32; set unconditionally.
const PPC32_SEGS = 1 << 4;
/// PPC64LE Radix PID — mtspr PIDR (~30–60 cycles, POWER9+).
/// Enables per-process page-table root switching without full TLB flush.
const PPC64_RADIX_PID = 1 << 5;
/// ARM Memory Tagging Extension — available for heap safety.
/// AArch64: FEAT_MTE2 — ID_AA64PFR1_EL1[11:8] ≥ 2 (preferred; async tagging)
/// FEAT_MTE — ID_AA64PFR1_EL1[11:8] ≥ 1 (synchronous only)
const MTE = 1 << 6;
/// x86-64 Control-flow Enforcement Technology: Shadow Stack.
/// CPUID leaf 7 ECX[7] (CET-SS, Tiger Lake+).
/// Enables hardware-enforced return-address integrity for kernel code.
const CET_SS = 1 << 7;
/// AArch64 Pointer Authentication.
/// FEAT_PAUTH — ID_AA64ISAR1_EL1[11:4] ≥ 1; FEAT_PAUTH2 preferred (≥ 3).
/// Used for kernel return-address signing (PACIASP/AUTIASP).
const PAUTH = 1 << 8;
/// s390x Storage Keys — per-page 4-bit key + fetch/store protection.
/// Always present on z/Architecture. NOT equivalent to MPK/POE: keys
/// protect individual pages, not memory domains, and key changes require
/// privileged SSK instruction + IPTE. Too coarse and too expensive for
/// Tier 1 isolation. Tier 1 is unavailable on s390x; drivers use Tier 0 or Tier 2.
/// Included for completeness so cross-platform code can query uniformly.
const S390X_STORAGE_KEYS = 1 << 9;
/// LoongArch has no in-process memory protection domain mechanism
/// equivalent to x86 MPK or ARM POE. Tier 1 is unavailable on
/// LoongArch; drivers use Tier 0 or Tier 2 (IOMMU + process
/// isolation) depending on licensing and admin policy.
/// Flag is always 0 on LoongArch; included for uniform querying.
const LOONGARCH_NONE = 1 << 10;
/// x86-64 Indirect Branch Tracking (CET-IBT).
/// CPUID leaf 7 ECX[20] (IBT, Tiger Lake+, Zen 3+).
/// All indirect branch targets must begin with ENDBR64.
/// Used by KABI driver loader to validate driver entry points.
const CET_IBT = 1 << 11;
}
}
bitflags! {
/// Virtualisation hosting capabilities.
pub struct VirtCaps: u8 {
/// Can host virtual machines (full hypervisor mode).
///
/// x86-64: VMX — CPUID leaf 1 ECX[5] (Intel VT-x, Pentium 4 Prescott+)
/// SVM — CPUID leaf 0x80000001 ECX[2] (AMD-V, Athlon 64+)
/// AArch64: EL2 presence + FEAT_VHE — ID_AA64PFR0_EL1[11:8] ≥ 1
/// RISC-V: H extension — misa[H] bit (ratified in RISC-V ISA v20191213)
/// PPC64: Hypervisor mode always present on server-class POWER cores.
/// PPC32: BookE hypervisor extensions (e500mc).
const HOST_VIRT = 1 << 0;
/// Nested virtualisation (hypervisor running inside a hypervisor).
///
/// AArch64: FEAT_NV2 — ID_AA64MMFR2_EL1[27:24] >= 2 (Armv8.4-A+)
/// x86-64: VMCS shadowing — CPUID leaf 1 ECX[14] (Haswell+)
/// RISC-V: Two-level H extension nesting (experimental; not yet ratified)
const NESTED_VIRT = 1 << 1;
/// Interrupt virtualisation (hardware GIC/APIC virtualisation).
///
/// AArch64: GIC with VHE list registers (GIC architecture 3.0+)
/// x86-64: APICv (Intel VMX posted interrupts, Haswell+)
/// AVIC (AMD advanced virtual interrupt controller, Zen+)
const VIRT_IRQ = 1 << 2;
}
}
2.16.2.3 Firmware and Platform Capabilities¶
bitflags! {
/// Firmware interfaces and platform capabilities available at runtime.
/// Populated during boot by probing each interface. Errata workarounds
/// that require firmware cooperation (e.g., ARM SMCCC mitigations) check
/// these bits before attempting the firmware call path; if the firmware
/// lacks support, a software-only fallback is used (or the erratum is
/// logged as unmitigatable).
pub struct FirmwareCaps: u32 {
// ── ARM SMCCC (Secure Monitor Call Calling Convention) ────────
/// ARCH_WORKAROUND_1 — Spectre v2 mitigation via firmware.
/// Probed via SMCCC ARCH_FEATURES (function ID 0x80008000).
const SMCCC_WORKAROUND_1 = 1 << 0;
/// ARCH_WORKAROUND_2 — Spectre v4 (SSBS) mitigation via firmware.
const SMCCC_WORKAROUND_2 = 1 << 1;
/// ARCH_WORKAROUND_3 — Spectre-BHB mitigation via firmware.
const SMCCC_WORKAROUND_3 = 1 << 2;
/// PSCI (Power State Coordination Interface) — CPU on/off/suspend.
/// Required for SMP bringup on ARM/ARMv7.
const PSCI = 1 << 3;
/// SMCCC SOC_ID — SoC identification for platform quirk matching.
const SMCCC_SOC_ID = 1 << 4;
// ── PowerPC firmware interfaces ──────────────────────────────
/// OPAL (Open Power Abstraction Layer) — bare-metal PowerNV.
/// Provides OPAL calls for hardware management.
const OPAL = 1 << 8;
/// RTAS (Run-Time Abstraction Services) — pseries/KVM.
/// Provides rtas_call() for device management, error logging.
const RTAS = 1 << 9;
/// Hardware count cache flush on POWER9 DD2.3+ (vs software flush).
const PPC_HW_COUNT_FLUSH = 1 << 10;
/// Hardware STF barrier on POWER9+ (vs software nop sequence).
const PPC_HW_STF_BARRIER = 1 << 11;
// ── s390x firmware interfaces ────────────────────────────────
/// SCLP (Service Call Logical Processor) — early console, memory
/// discovery, event handling. The only early I/O path on s390x.
const SCLP = 1 << 16;
/// DIAG (Diagnose) instructions — z/VM guest services.
const DIAG = 1 << 17;
/// SIE (Start Interpretive Execution) — s390x KVM hosting.
const SIE = 1 << 18;
// ── x86 firmware interfaces ──────────────────────────────────
/// UEFI Runtime Services available after ExitBootServices().
const UEFI_RUNTIME = 1 << 24;
// ── LoongArch firmware interfaces ────────────────────────────
/// BPI (Boot and Peripheral Interface) — LoongArch boot services.
const LOONGARCH_BPI = 1 << 28;
}
}
2.16.2.4 Errata and Workaround Flags¶
/// Known CPU errata requiring kernel workarounds.
///
/// **Aggregation rule**: the universal errata set is the UNION (OR) of
/// all per-CPU errata flags, not the intersection. If ANY CPU has the
/// bug, the workaround must be active system-wide because kthreads
/// migrate. This is the opposite of capability flags (AND).
///
/// Errata are split into two layers:
/// - `SpecMitigations`: cross-architecture speculation vulnerability classes.
/// Used by arch-independent code (syscall dispatch, eBPF verifier) to ask
/// "does this system need Spectre v1 mitigation?" without caring which arch.
/// - Per-architecture bitfields: detailed errata specific to one ISA. Only the
/// field for the running architecture is populated; others are zero. Each
/// architecture gets 128 bits (two u64), sufficient for all known errata
/// plus growth headroom.
///
/// Per-CPU errata queries are supported: `this_cpu_has_errata()` checks the
/// calling CPU's errata flags, enabling per-core workarounds on big.LITTLE
/// and hybrid systems without penalizing unaffected cores in hot paths.
pub struct ErrataCaps {
/// Cross-architecture speculation vulnerability flags.
pub spec: SpecMitigations,
/// Architecture-specific errata. Only the field matching the compile-time
/// target architecture is populated; all others are zero.
pub x86: X86Errata,
pub aarch64: Aarch64Errata,
pub armv7: Armv7Errata,
pub riscv: RiscvErrata,
pub ppc: PpcErrata,
pub s390x: S390xErrata,
pub loongarch: LoongArchErrata,
}
bitflags! {
/// Cross-architecture speculation vulnerability classes.
/// Set if the CPU is affected regardless of architecture.
pub struct SpecMitigations: u32 {
/// Spectre v1 (bounds check bypass). All OoO processors.
/// Mitigation: array_index_nospec(), LFENCE/CSDB/FENCE.
const SPECTRE_V1 = 1 << 0;
/// Spectre v2 (branch target injection). All OoO processors.
/// Mitigation: retpoline/eIBRS/AutoIBRS/CSV2/expolines.
const SPECTRE_V2 = 1 << 1;
/// Meltdown (rogue data cache load). Intel pre-Ice Lake, A75, POWER7-9.
/// Mitigation: KPTI / RFI flush / lightweight TLBI (A510/A520).
const MELTDOWN = 1 << 2;
/// Spectre v4 (speculative store bypass). Cross-architecture.
/// Mitigation: SSBD MSR / SSBS / prctl per-thread opt-in.
const SPECTRE_V4_SSB = 1 << 3;
/// Spectre-BHB / BHI (branch history buffer/injection).
/// Intel eIBRS CPUs + ARM Cortex-A57+.
/// Mitigation: BHI_DIS_S / software BHB clearing / CLEARBHB / SMCCC.
const SPECTRE_BHB = 1 << 4;
}
}
bitflags! {
/// x86-64 specific errata. 128 bits (two u64).
pub struct X86Errata: u128 {
// ── Speculative execution (non-Spectre class) ────────────────
/// MDS (Microarchitectural Data Sampling). VERW at kernel→user.
const MDS = 1 << 0;
/// L1TF (L1 Terminal Fault). PTE inversion + VMX cache flush.
const L1TF = 1 << 1;
/// TAA (TSX Async Abort). TSX disable or VERW.
const TAA = 1 << 2;
/// SRBDS. RDRAND/RDSEED serialisation.
const SRBDS = 1 << 3;
/// MMIO Stale Data. VERW at VMX transitions.
const MMIO_STALE = 1 << 4;
/// GDS / Downfall. Microcode fix or gather disable.
const GDS = 1 << 5;
/// RFDS (Register File Data Sampling). VERW on E-cores.
const RFDS = 1 << 6;
/// BHI (Branch History Injection on eIBRS). BHI_DIS_S or sw clear.
const BHI = 1 << 7;
/// SRSO / Inception (AMD Zen 1-4). Safe-ret thunks.
const SRSO = 1 << 8;
/// ITS / Training Solo (Intel Skylake-X through Tiger Lake).
/// Indirect branch alignment + IPU 2025.1+ microcode.
const ITS = 1 << 9;
/// BPRC / Branch Privilege Injection (Intel 7th gen+).
/// Microcode update; fallback to retpoline if below fix version.
const BPRC = 1 << 10;
/// Reptar (CVE-2023-23583). Redundant REX prefix on FSRM CPUs.
/// Microcode check at boot; refuse to boot if below fix version.
const REPTAR = 1 << 11;
/// AMD IBPB does not flush RSB (Zen 1-3). 32-entry RSB fill after IBPB.
const IBPB_NO_RSB = 1 << 12;
// ── AMD-specific ─────────────────────────────────────────────
/// AMD Zen 4 erratum #1485. Spurious #UD with STIBP disabled.
/// Unconditional MSR fix at boot (correctness, not mitigation).
const ZEN4_SPURIOUS_UD = 1 << 16;
/// AMD Zen 5 RDSEED 16/32-bit returns zero (CVE-2025-62626).
/// Use 64-bit RDSEED only, or fall back to RDRAND.
const ZEN5_RDSEED = 1 << 17;
/// AMD Zen 2 RDRAND returns 0xFFFFFFFF. Disable RDRAND/RDSEED.
const ZEN2_RDRAND = 1 << 18;
/// AMD Erratum 793 (Family 16h). LOCK + WC memory hang.
const AMD_LOCK_WC_HANG = 1 << 19;
/// AMD AVIC IPI missed wakeups (Zen 1/Zen 2, erratum #1235).
const AVIC_IPI_ZEN12 = 1 << 20;
/// AMD SVM nested AVIC validation bypass (CVE-2021-3653).
const SVM_AVIC_BYPASS = 1 << 21;
/// AMD SEV-SNP cache coherency on page conversion (CVE-2024-36331).
const SEV_SNP_CACHE = 1 << 22;
/// AMD microcode signature verification weakness (CVE-2024-56161).
const AMD_UCODE_SIG = 1 << 23;
// ── Intel-specific ───────────────────────────────────────────
/// AEPIC Leak (CVE-2022-21233). xAPIC MMIO stale data.
/// Force x2APIC mode on affected CPUs.
const AEPIC_LEAK = 1 << 32;
/// MWAIT wakeup failure (Apollo Lake, ICX, Lunar Lake).
/// IPI fallback for idle wakeup.
const MWAIT_BROKEN = 1 << 33;
/// TSC deadline timer broken without microcode (Haswell-Kaby Lake).
/// Fallback to LAPIC one-shot.
const TSC_DEADLINE_BROKEN = 1 << 34;
/// TSC stops in deep C-states (pre-Nehalem, some AMD).
const TSC_C3STOP = 1 << 35;
/// AMD LAPIC timer stops in C1E (Erratum 400).
const LAPIC_C1E = 1 << 36;
/// HPET stops in PC10 (Coffee Lake, Ice Lake, Bay Trail).
const HPET_PC10 = 1 << 37;
/// Skylake MOVNTDQA passes MFENCE/LOCK (SKL079/SKL155).
const SKL_NT_ORDERING = 1 << 38;
/// AVX-512 frequency throttling (Skylake-SP through Ice Lake).
const AVX512_LICENSE = 1 << 39;
/// XFD per-CPU cache desync on CPU hotplug (SPR+, CVE-2024-35801).
const XFD_HOTPLUG = 1 << 40;
/// SPR TILEDATA corruption after faulting XRSTOR (SPR4).
const SPR_TILEDATA = 1 << 41;
/// KVM AMX XFD host/guest value confusion (SPR+).
const AMX_XFD_KVM = 1 << 42;
/// LAM without LASS enables SLAM transient execution attack.
const LAM_SLAM = 1 << 43;
/// Bay Trail/Cherry Trail C6 freeze.
const BAYTRAIL_CSTATE = 1 << 44;
/// VMX preemption timer value-1 failure (SPR).
const VMX_PREEMPT_SPR = 1 << 45;
/// Shadow VMCS preemption corruption.
const SHADOW_VMCS_PREEMPT = 1 << 46;
/// Intel hybrid: P-core vs E-core ISA/perf asymmetry.
const HYBRID_ASYMMETRIC = 1 << 47;
/// PCID INVLPG fails to flush Global entries (Alder/Raptor Lake).
const PCID_INVLPG_GLOBAL = 1 << 48;
// ── Architectural (all x86) ──────────────────────────────────
/// NMI IRET re-enables NMI prematurely. Triple-save mechanism.
const NMI_IRET_REENTER = 1 << 56;
/// Split lock / bus lock DoS. Detection on Ice Lake+.
const SPLIT_LOCK = 1 << 57;
/// SMI disturbance during TSC calibration.
const SMI_TSC_CALIBRATION = 1 << 58;
/// FRED late architecture change — feature-gated until validated.
const FRED_UNSTABLE = 1 << 59;
// ── CET/security ─────────────────────────────────────────────
/// CET-IBT ENDBR validation needed for KABI driver loading.
const CET_IBT_COMPAT = 1 << 60;
/// CR4 security bits must be pinned after boot.
const CR4_PIN = 1 << 61;
/// UMIP emulation needed for legacy apps (Wine, DOSEMU2).
const UMIP_EMULATE = 1 << 62;
// ── SGX/TDX ─────────────────────────────────────────────────
/// TDX Heckler attack: malicious hypervisor injects interrupts
/// into TDX guests to influence kernel execution paths.
/// All host-provided values (MMIO, port I/O, MSRs, CPUIDs) must
/// pass validation; interrupt injection filtered against expected vectors.
const TDX_HECKLER = 1 << 64;
/// INTEL-SA-00837: unauthorized error injection in SGX/TDX
/// enables privilege escalation. Boot-time microcode version
/// validation required; disable SGX/TDX below minimum safe version.
const SGX_TDX_ERROR_INJ = 1 << 65;
/// AMX tile uninitialized state prevents deep C-states (C6+).
/// Idle path must INIT tile state before requesting deep C-states.
const AMX_TILE_CSTATE = 1 << 66;
}
}
bitflags! {
/// AArch64 specific errata. 128 bits (two u64).
pub struct Aarch64Errata: u128 {
// ── Cortex-A53 ───────────────────────────────────────────────
/// 843419: ADRP incorrect address. Linker workaround required.
const A53_843419 = 1 << 0;
/// 819472/826319/827319/824069: cache maintenance insufficient.
/// Upgrade dc cvac/cvau → dc civac via alternatives.
const A53_CACHE_MAINT = 1 << 1;
/// 845719: AArch32 virtual address aliasing (stale data).
const A53_845719 = 1 << 2;
// ── Cortex-A55 ───────────────────────────────────────────────
/// 1024718: broken hardware DBM (FEAT_HAFDBS). Disable DBM.
const A55_1024718 = 1 << 3;
/// 1530923: speculative AT during guest switch (VHE).
const A55_1530923 = 1 << 4;
/// 2441007: TLBI race on break-before-make. Repeat TLBI+DSB.
const A55_2441007 = 1 << 5;
// ── Cortex-A57 ───────────────────────────────────────────────
/// 832075: exclusive + device load deadlock. Load-acquire for MMIO.
const A57_832075 = 1 << 6;
/// 834220: false Stage-2 fault. KVM must check Stage-1 first.
const A57_834220 = 1 << 7;
// ── Cortex-A57/A72 shared ────────────────────────────────────
/// 1742098: AES ELR corruption in AArch32. Hide AES hwcap.
const A57_A72_AES_ELR = 1 << 8;
/// 1319367/1319537: speculative AT TLB corruption.
const A57_A72_SPEC_AT = 1 << 9;
// ── Cortex-A76 ───────────────────────────────────────────────
/// 1165522: speculative AT TLB corruption (VHE). Mandate VHE.
const A76_1165522 = 1 << 10;
/// 1286807: TLBI ordering violation. Repeat TLBI+DSB.
const A76_1286807 = 1 << 11;
/// 1463225: software step blocks interrupts.
const A76_1463225 = 1 << 12;
// ── Cortex-A77 ───────────────────────────────────────────────
/// 1508412: NC/device load + store-exclusive deadlock.
/// DMB SY around PAR_EL1 + firmware cooperation required.
const A77_1508412 = 1 << 13;
// ── Cortex-A510/A520 ─────────────────────────────────────────
/// 2051678: broken dirty bit ordering on early A510.
const A510_2051678 = 1 << 14;
/// 2077057: SPSR_EL2 corruption on PAC trap.
const A510_2077057 = 1 << 15;
/// 2658417: BF16/VMMLA incorrect results. Hide BF16 hwcap.
const A510_2658417 = 1 << 16;
/// 3117295: speculative unprivileged load leak (A510).
/// TLBI before return to EL0.
const A510_3117295 = 1 << 17;
/// 2966298: speculative unprivileged load leak (A520 <r0p2).
const A520_2966298 = 1 << 18;
// ── Cortex-A715 ──────────────────────────────────────────────
/// 2645198: ESR_ELx/FAR_ELx corruption on exec→non-exec change.
/// Break-before-make mandatory for permission transitions.
const A715_2645198 = 1 << 19;
// ── Neoverse ─────────────────────────────────────────────────
/// N1 1542419: stale instruction execution (DIC). Hide DIC bit.
const N1_1542419 = 1 << 20;
// ── Cross-core errata ────────────────────────────────────────
/// 3194386 family: MSR SSBS not self-synchronizing.
/// Nearly ALL modern ARM cores (A76 through X4, N1-N3, V1-V3).
/// Insert SB/ISB after every SSBS modification.
const SSBS_3194386 = 1 << 24;
/// LSE atomics double page fault on per-CPU paths.
/// Use LL/SC fallback for initial per-CPU access.
const LSE_DOUBLE_FAULT = 1 << 25;
/// LSE STADD/STCLR/STSET far-execute performance catastrophe.
/// Use LDADD/LDCLR/LDSET with XZR destination instead.
const LSE_STADD_PERF = 1 << 26;
/// SVE context switch race condition (software design issue).
const SVE_CTX_RACE = 1 << 27;
/// SME-only systems: signal frame space not allocated.
const SME_SIGNAL_FRAME = 1 << 28;
/// SVE signal context restore corruption with SME.
const SVE_SIGNAL_RESTORE = 1 << 29;
/// TikTag: speculative MTE tag leak. MTE is defense-in-depth only.
const TIKTAG_MTE = 1 << 30;
// ── GIC errata ───────────────────────────────────────────────
/// GIC-700 2941627: SPI deactivation race on affinity migration.
const GIC700_2941627 = 1 << 32;
/// GIC-700 2195890: LPI delivery stall. Periodic INVLPIR heartbeat.
const GIC700_2195890 = 1 << 33;
// ── Vendor-specific (Cavium/Marvell) ─────────────────────────
/// ThunderX 23144: cross-NUMA ITS SYNC hang.
const THUNDER_23144 = 1 << 40;
/// ThunderX 38539: GICD_TYPER2 access abort.
const THUNDER_38539 = 1 << 41;
/// ThunderX 23154: ICC_IAR1_EL1 not synchronized.
const THUNDER_23154 = 1 << 42;
/// ThunderX 27456: broadcast TLBI corrupts icache.
const THUNDER_27456 = 1 << 43;
/// ThunderX2 219: PRFM after TTBR change. Trap guest TTBR writes.
const THUNDERX2_219 = 1 << 44;
/// ThunderX 30115: KVM disables host GIC Group 1 interrupts.
const THUNDERX_30115 = 1 << 45;
// ── Vendor-specific (other) ──────────────────────────────────
/// AmpereOne AC04_CPU_23: HCR_EL2 update corruption.
const AMPERE_AC04_23 = 1 << 48;
/// AmpereOne AC03_CPU_38: unadvertised HAFDBS bugs.
const AMPERE_AC03_38 = 1 << 49;
/// NVIDIA Carmel CNP TLB invalidation semantic difference.
const CARMEL_CNP = 1 << 50;
/// Fujitsu A64FX E010001: spurious undefined fault.
const A64FX_E010001 = 1 << 51;
// ── Timer/counter errata ─────────────────────────────────────
/// A73 858921: counter read non-atomic across bit-32 boundary.
const A73_858921 = 1 << 56;
/// Freescale A008585: unstable counter reads.
const FREESCALE_A008585 = 1 << 57;
/// CNTFRQ misprogrammed by firmware on some platforms.
const CNTFRQ_MISPROG = 1 << 58;
/// Allwinner A64 timer instability.
const ALLWINNER_TIMER = 1 << 59;
// ── Platform errata ──────────────────────────────────────────
/// Rockchip RK3588 GIC-600 shareability broken.
const RK3588_GIC = 1 << 60;
/// APM X-Gene non-ECAM compliant PCIe.
const XGENE_ECAM = 1 << 61;
/// big.LITTLE SGI RSS mismatch in heterogeneous systems.
const SGI_RSS_MISMATCH = 1 << 62;
}
}
bitflags! {
/// ARMv7 (32-bit ARM) specific errata.
pub struct Armv7Errata: u32 {
/// Cortex-A15 798181: TLBI+DSB shootdown failure.
/// Requires IPI+DSB workaround for SMP TLB shootdown.
const A15_798181 = 1 << 0;
/// Cortex-A9 742230: DMB broken (faulty store ordering).
/// Set diagnostic register bit[12] before any SMP operation.
const A9_742230 = 1 << 1;
/// Cortex-A15: Spectre v2. BPIALL at context switch.
const A15_SPECTRE_V2 = 1 << 2;
/// GICv2 SGI source CPU encoding in GICC_IAR bits [12:10].
/// Must preserve full IAR value for SGI acknowledgment.
const GICV2_SGI_SOURCE = 1 << 3;
/// Most ARMv7 SoCs lack DMA coherency. Explicit cache maintenance.
const DMA_NON_COHERENT = 1 << 4;
/// No LDAR/STLR: explicit DMB for all Acquire/Release semantics.
const NO_LOAD_ACQUIRE = 1 << 5;
}
}
bitflags! {
/// RISC-V specific errata.
pub struct RiscvErrata: u64 {
// ── T-Head XuanTie C9xx ──────────────────────────────────────
/// GhostWrite (C910/C920): XTheadVector MMU bypass.
/// MUST disable XTheadVector unconditionally — unprivileged
/// physical memory write primitive.
const THEAD_GHOSTWRITE = 1 << 0;
/// Non-standard PTE bit encoding (MAEE). Alternative PTE layout.
const THEAD_MAE = 1 << 1;
/// Non-standard cache management operations (CMO).
/// Pluggable cache op backends required.
const THEAD_CMO = 1 << 2;
/// Store merge buffer delay (WRITE_ONCE). fence w,o after stores.
const THEAD_WRITE_ONCE = 1 << 3;
/// C906 halt-and-catch-fire via XTheadMemIdx.
const THEAD_C906_HALT = 1 << 4;
/// C908 vector instruction permanent halt.
const THEAD_C908_VHALT = 1 << 5;
/// C910 imprecise load access faults (wrong stval).
const THEAD_IMPRECISE_FAULT = 1 << 6;
/// C9xx marchid=0: cannot distinguish C906/C910/C920.
const THEAD_ZERO_MARCHID = 1 << 7;
// ── SiFive ───────────────────────────────────────────────────
/// CIP-453: stval sign extension missing on U54/U74.
const SIFIVE_CIP453 = 1 << 8;
/// Non-coherent DMA on U74/U54. Explicit cache flush required.
const SIFIVE_NON_COHERENT = 1 << 9;
// ── Vector extension ─────────────────────────────────────────
/// RVV state corruption in rt_sigreturn (CVE-2024-35873).
const RVV_SIGRETURN = 1 << 16;
/// RVV 0.7.1 (XTheadVector) vs 1.0 incompatibility.
/// Never advertise standard V on cores with XTheadVector only.
const RVV_071_INCOMPAT = 1 << 17;
// ── Interrupt controller ─────────────────────────────────────
/// APLIC MSI mode: level-sensitive interrupt loss.
/// Re-assertion check mandatory after ISR completion.
const APLIC_LEVEL_MSI = 1 << 24;
// ── Platform ─────────────────────────────────────────────────
/// Svadu A-bit updates may be speculative. Use as hint only.
const SVADU_SPECULATIVE = 1 << 28;
/// Misaligned access slow (trap to M-mode). Probe at boot.
const MISALIGNED_SLOW = 1 << 29;
/// PMP boundary partial store visibility.
const PMP_PARTIAL_STORE = 1 << 30;
/// rdcycle/rdinstret side channels. Disable user access.
const COUNTER_SIDE_CHANNEL = 1 << 31;
// ── Vendor-specific ─────────────────────────────────────────
/// Andes AX45MP: non-standard CMO (predates Zicbom).
const ANDES_CMO = 1 << 32;
/// Smepmp extension absent: PMP cannot restrict M-mode.
const NO_SMEPMP = 1 << 33;
/// SiFive CIP-1200: address-specific sfence.vma unreliable.
/// Fall back to full sfence.vma.
const SIFIVE_CIP1200 = 1 << 34;
/// Ztso extension absent: RVWMO base, fence.tso decodes as
/// full fence rw,rw. Alternatives framework can relax fences
/// when Ztso is confirmed present.
const NO_ZTSO = 1 << 35;
}
}
bitflags! {
/// PowerPC specific errata.
pub struct PpcErrata: u64 {
/// POWER8/9 count cache speculation (Spectre v2 equivalent).
const COUNT_CACHE = 1 << 0;
/// POWER7-9 STF (store-to-load forwarding) barrier needed.
const STF_BARRIER = 1 << 1;
/// POWER7-9 Meltdown: RFI flush (L1D cache flush) at syscall exit.
const RFI_FLUSH = 1 << 2;
/// POWER8+ eieio does NOT order cacheable stores. Use lwsync.
const EIEIO_CACHEABLE = 1 << 3;
/// ldarx/stdcx. reservation granule false sharing (cache line).
const RESERVATION_GRANULE = 1 << 4;
/// PPC32 Book E software TLB management (no hardware walker).
const BOOK_E_SW_TLB = 1 << 5;
/// Some PPC32 Book E SoCs lack DMA coherency.
const DMA_NON_COHERENT = 1 << 6;
/// Timebase frequency from DT only (not hardware-discoverable).
const TB_FREQ_FROM_DT = 1 << 7;
/// PPC32 TBU/TBL 32-bit read rollover.
const TB_32BIT_READ = 1 << 8;
/// POWER9 L1D entry/uaccess flush (CVE-2020-4788).
const POWER9_L1D_ENTRY = 1 << 9;
/// POWER8/9 SLB multi-hit machine check (HPT mode only).
const SLB_MULTIHIT = 1 << 10;
/// POWER8 Transactional Memory hardware bugs. TM disabled by default.
const TM_BUGS = 1 << 11;
/// e500v1/v2 mbar MO=1 errata (CPU-3). Use mbar MO=0 or msync.
const E500_MBAR = 1 << 12;
/// e500 system hang A-005125. Set SPR976[40:41]=0b10 at early boot.
const E500_HANG = 1 << 13;
/// e500 BTB phantom branch (A-004466). Flush BUCSR[BBFI] on ctx switch.
const E500_BTB = 1 << 14;
/// e500 TLB1 flash-invalidate errata (CPU-A001). Use per-entry tlbivax.
const E500_TLB1_FLASH = 1 << 15;
}
}
bitflags! {
/// s390x (z/Architecture) specific errata and architectural constraints.
/// Many of these are not "errata" in the hardware-bug sense but are
/// architectural properties that require specific kernel code paths
/// fundamentally different from all other architectures.
pub struct S390xErrata: u64 {
/// I-cache not snooped by stores. IPI + bcr serialization
/// mandatory after any code patching (BPF JIT, static keys).
const ICACHE_INCOHERENT = 1 << 0;
/// Spectre v2: expolines required on z12-z13 (no eToken).
/// z14+ with facility bit 82 (eToken) eliminates expolines.
const EXPOLINES = 1 << 1;
/// 64-bit TOD clock wraps ~2043 (epoch January 1, 1900).
const TOD_2043_WRAP = 1 << 2;
/// SVC dual encoding: 0-255 direct, higher via SVC 0 + r1.
const SVC_DUAL_ENCODING = 1 << 3;
/// No MMIO. All I/O via channel I/O (CCW) or PCI special
/// instructions (PCILG/PCISTG/PCISTB).
const NO_MMIO = 1 << 4;
/// I/O interrupts float to any CPU with ISC enabled.
/// IRQ affinity via ISC masking, not IOAPIC/GIC routing.
const FLOATING_INTERRUPTS = 1 << 5;
/// Dynamic ASCE page table level switching (3→4→5 levels).
const DAT_ASCE_DYNAMIC = 1 << 6;
/// PSW-swap interrupt model (no interrupt vector table).
const PSW_SWAP_INTERRUPTS = 1 << 7;
/// User/kernel in different address spaces (Primary/Home).
/// MVCOS for copy_from_user; Secondary mode for futex CAS.
const SEPARATE_ADDRESS_SPACES = 1 << 8;
/// SIGP replaces APIC IPIs entirely.
const SIGP_IPI = 1 << 9;
}
}
bitflags! {
/// LoongArch64 specific errata and architectural constraints.
pub struct LoongArchErrata: u32 {
/// Software TLB management on 3A5000 (hardware PTW optional on 3A6000).
const SW_TLB_3A5000 = 1 << 0;
/// I-cache not coherent with data stores. IBAR after code patching.
const IBAR_REQUIRED = 1 << 1;
/// No broadcast TLB invalidation. INVTLB + IPI to each CPU.
const INVTLB_IPI = 1 << 2;
/// EIOINTC interrupt controller (not GIC/APIC compatible).
const EIOINTC_ONLY = 1 << 3;
/// No MPK/POE equivalent. Tier 1 unavailable.
const NO_FAST_ISOLATION = 1 << 4;
}
}
2.16.2.5 Microarchitectural Tuning Hints¶
/// Non-errata CPU characteristics that affect performance tuning.
/// These are hints, not bugs — the kernel uses them to select optimal
/// parameters, not to work around incorrect behavior.
///
/// Populated during `detect_features()` from CPUID/ID registers and
/// model-specific tables. Frozen with the rest of CpuFeatureSet.
pub struct MicroarchHints {
/// L1 data cache line size in bytes (32, 64, or 128).
/// Used by DomainRingBuffer alignment, slab object padding,
/// per-CPU struct alignment.
pub cacheline_bytes: u8,
/// Number of hardware prefetch streams the CPU can track.
/// Affects page zeroing strategy (sequential vs strided).
/// 0 = unknown (use conservative default of 4).
pub hw_prefetch_streams: u8,
/// Preferred kernel memcpy strategy for large copies (>4 KB).
/// Derived from CPU model and CPUID feature flags.
/// See `AlgoDispatch<KERNEL_MEMCPY>` for the algorithm selection.
pub memcpy_strategy: MemcpyStrategy,
/// Page table depth supported (affects VA bits and walk cost).
/// x86-64: 4 (48-bit VA) or 5 (57-bit VA, LA57).
/// AArch64: 3 (39-bit) or 4 (48-bit) based on TCR config.
/// RISC-V: 3 (Sv39), 4 (Sv48), or 5 (Sv57).
/// PPC64LE: 4 (Radix, 52-bit VA).
/// s390x: 3 (Region-Third, 4TB) through 5 (Region-First, 16EB).
/// LoongArch: user-configurable page size + fixed 4-level walk.
pub page_table_levels: u8,
/// Optimal C-state for idle loop. Derived from ACPI _CST or
/// CPUID MWAIT leaf (x86). Deeper C-states save power but have
/// higher wake-up latency. The scheduler uses this for the
/// platform power manager ([Section 7.4](07-scheduling.md#platform-power-management)).
pub deepest_efficient_cstate: u8,
/// Maximum safe C-state. Some CPU models freeze in deep C-states
/// (Bay Trail/Cherry Trail in C6, HPET stops in PC10). This is
/// the deepest C-state the idle driver may request on this CPU.
/// Derived from per-model C-state blacklist tables.
/// 0 = no restriction (use deepest_efficient_cstate).
pub max_safe_cstate: u8,
/// CPU has efficient unaligned memory access (no penalty for
/// misaligned loads/stores). True on x86-64 (all modern),
/// AArch64 (most Cortex-A), false on some RISC-V and ARMv7.
/// Affects DMA buffer alignment requirements and struct packing.
pub efficient_unaligned: bool,
/// CPU core type in a heterogeneous/hybrid system.
/// Determines scheduler placement heuristics and per-core
/// domain switch cost estimates.
pub core_type: CoreType,
/// Relative IPC (instructions per cycle) capacity of this core,
/// normalized so that the highest-IPC core in the system = 1024.
/// Derived from CPUID hybrid info (x86), MIDR implementer/part
/// (AArch64 big.LITTLE), or 1024 on homogeneous systems.
/// The scheduler uses this for heterogeneity-aware placement:
/// latency-sensitive tasks prefer high-IPC cores.
pub ipc_capacity: u16,
/// Tier 1 domain switch cost in cycles for this CPU.
/// x86 P-core: ~35 (WRPKRU), x86 E-core: ~89 (WRPKRU),
/// AArch64 POE: ~20-30 (POR_EL0 + ISB).
/// Architectures without fast isolation (RISC-V, s390x, LoongArch64)
/// report the page-table-switch cost instead (typically 200-500 cycles),
/// since Tier 1 is unavailable on those architectures and Tier 2 domain
/// transitions require a full address-space switch.
/// The scheduler uses this to avoid placing isolation-heavy
/// workloads on cores with expensive domain switches.
pub domain_switch_cycles: u16,
/// TSC / counter is reliable as a clocksource on this CPU.
/// False on pre-Nehalem Intel (TSC stops in C3+), AMD multi-socket
/// (cross-socket drift), Bay Trail (HPET/TSC interaction).
/// When false, the timekeeping subsystem falls back to a
/// platform clocksource (ACPI PM timer, ARM arch timer, etc.).
pub tsc_reliable: bool,
/// LL/SC or CAS reservation granule in bytes (PPC: typically 64-128).
/// Atomic variables in hot-path structures must be separated by at
/// least this many bytes to avoid false sharing of reservations.
/// 0 on architectures without reservation-based atomics (x86, s390x).
pub reservation_granule_bytes: u8,
/// AVX-512 frequency throttling behavior on this CPU.
/// Skylake-SP through Ice Lake: heavy AVX-512 triggers license-level
/// downclocking (~85% or ~70% turbo, persists ~670us).
/// The scheduler uses this to isolate AVX-512 workloads and
/// ensure VZEROUPPER after kernel AVX use on pre-Ice-Lake.
pub avx512_throttle: Avx512Throttle,
/// Padding to align struct to 8 bytes.
_pad: [u8; 1],
}
/// CPU core type in heterogeneous/hybrid systems.
#[repr(u8)]
pub enum CoreType {
/// Homogeneous system or single core type. All cores are equal.
Standard = 0,
/// High-performance core (Intel P-core, ARM big core like A76/A78/X1).
/// Higher IPC, higher power, larger caches.
PerformanceCore = 1,
/// High-efficiency core (Intel E-core, ARM LITTLE core like A55/A510).
/// Lower IPC, lower power, smaller caches.
EfficiencyCore = 2,
}
/// AVX-512 frequency throttling profile.
#[repr(u8)]
pub enum Avx512Throttle {
/// No throttling (Ice Lake+, AMD, or no AVX-512).
None = 0,
/// License Level 1: ~85% turbo on any 512-bit instruction.
/// Skylake-SP with light AVX-512 use.
Level1 = 1,
/// License Level 2: ~70% turbo on heavy 512-bit use.
/// Skylake-SP/Cascade Lake with sustained AVX-512.
Level2 = 2,
/// AVX-512 not available on this CPU.
NotAvailable = 3,
}
/// Preferred memcpy implementation strategy.
#[repr(u8)]
pub enum MemcpyStrategy {
/// rep movsb with ERMS+FSRM (x86 Ice Lake+). Optimal for all sizes.
RepMovsb = 0,
/// SIMD (AVX2/AVX-512/NEON/SVE). Optimal for large aligned copies.
Simd = 1,
/// Scalar word-sized loads/stores. Fallback for platforms without
/// ERMS or SIMD (some RISC-V, PPC32).
Scalar = 2,
}
2.16.2.6 Code Alternatives: Instruction-Level Patching¶
AlgoDispatch handles algorithm selection (which function to call).
A complementary mechanism handles instruction-level alternatives:
replacing individual instructions or short instruction sequences based
on CPU features or errata, without replacing an entire function.
This covers cases where the same algorithm needs a different instruction (not a different function) on different CPU generations:
LFENCEafter indirect branches (Spectre v2 retpoline vs eIBRS)WRMSRNS(non-serializing MSR write) replacingWRMSRon Ice Lake+SERIALIZEinstruction replacingCPUIDfor pipeline serializationNOPreplacing errata workaround instructions on unaffected CPUsCLEARBHBon AArch64 (Spectre-BHB mitigation, only on affected cores)DC ZVAvsSTP xzrfor page zeroing (AArch64, depends on ZVA block size)
Mechanism: code_alternative! macro
/// Declare an instruction-level alternative at a code site.
///
/// At boot (phase 9, during `alt_patch_all()`), the binary is patched
/// in-place: the default instruction sequence is overwritten with the
/// best alternative whose requirements are satisfied. After patching,
/// the instruction cache is flushed (per-arch), and the code runs
/// at native speed — zero runtime dispatch overhead.
///
/// # Parameters
/// - `default`: the instruction sequence used on CPUs without any
/// matching alternative (must be the widest/safest variant).
/// - `alt`: one or more `(condition, replacement)` pairs. The first
/// matching condition wins (priority order, same as AlgoDispatch).
///
/// # How it works
/// The macro emits the default instruction sequence inline, plus a
/// relocation entry in the `__alt_instructions` linker section. The
/// `alt_patch_all()` function iterates these entries, checks each
/// condition against `CpuFeatureTable.universal` (for capabilities) or
/// `CpuFeatureTable.errata_union` (for errata), and patches the code.
///
/// The patching window is atomic: interrupts are disabled on the BSP
/// during `alt_patch_all()`, and APs have not started yet. After
/// patching, the modified pages are made executable and I-cache is
/// invalidated per architecture.
///
/// Example: serialize pipeline
///
/// ```rust
/// code_alternative! {
/// default: "cpuid" // serialize on old CPUs (clobbers EAX-EDX)
/// alt: (arch_raw::SERIALIZE, // CPUID leaf 7 ECX[14], Alder Lake+
/// "serialize") // SERIALIZE instruction (no clobbers)
/// alt: (arch_raw::LFENCE_RDTSC, // available since Pentium Pro
/// "lfence") // weaker but sufficient for many uses
/// }
/// ```
///
/// Example: Spectre v2 mitigation
///
/// ```rust
/// code_alternative! {
/// default: "retpoline_thunk" // indirect branch via retpoline
/// alt: (!errata::SPECTRE_V2, // CPU not affected → no mitigation
/// "jmp *rax") // direct indirect branch
/// alt: (arch_raw::IBRS_ENHANCED,// eIBRS hardware mitigation
/// "jmp *rax") // direct indirect branch (eIBRS active)
/// }
/// ```
macro_rules! code_alternative { ... }
Patching scope: code_alternative! is used exclusively in
arch/*/ modules — generic kernel code never contains arch-specific
instruction alternatives. The alternatives are low-level optimizations
within the architecture abstraction layer.
Relationship to AlgoDispatch:
| Mechanism | Granularity | When | Use case |
|---|---|---|---|
AlgoDispatch |
Entire function | Boot phase 9 | Different algorithm implementations (SHA-NI vs generic SHA) |
code_alternative! |
1-16 instructions | Boot phase 9 | Same algorithm, different instruction on different CPU model |
arch::current:: |
Entire module | Compile time (per target triple) | Different architecture (x86 vs ARM vs RISC-V) |
All three are complementary. arch::current:: selects the architecture.
Within an architecture, AlgoDispatch selects the function. Within a
function, code_alternative! selects the instruction. The result is a
kernel binary that adapts to the specific CPU it runs on at boot — no
per-host recompilation needed, no runtime branches in hot paths.
Per-CPU-type alternatives for big.LITTLE and hybrid systems:
On heterogeneous systems (ARM big.LITTLE, Intel hybrid P/E-core), different
CPU types within the same SoC may require different instruction-level
workarounds. For example, on a Cortex-A55 + Cortex-A76 DynamIQ cluster:
- A55 needs A55_2441007 (repeat TLBI+DSB) but A76 does not.
- A76 needs A76_1286807 (repeat TLBI+DSB) but A55 has its own variant.
- Both need SSBS_3194386 (SB after SSBS modification).
Global code_alternative! patching cannot handle this because a single
code site cannot have two different patches for two different CPU types
running simultaneously.
Solution: For per-CPU-type errata in hot paths, code_alternative!
patches in the safe superset (the workaround that is correct on all
cores), and a runtime per-CPU check is used only when the safe superset
would impose unacceptable overhead on unaffected cores. The runtime check
reads this_cpu_features().errata (a single cache-line-aligned load) and
branches. This costs ~1-3 cycles on the fast path (branch predictor learns
quickly) vs potentially hundreds of cycles for unnecessary workarounds.
// Example: TLBI path on big.LITTLE
code_alternative! {
// Default: repeat TLBI+DSB (safe on all AArch64 cores)
default: "tlbi_repeat_dsb"
// If NO CPU needs repeat-TLBI, use single TLBI+DSB
alt: (!errata::A55_2441007 & !errata::A76_1286807,
"tlbi_single_dsb")
}
// For per-CPU hot-path optimization (optional):
// if this_cpu_has_errata(Aarch64Errata::A55_2441007) {
// tlbi_repeat_dsb(); // A55-specific path
// } else {
// tlbi_single_dsb(); // A76 or unaffected core
// }
This approach ensures correctness (safe superset is always applied by default) while allowing per-CPU optimization where the performance difference justifies a runtime branch.
2.16.2.7 Errata Aggregation in cpu_features_freeze()¶
/// Extended freeze protocol (additions to sub-phase 3).
///
/// NOTE: cpu_features_freeze() lives in Evolvable (swappable), not Nucleus.
/// Only the CpuFeatureTable data storage and alt_patch_apply() primitive
/// are in Nucleus. All orchestration logic is replaceable.
///
/// cpu_features_freeze():
/// 1. ANDs all per-CPU capabilities → universal intersection (existing).
/// 2. ORs all per-CPU errata → errata_union (NEW).
/// This ORs each sub-field independently:
/// errata_union.spec |= entry[i].errata.spec
/// errata_union.x86 |= entry[i].errata.x86 (only on x86)
/// errata_union.aarch64 |= entry[i].errata.aarch64 (only on AArch64)
/// ... etc. Only the compile-time active architecture field is non-zero.
/// If ANY CPU has the bug, the workaround is system-wide.
/// 3. Validates microarch consistency: all CPUs must agree on
/// cacheline_bytes and page_table_levels. If not, log a warning
/// and use the most conservative value.
/// 3a. Computes heterogeneity summary: if any two CPUs have different
/// core_type values, the system is heterogeneous. Records
/// min/max ipc_capacity and min/max domain_switch_cycles for
/// scheduler topology initialization.
/// 4. Marks the CpuFeatureTable page read-only.
/// 5. Calls alt_patch_all() to apply code alternatives.
/// For errata-gated alternatives, checks errata_union (system-wide).
/// For capability-gated alternatives, checks universal (intersection).
/// 6. Calls algo_dispatch_init_all() to select algorithm variants.
///
/// After this point, the kernel binary is patched for this specific
/// hardware. No further code modification occurs EXCEPT during live
/// evolution, where the evolution framework:
/// a. Temporarily remaps CpuFeatureTable page as writable (Nucleus).
/// b. Calls the new Evolvable's detect/aggregate functions to update
/// errata bits if detection logic has changed.
/// c. Calls alt_patch_all() on the newly loaded module's
/// __alt_instructions section (patching only the new module).
/// d. Re-freezes the CpuFeatureTable page (Nucleus).
/// e. Calls algo_dispatch_init_all() for the new module's
/// AlgoDispatch statics.
/// This ensures new errata workarounds and instruction alternatives
/// take effect without reboot.
2.16.2.8 Global Feature Table and Query API¶
/// The system-wide CPU feature table. One entry per logical CPU.
///
/// Access pattern: exclusively read after boot. Write path (during boot) is
/// serialised by the boot sequence (single-threaded until AP bring-up, then
/// each AP writes only its own entry). After `cpu_features_freeze()`, the
/// underlying page is read-only and no writes occur.
pub struct CpuFeatureTable {
/// Per-CPU feature sets, indexed by CPU ID (same index as `CpuCapacity`
/// in the scheduler topology table). Allocated from early-boot memory
/// (not slab — slab is not yet available at detection time).
entries: &'static [CpuFeatureSet],
/// Feature intersection across all CPUs: a feature bit in `universal` is set
/// iff every CPU's `CpuFeatureSet` has that bit set. Computed once during
/// `cpu_features_freeze()`. Used by `AlgoDispatch` selection (§3.10.1)
/// to guarantee the chosen implementation runs correctly on any CPU,
/// regardless of migration.
universal: CpuFeatureSet,
}
/// Global instance. Written during boot, read-only after `cpu_features_freeze()`.
pub static CPU_FEATURES: OnceCell<CpuFeatureTable> = OnceCell::new();
Public query API (all functions are #[inline(always)], zero overhead):
/// Feature set of the current CPU (the CPU executing this instruction).
/// Valid after boot phase 6 for any online CPU; valid from phase 2 for the BSP.
pub fn this_cpu_features() -> &'static CpuFeatureSet
/// Feature set of a specific CPU by ID. Panics if cpu_id ≥ cpu_count().
pub fn cpu_features(cpu_id: u32) -> &'static CpuFeatureSet
/// True if every CPU in the system has all bits in `caps` set.
/// Use this for `AlgoDispatch` selection: an implementation chosen via
/// `all_cpus_have_crypto(caps)` can run on any CPU after migration.
pub fn all_cpus_have_crypto(caps: CryptoCaps) -> bool
pub fn all_cpus_have_atomics(caps: AtomicCaps) -> bool
pub fn all_cpus_have_isolation(caps: IsolationCaps) -> bool
/// Minimum SIMD register width (bytes) across all online CPUs.
/// An algorithm requiring SIMD of this width or less can be dispatched globally.
pub fn min_simd_width_bytes() -> u16
/// True if ANY CPU in the system has the specified errata.
/// Errata use union (OR) semantics: if any CPU is affected, the
/// workaround must be active system-wide (kthreads migrate).
pub fn any_cpu_has_errata(errata: ErrataCaps) -> bool
/// Microarchitectural hints from the universal set (most conservative
/// values across all CPUs).
pub fn microarch_hints() -> &'static MicroarchHints
/// True if the current CPU has all bits in `caps` set.
/// Only use for per-CPU decisions (pinned kthreads, interrupt handlers on a
/// specific CPU). For global dispatch, use `all_cpus_have_*` instead.
pub fn this_cpu_has_crypto(caps: CryptoCaps) -> bool
pub fn this_cpu_simd_width() -> u16
/// True if the current CPU has ALL specified errata bits set.
/// Use for per-CPU hot-path workaround branching on heterogeneous systems
/// (big.LITTLE, Intel hybrid) where different cores have different errata.
/// The ErrataCaps argument should specify a single architecture's field;
/// the function checks only the compile-time active arch field.
///
/// Example (AArch64 big.LITTLE TLB path):
/// if this_cpu_has_errata(ErrataCaps { aarch64: Aarch64Errata::A55_2441007, ..default }) {
/// tlbi_repeat_dsb();
/// }
pub fn this_cpu_has_errata(errata: ErrataCaps) -> bool
/// True if a specific CPU (by ID) has ALL specified errata bits set.
/// Panics if cpu_id ≥ cpu_count(). Primarily used during boot to inspect
/// AP feature sets before freeze, or by diagnostic/sysfs reporting.
pub fn cpu_has_errata(cpu_id: u32, errata: ErrataCaps) -> bool
/// Returns the firmware capability bitflags from the universal set.
/// Used to check whether firmware-mediated mitigations are available
/// before attempting SMCCC/OPAL/SCLP calls.
pub fn firmware_caps() -> FirmwareCaps
/// Returns the microcode revision of the current CPU.
/// Used to check whether a specific microcode update has been applied
/// (e.g., for microcode-gated errata workarounds).
pub fn this_cpu_microcode_rev() -> u64
2.16.2.9 Architecture-Specific Detection Functions¶
Each architecture provides:
/// Detect and return the feature set for the calling CPU.
/// Called exactly once per CPU during boot (BSP in phase 2, APs in phase 6).
/// Must not allocate; must not take locks; must not fail.
/// All CPUID/system-register reads happen here; results are stored in the
/// returned struct and never re-read at runtime.
pub fn detect_features() -> CpuFeatureSet // in arch::current::cpu
Implementation notes per architecture:
-
x86-64: Executes CPUID leaves 1 (ECX/EDX), 7 sub-leaf 0 (EBX/ECX/EDX), 7 sub-leaf 1, 0xD sub-leaf 0, 0x80000001 (ECX), 0x8000001F (ECX for SME/SEV). Reads
XCR0(viaXGETBV) to determine which extended state components the OS has enabled; SIMD width derived from enabled YMM (32 bytes) or ZMM (64 bytes) state. AES-NI check: leaf 1 ECX[25]. SHA-NI check: leaf 7 EBX[29]. -
AArch64: Reads
ID_AA64ISAR0_EL1,ID_AA64ISAR1_EL1,ID_AA64ISAR2_EL1,ID_AA64PFR0_EL1,ID_AA64PFR1_EL1,ID_AA64MMFR3_EL1. If FEAT_SVE is present (PFR0[35:32] ≥ 1), readsZCR_EL1.LENto determine the SVE vector length in bytes:vl = (ZCR_EL1.LEN + 1) × 16. SIMD width = max(16, vl). If FEAT_SME is present (PFR1[27:24] ≥ 1), readsSMCR_EL1.LENfor the Scalable Matrix Extension vector length (used by accelerator subsystem only; not reflected insimd_width_byteswhich covers streaming-SVE-compatible width). -
ARMv7: Reads
MVFR0,MVFR1(VFP/NEON capability registers via MRC p10). ReadsID_ISAR5for crypto extension presence. SIMD width = 16 if NEON (MVFR1[19:16] ≥ 1), else 8 if VFPv3-D16 (MVFR0[3:0] ≥ 1), else 0. -
RISC-V: Reads per-hart
riscv,isastring from the device tree (passed by OpenSBI in the device-tree blob). Parses all extension letters and multi-letter extensions (Zk, Zb, Za*). If the V (vector) extension is present, reads thevlenbCSR (vsetvli t0, x0, e8, m1, ta, ma; csrr t0, vlenb) to determine the vector register length in bytes;simd_width_bytes = vlenb. Called once per hart as each hart enters kernel mode.
Misaligned access probing (RISC-V only): RISC-V does not mandate efficient
unaligned memory access — it is implementation-defined whether misaligned loads/stores
are handled in hardware (fast, 1-2 cycles) or trapped to an M-mode exception handler
(slow, 100-1000+ cycles). The variation across implementations is dramatic:
- SiFive U74: hardware misaligned access, ~1-2 cycle penalty
- T-Head C910: hardware misaligned access, ~1-2 cycle penalty
- T-Head C906: trap-based misaligned handling, ~100+ cycles
- QEMU: emulated, no meaningful penalty
UmkaOS probes misaligned access performance at boot by timing a misaligned load
against an aligned load (100 iterations, take median). If the misaligned cost
exceeds 10× the aligned cost, CpuFeatures.efficient_unaligned is set to false
and the kernel selects alignment-safe implementations for memcpy/memset/string
operations. The probe result is also reported to userspace via the riscv_hwprobe()
syscall (RISCV_HWPROBE_KEY_CPUPERF_0 field, RISCV_HWPROBE_MISALIGNED_FAST /
RISCV_HWPROBE_MISALIGNED_SLOW / RISCV_HWPROBE_MISALIGNED_EMULATED flags) for
glibc's ifunc resolver to select optimized routines.
-
PPC32: Reads PVR (
mfspr r0, 287) for processor version. Reads DTibm,vmxproperty (1 = AltiVec/VMX present). SIMD width = 16 if AltiVec, else 0. No crypto hardware on any supported PPC32 variant; allCryptoCapsbits are 0. -
PPC64LE: PVR + DT
ibm,vmx(VMX/AltiVec baseline, 16 bytes),ibm,vsx(VSX vector-scalar, 16 bytes per register; VSX has 64 × 128-bit registers). DARN instruction available on POWER9+ (PVR major version ≥ 0x004E). vcipher/vncipher AES available on POWER8+ (PVR major version ≥ 0x004D). SIMD width = 16 if VMX or VSX, else 0. -
s390x: Reads STFLE (Store Facility List Extended) to populate a 2048-bit facility mask. Key facility bits: 17 (MSA — message-security-assist base), 76/77 (MSA3/MSA4 — AES-128/256, SHA-256/512), 146 (MSA8 — AES-GCM AEAD), 129 (vector facility — 128-bit SIMD), 134 (vector-packed-decimal), 82 (eToken — Spectre v2 mitigation, eliminates expolines on z14+), 135 (vector-enhancements-1 — BCD, string), 148 (vector-enhancements-2 — IEEE 128-bit float). Machine type from STIDP (Store CPU ID) identifies the CPU generation (z13=2964, z14=3906, z15=8561, z16=3931). SIMD width = 16 if facility 129, else 0. Crypto caps from MSA facility bits.
FirmwareCaps::SCLPalways set (s390x requires SCLP for console, clock sync, topology).FirmwareCaps::SIEset if facility 44 (configuration-z/architecture-mode). Errata populated from machine type: z12-z13 getEXPOLINES(no eToken), all getICACHE_INCOHERENT,SVC_DUAL_ENCODING,NO_MMIO,FLOATING_INTERRUPTS,PSW_SWAP_INTERRUPTS,SEPARATE_ADDRESS_SPACES,SIGP_IPI(these are architectural, not errata per se). -
LoongArch64: Reads CPUCFG (CPU configuration) words 0-19. CPUCFG word 0 contains PRID (Processor ID): bits [31:16] = company (0x14 = Loongson), bits [15:8] = series (0xC0 = 3A5000, 0xD0 = 3A6000). CPUCFG word 2 provides ISA feature bits: LAMO (atomic memory operations), LAM_BH (byte/half atomics), LAMCAS (compare-and-swap), FP/FP_SP/FP_DP (floating-point), LSX (128-bit SIMD), LASX (256-bit SIMD), COMPLEX (complex operations), CRYPTO (AES/SHA). SIMD width = 32 if LASX, 16 if LSX, else 0. Crypto caps from CPUCFG word 2 CRYPTO bit.
FirmwareCaps::LOONGARCH_BPIset (Boot and Pre-boot Interface always present). Errata from PRID: 3A5000 getsSW_TLB_3A5000, all getIBAR_REQUIRED,INVTLB_IPI,EIOINTC_ONLY,NO_FAST_ISOLATION(architectural constraints).
2.17 Production Boot Target¶
The following subsections describe the target boot architecture for production deployments. None of this is implemented yet — it represents the design goal that the Multiboot implementation will evolve toward (see Section 2.18 for the migration path).
2.17.1 Goal: Drop-in Kernel Package¶
UmkaOS installs as a standard kernel package alongside the existing Linux kernel. The user can dual-boot between them using the GRUB menu.
# # Debian / Ubuntu
apt install umka
update-initramfs -c -k umka-1.0.0
update-grub
# # RHEL / Fedora
dnf install umka
dracut --force /boot/initramfs-umka-1.0.0.img umka-1.0.0
grub2-mkconfig -o /boot/grub2/grub.cfg
# # Arch Linux
pacman -S umka
mkinitcpio -p umka
# # Reboot, select "UmkaOS 1.0.0" from GRUB menu
# # Existing Linux kernel is always available as a fallback entry
2.17.2 Boot Requirements¶
- Image format: ELF kernel image with an embedded PE/COFF stub header, compatible
with GRUB2 (loading as ELF), systemd-boot, and UEFI direct boot (loading as PE/COFF).
Installed as
/boot/vmlinuz-umka-VERSION(the "vmlinuz" name is a convention; the actual format is a PE/COFF-stubbed ELF, similar to Linux's bzImage with EFISTUB). - Boot protocol: x86 Linux boot protocol (for BIOS legacy boot) and UEFI stub (for UEFI direct boot). Both are supported.
- Initramfs: Custom initramfs containing UmkaOS-native drivers for early boot (storage controller, root filesystem). Built using standard tools (dracut, mkinitcpio) with UmkaOS-specific hooks.
/bootlayout: Fully compatible with existing distribution tools./boot/vmlinuz-umka-VERSION/boot/initramfs-umka-VERSION.img/boot/System.map-umka-VERSION(optional, for debugging)- Kernel command line: Standard Linux cmdline parameters are parsed and honored
(
root=,console=,quiet,init=,rw/ro, etc.).
2.17.3 Target Boot Sequences¶
These are high-level summaries for package installation and boot target configuration. For the authoritative phase-by-phase boot specification, see the arch-specific boot sequence files (e.g., Section 2.2, Section 2.5, Section 2.7, etc.) and the canonical cross-architecture boot table in Section 2.3.
x86-64 (production):
1. UEFI firmware (PE/COFF stub) / BIOS bootloader loads kernel image
2. Boot stub (Rust/asm) sets up:
- Identity-mapped page tables
- GDT, IDT stubs
- Stack
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse boot parameters and ACPI tables
b. Initialize physical memory allocator (from e820/UEFI memory map)
c. Initialize virtual memory (kernel page tables, PCID)
d. Initialize per-CPU data structures
e. Initialize Tier 0 drivers: APIC, timer, early console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init (typically systemd)
AArch64 (production):
1. UEFI firmware or QEMU -kernel loads the ELF, jumps to _start in EL1
2. Boot stub (assembly) sets up:
- Exception vectors (VBAR_EL1)
- Stack pointer
- MMU disabled (identity-mapped initially)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) passed in x0
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (TTBR0_EL1/TTBR1_EL1, ASID, TCR_EL1)
d. Initialize per-CPU data structures (MPIDR_EL1 affinity)
e. Initialize Tier 0 drivers: GIC (distributor + redistributor), generic timer, early console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
No microcode loading is performed — ARM firmware updates are handled by the platform firmware (UEFI capsule updates or vendor-specific mechanisms), not the kernel. This is architecturally correct: ARM's trust model places firmware updates in the Secure World (EL3/EL2), not in the Normal World OS.
ARMv7 (production):
1. QEMU vexpress-a15 loads the ELF, jumps to _start in SVC mode
2. Boot stub (assembly) sets up:
- Vector table (VBAR)
- Stack pointer
- Interrupts disabled (CPSR I+F bits)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) passed in r2
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (TTBR0, DACR for domain isolation)
d. Initialize per-CPU data structures
e. Initialize Tier 0 drivers: GIC, SP804 timer, early UART console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
ARMv7 does not have microcode updates. CPU errata on ARMv7 are addressed through kernel code paths (alternative instruction sequences) selected at boot based on the MIDR (Main ID Register) value.
RISC-V 64 (production):
1. OpenSBI (M-mode firmware) initializes hardware, jumps to _start in S-mode
a0 = hart_id, a1 = DTB address
2. Boot stub (assembly) sets up:
- Trap vector (stvec)
- Stack pointer
- Interrupts disabled (sstatus.SIE = 0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) from a1
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (satp CSR, Sv48 mode, ASID)
d. Initialize per-CPU data structures (per-hart)
e. Initialize Tier 0 drivers: PLIC, timer (via SBI ecall), early 16550 UART
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
RISC-V does not have microcode updates. CPU errata are handled by OpenSBI (M-mode)
or by kernel alternative code paths selected based on the mvendorid/marchid/
mimpid CSRs (exposed via SBI or DTB).
PPC32 (production):
1. U-Boot or QEMU loads ELF, jumps to _start in supervisor mode
r3 = DTB address
2. Boot stub (assembly) sets up:
- Stack pointer (r1)
- Exception vectors (IVPR base + IVOR offsets)
- Interrupts disabled (MSR EE=0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) from r3
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (TLB1 entries for initial mapping, then software page table)
d. Initialize per-CPU data structures
e. Initialize Tier 0 drivers: OpenPIC, decrementer timer, early UART console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
PPC32 does not have microcode updates. CPU errata are handled by kernel code paths selected at boot based on the PVR (Processor Version Register).
PPC64LE (production):
1. SLOF/OPAL firmware loads ELF, jumps to _start
r3 = DTB address, MSR: SF=1, LE=1
2. Boot stub (assembly) sets up:
- TOC pointer (r2) for position-independent data access
- Stack pointer (r1)
- Exception vectors (via LPCR and HSPRG0/1)
- Interrupts disabled (MSR EE=0)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Parse device tree blob (DTB) from r3
b. Initialize physical memory allocator (from DTB /memory nodes)
c. Initialize virtual memory (Radix MMU on POWER9+, HPT fallback on POWER8)
d. Initialize per-CPU data structures (PIR = Processor Identification Register)
e. Initialize Tier 0 drivers: XIVE interrupt controller, decrementer timer, early UART console
f. Initialize capability system
g. Initialize scheduler
h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
PPC64LE does not have user-loadable microcode. POWER processor firmware updates are applied by the service processor (FSP or BMC) out-of-band, not by the OS kernel.
s390x (production):
1. IPL (Initial Program Load) loads the kernel from DASD/SCSI via channel I/O
or QEMU -kernel direct boot. Initial PSW points to _start.
No DTB, no ACPI — hardware discovery via STSI/SCLP.
2. Boot stub (assembly) sets up:
- Lowcore PSW pairs for all 6 interrupt classes
- Stack pointer (R15)
- 64-bit addressing mode (PSW bits 31,32 = 1,1)
- Prefix register (per-CPU lowcore page)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. Initialize SCLP console for early output
b. Execute STFLE for facility detection (CPU features)
c. Discover memory layout via SCLP Read SCP Info
d. Initialize physical memory allocator
e. Initialize virtual memory (DAT with Region-Third ASCE)
f. Initialize per-CPU data structures (lowcore-based)
g. Initialize Tier 0 drivers: CPU timer, subchannel I/O
h. Initialize capability system
i. Initialize scheduler
j. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
s390x does not have user-loadable microcode. z/Architecture firmware updates are applied by the Support Element (SE) or Hardware Management Console (HMC) during a scheduled IPL or concurrent firmware update, not by the OS kernel.
LoongArch64 (production):
1. UEFI firmware (EDK2) or QEMU -kernel loads the ELF, jumps to _start
in PLV0 (kernel mode). Register convention varies by boot method:
- UEFI: a0=efi_boot (bool, 1=EFI), a1=cmdline_ptr (pointer to command
line string), a2=system_table_ptr (pointer to EFI system table).
Matches Linux arch/loongarch/kernel/head.S entry convention.
- DTB boot: a0=CPU_ID, a1=DTB address
2. Boot stub (assembly) sets up:
- Direct Mapping Windows (CSR.DMW0/DMW1) for identity-mapped access
- Exception vectors (CSR.EENTRY, CSR.TLBRENTRY)
- Stack pointer
- FPU enable (CSR.EUEN.FPE = 1)
3. Jump to Rust entry point (umka_core::main)
4. UmkaOS Core initialization:
a. CPUCFG feature detection (PRID, ISA features, TLB geometry)
b. Parse DTB or UEFI memory map for physical memory regions
c. Initialize physical memory allocator
d. Initialize virtual memory (4-level page tables, software or hardware PTW)
e. Initialize per-CPU data structures
f. Initialize Tier 0 drivers: EIOINTC, stable counter timer, early NS16550 UART
g. Initialize capability system
h. Initialize scheduler
i. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init
LoongArch64 does not have user-loadable microcode. Loongson processor firmware updates are applied by the UEFI firmware update mechanism (capsule updates).
2.17.4 Initramfs Detection and Loading¶
UmkaOS supports three initramfs loading mechanisms, tried in priority order. The mechanism used depends on the boot path (BIOS/Multiboot, UEFI, or firmware with device tree). All three paths expose the same result to the kernel: a physical address and byte length for a contiguous initramfs image in RAM.
| Boot Path | Discovery Mechanism | Address Fields |
|---|---|---|
| x86 BIOS/Multiboot | boot_params.hdr.ramdisk_image (offset 0x218) |
u32 phys addr + ramdisk_size at 0x21c |
| EFI stub (all arches) | LINUX_EFI_INITRD_MEDIA_GUID LoadFile2 protocol |
GUID: 5568e427-68fc-4f3d-ac74-ca555231cc68 |
| Device Tree | /chosen node: linux,initrd-start + linux,initrd-end |
u64 big-endian absolute physical addresses |
All three paths converge on the same kernel-internal representation:
/// Initramfs blob location discovered during early boot.
/// Populated by one of the three platform-specific loading paths before
/// the memory allocator is fully online. The physical range
/// [phys_start, phys_start + len) must lie within usable RAM.
pub struct InitramfsBlob {
/// Physical start address of the initramfs image.
pub phys_start: PhysAddr,
/// Byte length of the compressed CPIO archive.
pub len: usize,
}
Path 1 — x86 boot_params (highest priority on x86/x86-64)
The Multiboot loader or UEFI stub populates fields in boot_params (the "zero
page"). There are two distinct areas: the setup_header (header fields) and the
boot_params extension area (zero-page fields). The ramdisk fields span both:
/// Fields read from the x86 Linux boot protocol.
/// NOTE: This is a logical grouping, NOT a memory-mapped overlay. The four fields
/// live at different non-contiguous offsets within boot_params. Read each field
/// individually from its documented offset, then assemble into this struct.
/// Do NOT cast `boot_params + offset` to `*const BootParamsRamdiskFields`.
///
/// `ramdisk_image` and `ramdisk_size` live in the setup_header at fixed offsets
/// from the start of the real-mode kernel header (0x01f1 into boot_params).
/// `ext_ramdisk_image` and `ext_ramdisk_size` are separate extension fields
/// in the boot_params zero-page area, not in the header itself.
pub struct BootParamsRamdiskFields {
/// Low 32 bits of the initramfs physical base address.
/// Offset from boot_params base: 0x218 (within setup_header).
/// Boot protocol 2.00+ (kernel 1.3.73+).
pub ramdisk_image: u32,
/// Low 32 bits of the initramfs byte length.
/// Offset from boot_params base: 0x21c (within setup_header).
/// Boot protocol 2.00+ (kernel 1.3.73+).
pub ramdisk_size: u32,
/// High 32 bits of the initramfs physical base address.
/// Offset from boot_params base: 0x0c0 (zero-page extension area).
/// Added in boot protocol 2.12 (kernel 3.8) for loading above 4 GiB.
pub ext_ramdisk_image: u32,
/// High 32 bits of the initramfs byte length.
/// Offset from boot_params base: 0x0c4 (zero-page extension area).
/// Added in boot protocol 2.12 (kernel 3.8).
pub ext_ramdisk_size: u32,
}
If boot_params.hdr.ramdisk_image != 0, UmkaOS reads the initramfs from:
physical_addr = ((ext_ramdisk_image as u64) << 32) | (ramdisk_image as u64)
size_bytes = ((ext_ramdisk_size as u64) << 32) | (ramdisk_size as u64)
On systems without boot protocol 2.12 support (i.e., ext_ramdisk_image and
ext_ramdisk_size are zero-initialized), this reduces to the 32-bit address
and size directly from ramdisk_image and ramdisk_size.
Path 2 — EFI LoadFile2 / Initrd Media GUID Protocol (EFI systems, all architectures)
When booted via EFI (UEFI stub or EFI bootloader such as systemd-boot or GRUB2),
the bootloader may expose the initramfs through the LoadFile2 protocol registered
on the LINUX_EFI_INITRD_MEDIA_GUID vendor media device path. This mechanism was
introduced in Linux 5.8 and is also implemented in the UmkaOS EFI stub.
/// EFI GUID identifying the initrd media vendor device path.
/// The kernel's EFI stub locates a handle with this GUID registered on the
/// firmware's device path protocol, then calls LoadFile2 to obtain the initrd.
/// Defined in the Linux EFI stub (drivers/firmware/efi/libstub/efi-stub-helper.c).
pub const LINUX_EFI_INITRD_MEDIA_GUID: EfiGuid = EfiGuid {
data1: 0x5568_e427,
data2: 0x68fc,
data3: 0x4f3d,
data4: [0xac, 0x74, 0xca, 0x55, 0x52, 0x31, 0xcc, 0x68],
};
The loading sequence:
- Scan the EFI handle database for a handle that matches the
LINUX_EFI_INITRD_MEDIA_GUIDvendor media device path. - If found, query the
LoadFile2protocol on that handle. - Call
LoadFile2.LoadFile()withBootPolicy = falseto obtain the initrd size (first call returnsEFI_BUFFER_TOO_SMALLwith the size). - Allocate pages below the hard limit, call
LoadFile2.LoadFile()again to transfer the data. - The resulting
(base, size)pair is stored in the EFI configuration table underLINUX_EFI_INITRD_MEDIA_GUIDand consumed by the kernel afterExitBootServices().
Path 3 — Device Tree /chosen node (AArch64, ARMv7, RISC-V, PPC)
The firmware or bootloader populates the /chosen DT node with the initramfs
physical address range:
/ {
chosen {
/* linux,initrd-start and linux,initrd-end are big-endian cell values.
Cell width follows #address-cells of the root node (typically 2 on
64-bit platforms, giving 64-bit addresses across two 32-bit cells). */
linux,initrd-start = <0x0 0x82000000>; /* 64-bit: 0x0000000082000000 */
linux,initrd-end = <0x0 0x84000000>; /* exclusive end address */
};
};
/// DT /chosen property names for initramfs (standard Linux boot protocol).
/// Values are big-endian cells; cell width matches the root node's #address-cells.
/// Size of initramfs = initrd_end - initrd_start (initrd_end is exclusive).
pub const DT_INITRD_START_PROP: &str = "linux,initrd-start";
pub const DT_INITRD_END_PROP: &str = "linux,initrd-end";
UmkaOS reads these properties during early DT parsing (step 4a in the DT-based boot sequences). Both are treated as 64-bit big-endian values regardless of platform word size, matching the Linux implementation.
Priority and fallback:
if arch == x86 || arch == x86_64:
if boot_params.hdr.ramdisk_image != 0:
use Path 1
elif efi_boot && efi_load_initrd_dev_path() succeeds:
use Path 2
else:
no initramfs
elif efi_boot:
if efi_load_initrd_dev_path() succeeds:
use Path 2
elif dt_available && dt_property_exists("/chosen", "linux,initrd-start"):
use Path 3 // DTB fallback for EFI platforms (LoongArch64 QEMU, some AArch64)
else:
no initramfs
elif dt_boot:
if dt_property_exists("/chosen", "linux,initrd-start"):
use Path 3
else:
no initramfs
When multiple mechanisms provide initramfs data (e.g., both EFI LoadFile2 and DTB
/chosen properties on LoongArch64), only the highest-priority path (topmost match
in the cascade) is used; lower-priority paths are not consulted or validated.
No initramfs is also valid — the kernel falls back to a minimal in-kernel rootfs
(tmpfs) and attempts to find /init from a built-in CPIO archive. If no built-in
CPIO is present and no initramfs was loaded, the kernel panics with a descriptive
message: "No initramfs found and no built-in rootfs — cannot locate /init".
Validation (after loading, regardless of path):
- Verify the initramfs starts with a valid cpio magic:
070701(newc, no CRC) or070702(newc, with CRC). Reject if absent. (The binary CPIO format0707and old ASCII format070707are not supported — only newc format is used for Linux initramfs images, matching Linuxinit/initramfs.c.) - Verify
size_bytes > 0and that the physical range[physical_addr, physical_addr + size_bytes)lies entirely within available RAM (not in reserved regions or MMIO holes). Reject with a boot error if not. - If IMA is active (Integrity Measurement Architecture, Section 9.5), measure the complete initramfs into PCR 10 before executing any init scripts. This matches the Linux IMA policy for initramfs measurement.
2.18 CPU Errata and Speculation Mitigations¶
Modern CPUs ship with known errata — hardware bugs documented in vendor errata sheets. UmkaOS handles these systematically rather than scattering workarounds through the codebase.
Early microcode loading — CPU microcode is applied before most kernel initialization, matching the Linux early microcode loading model. The microcode blob is located by scanning the raw initramfs image in physical memory (NOT by mounting the filesystem — initramfs mount happens later). Linux uses the same approach: the bootloader provides an uncompressed CPIO archive prepended to the initramfs; the kernel extracts the microcode by parsing the raw CPIO headers in memory at boot.
The microcode update runs at Phase 0.14a (after early serial init, before CPUID-dependent decisions). See the canonical x86-64 phase table in Section 2.2 for the precise ordering.
Phase 0.14a: Early microcode update
1. Scan raw initramfs blob in physical memory for microcode CPIO archive
(/lib/firmware/intel-ucode/ or /lib/firmware/amd-ucode/ paths in CPIO)
2. Validate signature (CPU hardware performs validation internally —
the WRMSR to the update MSR triggers the CPU's built-in signature
verification; the kernel cannot bypass or customize this check)
3. Apply via WRMSR to IA32_BIOS_UPDT_TRIG (Intel) or MSR_AMD64_PATCH_LOADER (AMD)
4. Re-read CPUID — microcode may change feature flags (critical: must happen
before CPUID-dependent decisions such as page table format selection)
5. Log applied microcode revision to ring buffer
Errata database — After microcode loading and CPUID enumeration, UmkaOS consults a per-CPU-model quirk table:
/// CPU errata entry — matches a specific CPU stepping to its required workarounds.
///
/// The errata database is a static `&[CpuErrata]` table, compiled into the kernel.
/// During boot (step 4d), after CPUID/MIDR/STFLE enumeration and microcode loading,
/// the kernel iterates this table. For each entry whose `match_id` matches the
/// current CPU and whose `microcode_min` threshold is satisfied (or absent), the
/// `workaround` function is called and the `errata_bits` are OR'd into the per-CPU
/// `ErrataCaps` structure defined in [Section 2.16](#extended-state-and-cpu-features).
struct CpuErrata {
/// CPU identification (vendor, family, model, stepping range).
match_id: CpuMatch,
/// Human-readable errata identifier (e.g., "SKX003", "ZEN4-ERR-1234").
errata_id: &'static str,
/// Workaround function applied during boot.
workaround: fn() -> Result<()>,
/// Classification bitmask for boot-parameter override.
class: ErrataClass,
/// Errata bits to OR into the per-CPU ErrataCaps when this entry matches.
/// This connects the workaround entry to the typed bitflags in
/// [Section 2.16](#extended-state-and-cpu-features) that code_alternative! and
/// runtime checks query.
errata_bits: ErrataCaps,
/// Minimum microcode revision required for this workaround to be effective.
/// If `Some(rev)` and the CPU's microcode revision (from `CpuFeatureSet.microcode_revision`)
/// is below `rev`, the workaround is skipped and a warning is logged advising
/// the admin to update microcode. If `None`, the workaround is unconditional.
microcode_min: Option<u64>,
}
Errata classification — Each errata entry is tagged with one or more ErrataClass
bits. These serve two purposes: (1) boot parameters can disable entire classes
(umka.mitigate.barrier=off), and (2) the kernel can report grouped statistics
(e.g., "3 TLB workarounds active, 2 feature disables active").
bitflags! {
/// Classification of errata workarounds. A single errata entry may
/// combine multiple classes (e.g., a TLB workaround that also requires
/// an IPI is `TLBI_WORKAROUND | IPI_REQUIRED`).
pub struct ErrataClass: u32 {
/// MSR/system-register write to disable or enable a feature.
const MSR_TWEAK = 1 << 0;
/// Alternative code path via code_alternative! or AlgoDispatch.
const CODE_PATH = 1 << 1;
/// Disable a CPU feature entirely (clear CPUID/HWCAP bit).
const FEATURE_DISABLE = 1 << 2;
/// TLB invalidation workaround (repeat TLBI, extra DSB/ISB).
const TLBI_WORKAROUND = 1 << 3;
/// Memory barrier insertion (LFENCE, DMB, FENCE, DBAR).
const BARRIER_INSERTION = 1 << 4;
/// Instruction upgrade (dc cvac → dc civac, eieio → lwsync).
const INSTRUCTION_UPGRADE = 1 << 5;
/// Hide a feature from userspace HWCAP/CPUID reporting.
const HWCAP_HIDE = 1 << 6;
/// Timer or counter workaround (re-read, frequency override).
const TIMER_WORKAROUND = 1 << 7;
/// Context switch hook (save/restore extra state, flush buffers).
const CONTEXT_SWITCH = 1 << 8;
/// Idle path modification (limit C-state, add barriers).
const IDLE_PATH = 1 << 9;
/// Requires IPI broadcast (e.g., remote TLB shootdown on broken TLBI).
const IPI_REQUIRED = 1 << 10;
/// KVM-specific hook (guest entry/exit, VMCS field workaround).
const KVM_HOOK = 1 << 11;
/// Workaround is effective only with minimum microcode version.
const MICROCODE_GATE = 1 << 12;
/// Cache operation override (coherency workaround, DMA path change).
const CACHE_OP_OVERRIDE = 1 << 13;
/// PTE bit layout workaround (non-standard encodings, dirty-bit races).
const PTE_BIT_LAYOUT = 1 << 14;
/// Driver demotion: force a device to lower isolation tier.
const TIER_DEMOTION = 1 << 15;
}
}
CPU matching — CpuMatch supports all eight architectures. Each variant
encodes the minimum identification needed to match a CPU model/stepping:
enum CpuMatch {
/// x86 (Intel/AMD): vendor + family + model + stepping range.
X86 {
vendor: X86Vendor,
family: u8,
model: u8,
/// Inclusive stepping range. `(0, 0xFF)` matches all steppings.
stepping_range: (u8, u8),
},
/// AArch64: MIDR_EL1 fields.
Aarch64 {
/// Implementer code (0x41 = ARM, 0x42 = Broadcom, 0x43 = Cavium,
/// 0x48 = HiSilicon, 0x4E = NVIDIA, 0x50 = APM, 0x51 = Qualcomm,
/// 0x61 = Apple, 0xC0 = Ampere).
implementer: u8,
/// Primary part number (e.g., 0xD03 = Cortex-A53, 0xD0B = Cortex-A76).
part_num: u16,
/// Inclusive revision range (MIDR variant:revision). `(0, 0xFF)` = all.
revision_range: (u8, u8),
},
/// ARMv7: MIDR fields (same encoding as AArch64 but 32-bit register).
Armv7 {
implementer: u8,
part_num: u16,
revision_range: (u8, u8),
},
/// RISC-V: vendor + architecture ID from DT or CSR.
Riscv {
/// mvendorid CSR value (0x489 = SiFive, 0x5B7 = T-Head).
mvendorid: u64,
/// marchid CSR value (0 = unknown — use mvendorid-only matching).
marchid: u64,
/// mimpid range for stepping-level matching. `(0, u64::MAX)` = all.
mimpid_range: (u64, u64),
},
/// PPC32: Processor Version Register (PVR) fields.
Ppc32 {
/// PVR[31:16] (e.g., 0x8020 = e500v2, 0x0040 = 440EP).
pvr_version: u16,
/// PVR[15:0] revision range. `(0, 0xFFFF)` = all.
pvr_revision_range: (u16, u16),
},
/// PPC64LE: Processor Version Register fields.
Ppc64 {
/// PVR[31:16] (e.g., 0x004D = POWER8, 0x004E = POWER9, 0x0080 = POWER10).
pvr_version: u16,
pvr_revision_range: (u16, u16),
},
/// s390x: Machine type + facility bits.
S390x {
/// Machine type from STIDP (e.g., 0x2964 = z13, 0x3906 = z14,
/// 0x8561 = z15, 0x3931 = z16). 0 = match all machine types.
machine_type: u16,
/// Required facility bits that must be present (or absent) for the
/// errata to apply. Encoded as (bit_index, must_be_set).
/// Empty slice = match all facility configurations.
facility_check: &'static [(u16, bool)],
},
/// LoongArch64: PRID (Processor ID from CPUCFG word 0).
LoongArch {
/// PRID[31:16] company ID (0x14 = Loongson).
company_id: u16,
/// PRID[15:8] series (0xC0 = 3A5000, 0xD0 = 3A6000).
series: u8,
/// PRID[7:0] revision range. `(0, 0xFF)` = all.
revision_range: (u8, u8),
},
/// Match all CPUs (architecture-wide errata, e.g., NMI IRET on all x86).
AllX86,
AllAarch64,
AllArmv7,
AllRiscv,
AllPpc,
AllS390x,
AllLoongArch,
}
#[repr(u8)]
enum X86Vendor {
Intel = 0,
Amd = 1,
/// Hygon (AMD-derived, used in China).
Hygon = 2,
/// Zhaoxin (VIA-derived, used in China).
Zhaoxin = 3,
}
The quirk table is checked during boot (step 4d, after CPUID). Each matching entry's workaround function is called. Workarounds are logged to the ring buffer.
Errata entries (production table — all entries required for correctness):
errata_id |
CPU | Class | Workaround |
|---|---|---|---|
"ZEN2-RDRAND" |
AMD Zen 2 (family 0x17, model 0x71) | FEATURE_DISABLE |
RDRAND returns 0xFFFFFFFF on some steppings. Disable RDRAND/RDSEED CPUID bits; fall back to DRBG. |
"SKX-TSX-ABORT" |
Intel Skylake-X (family 6, model 85, stepping <5) | FEATURE_DISABLE |
TSX (RTM/HLE) causes unpredictable aborts. Clear RTM and HLE CPUID bits. |
"ADL-PCID" |
Intel Alder Lake / Raptor Lake (family 6, model 0x97/0xBF) | FEATURE_DISABLE |
INVLPG fails to flush Global TLB entries when PCID enabled. Disable PCID; use full CR3 reload for TLB invalidation. |
"SPR-TILEDATA" |
Intel Sapphire Rapids (family 6, model 0x8F, stepping 4) | CONTEXT_SWITCH |
XRSTOR fault leaves partial AMX tile state. Re-execute faulted XRSTOR before any XSAVE of tile data. |
"SPR-VMX-TIMER" |
Intel Sapphire Rapids (family 6, model 0x8F) | KVM_HOOK |
VMX preemption timer misfires with value 1. Clamp to max(2, requested_value). |
"AEPIC-X2APIC" |
Intel Ice Lake / Alder Lake (family 6, models 0x6A/0x97) | MSR_TWEAK |
AEPIC Leak (CVE-2022-21233): xAPIC MMIO leaks stale data. Force x2APIC mode on affected CPUs. |
"MWAIT-IPI" |
Intel Apollo Lake / ICX / Lunar Lake | IDLE_PATH \| IPI_REQUIRED |
MWAIT fails to wake on interrupt. Fall back to IPI-based idle wakeup. |
"ZEN5-RDSEED" |
AMD Zen 5 (family 0x1A) | CODE_PATH |
16-bit and 32-bit RDSEED returns zero. Use 64-bit RDSEED only. |
"ZEN4-1485" |
AMD Zen 4 (family 0x19, model 0x61/0x71) | MSR_TWEAK |
Spurious #UD with STIBP disabled. Unconditional MSR fix at boot. |
"XFD-HOTPLUG" |
Intel SPR+ (family 6, model ≥0x8F) | CONTEXT_SWITCH |
XFD per-CPU cache desyncs on CPU hotplug. Invalidate XFD cache on hotplug events. |
"BAYTRAIL-C6" |
Intel Bay Trail / Cherry Trail (family 6, model 0x37/0x4C) | IDLE_PATH |
C6 freeze: CPU fails to wake. Block C6 and deeper idle states. |
"SPLIT-LOCK" |
Intel Ice Lake+ (family 6, model ≥0x7E) | MSR_TWEAK |
Split lock detection: enable IA32_CORE_CAPABILITIES[5] on capable CPUs, send SIGBUS to offending userspace processes. |
"TDX-HECKLER" |
Intel SPR+ (TDX-capable) | KVM_HOOK |
Malicious hypervisor injects interrupts into TDX guests to influence execution paths, causing information disclosure (CVE-2024-26922, "Heckler" attack). Mitigation: (a) Validate all host-provided values: MMIO reads return sanitised defaults if outside expected range, port I/O results are bounds-checked, MSR reads are compared against a whitelist of safe values, CPUID leaves are cached at boot and reused. (b) Interrupt injection filtering: maintain a per-vCPU bitmap of expected interrupt vectors; unexpected vectors are logged and suppressed (delivered as NMI instead of the injected vector). (c) #VE (virtualisation exception) handler must never branch on untrusted host data — all decisions use guest-internal state only. |
"SGX-TDX-ERRINJ" |
Intel Xeon (certain SGX/TDX-capable) | MICROCODE_GATE |
INTEL-SA-00837: unauthorized error injection in SGX/TDX enables privilege escalation. Boot-time microcode version validation required before enabling SGX/TDX; disable if below minimum safe version. |
"AMX-TILE-CSTATE" |
Intel SPR (family 6, model 0x8F) | IDLE_PATH |
Uninitialized AMX tile register state prevents deep C-states (C6+). Idle path must ensure AMX tile state is in INIT configuration before requesting deep C-states. |
"A76-1286807" |
ARM Cortex-A76 (MIDR 0x41_D0B0, rev<r4p0) |
CODE_PATH \| TLBI_WORKAROUND |
Speculative AT instruction may corrupt TLB. Repeat TLBI+DSB in invalidation sequence. |
"A55-HAFDBS" |
ARM Cortex-A55 (MIDR 0x41_D05x) |
FEATURE_DISABLE |
Broken hardware DBM (FEAT_HAFDBS). Disable DBM unconditionally. |
"A510-DIRTYBUG" |
ARM Cortex-A510 (MIDR 0x41_D46x, early steppings) |
FEATURE_DISABLE |
Broken dirty bit ordering. Disable hardware dirty bit management. |
"A510-BF16" |
ARM Cortex-A510 (MIDR 0x41_D46x, early steppings) |
HWCAP_HIDE |
BF16/VMMLA incorrect results with shared NEON. Hide BF16 from hwcap. |
"A715-BBM" |
ARM Cortex-A715 (MIDR 0x41_D4Dx) |
CODE_PATH |
ESR/FAR corruption on exec→non-exec PTE transitions. Break-before-make mandatory for all permission changes. |
"A57-MMIO" |
ARM Cortex-A57 (MIDR 0x41_D07x) |
CODE_PATH |
Exclusive + device load deadlock (erratum 832075). Promote MMIO loads to load-acquire. |
"SSBS-3194386" |
AArch64 (A76 through X4, N1-N3, V1-V3) | BARRIER_INSERTION |
SSBS not self-synchronizing. Insert SB/ISB after every PSTATE.SSBS modification. |
"LSE-STADD" |
AArch64 (all cores with LSE atomics) | CODE_PATH |
STADD/STCLR/STSET have catastrophic interconnect performance. Use LDADD/LDCLR/LDSET with XZR destination instead. |
"THEAD-GHOSTWRITE" |
RISC-V T-Head C910/C920 | FEATURE_DISABLE \| TIER_DEMOTION |
XTheadVector bypasses MMU — unprivileged physical write. Disable XTheadVector unconditionally. |
"THEAD-MERGEBUF" |
RISC-V T-Head C910/C920 | BARRIER_INSERTION |
Store merge buffer delays stores indefinitely. Insert fence w,o after volatile stores used for inter-core communication. |
"THEAD-MARCHID0" |
RISC-V T-Head C9xx | CODE_PATH |
marchid=0, cannot distinguish C906/C910/C920. Use CSR probing for disambiguation. |
"APLIC-LEVEL-MSI" |
RISC-V (APLIC in MSI mode) | CODE_PATH |
Level-sensitive interrupt loss in MSI mode. Re-assertion check mandatory after ISR completion (permanent interrupt loss otherwise). |
"RVV-071-COMPAT" |
RISC-V T-Head C906/C910 (XTheadVector 0.7.1) | HWCAP_HIDE |
RVV 0.7.1 and 1.0 are mutually exclusive. Never advertise standard V extension on cores with XTheadVector only. |
"THEAD-C906-HALT" |
RISC-V T-Head C906 (Sophgo CV1800B, Allwinner D1) | FEATURE_DISABLE |
XTheadMemIdx load-with-increment (dest=base register, reserved encoding) + CSR read targeting same register permanently halts the core. No exception raised. Unprivileged DoS — consider disabling XTheadMemIdx entirely. |
"THEAD-C908-VHALT" |
RISC-V T-Head C908 | FEATURE_DISABLE |
Certain vector instruction sequences permanently halt the core (found by RISCVuzz fuzzing). Consider disabling vector extension on C908 in high-security environments. |
"THEAD-IMPRECISE" |
RISC-V T-Head C910/C920 (TH1520, SG2042) | CODE_PATH |
Load bus errors are imprecise: sepc/mepc points to instruction after faulting load; stval/mtval contains physical address instead of virtual. Fault handler must tolerate off-by-one sepc and physical-address stval. |
"SIFIVE-CIP453" |
RISC-V SiFive U54/U74 (FU540, FU740, JH7110) | CODE_PATH |
stval/mtval not sign-extended on instruction page/access faults — bit 38 not propagated. Instruction fault handler must manually sign-extend stval before address comparison. |
"RVV-SIGRETURN" |
RISC-V (all cores with V extension) | CODE_PATH |
CVE-2024-35873: during rt_sigreturn(), vector state marked dirty. If vectorized copy_from_user() then restores registers from sigcontext, dirty live state is saved back first, corrupting it. Signal return path must never use vector-accelerated copies for vstate restore. |
"SVADU-SPECULATIVE" |
RISC-V (cores with Svadu extension) | CODE_PATH |
Svadu hardware A-bit updates may be speculative. D-bit updates must be precise. Page replacement algorithms must NOT rely on A-bit exactness — treat as hint only. Detect menvcfg.ADUE at boot; all harts must use the same PTE-update scheme (Svade software OR Svadu hardware). |
"PMP-PARTIAL" |
RISC-V (implementations that decompose misaligned accesses) | CODE_PATH |
Misaligned store straddling a PMP boundary can partially complete — portion passing PMP check becomes visible, leaving memory inconsistent. PMP boundaries must be aligned to at least maximum access width. Never allow untrusted code to perform misaligned stores near PMP boundaries. |
"RISCV-COUNTERS" |
RISC-V (T-Head C906/C908/C910, likely others) | MSR_TWEAK |
Unprivileged rdcycle/rdinstret enables Cache+Time and CycleDrift side-channel attacks that leak data across privilege boundaries, even on in-order cores. Disable user-space access to cycle and instret counters via scounteren CSR at boot; provide vDSO-based timing interface instead. |
"RISCV-SIFIVE-TLB" |
RISC-V SiFive FU740 (CIP-1200) | TLBI_WORKAROUND |
Address-specific sfence.vma addr fails to reliably flush specified address. Fall back to full sfence.vma (no address argument) via alternatives framework. Negates targeted flush performance benefit. |
"POWER8-CCF" |
PPC64 POWER8/9 | CODE_PATH \| CONTEXT_SWITCH |
Count cache flush sequence (bcctr flush) on context switch for Spectre v2. |
"POWER-EIEIO" |
PPC64 POWER8+ | INSTRUCTION_UPGRADE |
eieio does NOT order cacheable stores — orders cacheable and non-cacheable stores separately but NOT across the boundary. All smp_wmb() uses lwsync instead; driver I/O barriers crossing cacheable↔MMIO must use sync. |
"POWER9-L1D-ENTRY" |
PPC64 POWER9 | CODE_PATH \| CONTEXT_SWITCH |
CVE-2020-4788: POWER9 speculatively operates on L1D data after access from a less-privileged mode. Requires L1D cache flush on both kernel entry (entry_flush) and kernel exit (rfi_flush), plus L1D flush around user memory accesses (uaccess_flush). POWER10 hardware mitigations reduce this requirement. Boot parameters: no_entry_flush, no_uaccess_flush. |
"POWER9-SLB-MULTIHIT" |
PPC64 POWER8/9 (HPT mode) | CODE_PATH |
Software bug can cause two SLB entries with the same ESID, producing bitwise OR of matching VSIDs. Machine check handler must detect SLB multi-hit (DSISR bits), flush and reload the SLB. On POWER8, multi-hit always sets both multi-hit AND parity error bits. Not applicable in radix mode (no SLB). |
"POWER8-TM" |
PPC64 POWER8 (all steppings) | FEATURE_DISABLE |
Multiple Transactional Memory hardware bugs: TM state corruption during signal delivery, race conditions between treclaim and context switch, missing MSR_VEC/MSR_VSX in restore_math. POWER9 emulates TM for guests (significant performance penalty). TM disabled by default; KVM must handle facility unavailable interrupts if exposed to guests. |
"E500-MBAR" |
PPC32 e500v1/v2 (MPC8540, MPC8548) | INSTRUCTION_UPGRADE |
CPU-3: mbar MO=1 fails to order cache-inhibited guarded loads after cache-inhibited stores. All MMIO barriers on e500 must use mbar MO=0 (full mbar) or msync. No fix planned by NXP. |
"E500-HANG" |
PPC32 e500 (MPC8548 and similar) | MSR_TWEAK |
A-005125: When the core initiates a guarded load to PCI/PCIe/sRIO near the time a PCI device writes to cacheable coherent memory hitting modified L1 data, the CCB bus arbiter enters an invalid state causing system hang. Workaround: set SPR976[40:41] = 0b10 at early boot. No fix planned by NXP. |
"E500-NO-LWSYNC" |
PPC32 e500v1/v2 (MPC8540, MPC8548, MPC8572) | INSTRUCTION_UPGRADE |
e500v1/v2 cores do not implement the lwsync instruction; executing it causes an Illegal Instruction trap. Mitigation: code_alternative! patches all lwsync sites to sync (full barrier) at boot when the errata flag is set. Performance impact: sync is heavier than lwsync (~10-20 cycles vs ~5-8 cycles), but e500v1/v2 are single-issue in-order cores where the difference is minimal. e500mc and later implement lwsync correctly. |
"E500-BTB" |
PPC32 e500v2/e500mc/e5500 | CODE_PATH \| CONTEXT_SWITCH |
A-004466/A-004465: Branch Target Buffer (BTB) phantom branches — stale BTB entries from previous address space redirect instruction stream to incorrect target. Flush BTB by writing BBFI bit in BUCSR register on every context switch. Also serves as Spectre v2 mitigation on Book E. |
"E500-TLB1FI" |
PPC32 e500v1/v2 | TLBI_WORKAROUND |
CPU-A001: Flash-invalidation of TLB1 via mtspr MMUCSR0[TLB1_FI] fails if it coincides with a tlbivax to TLB1 on any processor. Use individual per-entry tlbivax instructions instead of flash-invalidate. |
"S390X-EXPOLINES" |
s390x z12/z13 (no facility 82) | CODE_PATH |
Expolines for indirect branches (Spectre v2). Eliminated by eToken on z14+. |
"LA-SW-TLB" |
LoongArch 3A5000 (series 0xC0) | CODE_PATH \| TLBI_WORKAROUND \| IPI_REQUIRED |
Software TLB refill handler + IPI-based TLB shootdown (no hardware broadcast). |
"RISCV-ANDES-CMO" |
RISC-V Andes AX45MP (Renesas RZ/Five) | CODE_PATH \| CACHE_OP_OVERRIDE |
Predates Zicbom. Uses SiFive-compatible L2 cache controller for DMA coherency. Register as vendor-specific cache backend alongside T-Head CMO. |
"RISCV-SMEPMP" |
RISC-V (cores without Smepmp extension) | FEATURE_DISABLE |
Without Smepmp, PMP cannot deny M-mode access to any region. Firmware memory is not protected from M-mode code. Require Smepmp for security-critical deployments. |
"RISCV-NO-ZTSO" |
RISC-V (cores without Ztso extension) | CODE_PATH |
RISC-V base ISA is RVWMO (weak memory order). fence.tso instruction is a hint — on cores without Ztso it decodes as a full fence rw,rw. UmkaOS memory model emits explicit fence rw,rw on all RISC-V by default; if Ztso is detected at boot (misa or DT probing), the alternatives framework may relax selected fences to fence.tso for ~5-15% throughput gain on fence-heavy paths. Never assume TSO on RISC-V without runtime Ztso confirmation. |
Device and platform quirk table — Some errata are not CPU-core bugs but platform-level issues: IOMMU quirks, interrupt controller bugs, PCIe host bridge non-compliance, or SoC-specific DMA coherency problems. These are matched by device identity (PCI vendor/device ID, ACPI HID, or DT compatible string) rather than CPU identity.
/// Platform/device quirk entry — matches a device to its required workarounds.
/// Checked during device probe (not at boot like CpuErrata), because the
/// quirked devices may not be enumerated until PCI/ACPI/DT scanning.
struct DeviceQuirk {
/// Device identification.
match_id: DeviceMatch,
/// Human-readable quirk identifier.
quirk_id: &'static str,
/// Workaround applied during device probe.
workaround: fn(&mut DeviceContext) -> Result<()>,
/// Classification for reporting and override.
class: ErrataClass,
}
/// Device matching criteria for platform quirks.
///
/// Real-world platforms have an effectively unbounded number of ACPI/BIOS bugs.
/// This enum must support all matching strategies Linux uses:
/// - PCI vendor/device ID (per-device hardware bugs)
/// - ACPI HID (per-device ACPI namespace bugs)
/// - Device Tree compatible (per-SoC bugs)
/// - IOMMU vendor/model (IOMMU silicon errata)
/// - **DMI/SMBIOS strings** (per-motherboard/BIOS firmware bugs — the largest
/// category by volume, covering broken ACPI tables, incorrect feature
/// advertisement, wake/sleep failures, non-functional hardware, etc.)
/// - **ACPI OEM ID** (ACPI table-level bugs matching by OEM signature fields)
///
/// The DMI and ACPI OEM variants are checked during early boot (after ACPI/DMI
/// table parsing, before device probe) because they affect system-wide behavior,
/// not individual devices.
enum DeviceMatch {
/// PCI device: vendor + device + optional subvendor/subdevice.
Pci {
vendor: u16,
device: u16,
/// 0 = match any subvendor.
subvendor: u16,
/// 0 = match any subdevice.
subdevice: u16,
/// PCI revision range. `(0, 0xFF)` = all revisions.
revision_range: (u8, u8),
},
/// ACPI device (HID string, e.g., "QCOM0610" for Qualcomm SMMU).
Acpi {
hid: &'static str,
},
/// Device Tree compatible string (e.g., "arm,gic-v3", "riscv,aplic").
DeviceTree {
compatible: &'static str,
},
/// Match by IOMMU model (for IOMMU-specific workarounds).
Iommu {
vendor: IommuVendor,
model: u16,
},
/// Match by DMI/SMBIOS strings (motherboard + BIOS identification).
///
/// This is the **largest category** of platform quirks in any shipping OS.
/// Linux's `drivers/acpi/blacklist.c`, `arch/x86/kernel/quirks.c`, and
/// hundreds of driver-specific `dmi_system_id` tables collectively contain
/// 500+ DMI-matched quirks for broken ACPI tables, non-functional hardware
/// features, incorrect feature advertisement, sleep/wake failures, etc.
///
/// Each field is a substring match (case-insensitive). Empty string = match any.
/// All non-empty fields must match (AND logic).
Dmi {
/// SMBIOS Type 0 (BIOS Information): Vendor string.
/// e.g., "American Megatrends" / "Dell" / "HP" / "Lenovo" / "Phoenix".
bios_vendor: &'static str,
/// SMBIOS Type 0: BIOS Version string.
/// e.g., "2.30" — often used to pin a quirk to specific BIOS revisions.
bios_version: &'static str,
/// SMBIOS Type 1 (System Information): Product Name.
/// e.g., "ProLiant DL380 Gen10" / "ThinkPad T14s" / "MacBookPro18,1".
product_name: &'static str,
/// SMBIOS Type 2 (Baseboard): Board Name.
/// e.g., "X570 AORUS MASTER" / "ROG STRIX Z690-A".
board_name: &'static str,
},
/// Match by ACPI table OEM fields (for broken ACPI tables from specific
/// BIOS/firmware vendors). Matches the OEM ID and OEM Table ID fields
/// present in every ACPI table header.
///
/// Use when the bug is in a specific ACPI table (DMAR, MADT, SRAT, etc.)
/// from a specific firmware vendor, not tied to the motherboard model.
AcpiOem {
/// ACPI table signature to check (e.g., "DMAR", "MADT", "SRAT", "FACP").
table_sig: &'static [u8; 4],
/// OEM ID (6 bytes, space-padded). Empty = match any.
oem_id: &'static str,
/// OEM Table ID (8 bytes, space-padded). Empty = match any.
oem_table_id: &'static str,
/// OEM Revision range. `(0, u32::MAX)` = match all.
oem_revision_range: (u32, u32),
},
}
#[repr(u8)]
enum IommuVendor {
IntelVtd = 0,
AmdVi = 1,
ArmSmmu = 2,
ArmSmmuV3 = 3,
RiscvIommu = 4,
}
DMI/SMBIOS data source: DMI strings are parsed from the SMBIOS entry point table
located via EFI System Table or by scanning the 0xF0000-0xFFFFF physical range (BIOS
legacy). Parsing runs during boot step 4d (after physical memory init, before device
probe). The parsed strings are stored in a DmiInfo struct (immutable after boot) and
made available to all quirk checks.
ACPI OEM data source: Every ACPI table header contains OemId (6 bytes) and
OemTableId (8 bytes). These are available as soon as the ACPI table root (RSDP/XSDT)
is parsed in boot step 4e. ACPI OEM quirk checks run before ACPI interpretation
(AcpiOsExecute callbacks) to ensure workarounds are in place before broken AML code
executes.
Quirk volume and maintenance: In any production OS running on diverse hardware,
the platform quirk table grows continuously. Linux adds 20-50 new DMI quirks per
kernel release cycle. UmkaOS adopts the same model: the quirk table is compiled into
the kernel from a machine-readable source (TOML or equivalent), and new entries are
added as platforms are tested. For Linux-compatible behavior, UmkaOS also implements
ACPI _OSI("Linux") and _OSI("Windows 2022") response handling to steer BIOS AML
code paths (many ACPI bugs only manifest when the OS identifies as Linux rather than
Windows, because vendors test primarily on Windows).
Example platform quirk entries:
quirk_id |
Match | Class | Workaround |
|---|---|---|---|
"IVRS-UNITY-MAP" |
Iommu { AmdVi } |
CACHE_OP_OVERRIDE |
Honor IVRS unity mappings; do not remap. |
"ECAM-XGENE" |
DeviceTree { "apm,xgene-pcie" } |
CODE_PATH |
Non-ECAM-compliant config space access. Use indirect register access. |
"GIC700-SPI-RACE" |
DeviceTree { "arm,gic-v3" } (rev < r1p6) |
BARRIER_INSERTION |
SPI deactivation race on affinity migration. Deactivate→barrier→reactivate. |
"RK3588-GIC600" |
DeviceTree { "rockchip,rk3588" } |
MSR_TWEAK |
GIC shareability broken. Force non-shareable GIC memory mapping. |
"HP-DL380-DMAR" |
Dmi { product_name: "ProLiant DL380" } |
FEATURE_DISABLE |
Broken DMAR table. Disable VT-d and fall back to SWIOTLB. |
"LENOVO-HPET" |
Dmi { bios_vendor: "Lenovo", board_name: "..." } |
TIMER_WORKAROUND |
HPET advertised but non-functional. Force TSC clocksource. |
"ASUS-ASPM" |
Dmi { board_name: "ROG STRIX Z690" } |
MSR_TWEAK |
PCIe ASPM causes NVMe timeout. Disable ASPM on affected root ports. |
"DELL-DMAR-OEM" |
AcpiOem { "DMAR", "DELL ", "", (0, u32::MAX) } |
CODE_PATH |
DMAR table includes phantom RMRR entries. Ignore RMRR for specific devices. |
"AMI-MADT-LINT" |
AcpiOem { "APIC", "AMI ", "", (0, 1) } |
CODE_PATH |
MADT LINT0/LINT1 entries wrong. Override from hardcoded defaults. |
2.18.1 ARM SoC and RISC-V Vendor Diversity (Non-x86 Platform Quirks)¶
The x86 world has ACPI/BIOS quirks tracked by DMI strings; the ARM and RISC-V ecosystems have an equally large — and arguably more fragmented — set of platform-level issues. The difference is that non-x86 platforms use Device Tree (DT) as the primary hardware description and quirk-matching mechanism, and vendor diversity spans hundreds of SoC vendors rather than a handful of BIOS vendors.
ARM SoC vendor zoo — The ARM ecosystem includes hundreds of SoC vendors, each with custom peripheral implementations around licensed ARM CPU cores:
| Vendor Family | Examples | Quirk Volume |
|---|---|---|
| Server | Ampere (Altra, AmpereOne), Marvell (ThunderX2/X3), Fujitsu (A64FX), HiSilicon (Kunpeng) | Moderate — fewer models, better errata docs |
| Mobile | Qualcomm (Snapdragon), MediaTek (Dimensity), Samsung (Exynos), Google (Tensor) | Very high — hundreds of SoC variants, many with custom interconnects |
| Embedded/SBC | Rockchip (RK3588/3566/3399), Allwinner (H616/T527), Amlogic (S905/A311D), NXP (i.MX8/9), TI (AM62/AM64), Broadcom (BCM2711/2712) | Extremely high — vast board diversity, minimal vendor errata disclosure |
| Custom silicon | Apple (M1-M4), NVIDIA (Grace), Qualcomm (Oryon) | Low volume but deep quirks — non-standard peripherals |
Each vendor ships SoC-specific quirks in areas that x86 platforms handle uniformly:
- PCIe host bridges: Nearly every SoC vendor implements a non-standard PCIe root
complex (DesignWare, Synopsys, Cadence IP with vendor-specific configuration
registers). The compatible string (e.g., "rockchip,rk3588-pcie",
"qcom,pcie-sm8550") selects the correct initialization sequence.
- DMA coherency: Many ARM SoCs have non-coherent DMA by default (no hardware
cache snooping between CPU and device). The DT property dma-coherent declares
coherent DMA; its absence triggers explicit cache maintenance (dma_sync_* calls).
Some SoCs have partially-coherent DMA (coherent for certain bus masters but not
others) — the DT dma-ranges property and per-device dma-coherent control this.
- Clock trees: Each SoC has a unique clock topology (PLLs, dividers, muxes).
Clock initialization order and parent assignment come from the DT clocks and
assigned-clocks properties. A wrong clock parent can underclock or overclock a
peripheral, or hang the SoC.
- Pin muxing: GPIOs, UART, SPI, I2C share physical pins. The DT pinctrl-*
properties configure pin function selection. SoC-specific pinctrl drivers interpret
vendor-defined register layouts.
- Interrupt controllers: While GICv2/v3 is standard for the CPU-facing interrupt
path, many SoCs have additional vendor-specific interrupt combiners, wakeup
controllers, and GPIO-to-GIC bridges. Each requires its own IrqDomain chaining.
- Power domains: SoC power islands (GPU, display, ISP, etc.) have vendor-specific
power management register sequences. The power-domains DT property links devices
to their power domain controller driver.
Device Tree as the quirk database — Unlike x86 where the hardware self-describes
via ACPI and quirks are corrections to that description, ARM/RISC-V systems use DT
as the primary hardware description. The DT compatible string serves as both the
device identifier AND the quirk key:
/// DeviceMatch::DeviceTree matching for SoC-level quirks.
///
/// The compatible property is a list of strings from most-specific to least-specific.
/// Example: compatible = "rockchip,rk3588-pcie", "snps,dw-pcie";
///
/// UmkaOS matches the MOST SPECIFIC string first (SoC-specific quirk),
/// falling back to generic IP block matches (e.g., "snps,dw-pcie" for
/// all DesignWare PCIe implementations).
///
/// Board-level matching: the root node's compatible identifies the board:
/// compatible = "pine64,star64", "starfive,jh7110";
/// This enables board-specific quirks (broken regulators, miswired GPIOs, etc.)
/// analogous to x86 DMI board_name matching.
DT overlays for firmware bugs — DT blobs provided by firmware can themselves be
wrong (missing devices, incorrect register addresses, wrong interrupt routing). Linux
handles this via DT fixups applied at boot — UmkaOS uses the same mechanism:
1. Early boot parses the firmware-provided DT blob
2. A platform fixup table (keyed by root node compatible) applies corrections
3. Corrected DT is used for all subsequent device enumeration
This is the DT equivalent of ACPI _OSI() response steering and MADT/DMAR override
quirks in the x86 world.
RISC-V vendor fragmentation — RISC-V's fragmentation is worse than ARM's because the ISA is open and vendors can add arbitrary custom extensions:
| Vendor | SoC | Notable Quirks |
|---|---|---|
| T-Head (Alibaba) | C906/C910/C920 | marchid=0 (can't distinguish cores), XTheadVector (non-standard V 0.7.1), store merge buffer, MAEE (non-standard PTE bits), GhostWrite |
| SiFive | U74 (JH7110), P670 | Custom cache management instructions (sifive,ccache), non-standard L2 cache controller |
| Canaan | K230 (K510) | Custom AI accelerator registers, limited RISC-V extension support |
| Andes | AX45MP | Custom performance counters, AndeStar V5 extensions |
| SpacemiT | X60 (Key Stone K1) | Custom multimedia extensions, early RISC-V V 1.0 silicon |
RISC-V identification challenges:
- marchid=0 disambiguation: T-Head C906, C910, and C920 all report marchid=0
from the device tree. UmkaOS must use CSR probing (attempt to read vendor-specific
CSRs like th.mxstatus and catch the illegal instruction trap) to determine the
actual core. The probing sequence is documented in
Section 2.16.
- Vendor extension detection: RISC-V ISA extensions are declared in the DT
riscv,isa string (e.g., rv64imafdc_zba_zbb_xtheadvector). Non-standard
extensions (prefixed x or sx) are vendor-specific and require per-vendor
handling. The SBI (Supervisor Binary Interface) sbi_probe_extension() call
supplements DT for runtime extension detection.
- Non-standard PTE layouts: T-Head MAEE (Memory Attribute Extension Enhancement)
uses a different PTE bit layout than standard RISC-V Svpbmt. The DT compatible
string for the CPU node selects the correct PTE encoding. This means page table code
must support multiple PTE formats, selected at boot — not hardcoded to the standard
layout.
- Interrupt controllers: RISC-V has three distinct interrupt controllers:
- PLIC (Platform Level Interrupt Controller) — original spec, no MSI support
- APLIC (Advanced PLIC) — modern spec, supports both wired and MSI delivery modes
- CLIC (Core-Local Interrupt Controller) — preemptive vectored interrupts,
vendor-specific until ratification
Each requires its own IrqDomain implementation, detected from the DT compatible
string ("riscv,plic0", "riscv,aplic", "riscv,clic").
- DMA coherency: Most RISC-V SoCs have non-coherent DMA. The DT dma-coherent
property is per-device. Cache management for non-coherent DMA is architecture-specific
(T-Head uses custom th.dcache.cva CSR instructions; SiFive uses MMIO-mapped cache
controller registers; standard RISC-V Zicbom uses CBO.CLEAN/CBO.FLUSH/CBO.INVAL
instructions). UmkaOS abstracts this behind arch::current::mm::cache_flush_range()
with per-vendor implementations selected at boot.
Platform quirk example entries (ARM/RISC-V):
quirk_id |
Match | Class | Workaround |
|---|---|---|---|
"RK3588-PCIE-ASPM" |
DeviceTree { "rockchip,rk3588-pcie" } |
FEATURE_DISABLE |
ASPM L1 causes link training failure. Disable ASPM on RK3588 PCIe controller. |
"IMX8MQ-VPU-POWER" |
DeviceTree { "fsl,imx8mq-vpu" } |
CODE_PATH |
VPU power domain has undocumented 10ms settle time. Insert delay after power-on. |
"BCM2711-PCIE-MSI" |
DeviceTree { "brcm,bcm2711-pcie" } |
CODE_PATH |
Non-standard MSI controller. Use vendor-specific MSI doorbell address. |
"QCOM-SMMU-BYPASS" |
DeviceTree { "qcom,sm8550-smmu-500" } |
MSR_TWEAK |
S2CR bypass bit inverted in firmware. Override at SMMU init. |
"JH7110-PCIE-DMA" |
DeviceTree { "starfive,jh7110-pcie" } |
BARRIER_INSERTION |
Non-coherent DMA behind PCIe. Force SWIOTLB bounce buffering. |
"THEAD-C910-CACHE" |
DeviceTree { "thead,c910" } |
CODE_PATH |
Custom cache management CSRs. Use th.dcache.cva/th.dcache.iva instead of Zicbom instructions. |
Quirk volume comparison across ecosystems:
| Ecosystem | Primary Quirk Mechanism | Approximate Quirk Count (mature OS) | Growth Rate |
|---|---|---|---|
| x86 ACPI/BIOS | DMI strings + ACPI OEM ID | ~500-700 (Linux 6.x) | ~20-50 per release |
| ARM DT | DT compatible strings |
~300-500 (Linux 6.x) | ~30-60 per release (growing faster) |
| RISC-V DT | DT compatible + CSR probing |
~30-50 (Linux 6.x) | ~10-20 per release (accelerating) |
UmkaOS's DeviceQuirk table must grow to match Linux's accumulated quirk database as
hardware support expands. The DeviceMatch::DeviceTree variant already covers the ARM and
RISC-V matching mechanism; the key insight is that the DT compatible string hierarchy
(board → SoC → IP block) provides the same multi-level matching that x86 achieves through
the combination of DMI strings (board), ACPI OEM ID (firmware), and PCI ID (device).
Microcode minimum version table — Some errata workarounds are only effective (or only necessary) at specific microcode revision levels. The kernel must check the current microcode revision before applying certain workarounds:
/// Static table mapping CPU models to minimum safe microcode revisions.
/// If a CPU's microcode is below the listed version, the kernel logs a
/// warning and may need to apply a more aggressive (higher-overhead)
/// software workaround instead of relying on the microcode fix.
///
/// Checked during boot step 4d after early microcode loading.
/// Also checked at runtime if late microcode loading is supported.
static MICROCODE_MIN_VERSIONS: &[MicrocodeMinVersion] = &[
// Intel: SRBDS mitigation requires microcode update
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Intel,
family: 6, model: 0x8E, // Kaby Lake / Coffee Lake
stepping_range: (0x09, 0x0D),
},
min_revision: 0xF4,
affected_feature: "SRBDS mitigation (MD_CLEAR + VERW)",
},
// Intel: GDS (Gather Data Sampling) microcode mitigation
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Intel,
family: 6, model: 0x55, // Skylake-X / Cascade Lake
stepping_range: (0x03, 0x07),
},
min_revision: 0x02007106,
affected_feature: "GDS mitigation (VERW clears fill buffers)",
},
// AMD: Zen 2 RDRAND fix
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Amd,
family: 0x17, model: 0x71,
stepping_range: (0, 0xFF),
},
min_revision: 0x08301055,
affected_feature: "RDRAND reliability",
},
];
// Intel: TSC deadline timer errata (Haswell through Kaby Lake)
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Intel,
family: 6, model: 0x3C, // Haswell
stepping_range: (0, 0xFF),
},
min_revision: 0x28,
affected_feature: "TSC deadline timer (errata HSD136/BDM85/SKL088)",
},
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Intel,
family: 6, model: 0x3D, // Broadwell
stepping_range: (0, 0xFF),
},
min_revision: 0x2E,
affected_feature: "TSC deadline timer",
},
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Intel,
family: 6, model: 0x4E, // Skylake client
stepping_range: (0, 0xFF),
},
min_revision: 0xE2,
affected_feature: "TSC deadline timer (SKL088)",
},
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Intel,
family: 6, model: 0x5E, // Skylake desktop
stepping_range: (0, 0xFF),
},
min_revision: 0xE2,
affected_feature: "TSC deadline timer (SKL088)",
},
// Intel: Reptar (CVE-2023-23583) — refuse to boot without fix
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Intel,
family: 6, model: 0x8F, // Sapphire Rapids
stepping_range: (0, 0xFF),
},
min_revision: 0x2B000461,
affected_feature: "Reptar (CVE-2023-23583) — CRITICAL: refuse SGX/TDX without update",
},
// AMD: SEV-SNP microcode signature verification (CVE-2024-56161)
MicrocodeMinVersion {
match_id: CpuMatch::X86 {
vendor: X86Vendor::Amd,
family: 0x19, model: 0x01,
stepping_range: (0, 0xFF),
},
min_revision: 0x0A0011D1,
affected_feature: "SEV-SNP signature verification (CVE-2024-56161) — refuse SNP without update",
},
];
struct MicrocodeMinVersion {
match_id: CpuMatch,
min_revision: u64,
affected_feature: &'static str,
}
IA32_ARCH_CAPABILITIES check (x86-64, boot step 4d):
After CPUID enumeration, the kernel reads IA32_ARCH_CAPABILITIES (MSR 0x10A) on
every x86-64 CPU. This MSR reports which speculative execution vulnerabilities the CPU
is immune to. Key bits used by UmkaOS:
| Bit | Name | Meaning |
|---|---|---|
| 0 | RDCL_NO |
Not vulnerable to Meltdown / Rogue Data Cache Load. KPTI not needed for userspace isolation. |
| 1 | IBRS_ALL |
IBRS provides full Spectre v2 protection (eIBRS). Retpoline not needed. |
| 2 | RSBA |
Has RSBA (Return Stack Buffer Alternate). RSB fill required even with eIBRS. |
| 4 | SSB_NO |
Not vulnerable to Spectre v4 (Speculative Store Bypass). SSBD not needed. |
| 5 | MDS_NO |
Not vulnerable to MDS. VERW at transitions not needed. |
| 6 | IF_PSCHANGE_MC_NO |
No machine check on page size change (L1TF PTE inversion not needed). |
| 7 | TSX_CTRL |
TSX can be disabled via MSR (for TAA mitigation). |
| 10 | TAA_NO |
Not vulnerable to TAA. |
| 19 | FBSDP_NO |
Fill Buffer Stale Data Propagation immune. |
| 20 | PSDP_NO |
Primary SBDS Stale Data Propagation immune. |
| 24 | BHI_NO |
Not vulnerable to Branch History Injection. BHB clearing not needed. |
Meltdown-PK (transient execution) check: On CPUs without RDCL_NO (Skylake-SP
through Coffee Lake — all MPK-capable), a Meltdown-style speculative execution attack
can transiently read memory in a foreign PKU domain before the protection-key fault is
architecturally resolved. This means Tier 1 isolation on vulnerable CPUs provides
integrity (crash detection) but not confidentiality against a sophisticated
attacker with cache-timing capabilities. UmkaOS logs at boot:
umka: CPU0 vulnerable to Meltdown-PK (IA32_ARCH_CAPABILITIES[RDCL_NO]=0).
Tier 1 PKU isolation provides integrity but not confidentiality against
transient execution attacks. Upgrade to Ice Lake+ for full Tier 1 security.
This is informational — Tier 1 isolation remains active on affected CPUs because integrity protection (crash detection on domain boundary violations) still has value.
When a CPU's microcode revision is below the minimum for a matched entry, the kernel logs:
umka: CPU0 microcode 0x00F0 below minimum 0x00F4 for "SRBDS mitigation (MD_CLEAR + VERW)".
Software-only mitigation active. Update microcode for optimal performance.
Spectre/Meltdown class mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Meltdown (v3) | KPTI (page table isolation) | Required for Tier 2 + userspace; NOT needed for Tier 1 (same ring, MPK isolation) |
| Spectre v1 | LFENCE barriers at bounds checks; Speculative Load Hardening (SLH) | Compiler-inserted SLH (-mllvm -x86-speculative-load-hardening); manual LFENCE in asm hot paths |
| Spectre v2 | Retpoline / IBRS / eIBRS | Retpoline (-C target-feature=+retpoline-indirect-branches) for indirect branches in kernel code; eIBRS preferred on supporting hardware |
| Spectre v4 (SSB) | SSBD (Spec. Store Bypass Disable) | Per-thread via IA32_SPEC_CTRL MSR; toggled on context switch for untrusted threads |
| MDS/TAA | Buffer clears (VERW) |
On context switch to userspace; on VM entry/exit |
| SRBDS | Microcode + VERW |
Handled by early microcode update |
| RFDS/GDS | Microcode + opt-in VERW |
Same as MDS path |
Mitigation boot parameters:
umka.mitigate=auto # Default: apply mitigations based on detected CPU (recommended)
umka.mitigate=on # Force all mitigations on, even if CPU claims to be fixed
umka.mitigate=off # Disable all mitigations (INSECURE — see below)
umka.mitigate.kpti=off # Disable specific mitigation class
umka.mitigate.retpoline=off # Disable specific mitigation class
umka.errata.tlbi=off # Disable ErrataClass::TLBI_WORKAROUND entries
umka.errata.barrier=off # Disable ErrataClass::BARRIER_INSERTION entries
umka.errata.timer=off # Disable ErrataClass::TIMER_WORKAROUND entries
umka.errata.idle=off # Disable ErrataClass::IDLE_PATH entries
Boot parameter names map to ErrataClass bits via a static lookup table.
The umka.mitigate.* parameters control speculation mitigations specifically;
the umka.errata.* parameters control non-speculation errata workarounds.
Both families support =off (disable) and =on (force enable even on
unaffected CPUs — useful for testing).
2.18.1.1 Performance Impact of Mitigations¶
The cumulative overhead of speculative execution mitigations is substantial — typically 5-30% depending on workload characteristics:
| Mitigation | Overhead | Worst-case workload |
|---|---|---|
| KPTI | ~5% syscall-heavy; ~100-200ns per user↔kernel transition | Database OLTP (millions of syscalls/sec) |
| Retpoline / eIBRS | ~2-10% | Indirect-branch-heavy code (virtual dispatch, interpreters) |
| SSBD | ~1-5% | Memory-intensive with store-to-load forwarding |
MDS VERW |
~1-3% on context switch | Frequent user↔kernel transitions |
| Cumulative | 5-30% | Syscall-heavy + indirect-branch-heavy (databases, VMs) |
umka.mitigate=off is legitimate for:
- Air-gapped HPC clusters where all code is trusted and no untrusted workloads run.
- Benchmarking to isolate application performance from mitigation overhead.
- Single-tenant bare-metal where the threat model excludes local attackers.
- Nested within a trusted VM where the host hypervisor enforces mitigations at
the outer boundary (the guest's mitigations are redundant).
The kernel logs a prominent boot warning when mitigations are disabled:
umka: WARNING — speculative execution mitigations DISABLED (umka.mitigate=off).
This system is vulnerable to Spectre, Meltdown, MDS, and related attacks.
Do NOT use in multi-tenant or untrusted environments.
Interaction with umka.isolation=performance: When umka.isolation=performance is
set (promoting Tier 1 drivers to Tier 0, disabling CPU-side isolation), the admin has
already accepted a reduced security posture. Combining umka.isolation=performance with
umka.mitigate=off provides the maximum performance envelope — no isolation overhead,
no mitigation overhead — but should be limited to environments where all executing code
is fully trusted. The two settings are independent; either can be set alone.
Runtime reporting — Vulnerability status is exposed via Linux-compatible sysfs:
/sys/devices/system/cpu/vulnerabilities/meltdown: "Mitigation: PTI"
/sys/devices/system/cpu/vulnerabilities/spectre_v1: "Mitigation: usercopy/LFENCE"
/sys/devices/system/cpu/vulnerabilities/spectre_v2: "Mitigation: eIBRS"
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass: "Mitigation: SSBD"
/sys/devices/system/cpu/vulnerabilities/mds: "Mitigation: Clear buffers"
This ensures monitoring tools (spectre-meltdown-checker, lynis) work without modification.
2.18.2 Speculation Mitigations (All Architectures)¶
The x86-specific mitigation table in Section 2.18 covers only one architecture. Here is the complete per-architecture mitigation matrix:
AArch64 mitigations:
| Vulnerability | ARM Identifier | Mitigation | UmkaOS scope |
|---|---|---|---|
| Spectre v1 (bounds bypass) | — | CSDB barriers at bounds checks | Compiler-inserted CSDB barriers after conditional branches (ARM equivalent of x86 SLH; uses CSDB instruction, not LLVM's x86-specific -x86-speculative-load-hardening pass) |
| Spectre v2 (BTI) | CVE-2017-5715 | BTI (Branch Target Identification) | Hardware BTI (ARMv8.5+): enabled via SCTLR_EL1.BT1. Software: SMCCC ARCH_WORKAROUND_1 firmware call |
| Spectre-BHB | CVE-2022-23960 | BHB clearing sequence or firmware call | SMCCC ARCH_WORKAROUND_3 or BHB clearing loop on context switch |
| Meltdown (v3) | CVE-2017-5754 | KPTI (separate EL0/EL1 page tables) | Full KPTI required on Cortex-A75 (all revisions) per ARM security bulletins; NOT needed on Cortex-A76 (all revisions), Cortex-A78, Cortex-X1, Cortex-X2, Cortex-A710, Cortex-A715 (NOT affected per ARM security bulletin — CSV3=Yes across all revisions), or other Armv9/v8.x cores with CSV3, or earlier in-order cores (A53, A55, etc.). Cortex-A510 (all revisions): classified as Variant 3 by ARM, but the actual erratum (3117295) describes a speculative unprivileged load issue whose workaround is a TLBI instruction before returning to EL0, not full KPTI page table splitting. UmkaOS applies the lightweight TLBI mitigation for A510, not the heavyweight page table split used for Cortex-A75. Cortex-A520 (prior to r0p2 only; r0p2+ is not affected): classified as Variant 3 by ARM (erratum 2966298); like Cortex-A510, the actual issue is a speculative unprivileged load whose workaround is a TLBI instruction, not full KPTI. UmkaOS applies the same lightweight TLBI mitigation as for A510. |
| Spectre v4 (SSB) | CVE-2018-3639 | SSBS (Speculative Store Bypass Safe) | Hardware SSBS bit (ARMv8.5+): per-thread via PSTATE.SSBS. Software: SMCCC ARCH_WORKAROUND_2 |
| Straight-line speculation | — | SB instruction after branches | Compiler-inserted speculation barriers |
ARM firmware interface: Unlike x86 (which uses MSR writes), ARM mitigations are
often applied through SMCCC (SMC Calling Convention) firmware calls to EL3 Secure
Monitor code. The kernel calls ARCH_WORKAROUND_1/2/3 — the firmware applies the
actual mitigation. This is architecturally cleaner (firmware knows the exact CPU
revision) but adds ~100-200 cycles per SMCCC call.
ARMv7 mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Spectre v1 | CSDB barriers at bounds checks | Same as AArch64 |
| Spectre v2 | Firmware workaround via SMCCC | ARCH_WORKAROUND_1 for affected Cortex-A cores |
| Meltdown | Not applicable | ARMv7 Cortex-A cores are not affected |
| Spectre v4 | Firmware workaround | ARCH_WORKAROUND_2 where supported |
RISC-V mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Spectre v1 | FENCE instructions at bounds checks | Manual insertion in assembly; compiler support evolving |
| Spectre v2 | Vendor-specific | SiFive: FENCE.I after indirect branches. Other vendors: per-implementation |
| Meltdown | Not applicable | In-order RISC-V cores not affected; OoO cores (e.g., SiFive P670) may need KPTI |
| Spectre v4 | Vendor-specific | No standard RISC-V mitigation; per-vendor microarchitecture |
RISC-V status: Speculation mitigations on RISC-V are less mature than x86 or ARM.
The RISC-V CFI extensions Zicfiss (shadow stacks) and Zicfilp (landing pads) are ratified
as standalone extensions (not part of the base privileged specification, but separate
ratified ISA extensions). UmkaOS implements both when the hardware
reports support via the Zicfiss and Zicfilp ISA string entries. UmkaOS also applies
vendor-specific workarounds based on mvendorid/marchid from the device tree, similar
to the x86 errata database approach.
PowerPC mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Spectre v1 | ori 31,31,0 (speculation barrier) |
Inserted at bounds checks in assembly |
| Spectre v2 | Count Cache Flush + link stack flush | POWER8/9: bcctr flush sequence; POWER10: hardware mitigation |
| Meltdown | RFI flush (L1D cache flush) | POWER7+: flush on return from interrupt via rfid/hrfid |
| Spectre v4 | STF (Store Thread Forwarding) barrier | ori 31,31,0 barrier; POWER9+ firmware toggle |
PowerPC status: IBM POWER processors have well-documented mitigations managed via firmware (skiboot/OPAL) and kernel runtime patches. POWER10 includes hardware mitigations for most Spectre variants.
Additional POWER9-specific requirements:
- Entry flush (CVE-2020-4788): L1D cache flush on kernel entry from userspace,
in addition to the RFI exit flush.
- Uaccess flush: L1D flush around copy_from_user/copy_to_user on POWER9.
- STF barrier per generation: POWER9 uses eieio; POWER8 uses hwsync; ori 31,31,0;
POWER7 uses displacement flush. The barrier type must be selected per PVR.
PPC32 embedded cores (e500, 440) are generally in-order and not affected by speculative
execution vulnerabilities. However, e500 has:
- Spectre v1: isync; sync as speculation barrier (not ori 31,31,0 — firmware
does not enable it on Book E).
- Spectre v2: BTB flush via BUCSR[BBFI] on context switch (also fixes the
phantom branch errata A-004466).
UmkaOS applies mitigations based on PVR (Processor Version Register) from the device tree.
s390x mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Spectre v1 | LFENCE equivalent (bcr 14,0 serialization) |
Inserted at bounds checks in assembly |
| Spectre v2 | Expolines (z12-z13) / eToken (z14+) | z12-z13: all indirect branches compiled through expoline thunks (exrl + execute-relative-long to trampoline). z14+ with facility bit 82 (eToken): expolines eliminated, hardware mitigation active. |
| Meltdown | Not applicable | z/Architecture does not have Meltdown — user/kernel in separate address spaces (Primary/Home ASCE) with hardware-enforced DAT separation |
| Spectre v4 | Not applicable | z/Architecture store forwarding is architecturally ordered |
| Side channel (PxSe) | PSW-swap + address space isolation | z/Architecture's PSW-swap interrupt model inherently separates contexts. MVCOS instruction for user↔kernel copies avoids shared TLB entries. |
s390x architectural constraints (not errata, but require dedicated code paths):
| Constraint | Impact |
|---|---|
| No MMIO | All I/O via channel programs (CCW) or PCI special instructions (PCILG/PCISTG/PCISTB). The entire driver I/O path differs from all other architectures. |
| PSW-swap interrupts | No interrupt vector table. Interrupt entry saves old PSW to fixed low-memory address, loads new PSW from adjacent address. Each interrupt class (External, I/O, Machine Check, Program, SVC, Restart) has its own PSW pair. |
| Floating I/O interrupts | I/O interrupts are not routed to a specific CPU. Any CPU with the appropriate ISC (Interrupt Sub-Class) enabled in its CR6 may receive the interrupt. IRQ affinity is managed via ISC masking, not IOAPIC/GIC-style routing. |
| SIGP for IPI | Inter-processor signaling uses the SIGP instruction (not a memory-mapped interrupt controller). SIGP orders: SENSE, EXTERNAL_CALL, EMERGENCY_SIGNAL, STOP, RESTART, SET_PREFIX, STORE_STATUS. |
| Separate address spaces | User code runs in Primary Address Space, kernel in Home Address Space. MVCOS (Move with Optional Specifications) copies between address spaces. SAC (Set Address Space Control) switches. PT (Program Transfer) for fast syscall return. |
| I-cache not snooped | Stores to code do not automatically invalidate corresponding I-cache entries. After any code patching (BPF JIT, static keys, alternatives), a serializing instruction (bcr 14,0) must be executed on ALL CPUs via SIGP EXTERNAL_CALL + handler. |
| TOD clock epoch 1900 | 64-bit TOD clock (bit 51 = 1 µs) overflows circa 2043. UmkaOS timestamps internally use extended TOD (128-bit STCKE) or convert to Unix epoch immediately. |
| SVC dual encoding | SVC 0-255: immediate in instruction. SVC 256+: SVC 0 with service code in R1. The syscall dispatch path must handle both encodings. |
LoongArch64 mitigations:
| Vulnerability | Mitigation | UmkaOS scope |
|---|---|---|
| Spectre v1 | DBAR (Data Barrier) after bounds checks | DBAR 0 instruction at speculative bounds check sites |
| Spectre v2 | Not applicable (in-order 3A5000) / Vendor-specific (OoO 3A6000) | 3A5000: in-order pipeline, not vulnerable. 3A6000: OoO, apply vendor-recommended indirect branch restrictions if/when published. |
| Meltdown | Not applicable | LoongArch hardware-enforced privilege levels prevent cross-level data leakage |
| Spectre v4 | Not applicable | No published store-forwarding bypass on LoongArch |
LoongArch64 architectural constraints:
| Constraint | Impact |
|---|---|
| Software TLB (3A5000) | 3A5000 uses software TLB refill via TLBFILL exception handler. 3A6000 adds optional hardware page table walker (PTW). UmkaOS implements the software TLB refill handler and probes for PTW at boot via CPUCFG. |
| I-cache incoherent | Like s390x, stores to instruction memory do not invalidate I-cache. IBAR 0 (Instruction Barrier) must be executed after any code patching. On remote CPUs, an IPI + IBAR is required. |
| No broadcast INVTLB | INVTLB only invalidates the local CPU's TLB. TLB shootdown requires explicit IPI to each target CPU followed by INVTLB on each. This differs from x86 (which has INVLPG but typically uses IPI anyway) and AArch64 (which has broadcast TLBI). |
| EIOINTC interrupt controller | Extended I/O Interrupt Controller — not compatible with APIC, GIC, PLIC, or any other standard interrupt controller. Requires a dedicated IRQ domain implementation. Supports up to 256 interrupt vectors with 1:1 CPU routing. |
| No fast isolation (Tier 1) | LoongArch has no MPK/POE/DACR equivalent. Tier 1 is unavailable; drivers choose Tier 0 (in-kernel, fully trusted) or Tier 2 (user-mode + IOMMU) based on licensing and admin policy. |
| CSR-based system registers | All system configuration via CSR (Control and Status Register) instructions (CSRRD/CSRWR/CSRXCHG), not memory-mapped registers. CSR space is 14-bit indexed (0x0-0x3FFF). |
Runtime reporting (all architectures) — The Linux-compatible sysfs interface
(/sys/devices/system/cpu/vulnerabilities/) is populated on all architectures with
architecture-appropriate mitigation status strings.
2.18.3 Dual-Boot Safety¶
- UmkaOS never modifies the existing Linux kernel installation.
- GRUB is configured with both kernels; the default can be set by the user.
- If UmkaOS fails to boot, the user selects the Linux kernel from GRUB.
- A "last known good" mechanism records successful boots and can auto-revert.
2.18.4 Boot Protocol Migration Path¶
The boot architecture evolves through four phases, each building on the previous:
Phase 1 — Multiboot1 (current). GRUB loads the ELF via multiboot command.
QEMU loads directly with -kernel. Memory map from Multiboot1 info structure.
Sufficient for all kernel development and QEMU-based testing.
Phase 2 — Multiboot2 full parser. Parse Multiboot2 tags to access richer boot
information: ACPI RSDP pointer, EFI memory map, framebuffer info, boot services
tag. This enables ACPI table parsing and EFI runtime services without changing the
bootloader. GRUB2 already supports the multiboot2 command.
Phase 3 — UEFI stub boot. Add a PE/COFF header stub to the kernel image (similar
to Linux EFISTUB). UEFI firmware requires PE/COFF executables, not ELF — the stub
header makes the kernel image a valid PE/COFF binary that UEFI can load directly.
The actual kernel code remains ELF internally; the PE/COFF header is a thin wrapper
(like Linux's header.S which embeds a PE/COFF header in the bzImage). The kernel
becomes directly bootable from UEFI firmware without GRUB — efibootmgr can register
it. Use EFI boot services for memory map and GOP framebuffer, then call
ExitBootServices() before entering the kernel proper. systemd-boot and other
UEFI-native boot managers work at this stage.
Phase 4 — Linux boot protocol. Implement the x86 Linux boot protocol
(struct boot_params at 0x10000). This makes the UmkaOS kernel loadable by any
Linux-compatible bootloader. Combined with a standard /boot layout and initramfs,
this enables the drop-in package installation described in Section 2.17. This is the
final production boot target.
2.19 Secure Boot and Measured Boot¶
Secure Boot and Measured Boot are kernel-level boot-phase concerns. They apply equally to servers (enterprise attestation, confidential computing), cloud instances (vTPM-based instance identity), and consumer devices (UEFI Secure Boot for firmware lockdown). Neither feature is consumer-specific.
2.19.1 UEFI Secure Boot¶
UEFI Secure Boot enforces a chain of trust starting in firmware: the UEFI db (allowed signature database) and dbx (revocation list) are stored in firmware NVRAM. Every executable in the boot path (bootloader, shim, kernel) must be signed by a key in the db.
Deployment models:
| Model | Chain | When used |
|---|---|---|
| Shim + GRUB | Microsoft UEFI CA → shim (signed by MS) → GRUB (signed by distro) → kernel (signed by distro) | Default for distros shipping via OEM |
| UEFI direct | Custom key enrolled in db → kernel PE/COFF (signed by UmkaOS key) | Self-managed servers, custom deployments |
| Unsigned (disabled) | No verification | Development hardware, QEMU |
UmkaOS requires Phase 3 (UEFI stub, Section 2.18) before Secure Boot can be supported.
The kernel image must be a valid PE/COFF binary for UEFI to verify its
signature before loading. The build system produces a signed image via
sbsign --key umka-signing.key --cert umka-signing.crt umka-kernel.efi.
Kernel module signing: Once the kernel is Secure Boot-booted, all kernel modules (Tier 1 drivers) must also be signed. Unsigned modules are rejected. The module signing key is separate from the UEFI boot key. The build system embeds the module signing public key in the kernel image; drivers are signed with the corresponding private key during the build.
UEFI Secure Boot state: The kernel reads the UEFI SecureBoot variable
from EFI runtime services at boot and records it in a read-only kernel
parameter. Userspace can query via /sys/firmware/efi/efivars/SecureBoot-*.
This affects policy decisions (e.g., CAP_SYS_MODULE behaviour).
2.19.1.1 Key Compromise Recovery¶
If the UmkaOS signing key is compromised, three coordinated actions are required: updating the UEFI revocation list (dbx), rotating the signing key, and migrating TPM-sealed secrets to a new PCR state. This subsection specifies each step precisely.
dbx Update Path
The UEFI Signature Database Forbidden (dbx) is stored in EFI NVRAM and contains
hashes or certificate thumbprints of revoked images and keys. Updates are delivered
as signed UEFI authenticated variables:
-
Delivery mechanism: a signed UEFI capsule image (
EFI_FIRMWARE_IMAGE_PROTOCOL, GUID6dcbd5ed-e82d-4c44-bda1-7194199ad92a) deposited either via theEFI_UPDATE_CAPSULEruntime service or as a file at/EFI/UpdateCapsule/<GUID>.binon the EFI System Partition. The firmware processes the capsule beforeExitBootServices()on the next boot. -
The capsule is authenticated by the firmware using the Platform Key (PK) or Key Exchange Key (KEK) chain already enrolled in NVRAM. A dbx capsule signed by the KEK — and delivered through a signed distro update package — requires no additional user interaction.
-
Early kernel verification: after the firmware has applied the new dbx and before entering the UmkaOS boot stub, UEFI re-verifies every image in the boot chain against the updated dbx. If the running kernel image's hash or signing certificate is now in the dbx, UEFI aborts the boot and presents an error to the user. The kernel never reaches
umka_main()in this case — the revocation check fires beforeExitBootServices(). -
EFI event log: the firmware records the dbx update in the TCG EFI Platform Specification event log (EV_EFI_VARIABLE_AUTHORITY entry, PCR 7). The kernel reads this log during early initialization and forwards the entry to the IMA audit log, creating a durable, ordered record that dbx was updated.
Key Rotation Protocol
The UmkaOS signing key is an ML-DSA-65 + Ed25519 hybrid key pair. Key rotation proceeds through five steps:
-
Generate the new key pair. Create a new ML-DSA-65 + Ed25519 hybrid signing key pair in a hardware security module (HSM). The HSM never exports the private key material.
-
Enroll the new public key in
db. Submit the new public key certificate to the UEFIdb(allowed signature database) via a KEK-authenticated variable update — the same delivery mechanism as the dbx capsule described above. The update is deployed through the normal distro package management pipeline (e.g., as afwupdplugin or a signed distro package writing to/EFI/UpdateCapsule/). After the update applies, both the old key and the new key are accepted by UEFI. -
Dual-signing period — minimum 30 days. Every kernel release during this period is signed with BOTH the old key and the new key. A dual-signed image satisfies any UEFI db that contains either key. This covers:
- Existing systems that have not yet received the new
dbenrollment. -
Systems that received the enrollment but whose
dbupdate failed to apply (e.g., NVRAM full, firmware bug). The 30-day window gives sufficient time for thedbenrollment to propagate to all deployed systems via normal OS update channels. -
Revoke the old key. After the dual-signing period ends, add the old signing certificate's SHA-256 hash to
dbxvia a KEK-signed capsule update. From this point, images signed only with the old key are rejected. Dual-signed images (carrying the new key's signature as well) continue to boot. -
Out-of-band recovery media. Prepare a USB recovery drive containing:
- The new public key certificate in DER format.
- A signed
dbupdate capsule that adds the new key. - Instructions for manual enrollment via the UEFI setup utility. This drive is used on systems that missed the automatic enrollment (e.g., systems offline during the update window, air-gapped systems).
PCR Extension for the New Key
Standard UEFI Secure Boot behavior extends PCR 7 with the hash of each certificate used to verify an image (EV_EFI_VARIABLE_AUTHORITY events). When a kernel signed with the new key boots for the first time, UEFI extends PCR 7 with the new signing certificate hash. The PCR 7 value changes, which breaks TPM-sealed secrets (such as disk encryption keys) that were sealed with a policy referencing the old PCR 7 value.
Migration path — applied before the old key is revoked (Step 4 above):
-
Compute the expected new PCR 7 value. The new PCR 7 value is:
This can be computed offline from the current PCR 7 value and the new certificate, without rebooting. -
Re-seal secrets with a dual-policy. Unseal each secret under the existing policy (PCR 7 =
The re-sealed blob can be unsealed on systems booting with either key during the dual-signing period.PCR7_old), then re-seal with aPolicyORpolicy that accepts either the old or new PCR 7 value: -
After key rotation completes. Once the old key is in
dbxand all systems boot only with the new key, re-seal secrets one final time with a single-policy referencing onlyPCR7_new. This drops the fallback to the old PCR 7 value, producing a tighter policy for the steady state.
The migration (Steps 2 and 3) is performed by a userspace tool
(umka-tpm-reseal) that runs as a systemd oneshot service during the transition
window. The service is activated by detecting a new db entry for the UmkaOS
signing key in the EFI event log.
2.19.2 Measured Boot (TPM PCR Chain)¶
Measured Boot extends a TPM Platform Configuration Register (PCR) with a cryptographic hash at each step of the boot chain. PCRs are append-only (extend = SHA256(current PCR value || new measurement)); they cannot be reset without rebooting. A remote attestation verifier can reconstruct the expected PCR values from the known firmware/bootloader/kernel and check that the running system matches.
Standard x86 PCR assignment (UEFI + Linux convention, which UmkaOS follows):
| PCR | What is measured |
|---|---|
| 0 | UEFI firmware code and configuration |
| 1 | UEFI firmware data (platform config) |
| 2 | Option ROM code |
| 3 | Option ROM data |
| 4 | Boot manager code (GRUB/shim) + kernel image on UEFI (firmware measures PE/COFF as EV_EFI_BOOT_SERVICES_APPLICATION) |
| 5 | Boot manager data + GPT partition table |
| 6 | Resume from hibernate |
| 7 | Secure Boot policy (db, dbx, PK, KEK state) |
| 8 | GRUB command line |
| 9 | Kernel files: initrd + load options (UEFI EFI stub), or all loaded files including kernel (GRUB tpm module on Multiboot paths) |
| 10 | initramfs |
| 11 | Kernel command line (systemd-stub also extends PCR 11) |
| 12 | UmkaOS: Tier 1 driver measurements (Section 9.4) + IMA policy keys |
| 13–15 | Available for OS/application use |
The kernel does NOT self-measure (self-measurement is architecturally unsound
since a compromised kernel would produce the expected hash). On UEFI paths,
firmware measures the kernel image into PCR 4 as an EV_EFI_BOOT_SERVICES_APPLICATION
event when loading the PE/COFF image; the kernel's EFI stub then measures
initrd and load options into PCR 9. On Multiboot paths, GRUB's tpm module
measures all loaded files (including the kernel) into PCR 9. PCR 10 is extended
with the initramfs hash. PCR 11 is extended with the kernel command line.
TPM interface: The kernel accesses the TPM via the TPM CRB (Command
Response Buffer, TPM 2.0 mandatory interface) or TPM TIS (legacy 1.2 / 2.0
FIFO interface). The driver is Tier 1, ACPI-probed (MSFT0101 or MSFT0200).
// umka-core/src/tpm/mod.rs
/// TPM 2.0 PCR Extend command.
/// Extends the given PCR with SHA-256(current || digest).
pub fn pcr_extend(pcr_index: u32, digest: &[u8; 32]) -> Result<(), TpmError>;
/// Read back the current value of a PCR.
pub fn pcr_read(pcr_index: u32) -> Result<[u8; 32], TpmError>;
/// Seal a secret to the current PCR state.
/// Returns a TPM2B_PUBLIC + TPM2B_PRIVATE blob.
/// The secret can only be unsealed if PCRs match the policy at seal time.
pub fn seal(pcr_policy: &PcrPolicy, secret: &[u8]) -> Result<SealedBlob, TpmError>;
/// Unseal a blob previously created by seal().
/// Fails if any PCR in the policy has changed since sealing.
/// # Security
/// The returned buffer is wrapped in `Zeroizing` which zeroes on drop,
/// preventing key material from lingering in memory. The 256-byte limit
/// is per TPM 2.0 Part 2, Section 12.3 (`TPM2B_SENSITIVE_DATA`,
/// max size = `MAX_SYM_DATA` = 256 bytes).
pub fn unseal(blob: &SealedBlob) -> Result<Zeroizing<ArrayVec<u8, 256>>, TpmError>;
Disk encryption integration: seal() is the mechanism for TPM-bound disk
encryption keys (equivalent to Linux's tpm2-totp / systemd-cryptenroll).
The disk encryption key is sealed to a PCR policy covering PCRs 0, 4, 7, 9,
11 (firmware + Secure Boot policy + kernel + cmdline). Any modification to the
boot chain (new kernel, changed cmdline, disabled Secure Boot) causes unseal
to fail, prompting for a recovery passphrase.
Confidential computing intersection: On confidential VM platforms (AMD SEV-SNP, Intel TDX, ARM CCA), the TPM is replaced by a virtual TPM whose root of trust is the hardware attestation report (VCEK certificate, TD quote, Realm Attestation Token). The PCR-based measured boot model is the same; the trust root is the hardware VM isolation guarantee rather than a physical TPM chip. Section 5.1 covers the distributed/confidential computing architecture.
2.19.3 Kernel Responsibilities Summary¶
| Responsibility | In kernel? | Notes |
|---|---|---|
| Kernel image signing | Build-time | sbsign in build system |
| Module signing verification | Yes | Enforced when Secure Boot active |
| PCR extension (kernel + cmdline) | Yes | Early boot, before driver init |
| TPM driver (CRB/TIS) | Yes | Tier 1, ACPI-probed |
seal() / unseal() API |
Yes | Exposed to userspace via ioctl |
| Key management policy | No | Userspace (systemd-cryptenroll, clevis) |
| Remote attestation protocol | No | Userspace (keylime, MAA agent) |
| Boot graphics, splash screen | No | Bootloader/compositor |
| Dual-boot chainloading | No | Bootloader (GRUB) |
2.20 UEFI Runtime Services¶
After ExitBootServices(), the UEFI Boot Services memory map is invalidated and
all Boot Services (memory allocation, protocol interfaces, etc.) are gone. However,
a distinct set of UEFI Runtime Services remains accessible for the life of the
running OS. These services operate on a virtual address mapping that the kernel
establishes during boot via SetVirtualAddressMap().
2.20.1 Virtual Address Mapping¶
Before calling ExitBootServices(), the kernel enumerates the UEFI memory map and
identifies all regions with the EFI_MEMORY_RUNTIME attribute. These regions are
mapped into a dedicated kernel virtual address range (EFI_RUNTIME_VA_BASE,
architecture-specific) using normal kernel page table entries. The mapping must
preserve the relative offsets between firmware-runtime regions exactly as the
firmware expects.
The kernel then calls SetVirtualAddressMap(map_size, descriptor_size,
descriptor_version, virtual_map) once, passing the updated descriptors with the
new virtual base addresses. After this call returns, all UEFI runtime service
pointers stored in the EFI System Table are updated to use the new virtual
addresses. The physical EFI System Table address is preserved separately so the
kernel can locate it after the mapping call.
/// Handle to UEFI runtime services, valid after ExitBootServices().
pub struct EfiRuntime {
/// Physical address of EFI System Table, preserved across ExitBootServices().
pub system_table_pa: PhysAddr,
/// Virtual address of EFI Runtime Services table, after SetVirtualAddressMap().
pub runtime_services: *const EfiRuntimeServices,
/// Whether runtime services are available (false if firmware is broken or
/// SetVirtualAddressMap() failed).
pub available: bool,
/// Serializes all EFI runtime calls. UEFI firmware is not reentrant.
/// Lock level: 270 (EFI_RUNTIME_LOCK) — leaf lock, never nested. EFI
/// runtime calls are cold-path only (variable reads, time set, reboot).
/// No other lock is acquired while this lock is held. IRQs disabled.
/// Level 270 is above VTABLE_LOCK (260) and avoids the level 250
/// collision with DEV_REG_LOCK.
pub lock: Lock<(), 270>,
}
All accesses to EfiRuntime hold EfiRuntime::lock and execute with interrupts
disabled. UEFI firmware is documented as non-reentrant; concurrent calls from
different CPUs or from an IRQ handler preempting a runtime call both produce
undefined behavior.
2.20.2 NVRAM (EFI Variables)¶
EFI variables are named byte arrays stored in firmware NVRAM. They persist across reboots and are accessed by name (UTF-16 string) and vendor GUID. Variables have attribute flags controlling persistence and visibility:
EFI_VARIABLE_NON_VOLATILE(bit 0): persists across power cycles.EFI_VARIABLE_BOOTSERVICE_ACCESS(bit 1): accessible during Boot Services.EFI_VARIABLE_RUNTIME_ACCESS(bit 2): accessible afterExitBootServices().
UmkaOS wraps the UEFI variable services with interrupt-disabled, locked calls:
/// Read a UEFI variable by name and GUID.
///
/// Returns variable data bounded by firmware's MaxVariableStorageSize
/// (queried via QueryVariableInfo()), typically 32-64 KB. Cold-path function:
/// called only during boot or admin operations.
///
/// # Safety Note
/// This function MUST NOT be called from error recovery paths (OOM, MCE,
/// panic handlers) because it heap-allocates (Vec<u8>). The SecureBoot
/// state is cached at boot in a kernel-internal flag; error recovery code
/// reads the cached value, not the EFI variable.
///
/// Returns the variable data on success, or an `EfiStatus` error code.
/// Common errors: `EFI_NOT_FOUND` (variable absent), `EFI_BUFFER_TOO_SMALL`
/// (internal — handled by the wrapper via a two-pass size query).
pub fn efi_get_variable(
name: &UcsStr,
guid: &EfiGuid,
) -> Result<(Vec<u8>, u32 /* attributes */), EfiStatus>;
/// Write or delete a UEFI variable.
///
/// Pass `data = &[]` with `attrs = 0` to delete an existing variable.
/// Authenticated variables (e.g., db, dbx) require a signed payload structure
/// in `data`; the firmware validates the signature before writing.
pub fn efi_set_variable(
name: &UcsStr,
guid: &EfiGuid,
attrs: u32,
data: &[u8],
) -> Result<(), EfiStatus>;
Uses by the kernel:
- Reading the
SecureBootvariable (GUID{8be4df61-...}) to determine whether Secure Boot is active (see Section 2.19). - Reading and writing the
BootOrderandBoot####variables to manage UEFI boot entries (used byumka-efibootmgr, a userspace tool that delegates to the kernel via an ioctl). - Delivering
db/dbxupdates as authenticated variable writes during the key compromise recovery process (see Section 2.19).
NVRAM wear: EFI NVRAM has limited write endurance (typically 100,000 to 1,000,000 cycles depending on the flash technology). The kernel must not write EFI variables at high frequency. Policy variables, boot configuration, and security databases are the intended use; per-boot or per-minute writes are acceptable; per-second writes are not.
2.20.3 Time Services¶
UEFI provides GetTime(time, capabilities) and SetTime(time) for wall-clock time
access, and GetWakeupTime/SetWakeupTime for ACPI alarm-based resume.
UmkaOS uses EFI time services exactly once: during early boot (between Phase 2 and Phase 3 of the x86-64 initialization sequence) to read the hardware RTC and initialize the kernel wall clock. All subsequent timekeeping uses hardware-direct paths:
- x86-64: HPET, TSC, LAPIC timer via direct MMIO and MSR reads.
- AArch64: ARM Generic Timer (
CNTPCT_EL0,CNTFRQ_EL0) via system registers. - RISC-V:
rdtimepseudo-instruction, frequency from Device Tree. - PPC32/PPC64LE: Timebase register (
mftb) and decrementer SPR.
This avoids the serialization cost of EfiRuntime::lock on the timekeeping hot
path. EFI SetTime() is called when the user updates the wall clock (e.g., via
adjtimex(2) or settimeofday(2)) to propagate the change back to the hardware
RTC.
2.20.4 Reset and Shutdown¶
ResetSystem(type, status, data_size, data) is the UEFI-standard mechanism for
system reset and shutdown. The type field is one of:
| Type | Value | Semantics |
|---|---|---|
EfiResetCold |
0 | Full hardware reset, re-runs POST. |
EfiResetWarm |
1 | Warm reset without POST (where supported by platform). |
EfiResetShutdown |
2 | Power off via ACPI S5 state. |
EfiResetPlatformSpecific |
3 | Vendor-defined reset type identified by GUID in data. |
UmkaOS maps Linux reboot syscall commands to EFI reset types as follows:
reboot(2) command |
UEFI call |
|---|---|
LINUX_REBOOT_CMD_RESTART |
EfiResetCold |
LINUX_REBOOT_CMD_POWER_OFF |
EfiResetShutdown |
LINUX_REBOOT_CMD_HALT |
EfiResetShutdown (processor halt before calling) |
LINUX_REBOOT_CMD_RESTART2 (with command string) |
EfiResetPlatformSpecific with distro-specific GUID |
Fallback path when EFI runtime is unavailable. If EfiRuntime::available is
false (non-UEFI boot, firmware bug, or SetVirtualAddressMap() failure), UmkaOS
falls back to ACPI-direct paths:
- Shutdown: write the ACPI sleep type for S5 (from
\_S5object in DSDT) to the PM1a Control Register (PM1a_CNT), setting theSLP_ENbit. - Reset: write
0x06to I/O port0xCF9(keyboard controller reset, widely supported on x86 platforms), or useACPI_RESET_REGif defined in the FADT.
2.21 Kernel Image Structure and Loading Model¶
UmkaOS organizes its executable code into four loading tiers with strict boundaries. The design goal is a minimal verified nucleus (Nucleus) that can load, verify, and replace everything else — including the bulk of the kernel itself — without reboot over a 50-year operational lifetime.
2.21.1 Nucleus: Verified Nucleus (Non-Replaceable)¶
Nucleus is the only code that cannot be live-replaced. It is the bootstrap trust root: if Nucleus has a bug, the only fix is a reboot. Therefore Nucleus is (a) minimized to the absolute essential, (b) formally verified (Section 24.4), and (c) frozen after Phase 2 exit.
Nucleus is statically linked into the kernel ELF image. It is always resident in memory. It executes entirely in Ring 0, Tier 0 — no isolation boundary applies to Nucleus code.
Nucleus contains exactly these components:
| Component | Purpose | Approx Size | Verified? |
|---|---|---|---|
| Boot entry (per-arch asm) | Hardware trap from firmware/bootloader → Rust entry. GDT/IDT stubs, stack setup, identity-mapped page tables. | ~2 KB per arch | Yes |
| Physical memory data | PageArray (vmemmap), BuddyFreeList, PcpPagePool — the physical memory map and free-page bookkeeping. Data structures only, not allocation policy. |
~4 KB code | Yes |
| Page reclaim data | ZoneLru, CgroupLru, LruGeneration, ShadowEntry, LruDrainBuffer — the LRU generation lists, per-cgroup LRU state, shadow entries for refault distance, and per-CPU drain buffers. Struct layout and manipulation code are non-replaceable; values change continuously every reclaim cycle. See Section 4.4. |
~3 KB code | Yes |
| Page table hardware ops | arch::current::mm::install_pte(), read_pte(), PTE encoding/decoding. ~10 arch-specific instructions per architecture. |
~1 KB per arch | Yes |
| Capability system (data) | CapTable (XArray of CapEntry), cap_lookup(), cap_generation_valid(), cap_has_rights(), CapOperationGuard atomic ops. Data structures only, not policy. |
~2-3 KB | Yes |
| KABI dispatch trampoline | vtable[method_id](args) — indirect call through versioned vtable. |
~0.5 KB | Yes |
| Syscall entry (Layer 1) | Per-arch trap handler: save registers, sign-extend nr, call Layer 2 via atomic fn pointer. | ~1 KB per arch | Yes |
| Evolution primitive | Stop-the-world IPI, page remap/copy, vtable pointer atomic swap, CPU release. The irreducible mechanism that installs a pre-verified, pre-loaded image. No ELF parsing, no signature verification, no symbol resolution. | ~2-3 KB | Yes |
| DomainRingBuffer header | Ring buffer data layout (producer/consumer cache lines, head/tail/published atomics). The wire protocol between isolation domains. | ~2 KB | Yes |
AlgoDispatch data |
AlgoDispatch<F> struct (function pointer cell) and get() method. The data structure that holds the selected function pointer. ~20 lines of code. |
~0.5 KB | Yes |
CpuFeatureTable data |
CpuFeatureSet struct layout, CpuFeatureTable storage, CpuFeatureSet per-CPU array. The frozen data that all subsystems read. |
~1 KB | Yes |
alt_patch_apply() primitive |
Writes bytes to a code address and flushes I-cache. ~10 arch-specific instructions. The tool that applies patches; not the logic that decides what to patch. | ~0.5 KB per arch | Yes |
| Early serial (per-arch) | Minimal UART/COM1 output for pre-console diagnostics. | ~0.5 KB per arch | Yes (trivial) |
Total Nucleus: ~25-35 KB across all architectures (comparable to seL4's verified kernel at ~10K SLOC, but providing more services). This is the code that must be correct — formal verification is the sole defence.
Nucleus does NOT contain (these are in Evolvable or loadable modules):
- Evolution orchestration (ELF loader, ML-DSA-65 verification, Phase A/A'/B/C
sequencing, PendingOpsPerCpu management, symbol resolution, state export/import
coordination, DATA_FORMAT_EPOCH) — replaceable Evolvable component (~12-13 KB)
- Scheduler (EEVDF, RT, DL) — replaceable policy
- Memory allocation policy (PhysAllocPolicy) — replaceable
- VMM policy (VmmPolicy) — replaceable
- Capability policy (CapPolicy) — replaceable (same data/policy split pattern)
- Slab allocator — replaceable
- VFS, networking, block layer — KABI services
- Interrupt controller drivers (APIC, GIC, PLIC) — Tier 0 but in Evolvable
- Timer drivers — Tier 0 but in Evolvable
- ACPI/DTB parsers — Evolvable
- Any device driver — Tier 1/2
CPU adaptation is almost entirely in Evolvable (swappable):
The CPU-dependent adaptation system follows the same data/policy split as the memory allocator:
| Component | Where | Why |
|---|---|---|
CpuFeatureSet struct, CpuFeatureTable storage |
Nucleus (data) | Fixed-size data structure. Fields use u64/bitflags with headroom for new bits. |
AlgoDispatch<F> struct, get() |
Nucleus (data) | A function pointer cell. Nucleus code may call through it. |
alt_patch_apply() |
Nucleus (primitive) | Writes bytes to code + flushes I-cache. ~10 instructions per arch. |
detect_features() per arch |
Evolvable (policy) | CPUID/ID-register reading, model/stepping → errata matching. Swappable: new errata discovery requires updating detection tables. |
| Errata matching tables | Evolvable (policy) | Which CPU model has which bug. Must be updatable as new errata are disclosed over 50 years. |
cpu_features_freeze() |
Evolvable (orchestration) | Computes universal intersection, errata union, calls alt_patch_all(). Swappable. |
alt_patch_all() |
Evolvable (orchestration) | Iterates __alt_instructions linker section, evaluates conditions, calls alt_patch_apply(). Swappable. |
algo_dispatch_init_all() |
Evolvable (orchestration) | Iterates AlgoDispatch statics, selects candidates. Swappable. |
code_alternative! site definitions |
Each module | Alternative entries live in the module that contains the code site. Swappable with the module. |
AlgoDispatch candidates |
Each module | Candidate function pointers live in the module that declares them. Swappable. |
Why this split matters for 50-year uptime:
When a new CPU erratum is discovered in year 17:
1. Live-evolve Evolvable with updated detect_features() that sets a new
ErrataCaps bit (the u64 field has ~48 unused bits).
2. During Phase B (stop-the-world), the evolution framework temporarily
remaps the CpuFeatureTable page as writable. The new detection code
re-reads CPU registers and updates the errata union.
3. The new Evolvable's alt_patch_all() re-scans all loaded modules'
__alt_instructions sections and applies patches for the new erratum.
4. CpuFeatureTable page re-frozen as read-only. CPUs released.
When a new CPU instruction becomes available in year 25:
1. Live-evolve the module containing the relevant code_alternative! site
with a new alternative entry for the new instruction.
2. The evolution orchestration calls alt_patch_apply() (Nucleus primitive) to
patch the new module's code during Phase B.
When ErrataCaps exhausts all 64 bits (extremely unlikely — Linux has ~60
errata across 25 years of x86):
1. Use the Data Format Evolution Framework (Pattern 1: Extension Array) —
add ErrataCapsExt parallel to the existing struct.
2. Or use Pattern 2 (Shadow-and-Migrate) to grow CpuFeatureSet to
accommodate errata: [u64; 2].
The only scenario requiring reboot: a bug in Nucleus's alt_patch_apply()
primitive (10 instructions) or AlgoDispatch<F>::get() (5 instructions).
Both are formally verified.
2.21.2 Evolvable: Boot Monolith (First Loadable, Swappable)¶
Evolvable is the first payload activated by Nucleus at boot. It contains
everything needed to bring the system to a functional state (mount root
filesystem, start init). The Evolvable build artifact is a standard ELF
(produced by the Rust toolchain). The Evolvable embedded format is a flat
binary with a self-describing EvolvableImageHeader, extracted from the ELF by
the build pipeline and embedded in the kernel ELF as a linker section. At
boot, Nucleus handles only the flat binary — no ELF parsing needed (ELF
parsing is in Evolvable's evolution orchestration, used for post-boot live
replacement). At live evolution time, new Evolvable images arrive as signed ELF
files, parsed by Evolvable's own ELF loader.
Evolvable is live-replaceable as a whole (via the evolution primitive in
Nucleus + evolution orchestration in Evolvable) or subsystem-by-subsystem (each
Evolvable component implements EvolvableComponent). In practice, subsystem-level
replacement is the norm; whole-Evolvable replacement is reserved for major kernel
upgrades.
Evolvable contains:
| Category | Components | Replaceability |
|---|---|---|
| Evolution orchestration | ELF loader, ML-DSA-65 signature verification, Phase A/A'/B/C sequencing, PendingOpsPerCpu management, symbol resolution, state export/import coordination, DATA_FORMAT_EPOCH, version compatibility checks, chained migration. The Nucleus evolution primitive provides the atomic swap; orchestration handles everything before and after. |
Replaceable (self-evolving: new orchestration verified by old orchestration, then installed by Nucleus primitive) |
| Tier 0 platform drivers | APIC/GIC/PLIC/OpenPIC (interrupt controller), HPET/generic timer/decrementer (timer), PIT/RTC (legacy clock) | Individually replaceable via EvolvableComponent |
| CPU feature detection (per-arch) | CPUID (x86), ID registers (AArch64/ARMv7), CSR probing (RISC-V), PVR (PPC) — fills CpuFeatureTable entries |
Replaceable (detection code only; the table is Nucleus data) |
| Scheduler | EEVDF + RT + DL scheduler, CFS bandwidth, CPU hotplug, core provisioning, load balancer | Replaceable policy module (Section 19.9) |
| Memory policy | PhysAllocPolicy (default buddy policy), PageReclaimPolicy (default MGLRU eviction/aging), VmmPolicy (default page fault/THP/TLB policy), OOM killer, kswapd, kcompactd |
Each independently replaceable (stateless trait) |
| Capability policy | CapPolicy (default permission/delegation/revocation/inheritance policy). Same data/policy split as memory subsystem. |
Replaceable (stateless trait, MonotonicVerifier gate). See Section 9.1 |
| Slab allocator | SlabCache management, magazine depot, kslab_gc thread |
Replaceable |
| IPC dispatch | DomainRingBuffer dispatch logic, routing, capability passing, flow control | Replaceable (data layout is Nucleus; dispatch logic is Evolvable) |
| Workqueue framework | BoundedMpmcRing, named kthread pools, work item scheduling |
Replaceable |
| ACPI/DTB parser | ACPI table parsing (MADT, SRAT, SLIT, MCFG, HPET, FADT), DTB parsing | Replaceable |
| SysAPI layer (umka-sysapi) | Bidirectional syscall dispatch table, Linux ABI translation, UmkaOS native syscall handlers | Replaceable (Layer 2 in syscall architecture) |
| VFS | Virtual filesystem layer, dentry cache, inode cache, mount tree, page cache, overlayfs | Replaceable KABI service (Section 13.18) |
| Block layer | Request queues, I/O scheduler (IoSchedOps), dm-* targets, partitioning |
Replaceable KABI service |
| Networking stack | TCP/IP/UDP, socket layer, routing FIB, netfilter/nftables, qdisc (QdiscOps) |
Replaceable KABI service |
| Cgroup/namespace | Cgroup hierarchy, cgroup controllers (cpu, memory, io, pids), namespace implementation | Replaceable KABI service |
| Security modules | LSM framework, default SELinux/AppArmor stubs, IMA/EVM | Replaceable |
| Crypto core | Algorithm registry, TfmBase, key derivation — but NOT the algorithm implementations (those are feature-variant modules) |
Replaceable |
| FMA / observability | Fault management, tracepoints, perf subsystem | Replaceable |
| AlgoDispatch candidates | All algo_dispatch! candidate function pointers (crypto, checksum, compression, memops) — initialized at boot phase 9 |
Replaceable (new candidates loaded via evolution) |
Evolvable loading sequence (integrated into the existing boot phases):
Boot Phase 0.1-0.7: Nucleus executes (entry asm, memory map parse,
BootAlloc init, NUMA discovery, CpuLocal BSP).
Boot Phase 0.8a: Nucleus verifies Evolvable signature (physical addresses,
no MMU required).
Boot Phase 0.2: Identity map / MMU enable (on non-x86 arches, runs between
0.8a and 0.8b; on x86-64, already active from entry asm).
Boot Phase 0.8b: Nucleus maps Evolvable at EVOLVABLE_VIRT_BASE, populates
vtable slots, calls core1_entry(). Requires MMU.
Boot Phase 1.1+: Evolvable code runs: buddy handoff, slab init, IRQ domain,
capability system, scheduler, workqueue, RCU.
Boot Phase 3.1-3.3: SMP bringup (AP trampoline, secondary CPUs).
Boot Phase 4.1-5.4: IOMMU, device registry, KABI, bus enumeration,
Tier 0/1 drivers, VFS, root mount.
2.21.2.1 Evolvable Image Format¶
The Evolvable flat binary has a self-describing header at offset 0, inspired by the ARM64 kernel Image header (64 bytes at a fixed offset with magic and geometry fields) but extended for Nucleus/Evolvable structured handoff. The header is designed for zero-parsing activation: all offsets are pre-computed by the build system, all sizes are explicit, and all fields are fixed-width. Nucleus reads exactly 128 bytes and validates 3 fields — no loops, no variable-length parsing, no string comparisons.
/// Evolvable flat binary image header. Placed at byte 0 of the embedded image.
/// All offsets are relative to byte 0 of this header (image start).
/// All multi-byte fields are little-endian, using `Le16`/`Le32`/`Le64` wrapper
/// types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) to enforce
/// byte-order correctness on all eight architectures. On the two big-endian
/// platforms (PPC32, s390x), `Le32::to_ne()` performs a byte-swap; on the
/// six little-endian platforms, it is a no-op.
///
/// Total size: 128 bytes (2 cache lines). Fixed across all architectures.
///
/// The build system (`core1-pack`) generates this header by reading metadata
/// from the Evolvable ELF (section addresses, sizes, entry point, vtable symbol
/// addresses) and writing the flat binary: header + .text + .rodata + .data.
///
/// **Versioning**: `header_version` enables forward-compatible evolution.
/// Nucleus rejects images with `header_version > NUCLEUS_MAX_HEADER_VERSION`.
/// New fields are added to the reserved area; old Nucleus ignores them. This
/// means the header format itself can evolve without requiring a Nucleus
/// update (reboot) — as long as new fields are additive and unused fields
/// remain zero.
// kernel-internal, not KABI
#[repr(C, align(8))]
pub struct EvolvableImageHeader {
// --- Identification (8 bytes) ---
/// Magic bytes: b"UKC1" (UmKa Evolvable 1). Nucleus rejects any image where
/// the first 4 bytes do not match. This catches corrupted or
/// misidentified blobs immediately. Byte array — no endianness conversion.
pub magic: [u8; 4],
/// Header format version. Currently 1. Nucleus rejects images with
/// `header_version > NUCLEUS_MAX_HEADER_VERSION` (currently 1).
/// Increment only when a field's semantics change incompatibly.
/// Adding new fields to _reserved does NOT require a version bump
/// (old Nucleus ignores them because it reads at fixed offsets).
pub header_version: Le16,
/// Image flags (bitfield, stored little-endian).
/// Bit 0: 0 = position-dependent (linked at fixed EVOLVABLE_VIRT_BASE),
/// 1 = reserved (position-independent, not yet supported).
/// Bits 1-15: reserved, must be zero.
pub flags: Le16,
// --- Image geometry (48 bytes) ---
/// Total image size in bytes: from byte 0 (header start) through end
/// of .data (NOT including the appended LMS signature). Nucleus validates
/// this against the linker symbol distance minus signature size.
pub image_size: Le64,
/// Virtual address base that this image was linked at. Nucleus maps the
/// image sections at VAs starting from this base. Must match the
/// per-architecture EVOLVABLE_VIRT_BASE constant.
pub virt_base: Le64,
/// .text section: executable code, mapped RX.
/// Offset from image start. Size in bytes. 4KB-aligned by the build
/// system (padded if necessary).
pub text_offset: Le32,
pub text_size: Le32,
/// .rodata section: read-only data, mapped R (no execute).
/// Offset from image start. Size in bytes. 4KB-aligned.
pub rodata_offset: Le32,
pub rodata_size: Le32,
/// .data section: read-write initialized data. Nucleus copies these bytes
/// into freshly allocated pages (from BootAlloc) and maps them RW.
/// The embedded copy in the kernel ELF is never written — it serves as
/// the pristine initial values, reusable if Evolvable needs re-activation
/// (rollback).
/// Offset from image start. Size in bytes. 4KB-aligned.
pub data_offset: Le32,
pub data_size: Le32,
/// .bss size in bytes. Not present in the image (virtual-only). Nucleus
/// allocates and zeroes (data_size + bss_size) bytes total for the RW
/// region. The .bss occupies the portion after .data in the allocated
/// pages. 4KB-aligned.
pub bss_size: Le32,
pub _pad0: Le32,
// --- Entry point and vtable directory (16 bytes) ---
/// Offset of core1_entry() from image start. Nucleus computes the entry
/// VA as: virt_base.to_ne() + entry_offset.to_ne(). The entry function
/// signature is:
/// extern "C" fn core1_entry(handoff: *const BootHandoff) -> !
pub entry_offset: Le32,
/// Offset of the VtableDirectoryEntry[] array from image start.
/// The directory resides in .rodata (read-only, part of the signed
/// image). Nucleus iterates it to populate dispatch slots.
pub vtable_dir_offset: Le32,
/// Number of entries in the vtable directory. Nucleus rejects images
/// where vtable_dir_count > MAX_VTABLE_SLOTS (currently 8).
pub vtable_dir_count: Le16,
pub _pad1: [u8; 6],
// --- Signature metadata (24 bytes) ---
/// Offset from image start to the LMS signature data. Normally
/// equals `image_size` (signature is appended immediately after
/// .data). Zero means no signature — Nucleus panics in enforce mode.
pub sig_offset: Le32,
/// Size of the appended LMS signature in bytes.
/// For LMS-SHAKE256-N32-W4-H15: ~2,700 bytes.
pub sig_size: Le32,
/// Signature algorithm identifier:
/// 0x0000 = no signature (development only; rejected in production)
/// 0x0001 = LMS-SHAKE256-N32-W4-H15 (default, NIST SP 800-208)
/// 0x0002 = HSS/LMS-SHAKE256 two-level (for key trees > 32K signatures)
/// Nucleus rejects unknown algorithm IDs.
pub sig_algo: Le16,
pub _reserved: [u8; 6],
}
/// Size: 128 bytes exactly. Compile-time assertion.
const _: () = assert!(core::mem::size_of::<EvolvableImageHeader>() == 128);
Image binary layout (produced by core1-pack):
Offset 0x0000: EvolvableImageHeader (128 bytes)
Offset T: .text section (code, T = text_offset, mapped RX)
Offset R: .rodata section (R = rodata_offset, mapped R)
Contains the VtableDirectoryEntry[] array.
Offset D: .data section (D = data_offset, initial values for RW data)
Copied by Nucleus into fresh pages; original never modified.
--- .bss is NOT in the image — Nucleus allocates and zeroes it.
Offset S: LMS signature (S = sig_offset = image_size, sig_size bytes)
Appended AFTER the signed content. Covers bytes [0, image_size).
Sections are ordered .text → .rodata → .data with 4KB alignment between
each. The build system pads each section to a 4KB boundary so that page
protections can be set per-section without sharing a page between RX and R
regions. The LMS signature is appended after .data — it is NOT page-aligned
(it is never mapped; Nucleus reads it from the embedded section and discards it
after verification).
2.21.2.2 Evolvable Embedding in the Kernel ELF¶
Evolvable is embedded in the kernel ELF as a read-only linker section, following
the same pattern Linux uses for its built-in initramfs (.init.ramfs section
with __initramfs_start / __initramfs_size linker symbols).
Per-architecture assembly file (umka-kernel/src/arch/*/core1_embed.S):
// Embeds the Evolvable flat binary into the kernel ELF.
// The binary is produced by the build pipeline: core1.elf → objcopy → core1-pack.
.section .core1_image, "a" // "a" = allocatable (loaded by bootloader)
.balign 4096 // page-aligned for clean page protections
.global __core1_start
__core1_start:
.incbin "core1.bin" // path resolved by build.rs
.global __core1_end
__core1_end:
.balign 4096 // pad to page boundary
Linker script additions (all eight linker-*.ld files):
/* Evolvable flat binary image. Placed in its own PT_LOAD segment so the
bootloader loads it as a contiguous region. Marked read-only — Nucleus
never writes to the embedded copy (it copies .data to fresh RW pages). */
.core1_image ALIGN(4096) : {
__core1_start = .;
KEEP(*(.core1_image))
__core1_end = .;
} :core1_seg
/* In PHDRS: */
PHDRS {
text PT_LOAD FLAGS(5); /* R-X: kernel .text */
rodata PT_LOAD FLAGS(4); /* R--: kernel .rodata */
data PT_LOAD FLAGS(6); /* RW-: kernel .data + .bss */
core1_seg PT_LOAD FLAGS(4); /* R--: Evolvable embedded image */
}
Placing .core1_image in its own PT_LOAD segment ensures the bootloader maps
it as a separate, contiguous, read-only region. This prevents the bootloader
from merging it with the kernel's RW data segment (which would allow accidental
writes to the embedded Evolvable image).
Nucleus references (in umka-core/src/evolution/boot_load.rs):
extern "C" {
/// Linker-provided symbols bounding the embedded Evolvable image
/// (including the appended LMS signature).
static __core1_start: u8;
static __core1_end: u8;
}
/// LMS public key for Evolvable image verification. Baked into Nucleus at build
/// time. This key is NOT tied to a specific Evolvable version — any Evolvable
/// signed by the corresponding private key will pass verification.
/// The private key is held exclusively by the build system (signing server).
///
/// Format: LMS public key per RFC 8554 §5.3 (type || otstype || I || T[1]).
/// For LMS-SHAKE256-N32-W4-H15: 56 bytes exactly.
///
/// Key rotation: replacing this key requires a Nucleus rebuild (reboot).
/// With H=15, the key can sign 32,768 Evolvable releases. With H=20,
/// 1,048,576 releases. At one release per day, H=15 lasts ~90 years,
/// H=20 lasts ~2,870 years. Key exhaustion is not a practical concern.
static EVOLVABLE_LMS_PUBLIC_KEY: [u8; 56] =
*include_bytes!(concat!(env!("OUT_DIR"), "/core1-lms.pub"));
/// Maximum header version this Nucleus can parse.
const NUCLEUS_MAX_HEADER_VERSION: u16 = 1;
/// Maximum vtable directory entries.
const MAX_VTABLE_SLOTS: usize = 8;
2.21.2.3 Per-Architecture Evolvable Virtual Address Layout¶
Evolvable is linked at a fixed VA base per architecture. The kernel linker script
reserves this VA range. Nucleus's page table setup (Phase 0.2) does NOT map it —
the embedded image bytes are in Nucleus's .core1_image segment at Nucleus's own
VAs. During Phase 0.8, Nucleus creates new page table mappings that alias Evolvable's
.text and .rodata physical pages to Evolvable's linked VAs, and maps freshly
allocated RW pages for Evolvable's .data+.bss at Evolvable's expected RW VAs.
After Phase 0.8, code at EVOLVABLE_VIRT_BASE + entry_offset is executable, and
Evolvable's global variables are at their expected link-time addresses.
/// Per-architecture Evolvable virtual address base. Evolvable's linker script
/// uses this as its ORIGIN. Nucleus maps Evolvable sections starting here.
///
/// Design constraints:
/// - Must not overlap kernel .text, .rodata, .data, .bss, or BootAlloc VA.
/// - Must not overlap the direct-map (physmap) region.
/// - Must be 2 MB aligned (enables huge page mapping for .text).
/// - Must be userspace-inaccessible on all eight architectures.
/// Per-architecture isolation mechanisms:
///
/// | Architecture | EVOLVABLE_VIRT_BASE | Isolation Mechanism |
/// |---------------|----------------------------|--------------------------------------------------------------|
/// | x86-64 | 0xFFFF_FFFF_A000_0000 | Higher-half VA (SMAP/SMEP + PML4 NX) |
/// | AArch64 | 0xFFFF_0000_4000_0000 | Higher-half VA (TTBR1 kernel space, PAN/PXN) |
/// | ARMv7 | 0xC040_0000 | Higher-half VA (3G/1G split, PXN) |
/// | RISC-V 64 | 0xFFFF_FFC0_8040_0000 | Higher-half VA (Sv48 kernel range) |
/// | PPC32 | 0x0040_0000 | TS-bit isolation (kernel TS=0, user TS=1 — separate spaces) |
/// | PPC64LE | 0x2040_0000 | Radix PID isolation (PID=0 kernel, PID>0 user) |
/// | s390x | 0x0000_0200_0000_0000 | DAT ASCE isolation (kernel Region-Third ASCE) |
/// | LoongArch64 | 0x9000_0000_4000_0000 | DMW1 cached window (PLV0 only, inaccessible from PLV3 userspace) |
///
/// The gap between kernel_end and EVOLVABLE_VIRT_BASE accommodates:
/// - vmalloc region (grows upward from kernel_end)
/// - Future Evolvable growth (new subsystems increase image size)
///
/// These values are compile-time constants shared between the kernel linker
/// script and the Evolvable linker script via a common header
/// (umka-common/src/layout.rs).
#[cfg(target_arch = "x86_64")]
pub const EVOLVABLE_VIRT_BASE: u64 = 0xFFFF_FFFF_A000_0000;
// 512 MB below kernel .text (0xFFFF_FFFF_8100_0000).
// Leaves 512 MB gap for vmalloc. Fits in the [-2 GB, 0] kernel window.
#[cfg(target_arch = "aarch64")]
pub const EVOLVABLE_VIRT_BASE: u64 = 0xFFFF_0000_4000_0000;
// 1 GB into the kernel VA half (TTBR1 region).
// Kernel .text is at 0xFFFF_0000_4008_0000 (QEMU virt default).
// Evolvable is placed 128 KB below kernel .text with a 2 MB alignment.
#[cfg(target_arch = "arm")]
pub const EVOLVABLE_VIRT_BASE: u64 = 0xC040_0000;
// 4 MB above kernel base (0xC000_0000 with 3G/1G split).
// ARM32 kernel VA is limited to 1 GB; Evolvable fits in the first 16 MB.
#[cfg(target_arch = "riscv64")]
pub const EVOLVABLE_VIRT_BASE: u64 = 0xFFFF_FFC0_8040_0000;
// Sv48 kernel VA. 4 MB above kernel base (0xFFFF_FFC0_8020_0000).
#[cfg(target_arch = "powerpc")]
pub const EVOLVABLE_VIRT_BASE: u64 = 0x0040_0000;
// 4 MB above kernel base (0x0010_0000).
#[cfg(target_arch = "powerpc64")]
pub const EVOLVABLE_VIRT_BASE: u64 = 0x2040_0000;
// 4 MB above kernel base (0x2000_0000).
#[cfg(target_arch = "s390x")]
pub const EVOLVABLE_VIRT_BASE: u64 = 0x0000_0200_0000_0000;
// In the kernel virtual address space above identity-mapped lowcore.
// s390x uses DAT with Region-Third ASCE initially (4 TB VA space).
// Kernel .text is at 0x10000; Evolvable is placed at 2 TB (well above
// the direct-mapped physical memory region).
#[cfg(target_arch = "loongarch64")]
pub const EVOLVABLE_VIRT_BASE: u64 = 0x9000_0000_4000_0000;
// In the DMW1 direct-mapped cached window (0x9000_0000_0000_0000, CA=1, MAT=1).
// 1 GB offset from window base. Kernel .text starts at
// 0x9000_0000_0020_0000 (2 MB into DMW1); Evolvable is placed at
// 1 GB with a 2 MB alignment gap.
Evolvable linker script (umka-core1/linker-<arch>.ld):
/* Evolvable is linked as a standalone binary at EVOLVABLE_VIRT_BASE.
objcopy -O binary extracts the flat binary for embedding. */
ENTRY(core1_entry)
SECTIONS {
. = EVOLVABLE_VIRT_BASE;
.text ALIGN(4096) : { *(.text .text.*) }
.rodata ALIGN(4096) : { *(.rodata .rodata.*) }
.data ALIGN(4096) : { *(.data .data.*) }
.bss ALIGN(4096) (NOLOAD) : {
__core1_bss_start = .;
*(.bss .bss.*)
*(COMMON)
__core1_bss_end = .;
}
/* Vtable directory: a static array in .rodata populated by the
#[link_section = ".rodata.vtable_dir"] attribute on the
VTABLE_DIRECTORY constant in Evolvable source code. */
/DISCARD/ : { *(.comment) *(.note*) *(.eh_frame*) }
}
2.21.2.4 Phase 0.8: Evolvable Boot Loading Protocol¶
This is the complete algorithm executed by Nucleus to activate Evolvable. It runs in two sub-phases:
- Phase 0.8a (evolvable_verify): Signature verification. Operates on physical addresses only — no MMU required. May precede Phase 0.2 (identity mapping / MMU enable) on architectures where the MMU is set up after CpuLocal (AArch64, ARMv7, RISC-V, PPC32, PPC64LE, LoongArch64). On x86-64, MMU is already active from entry assembly.
- Phase 0.8b (evolvable_map_and_init): Virtual mapping and initialization.
Requires MMU enabled (Phase 0.2 complete). Maps Evolvable at
EVOLVABLE_VIRT_BASE, populatesVTABLE_SLOTS[], callscore1_entry().
Runs on the BSP with interrupts disabled, after BootAlloc is initialized (Phase 0.4) and CpuLocal BSP is set up (Phase 0.7). No heap, no slab, no scheduler — only BootAlloc page allocation and Nucleus's page table hardware ops are available.
Preconditions (guaranteed by Phases 0.1-0.7 for 0.8a; 0.1-0.7+0.2 for 0.8b):
- BootAlloc is initialized and has free physical pages.
- For Phase 0.8b: identity map / MMU is active (VA == PA for low memory).
- Nucleus's page table hardware ops (arch::current::mm) are functional (for 0.8b).
- __core1_start and __core1_end symbols resolve to the embedded image
boundaries (including appended LMS signature) in Nucleus's VA space.
- EVOLVABLE_LMS_PUBLIC_KEY contains the LMS verification key.
- Serial console is available for panic messages.
/// Phase 0.8a + 0.8b: Load and activate Evolvable.
///
/// This function is the last thing Nucleus executes autonomously. After
/// calling core1_entry(), control never returns to Nucleus — Evolvable owns the
/// boot sequence from this point forward. Nucleus's code remains mapped
/// (it provides the syscall entry, KABI trampoline, and evolution primitive)
/// but Nucleus never initiates execution again until a live evolution request
/// arrives via the evolution orchestration.
///
/// # Panics
///
/// Panics (halts all CPUs, prints to serial) on any of:
/// - LMS signature verification failure (Evolvable tampered or wrong key)
/// - Bad magic or unsupported header version
/// - Image geometry inconsistency (sections overflow image_size)
/// - virt_base mismatch with EVOLVABLE_VIRT_BASE
/// - BootAlloc unable to allocate RW pages for Evolvable's .data+.bss
/// - Page table mapping failure
///
/// There is no recovery from any of these — they indicate a build system
/// bug, storage corruption, or hardware fault. Panic is the only correct
/// response.
pub fn core0_activate_core1(boot_alloc: &mut BootAlloc) -> ! {
let image_start = unsafe { &__core1_start as *const u8 };
let image_end = unsafe { &__core1_end as *const u8 };
let total_len = image_end as usize - image_start as usize;
// --- Step 1: Parse header minimally to find signature location ---
assert!(total_len >= core::mem::size_of::<EvolvableImageHeader>(),
"Evolvable image too small for header");
// SAFETY: image_start is 4KB-aligned (linker section alignment),
// and EvolvableImageHeader is align(8). 4096 > 8, so alignment is satisfied.
let header = unsafe { &*(image_start as *const EvolvableImageHeader) };
assert!(header.magic == *b"UKC1",
"Evolvable bad magic: expected UKC1");
assert!(header.header_version.to_ne() <= NUCLEUS_MAX_HEADER_VERSION,
"Evolvable header version {} > max {}",
header.header_version.to_ne(), NUCLEUS_MAX_HEADER_VERSION);
// --- Step 2: LMS signature verification ---
// The signed content is [0, image_size). The signature is appended
// at [sig_offset, sig_offset + sig_size). Nucleus verifies the
// signature over the image content using the baked-in public key.
//
// LMS verification (NIST SP 800-208) is purely hash-based:
// 1. SHAKE256-hash the signed content → message digest
// 2. Complete Winternitz one-time signature chains (~530 hash calls
// for W=4) → candidate OTS public key
// 3. Walk Merkle authentication path (15 hash calls for H=15) →
// candidate root
// 4. Compare candidate root against public key's T[1]
//
// Nucleus already has Keccak-f[1600] for SHA3-256. SHAKE256 uses the
// same permutation — only the padding byte changes (0x06 → 0x1F).
// The LMS verifier adds ~1-3 KB of code beyond the existing Keccak.
//
// Total verification time: ~3-8 ms for a 2-4 MB image (dominated by
// SHAKE256 hashing the image content; the ~530 Winternitz chain hash
// calls add ~0.5 ms). No SIMD — AlgoDispatch is not yet initialized.
assert!(header.sig_algo.to_ne() != 0,
"Evolvable image has no signature (sig_algo=0)");
assert!(header.sig_algo.to_ne() == 0x0001,
"Evolvable unknown signature algorithm: {:#06x} (only 0x0001 LMS is implemented)",
header.sig_algo.to_ne());
assert!(header.sig_offset.to_ne() as usize + header.sig_size.to_ne() as usize
<= total_len,
"Evolvable signature overflows embedded section");
let signed_content = unsafe {
core::slice::from_raw_parts(image_start, header.image_size.to_ne() as usize)
};
let signature = unsafe {
core::slice::from_raw_parts(
image_start.add(header.sig_offset.to_ne() as usize),
header.sig_size.to_ne() as usize,
)
};
let valid = lms_verify_shake256(
&EVOLVABLE_LMS_PUBLIC_KEY,
signed_content,
signature,
);
if !valid {
panic!("Evolvable LMS signature verification failed");
}
// --- Step 3: Full header validation (post-signature) ---
// Now that the image is authenticated, validate geometry.
let image_len = header.image_size.to_ne() as usize;
assert!(header.virt_base.to_ne() == EVOLVABLE_VIRT_BASE,
"Evolvable virt_base mismatch: image={:#x}, expected={:#x}",
header.virt_base.to_ne(), EVOLVABLE_VIRT_BASE);
let flags = header.flags.to_ne();
assert!(flags & !0x0001 == 0,
"Evolvable unknown flags: {:#06x}", flags);
assert!(flags & 0x0001 == 0,
"Evolvable position-independent mode not yet supported");
// Validate section geometry: each section must fit within image_size,
// sections must not overlap, and must be 4KB-aligned.
let img_sz = header.image_size.to_ne() as usize;
validate_section("text", header.text_offset.to_ne(), header.text_size.to_ne(), img_sz);
validate_section("rodata", header.rodata_offset.to_ne(), header.rodata_size.to_ne(), img_sz);
validate_section("data", header.data_offset.to_ne(), header.data_size.to_ne(), img_sz);
assert!(header.text_offset.to_ne() as usize
>= core::mem::size_of::<EvolvableImageHeader>(),
"Evolvable .text overlaps header");
// --- Step 4: Allocate RW pages for .data + .bss ---
// The embedded image contains .data initial values (read-only). Nucleus
// allocates fresh physical pages, copies .data into them, and zeroes
// the .bss portion. This gives Evolvable its own writable data region
// while preserving the pristine embedded copy for potential rollback.
let rw_size = round_up_4k(header.data_size.to_ne() as usize
+ header.bss_size.to_ne() as usize);
let rw_pages = boot_alloc.alloc_pages_zeroed(rw_size / PAGE_SIZE)
.expect("Evolvable: BootAlloc cannot allocate RW pages");
// Copy .data initial values from embedded image to allocated pages.
let data_src = unsafe {
image_start.add(header.data_offset.to_ne() as usize)
};
unsafe {
core::ptr::copy_nonoverlapping(
data_src,
rw_pages.as_mut_ptr(),
header.data_size.to_ne() as usize,
);
}
// .bss is already zeroed (alloc_pages_zeroed).
// --- Step 5: Create page table mappings for Evolvable ---
// Map Evolvable sections at their linked VAs. The physical backing for
// .text and .rodata is the embedded image itself (in Nucleus's
// .core1_image segment). The physical backing for .data+.bss is the
// freshly allocated pages.
//
// Protection flags follow the principle of least privilege:
// .text → RX (execute, no write)
// .rodata → R (read only, no execute, no write)
// .data → RW (read-write, no execute)
// .bss → RW (read-write, no execute)
//
// On architectures with 2 MB huge pages (x86-64 PMD, AArch64 L2 block,
// RISC-V megapage), sections ≥ 2 MB that are 2 MB-aligned use huge
// pages. Smaller sections use 4 KB pages. The build system's 4 KB
// alignment guarantee ensures no two sections share a page.
let text_phys = virt_to_phys(
image_start as usize + header.text_offset.to_ne() as usize);
let rodata_phys = virt_to_phys(
image_start as usize + header.rodata_offset.to_ne() as usize);
let data_phys = rw_pages.phys_addr();
let vb = header.virt_base.to_ne();
// Map .text (RX)
map_region(
boot_alloc,
vb + header.text_offset.to_ne() as u64,
text_phys,
header.text_size.to_ne() as u64,
PageFlags::READ | PageFlags::EXECUTE,
);
// Map .rodata (R)
map_region(
boot_alloc,
vb + header.rodata_offset.to_ne() as u64,
rodata_phys,
header.rodata_size.to_ne() as u64,
PageFlags::READ,
);
// Map .data + .bss (RW)
map_region(
boot_alloc,
vb + header.data_offset.to_ne() as u64,
data_phys,
rw_size as u64,
PageFlags::READ | PageFlags::WRITE,
);
// Flush TLB for the entire Evolvable VA range (architecturally required
// after adding new page table entries).
arch::current::mm::flush_tlb_range(
vb,
vb + header.data_offset.to_ne() as u64 + rw_size as u64,
);
// --- Step 6: Read vtable directory, populate dispatch slots ---
// The vtable directory is in Evolvable's .rodata (signed, read-only).
// Each entry maps a SlotId to a virtual address in Evolvable's address
// space. Nucleus writes these addresses into the global dispatch slots.
assert!(header.vtable_dir_count.to_ne() as usize <= MAX_VTABLE_SLOTS,
"Evolvable vtable directory too large: {}",
header.vtable_dir_count.to_ne());
let dir_ptr = unsafe {
image_start.add(header.vtable_dir_offset.to_ne() as usize)
as *const VtableDirectoryEntry
};
for i in 0..header.vtable_dir_count.to_ne() as usize {
let entry = unsafe { &*dir_ptr.add(i) };
assert!((entry.slot_id as usize) < MAX_VTABLE_SLOTS,
"Evolvable vtable slot_id {} out of range", entry.slot_id);
// The VA in the directory is a Evolvable link-time address.
// It's now valid because we mapped Evolvable at virt_base.
VTABLE_SLOTS[entry.slot_id as usize]
.store(entry.vaddr as *mut (), Ordering::Release);
}
// --- Step 7: Build BootHandoff and call Evolvable entry ---
let handoff = BootHandoff {
boot_alloc: boot_alloc as *mut BootAlloc,
memory_map: FIRMWARE_MEMMAP.as_ptr(),
memory_map_count: FIRMWARE_MEMMAP_COUNT,
acpi_rsdp: ACPI_RSDP_ADDR,
dtb_addr: DTB_ADDR,
cmdline: KERNEL_CMDLINE.as_ptr(),
cmdline_len: KERNEL_CMDLINE_LEN,
bsp_cpu_local: arch::current::cpu::bsp_cpulocal_ptr(),
cpu_feature_table: &raw mut CPU_FEATURE_TABLE,
vtable_slots: &raw mut VTABLE_SLOTS,
core1_image_phys: virt_to_phys(image_start as usize),
core1_rw_phys: data_phys,
core1_rw_size: rw_size as u64,
initramfs_start: INITRAMFS_PHYS_START,
initramfs_size: INITRAMFS_SIZE,
arch: arch::current::boot::arch_boot_data(),
};
let entry_va = vb + header.entry_offset.to_ne() as u64;
let entry_fn: extern "C" fn(*const BootHandoff) -> ! =
unsafe { core::mem::transmute(entry_va) };
// Point of no return. Evolvable takes ownership of the boot sequence.
entry_fn(&handoff);
}
/// Validate a section descriptor within the image.
fn validate_section(name: &str, offset: u32, size: u32, image_size: usize) {
assert!(offset as usize % PAGE_SIZE == 0,
"Evolvable .{} offset not page-aligned: {:#x}", name, offset);
let end = offset as usize + size as usize;
assert!(end <= image_size,
"Evolvable .{} overflows image: offset={:#x} size={:#x} image={}",
name, offset, size, image_size);
}
/// Map a contiguous VA range to a contiguous PA range using Nucleus's
/// page table hardware ops. Allocates intermediate page table pages from
/// BootAlloc as needed. Uses 2 MB huge pages where possible (range is
/// 2 MB aligned and ≥ 2 MB), otherwise 4 KB pages.
fn map_region(
alloc: &mut BootAlloc,
virt_start: u64,
phys_start: PhysAddr,
size: u64,
flags: PageFlags,
) {
let mut offset = 0u64;
while offset < size {
let va = virt_start + offset;
let pa = PhysAddr(phys_start.0 + offset);
let remaining = size - offset;
// Attempt 2 MB huge page if aligned and large enough.
if va % HUGE_PAGE_SIZE == 0
&& pa.0 % HUGE_PAGE_SIZE == 0
&& remaining >= HUGE_PAGE_SIZE
{
arch::current::mm::map_huge_page(alloc, va, pa, flags);
offset += HUGE_PAGE_SIZE;
} else {
arch::current::mm::map_page(alloc, va, pa, flags);
offset += PAGE_SIZE as u64;
}
}
}
2.21.2.5 BootHandoff: Nucleus → Evolvable Data Transfer¶
The BootHandoff struct is the sole interface between Nucleus and Evolvable at boot.
It transfers all state that Evolvable needs to continue the boot sequence. Designed
for zero ambiguity — every field has a single interpretation and no optional
semantics beyond the documented firmware-dependent addresses.
/// Data transferred from Nucleus to Evolvable at Phase 0.8.
/// Placed on Nucleus's boot stack (BSP). Evolvable's entry function copies what
/// it needs before switching to its own stack. All pointers are virtual
/// addresses in the current (kernel higher-half) mapping.
///
/// This struct is part of the Nucleus/Evolvable ABI. Fields may be added at the
/// end; existing fields are never removed or reordered.
// kernel-internal, not KABI
#[repr(C)]
pub struct BootHandoff {
// --- Memory ---
/// Pointer to the BootAlloc instance. Evolvable takes ownership: calls
/// hand_off_to_buddy() to transfer remaining free pages into the
/// buddy allocator during Phase 1.1.
pub boot_alloc: *mut BootAlloc,
/// Firmware memory map (array of MemoryMapEntry). Parsed from
/// Multiboot1 E820 (x86) or DTB /memory (non-x86) during Phase 0.3.
pub memory_map: *const MemoryMapEntry,
pub memory_map_count: u32,
pub _pad0: u32,
// --- Firmware tables ---
/// ACPI RSDP physical address. Non-zero on x86-64 and AArch64 UEFI
/// systems. Zero on DTB-only systems (ARMv7, RISC-V, PPC).
pub acpi_rsdp: PhysAddr,
/// DTB (Device Tree Blob) physical address. Non-zero on AArch64,
/// ARMv7, RISC-V, PPC32, PPC64LE. Zero on x86-64.
pub dtb_addr: PhysAddr,
// --- Command line ---
/// Kernel command line (null-terminated UTF-8 string). Points into
/// either the Multiboot1 info structure (x86) or a BootAlloc'd copy
/// of the DTB /chosen/bootargs property.
pub cmdline: *const u8,
pub cmdline_len: u32,
pub _pad1: u32,
// --- Per-CPU ---
/// BSP's CpuLocalBlock pointer. Evolvable uses this immediately (it
/// runs on the BSP). AP CpuLocalBlocks are allocated during Phase 3.
pub bsp_cpu_local: *mut CpuLocalBlock,
/// CpuFeatureTable storage. Empty — Evolvable fills it during CPU
/// feature detection (Phase 1.3 equivalent). The storage is in
/// Nucleus's BSS (non-replaceable data).
pub cpu_feature_table: *mut CpuFeatureTable,
// --- Nucleus dispatch infrastructure ---
/// Global vtable slot array. Nucleus has populated slots from the
/// Evolvable vtable directory (Step 5). Evolvable reads these to find its
/// own entry points and may register additional entries during init.
pub vtable_slots: *mut [AtomicPtr<()>; MAX_VTABLE_SLOTS],
// --- Evolvable image metadata (for rollback / evolution) ---
/// Physical address of the embedded Evolvable image (.text + .rodata).
/// Retained for rollback: if Evolvable crashes during early init, a
/// watchdog could theoretically re-activate the pristine image.
/// (In practice, early-boot crashes are fatal — this field is for
/// completeness and future evolution framework use.)
pub core1_image_phys: PhysAddr,
/// Physical address and size of the allocated RW pages (.data+.bss).
pub core1_rw_phys: PhysAddr,
pub core1_rw_size: u64,
// --- Initramfs ---
/// Physical address and size of the initramfs CPIO archive. Contains
/// device drivers, on-demand KABI modules, and firmware blobs.
/// Loaded by the bootloader (Multiboot module, DTB /chosen/initrd).
/// Zero if no initramfs is present.
pub initramfs_start: PhysAddr,
pub initramfs_size: u64,
// --- Architecture-specific ---
/// Per-architecture boot data. Contains arch-specific values that
/// don't fit the cross-platform fields above.
pub arch: ArchBootData,
}
// Boot ABI struct — contains raw pointers and ArchBootData, so size is
// platform-dependent. Verified per-architecture at build time.
/// Architecture-specific boot data, appended to BootHandoff.
/// Each architecture defines its own variant. The enum discriminant
/// is implicit (known at compile time from the target triple).
// kernel-internal, not KABI
#[repr(C)]
pub struct ArchBootData {
#[cfg(target_arch = "x86_64")]
/// Multiboot1 info structure physical address.
pub multiboot_info: PhysAddr,
#[cfg(any(target_arch = "aarch64", target_arch = "arm"))]
/// Machine ID (ARMv7) or zero (AArch64). Passed in r1/x1 by
/// the bootloader. Used by some embedded board files.
pub machine_id: u32,
#[cfg(target_arch = "riscv64")]
/// Hart ID of the BSP. Passed in a0 by OpenSBI.
pub boot_hart_id: u64,
#[cfg(any(target_arch = "powerpc", target_arch = "powerpc64"))]
/// Effective address of the Open Firmware / OPAL entry point.
/// PPC32: Open Firmware client interface. PPC64LE: OPAL.
pub firmware_entry: u64,
#[cfg(target_arch = "s390x")]
/// IPL parameter block passed by z/VM or LPAR.
pub ipl_parms: *const u8,
#[cfg(target_arch = "s390x")]
/// Subchannel ID of the IPL device.
pub ipl_schid: u32,
#[cfg(target_arch = "loongarch64")]
/// Device tree blob pointer (from firmware).
pub dtb_ptr: *const u8,
#[cfg(target_arch = "loongarch64")]
/// Boot parameters from BIOS/UEFI.
pub boot_params: *const u8,
}
2.21.2.6 Evolvable Entry Point Contract¶
/// Evolvable entry function. Called exactly once by Nucleus at Phase 0.8.
/// This is the first instruction of Evolvable code that executes.
///
/// # Contract
///
/// **Preconditions** (guaranteed by Nucleus):
/// - BSP is running, interrupts disabled, preemption disabled.
/// - Evolvable .text is mapped RX, .rodata is mapped R, .data+.bss is
/// mapped RW at the linked VAs (EVOLVABLE_VIRT_BASE + offsets).
/// - BootAlloc is operational (pointed to by handoff.boot_alloc).
/// - Nucleus data structures (PageArray, CapTable, CpuFeatureTable,
/// DomainRingBuffer, AlgoDispatch, VTABLE_SLOTS) are initialized
/// but empty — Evolvable populates them.
/// - handoff pointer is valid for the duration of this function
/// (it's on Nucleus's boot stack).
/// - The only allocation mechanism available is BootAlloc (bump
/// allocator). No buddy, no slab, no heap.
///
/// **Postconditions** (Evolvable's obligation — this function never returns):
/// - Buddy allocator initialized (Phase 1.1: hand_off_to_buddy()).
/// - Slab allocator initialized (Phase 1.2).
/// - All Evolvable subsystems initialized per the canonical phase table
/// ([Section 2.3](#boot-init-cross-arch)).
/// - Init process (PID 1) started.
/// - Function diverges (-> !): enters the idle loop or scheduler.
///
/// # ABI
///
/// extern "C" with a single pointer argument. The pointer is passed in
/// the architecture's first integer argument register:
/// x86-64: rdi AArch64: x0 ARMv7: r0
/// RISC-V: a0 PPC32: r3 PPC64LE: r3
///
/// This matches the C ABI on all eight architectures — no special calling
/// convention.
#[no_mangle]
pub extern "C" fn core1_entry(handoff: *const BootHandoff) -> ! {
let handoff = unsafe { &*handoff };
// 1. Copy BootHandoff fields to Evolvable's own static storage
// (handoff is on Nucleus's stack, which we'll abandon).
save_boot_params(handoff);
// 2. Switch to Evolvable's own BSP stack (allocated from .bss).
// After this point, Nucleus's boot stack is no longer referenced.
switch_to_core1_stack();
// 3. Execute canonical init phases 1.1 → 5.4.
phase_1_memory_init(); // buddy, slab, page cache
phase_2_core_subsystems(); // IRQ, caps, scheduler, workqueue, RCU
phase_3_smp_bringup(); // AP trampoline, secondary CPUs
phase_4_device_init(); // IOMMU, device registry, KABI, buses
phase_5_rootfs(); // Tier 0/1 drivers, VFS, root mount
// 4. Start init process, enter scheduler.
start_init_and_idle();
}
2.21.2.7 Nucleus/Evolvable Dispatch Slots (VtableSlot ABI)¶
The vtable directory in the Evolvable image registers a small number of entry
points that Nucleus dispatches through. These are the ONLY paths where Nucleus
code calls into Evolvable code. All other Evolvable entry points (policy traits,
KABI services, etc.) are internal to Evolvable and use their own AtomicPtr
statics.
/// Vtable directory entry. Array of these in Evolvable's .rodata, referenced
/// by EvolvableImageHeader.vtable_dir_offset/count.
#[repr(C)]
pub struct VtableDirectoryEntry {
/// Slot ID (frozen ABI — append only, never reorder or remove).
pub slot_id: u16,
pub _pad: [u8; 6],
/// Virtual address of the entry point or vtable struct in Evolvable's
/// address space. Nucleus writes this into VTABLE_SLOTS[slot_id].
pub vaddr: u64,
}
const_assert!(core::mem::size_of::<VtableDirectoryEntry>() == 16);
const _: () = assert!(
core::mem::size_of::<VtableDirectoryEntry>() == 16);
/// Frozen slot IDs — the Nucleus/Evolvable dispatch ABI.
///
/// These are the ONLY entry points that Nucleus ever calls into Evolvable.
/// Adding new slots is a backward-compatible change (old Nucleus ignores
/// unknown slot_ids; new Nucleus treats missing slots as null).
/// Removing or changing a slot's semantics requires a Nucleus update (reboot).
pub mod slot_id {
/// Syscall Layer 2 dispatch function.
/// Nucleus's syscall entry (Layer 1) calls this for every syscall:
/// extern "C" fn(nr: u64, args: &SyscallArgs) -> SyscallResult
/// Written into the SYS_DISPATCH_FN atomic function pointer that
/// the Layer 1 asm trampoline reads.
pub const SYSCALL_DISPATCH: u16 = 0;
/// Evolution orchestration entry.
/// Nucleus's evolution primitive calls this to initiate a replacement:
/// extern "C" fn(request: &EvolutionRequest) -> EvolutionResult
/// Used when an external trigger (sysfs write, IPC) requests a
/// live component replacement.
pub const EVOLUTION_ORCHESTRATION: u16 = 1;
/// Evolvable panic handler.
/// If Nucleus detects a fault in Evolvable code (e.g., page fault at a
/// Evolvable VA), it calls this handler to attempt graceful shutdown:
/// extern "C" fn(fault: &FaultInfo) -> !
/// If this slot is null, Nucleus falls back to a kernel panic.
pub const PANIC_HANDLER: u16 = 2;
/// Timer tick handler.
/// Nucleus's timer interrupt (Tier 0 platform driver calls through
/// this slot) invokes Evolvable's scheduler tick:
/// extern "C" fn(cpu: CpuId, now_ns: u64)
pub const TIMER_TICK: u16 = 3;
/// IRQ dispatch handler.
/// Nucleus's IRQ entry dispatches non-timer, non-IPI interrupts
/// to Evolvable's IRQ domain framework:
/// extern "C" fn(irq_nr: u32)
pub const IRQ_DISPATCH: u16 = 4;
}
/// Global vtable slot array. Nucleus data (non-replaceable storage).
/// Slots are populated by Nucleus during Phase 0.8 from the Evolvable vtable
/// directory, and may be updated by Evolvable during init or live evolution.
///
/// Nucleus code reads these with Acquire ordering before every dispatch.
/// Null means "slot not populated" — Nucleus falls back to a default
/// (panic for mandatory slots, no-op for optional slots).
pub static VTABLE_SLOTS: [AtomicPtr<()>; MAX_VTABLE_SLOTS] =
[const { AtomicPtr::new(core::ptr::null_mut()) }; MAX_VTABLE_SLOTS];
pub const MAX_VTABLE_SLOTS: usize = 8;
2.21.2.8 Build Pipeline¶
The Evolvable image is produced by a multi-step build pipeline that transforms a standard Rust ELF into the embedded flat binary format:
┌────────────────────────────────────────────────────────────────┐
│ Step 1: cargo build -p umka-core1 --target <triple> │
│ → target/<triple>/release/umka-core1 (ELF) │
├────────────────────────────────────────────────────────────────┤
│ Step 2: objcopy -O binary umka-core1 core1.flat │
│ Extracts loadable segments as a flat binary. │
│ Internal addresses are the link-time VAs. │
├────────────────────────────────────────────────────────────────┤
│ Step 3: core1-pack core1.flat umka-core1 → core1.unsigned │
│ Reads the Evolvable ELF to extract: │
│ - Section offsets/sizes (.text, .rodata, .data) │
│ - .bss size (from ELF section headers) │
│ - Entry point address (ELF e_entry) │
│ - Vtable directory address (VTABLE_DIRECTORY symbol) │
│ Prepends a EvolvableImageHeader. Output: header + flat. │
├────────────────────────────────────────────────────────────────┤
│ Step 4: core1-sign core1.unsigned --key lms.priv → core1.bin │
│ Signs the image with LMS-SHAKE256-N32-W4-H15. │
│ 1. SHAKE256 hash of core1.unsigned → message digest │
│ 2. LMS sign with the build system's stateful key │
│ 3. Appends LMS signature (~2.7 KB) after .data │
│ 4. Patches header: sig_offset, sig_size, sig_algo │
│ Output: signed Evolvable image (header + sections + sig). │
│ Also outputs: core1-lms.pub (56-byte public key, if │
│ not already generated — reused across Evolvable releases). │
├────────────────────────────────────────────────────────────────┤
│ Step 5: mkumka core0.elf core1.bin → vmlinuz-umka-VERSION │
│ 1. Embeds core1.bin into Nucleus's .core1_image section │
│ via objcopy --add-section (or .incbin in build.rs) │
│ 2. Embeds core1-lms.pub into Nucleus's .rodata via │
│ include_bytes!() (only on Nucleus rebuild) │
│ 3. Output: assembled kernel ELF ready for boot │
│ The separate core1.bin is ALSO installed alongside │
│ the assembled ELF for live evolution at runtime. │
└────────────────────────────────────────────────────────────────┘
core1-pack is a host-side tool (~200 lines of Rust) that runs during the
build. It reads the Evolvable ELF with the object crate (no custom ELF parser),
computes section offsets relative to the flat binary start, and writes the
header. It validates that all sections are 4KB-aligned and non-overlapping.
core1-sign is the LMS signing tool (~300 lines of Rust). It holds or
accesses the LMS private key state (the leaf index counter). The private key
MUST be stored on the build/signing server only — never on target devices.
The tool increments the one-time-key index after each signing. With H=15,
the key supports 32,768 signatures; with H=20, 1,048,576. Key state must be
backed up (losing state = duplicate one-time key use = security failure).
LMS key generation is a one-time operation per key lifecycle.
mkumka is the install-time assembly tool (analogous to mkinitrd for
Linux initramfs). It takes a pre-built Nucleus ELF and a signed Evolvable binary,
embeds Evolvable into the ELF's .core1_image section, and produces the final
bootable kernel image. The LMS public key is baked into Nucleus at Nucleus's
build time (not at mkumka time) — mkumka does NOT modify Nucleus's
.rodata. This means Nucleus is truly immutable: the same Nucleus binary works
with any Evolvable signed by the matching LMS key.
Cross-compilation: The build system produces one core1.bin per target
architecture. The Evolvable linker script uses the correct EVOLVABLE_VIRT_BASE for
each architecture. The core1-pack, core1-sign, and mkumka tools are
host tools (run on the build machine or target's package manager, not the
kernel itself).
2.21.2.9 Security Analysis¶
The Nucleus→Evolvable loading protocol provides the following security properties:
-
Integrity + Authenticity: The entire Evolvable image (header + code + data) is verified against an LMS signature before any Evolvable code executes. A single bit flip causes a boot panic. LMS (NIST SP 800-208) provides post-quantum security based solely on hash function preimage resistance — no lattice assumptions, no elliptic curve discrete log. The public key is baked into Nucleus's
.rodata(part of the signed kernel ELF), inheriting the bootloader's Secure Boot chain. Unlike a baked hash, the public key is NOT tied to a specific Evolvable version — any Evolvable signed by the build system's private key will pass verification. This decouples Nucleus releases from Evolvable releases entirely. -
No code injection: Nucleus never writes to Evolvable's .text or .rodata mappings. .text is mapped RX (no write). .rodata is mapped R (no write, no execute). .data+.bss is mapped RW but not executable (NX/XN bit set). W^X is enforced from the first mapping.
-
No TOCTOU: The signature is verified over the entire embedded section in a single pass, then Nucleus immediately maps and uses the same bytes. There is no window where the image could be modified between verification and use (the embedded section is in Nucleus's read-only segment, and no DMA is active at this boot stage).
-
Minimal TCB: Nucleus's loading code is ~400 lines of straight-line Rust with no loops beyond the Winternitz chain completion (bounded by W parameter), the Merkle path walk (bounded by H parameter), and the page mapping iteration (bounded by image size). The cryptographic code is
lms_verify_shake256()— Keccak-f[1600] permutation (shared with SHA3-256) + Winternitz chain + Merkle path. Pure functions with no state, no allocation, and no external dependencies. This is a tractable formal verification target. -
Version-independent Nucleus: The LMS public key can verify any Evolvable signed by the corresponding private key. Nucleus never needs rebuilding for Evolvable updates. The only scenarios requiring a Nucleus rebuild: (a) LMS key rotation (after 32K+ signatures), (b) Nucleus code bug, (c) vtable ABI change. All are expected to be extremely rare events.
2.21.2.10 Performance Analysis¶
| Operation | Time (est.) | Notes |
|---|---|---|
| LMS verification of 3 MB image | 4-10 ms | SHAKE256 hash (~3-8 ms, generic Keccak, no SIMD) + ~530 Winternitz chain hashes + 15 Merkle path hashes (~0.5-1 ms). |
| Header validation | < 1 μs | 15 integer comparisons |
| BootAlloc page allocation (.data+.bss) | < 100 μs | Bump allocator: pointer increment |
| memcpy .data (256 KB typical) | 50-200 μs | Cache-cold, generic memcpy |
| memset .bss (128 KB typical) | 25-100 μs | Zeroing fresh pages |
| Page table mapping (3 sections, ~800 pages) | 100-500 μs | One PTE write per 4 KB page, or one PMD write per 2 MB huge page |
| TLB flush | < 10 μs | Full TLB flush on BSP (no APs running yet) |
| Vtable directory scan (≤ 8 entries) | < 1 μs | 8 AtomicPtr stores |
| Total Phase 0.8 | ~5-12 ms | Dominated by LMS verification (SHAKE256 hash of image content) |
For comparison, Linux spends ~50-200 ms on kernel decompression alone (for compressed kernels). UmkaOS Evolvable is uncompressed (embedded as a flat binary in a read-only segment), so there is no decompression step. The ~5-12 ms total is negligible relative to total boot time (~500 ms to userspace). LMS adds ~1-2 ms over a raw hash due to the Winternitz chain completion and Merkle path verification, but provides signature-based authenticity (not just integrity) and decouples Nucleus from specific Evolvable versions.
Why a monolith, not micromodules at boot? Three reasons: 1. Dependency ordering: subsystems have complex init ordering (memory before slab, slab before VFS, VFS before block, etc.). A single Evolvable image with a known init sequence eliminates dependency resolution at boot — the most fragile point in the kernel's lifetime. 2. Verification efficiency: one LMS signature verification (~5-10 ms) for the entire Evolvable, vs. N verifications for N separate modules. 3. Cold-start performance: loading one contiguous image from storage/initramfs is faster than seeking N separate files.
After boot, individual subsystems within Evolvable can be replaced independently
via live evolution using per-subsystem signed ELFs. For whole-Evolvable replacement,
the kernel monolith itself is the canonical source — the evolution loader extracts
the .core1_image section from a newer monolith in /boot/, validates the
EvolvableImageHeader signature, and performs the full dependency-ordered batch swap.
There is no separate whole-Evolvable ELF file; the monolith serves as the single
artifact for both boot and live evolution
(Section 13.18).
2.21.3 On-Demand KABI Services (Loaded When First Requested)¶
Some kernel subsystems are not needed on every system and are not included
in Evolvable. They are packaged as separate signed modules in the initramfs
module store or the on-disk /lib/modules/$(uname -r)/ directory. They
are loaded automatically when a KABI consumer first requests their service
via request_service()
(Section 12.7).
On-demand subsystem modules:
| Module | Service ID | Trigger | Why Not in Evolvable |
|---|---|---|---|
| umka-kvm | kvm_hypervisor |
First open("/dev/kvm") or KVM ioctl |
Virtualization not needed on most systems |
| umka-drm | drm_subsystem |
First GPU device probe or DRM ioctl | Headless servers don't need display |
| umka-audio | audio_subsystem |
First audio device probe (HDA/USB-Audio) | Servers and embedded often have no audio |
| umka-bluetooth | bt_hci_service |
First Bluetooth HCI device probe | Not universal hardware |
| umka-wifi | wifi_nl80211 |
First WiFi NIC device probe | Servers/wired-only systems |
| umka-nfs | nfs_client |
First mount -t nfs or NFS-root in cmdline |
Not all systems use NFS |
| umka-distributed | cluster_membership |
First peer discovery or /proc/umka/cluster/join |
Single-node systems don't need clustering |
| umka-dsm | dsm_coherence |
First DSM_MAP syscall or cluster DSM init |
Specialized distributed workloads |
| umka-mlpolicy | ml_policy_engine |
First ML policy activation via cgroup | Only for ML-guided kernel tuning |
| umka-infiniband | ib_core_service |
First InfiniBand/RoCE device probe | HPC/datacenter only |
| umka-sctp | sctp_transport |
First socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP) |
Telecom workloads only |
| umka-mptcp | mptcp_transport |
First socket() with IPPROTO_MPTCP or sysctl enable |
Not default transport |
| umka-xfs | xfs_filesystem |
First mount -t xfs |
Not all systems use XFS |
| umka-btrfs | btrfs_filesystem |
First mount -t btrfs |
Not all systems use btrfs |
| umka-zfs | zfs_filesystem |
First mount -t zfs or ZFS pool import |
Optional filesystem |
| umka-dm | dm_target_service |
First dmsetup or LVM activation |
Bare-metal without LVM |
Loading protocol: The KabiProviderIndex (built at boot from the
initramfs module store) maps each ServiceId to the module that provides
it. When request_service() finds no registered provider, it looks up the
index and schedules the module for loading via schedule_module_load().
The requesting driver's probe is deferred (ProbeError::Deferred) and
automatically retried when the module finishes loading and registers its
service. See Section 12.7
for the full protocol.
On-demand modules are live-replaceable by the same evolution framework as Evolvable subsystems. Once loaded, they are indistinguishable from Evolvable components in terms of replaceability.
2.21.4 Device Drivers (Tier 1 / Tier 2)¶
Device drivers are always separate modules, never part of Nucleus or Evolvable. They are loaded from the initramfs (during boot phase 11-15) or from the root filesystem (after boot) by the device registry's probe mechanism (Section 11.4).
| Driver Category | Tier | Loading | Examples |
|---|---|---|---|
| Storage (boot-critical) | Tier 1 | Initramfs, phase 5 | NVMe, AHCI/SATA, virtio-blk |
| Network (boot-critical) | Tier 1 | Initramfs or root FS | Intel e1000e/ixgbe, Mellanox mlx5, virtio-net |
| GPU / Display | Tier 1 | Root FS (after umka-drm loads) | i915, amdgpu, nouveau |
| USB host | Tier 1 | Initramfs | xHCI, EHCI |
| USB class | Tier 2 | Root FS, on device plug | USB-HID, USB-Audio, USB-Storage |
| Input | Tier 2 | Root FS | evdev, PS/2 keyboard/mouse |
| Audio | Tier 1 (default); optional Tier 2 demotion at >=10ms periods | Root FS (after umka-audio loads) | HDA, USB-Audio |
| Bluetooth | Tier 2 | Root FS (after umka-bluetooth loads) | btusb, hci_uart |
| WiFi | Tier 1 (data) + Tier 2 (control) | Root FS (after umka-wifi loads) | iwlwifi, ath11k |
| Printer/Scanner | Tier 2 | Root FS | USB-Printer class |
| Camera | Tier 2 | Root FS | UVC, platform ISP |
| Third-party/vendor | Tier 2 | Root FS, signed | Vendor-specific |
All drivers implement the KABI vtable interface (Section 12.1) and are ML-DSA signed. Driver tier changes are dynamic (Section 13.18).
2.21.5 Distribution Model¶
UmkaOS ships the kernel as two artifacts per release, both in the same package:
-
Assembled kernel ELF (
/boot/vmlinuz-umka-VERSION): Nucleus + embedded Evolvable, ready for cold boot. Produced bymkumkaat install time (or by the build system for official releases). -
Standalone Evolvable image (
/boot/core1-umka-VERSION.bin): The same LMS-signed Evolvable binary that is embedded in the ELF, also installed separately. Used by the running kernel's evolution orchestration for live upgrades.
Kernel update workflow:
Package manager installs umka-kernel-5.2.0:
/boot/vmlinuz-umka-5.2.0 ← assembled ELF (Nucleus v1 + Evolvable v5.2)
/boot/core1-umka-5.2.0.bin ← standalone Evolvable v5.2 (LMS-signed)
Case 1 — System is running (live evolution):
1. Update agent reads /boot/core1-umka-5.2.0.bin
2. Evolvable evolution orchestration verifies ML-DSA-65 signature
3. Orchestration extracts changed components, evolves each via
the standard Phase A/A'/B/C flow
4. Running kernel is now at v5.2. No reboot.
Case 2 — System reboots (cold boot):
1. GRUB loads /boot/vmlinuz-umka-5.2.0
2. Nucleus verifies embedded Evolvable via LMS signature (Phase 0.8)
3. Boots directly into v5.2. No evolution replay needed.
Why ship both? The assembled ELF ensures cold boots always start from the latest installed version — no evolution replay journal, no regression after power failure. The standalone Evolvable enables live upgrades without reboot. Shipping both costs ~3 MB of disk (one extra copy of Evolvable) — a negligible overhead for the operational benefit.
Nucleus independence: Because Nucleus uses an LMS public key (not a
version-specific hash), the same Nucleus binary works with any Evolvable release
signed by the matching key. A Evolvable-only update does NOT require rebuilding
or touching Nucleus. The package manager simply:
1. Installs the new Evolvable standalone binary.
2. Runs mkumka to produce a new assembled ELF (embedding the new Evolvable
into the existing Nucleus).
3. The running system live-evolves to the new Evolvable from the standalone
binary.
When does Nucleus change? Three scenarios only: - LMS key rotation (after 32K+ Evolvable releases — decades at one release/day). - Nucleus code bug (requires formal re-verification of the fix). - Vtable slot ABI change (adding new slots is backward-compatible; changing slot semantics is not).
In practice, Nucleus may never change after initial verification. This is the design intent: a mathematically verified, immutable nucleus.
2.21.5.1 Trust Chain Summary¶
Two signature schemes, each optimal for its role:
Build system (signing server):
LMS private key (stateful) ──signs──→ Evolvable images (boot-time verification)
ML-DSA-65 private key ──signs──→ Component updates (runtime verification)
Nucleus (verified nucleus, 25-35 KB):
LMS public key (56 bytes, baked in .rodata)
lms_verify_shake256() (~1-3 KB code, reuses Keccak-f[1600])
→ Verifies Evolvable at boot (Phase 0.8). Stateless. Post-quantum.
Evolvable (replaceable, 2-4 MB):
ML-DSA-65 verifier (~5 KB code, in evolution orchestration)
Trust anchor chain (dual-key, rotatable without reboot)
→ Verifies all runtime evolution payloads. Richer features.
Boot trust chain:
UEFI Secure Boot → signed kernel ELF → Nucleus LMS → Evolvable image
→ Evolvable orchestration → ML-DSA-65 → all subsequent evolution
Why two schemes? LMS verification is minimal — just hash calls + Merkle tree walk, reusing Nucleus's existing Keccak permutation. This keeps Nucleus's cryptographic code under 3 KB, within the formal verification budget. ML-DSA-65 is richer (stateless signing, smaller signatures at higher security levels, standard key rotation) but requires ~5 KB of verifier code — fine for Evolvable, too large for the verified nucleus. The boundary is clean: LMS at the immutable root, ML-DSA-65 in the replaceable layer.
2.21.6 CPU-Dependent Adaptation¶
UmkaOS adapts to the exact CPU it runs on at boot — extracting maximum performance without per-host recompilation. Three complementary mechanisms handle different granularities of CPU dependence:
1. arch::current:: — Architecture selection (compile time)
The target triple (x86-64, AArch64, etc.) selects the entire architecture
module at compile time. This is a pub use $arch as current alias — zero
runtime cost. Every UmkaOS release ships one kernel binary per architecture.
2. code_alternative! — Instruction-level patching (boot phase 9)
Individual instructions within arch code are patched in-place at boot based
on the CpuFeatureTable. This covers:
- CPU errata workarounds: Spectre mitigations (retpoline → eIBRS), KPTI (Meltdown), VERW (MDS), CLEARBHB (AArch64 Spectre-BHB), DSB before TLBI (Cortex-A76 erratum).
- New instruction adoption:
SERIALIZEreplacingCPUIDfor serialization (Alder Lake+),WRMSRNSreplacingWRMSR(non-serializing MSR write),LKGSfor faster syscall entry. - Instruction selection: DC ZVA vs STP xzr for page zeroing (AArch64,
depends on ZVA block size),
rep movsbvs explicit SIMD loop.
After patching, the binary is native for this CPU — zero runtime dispatch.
See Section 2.16 for
the full code_alternative! specification.
3. AlgoDispatch — Algorithm selection (boot phase 9)
Selects the best function-level implementation from a priority-ordered candidate list. See Section 3.10 for the mechanism and the complete feature-dependent subsystem catalog.
4. MicroarchHints — Parameter tuning (boot phase 9)
Non-instruction-level differences: cache line size, prefetch stream count,
page table depth, power state tables, memcpy strategy. These are read-only
parameters that subsystems query via microarch_hints() to tune their
behavior. See Section 2.16.
Two packaging models for feature variants:
Model A — Inline candidates (default): All variants are compiled into the
same module (Evolvable or an on-demand module). The compiler includes all
candidate function bodies; AlgoDispatch selects one at boot; the others are
dead code in memory (a few KB each — acceptable for a kernel image). All
code_alternative! sites are also inline — the default instruction and all
alternatives are present in the binary, and patching overwrites the default.
Evolvable ELF contains:
sha256_sha_ni() ← selected on x86 with SHA-NI
sha256_avx2_4way() ← selected on x86 with AVX2 but no SHA-NI
sha256_neon_4way() ← selected on AArch64
sha256_generic() ← fallback on any arch
AlgoDispatch<SHA256> → points to one of the above after boot phase 9
code_alternative! sites patched:
retpoline → eIBRS (if !SPECTRE_V2 or eIBRS available)
CPUID → SERIALIZE (if Alder Lake+)
DC ZVA hint adjusted for actual ZVA block size
This is the default because it is simple, has zero loading overhead, and the dead-code cost is negligible (~100 KB total across all algorithms for all architectures — less than 0.01% of a typical kernel image).
Model B — Split variant modules (optional, for large implementations):
For implementations where the variant code is large (>64 KB per variant) or
requires arch-specific assembly files, each variant is a separate module.
AlgoDispatch still selects at boot, but the selection triggers demand-loading
of the chosen variant module:
Evolvable ELF contains:
sha256_generic() ← always present (fallback)
AlgoDispatch<SHA256> → initially points to generic
Initramfs module store:
umka-crypto-sha256-shani.uko ← loaded on x86 + SHA-NI
umka-crypto-sha256-avx2.uko ← loaded on x86 + AVX2
umka-crypto-sha256-ce.uko ← loaded on AArch64 + SHA2-CE
Boot phase 9:
algo_dispatch_init_all() checks CpuFeatureTable:
if SHA-NI present → schedule_module_load("umka-crypto-sha256-shani")
→ module registers sha256_sha_ni candidate
→ AlgoDispatch<SHA256> updated to sha256_sha_ni
else if AVX2 → load avx2 module, update dispatch
else → keep generic (no module load)
Model B is used only when explicitly justified by code size. The decision is per-algorithm, not per-subsystem. Most algorithms use Model A.
Complete CPU-Feature-Dependent Subsystem Catalog:
The following table enumerates every kernel subsystem that benefits from CPU-specific hardware acceleration. For each, it lists the available variants, their hardware requirements, and the dispatch mechanism.
See Section 3.10
for the full catalog with AlgoDispatch declarations.
| Subsystem | Algorithm | Variants (priority order) | Dispatch |
|---|---|---|---|
| Crypto: block cipher | AES-GCM | VAES+VPCLMUL (x86 AVX-512) → AES-NI+CLMUL (x86) → CE+PMULL (AArch64) → Zkne+Zkg (RISC-V) → vcipher (PPC64) → generic | AlgoDispatch |
| Crypto: block cipher | AES-XTS | AES-NI (x86) → CE (AArch64) → Zkne (RISC-V) → generic | AlgoDispatch |
| Crypto: block cipher | ChaCha20-Poly1305 | AVX-512 (x86) → AVX2 (x86) → NEON (AArch64) → generic | AlgoDispatch |
| Crypto: hash | SHA-256 | SHA-NI (x86) → AVX2 4-way (x86) → SHA2-CE (AArch64) → NEON 4-way (AArch64) → Zknh (RISC-V) → generic | AlgoDispatch |
| Crypto: hash | SHA-512 | AVX2 (x86) → SHA512-CE (AArch64) → generic | AlgoDispatch |
| Crypto: hash | SHA-3 / SHAKE | AVX2 Keccak-4x (x86) → CE (AArch64) → generic | AlgoDispatch |
| Crypto: hash | SM3 | AVX2+AES-NI (x86) → SM3-CE (AArch64) → Zksh (RISC-V) → generic | AlgoDispatch |
| Crypto: cipher | SM4 | AES-NI (x86, using AES affine transform) → SM4-CE (AArch64) → Zksed (RISC-V) → generic | AlgoDispatch |
| Crypto: PQC | ML-KEM-768 | AVX2 (x86, NTT vectorized) → NEON (AArch64) → generic | AlgoDispatch |
| Crypto: PQC | ML-DSA-65 | AVX2 (x86) → NEON (AArch64) → generic | AlgoDispatch |
| Crypto: MAC | GHASH | PCLMULQDQ (x86) → PMULL (AArch64) → Zkg (RISC-V) → generic | AlgoDispatch |
| Crypto: MAC | Poly1305 | AVX2 (x86) → NEON (AArch64) → generic | AlgoDispatch |
| Checksum | CRC32C | SSE4.2 crc32 (x86) → CRC32 instr (AArch64) → Zbc (RISC-V) → generic | AlgoDispatch |
| Checksum | xxHash64 | AVX2 (x86) → NEON (AArch64) → generic | AlgoDispatch |
| Checksum | Adler32 | SSSE3 (x86) → NEON (AArch64) → generic | AlgoDispatch |
| Compression | zstd compress | AVX2 (x86, match finder) → NEON (AArch64) → generic | AlgoDispatch |
| Compression | zstd decompress | BMI2 (x86, bit extraction) → generic | AlgoDispatch |
| Compression | LZ4 compress/decompress | SSE2/AVX2 (x86, sequence matching) → NEON (AArch64) → generic | AlgoDispatch |
| Compression | zlib/deflate | SSE4.2+PCLMULQDQ (x86, CRC+match) → CRC32+PMULL (AArch64) → generic | AlgoDispatch |
| Memory ops | memcpy (kernel) | ERMS+FSRM rep movsb (x86) → AVX2 (x86, no FSRM) → NEON ldp/stp (AArch64) → generic | AlgoDispatch |
| Memory ops | memset / page zeroing | ERMS rep stosb (x86) → AVX2 (x86) → NEON stp xzr (AArch64) → DC ZVA (AArch64) → generic | AlgoDispatch |
| Memory ops | memcmp | SSE4.2 PCMPISTRI (x86) → NEON (AArch64) → generic | AlgoDispatch |
| RAID parity | XOR (RAID5) | AVX-512 (x86) → AVX2 (x86) → NEON (AArch64) → SVE (AArch64) → RVV (RISC-V) → generic | AlgoDispatch |
| RAID parity | P+Q (RAID6) | AVX-512 (x86) → AVX2 (x86) → NEON (AArch64) → generic | AlgoDispatch |
| Networking | TCP/UDP/IP checksum | AVX2 (x86) → SSE2 (x86) → NEON (AArch64) → generic | AlgoDispatch |
| Networking | RSS Toeplitz hash | PCLMULQDQ (x86) → PMULL (AArch64) → generic | AlgoDispatch |
| Isolation | Domain switch | WRPKRU (x86 MPK) → POR_EL0 (AArch64 POE) → DACR (ARMv7) → mtsr (PPC32) → page table (RISC-V/fallback) | arch::current::isolation (compile-time, not AlgoDispatch) |
Isolation dispatch is NOT AlgoDispatch. Isolation mechanisms are
selected at compile time (per target triple) via arch::current::isolation,
not at boot time. A kernel compiled for x86-64 always uses MPK; one compiled
for AArch64 uses POE if available (runtime feature check during Tier 1 init),
falling back to page-table isolation. This is because isolation affects the
entire driver model, not a single algorithm callsite.
AlgoDispatch for on-demand modules: When an on-demand subsystem module
loads after boot (e.g., umka-kvm, umka-zfs), its internal algo_dispatch!
statics are initialized during the module's init() function, using the
same CpuFeatureTable (already frozen). The dispatch result is identical
to what it would have been at boot — the universal intersection doesn't
change after cpu_features_freeze().
2.21.7 Loading Architecture Summary¶
┌─────────────────────────────────────────────────────────────────────┐
│ FIRMWARE / BOOTLOADER │
│ UEFI / GRUB / OpenSBI / U-Boot │
│ Loads kernel ELF, passes memory map + DTB/ACPI │
└────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ NUCLEUS: Verified Nucleus (~25-35 KB) │
│ ─────────────────────────────────────── │
│ Boot entry (asm) │ PageArray/Buddy data │ PTE hw ops │ Capability │
│ KABI trampoline │ Syscall entry │ Evolution primitive │ DomainRing │
│ AlgoDispatch infra │ CpuFeatureTable │ Early serial │
│ │
│ VERIFIED. NON-REPLACEABLE. FROZEN. │
│ │
│ Phase 4a: Load + verify Evolvable ─────────────────────────────────┐ │
└─────────────────────────────────────────────────────────────────┤───┘
│
▼ │
┌─────────────────────────────────────────────────────────────────────┐
│ EVOLVABLE: Boot Monolith (swappable, ~2-4 MB) │
│ ───────────────────────────────────────── │
│ Evolution orch. │ Tier 0 platform │ Scheduler │ Memory policy │ │
│ Slab │ IPC dispatch │ Workqueue │ ACPI/DTB │ SysAPI layer │ VFS │
│ Block │ Networking │ Cgroup/NS │ Security │ Crypto │ FMA │ Algo │
│ │
│ SWAPPABLE (whole or per-subsystem via EvolvableComponent) │
│ │
│ Phases 5-15: Init subsystems → mount initramfs → load drivers ─┐ │
└─────────────────────────────────────────────────────────────────┤───┘
│
┌────────────────┬───────────────┬───────────────────┘
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────────────────┐
│ DEVICE DRIVERS │ │ ON-DEMAND KABI │ │ CPU-FEATURE VARIANT MODULES│
│ │ │ SERVICES │ │ (Model B only) │
│ Tier 1 / Tier 2│ │ │ │ │
│ NVMe, AHCI, │ │ umka-kvm │ │ umka-crypto-aes-avx512 │
│ ixgbe, mlx5, │ │ umka-drm │ │ umka-crypto-sha-shani │
│ i915, xHCI, │ │ umka-audio │ │ umka-raid-avx2 │
│ virtio-*, │ │ umka-bluetooth │ │ (loaded by AlgoDispatch │
│ evdev, HDA, │ │ umka-wifi │ │ at boot phase 9 based on │
│ usb-hid, ... │ │ umka-nfs │ │ CpuFeatureTable) │
│ │ │ umka-xfs │ │ │
│ From initramfs │ │ umka-zfs │ │ From initramfs module store│
│ or root FS │ │ umka-dsm │ │ │
│ │ │ umka-mlpolicy │ │ │
│ SWAPPABLE │ │ │ │ SWAPPABLE │
│ (Tier promote/ │ │ SWAPPABLE │ │ (AlgoDispatch re-init on │
│ demote, crash │ │ (loaded on │ │ evolution with new │
│ recovery) │ │ first request)│ │ candidates) │
└────────────────┘ └────────────────┘ └────────────────────────────┘
2.21.8 Cross-references¶
- Three-tier protection model: Section 11.1
- Live kernel evolution: Section 13.18
- Data format evolution: Section 13.18
- KABI module system: Section 12.1
- Demand loading: Section 12.7
- AlgoDispatch and SIMD: Section 3.10
- CPU feature detection: Section 2.16
- Trust anchor (ML-DSA signing): Section 9.3
- Formal verification: Section 24.4
2.22 First-Class Architectures¶
UmkaOS targets eight architectures as first-class citizens. All eight receive equal design consideration, CI testing, and performance optimization.
| Architecture | Status | Isolation mechanism | Notes |
|---|---|---|---|
| x86-64 | Primary dev target | Intel MPK (WRPKRU) |
Most mature, widest hardware |
| aarch64 | First-class, day one | POE (ARMv8.9+) / page-table fallback | ARM servers, Apple Silicon (VM) |
| armv7 | First-class, day one | DACR memory domains | Embedded, IoT, Raspberry Pi |
| riscv64 | First-class, day one | Page-table based | Emerging server/embedded platform |
| ppc32 | First-class, day one | Segment registers / page-table based | Embedded PowerPC, AmigaOne, networking appliances |
| ppc64le | First-class, day one | HPT / Radix MMU / page-table based | POWER servers, IBM POWER8/9/10, Raptor Talos II |
| s390x | First-class, day one | Storage Keys (Tier 1 unavailable) | IBM z systems, channel I/O, big-endian |
| loongarch64 | First-class, day one | Page-table based (Tier 1 unavailable) | Loongson 3A5000/6000, China ecosystem |
Table coverage: s390x and LoongArch64 are present in all major per-architecture tables (performance budget, memory model, syscall register mapping, interrupt controller, clock framework, boot sequence). Some smaller specialized tables (SIMD dispatch, signal frame layout) are being progressively updated.
2.22.1 Architecture-Specific Code¶
Architecture-specific code is isolated under arch/ and umka-core/src/arch/:
- Boot code: Rust and assembly, per-architecture
- Syscall entry/exit: Assembly stubs
- Context switch: Assembly (register save/restore)
- Interrupt dispatch: Assembly stubs into Rust handlers
- vDSO: Per-architecture user-accessible pages (see Section 2.22)
- MPK / isolation primitives: Abstracted behind a common
IsolationDomaintrait
2.22.1.1 vDSO (Virtual Dynamic Shared Object)¶
The vDSO is a small ELF shared library that the kernel maps into every user process's address space at process creation. It provides fast userspace implementations of a small set of syscalls that can be answered without entering the kernel — specifically time-related syscalls — by reading kernel-maintained data from a shared page (the VVAR page).
Why the vDSO matters for performance: clock_gettime(CLOCK_MONOTONIC) is
called millions of times per second in high-performance workloads (databases, gRPC,
event loops). A kernel entry costs 100-300 ns on x86-64 with KPTI. The vDSO path
costs ~5-20 ns — a 10-30x speedup. UmkaOS implements the Linux-compatible vDSO ABI
so that existing glibc, musl, and uclibc-ng builds use the fast path automatically,
with no changes to userspace.
Virtual address layout per process (x86-64 example, above stack, ASLR-randomized):
high address
[vdso ELF] 1-4 pages, PROT_READ|PROT_EXEC — contains function code
[vvar page] 1 page (4 KB), PROT_READ — kernel writes, userspace reads
low address
The VVAR page is mapped immediately below the vDSO ELF. Its address is derived by
the vDSO code using a fixed negative offset from the vDSO load address (computed
by the linker script). The kernel communicates the VVAR page address to userspace
via the ELF auxiliary vector (AT_SYSINFO_EHDR points to the vDSO ELF base).
VVAR Page Layout:
/// Kernel-maintained data page shared with userspace for vDSO fast paths.
/// The kernel writes this page using a seqlock protocol; the vDSO reads it
/// without kernel entry.
///
/// This page is mapped read-only into every user process. The kernel maps it
/// read-write in kernel virtual address space only.
///
/// The layout is fixed ABI: the vDSO ELF references fields at fixed offsets.
/// Adding new fields must not change existing field offsets.
// kernel-internal, not KABI
#[repr(C, align(4096))]
pub struct VvarPage {
/// Seqlock sequence counter.
/// Invariant: odd = kernel write in progress; even = data is stable.
/// The vDSO reads this before and after reading data fields; if the
/// value changes or is odd, it retries from the beginning.
pub seq: AtomicU32,
pub _pad_seq: u32,
/// CLOCK_REALTIME: seconds since Unix epoch (TAI - leap seconds).
pub clock_realtime_sec: u64,
/// CLOCK_REALTIME: nanoseconds within the current second (0..999_999_999).
pub clock_realtime_nsec: u32,
pub _pad_rt: u32,
/// CLOCK_MONOTONIC: nanoseconds since kernel boot (never steps backward).
/// **Design divergence from Linux**: Linux splits CLOCK_MONOTONIC into
/// `sec: u64` + `nsec: u32`. UmkaOS stores BOTH: the single `u64`
/// nanosecond counter (for 64-bit arch vDSO fast path) AND the sec+nsec
/// split (for 32-bit arch vDSO which lacks hardware u64 division).
/// u64 nanoseconds overflows after ~584 years — exceeds the 50-year uptime
/// target by 11x. The kernel writes all three fields atomically under the
/// seqlock. 64-bit vDSO uses `clock_monotonic_ns` directly (single load,
/// no division). 32-bit vDSO (ARMv7, PPC32) uses `clock_monotonic_sec` +
/// `clock_monotonic_nsec` (two loads, no division — avoids the ~100-200
/// cycle `__aeabi_uldivmod` software division on ARMv7).
pub clock_monotonic_ns: u64,
/// CLOCK_MONOTONIC: seconds since boot (for 32-bit arch vDSO).
pub clock_monotonic_sec: u64,
/// CLOCK_MONOTONIC: nanoseconds within current second (0..999_999_999).
pub clock_monotonic_nsec: u32,
pub _pad_mono: u32,
/// TSC-to-nanoseconds conversion multiplier.
/// Formula: ns_delta = (tsc_delta * tsc_to_ns_mul) >> tsc_to_ns_shift
/// Valid only when the hardware TSC is stable (invariant TSC required).
/// Zero means TSC is not usable; fall back to a syscall.
pub tsc_to_ns_mul: u32,
/// TSC-to-nanoseconds conversion shift (see tsc_to_ns_mul).
pub tsc_to_ns_shift: u32,
/// TSC value at the time of the last VVAR update.
pub tsc_base: u64,
/// Timezone offset in seconds west of UTC (matches `struct timezone.tz_minuteswest * 60`).
pub tz_minuteswest: i32,
/// DST correction type (matches `struct timezone.tz_dsttime`; always 0 in practice).
pub tz_dsttime: i32,
/// Architecture-specific counter base for non-TSC paths.
/// x86-64: unused (TSC used directly).
/// AArch64: CNTVCT_EL0 value at last update.
/// RISC-V: `rdtime` value at last update.
pub arch_counter_base: u64,
/// Architecture-specific counter frequency (Hz).
/// x86-64: TSC frequency.
/// AArch64: CNTFRQ_EL0.
/// RISC-V: timer-frequency from Device Tree.
pub arch_counter_freq_hz: u64,
/// CLOCK_TAI offset: TAI = REALTIME + tai_offset.
/// tai_offset is the current number of leap seconds (typically 37 as of 2024).
/// Updated when the kernel receives a leap second notification (NTP/PTP).
/// The vDSO computes CLOCK_TAI as:
/// tai_sec = clock_realtime_sec + clock_tai_offset_sec
/// tai_nsec = clock_realtime_nsec (sub-second identical to REALTIME)
pub clock_tai_offset_sec: i64,
/// CLOCK_BOOTTIME: seconds since boot including suspend time.
/// Difference from CLOCK_MONOTONIC: CLOCK_BOOTTIME advances during
/// system suspend; CLOCK_MONOTONIC does not.
pub clock_boottime_sec: u64,
/// CLOCK_BOOTTIME: nanoseconds within the current second (0..999_999_999).
pub clock_boottime_nsec: u32,
pub _pad_bt: u32,
/// CLOCK_MONOTONIC_RAW: nanoseconds since boot at the nominal hardware
/// rate, immune to NTP frequency adjustments.
/// Difference from CLOCK_MONOTONIC: CLOCK_MONOTONIC uses an NTP-adjusted
/// multiplier; CLOCK_MONOTONIC_RAW uses the nominal clocksource multiplier.
/// Same dual-representation as CLOCK_MONOTONIC: the single u64 nanosecond
/// counter for 64-bit vDSO, plus sec+nsec for 32-bit vDSO.
pub clock_monotonic_raw_ns: u64,
/// CLOCK_MONOTONIC_RAW: seconds since boot (for 32-bit arch vDSO).
pub clock_monotonic_raw_sec: u64,
/// CLOCK_MONOTONIC_RAW: nanoseconds within current second (0..999_999_999).
pub clock_monotonic_raw_nsec: u32,
pub _pad_raw: u32,
/// Nominal (non-NTP-adjusted) clocksource multiplier for CLOCK_MONOTONIC_RAW.
/// Formula: raw_ns_delta = (counter_delta * raw_mult) >> raw_shift
/// Often identical to tsc_to_ns_mul but diverges when NTP applies
/// frequency correction.
pub raw_mult: u32,
/// Nominal clocksource shift for CLOCK_MONOTONIC_RAW (see raw_mult).
pub raw_shift: u32,
/// Bitmask of clocks that must use syscall fallback instead of the vDSO
/// fast path. Bit N set means clock ID N is disabled in the vDSO (the
/// kernel has not populated its VVAR fields, or the clocksource does not
/// support vDSO-level precision). The vDSO checks
/// `clock_disabled_mask.load(Relaxed) & (1 << clock_id) != 0` before
/// using the fast path; on match it falls back to `syscall(clock_gettime)`.
///
/// u32 provides 32 clock IDs. Linux defines 16 clock IDs after 20+ years;
/// u32 provides ample headroom. Upgrade to AtomicU64 if >32 ever needed
/// (isolated change: one field, one vDSO check).
pub clock_disabled_mask: AtomicU32,
pub _pad: [u8; 3948], // Explicit padding: 148 bytes of fields above + 3948 = 4096 exactly.
// Field accounting: seq(4)+pad_seq(4)+rt_sec(8)+rt_nsec(4)+pad_rt(4)+mono_ns(8)+mono_sec(8)
// +mono_nsec(4)+pad_mono(4)+tsc_mul(4)+tsc_shift(4)+tsc_base(8)+tz_min(4)+tz_dst(4)
// +arch_base(8)+arch_freq(8)+tai_offset(8)+bt_sec(8)+bt_nsec(4)+pad_bt(4)
// +raw_ns(8)+raw_sec(8)+raw_nsec(4)+pad_raw(4)+raw_mul(4)+raw_shift(4)+clk_mask(4) = 148 bytes.
// (Do not rely on implicit tail padding from align(4096) — explicit is safer as fields grow.)
//
// CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE use the same base
// fields (clock_realtime_sec/nsec, clock_monotonic_sec/nsec) without adding
// the (counter_delta * mult) >> shift interpolation. No separate fields needed.
}
const_assert!(size_of::<VvarPage>() == 4096);
// Per-field offset assertions: self-documenting for vDSO C/asm implementers.
// Pattern required for all kernel/userspace shared structs (see Decision 3 of
// spec review debate 2026-04-02).
const_assert!(core::mem::offset_of!(VvarPage, seq) == 0);
const_assert!(core::mem::offset_of!(VvarPage, _pad_seq) == 4);
const_assert!(core::mem::offset_of!(VvarPage, clock_realtime_sec) == 8);
const_assert!(core::mem::offset_of!(VvarPage, clock_realtime_nsec) == 16);
const_assert!(core::mem::offset_of!(VvarPage, _pad_rt) == 20);
const_assert!(core::mem::offset_of!(VvarPage, clock_monotonic_ns) == 24);
const_assert!(core::mem::offset_of!(VvarPage, clock_monotonic_sec) == 32);
const_assert!(core::mem::offset_of!(VvarPage, clock_monotonic_nsec) == 40);
const_assert!(core::mem::offset_of!(VvarPage, _pad_mono) == 44);
const_assert!(core::mem::offset_of!(VvarPage, tsc_to_ns_mul) == 48);
const_assert!(core::mem::offset_of!(VvarPage, tsc_to_ns_shift) == 52);
const_assert!(core::mem::offset_of!(VvarPage, tsc_base) == 56);
const_assert!(core::mem::offset_of!(VvarPage, tz_minuteswest) == 64);
const_assert!(core::mem::offset_of!(VvarPage, tz_dsttime) == 68);
const_assert!(core::mem::offset_of!(VvarPage, arch_counter_base) == 72);
const_assert!(core::mem::offset_of!(VvarPage, arch_counter_freq_hz) == 80);
const_assert!(core::mem::offset_of!(VvarPage, clock_tai_offset_sec) == 88);
const_assert!(core::mem::offset_of!(VvarPage, clock_boottime_sec) == 96);
const_assert!(core::mem::offset_of!(VvarPage, clock_boottime_nsec) == 104);
const_assert!(core::mem::offset_of!(VvarPage, _pad_bt) == 108);
const_assert!(core::mem::offset_of!(VvarPage, clock_monotonic_raw_ns) == 112);
const_assert!(core::mem::offset_of!(VvarPage, clock_monotonic_raw_sec) == 120);
const_assert!(core::mem::offset_of!(VvarPage, clock_monotonic_raw_nsec) == 128);
const_assert!(core::mem::offset_of!(VvarPage, _pad_raw) == 132);
const_assert!(core::mem::offset_of!(VvarPage, raw_mult) == 136);
const_assert!(core::mem::offset_of!(VvarPage, raw_shift) == 140);
const_assert!(core::mem::offset_of!(VvarPage, clock_disabled_mask) == 144);
const_assert!(core::mem::offset_of!(VvarPage, _pad) == 148);
C ABI compatibility: AtomicU32 and AtomicU64 have the same size and alignment
as u32 and u64 respectively (#[repr(C)] guarantees deterministic layout). The
vDSO (compiled as C/asm) accesses these fields as volatile loads with explicit
architecture-specific memory barriers (smp_rmb() on ARM/RISC-V, implicit on x86
TSO). The Rust atomic types in this struct definition are the kernel-side API; the
vDSO does not use Rust atomics. The const_assert! size check and per-field
offset_of! assertions above provide machine-verified C-ABI documentation for
vDSO implementers.
Exported Symbols (Linux-compatible ABI):
The vDSO ELF exports the following symbols with STV_DEFAULT visibility. These
match the Linux x86-64 vDSO symbol names exactly so that glibc and other libc
implementations find them without modification:
| Symbol | Signature | Supported clocks |
|---|---|---|
__vdso_clock_gettime |
(clockid_t clk_id, struct timespec *tp) -> int |
CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_BOOTTIME, CLOCK_TAI, CLOCK_REALTIME_COARSE, CLOCK_MONOTONIC_COARSE |
__vdso_gettimeofday |
(struct timeval *tv, struct timezone *tz) -> int |
All (derives from clock_realtime) |
__vdso_time |
(time_t *tloc) -> time_t |
Derives from clock_realtime_sec |
__vdso_clock_getres |
(clockid_t clk_id, struct timespec *res) -> int |
Returns resolution for supported clocks |
__vdso_getcpu |
(unsigned *cpu, unsigned *node) -> int |
Returns VcpuPage::cpu_id and numa_node (see below) |
On AArch64, the equivalent symbols use the same names but read CNTVCT_EL0
(virtual counter) instead of RDTSC. On RISC-V, rdtime is used. The VVAR
arch_counter_base and arch_counter_freq_hz fields supply the base and
frequency needed for the conversion.
CLOCK_TAI implementation: CLOCK_TAI (International Atomic Time) is computed
as clock_realtime_sec + clock_tai_offset_sec with clock_realtime_nsec as the
sub-second component. The clock_tai_offset_sec field is typically 37 (as of 2024)
and changes only on leap second events. This matches Linux's vDSO CLOCK_TAI support
(verified: include/vdso/datapage.h VDSO_HRES includes BIT(CLOCK_TAI)).
Applications using CLOCK_TAI (PTP, financial trading, TSN networking) get the fast
vDSO path instead of falling back to a slow syscall.
__vdso_getcpu implementation: getcpu() reads CPU identity from the
per-CPU VcpuPage (see below), NOT from the global VvarPage. Per-arch fast paths:
| Architecture | getcpu() mechanism | Latency |
|---|---|---|
| x86-64 | RDPID instruction (never touches memory) |
~1 cycle |
| AArch64 | MRS x0, TPIDRRO_EL0 (kernel writes CPU ID at context switch) |
~1 cycle |
| ARMv7 | MRC p15, 0, r0, c13, c0, 3 (TPIDRURO) |
~1 cycle |
| RISC-V/PPC32/s390x/LoongArch64 | Read from VcpuPage at fixed offset (page hot in TLB from getrandom()) | ~3 cycles |
For clock IDs that the vDSO does not handle (e.g., CLOCK_PROCESS_CPUTIME_ID,
CLOCK_THREAD_CPUTIME_ID), the vDSO falls back to a real syscall via the
syscall instruction (x86-64) or SVC / ecall (AArch64, RISC-V).
Coarse clock implementation: CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE
use the same base fields as CLOCK_REALTIME and CLOCK_MONOTONIC respectively,
but the vDSO implementation skips the TSC/counter interpolation step. The base
fields are updated at tick frequency (~1-10 ms), which provides coarse resolution.
No dedicated fields needed: coarse = base value without counter delta computation.
Seqlock Update Protocol:
Kernel (called on each timer tick or TSC calibration update — single writer):
1. let s = VvarPage::seq.load(Relaxed); // read current value (single writer, no RMW needed)
VvarPage::seq.store(s + 1, Release); // seq becomes odd: write in progress
2. write clock_realtime_sec, clock_realtime_nsec, clock_monotonic_ns
3. write tsc_base, tsc_to_ns_mul, tsc_to_ns_shift (if TSC calibration changed)
4. write arch_counter_base (architecture-specific counter snapshot)
5. write clock_tai_offset_sec (if leap second changed)
6. let s = VvarPage::seq.load(Relaxed);
VvarPage::seq.store(s + 1, Release); // seq becomes even: write complete
// Note: plain store+Release instead of fetch_add (LOCK XADD on x86) because
// the timer tick is the sole writer — no atomic RMW is needed.
vDSO userspace (pseudocode for __vdso_clock_gettime):
loop:
seq1 = load(VvarPage::seq, Acquire)
if seq1 & 1 != 0: continue // write in progress, retry
tsc_now = RDTSC (or arch counter)
ns = clock_monotonic_ns + ((tsc_now - tsc_base) * tsc_to_ns_mul) >> tsc_to_ns_shift
seq2 = load(VvarPage::seq, Acquire)
if seq2 != seq1: continue // update raced, retry
return ns
The retry loop is expected to execute zero times in practice: timer tick updates are infrequent (1–10 ms intervals) and short (< 1 μs). The loop exists only for correctness on the rare overlap.
ELF Build Requirements:
The vDSO ELF is built as a position-independent shared library with no dynamic linker dependencies:
- Compiled with
-fPIC -fno-plt -nostdlib -Wl,-shared. - No external symbol references (self-contained; no libc, no PLT stubs).
- Linked with a custom linker script that produces exactly two
PT_LOADsegments (one RX for code, one R for read-only data) plusPT_DYNAMICandPT_GNU_EH_FRAME. - Stripped of debugging sections;
.eh_frameretained for unwinding (stack traces in userspace debuggers work through the vDSO). - The vDSO ELF is embedded in the kernel image as a byte array in
.rodata. At process creation (exec), the kernel copies it into a freshly allocated page and maps it withPROT_READ | PROT_EXEC.
Architecture-Specific Notes:
| Architecture | Counter instruction | Notes |
|---|---|---|
| x86-64 | RDTSC |
Requires invariant TSC (CPUID leaf 0x80000007 bit 8). Non-invariant TSC (laptops with deep C-states on pre-Nehalem) falls back to syscall. |
| AArch64 | MRS x0, CNTVCT_EL0 |
Virtual counter. CNTFRQ_EL0 gives the frequency. Always available on ARMv8+ in EL0. |
| ARMv7 | MRC p15, 0, r0, c14, c3, 2 (CNTVCT) |
Available on Cortex-A7/A15 with generic timer. Falls back to syscall if not available. |
| RISC-V 64 | rdtime pseudo-instruction |
Frequency from Device Tree /cpus/timebase-frequency. |
| PPC32 | mftb (Timebase lower) |
PPC32 vDSO is limited; gettimeofday uses syscall fallback on embedded targets without an invariant timebase. |
| PPC64LE | mftb |
Timebase register, frequency from Device Tree or OPAL. |
| s390x | STCK / STCKE |
TOD clock, epoch 1900-01-01. Always available in problem state. Frequency = 2^12 ticks/µs (architectural constant). |
| LoongArch64 | RDTIME.D |
Stable Counter. Frequency from CPUCFG or DTB. Available at PLV3 (user mode). |
Per-architecture vDSO placement in arch/:
umka-kernel/src/arch/x86_64/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/aarch64/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/armv7/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/riscv64/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/ppc32/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/ppc64le/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/s390x/vdso/ — vdso.S, vdso.ld, vvar.rs
umka-kernel/src/arch/loongarch64/vdso/ — vdso.S, vdso.ld, vvar.rs
The VvarPage struct definition is shared (umka-kernel/src/vvar.rs); only the
counter-reading instructions in vdso.S and the linker load addresses in vdso.ld
differ per architecture.
Per-architecture hardware abstraction equivalents:
| Concept | x86-64 | AArch64 | ARMv7 | RISC-V 64 | PPC32 | PPC64LE | s390x | LoongArch64 |
|---|---|---|---|---|---|---|---|---|
| Privilege separation | GDT (ring 0/3 segments) | Exception levels (EL0/EL1) | Processor modes (USR/SVC) | Privilege levels (U/S) | MSR PR bit (user/supervisor) | MSR PR bit (user/supervisor) | PSW problem-state bit (0=supervisor, 1=user) | PLV (PLV0=kernel, PLV3=user) |
| Exception dispatch | IDT (256 gate descriptors) | Exception vector table (VBAR_EL1, 16 entries × 4 vectors) | Vector table (VBAR, 8 entries) | Trap vector (stvec, single entry + scause dispatch) | Exception vector table (IVPR + IVORn) | System Reset + Machine Check vectors (LPCR) | PSW-swap: old PSW saved to lowcore, new PSW loaded. Six interrupt classes. | CSR.EENTRY base + (Ecode × VS). TLB refill has dedicated CSR.TLBRENTRY. |
| Interrupt controller | APIC (LAPIC + IOAPIC) | GIC v2/v3 (distributor + redistributor/CPU interface, detected at runtime) | GIC (distributor + CPU interface) | PLIC (+ CLINT for timer/IPI) | OpenPIC / MPIC | XICS / XIVE (POWER8/9/10) | None (architectural). ISC masking in CR6 for I/O. SIGP for external. | EIOINTC (256 vectors, per-CPU routing via IOCSR). LIOINTC for legacy. |
| Timer | APIC timer / HPET / TSC | Generic Timer (CNTPCT_EL0) | Generic Timer (CNTPCT) | SBI timer ecall / mtime | Decrementer (DEC SPR) | Decrementer (DEC SPR) / HDEC | CPU Timer (SPT/STPT) + Clock Comparator (SCKC/STCKC). TOD = 4096 ticks/µs. | Stable Counter (CSR.TCFG, auto-reload). RDTIME.D for reading. |
| Syscall mechanism | SYSCALL/SYSRET (MSRs) | SVC instruction (EL0→EL1) | SVC instruction (USR→SVC) | ecall instruction (U→S) | sc instruction (system call) |
sc instruction / scv (POWER9+) |
SVC instruction (8-bit immediate or R1 for >255). | SYSCALL instruction (PLV3→PLV0). ERTN to return. |
| Page table format | 4-level (PML4→PDPT→PD→PT) | 4-level (L0→L1→L2→L3) | 2-level (L1→L2, 1MB sections) | 4-level Sv48 | 2-level (PGD→PTE, 4 KB pages) | Radix tree (POWER9+) or HPT (hashed page table) | DAT: 3-5 level (Region-First through Page), 4 KB pages, 1 MB large pages. | 4-level (PGD→PUD→PMD→PTE), 4 KB/16 KB/64 KB pages. Software or hardware PTW. |
| Fast isolation | MPK (WRPKRU) | POE (POR_EL0) / MTE | DACR (16 domains) | Page-table based | Segment registers (16 segments) | Radix partition table / HPT LPAR | Storage Keys (4-bit per page, too coarse). Tier 1 → Tier 0. | None. Tier 1 → Tier 0. |
| TLB ID | PCID (12-bit, CR3) | ASID (8/16-bit, TTBR) | ASID (8-bit, CONTEXTIDR) | ASID (9-16 bit, satp) | PID (8-bit, via PID SPR) | PID/LPID (Radix: 20-bit PID, LPIDR) | ASCE (no explicit ASID — address space change implies TLB context) | ASID (10-bit, CSR.ASID). INVTLB for invalidation. |
Everything else -- scheduling, memory management, capability system, driver model, syscall compatibility -- is architecture-independent Rust code.
2.22.1.2 Extended VVAR Pages (UmkaOS-Specific)¶
Linux's vDSO exposes only timekeeping and CPU identity data because Linux has no structured kernel state that can be safely projected to userspace in compact, read-only form. UmkaOS's capability model and cgroup architecture enable additional VVAR-style pages that eliminate syscalls in high-frequency userspace hot paths.
All extended VVAR pages follow the same security model as the timekeeping VVAR:
the kernel maps them PROT_READ into the process address space, writes them
from kernel context under a seqlock protocol, and userspace reads without
entering the kernel. The pages contain no pointers to kernel memory — only
values and bitmasks.
Extended VVAR page addresses are communicated via the UmkaOS-specific
auxiliary vector entries (negative a_type values in the auxv, same
convention as UmkaOS native syscalls — Section 19.8).
Unmodified Linux applications ignore negative auxv types (glibc's getauxval()
treats them as unknown). UmkaOS-aware applications access them via libumka.
2.22.1.2.1 Capability Summary Page (per-process)¶
The capability summary page provides a fast, read-only snapshot of the process's current capability state. This enables userspace to perform "do I have permission X?" checks in ~5 ns (VVAR read) instead of ~200 ns (syscall round-trip), which matters for applications that make access-control decisions millions of times per second (capability-aware middleware, database query planners, container runtimes).
Why Linux cannot do this: Linux's security model is a heterogeneous
collection of UID checks, POSIX capability bits, LSM labels, seccomp filters,
and namespace membership — there is no compact, authoritative representation
of "what can this process do?" UmkaOS's PermissionBits bitflags and
SystemCaps bitfield provide exactly this.
/// Per-process capability summary page. Mapped into each process at execve()
/// and on-demand for UmkaOS-native applications.
///
/// The kernel updates this page on capability grant, revoke, restrict, and
/// delegate operations. Updates use the same seqlock protocol as the
/// timekeeping VVAR.
///
/// Security invariant: this page is advisory. It enables fast NEGATIVE checks
/// ("I definitely don't have this permission → skip the syscall") but MUST NOT
/// be used as a positive authorization. The kernel re-validates every capability
/// on every syscall — the summary page is a performance optimization, not a
/// security boundary.
// kernel-internal, not KABI
#[repr(C, align(4096))]
pub struct CapSummaryPage {
/// Seqlock sequence counter (same protocol as VvarPage::seq).
pub seq: AtomicU32,
pub _pad_seq: u32,
/// Monotonic generation counter. Incremented on every capability table
/// modification (grant, revoke, restrict, delegate, exec transform).
/// Userspace can cache a permission check result and invalidate when
/// this counter advances.
pub cap_generation: u64,
/// Union of SystemCaps across all capabilities held by this process.
/// If a bit is NOT set here, the process definitely does not hold that
/// system capability — no syscall needed. If a bit IS set, the process
/// MIGHT hold it (must confirm via syscall or CAP_QUERY native call).
pub syscaps_union: u64,
/// Union of PermissionBits across all capabilities held by this process.
/// Same semantics: clear bit = definitely no; set bit = maybe yes.
pub permbits_union: u32,
pub _pad_permbits: u32,
/// Number of active (non-revoked) capabilities in the process's CapTable.
pub active_cap_count: u32,
/// Number of delegated capabilities (capabilities this process has
/// delegated to other processes, still revocable).
pub delegated_cap_count: u32,
/// Bloom filter over active CapId values. Enables fast "do I hold a
/// capability for object X?" check without syscall. 256-bit (32-byte)
/// bloom filter with 2 hash functions (k=2). False positive rate ~1.2%
/// at 50 active caps, ~5% at 100 caps.
///
/// Usage: hash(object_id) into 2 bit positions; if both set, MAYBE
/// (confirm via CAP_QUERY). If either clear, DEFINITELY NOT.
pub cap_bloom: [u64; 4],
/// AT_SECURE equivalent: non-zero if this process has elevated privileges
/// (capability grants applied at exec, or uid != euid). Mirrors the
/// auxv AT_SECURE value but is live-updated if credentials change
/// post-exec (e.g., via setuid emulation through capability grants).
pub secure_mode: u32,
pub _pad_tail: [u8; 4020],
// seq(4) + pad(4) + cap_generation(8) + syscaps_union(8) + permbits_union(4) + pad(4)
// + active_cap_count(4) + delegated_cap_count(4) + cap_bloom(32) + secure_mode(4)
// = 4+4+8+8+4+4+4+4+32+4 = 76 bytes
// Padding: 4096 - 76 = 4020 bytes
}
const _: () = assert!(core::mem::size_of::<CapSummaryPage>() == 4096);
Kernel update points (CapSummaryPage is written under seqlock):
| Event | Fields updated |
|---|---|
cap_grant() / CAP_DERIVE |
cap_generation++, syscaps_union |= new, permbits_union |= new, active_cap_count++, bloom insert |
cap_revoke() / CAP_REVOKE |
cap_generation++, recompute syscaps_union and permbits_union (OR over remaining caps), active_cap_count--, bloom rebuild |
cap_restrict() / CAP_RESTRICT |
cap_generation++, recompute unions (restriction may clear bits) |
cap_delegate() / CAP_DELEGATE |
cap_generation++, delegated_cap_count++ |
execve() credential transform |
Full rebuild: new unions, new bloom, new counts, secure_mode update |
Bloom filter rebuild on revoke: Revocation cannot simply clear bloom bits
(other capabilities may hash to the same positions). The kernel rebuilds the
bloom filter by re-hashing all remaining active_cap_count entries. For
typical process cap counts (< 100), this takes < 1 μs.
Userspace fast-path example (via libumka):
// Check if this process can bind to privileged ports before attempting.
// Avoids a guaranteed-EPERM syscall in the common "unprivileged" case.
#include <umka/vdso.h>
struct umka_cap_summary *cs = umka_get_cap_summary(); // mmap'd at exec
if (!(cs->syscaps_union & UMKA_SYSCAP_NET_BIND_SERVICE)) {
// Definitely don't have it — skip the bind() call or use port > 1024.
return use_unprivileged_port();
}
// Might have it — proceed with bind(), kernel will confirm.
return bind(sockfd, addr, addrlen);
Auxiliary vector entry: AT_UMKA_CAPSUMMARY (a_type = -0x0100, matching
the capability syscall range). Value = base address of the CapSummaryPage
mapping. Zero if the kernel does not support extended VVAR (forward compat).
2.22.1.2.2 Cgroup Resource Gauge Page (per-cgroup)¶
The cgroup resource gauge exposes real-time resource consumption counters to
all processes within a cgroup hierarchy, eliminating the need to read
/sys/fs/cgroup/.../memory.current, cpu.stat, etc. via VFS (which involves
path resolution, permission checks, seq_file formatting, and copy_to_user —
~5-10 μs per read). The gauge page provides the same data in ~5-20 ns.
Why this matters: JVM GCs, Go runtime memory managers, database buffer pools, and container orchestrators poll cgroup resource counters at high frequency (100-1000 Hz) to make adaptive decisions (trigger GC, shed load, throttle allocations). At 1000 Hz, VFS reads consume ~5-10 ms/sec of CPU per container — pure overhead that the gauge page eliminates.
/// Per-cgroup resource gauge page. One page per cgroup that has at least one
/// process with an active gauge mapping.
///
/// The kernel updates this page on the scheduler tick (CPU fields), on page
/// allocation/free (memory fields), and on I/O completion (I/O fields).
/// All updates use the seqlock protocol.
///
/// Mapping lifetime: created on first demand (when a process in the cgroup
/// calls umka_map_cgroup_gauge() or reads AT_UMKA_CGAUGE). Destroyed when
/// the cgroup is removed (all processes exited + no external references).
// kernel-internal, not KABI
#[repr(C, align(4096))]
pub struct CgroupGaugePage {
/// Seqlock sequence counter.
pub seq: AtomicU32,
pub _pad_seq: u32,
/// Cgroup ID (matches the cgroupfs inode number). Allows userspace to
/// verify it is reading the correct cgroup's data.
pub cgroup_id: u64,
// --- Memory controller gauges ---
/// Current memory usage in bytes (equivalent to memory.current).
/// Includes anonymous pages, file cache pages charged to this cgroup,
/// and kernel memory (slab, page tables) charged to this cgroup.
pub memory_current: u64,
/// memory.max (the configured limit). 0 = no limit (root cgroup default).
pub memory_max: u64,
/// memory.high (the throttle threshold). 0 = no high limit.
pub memory_high: u64,
/// memory.swap.current: swap usage in bytes.
pub swap_current: u64,
/// memory.swap.max: swap limit in bytes. 0 = no swap limit.
pub swap_max: u64,
/// Number of page faults since cgroup creation (major + minor).
pub pgfault_total: u64,
/// Number of major page faults (required I/O).
pub pgmajfault_total: u64,
/// Memory pressure: PSI (Pressure Stall Information) averages.
/// Encoded as fixed-point micros (value × 1_000_000 = percentage × 10_000).
/// e.g., 50_000 = 5.0000% = some tasks stalled 5% of recent time.
pub memory_pressure_some_avg10: u32,
pub memory_pressure_some_avg60: u32,
pub memory_pressure_some_avg300: u32,
pub _pad_pressure: u32,
// --- CPU controller gauges ---
/// Total CPU time consumed by this cgroup in nanoseconds (user + system).
/// Equivalent to cpu.stat: usage_usec × 1000.
pub cpu_usage_ns: u64,
/// User-mode CPU time in nanoseconds.
pub cpu_user_ns: u64,
/// Kernel-mode CPU time in nanoseconds.
pub cpu_system_ns: u64,
/// Number of periods where this cgroup was throttled (CBS bandwidth).
pub cpu_nr_throttled: u64,
/// Total time spent throttled in nanoseconds.
pub cpu_throttled_ns: u64,
/// CPU pressure: PSI averages (same encoding as memory pressure).
pub cpu_pressure_some_avg10: u32,
pub cpu_pressure_some_avg60: u32,
pub cpu_pressure_some_avg300: u32,
pub _pad_cpu: u32,
// --- I/O controller gauges ---
/// Total bytes read by this cgroup (across all devices).
pub io_read_bytes: u64,
/// Total bytes written by this cgroup (across all devices).
pub io_write_bytes: u64,
/// Total I/O operations (read + write) completed.
pub io_ops_total: u64,
/// I/O pressure: PSI averages (same encoding).
pub io_pressure_some_avg10: u32,
pub io_pressure_some_avg60: u32,
pub io_pressure_some_avg300: u32,
pub _pad_io: u32,
// --- PIDs controller ---
/// Current number of tasks (threads) in this cgroup.
pub pids_current: u32,
/// pids.max limit. 0 = no limit.
pub pids_max: u32,
pub _reserved: [u8; 3904],
// Total fields: 192 bytes (see precise accounting below)
// Reserved: 4096 - 192 = 3904 bytes
}
// Precise field accounting:
// seq(4) + pad(4) + cgroup_id(8) = 16
// memory_current(8) + memory_max(8) + memory_high(8) = 24
// swap_current(8) + swap_max(8) = 16
// pgfault_total(8) + pgmajfault_total(8) = 16
// mem_psi: 4+4+4+4 = 16
// cpu_usage_ns(8)+cpu_user_ns(8)+cpu_system_ns(8) = 24
// cpu_nr_throttled(8)+cpu_throttled_ns(8) = 16
// cpu_psi: 4+4+4+4 = 16
// io_read(8)+io_write(8)+io_ops(8) = 24
// io_psi: 4+4+4+4 = 16
// pids_current(4)+pids_max(4) = 8
// Total fields: 16+24+16+16+16+24+16+16+24+16+8 = 192
// Reserved: 4096 - 192 = 3904 bytes
const _: () = assert!(core::mem::size_of::<CgroupGaugePage>() == 4096);
Kernel update frequency:
| Field group | Update trigger | Cost |
|---|---|---|
| Memory counters | On page charge/uncharge (piggybacks on existing atomic updates) | ~0 ns additional (atomic store to gauge page under seqlock) |
| CPU counters | Scheduler tick (1-10 ms interval) | ~10 ns per cgroup in the tick path |
| I/O counters | I/O completion callback | ~5 ns per I/O (atomic add already exists) |
| PSI averages | Dedicated PSI update timer (2 sec interval, matching Linux) | ~50 ns per cgroup per update |
| PIDs counter | Task creation / exit | ~5 ns (atomic inc/dec) |
Mapping mechanism: A process calls umka_map_cgroup_gauge(cgroup_fd) (or
the automatic path via AT_UMKA_CGAUGE for the process's own cgroup). The
kernel returns an mmap'd read-only page. Multiple processes in the same
cgroup share the same physical page (one page per cgroup, not per process).
If the cgroup is destroyed while a mapping exists, the page is zeroed and the
cgroup_id field is set to 0 (userspace detects this as "gauge invalid").
Auxiliary vector entry: AT_UMKA_CGAUGE (a_type = -0x0101). Value = base
address of the CgroupGaugePage for the process's initial cgroup (at exec time).
Zero if the process is in the root cgroup with no gauge configured.
2.22.1.2.3 Scheduler Hint Page (per-task)¶
The scheduler hint page exposes the current task's scheduling state to userspace. This enables cooperative userspace schedulers (Go runtime, Erlang BEAM, Java virtual threads, Rust tokio) to make informed decisions about when to yield, how many work items to batch, and whether to spin-wait vs park.
/// Per-task scheduler hint page. Mapped on demand via
/// umka_map_sched_hint(). All fields are advisory — the kernel may
/// update them at any time, and userspace must not depend on exact values
/// for correctness (only for performance optimization).
///
/// Unlike the per-process CapSummaryPage, this page is per-TASK (thread).
/// Each task that requests it gets its own page (mapped into the task's
/// address space, read-only). The kernel writes it during context switch
/// and scheduler tick.
// kernel-internal, not KABI
#[repr(C, align(4096))]
pub struct SchedHintPage {
/// Seqlock sequence counter.
pub seq: AtomicU32,
pub _pad_seq: u32,
/// Current scheduling policy: SCHED_NORMAL (0), SCHED_FIFO (1),
/// SCHED_RR (2), SCHED_BATCH (3), SCHED_IDLE (5), SCHED_DEADLINE (6).
/// Matches the Linux SCHED_* constants.
pub sched_policy: u32,
/// Nice value (-20..19) for SCHED_NORMAL/SCHED_BATCH, or RT priority
/// (1..99) for SCHED_FIFO/SCHED_RR.
pub priority: i32,
/// Approximate remaining timeslice in nanoseconds. Updated on scheduler
/// tick. Zero means the task has been preempted or is about to be.
/// For SCHED_DEADLINE: remaining runtime budget in the current period.
pub slice_remaining_ns: u64,
/// Non-zero if the task's CBS bandwidth group is currently throttled
/// (cpu.max exceeded). A Go/Erlang runtime seeing this can stop
/// spawning new goroutines/processes until throttling clears.
pub cbs_throttled: u32,
/// Number of involuntary context switches since task creation.
/// A high rate indicates the task is compute-bound and should yield
/// more cooperatively.
pub nvcsw: u32,
/// Number of voluntary context switches (sched_yield, futex wait, etc.).
pub nivcsw: u32,
/// CPU this task last ran on. Updated on context-switch-in.
pub last_cpu: u32,
/// NUMA node of last_cpu.
pub last_numa_node: u32,
/// Non-zero if the task is running on a performance core (big.LITTLE /
/// hybrid architectures). Zero on homogeneous systems.
pub on_performance_core: u32,
/// Current CPU frequency in kHz (approximate — read from cpufreq at
/// scheduler tick). Zero if cpufreq is not active.
pub cpu_freq_khz: u32,
pub _pad_tail: u32,
pub _reserved: [u8; 4040],
// seq(4)+pad(4) + policy(4)+priority(4) + slice(8)
// + throttled(4)+nvcsw(4)+nivcsw(4) + last_cpu(4)+last_numa(4)+perf_core(4)
// + freq(4)+pad(4)
// = 8+8+8+12+12+8 = 56 bytes
// Reserved: 4096 - 56 = 4040 bytes
}
const _: () = assert!(core::mem::size_of::<SchedHintPage>() == 4096);
Kernel update points:
| Event | Fields updated | Cost |
|---|---|---|
| Context switch in | last_cpu, last_numa_node, on_performance_core, slice_remaining_ns |
~15 ns (5 stores under seqlock) |
| Scheduler tick | slice_remaining_ns, cpu_freq_khz, cbs_throttled |
~10 ns |
sched_setscheduler() |
sched_policy, priority |
~5 ns |
| CBS throttle/unthrottle | cbs_throttled |
~5 ns |
Mapping mechanism: umka_map_sched_hint() returns a read-only mapping
of the calling task's SchedHintPage. The page is allocated on first request
from a dedicated slab cache. Tasks that never request it pay zero cost.
Auxiliary vector entry: AT_UMKA_SCHEDHINT (a_type = -0x0102). Value = 0
at exec time (the page is per-task, not per-process; it is mapped on demand
via umka_map_sched_hint(), not automatically at exec). The auxv entry exists
to advertise feature availability — non-zero means the kernel supports it.
2.22.1.2.4 Per-CPU vDSO Data Page (VcpuPage) — Getrandom + getcpu + CPU Hints¶
UmkaOS consolidates per-CPU userspace data into a single page: CSPRNG state
for the getrandom() vDSO fast path (matching Linux 6.11+ __vdso_getrandom
for binary compatibility with glibc 2.40+), CPU identity for getcpu(), and
CPU performance hints for DPDK/ScyllaDB-class workloads.
Why one page instead of separate pages: The former VdsoRngPage used 64 of 4096 bytes. The getcpu data needs 16 bytes. Allocating separate 4096-byte pages for each wastes physical memory per CPU. Consolidating into a single VcpuPage costs zero additional pages and keeps all per-CPU userspace data TLB-hot.
/// Per-CPU vDSO data page. Consolidates CSPRNG state, CPU identity, and
/// CPU performance hints into a single per-CPU page.
///
/// The page is mapped PROT_READ|PROT_WRITE into every process (each CPU
/// gets its own page; processes on the same CPU share the same physical
/// page). The write permission is necessary because the vDSO must advance
/// the CSPRNG consumption cursor atomically.
///
/// **Layout zones** (all at fixed offsets, stable ABI):
/// offset 0-63: RNG state (ChaCha20 key/nonce/counter + generation)
/// offset 64-79: CPU identity (cpu_id, numa_node, cpu_freq_khz, core_type)
/// offset 80-4095: reserved for future per-CPU userspace data
///
/// **Security model**: The ChaCha20 key is directly readable by any process
/// mapped to this CPU's VcpuPage. A process can predict output for other
/// processes on the same CPU until the next reseed (every context switch or
/// every 60 seconds, whichever comes first). Cross-CPU prediction is prevented
/// (each CPU has its own key). Same-CPU process isolation is NOT a design goal
/// for the vDSO CSPRNG fast path. Applications requiring cross-process CSPRNG
/// isolation should use the `getrandom()` syscall fallback (which uses a
/// per-task entropy extraction path that does NOT share state) or the
/// per-thread model (Phase 3 — see `vgetrandom_alloc(2)`).
///
/// **CPU identity security**: Userspace CAN write to cpu_id/numa_node (the page
/// is PROT_WRITE for the RNG cursor CAS), but the kernel overwrites them on
/// every context switch — a write is harmless because it is immediately corrected.
/// Same security model as the existing RNG state.
///
/// **Phase 3 enhancement**: Per-thread CSPRNG state via `MAP_DROPPABLE` pages
/// allocated through `vgetrandom_alloc(2)`, matching Linux 6.11+ behavior.
/// Each thread gets its own CSPRNG state — no cross-process prediction.
/// The per-CPU model remains as a fallback for threads that have not called
/// `vgetrandom_alloc()`.
// kernel-internal, not KABI
#[repr(C, align(4096))]
pub struct VcpuPage {
// --- Zone 0: RNG state (offset 0-63, 64 bytes) ---
/// Generation counter. Incremented by the kernel on every reseed.
/// The vDSO checks this to detect reseeds and re-derive state.
pub generation: AtomicU64,
/// ChaCha20 state: 256-bit key + 96-bit nonce + 32-bit counter.
/// The vDSO runs ChaCha20 rounds locally to produce output, then
/// atomically advances the counter. The kernel reseeds (overwrites
/// key + nonce) every 60 seconds or after 1 MiB of output.
pub chacha_key: [u8; 32],
pub chacha_nonce: [u8; 12],
pub chacha_counter: AtomicU32,
/// Number of bytes generated from the current seed. The kernel
/// monitors this and reseeds when it exceeds the threshold.
pub bytes_generated: AtomicU64,
// --- Zone 1: CPU identity (offset 64-79, 16 bytes) ---
/// Current CPU index. Updated by the kernel on every context-switch-in.
/// Used by `__vdso_getcpu()` on architectures without a dedicated
/// register-based fast path (RISC-V, PPC32, s390x, LoongArch64).
/// On x86-64 (RDPID) and AArch64 (TPIDRRO_EL0) the vDSO reads the
/// register directly and never touches this field.
pub cpu_id: u32,
/// NUMA node of the current CPU. Updated alongside cpu_id.
pub numa_node: u32,
/// Current CPU frequency in kHz (approximate — read from cpufreq at
/// context switch). Zero if cpufreq is not active. Useful for DPDK/
/// ScyllaDB adaptive batching (larger batches at lower frequencies).
pub cpu_freq_khz: u32,
/// Non-zero if the task is running on a performance core (big.LITTLE /
/// Intel hybrid architectures). Zero on homogeneous systems.
/// 0 = efficiency core or homogeneous, 1 = performance core.
pub on_performance_core: u32,
// --- Reserved (offset 80-4095) ---
pub _reserved: [u8; 4016],
// Zone 0: generation(8) + chacha_key(32) + chacha_nonce(12) + chacha_counter(4)
// + bytes_generated(8) = 64 bytes
// Zone 1: cpu_id(4) + numa_node(4) + cpu_freq_khz(4) + on_performance_core(4) = 16 bytes
// Total fields: 64 + 16 = 80 bytes. Reserved: 4096 - 80 = 4016
}
const _: () = assert!(core::mem::size_of::<VcpuPage>() == 4096);
Kernel update points for VcpuPage:
| Event | Fields updated | Cost |
|---|---|---|
| Context switch in | cpu_id, numa_node, on_performance_core, cpu_freq_khz |
~10 ns (4 stores, no seqlock needed — single-writer per CPU) |
| CSPRNG reseed (every 60s or 1 MiB) | generation, chacha_key, chacha_nonce, chacha_counter |
~30 ns |
| cpufreq transition | cpu_freq_khz |
~5 ns |
vDSO symbol: __vdso_getrandom — matches the Linux 6.11+ ABI exactly.
Glibc 2.40+ and musl (when updated) automatically use this symbol for
getrandom() and getentropy() calls. No UmkaOS-specific changes needed
in userspace.
Fallback: If the vDSO RNG page is exhausted or if GRND_RANDOM is
requested (blocking entropy pool), the vDSO falls back to a real getrandom()
syscall.
2.22.1.2.5 UTS Identity Page (per-UTS-namespace)¶
The UTS identity page provides uname() data without a syscall. Every call to
gethostname() or uname() in Linux is a real syscall — glibc does not cache
the result. High-throughput logging frameworks stamp hostname on every log line;
at 100K logs/sec this is 100K pointless kernel entries for data that changes
at most once (at boot or container start).
Why this works: struct utsname is ~390 bytes of nearly static data.
sysname, release, version, machine are constant for kernel lifetime.
nodename changes only on sethostname() (rare). domainname changes only
on setdomainname() (nearly never). The seqlock will essentially never
contend.
Container support: Each UTS namespace has its own hostname. The UTS page
is per-UTS-namespace — all processes in the same namespace share the same
physical page. On setns() / unshare(CLONE_NEWUTS), the process's UTS page
mapping is updated to point to the new namespace's page.
/// Per-UTS-namespace identity page. Mapped PROT_READ into every process.
/// The kernel updates this page under seqlock on sethostname() / setdomainname().
///
/// The struct utsname fields use the Linux ABI sizes:
/// __NEW_UTS_LEN (64) + NUL = 65 bytes per field, 6 fields = 390 bytes.
// kernel-internal, not KABI
#[repr(C, align(4096))]
pub struct UtsVvarPage {
/// Seqlock sequence counter. Updated only on sethostname() / setdomainname()
/// (essentially never contended — one write per container lifetime).
pub seq: AtomicU32,
pub _pad_seq: u32,
/// Pre-formatted struct utsname, ready for memcpy into user buffer.
/// Matches the Linux ABI layout exactly so that uname() can be
/// implemented as: seqlock read + memcpy(buf, &page.utsname, 390).
///
/// Fields (each 65 bytes, NUL-terminated):
/// sysname: "UmkaOS" (constant)
/// nodename: hostname (set by sethostname())
/// release: kernel version string (e.g., "1.0.0")
/// version: build info (e.g., "#1 SMP PREEMPT 2026-03-12")
/// machine: architecture string (e.g., "x86_64", "aarch64")
/// domainname: NIS domain (set by setdomainname(), usually "(none)")
pub utsname: LinuxUtsname,
pub _reserved: [u8; 3698],
// utsname: 65 × 6 = 390 bytes. seq(4)+pad(4) = 8. Total fields = 398.
// Reserved: 4096 - 398 = 3698 bytes.
}
/// Matches Linux's struct utsname layout exactly (6 × 65 bytes).
#[repr(C)]
pub struct LinuxUtsname {
pub sysname: [u8; 65],
pub nodename: [u8; 65],
pub release: [u8; 65],
pub version: [u8; 65],
pub machine: [u8; 65],
pub domainname: [u8; 65],
}
const _: () = assert!(core::mem::size_of::<UtsVvarPage>() == 4096);
const _: () = assert!(core::mem::size_of::<LinuxUtsname>() == 390);
// Per-field offset assertions (same pattern as VvarPage — required for all
// kernel/userspace shared structs per Decision 3 of spec review debate 2026-04-02).
const_assert!(core::mem::offset_of!(UtsVvarPage, seq) == 0);
const_assert!(core::mem::offset_of!(UtsVvarPage, _pad_seq) == 4);
const_assert!(core::mem::offset_of!(UtsVvarPage, utsname) == 8);
const_assert!(core::mem::offset_of!(UtsVvarPage, _reserved) == 398);
const_assert!(core::mem::offset_of!(LinuxUtsname, sysname) == 0);
const_assert!(core::mem::offset_of!(LinuxUtsname, nodename) == 65);
const_assert!(core::mem::offset_of!(LinuxUtsname, release) == 130);
const_assert!(core::mem::offset_of!(LinuxUtsname, version) == 195);
const_assert!(core::mem::offset_of!(LinuxUtsname, machine) == 260);
const_assert!(core::mem::offset_of!(LinuxUtsname, domainname) == 325);
vDSO symbol: __vdso_uname — new symbol (not present in Linux's vDSO).
Glibc can be trivially patched to check for this symbol (same pattern as
__vdso_clock_gettime): if present, call it; if absent, fall back to the
uname() syscall. The vDSO implementation is a seqlock read + memcpy (~10 ns
vs ~200 ns for the syscall path). Musl and other libc implementations can
adopt the same pattern.
vDSO __vdso_uname implementation:
__vdso_uname(struct utsname *buf):
loop:
seq1 = load(UtsVvarPage::seq, Acquire)
if seq1 & 1 != 0: continue // write in progress
memcpy(buf, &UtsVvarPage::utsname, 390)
seq2 = load(UtsVvarPage::seq, Acquire)
if seq2 != seq1: continue // raced with sethostname()
return 0
Transparent acceleration: Unlike the UmkaOS-specific pages (CapSummary,
CgroupGauge, SchedHint), the UTS page uses a standard vDSO symbol. Once libc
implementations add __vdso_uname support, unmodified Linux binaries
recompiled against the updated libc (or dynamically linked against it) get
the fast path automatically — no source changes, no libumka dependency.
Auxiliary vector entry: The UTS page address is not exposed via auxv — the vDSO ELF locates it using a linker-time constant offset from the vDSO base (same mechanism as the timekeeping VVAR page). The UTS page is mapped adjacent to the existing VVAR page, extending the VVAR region by one page:
high address
[vdso ELF] 1-4 pages, PROT_READ|PROT_EXEC
[vvar page] 1 page, PROT_READ — timekeeping data
[uts page] 1 page, PROT_READ — utsname data
low address
2.22.1.2.6 Process Identity Page (per-process)¶
Several commonly called functions read static per-process identity that never
changes after exec (or changes extremely rarely via setid() calls):
getpid(), getppid(), getuid(), getgid(), geteuid(), getegid().
Glibc caches getpid() internally, but getppid() and the id() calls are
real syscalls every time.
These fields are already available in the auxiliary vector (AT_UID, AT_EUID,
AT_GID, AT_EGID) but only at exec time — they become stale if credentials
change post-exec. A live VVAR-style page solves this.
/// Per-process identity page. Mapped PROT_READ into every process at exec.
/// Updated under seqlock on setuid/setgid/setgroups and on fork (ppid).
// kernel-internal, not KABI
#[repr(C, align(4096))]
pub struct ProcIdentityPage {
/// Seqlock sequence counter.
pub seq: AtomicU32,
pub _pad_seq: u32,
/// Process ID (stable after fork, changes only across exec in
/// thread-group-leader-replacement edge case).
pub pid: u32,
/// Parent process ID (changes on parent exit → reparent to init).
pub ppid: u32,
/// Thread group ID (== pid for the main thread).
pub tgid: u32,
/// Session ID.
pub sid: u32,
/// Process group ID.
pub pgid: u32,
pub _pad_ids: u32,
/// Real/effective/saved user and group IDs.
pub uid: u32,
pub euid: u32,
pub suid: u32,
pub gid: u32,
pub egid: u32,
pub sgid: u32,
/// Number of supplementary groups.
pub ngroups: u32,
/// Supplementary group list (max 65536 on Linux; we store up to 256
/// inline, which covers >99.9% of real-world processes). Processes
/// with >256 groups fall back to the getgroups() syscall.
pub groups: [u32; 256],
pub _reserved: [u8; 3012],
// seq(4)+pad(4) = 8
// pid(4)+ppid(4)+tgid(4)+sid(4)+pgid(4)+pad(4) = 24
// uid(4)+euid(4)+suid(4)+gid(4)+egid(4)+sgid(4) = 24
// ngroups(4) + groups(256*4=1024) = 1028
// Total = 8 + 24 + 24 + 1028 = 1084
// Reserved: 4096 - 1084 = 3012
}
const _: () = assert!(core::mem::size_of::<ProcIdentityPage>() == 4096);
vDSO symbols (Linux-compatible function signatures):
| Symbol | Equivalent syscall | Implementation |
|---|---|---|
__vdso_getpid |
getpid() |
Return page->pid (seqlock read) |
__vdso_getppid |
getppid() |
Return page->ppid |
__vdso_getuid |
getuid() |
Return page->uid |
__vdso_geteuid |
geteuid() |
Return page->euid |
__vdso_getgid |
getgid() |
Return page->gid |
__vdso_getegid |
getegid() |
Return page->egid |
These are all new vDSO symbols (Linux does not have them). Like __vdso_uname,
libc implementations can trivially add support: check for the symbol at startup,
use it if present, fall back to syscall if absent. Since glibc already caches
getpid(), the main win is for getppid(), getuid(), geteuid(), etc.
Kernel update points:
| Event | Fields updated |
|---|---|
fork() / clone() |
New page allocated for child: pid, ppid, tgid, copy parent's creds |
setuid() / seteuid() / setreuid() / setresuid() |
uid, euid, suid |
setgid() / setegid() / setregid() / setresgid() |
gid, egid, sgid |
setgroups() |
ngroups, groups[] |
setsid() |
sid |
setpgid() |
pgid |
| Parent exit (reparent to init) | ppid |
Auxiliary vector entry: AT_UMKA_PROCID (a_type = -0x0103). Value = base
address of the ProcIdentityPage. Zero on kernels that don't support it.
2.22.1.2.7 Summary of Extended VVAR Pages¶
| Page | Scope | Mapped at | Auxv entry | Key use case |
|---|---|---|---|---|
| VvarPage | Global (single writer) | exec (automatic) | AT_SYSINFO_EHDR (33) |
clock_gettime (all clocks including CLOCK_TAI) |
| UtsVvarPage | Per-UTS-namespace | exec (automatic) | (vDSO-internal offset) | uname, gethostname |
| ProcIdentityPage | Per-process | exec (automatic) | AT_UMKA_PROCID (-0x0103) |
getpid, getppid, get*id |
| CapSummaryPage | Per-process | exec (automatic) | AT_UMKA_CAPSUMMARY (-0x0100) |
Fast negative permission check |
| CgroupGaugePage | Per-cgroup | On demand or exec | AT_UMKA_CGAUGE (-0x0101) |
GC / adaptive runtime resource checks |
| SchedHintPage | Per-task | On demand | AT_UMKA_SCHEDHINT (-0x0102) |
Cooperative userspace schedulers |
| VcpuPage | Per-CPU | exec (automatic) | (internal to vDSO) | getrandom() fast path, getcpu(), CPU performance hints |
2.22.2 No 32-bit Compatibility Modes on 64-bit Kernels¶
UmkaOS does not support running 32-bit binaries on 64-bit kernels: - No ia32 compatibility mode on x86-64 - No AArch32 compatibility mode on AArch64 - No RV32 compatibility mode on RV64 - No ESA/390 (31-bit) compatibility mode on s390x - No 32-bit mode on LoongArch64
ARMv7 (32-bit ARM) is supported as a native first-class architecture — it runs a native 32-bit kernel, not a compatibility layer on a 64-bit kernel. This follows the principle that 32-bit support, where needed, is added as a separate target rather than as a compatibility layer that doubles the syscall surface.
2.22.3 64-bit Atomics on 32-bit Architectures¶
UmkaOS uses AtomicU64 in several core data structures (PTY ring buffers, MCE logs,
lock-free IPC). On 32-bit architectures where native 64-bit atomics have limited
support, the following strategies apply:
| Architecture | Native 64-bit Atomic | Strategy |
|---|---|---|
| ARMv7 (Cortex-A) | LDREXD/STREXD (available on all ARMv7-A cores with LPAE) |
Native hardware atomics. The armv7a-none-eabi target supports AtomicU64 via doubleword exclusive load/store. Non-LPAE cores (Cortex-M, ARMv6) are not first-class targets. |
| PPC32 | No native 64-bit atomics | Software emulation via interrupt-disabling (wrteei 0/1) around read-modify-write sequences. Implemented in umka-kernel/src/arch/ppc32/atomics.rs. The custom target JSON sets max-atomic-width: 64 so LLVM generates calls to __atomic_* runtime functions provided by the kernel. |
Both strategies are safe in a single-core or SMP-with-coherence context. The interrupt-disabling approach on PPC32 is correct because UmkaOS's 32-bit targets are single-core embedded systems; SMP PPC uses 64-bit PPC64LE.
All six 64-bit architectures (x86-64, AArch64, RISC-V 64, PPC64LE, s390x,
LoongArch64) have native 64-bit atomics. s390x uses CSG (Compare and Swap
and Go, 8-byte). LoongArch64 uses AMSWAP.D, AMCAS.D (atomic memory
operations on doublewords). No special handling is needed.
2.22.4 Advanced Feature Architecture Parity¶
Chapters 16–18 define advanced features that rely on architecture-specific hardware
mechanisms. The following matrix summarizes support status across all eight first-class
architectures. Where hardware is unavailable, UmkaOS either provides a software fallback
(reduced performance) or marks the feature as not supported on that architecture. The
kernel's #[cfg(target_feature)] mechanism ensures unsupported paths compile to no-ops
with zero overhead.
| Feature | Mechanism | x86-64 | AArch64 | ARMv7 | RISC-V 64 | PPC32 | PPC64LE | s390x | LoongArch64 |
|---|---|---|---|---|---|---|---|---|---|
| Fast driver isolation | MPK/POE/DACR/page-table | WRPKRU (native) | POE (ARMv8.9+, POR_EL0) / page-table fallback | DACR 16 domains | Page-table based | Segment registers (16 segments) / page-table fallback | Radix partition table / HPT LPAR | Storage Keys (Tier 1 unavailable) | Not available (Tier 1 unavailable) |
| Memory tagging | MTE/LAM | Intel LAM (pointer tagging only) | MTE (full, ARMv8.5+) | Not available | Not available | Not available | Not available | Not available | Not available |
| Hardware power metering | RAPL/SCMI/SBI | RAPL (native) | SCMI power domain | SCMI (limited) | SBI PMU (basic) / software estimation | Not available (software only) | OPAL/OCC power sensors (POWER8/9/10) | SCLP energy management (z14+) | Software estimation |
| Confidential computing | SEV-SNP/TDX/CCA/CoVE | SEV-SNP + TDX (native) | ARM CCA (emerging) | Not available | RISC-V CoVE (draft) | Not available | Ultravisor Protected Execution Facility (POWER9+) | Ultravisor Secure Execution (z15+) | Not available |
| Cache partitioning | CAT/MPAM | Intel CAT + MBA (native) | ARM MPAM (ARMv8.4+) | Not available | Not available (software only) | Not available | Not available (software only) | Not available | Not available |
| Hardware preemption (GPU) | Device-dependent | Yes (vendor support) | Yes (Mali, Adreno) | Limited | Emerging | Not available | Limited (Nvidia via PCIe) | Not available (no GPU) | Limited |
| CXL memory pooling | CXL 2.0/3.0 | Native (PCIe 5.0+) | Emerging (ARMv9 + CXL) | Not available | Not available | Not available | OpenCAPI / CXL (POWER10+) | Not available (no PCIe) | Not available |
| In-kernel inference | ISA extensions | AMX (matrix), AVX-512 | SME (matrix), SVE (vector) | NEON (vector) | V extension (vector) | AltiVec/SPE (limited) | VSX (vector-scalar, POWER7+) | CPACF (crypto) + NNPA (z16, neural network) | LSX/LASX (128/256-bit SIMD) |
Reading the table: "Native" means hardware support is available and UmkaOS uses it directly. "Fallback" means UmkaOS implements the feature using a slower mechanism (typically page-table manipulation). "Not available" means neither hardware nor a practical software fallback exists — the feature is compile-time disabled on that architecture. "Emerging" or "draft" means the hardware specification exists but is not yet widely deployed; UmkaOS includes provisional support gated behind a feature flag.
2.22.4.1 Fallback Acceptance Criteria¶
For each feature in the parity matrix above, the following table specifies the performance threshold, test requirements, and user notification for each support level. These criteria define the pass/fail gate for each architecture in CI.
Per-support-level requirements:
| Support Level | Performance Threshold | Functional Tests | CI Gate |
|---|---|---|---|
| Native | ≤ 5% overhead vs Linux (macro benchmarks) | Full feature test suite must pass | Tier 1 (every commit) |
| Fallback | ≤ 10% overhead vs Linux on affected workloads | Same test suite as native — identical functional behavior, relaxed timing assertions | Tier 2 (every PR) |
| Not available | 0% overhead (feature compile-time disabled) | Feature tests are #[cfg]-skipped. No false positives in feature detection. Boot must succeed without the feature. |
Tier 1 (every commit) |
| Emerging | No performance gate (experimental) | Feature-flag-gated tests pass when enabled. No regression when disabled. | Tier 3 (nightly) |
Per-feature acceptance criteria:
| Feature | Fallback Mechanism | Fallback Overhead Budget | Acceptance Test |
|---|---|---|---|
| Fast driver isolation | Page-table+ASID (AArch64 no-POE, PPC); Tier 0 promotion (RISC-V, s390x, LoongArch64) | ≤ 10% on page-table fallback. 0% on Tier 0 promotion (no isolation, no overhead). | fio 4K random IOPS: fallback overhead < 10%. Tier 0 promotion: overhead = 0%, promoted-Tier-1 driver crash = kernel panic (documented tradeoff; Tier 2 drivers on these architectures retain full crash recovery via Ring 3 + IOMMU). |
| Memory tagging | None — feature disabled where hardware absent | 0% (no fallback, feature off) | Boot succeeds. No MTE-related dmesg errors. Sanitizer tests (ASan/MSan) substitute for MTE on non-MTE platforms. |
| Hardware power metering | Software estimation (cycle counters + thermal model) | ≤ 15% accuracy loss vs hardware metering | Power budget enforcement test: budget not exceeded by > 2 tick intervals. Estimation drift < 15% over 60s vs reference (hardware platform). |
| Confidential computing | None — feature disabled where hardware absent | 0% (no fallback) | Boot succeeds. CC-dependent features (TEE-to-TEE DSM) unavailable. Attestation APIs return ENOTSUP. |
| Cache partitioning | Software throttling (scheduler-based bandwidth limiting) | ≤ 20% less effective than hardware CAT/MPAM | Noisy-neighbor test: co-located workload P99 latency < 2x vs hardware partitioning. |
| In-kernel inference | Scalar fallback (no SIMD) | ≤ 3x inference latency vs native SIMD | Inference completes within cycle watchdog budget (Section 23.1). Model output identical regardless of SIMD path. |
| CXL memory pooling | None — feature disabled | 0% | Boot succeeds. NUMA topology reflects only local memory. |
Tier 1 isolation — per-architecture decision summary:
| Architecture | Mechanism | Decision | Rationale |
|---|---|---|---|
| x86-64 | MPK (WRPKRU) | Native — always available on Skylake+ | ~23 cycles. Well within 5% budget. |
| AArch64 (POE) | POR_EL0 | Native — on ARMv8.9+/ARMv9.4+ | ~40-80 cycles. Within 5% budget. |
| AArch64 (no POE) | Page-table + ASID switch | Fallback — accepted | ~150-300 cycles. Within 10% budget with coalescing. See Section 11.2. |
| ARMv7 | DACR + ISB | Native — architectural in ARMv7-A | ~30-40 cycles (MCR p15 + required ISB barrier per ARM ARM B3.7.2). Within 5% budget. |
| RISC-V 64 | None | Not available — Tier 1 runs as Tier 0 | No hardware mechanism exists. Documented tradeoff: no fault isolation for Tier 1 drivers. Tier 2 (Ring 3 + IOMMU) available for untrusted drivers. |
| PPC32 | Segment registers + isync | Native | ~20-40 cycles (mtsr + required isync barrier per Power ISA §5.4.4.2). Within 5% budget. |
| PPC64LE | Radix PID (POWER9+) | Native | ~30-60 cycles. Within 5% budget. |
| s390x | Storage Keys (4-bit per page) | Not available — Tier 1 runs as Tier 0 | Keys are page-granularity, ISK/SSK require privilege. Too coarse for sub-page domain isolation. Tier 2 available via channel I/O subchannel protection. |
| LoongArch64 | None | Not available — Tier 1 runs as Tier 0 | No hardware mechanism exists. Same as RISC-V. Tier 2 available via IOMMU. |
2.22.4.2 User-Visible Degradation Notification¶
The kernel reports feature availability at boot via dmesg and at runtime via
sysfs. Users and orchestrators (Kubernetes node labels, cloud metadata) can
query feature status programmatically.
Boot-time dmesg output (one line per feature):
umka: isolation: MPK native (12 driver domains, ~23 cycles/switch)
umka: isolation: page-table fallback (ASID, ~150-300 cycles/switch, coalescing enabled)
umka: isolation: disabled (no hardware support, Tier 1 unavailable — drivers use Tier 0 or Tier 2)
umka: memory-tagging: MTE2 native (async mode, 16-byte granule)
umka: memory-tagging: not available
umka: power-metering: RAPL native (package + DRAM domains)
umka: power-metering: software estimation (cycle-counter model, ~15% accuracy)
umka: confidential-computing: SEV-SNP available (SME + SEV + SEV-ES + SEV-SNP)
umka: confidential-computing: not available
umka: cache-partitioning: CAT native (L3, 12 classes of service)
umka: cache-partitioning: software throttling (scheduler-based)
umka: inference-simd: AVX-512 native
umka: inference-simd: NEON native
umka: inference-simd: scalar fallback
Runtime sysfs interface:
/sys/kernel/umka/features/
├── isolation # "native:mpk", "fallback:page-table", "disabled"
├── memory_tagging # "native:mte2", "not_available"
├── power_metering # "native:rapl", "fallback:software"
├── confidential # "native:sev-snp", "not_available"
├── cache_partitioning # "native:cat", "fallback:software"
└── inference_simd # "native:avx512", "native:sve", "fallback:scalar"
Each file returns a single line: <level>:<mechanism> where level is one of
native, fallback, disabled, not_available. Orchestrators parse this to
make scheduling decisions (e.g., Kubernetes: node.umka.io/isolation=native:mpk).
2.22.5 QEMU vs Real-Silicon Divergences¶
All eight architectures are tested via QEMU from Phase 1 onward, but QEMU TCG is a functional emulator, not a cycle-accurate simulator. Several categories of hardware behavior — memory ordering, TLB shootdown latency, DMA cache coherence, IOMMU advanced features, interrupt controller timing, and power management — are simplified or absent in QEMU. Code that passes all QEMU tests may still contain latent bugs that surface only on real hardware.
The complete per-architecture divergence analysis, including specific QEMU behaviors, real-silicon counterparts, testing impact, and mitigations, is documented in Section 24.2.
Key implications for architecture-specific development:
- AArch64, RISC-V, PPC, LoongArch: Lock-free and barrier-sensitive code cannot be validated by QEMU alone. QEMU's sequential execution masks weak-ordering bugs that corrupt data on real hardware. AArch64 real-hardware testing (RPi 5 / Apple M1) begins in Phase 2 to catch these early.
- s390x: Channel I/O subsystem (CCW chains, QDIO, subchannel multiplexing) is partially emulated. Functional testing of virtio-ccw transport is reliable; FICON and multi-subchannel scheduling require z/VM or LPAR hardware.
- LoongArch64: EIOINTC multi-node interrupt routing in QEMU is functional for single-socket configurations but untested for the 3C5000 8-node topology.
- All architectures: DMA cache coherence, PCIe power state transitions, and CPU power management (C-states, DVFS) are not modeled by QEMU. These paths require real-hardware validation.
2.23 Hardware Memory Safety¶
2.23.1 ARM MTE (Memory Tagging Extension)¶
ARM MTE is architecturally defined in ARMv8.5-A and first implemented in ARMv9 silicon. MTE availability depends on both the core IP implementing the extension AND the SoC vendor enabling tag storage in the memory subsystem:
- Core IP with MTE: ARM Neoverse V2, Neoverse V3 (all cores based on these designs implement the MTE extension at the microarchitectural level).
- Mobile SoCs with MTE enabled: Google Pixel 8/9 (Tensor G3/G4, Cortex-X3/X4), MediaTek Dimensity 9300+ devices.
- Datacenter SoC with MTE enabled: AmpereOne (the first datacenter SoC to fully enable MTE at the platform level, including tag storage in DRAM).
- Cloud SoCs with MTE logic but NOT enabled: AWS Graviton 4 (Neoverse V2) and Google Axion (Neoverse V2) include MTE logic in the cores but their memory subsystems do not support tag storage — MTE is not usable on these platforms despite the core IP implementing it.
- No MTE: Ampere Altra (Neoverse N1, ARMv8.2 — predates MTE entirely).
Every 16-byte memory granule carries a 4-bit tag. Pointer top bits carry a tag. Hardware compares them on every access. Mismatch = fault. Catches use-after-free, buffer overflow, in hardware, at near-zero runtime cost.
Important limitation: MTE is probabilistic, not complete. 4-bit tags = 16
possible values. Adjacent slab objects may receive the same tag by random chance
(probability 1/16 = 6.25%). Single-violation detection rate: ~93.75%. This is
acceptable for defense-in-depth — Rust's ownership model is the primary safety
mechanism; MTE is an additional hardware layer that catches what Rust cannot
(C driver bugs in Tier 1, unsafe blocks, compiler bugs). MTE is NOT a
substitute for memory-safe code.
Tag Storage Requirement:
ARM MTE stores tags in storage managed by the memory controller: 4 bits per 16-byte
granule. Relative to DRAM capacity, this means tag storage is sized at 3.125% of DRAM
(4 bits / 128 bits = 1/32). High-performance implementations (Neoverse V2/V3,
AmpereOne) typically use dedicated Tag RAM; other implementations may use reserved
DRAM regions managed transparently by the memory controller. In all cases, the
storage is invisible to software and managed automatically by the hardware.
The supported MTE level (MTE2 vs MTE3) is detected at boot via
ID_AA64PFR1_EL1.MTE (bits [11:8]: 0b0010 = MTE2, 0b0011 = MTE3) and
stored in the CpuFeatureTable (Section 2.16).
SCTLR_EL1.TCF0/TCF1 fields are configured accordingly during boot.
On SoCs without MTE support, the tagging code is compiled out
(#[cfg(target_feature = "mte")]) — zero overhead, zero memory cost.
MTE is only available on ARM; x86 systems are entirely unaffected.
TEE interaction: MTE tags are stored in separate physical tag RAM. For
TEE-encrypted pages, tag RAM may also be encrypted. Confidential pages are
allocated untagged (tag = 0); MTE checking is disabled for pages owned by a
ConfidentialContext (see Section 9.7). Hardware encryption already prevents
unauthorized access — MTE is redundant for confidential memory.
Section 4.10 already mentions MTE and Intel LAM. This section details the architectural integration.
2.23.2 Design: Tag-Aware Memory Allocator¶
// umka-core/src/mem/tagging.rs
/// Memory tagging policy (system-wide, configurable at boot).
#[repr(u32)]
pub enum TaggingPolicy {
/// No tagging. Standard allocation. Zero overhead.
/// Used on hardware without MTE, or for maximum performance.
Disabled = 0,
/// Synchronous tagging: fault immediately on tag mismatch.
/// Catches all tag violations. ~128 extra cycles per page allocation.
/// Recommended for development and high-security production.
Synchronous = 1,
/// Asynchronous tagging: record violations in a register, check lazily.
/// Lower overhead (~10 cycles per allocation), but violations reported
/// with delay. Good for production with logging.
Asynchronous = 2,
}
/// Tag operations for the memory allocator.
pub trait MemoryTagger {
/// Assign a random tag to a newly allocated region.
/// Called by: slab allocator (per-object), buddy allocator (per-page).
fn tag_allocation(&self, addr: *mut u8, size: usize) -> TaggedPtr;
/// Clear tags on freed memory (set to a "freed" tag value).
/// Any subsequent access with the old tag will fault.
fn tag_deallocation(&self, addr: *mut u8, size: usize);
/// Set tags for a DMA buffer region (tag = 0, untagged).
/// DMA engines don't understand tags — buffers must be untagged.
fn untag_dma_region(&self, addr: *mut u8, size: usize);
}
2.23.3 Integration Points¶
Slab allocator (Section 4.3):
Object allocation:
1. Allocate object from slab (existing path).
2. Assign random 4-bit tag to the object's 16-byte granules.
3. Return tagged pointer (tag in top bits).
Object deallocation:
1. Return object to slab (existing path).
2. Set the object's granules to a "freed" tag (e.g., 0xF).
3. Any subsequent access with the old tag faults immediately.
Benefit: use-after-free in kernel (or in Tier 1 C drivers) is caught
by hardware. The fault is caught by domain isolation and triggers driver crash recovery.
Page allocator (Section 4.2):
Page allocation: tag all granules in the page with a fresh tag.
Page deallocation: tag all granules with "freed" tag.
Granule counts: 4KB page = 256 granules (4096 / 16); 64KB page = 4096 granules (65536 / 16).
Cost (4KB page): 256 STG instructions per alloc/dealloc (or 128 ST2G/STZ2G,
each tagging two 16-byte granules).
At ~0.5 cycles per STG on A510+ cores: ~128 cycles with STG (64 cycles with ST2G). Page alloc is ~300+ cycles.
Overhead (4KB): ~43% with STG (128 tag cycles / ~300 base cycles); ~21% with ST2G.
Cost (64KB page): 4096 STG instructions (or 2048 ST2G/STZ2G instructions).
At ~0.5 cycles per STG: ~2048 cycles with STG (~1024 cycles with ST2G/STZ2G). Page alloc is ~300+ cycles.
Overhead (64KB): ~683% with individual STG (2048 tag cycles / ~300 base cycles);
~341% with ST2G/STZ2G (1024 tag cycles / ~300 base cycles). Prefer STZ2G for bulk
tagging as it zeros and tags in one pass. The 4KB case is the common slab/page
allocation path. 64KB huge-page allocation is rarely hot and the high overhead is
acceptable.
Note: this only affects ARM. On x86 without MTE, zero overhead.
On ARM without MTE enabled, zero overhead (policy = Disabled).
KABI boundary:
When kernel passes a buffer to a Tier 1 driver:
Buffer is tagged. Driver receives tagged pointer.
If driver overflows the buffer: tag mismatch, hardware fault.
Domain isolation catches the fault, driver is crash-recovered.
This provides hardware-enforced bounds checking for C drivers,
even though the kernel is written in Rust (which checks bounds in software).
DMA buffers:
DMA engines cannot process tagged memory.
DMA buffers are allocated untagged (tag = 0).
IOMMU validates DMA addresses regardless.
fork() / CoW:
Before CoW break: child shares parent's page (same tags, read-only).
On CoW break (child or parent writes):
1. Allocate new page, copy data.
2. Assign FRESH RANDOM tags to the new page's granules.
3. Do NOT copy the old page's tags.
Rationale: if both pages kept the same tags, a stale pointer from
one process could access the other's now-separate page without
a tag fault (same tag, different physical page). Fresh tags ensure
that cross-process stale pointers are detected by MTE.
2.23.4 Intel LAM (Linear Address Masking)¶
Intel LAM allows using top bits of 64-bit pointers for metadata without them being treated as part of the address. This is less powerful than MTE (no hardware tag checking), but useful for:
- Pointer authentication (storing metadata in unused address bits)
- Memory safety tooling (KASAN-like in-kernel detection)
- Capability tagging (embedding capability metadata in pointers)
LAM modes:
LAM_U48: bits 62:48 available for metadata (15 bits, user pointers only).
LAM_U57: bits 62:57 available for metadata (6 bits, 5-level paging mode).
Controlled via CR3 flags: CR3.LAM_U48 or CR3.LAM_U57.
No runtime cost: address masking is performed by hardware in the MMU pipeline.
Comparison with MTE:
MTE (ARM): 4-bit tag per 16-byte granule. Hardware CHECKS on every access.
Detects use-after-free, buffer overflow at runtime. ~128 cycles per
page allocation for tag setup. Zero-cost access checks (pipelined).
LAM (x86): 6-15 metadata bits per pointer. NO hardware checking — metadata is
simply ignored by the MMU. Software must perform its own checks.
Zero overhead. Useful for tooling metadata, not for runtime safety.
Result: MTE provides stronger guarantees (hardware-enforced); LAM provides
more flexible metadata embedding. UmkaOS uses both where available.
Integration: the memory allocator stores metadata in LAM bits. Debug builds use these bits for KASAN-equivalent checking. Release builds can optionally use them for capability hints.
Security caveat: Intel LAM has been disabled in the Linux kernel since v6.12 due to the SLAM attack (Spectre-based exploitation of LAM metadata bits without LASS protection). UmkaOS does not enable LAM unless LASS (Linear Address Space Separation) is also available on the CPU. On CPUs without LASS, the upper address bits described above are not used for metadata; KASAN-equivalent checking uses shadow memory instead. When both LAM and LASS are present, LAM is enabled with the protections described above.
2.23.5 AArch64 Pointer Authentication (PAC)¶
AArch64 provides Pointer Authentication Codes (PAC, ARMv8.3+) as a complementary mechanism to MTE. PAC signs pointers with a cryptographic MAC using a per-process key, detecting pointer forgery and corruption:
PAC in UmkaOS:
- Return address signing: PACIASP/AUTIASP in function prologue/epilogue.
Compiler-inserted via -mbranch-protection=pac-ret+leaf.
- Detects ROP (Return-Oriented Programming) attacks: corrupted return
addresses fail authentication and trap.
- Cost: ~1 cycle per PAC/AUT instruction (pipelined). Zero memory overhead.
- Available on: Apple M1+, AWS Graviton 3+, Cortex-A710+.
UmkaOS enables PAC for all kernel code on capable hardware. This is orthogonal
to MTE (MTE detects memory safety bugs; PAC detects control-flow hijacking).
2.23.6 CHERI (Future)¶
ARM Morello (CHERI prototype) demonstrates hardware-capability pointers with bounds checking. CHERI pointers are 128-bit: address (64) + bounds (32) + permissions (16) + flags (16). Every pointer carries its own bounds and permission information. Hardware checks on every dereference.
UmkaOS's capability system (Section 9.1) is a software capability model. CHERI provides a hardware capability model. When CHERI hardware is available:
Software capabilities (current):
Kernel maintains capability table. Validated on syscall.
Overhead: ~5-10 cycles per capability check (bitmask test).
CHERI hardware capabilities (future):
Pointer IS the capability. Hardware validates on every access.
Overhead: 0 cycles (pipelined with memory access).
UmkaOS's capability tokens become hardware CHERI capabilities.
The translation is natural: both use unforgeable tokens with
bounded permissions and delegation rules.
Design for CHERI readiness: the capability system should NOT assume that capabilities are always validated in software. The validation path should be abstractable so that CHERI hardware validation can replace software validation.
CHERI Morello Status:
ARM Morello evaluation boards shipped in 2022 (based on Neoverse N1 + CHERI extensions). As of 2026, production CHERI hardware is not available. The CHERI readiness design in Section 2.23 prepares for future hardware without depending on it. When production CHERI SoCs ship, the capability validation abstraction layer enables a transition from software to hardware capability checks.
2.23.7 Performance Impact¶
MTE on ARM (when enabled): ~128 cycles per page allocation (~40% of allocator hot path). Memory access checks are hardware-pipelined: zero overhead. Linux pays the same cost when MTE is enabled.
MTE disabled (default on x86, optional on ARM): zero overhead. No code runs.
Intel LAM: zero runtime overhead (address masking is free in hardware).
CHERI (future): zero overhead (hardware-pipelined capability checks).
2.23.8 Hardware Fault Handler Constraints¶
Hardware fault handlers (machine check exceptions, bus errors, SError, NMI, system error interrupts) operate in extremely constrained contexts where normal kernel operations are forbidden. Violating these constraints causes deadlock, system hang, or recursive faults.
2.23.8.1 Fault Handler Categories¶
Hardware fault handlers fall into three categories with progressively stricter constraints:
| Category | Examples | Context | Permitted Operations |
|---|---|---|---|
| Maskable interrupts | Timer tick, device IRQ | IRQ context, interrupts disabled | Try-lock, lock-free writes, deferred work |
| Synchronous faults | Page fault, alignment fault, breakpoint | Fault context, preemptible | Blocking locks (with care), allocation (with care) |
| Non-maskable faults | Machine Check (MCE), NMI, SError, Bus Error, System Reset | NMI context, all interrupts blocked | Lock-free only, per-CPU buffers, no locks |
The critical distinction: maskable interrupts can be delayed by disabling interrupts, but non-maskable faults fire regardless of interrupt state. Code holding a spinlock cannot prevent an MCE or NMI from occurring.
2.23.8.2 Non-Maskable Fault Handler Requirements¶
Non-maskable fault handlers (MCE, NMI, SError, Bus Error, System Reset vectors) MUST follow these rules:
1. No blocking operations. The handler MUST NOT:
- Acquire a spinlock with blocking semantics (lock() / spin_lock())
- Acquire a mutex, rwlock, or semaphore
- Allocate memory (kmalloc, vmalloc, page allocation)
- Sleep or yield (schedule(), wait(), condvar)
- Perform I/O that may block (disk, network)
- Call any function that may transitively do the above
Rationale: The fault may have interrupted code already holding locks. If the handler blocks waiting for the same lock, deadlock occurs immediately.
2. Try-lock only, with fallback. If the handler needs a lock, it MUST use
try-lock (try_lock() / spin_trylock()) and handle failure:
if lock.try_lock() {
// critical section
lock.unlock();
} else {
// Fallback: cannot acquire lock
// Options: log to per-CPU buffer and continue, force reboot, degrade gracefully
}
3. Per-CPU buffers for logging. NMI/MCE handlers MUST NOT write to shared ring buffers (MPSC, printk). Instead, use a pre-allocated per-CPU buffer:
Data types used by the MCE log:
/// Severity classification of a machine-check event.
#[repr(u32)]
enum MceSeverity {
Corrected = 0, // Hardware corrected; no data loss
Recoverable = 1, // Software-recoverable with page offlining
Fatal = 2, // Unrecoverable; system must reboot
}
/// One entry in the per-CPU MCE ring log.
/// Padded to 64 bytes (one cache line) so that array elements never span cache line
/// boundaries. This prevents false sharing when a remote monitoring thread reads
/// the log while the NMI handler writes it.
///
/// Torn-read detection uses a seqcount-style generation counter (`gen`):
/// the writer sets `gen` to an odd value before writing fields, then to the
/// next even value after writing. A reader that observes an odd `gen` or a
/// changed `gen` between its two reads has caught a torn write and must retry.
/// MCE log slot: seqcount generation counter + data payload in one cache line.
///
/// The generation counter (`gen`) is stored separately from the data payload
/// to avoid conflicts between the seqcount protocol and Rust's type system.
/// `AtomicU32` is `!Copy`, so the payload struct is `Copy` (for ring buffer
/// reads) while `gen` is accessed through the `MceLogSlot` wrapper.
#[repr(C, align(64))]
struct MceLogSlot {
gen: AtomicU32, // Generation counter (odd = write in progress, even = stable)
_pad_gen: [u8; 4],
data: MceLogData, // 48 bytes of payload
}
/// MCE data payload. Copy-able for reader-side snapshot.
#[repr(C)]
#[derive(Copy, Clone)]
struct MceLogData {
timestamp_tsc: u64, // TSC at time of MCE
bank: u8, // MCE bank number
_pad0: [u8; 7],
status: u64, // MCi_STATUS MSR value
address: u64, // MCi_ADDR MSR value (if valid)
misc: u64, // MCi_MISC MSR value (if valid)
severity: MceSeverity, // 4 bytes (repr(u32))
_pad1: [u8; 4], // Pad data to 48 bytes (align severity to slot boundary)
}
// Total slot size: 4 (gen) + 4 (_pad_gen) + 48 (data) = 56 bytes, padded to 64 by align(64).
const_assert!(core::mem::size_of::<MceLogData>() == 48);
const_assert!(core::mem::size_of::<MceLogSlot>() == 64);
impl MceLogSlot {
const EMPTY: Self = Self {
gen: AtomicU32::new(0), _pad_gen: [0; 4],
data: MceLogData {
timestamp_tsc: 0, bank: 0, _pad0: [0; 7],
status: 0, address: 0, misc: 0,
severity: MceSeverity::Corrected, _pad1: [0; 4],
},
};
}
impl MceLogData {
/// Construct payload from an `MceContext` snapshot.
/// Called in NMI context — no allocation, no locks.
fn from_ctx(ctx: &MceContext) -> Self {
Self {
timestamp_tsc: arch::current::cpu::read_timestamp(),
bank: ctx.bank,
_pad0: [0; 7],
status: ctx.status,
address: ctx.address,
misc: ctx.misc,
severity: ctx.severity,
_pad1: [0; 4],
}
}
}
/// Machine Check Exception context. Populated by the arch-specific MCE vector
/// handler before calling the generic `mce_handler()`. Each architecture reads
/// its error reporting registers into this common struct:
/// - x86-64: IA32_MCi_STATUS, IA32_MCi_ADDR, IA32_MCi_MISC MSRs
/// - AArch64: ERXSTATUS_EL1, ERXADDR_EL1, ERXMISC0_EL1 (RAS extension)
/// - ARMv7: DFSR/DFAR (imprecise data abort), ADFSR (auxiliary fault status)
/// - RISC-V: platform-specific (SiFive CECC/UECC, no standard yet)
/// - PPC32: MCSR, MCAR, MCSRR0/MCSRR1
/// - PPC64LE: SRR0/SRR1, DSISR, DAR
pub struct MceContext {
/// MCE bank number (0-based). x86: IA32_MCi_*, ARM: ERR<n>*.
pub bank: u8,
/// Error status register value.
/// x86: IA32_MCi_STATUS, ARM: ERXSTATUS_EL1, PPC: MCSR.
pub status: u64,
/// Faulting physical address (if `addr_valid` is true).
/// x86: IA32_MCi_ADDR, ARM: ERXADDR_EL1, PPC: MCAR/DAR.
pub address: u64,
/// Miscellaneous error information.
/// x86: IA32_MCi_MISC, ARM: ERXMISC0_EL1, PPC: 0 (unused).
pub misc: u64,
/// Severity classification (after firmware-first filtering on platforms
/// that support it — ACPI APEI/GHES on x86, SDEI on AArch64).
pub severity: MceSeverity,
/// Whether the error has been corrected by hardware (CE).
/// false = uncorrected error (UCE), may require page offlining or panic.
pub corrected: bool,
/// Whether `address` contains a valid faulting physical address.
/// Some error types (e.g., bus timeout) do not report an address.
pub addr_valid: bool,
}
/// Per-CPU MCE log with head counter and ring buffer.
struct MceLog {
head: AtomicU64, // Monotonically increasing write index
slots: [MceLogSlot; 64], // Ring buffer (indexed by head % 64)
}
impl MceLog {
const fn new() -> Self {
Self { head: AtomicU64::new(0), slots: [MceLogSlot::EMPTY; 64] }
}
}
// Allocated at boot, one per CPU, never freed.
static MCE_LOG: PerCpu<MceLog> = PerCpu::new(MceLog::new());
// Per-CPU re-entry guard for recursive MCE detection. Incremented on handler
// entry, decremented on exit. If > 0 on entry, a recursive MCE has occurred
// and the handler must halt immediately to prevent infinite recursion.
static MCE_NESTING: PerCpu<AtomicU32> = PerCpu::new(AtomicU32::new(0));
// In MCE handler (NMI context):
fn mce_handler(ctx: &MceContext) {
let log = MCE_LOG.this_cpu();
// Per-CPU: exactly one producer (this CPU's NMI handler), no concurrent writers.
// load(Relaxed) is safe because only this CPU writes head.
let count = log.head.load(Relaxed);
let idx = count as usize % 64;
let slot = &log.slots[idx];
// Seqcount protocol: mark slot as "write in progress" (odd gen).
// A reader that observes an odd gen or a changed gen must retry.
// gen is AtomicU32 to avoid data races with concurrent readers on
// the drain path. Release/Acquire ordering ensures visibility on all
// architectures (DMB on ARM, FENCE on RISC-V, lwsync on PPC, BCR on s390x).
let prev_gen = slot.gen.load(Relaxed);
slot.gen.store(prev_gen.wrapping_add(1), Release); // odd = write in progress
// Write data payload only — gen is NOT inside MceLogData, so the write
// cannot overwrite the generation counter we just set to odd.
// SAFETY: Single writer (per-CPU NMI), gen protects concurrent readers.
unsafe { core::ptr::write_volatile(&slot.data as *const _ as *mut MceLogData,
MceLogData::from_ctx(ctx)); }
slot.gen.store(prev_gen.wrapping_add(2), Release); // even = write complete
// ORDERING: Release store on head publishes the entry. Any thread that
// subsequently reads head with Acquire will observe the entry write.
log.head.store(count + 1, Release);
// Handler returns; main kernel drains log later
}
// Drain path (runs on a per-CPU workqueue thread, preemption disabled during
// the swap(0)+iterate critical section — this ensures no interruption between
// claiming entries and processing them). Called periodically by the MCE
// poller workqueue or on-demand after a burst of correctable errors.
// Scheduling context: TASK context, sleepable BEFORE the swap, non-preemptible
// DURING the swap+iterate window (local_irq_save around the critical section
// is NOT needed because NMI handlers use the seqcount protocol to avoid
// tearing; disabling preemption suffices to prevent this CPU from being
// rescheduled mid-drain).
fn drain_mce_log(log: &MceLog) {
// Use swap instead of load+store(0) to atomically capture AND reset head.
// This prevents losing entries from an MCE that fires between load and store.
let count = log.head.swap(0, AcqRel);
// AcqRel: Acquire ensures prior entry writes are visible; Release publishes
// the reset (head=0) so a concurrent MCE handler sees the new base.
// Iterate the ring from (count-N) to (count-1), reading entries newest-first
// would lose ordering — iterate oldest-first instead:
let n = core::cmp::min(count, 64); // At most 64 slots in the ring
// Head counter uses wrapping arithmetic to handle u64::MAX wrap-around.
// Always use wrapping_sub() when computing the distance between two head values.
// (Note: `start_idx` is a head index, not a per-slot generation counter.)
let start_idx = count.wrapping_sub(n);
for i in 0..n {
let i = start_idx.wrapping_add(i);
let slot = &log.slots[i as usize % 64];
// Seqcount read: check gen before and after copying data.
loop {
let gen1 = slot.gen.load(Acquire);
if gen1 & 1 != 0 { core::hint::spin_loop(); continue; } // odd = write in progress
let data: MceLogData = unsafe { core::ptr::read_volatile(&slot.data) };
let gen2 = slot.gen.load(Acquire);
if gen1 == gen2 { /* data is consistent — process it */ break; }
// gen changed — torn read, retry
}
}
}
Race window: A narrow race exists between
head.swap(0)and the drain loop. An MCE arriving after the swap writes toentries[0]while the drain may be reading entries at the same index (via modular arithmetic when the ring was full). Mitigation: each entry carries a seqcount-style generation counter (gen). The drain readsgenbefore and after reading the entry fields: ifgen_beforeis odd (write in progress) orgen_after != gen_before(torn write), the drain skips that entry and logs a warning. The skipped MCE is not lost — the hardware MCE bank registers retain the error until explicitly cleared (software must write zero to MCi_STATUS viawrmsr; hardware does NOT auto-clear corrected errors), so the next drain cycle will re-read it from the hardware banks.Double-read scenario: When a collision occurs (NMI writes entry N while drain is reading entry N), the seqcount causes the drain to spin until the NMI finishes, then reads the NEW entry (which overwrote the OLD entry at that slot). The OLD entry's data is lost from the ring but persists in hardware banks for the next poll cycle. The NEW entry is read during this drain AND will be picked up again on the next drain cycle (potential double-counting). Deduplication: drain cycles compare MceLogData fields (bank, status, addr, misc) against the previous drain's output to suppress duplicate reports. A simple ring buffer of the last 64 drained (bank, status) pairs suffices for dedup.
The main kernel drains these buffers after returning from the exception, outside NMI context.
4. No locks at all for NMI. NMI handlers specifically MUST NOT use any locks, even try-lock. The NMI can nest inside an MCE handler that already holds the lock, causing deadlock. NMI handlers use only: - Per-CPU variables (no sharing) - Lock-free atomic operations (atomic read/write, compare-and-swap) - Pre-mapped memory (no page faults possible)
5. Pre-allocated resources. All memory, buffers, and stacks used by NMI/MCE handlers MUST be allocated at boot time. Allocation during handler execution is forbidden. On x86-64, MCE handlers run on a dedicated IST (Interrupt Stack Table) stack, pre-allocated and never paged.
2.23.8.3 Deferred Recovery Actions¶
Any recovery action that might block MUST be deferred to a workqueue or tasklet:
MCE handler (NMI context):
1. Capture fault context to per-CPU buffer (lock-free)
2. Assess severity: recoverable vs. fatal
3. If recoverable:
a. Log to per-CPU buffer
b. Set flag: NEEDS_RECOVERY = true
c. Return from exception
4. If fatal:
a. Log to per-CPU buffer
b. Trigger immediate reboot (no locking)
Workqueue (thread context, after NMI returns):
1. Check NEEDS_RECOVERY flag
2. If set:
a. Drain per-CPU MCE log to kernel log (may block)
b. Initiate memory offlining (may block)
c. Notify userspace via netlink (may block)
d. Clear NEEDS_RECOVERY flag
The workqueue runs in normal thread context where blocking operations are safe. The NMI handler does the minimum work needed to capture state and flag the need for recovery.
2.23.8.4 Architecture-Specific Fault Types¶
| Architecture | Non-Maskable Fault Types | Vector / Entry Point |
|---|---|---|
| x86-64 | Machine Check Exception (#MC), NMI | IDT vector 18 (MCE), vector 2 (NMI) |
| AArch64 | SError Interrupt, Physical IRQ (FIQ) | VBAR_EL1 offset 0x380 (SError, Current EL with SPx) |
| ARMv7 | Data Abort (imprecise), FIQ | VBAR offset 0x1C (FIQ), 0x10 (Data Abort) |
| RISC-V 64 | NMI (platform-specific) | Platform-defined; often traps to mtvec in M-mode |
| PPC32 | Machine Check, Critical Interrupt | IVOR[1] (MCE), IVOR[0] (Critical) |
| PPC64LE | Machine Check, System Reset | HSRR0/HSRR1 vectors, LPCR-defined |
| s390x | Machine Check (MCK), Malfunction Alert | PSW swap at fixed locations (flc + 0xE0 for MCK); SIGP for malfunction alert |
| LoongArch64 | Machine Error Exception | CSR.EENTRY + CSR.ECFG exception vector; NMI via CSR.TLBRENTRY path |
All handlers for these vectors MUST follow the non-maskable fault handler requirements in Section 2.23.
2.23.8.5 Recursive Fault Prevention¶
Hardware fault handlers MUST prevent recursive faults:
1. Guard pages. Handler stacks have guard pages (unmapped) at both ends. Stack overflow causes an immediate fault rather than corrupting adjacent memory.
2. Handler re-entry detection. Each handler checks a per-CPU flag on entry:
fn mce_handler(ctx: &MceContext) {
let nesting = MCE_NESTING.this_cpu().fetch_add(1, Relaxed);
if nesting > 0 {
// Already in MCE handler — recursive fault.
// Cannot log (might fault again), cannot recover.
// Immediate halt to prevent infinite recursion.
arch::halt_loop();
}
// ... normal handler logic ...
//
// Use fetch_sub (not store(false)) to avoid a race window:
// store(false) + iret leaves a gap where a second MCE sees the flag
// clear while the first handler is still returning. fetch_sub(1)
// atomically decrements; a concurrent MCE that increments to 2 will
// see nesting > 0 and halt, regardless of timing.
MCE_NESTING.this_cpu().fetch_sub(1, Release);
}
3. Pre-pinned code. Handler code and data pages are pinned in memory (never paged out). A page fault during NMI/MCE handling would cause a double fault.
2.24 Clock Framework¶
Device I/O subsystems — SPI, I2C, UART, PCIe reference clock, USB PHY, camera sensor, audio codec — require configurable clocks: gates (on/off), dividers (frequency reduction), multiplexers (source selection), and PLLs (frequency synthesis). Without a formal clock framework, each driver hard-codes controller register offsets, misses clock-gating power savings, and cannot express clock dependencies between devices.
Cross-references: device probe sequence (Section 11.4), regulator framework (Section 13.27), per-device runtime PM (Section 7.5), SPI bus (Section 13.20).
2.24.1 Design: Typed Clock Tree vs Linux CCF¶
Linux Common Clock Framework (CCF): Clock nodes are generic structs with an
clk_ops function-pointer table and untyped void * driver data. The enable/disable
reference counting uses two separate locks (prepare_lock mutex and enable_lock
spinlock), leading to a complex two-phase enable protocol.
UmkaOS clock framework:
- Typed nodes:
ClkKindenum with concrete variants (Fixed,Gate,Divider,Mux,Pll) — no generic ops table, no void pointers, no accidental wrong-ops-table installation. - RAII consumers:
ClkHandleis an Arc-wrapped consumer following the RAII pattern used throughout UmkaOS (Section 3.5). Dropping it decrements the enable refcount; if the count reaches 0, the hardware gate is closed automatically. Clock enable leaks are impossible (forgetting to callclk_disableis not possible — the clock disables when the last handle is dropped). - Atomic enable refcount + per-node MMIO lock:
enable_count: AtomicU32tracks the consumer refcount without a spinlock. The hardware gate register is written only on 0→1 (enable) and 1→0 (disable) transitions, serialized by the per-node MMIO lock (implicit in the gate-write path). ThechildrenSpinLock protects only the children ArrayVec (registration-time mutation). No two-phase protocol. Lock ordering: children SpinLock always acquired root-to-leaf; MMIO lock is per-node (never held across nodes). - Frozen tree: the clock tree is populated during boot and frozen before drivers
probe. After Phase 4.4b, the normal
clk_provider_register()path is closed. Hot-pluggable clock controllers loaded after freeze use theclk_provider_register_late()path (see Late Clock Provider Registration below). Drivers callclk_get/enable/set_rate— these work on both boot-registered and late-registered clocks. Registration order is: fixed oscillators first, then PLLs, then dividers/muxes, then gates — ensuring parent nodes exist before children reference them. Cycles in clock dependencies are detected during registration (parent walk) and cause a boot panic. Clock orphans (registered nodes with no consumer) are harmless and remain gated.
2.24.2 Core Types¶
/// A consumer's handle to a clock node.
///
/// Holding a ClkHandle keeps the clock's enable refcount elevated.
/// On drop: if `enabled` is true, decrements enable_count; if count reaches 0,
/// the hardware gate register is written to disable the clock automatically.
pub struct ClkHandle {
node: Arc<ClkNode>,
enabled: bool, // Whether this handle has called enable()
}
/// A node in the clock tree.
pub struct ClkNode {
pub name: &'static str,
pub kind: ClkKind,
pub parent: Option<Weak<ClkNode>>,
pub children: SpinLock<ArrayVec<Arc<ClkNode>, CLK_MAX_CHILDREN>>,
/// Count of enabled consumers + enabled children.
/// 0 = clock is gated. Incremented by each ClkHandle::enable().
pub enable_count: AtomicU32,
/// Current output frequency in Hz. 0 = gated (no output).
pub rate_hz: AtomicU64,
}
/// Maximum child clock nodes per parent.
pub const CLK_MAX_CHILDREN: usize = 32;
// SAFETY: ClkNode is Send+Sync because:
// 1. Raw pointers in ClkKind variants (enable_reg, div_reg via MmioReg32) are
// MMIO register virtual addresses valid for the system lifetime (ioremap'd
// at boot, never freed). MmioReg32 itself is explicitly Send+Sync.
// 2. The `children` SpinLock guards the children ArrayVec (add/remove during
// registration and late registration only). MMIO register programming
// (enable/disable gate writes, set_rate divider/PLL writes) is serialized
// by a SEPARATE per-node lock implicit in the enable/set_rate paths — NOT
// by the children SpinLock. The enable/disable path uses AtomicU32
// enable_count for refcounting (no spinlock needed for the refcount itself)
// and acquires the MMIO lock only when the count transitions 0→1 or 1→0
// (gate register write). The set_rate path acquires the MMIO lock on the
// target node for register programming.
// **Lock ordering**: children SpinLock is acquired in root-to-leaf order
// (parent before child) during rate propagation. The MMIO lock is per-node
// and never held while walking the tree. enable() uses only AtomicU32
// enable_count (no tree walk under lock). No deadlock between enable() and
// set_rate() is possible because they use different synchronization primitives.
// 3. Read-only fields (name, kind variant, parent) are immutable after registration.
// 4. ClkNode is stored in Arc<ClkNode> in the global CLK_TREE — Arc requires
// T: Send + Sync.
unsafe impl Send for ClkNode {}
unsafe impl Sync for ClkNode {}
/// Typed wrapper for a volatile MMIO register pointer. Constructable only via
/// `ioremap()`. The wrapped pointer is a virtual address that is valid for the
/// lifetime of the system (ioremap'd at boot, never freed).
///
/// `MmioReg32` is explicitly `Send + Sync`:
/// - The underlying VA is stable (the ioremap mapping is permanent).
/// - All writes are serialized by the per-node SpinLock in `ClkNode`.
/// - Reads are volatile and produce no data races at the hardware level.
#[repr(transparent)]
pub struct MmioReg32(*mut u32);
// SAFETY: MMIO register virtual addresses are valid for the system lifetime
// (ioremap'd at boot, never freed). All writes are serialized by the
// per-ClkNode SpinLock; reads are volatile and do not race at the Rust
// abstract machine level.
unsafe impl Send for MmioReg32 {}
unsafe impl Sync for MmioReg32 {}
/// Clock node type. Determines which hardware operations are performed.
pub enum ClkKind {
/// Fixed-frequency oscillator (crystal, ceramic resonator, RC oscillator).
/// Rate never changes; no hardware registers to write.
Fixed {
rate_hz: u64,
},
/// Clock gate: enable or disable the clock output.
/// Rate is inherited from parent unchanged.
Gate {
/// MMIO register for the gate control bit (virtual address via ioremap).
enable_reg: MmioReg32,
/// Bit position within `enable_reg`.
enable_bit: u8,
/// Write this value (0 or 1) to enable the gate.
/// Set to 0 for active-low gates (write 0 to enable, 1 to disable).
enable_val: u8,
},
/// Integer divider: output_rate = parent_rate / divisor.
Divider {
/// MMIO register for the divider field (virtual address via ioremap).
div_reg: MmioReg32,
div_shift: u8,
div_width: u8, // Number of bits in the divisor field
/// Optional lookup table for non-linear divisors.
/// If Some, the divisor register value indexes into this table.
div_table: Option<&'static [u32]>,
},
/// Clock multiplexer: selects one parent clock as the output.
/// Changing the mux selection changes the output rate to the selected parent's rate.
Mux {
/// MMIO register for the mux selector (virtual address via ioremap).
mux_reg: MmioReg32,
mux_shift: u8,
mux_width: u8,
parents: ArrayVec<Weak<ClkNode>, CLK_MAX_CHILDREN>,
},
/// Phase-Locked Loop: synthesizes a high-frequency clock from a reference.
/// Output frequency = (parent_rate * M) / (N * output_div) for integer PLLs,
/// with fractional support depending on the SoC's PLL hardware.
/// The `set_rate` algorithm finds the (M, N, output_div) triple that produces
/// the closest achievable rate to the target, within the PLL's VCO frequency
/// range (typically 600 MHz - 3.2 GHz; SoC-specific).
Pll {
/// SoC-specific PLL register layout. The arch module provides this struct.
pll_regs: PllRegs,
/// Reference (input) clock frequency in Hz.
ref_rate_hz: u64,
},
}
/// PLL register block descriptor. Platform-specific MMIO layout.
/// Each SoC's clock driver constructs this from device tree or ACPI tables
/// during clock tree initialization. The generic clock framework uses these
/// fields to read/write PLL parameters without SoC-specific code paths.
pub struct PllRegs {
/// MMIO base address of the PLL register block.
pub base: PhysAddr,
/// M divider (feedback divider) register field.
/// PLL output = (parent_rate * M) / (N * output_div).
pub m_div: RegField,
/// N divider (reference divider) register field.
pub n_div: RegField,
/// Output divider register field.
pub output_div: RegField,
/// Lock status bit. PLL is stable when this bit reads 1.
/// After writing M/N/output_div, the clock framework polls this bit
/// with a timeout (default 1 ms, SoC-overridable) before declaring
/// the PLL locked.
pub lock_bit: RegField,
/// Enable bit. Write 1 to power on the PLL, 0 to power off.
pub enable_bit: RegField,
}
/// Describes a single bit-field within an MMIO register.
/// Used by PllRegs and other hardware descriptor structs.
pub struct RegField {
/// Byte offset from the parent struct's base address.
pub offset: u16,
/// Bit position within the register (0 = LSB).
pub shift: u8,
/// Field width in bits (e.g., 10 for a 10-bit divider).
pub width: u8,
}
2.24.3 Consumer API¶
Drivers use the consumer API exclusively. Clock provider registration is internal to the clock framework and is not exposed to driver code.
KABI access: Isolated drivers (Tier 1 and Tier 2) do not have direct access to
DeviceNode — they hold opaque DeviceHandle tokens. The clock consumer API is
exposed to drivers through KernelServicesVTable clock methods (clk_get,
clk_enable, clk_disable, clk_get_rate, clk_set_rate, clk_release) defined
in Section 11.6. The kernel translates
DeviceHandle → DeviceNode internally before calling the functions below.
/// Look up a clock by name and return a consumer handle.
///
/// `name` is the clock's name as declared in the device tree `clock-names` property
/// (e.g., `"bus"`, `"ref"`, `"if"`) or as registered by ACPI/platform clock code.
///
/// The handle keeps the clock node alive (Arc reference) but does not enable it.
/// Call `handle.enable()` before using the clock.
pub fn clk_get(device: &DeviceNode, name: &str) -> Result<ClkHandle, KernelError>;
impl ClkHandle {
/// Enable this clock. Increments enable_count on this node and all parent nodes
/// (bottom-up: parents are enabled before children to maintain valid signal chain).
///
/// Idempotent: calling enable() on an already-enabled handle is a no-op.
/// Thread-safe: may be called concurrently from multiple threads.
pub fn enable(&mut self) -> Result<(), KernelError>;
/// Disable this clock. Decrements enable_count on this node.
/// If enable_count reaches 0, the hardware gate is closed.
/// Does NOT disable parent nodes (parent may have other enabled consumers).
///
/// Idempotent: calling disable() on an already-disabled handle is a no-op.
pub fn disable(&mut self);
/// Get the current output frequency of this clock in Hz.
/// Returns 0 if the clock is gated.
pub fn get_rate(&self) -> u64;
/// Request a rate change on this clock.
///
/// The framework walks up the clock tree from this node, adjusting dividers
/// and PLLs to achieve the nearest achievable rate. Returns the actually-set
/// rate, which may differ from `rate_hz` if the hardware cannot achieve it exactly.
///
/// If the rate change requires reprogramming a PLL, the clock output may be
/// briefly gated during the transition (typically < 1 µs during PLL lock).
/// Drivers that cannot tolerate a clock glitch must use a mux node to switch
/// to a safe fallback clock before calling `set_rate` on the PLL.
///
/// Returns `Err(KernelError::DeviceError)` if no achievable rate exists within
/// the parent clock's capabilities.
pub fn set_rate(&self, rate_hz: u64) -> Result<u64, KernelError>;
/// Like `set_rate`, but fails if the exact requested rate cannot be achieved.
/// Use when the driver requires a precise frequency (e.g., audio sample rate).
pub fn set_rate_exclusive(&self, rate_hz: u64) -> Result<(), KernelError>;
}
impl Drop for ClkHandle {
fn drop(&mut self) {
if self.enabled { self.disable(); }
}
}
2.24.4 Clock Tree Population¶
The clock framework initializes in two stages that correspond to different canonical boot phases (Section 2.3):
-
Phase 1.5
clock_tree_init(): Initialize theCLK_TREEOnceCellinfrastructure and fixed-frequency clocks that are known at compile time (e.g., XTAL on x86-64). No DT parsing — DT nodes are not available yet. -
Phase 4.4b (after bus enumeration in Phase 4.4a): Populate the clock tree from device tree nodes. This is the step described below.
The clock tree is populated during Phase 4.4b (immediately after bus enumeration in Phase 4.4a), before any device drivers probe. After population, the tree is frozen:
Phase 4.4b: Clock tree construction (DT/ACPI enumeration)
Step 1 — Fixed oscillators:
Parse device tree nodes with `compatible = "fixed-clock"`.
Create ClkNode::Fixed for each. These are the roots of the clock tree.
Step 2 — Clock providers (tree order, parents before children):
Parse all DT nodes with `#clock-cells` property (clock providers).
For each node: create ClkNode with the appropriate ClkKind, link to parent
clock node via `clocks` DT property.
The DT traversal order guarantees parents are processed before children.
Step 3 — Freeze and index:
CLK_TREE.set() is called; no further nodes can be added.
Build the by_name HashMap for O(1) consumer lookup.
Step 4 — Boot clocks:
Any clocks marked `clock-output-names` with the `assigned-clocks` DT property
have their rate set according to `assigned-clock-rates`. This configures the
UART console clock, etc., before drivers probe.
/// Global clock tree. Cold-path only: populated once at boot, then frozen
/// via `OnceCell`. The `by_name` HashMap is acceptable because it is never
/// accessed on any hot path — drivers resolve clocks at probe time and hold
/// `ClkHandle` references thereafter (see §3.1.13 collection usage policy).
pub struct ClkTree {
roots: ArrayVec<Arc<ClkNode>, 16>,
by_name: HashMap<&'static str, Weak<ClkNode>>,
}
pub static CLK_TREE: OnceCell<ClkTree> = OnceCell::new();
2.24.5 Deferred Probe Integration¶
The clock tree is frozen at the end of Phase 4.4b (before driver probe). Drivers
that call clk_get() during their init() for a clock that has not yet been
registered by its platform clock provider will receive Err(EPROBE_DEFER). The
driver should translate this to ProbeResult::Deferred (see
Section 11.4)
so the device registry re-queues the driver for a later probe attempt after the
clock provider has been successfully probed.
Error translation:
clk_get()returnsErr(KernelError::ResourceNotAvailable). The device-registry probe infrastructure translates this toProbeResult::Deferred(see Section 11.4). Drivers should not match on the specific error variant — they should propagate theErrand let the probe framework handle deferral.
This situation occurs when a peripheral driver probes before the platform's clock controller driver (e.g., on DT platforms where device enumeration order is not guaranteed to follow dependency order). The deferred probe mechanism ensures all clock-dependent drivers eventually succeed without requiring explicit enumeration ordering in the device tree.
Clock provider deferred probe: Clock providers whose parent clock node has not
yet been registered during Step 2 of Phase 4.4b are placed on a deferred provider
list. After the initial DT traversal completes but before the tree is frozen (Step 3),
the framework re-processes deferred providers in dependency order until no further
progress is made. If a provider still cannot resolve its parent after this retry pass,
it is logged at WARN level and omitted from the frozen tree (its consumers will
receive EPROBE_DEFER at driver probe time).
PLL programming timing constraint: PLLs must not be reprogrammed before the
timekeeping subsystem is initialized (Section 2.3). Reprogramming a PLL
before timekeeping is stable can cause clock source instability during early boot (the
timer calibration loop depends on a stable reference). The set_rate consumer API
returns Err(KernelError::NotReady) if called before the timekeeping invariant is
established, and returns Err(KernelError::DeviceError) if the clock tree is frozen
and the rate change would require structural tree modifications.
/// Global guard: set to `true` by `timekeeping_init()` (Phase 2.5) once the
/// reference clock source is stable and calibrated. `ClkHandle::set_rate()`
/// checks this before allowing PLL reprogramming — returns
/// `Err(KernelError::NotReady)` if `false`.
pub static TIMEKEEPING_READY: AtomicBool = AtomicBool::new(false);
/// Global guard: set to `true` at the end of Phase 4.4b after the clock tree
/// DT population is complete. Once frozen, no new clock providers can be
/// registered (`clk_provider_register()` returns `Err(EBUSY)`). Consumers
/// may still call `set_rate()` on existing clocks — the freeze only prevents
/// structural tree mutations (new nodes), not rate changes on existing nodes.
pub static CLK_TREE_FROZEN: AtomicBool = AtomicBool::new(false);
Late clock provider registration: All clock providers must be Tier 0 or
platform drivers probed during Phase 4.4b (before freeze). Tier 1 clock
controller drivers loaded after freeze MUST register via
clk_provider_register_late(), which extends the frozen tree under a separate
seqlock without allowing structural mutations (no reparenting). Late-registered
clocks are available to consumers immediately after registration. This path is
intended for hot-pluggable clock controllers (e.g., USB clock generators, PCIe
add-in cards with their own PLL) that cannot be enumerated during boot.
/// Register a clock provider after the tree has been frozen.
///
/// Unlike `clk_provider_register()` (which fails with EBUSY after freeze),
/// this function appends new leaf clock nodes under a seqlock guard. The
/// new nodes are immediately visible to `clk_get()` consumers. Structural
/// mutations (reparenting existing nodes, changing the tree topology) are
/// NOT permitted — the new nodes must be leaves with an already-registered
/// parent, or standalone roots.
///
/// Returns `Err(KernelError::InvalidArgument)` if the specified parent
/// does not exist in the frozen tree.
pub fn clk_provider_register_late(
provider: ClkProviderDescriptor,
) -> Result<(), KernelError>;
2.24.6 Architecture-Specific Notes¶
x86-64: Most x86-64 peripherals use fixed reference clocks: - PCIe reference clock: 100 MHz fixed (spread-spectrum optional via BIOS) - USB UTMI clock: 60 MHz fixed (generated by USB PHY) - SATA reference: 100 MHz fixed - UART (LPC/eSPI): 1.8432 MHz or 48 MHz with configurable divisor register
x86-64 SoC platforms (Intel Elkhart Lake embedded, UP Board) have clock controllers described in ACPI or via platform driver hard-codes. These register ClkGate and ClkDivider nodes for peripherals (I2C, SPI, UART). ClkFixed nodes are registered for crystal sources.
AArch64: Rich SoC clock trees are standard. Platforms (Rockchip RK3588, Qualcomm SM8550, Broadcom BCM2712, NXP i.MX 95) have 100-400 clock nodes in their device tree clock controllers. All nodes are described in DT and populated by the generic DT-based clock registration path. The RK3588's CRU (Clock and Reset Unit) alone provides ~400 clocks.
ARMv7: Similar to AArch64 but simpler topologies (50-150 nodes). STM32, TI OMAP, Freescale i.MX all use DT-described clock trees.
RISC-V: Varies significantly by SoC: - SiFive FU740: clock tree described in DT; 20-30 nodes - StarFive JH7100/JH7110: dedicated clock controller with 50+ outputs - Simple RISC-V development boards: often a single fixed PLL, one or two dividers
PPC32: System PLL (sourced from crystal) → Core PLL → local bus divider → peripheral clocks. Described in DT. PPC32 SoC clock trees are simpler than modern ARM SoCs (10-30 nodes).
PPC64LE: IBM POWER systems expose clock information via OPAL (OpenPOWER Abstraction Layer). Clock control is typically not directly programmable by the OS; the hypervisor or OPAL manages frequency scaling. The clock framework on PPC64LE primarily provides the abstraction for querying clock rates, not for programming them.
s390x:
s390x has no clock tree — the core timekeeping frequency is an architectural constant,
not a discoverable or configurable parameter. No ClkKind/ClkHandle infrastructure is
needed for the core timer subsystem.
- TOD clock (Time Of Day): 4096 ticks/µs (2^12), epoch 1900-01-01. This is a
hardware-defined constant across all s390x implementations. UmkaOS uses
STCKEexclusively (Store Clock Extended, 128-bit). The 64-bitSTCKformat wraps at approximately 2043 — within the 50-year uptime target (boot 2026, run to 2076).STCKEprovides a 128-bit representation that wraps approximately in 36,000 AD. Storage: the kernel readsSTCKEinto a 16-byte buffer. The high 64 bits (bits 0-63 in IBM bit numbering) contain the TOD epoch value, sufficient for seconds-resolution timekeeping until ~2143. The full 128 bits are used only when sub-microsecond precision from the extended bits (bits 64-111) is needed. Atomic access: since s390x does not provide 128-bit atomics, the VvarPage uses a seqcount protocol: the kernel writes the high 64 bits + low 64 bits as two adjacent stores under the VvarPage seqlock. The vDSO reads both halves and verifies the seqlock before use. For kernel-internal timekeeping (sched_clock), only the high 64 bits are used — a single 64-bit atomic load suffices. - CPU Timer: 64-bit signed decrementer, per-CPU. Fires an External interrupt when the
value becomes negative. Set via
SPT(Set CPU Timer), read viaSTPT(Store CPU Timer). UmkaOS uses the CPU Timer for scheduler ticks (equivalent to APIC timer on x86-64 or decrementer on PPC). - Clock Comparator: 64-bit TOD value. Fires an External interrupt when the TOD clock
reaches the comparator value. Set via
SCKC(Set Clock Comparator), read viaSTCKC(Store Clock Comparator). Used for absolute-time wakeups.
No peripheral clock gating or frequency control is exposed to the OS — all I/O is channel-based and managed by the channel subsystem firmware.
LoongArch64: LoongArch64 uses a Stable Counter for core timekeeping and a standard DT-described clock tree for peripheral clocks.
- Stable Counter: Configured via
CSR.TCFG(Timer Configuration). TheInitValfield sets the initial countdown value, thePeriodicbit (1 = auto-reload on expiry), and theEnbit (1 = enable). When the counter reaches zero, it fires a TI (Timer Interrupt) exception. IfPeriodic= 1, the counter auto-reloadsInitValwithout software intervention. Current value is read fromCSR.TVAL. - Frequency discovery: The counter frequency is obtained from
CPUCFGword or from the device tree/cpus/timebase-frequencyproperty. - User-mode time access:
RDTIME.Dis available at PLV3 (user privilege level), enablingclock_gettimevDSO without syscall overhead. - Peripheral clocks: Standard clock tree via Loongson 7A1000/7A2000 bridge chipset. Bridge clock controllers are described in DT and populated by the generic DT-based clock registration path.
2.24.7 Linux External ABI¶
The following debugfs and sysfs interfaces are provided for tooling compatibility:
/sys/kernel/debug/clk/
├── <clock_name>/
│ ├── clk_rate : current rate in Hz (read-only)
│ ├── clk_enable_count : current enable count (read-only)
│ ├── clk_flags : clock flags (read-only)
│ └── clk_parent_name : name of parent clock (read-only)
└── clk_summary : one-line summary per clock (name, rate, enabled, consumers)
Clock names in these files match the DT clock-names consumer references, enabling
existing Linux clock debugging tools (clk-summary, cpupower) to work without changes.
No userspace API for rate changes is provided. Clock rate configuration is a privileged
kernel-driver operation; there is no /dev/clk or similar. Rate changes by userspace
go through the CPU frequency governor (cpufreq, Section 7.2.2) or via driver-specific
sysfs attributes (e.g., SPI bus frequency via /sys/bus/spi/devices/.../speed_hz).