Chapter 24: Roadmap and Verification¶

Driver ecosystem, implementation phases, verification strategy, technical risks, formal verification, appendices

Implementation phases, verification strategy, and project planning. Phase 1 (microkernel core) through Phase 5 (production hardening) define the delivery order. Formal verification targets are identified for safety-critical subsystems. Open questions and technical risks are tracked here as the canonical reference for development decisions.

24.1 Driver Ecosystem Strategy¶

24.1.1 The Challenge¶

Driver coverage is the single largest adoption blocker for any new kernel. Linux has thousands of drivers covering decades of hardware. UmkaOS cannot replicate this overnight.

24.1.2 Agentic Driver Rewrite Project¶

The key insight: all open-source Linux driver source code is available. The hardware programming logic (register sequences, DMA setup, interrupt handling) is identical regardless of kernel API. Only the kernel-facing API surface changes.

AI-assisted translation pipeline:

Input:  Linux driver C source code (GPL, ~500-5000 LOC typical)
   |
   v
Step 1: Parse Linux kernel API calls (kmalloc, dma_alloc_coherent,
        request_irq, pci_read_config_*, etc.)
   |
   v
Step 2: Map to KABI equivalents (KernelServicesVTable methods)
   |
   v
Step 3: Translate C to Rust, preserving hardware-specific logic exactly
   |
   v
Step 4: Generate KABI driver entry point and vtable exchange
   |
   v
Output: Native Rust KABI driver

Human review: Verify hardware-specific sequences are preserved
Testing: Against real hardware + QEMU virtual devices

24.1.3 Prioritized Driver List¶

These drivers cover approximately 95% of real hardware in server and desktop environments:

Priority 1 -- Cloud/VM (covers 100% of cloud deployments): 1. VirtIO block (virtio-blk) 2. VirtIO network (virtio-net) 3. VirtIO GPU (virtio-gpu) 4. VirtIO console (virtio-console)

Priority 2 -- Storage (covers 99% of bare-metal storage): 5. NVMe (universal modern SSD interface) 6. AHCI/SATA (legacy HDDs and older SSDs)

Priority 3 -- Networking (covers 90% of server NICs): 7. Intel e1000/e1000e (universal VM and consumer NIC) 8. Intel igb/ixgbe/ice (server 1G/10G/25G/100G) 9. Realtek r8169 (consumer Ethernet) 10. Mellanox mlx5 (high-performance datacenter)

Priority 4 -- Human Interface (covers desktop usability): 11. USB XHCI host controller (all modern USB) 12. USB EHCI host controller (USB 2.0 legacy) 13. USB HID (keyboard, mouse) 14. USB mass storage 15. Intel HDA audio 16. i915 (Intel integrated graphics, modesetting) 17. amdgpu (AMD graphics, modesetting) 18. UVC (USB Video Class) camera driver — Phase 4-5 implementation. Architecture fully specified in Section 13.16 (CameraDevice trait, ISP pipeline model, V4L2 compat, privacy enforcement). Printing is out of kernel scope (CUPS/IPP are pure userspace; Section 13.17).

Priority 5 -- Platform (covers system management): 19. ACPI subsystem 20. PCI/PCIe enumeration and configuration 21. IOMMU (Intel VT-d, AMD-Vi)

24.1.4 Nvidia / Proprietary Driver Strategy¶

For Nvidia (the most critical proprietary driver):

Nvidia's driver already has a clean internal abstraction layer between their proprietary GPU core and the "kernel interface layer" (nvidia.ko)
UmkaOS provides a KABI-native implementation of this kernel interface layer
Nvidia's proprietary compute core links against our KABI implementation
This is more sustainable than binary .ko compatibility: the interface layer is small, well-defined, and stable

Tier assignment: The Nvidia proprietary GPU compute core runs as a Tier 2 driver (Ring 3 process, IOMMU-isolated). It cannot access kernel memory or hardware registers directly. This is the correct security placement for closed-source proprietary code — a crash or exploit in the Nvidia blob cannot compromise the kernel or other processes.

GPU command submission goes through the Tier 2 KABI ring protocol (Section 12.6).
DMA buffers are mapped via the IOMMU through the Tier 2 DMA API (Section 4.14).
For display output (modesetting), the Nvidia driver uses the DRM/KMS KABI interface (Section 21.5). Display operations are not latency-critical; Tier 2 crossing overhead is acceptable.

Signing and verification: The Nvidia proprietary blob is signed by Nvidia using ML-DSA-65. The UmkaOS module loader verifies this signature against a Nvidia vendor certificate embedded in the UmkaOS kernel image at distribution build time (not at runtime). Certificate chain: Nvidia Root CA → Nvidia Driver Signing Cert → module. The Root CA certificate is pinned — it cannot be rotated without a kernel update, preventing supply chain substitution attacks. Unsigned or improperly-signed blobs are rejected at load time with ENOEXEC.

KABI interface layer: The in-kernel nvidia_kabi_shim (Tier 0) implements the Linux nvidia.ko kernel interface surface that Nvidia's compute core links against, translating calls to UmkaOS KABI operations. This shim is open-source (maintained by the UmkaOS project); Nvidia's proprietary compute core remains closed-source and requires no kernel patches.

Security boundary: The Tier 2 Nvidia process runs in its own user namespace with no capabilities. It communicates with the kernel exclusively via: 1. GPU command ring (KABI vtable, memory-mapped shared ring). 2. DMA API (buffer map/unmap; IOMMU-mediated; no arbitrary PA access). 3. DRM/KMS KABI (display; Tier 0 shim mediates all modesetting operations).

No raw MMIO access. No /dev/mem. No kernel symbol exports to the blob.

24.1.5 Community Incentive¶

The clean KABI SDK makes driver development significantly easier than Linux: - No need to track unstable internal APIs - Rust safety eliminates entire classes of bugs - Binary compatibility across kernel versions eliminates recompilation burden - Clear, documented interfaces reduce the learning curve

This lower barrier to entry is expected to attract contributors and vendors over time.

24.1.6 Standalone UmkaOS Peer Protocol Specification¶

The peer protocol is a single protocol used by all multikernel communication in UmkaOS — Tier M peers on a single host, distributed kernel nodes across hosts, and firmware shims on smart peripherals all speak the same protocol. The difference between these deployment modes is transport (Layer 0) and whether DSM coherence (Layer 3) is enabled — not the protocol itself.

Aspect	Detail
Scope	Layer 1 wire protocol + Layer 0 transport binding appendices
Size	~40-50 pages standalone document
Wire spec source	Section 5.1
Target audience	Firmware engineers, SmartNIC/DPU teams, FPGA developers, embedded/IoT
Firmware shim effort	~10-18K lines of C on existing RTOS, excluding cryptographic primitives already present in the firmware stack (Layers 0-1 only; a reference implementation will be published with measured line counts)
Timeline	Draft alongside Phase 3 Tier M demo

Protocol stack:

Layer 3: DSM page coherence      (optional — only for CPU-class peers doing shared memory)
Layer 2: Service messages         (service-specific KABI vtables over ring buffers)
Layer 1: PEER PROTOCOL            (membership, capabilities, crash recovery, ring transport)
Layer 0: Transport binding        (PCIe BAR/MSI, RDMA verbs, CXL.mem, Ethernet+TCP)

Layer 1 is identical across all deployment modes. A SAS controller shim on PCIe, a Tier M GPU peer on CXL, and a DSM node over RDMA all implement the same Layer 1. DSM nodes additionally implement Layer 3 (page coherence) on top. Layer 0 is pluggable — the peer protocol sees "send message to ring, receive doorbell interrupt."

The standalone spec must be published separately from the 24-chapter kernel architecture so that implementers do not need to read the full kernel design.

Standalone document contents:

Ring buffer layout — entry format, producer/consumer indices, doorbell coalescing
Message types (~10: CLUSTER_JOIN, CLUSTER_LEAVE, CAP_ADVERTISE, CAP_WITHDRAW, SERVICE_REQUEST, SERVICE_RESPONSE, HEALTH_REPORT, CRASH_NOTIFY, FLR_REQUEST, PING)
Capability negotiation — 3-way handshake: HELLO → CAP_ADVERTISE → ACK
Crash and recovery — IOMMU lockout, bus master disable, FLR, rejoin sequence
Transport binding appendices:
Appendix A: PCIe — BAR mapping, MSI/MSI-X vector assignment, P2P DMA
Appendix B: RDMA — RoCEv2 queue pairs, RDMA_SEND/RDMA_WRITE mapping
Appendix C: CXL — CXL.mem shared region, CXL.cache coherence interaction
Appendix D: USB — bulk transfer endpoints, interrupt polling fallback
Appendix E: Ethernet+TCP — software transport for development/demo/IoT

The protocol is transport-agnostic. Any fabric that carries ring buffer messages and delivers doorbells (interrupt or polling) is a valid Layer 0. This means ANY device with a processor — from a SAS HBA with an ARM Cortex-R to an STM32 microcontroller on USB to a datacenter DPU on PCIe — can be an UmkaOS peer.

24.1.6.1 Tier M Adoption Roadmap¶

Tier M viability is first demonstrated via UmkaOS-controlled endpoints that require zero third-party vendor cooperation. Vendor firmware adoption is market-dependent and deferred to later phases.

Phase	Tier M Milestone	Vendor Dependency
Phase 3	Kintex-7 FPGA PCIe endpoint — peer protocol in firmware, simultaneously advertising network (Ethernet service) and storage (ephemeral block device backed by onboard DDR). Validates multi-service peers, CLUSTER_JOIN, capability negotiation, crash recovery, and service re-advertisement. Both sides fully controlled.	None
Phase 3	UmkaOS-to-UmkaOS cluster peers over PCIe P2P	None (both sides controlled)
Phase 4	DPU integration (BlueField, Pensando via DOCA SDK)	Cooperative (DOCA is public SDK)
Phase 4	RDMA transport binding (RoCEv2 queue pairs)	None (standard verbs API)
Phase 5+	SAS HBA firmware shim (Broadcom/Marvell)	Requires vendor cooperation
Phase 5+	GPU firmware shim (NVIDIA, AMD)	Requires vendor NDA/engineering access
Phase 5+	NIC firmware shim (Intel, Mellanox)	Requires vendor cooperation
Phase 5+	NVMe SSD firmware shim	Requires vendor cooperation

The Kintex-7 FPGA is the cleanest Tier M demonstration: the device has never heard of UmkaOS — it just speaks the peer protocol spec. The dual-service capability (network + storage from a single device) proves that Tier M peers are not constrained to one device class, which is the key architectural distinction from traditional driver models.

24.2 Implementation Phases¶

This section covers the implementation timeline for all features. The first part (Phases 1-5+) defines core kernel milestones. The Enhancement Feature Phasing and Future-Proof Feature Phasing tables below map additional features onto these same phases.

24.2.1.1.1 Subsystem Completeness Rule¶

Every subsystem touched in a phase is implemented completely per its architecture spec, or it is not started. There are no "basic" or "stub" versions that get extended in a later phase. The subsystem is the unit of completeness, not the demo.

Rationale: partial implementations create compounding technical debt — the Phase N scaffolding blocks proper Phase N+1 implementation, nobody can tell what works vs what's half-done, and the analysis cost of untangling partial state exceeds the cost of implementing correctly the first time. This is production engineering, not prototyping.

Concretely: if Phase 2 needs signals for busybox, we implement the full signal subsystem (all 64 signals, sigaction, sigaltstack, per Section 8.5) — not "4 signals now, 60 later." If Phase 2 needs procfs for ps, we implement procfs completely (per Section 19.1) — not a stub.

Phase N+1 adds new subsystems. It never extends Phase N subsystems.

Exception: syscall dispatch table. The syscall table is an enumerated set of independent entries, not a monolithic subsystem. Adding syscall #300 doesn't change syscall #1.

Exception: Kernel Crypto API algorithm table. Like the syscall table, the algorithm registry is an enumerated set of independent entries. Each algorithm is a separate CryptoAlg registration that does not affect other algorithms. Phase 2 implements the framework completely (template instantiation, algorithm lookup, crypto_alloc_*() API, software fallback dispatch) plus essential algorithms (SHA-256, AES, CRC32c, ChaCha20). Phase 3 registers the full algorithm table and hardware acceleration backends. The FRAMEWORK is complete in Phase 2; the CATALOG grows in Phase 3.

However, syscalls form functional clusters that must ship together — fork() without wait() leaks zombies, socket() without bind()/listen()/accept() is useless, sigaction() without sigprocmask() breaks signal masking. The completeness unit for syscalls is the cluster, not the individual entry.

Syscall cluster rule: a cluster is complete when all real programs that use any syscall in the cluster can use all of them correctly. Clusters are defined by functional dependency, not by arbitrary grouping:

Cluster	Syscalls (representative, not exhaustive)	Ships in
Process lifecycle	fork, clone, execve, wait4, exit, exit_group, getpid, getppid, setpgid, setsid	Phase 2
Signals	rt_sigaction, rt_sigprocmask, rt_sigreturn, rt_sigsuspend, kill, tgkill, sigaltstack (all 64 signals)	Phase 2
File I/O	open, openat, close, read, write, lseek, dup, dup2, dup3, fcntl, fstat, fstatat, readlink	Phase 2
Memory	mmap, munmap, mprotect, brk, madvise, mremap	Phase 2
Directory	getdents64, mkdir, rmdir, chdir, getcwd, unlink, rename, chmod, chown, link, symlink, access	Phase 2
Pipe/poll	pipe2, poll, ppoll, epoll_create1, epoll_ctl, epoll_wait, select	Phase 2
Mount/FS	mount, umount2, statfs, fstatfs, sync, fsync	Phase 2
Time	clock_gettime, gettimeofday, nanosleep, clock_nanosleep	Phase 2
Identity	getuid, geteuid, getgid, getegid, setuid, setgid, setresuid, setresgid	Phase 2
Resource	getrlimit, setrlimit, prlimit64, getrusage, uname, sysinfo	Phase 2
Networking	socket, bind, listen, accept4, connect, send, recv, sendmsg, recvmsg, shutdown, setsockopt, getsockopt	Phase 3
Namespaces	clone(CLONE_NEW*), setns, unshare	Phase 3
io_uring	io_uring_setup, io_uring_enter, io_uring_register	Phase 3
eBPF	bpf() (all subcommands)	Phase 3
Advanced process	ptrace, prctl, seccomp	Phase 3

Phase 2 clusters total ~120+ syscalls (not "60" or "basic"). Phase 3 adds ~200+ more. Each cluster is fully tested before its phase exits.

24.2.1.1.2 Multi-Architecture Development Mandate¶

All phases develop and test on all 8 architectures from day one, not just x86-64. This is not a porting exercise deferred to Phase 5 — it is a code quality discipline that enforces proper separation between arch-specific and arch-independent code. If generic code only compiles and runs on x86-64, it is not actually generic, and fixing the hidden assumptions later requires rewriting core subsystems.

The rule: Every subsystem that ships in a phase must compile, boot, and pass its test suite on all 8 architectures in QEMU before the phase exits. x86-64 is the primary development and real-hardware target for Phases 1-4. AArch64 is the secondary real-hardware target from Phase 2 onward — Raspberry Pi 5 (BCM2712, Cortex-A76, no POE) and optionally Apple M1 (Icestorm/Firestorm, no POE but MTE available) provide non-emulatable validation of weak memory ordering, DMA cache coherency, GIC/SMMU, and DT-based PCIe. Other architectures are tested via QEMU CI. Phase 5a elevates remaining architectures to production quality on real hardware.

High-risk areas where x86-specific assumptions silently pollute generic code (identified by architecture risk analysis):

Risk Area	x86 Behavior	Non-x86 Behavior	Consequence of x86-Only Testing
Memory ordering	TSO (strong, hides bugs)	Weak model (ARM, RISC-V)	Lock-free code silently corrupts data on ARM/RISC-V
DMA cache coherency	Always coherent (no-op sync)	Requires explicit cache flush (ARM SoC, RISC-V)	DMA data corruption on ARM without CCI
IOMMU	VT-d (single implementation)	SMMU v3 (ARM), RISC-V IOMMU, PPC IOMMU	Tier 2 drivers fail to probe on non-x86
PCIe config access	ECAM via ACPI MCFG	Device tree `ranges` property (ARM, RISC-V)	Device discovery fails on non-x86
Isolation	MPK (WRPKRU, unprivileged)	POE/page-table (ARM), DACR (ARMv7), segments (PPC), none (RISC-V)	Tier 1 code assumes MPK exists everywhere

Mandatory abstractions (must be defined before Phase 2 drivers):

IommuDomain trait: Generic IOMMU operations (map, unmap, flush, fault handler). Per-arch implementations: VT-d (x86), SMMU v3 (ARM), RISC-V IOMMU, PPC IOMMU. Without this, all IOMMU code is implicitly VT-d-specific.
PcieConfigAccessor trait: Generic PCIe configuration space access. Per-arch implementations: ECAM via MCFG (x86), DT-based (ARM, RISC-V, PPC). Without this, device enumeration only works on x86.
NumaDiscovery trait: Generic NUMA topology discovery. Per-arch implementations: ACPI SRAT/SLIT (x86), device tree numa-node-id (ARM, RISC-V, PPC). Without this, NUMA-aware allocation silently falls back to node 0 on non-x86.

Per-phase multi-arch requirements are listed in each phase below.

24.2.1.1.3 QEMU Fidelity Matrix¶

QEMU is the primary CI vehicle for non-x86 architectures through Phase 4. Its TCG (Tiny Code Generator) mode is a functional emulator, not a cycle-accurate simulator. Several categories of real-silicon behavior are simplified or absent in QEMU, meaning bugs in these areas will not surface until real-hardware testing (Phase 5a, or earlier for x86-64 and AArch64 which have real-hardware targets in Phases 2-4). This matrix documents known divergences so that developers and reviewers treat QEMU-passing tests with appropriate skepticism in these areas.

See also Section 2.22 for per-architecture hardware capabilities and isolation mechanism availability.

Category	Architecture(s)	QEMU Behavior	Real Silicon Behavior	Impact on Testing	Mitigation
Memory ordering	AArch64, ARMv7, RISC-V, PPC32, PPC64LE, LoongArch64	TCG executes guest instructions sequentially on the host thread; store-buffer forwarding and reordering are not modeled. Effectively TSO or stronger.	Weak memory models (ARM: weakly-ordered with DMB/DSB barriers; RISC-V: RVWMO; PPC: very weak with lwsync/hwsync; LoongArch: weakly-ordered with DBAR). Out-of-order retirement, store buffers, and speculative loads produce reorderings that TCG cannot reproduce.	Lock-free algorithms, RCU, and seqlock code may pass all QEMU tests yet corrupt data on real hardware. This is the highest-risk divergence.	(1) x86-64 real-hardware testing catches TSO-compatible bugs. (2) AArch64 real-hardware testing (RPi 5 / Apple M1 from Phase 2) catches weak-ordering bugs. (3) All lock-free code is reviewed against the formal memory model per Section 3.1. (4) LKMM-style litmus test suite run under `herd7` for all barrier-sensitive paths.
TLB shootdown latency	All	TLB invalidation (`INVLPG`, `TLBI`, `sfence.vma`) completes instantly — QEMU has no TLB structure to invalidate. IPI delivery for cross-CPU shootdowns is synchronous within the same QEMU process.	TLB shootdown requires IPI to remote cores, each of which must drain its pipeline, invalidate TLB entries, and acknowledge. Cost: 1-50 us depending on core count and NUMA topology. Batching and lazy invalidation strategies have measurable impact.	TLB-intensive paths (munmap, mprotect, fork COW, context switch ASID flush) will appear ~100x faster in QEMU. Performance regressions from naive shootdown strategies are invisible.	(1) Real-hardware x86-64 and AArch64 benchmarks gate Phase 3/4 exit. (2) Performance budget numbers are derived from real-silicon measurements, not QEMU. (3) TLB shootdown batching is designed per spec regardless of QEMU results.
DMA cache coherence	AArch64, ARMv7, RISC-V, PPC32, LoongArch64	QEMU memory is always coherent — DMA reads/writes see the same view as CPU with no cache maintenance. `dma_sync_*` operations are no-ops.	Non-coherent SoCs (many ARM, all RISC-V without Svpbmt, LoongArch) require explicit cache clean/invalidate around DMA transfers. Missing a `dma_sync_for_device()` causes stale cache lines to overwrite DMA-written data.	DMA driver bugs (missing cache sync) are completely invisible in QEMU. Corruption manifests only on real non-coherent hardware.	(1) AArch64 real-hardware testing (RPi 5 is non-coherent for some peripherals) catches missing syncs from Phase 2. (2) DMA API (`StreamingDmaMap`, `CoherentDmaBuf` per Section 4.14) enforces sync at the type level — the API makes it hard to forget. (3) Static analysis flag for raw MMIO writes to DMA-mapped regions without intervening sync.
IOMMU fidelity	AArch64 (SMMUv3), RISC-V (IOMMU), PPC (IOMMU)	QEMU emulates basic IOMMU page-table walks and DMA remapping. Fault injection, ATS (Address Translation Services), PRI (Page Request Interface), and nested/stage-2 translation are partially implemented or absent. SMMU v3 HTTU (Hardware Table Update) is not modeled.	Full IOMMU implementations support ATS/PRI for device-side TLB, HTTU for dirty-bit tracking, nested translation for VM passthrough, and hardware-walked page tables with configurable granularity. Fault reporting is asynchronous via event queues.	Tier 2 driver isolation testing in QEMU validates basic DMA remapping but not ATS/PRI flows, nested translation, or IOMMU fault recovery paths.	(1) x86-64 VT-d is the best-emulated IOMMU in QEMU — Tier 2 regression tests run there. (2) AArch64 SMMU v3 testing on real hardware from Phase 3. (3) IOMMU fault injection test suite exercises error paths via synthetic fault generation independent of QEMU fidelity.
Interrupt controller timing	s390x (Adapter Interrupts), LoongArch64 (EIOINTC)	QEMU delivers interrupts synchronously at instruction boundaries. s390x adapter interrupt coalescing is simplified. LoongArch EIOINTC routing between nodes in multi-socket configurations is functional but untested for edge cases.	Real interrupt controllers have delivery latency (10-100 ns), coalescing windows, priority arbitration delays, and routing-table update propagation time. s390x QDIO adapter interrupts have specific timing contracts with channel programs.	Interrupt storm handling, coalescing tuning, and multi-socket routing bugs are invisible in QEMU. EIOINTC cross-node routing on real LoongArch multi-socket (3C5000 8-node) is untested.	(1) x86-64 APIC and AArch64 GICv3 are well-emulated and have real-hardware validation. (2) s390x and LoongArch64 require Phase 5a real-hardware validation for interrupt timing. (3) Interrupt coalescing parameters are configurable and default to conservative values.
PCIe configuration space	All non-x86	QEMU PCIe config reads/writes complete in zero simulated time. Extended capabilities (AER, ACS, L1 PM Substates, SR-IOV) are partially emulated. Config retry status (CRS) is not modeled.	Real PCIe config access requires ECAM MMIO or type 1 configuration cycles with completion timeouts (10-100 us for CRS). Extended capability registers may have hardware-enforced write masks and side effects. Power management state transitions (D0/D3hot/D3cold) have real latency.	Driver probe timing, CRS retry logic, and power state transition handling are untested in QEMU. SR-IOV VF BAR sizing timing is simplified.	(1) x86-64 and AArch64 real-hardware testing validates PCIe probe paths. (2) PCIe config access uses the `PcieConfigAccessor` trait with per-arch timeout handling designed in regardless of QEMU behavior. (3) CRS retry is implemented per PCIe Base Spec with configurable timeout.
s390x Channel I/O	s390x	QEMU emulates basic CCW (Channel Command Word) chains, SSCH/TSCH/HSCH instructions, and virtio-ccw transport. Subchannel multiplexing, QDIO data queues, and FICON channel path failover are simplified. Concurrent I/O on multiple subchannels may not reflect real arbitration.	Real s390x channel subsystem supports 65,536 subchannels, hardware-managed I/O queuing, channel path failover (CHPID/SNID), and QDIO with hardware-assisted buffer management (SBAL/SBALE). I/O interrupts have specific priority and masking semantics tied to the PSW I/O mask bit.	Multi-subchannel I/O scheduling, channel path failover, and QDIO performance optimization cannot be validated in QEMU. Basic virtio-ccw transport and single-subchannel I/O are testable.	(1) s390x is Phase 5a for real-hardware production. (2) virtio-ccw transport (the primary QEMU I/O path) is well-emulated and sufficient for functional testing. (3) FICON/QDIO drivers require z/VM or LPAR testing.
Power management	All	QEMU does not model CPU power states (C-states, P-states), voltage/frequency scaling, or thermal throttling. `MONITOR`/`MWAIT` (x86), `WFI`/`WFE` (ARM) are treated as no-ops or simple halts.	Real CPUs have multi-level C-states with entry/exit latencies (1-1000 us), P-state transition delays, thermal throttling that reduces effective frequency, and DVFS governors that interact with the scheduler.	Runtime PM, cpufreq governor behavior, and thermal management are completely untested in QEMU. The scheduler's energy-aware scheduling path cannot be validated.	(1) x86-64 and AArch64 real-hardware testing validates power management from Phase 3 (power-aware scheduling is a Phase 3 subsystem). (2) Runtime PM state machine per Section 7.5 is designed and tested for correctness independently of actual power savings.

Interpreting the matrix: QEMU testing validates functional correctness (correct register values, proper sequencing, ABI compatibility) but not performance characteristics, hardware timing, or weak-ordering behavior. Every "QEMU passes" result for a non-x86 architecture carries an implicit caveat for the categories above. The mitigation column documents how UmkaOS compensates — primarily through early real-hardware testing (x86-64 Phase 1+, AArch64 Phase 2+), type-level API enforcement, and formal memory-model verification.

24.2.2 Phase 1: Foundations¶

Goal: Boot to a hello-world program.

See Section 25.3 for detailed agentic workflow steps within this roadmap phase. Phase 1.1 and 1.2 (formerly separate x86-only and multi-arch phases) are merged — the spec now contains per-arch tables for every subsystem, making multi-arch support a compile-time configuration rather than a separate design phase.

Subsystems implemented (each complete per its architecture spec):

Boot and hardware discovery: - Multi-arch boot chain: Multiboot1/2 (x86-64), DTB (AArch64/ARMv7/RISC-V/PPC), SBI (RISC-V), SLOF (PPC64LE) — per Section 2.1 - ACPI table parsing (x86-64): MADT, MCFG, DMAR, FADT — per Section 2.22. Device Tree parsing for all other architectures — per Section 2.8 - Clock framework: full per Section 2.24 — PLL, divider, gate, mux clock types with runtime rate discovery - Hardware RNG detection: RDRAND (x86), RNDR (AArch64), Zkr (RISC-V) — seeds CSPRNG per Section 2.16 - CPU feature detection: per-arch capability discovery per Section 2.16 — isolation mechanism probing (MPK, POE, DACR, segment registers, Radix PID)

Concurrency primitives (foundation for all later subsystems): - Spinlocks, Mutexes, RwLocks: full per Section 3.1 — including lock ordering enforcement, lockdep debug mode - RCU: full non-preemptible model per Section 3.4 — grace period detection, rcu_read_lock/unlock, synchronize_rcu, call_rcu - CpuLocal and PerCpu infrastructure: full per Section 3.1 — register-based CpuLocal (GS/TPIDR_EL1/tp per arch), PerCpu data areas - IRQ chip and irqdomain hierarchy: full per Section 3.12 — IrqChip trait, IrqDomain, IrqTable, per-arch root domain (APIC/GIC/PLIC/OpenPIC) - Workqueues: full per Section 3.11 — BoundedMpmcRing, named thread pools with backpressure, system-wide + per-CPU + unbound pools

Memory: - Physical memory allocator: full per Section 4.1 (buddy allocator, NUMA-aware, zone-based, boot allocator → runtime transition) - Slab allocator: full per Section 4.3 — per-CPU magazines, size classes, NUMA-aware, GFP flags. Every kmalloc equivalent.

Scheduling and time: - EEVDF scheduler: full core per Section 7.1 (virtual deadline, lag tracking, preemption, two-tree eligible/timeline design). RT scheduling classes (SCHED_FIFO, SCHED_RR), deadline scheduling class (SCHED_DEADLINE), CBS bandwidth enforcement, and EAS energy-aware scheduling are separate subsystems added in Phase 3. Phase 1's EEVDF subsystem is complete per its spec. - Timekeeping: full per Section 7.8 — clock sources, clockevents, timer wheel, hrtimers, vDSO for clock_gettime()

Security foundation: - UmkaOS capability system: full per Section 9.1 — CapSpace, CapEntry, ObjectRegistry, capability creation/revocation/lookup - PQC crypto abstraction: algorithm enum, variable-length signature fields per Section 9.6 — design-in only, no functional implementation yet

Isolation and drivers: - Isolation domain infrastructure: full Tier 0/1/2 framework per Section 11.2 (MPK setup, domain allocation, IOMMU init on all architectures). No drivers use Tier 1 yet — infrastructure only. - Tier 0 drivers: APIC/GIC/PLIC (per arch), timer, serial console — complete - Device registry skeleton: DeviceRegistry struct with RwLock per Section 11.4 — populated in Phase 2

KABI and syscall infrastructure: - KABI compiler: umka-kabi-gen per Section 12.5 — complete - Syscall dispatch table: architecture per Section 19.1, populated with execve + write + exit_group initially. Table structure is final; later phases add entries, never change the dispatch mechanism.

Build and test: - Build system + CI/CD: Cargo workspace, linker scripts, QEMU boot tests — complete - Formal verification readiness: spec annotations, design contracts per Section 24.4 — design-in only

Multi-arch (Phase 1): Hello-world runs on all 8 architectures in QEMU. Per-arch isolation mechanism probed and reported (MPK on x86, DACR on ARMv7, segments on PPC32, Radix PID on PPC64LE, page-table fallback on AArch64, "Tier 1 unavailable" on RISC-V, "Tier 1 unavailable" on s390x, "Tier 1 unavailable" on LoongArch64). IrqChip/IrqDomain hierarchy validated on all arches (APIC, GIC, PLIC, OpenPIC). Concurrency primitives stress-tested on ARM64 and RISC-V QEMU (weak memory model targets) — not just x86.

Exit criteria: A statically linked 'Hello, world!' ELF binary runs on UmkaOS in QEMU (all 8 architectures). The KABI compiler parses a .kabi IDL file and generates Rust/C stubs that compile. Lock ordering violations detected by lockdep. RCU grace periods complete correctly under stress. Isolation domain probe succeeds on all arches.

24.2.3 Phase 2: Self-Hosting Shell + Tier 1 Fault Recovery¶

Goal: Run a busybox shell with basic utilities. Demonstrate Tier 1 driver crash recovery.

See Section 25.3 (Phase 2.1: Essential Drivers), Section 25.3 (Phase 2.2: Linux Compatibility Layer), and Section 25.3 (Phase 2.3: Networking Stack) for detailed agentic workflow steps within this roadmap phase.

New subsystems added (each complete per its architecture spec):

Memory (extended): - Virtual memory manager: full per Section 4.15 — mmap, brk, munmap, page fault handler, COW, demand paging, mprotect, madvise - Page cache: full per Section 4.4 — readahead, writeback, per-inode page tree. All VFS read/write goes through page cache. - DMA subsystem: full per Section 4.14 — DmaDevice trait, CoherentDmaBuf, StreamingDmaMap, SWIOTLB fallback, per-arch cache coherency

VFS and pseudo-filesystems: - VFS layer: full per Section 14.1 — mount table, path resolution, file descriptor table, dentry cache, inode cache, superblock operations - tmpfs: full per Section 19.1 — size limits, POSIX semantics - devtmpfs: full — /dev/null, /dev/zero, /dev/random, /dev/urandom, /dev/console, auto-populated from device registry - devpts: full — PTY allocation filesystem, required for terminal emulation - initramfs (cpio): full — extraction, switchroot - procfs: full per Section 19.1 — all standard entries (/proc/[pid]/status, /proc/meminfo, /proc/cpuinfo, /proc/self, etc.) - sysfs: full per Section 19.1 — device/driver/class hierarchy - Pipes and FIFOs: full per Section 14.17 — O_NONBLOCK, PIPE_BUF atomicity, splice between pipes - File locking: full per Section 14.14 — flock(), POSIX fcntl() locks, lock conflict detection

Process and credentials: - Process management: full per Section 8.1 — fork/clone/execve/wait/exit/exit_group, process groups, sessions, resource limits (getrlimit/setrlimit/prlimit64) - Signal handling: full per Section 8.5 — all 64 signals, sigaction, sigaltstack, sigprocmask, signal queuing, RT signals, per-arch signal frame layouts (all 8 architectures) - Credential model and Linux capabilities: full per Section 9.9 — uid/gid, supplementary groups, setuid/setgid/setresuid/setresgid, capget/capset, capability bounding set, securebits. Required for busybox su, login, and any setuid binary.

Crypto and entropy: - Kernel Crypto API: basic algorithms per Section 10.1 — SHA-256, AES, CRC32c (ext4 checksums), ChaCha20 (CSPRNG). Full algorithm table deferred to Phase 3; Phase 2 implements the framework + essential algorithms. - CSPRNG and getrandom(): full — hardware RNG seeding (Phase 1), ChaCha20-based CSPRNG, getrandom() syscall with blocking/non-blocking modes. Required by glibc init, ASLR, stack canaries.

Synchronization and special file descriptors: - futex: full per Section 19.4 — FUTEX_WAIT, FUTEX_WAKE, FUTEX_WAIT_BITSET, FUTEX_PI (priority inheritance), FUTEX_REQUEUE. Required by glibc/pthreads (every multithreaded program depends on futex). - eventfd: full per Section 19.10 — semaphore mode, non-blocking, EFD_CLOEXEC. Required by systemd sd-event loop. - signalfd: full per Section 19.10 — synchronous signal delivery via file descriptor. Required by systemd PID 1. - timerfd: full per Section 19.10 — CLOCK_MONOTONIC, CLOCK_REALTIME, TFD_TIMER_ABSTIME. Required by systemd timer management. - ioctl framework: full — dispatch table, per-subsystem ioctl handlers (terminal, block device)

Device infrastructure: - PCIe enumeration and configuration: full per Section 11.4 — BAR discovery, MSI-X vector allocation, configuration space access. Required before any PCIe device driver (VirtIO-blk is PCI). Includes PcieConfigAccessor trait with per-arch implementations: ECAM via ACPI MCFG (x86-64), DT-based ranges (AArch64, ARMv7, RISC-V, PPC). - IOMMU abstraction: IommuDomain trait with per-arch implementations — VT-d (x86-64), SMMU v3 (AArch64), DACR-based (ARMv7), PPC IOMMU, stub (RISC-V). Generic operations: map, unmap, flush, fault handler. All Tier 1/2 driver code uses the trait, never arch-specific IOMMU registers directly. - NUMA discovery abstraction: NumaDiscovery trait — ACPI SRAT/SLIT (x86-64), device tree numa-node-id (AArch64, RISC-V, PPC). Single-node fallback when no topology information is available. - Device registry and bus management: full per Section 11.4 — device/driver matching, probe sequencing, sysfs population - IPC architecture: full per Section 11.8 — Tier 1 driver communication with core via domain ring buffers - Zero-copy I/O path: full per Section 11.7 — block driver fast path, ring buffer entry reuse

Block storage: - Block I/O layer: full per Section 15.1 — bio submission, completion, merge, I/O scheduler - VirtIO-blk driver (Tier 1): full per Section 15.5 — VirtIO-blk Tier 1 KABI driver. Virtqueue layout (descriptor table, available ring, used ring), feature negotiation (VIRTIO_BLK_F_SEG_MAX, F_SIZE_MAX, F_BLK_SIZE, F_FLUSH, F_TOPOLOGY, F_MQ, F_DISCARD), I/O request format (VirtioBlkReq: type + ioprio + sector + data + status), flush semantics, crash recovery path.

KABI and recovery: - KABI runtime: full service registry, module loader per Section 12.7 - Crash recovery: full per Section 11.9 — fault detection, IOMMU revoke, FLR reset, driver reload, I/O resume

Observability (design-in): - Unified Object Namespace: infrastructure per Section 20.5 — umkafs mount point, basic object registration. Full population in Phase 3-4. - Stable Tracepoints: framework per Section 20.2 — tracepoint macros, ring buffer, basic trace_pipe interface. Needed for debugging.

Syscall clusters (Phase 2 adds ~120+ syscalls to the dispatch table):

Cluster	Syscalls (representative, not exhaustive)
Process lifecycle	fork, clone, execve, wait4, exit, exit_group, getpid, getppid, setpgid, setsid
Signals	rt_sigaction, rt_sigprocmask, rt_sigreturn, rt_sigsuspend, kill, tgkill, sigaltstack
File I/O	open, openat, close, read, write, lseek, dup, dup2, dup3, fcntl, fstat, fstatat, readlink
Memory	mmap, munmap, mprotect, brk, madvise, mremap
Directory	getdents64, mkdir, rmdir, chdir, getcwd, unlink, rename, chmod, chown, link, symlink, access
Pipe/poll	pipe2, poll, ppoll, epoll_create1, epoll_ctl, epoll_wait, select
Mount/FS	mount, umount2, statfs, fstatfs, sync, fsync
Time	clock_gettime, gettimeofday, nanosleep, clock_nanosleep
Identity	getuid, geteuid, getgid, getegid, setuid, setgid, setresuid, setresgid, capget, capset
Resource	getrlimit, setrlimit, prlimit64, getrusage, uname, sysinfo, getrandom
Sync objects	futex, eventfd2, signalfd4, timerfd_create, timerfd_settime, timerfd_gettime
File locking	flock, fcntl(F_SETLK/F_GETLK)

Multi-arch (Phase 2): Busybox shell boots on all 8 architectures in QEMU. VirtIO-blk driver (Tier 1) loads and serves I/O on all arches — validates IommuDomain trait (map/unmap paths only; fault handler validation requires real IOMMU, Phase 4), PcieConfigAccessor, and DMA cache coherency paths on ARM64 and RISC-V (not just x86). DMA streaming mappings stress-tested on ARM64 QEMU (non-coherent path exercised). RISC-V and PPC IOMMU testing uses software-emulated IOMMU paths (SWIOTLB fallback) in QEMU. Full hardware IOMMU validation requires real hardware, targeted for Phase 4. Signal frame layout tested on all arches (per-arch signal delivery is a common divergence point).

Exit criteria: Busybox shell boots, ls, cat, echo, ps, mount, uname work (on all 8 architectures in QEMU; x86-64 is primary). A multithreaded C program (pthreads) runs correctly (futex works). VirtIO-blk driver survives injected fault with I/O resumption demonstrated end-to-end. All implemented subsystems pass their full test suites — no "works for the demo" exceptions.

24.2.4 Phase 3: Real Workloads + Tier M Peer Demo¶

Goal: Boot systemd, run Docker containers. Demonstrate Tier M PCIe peer.

See Section 25.3 (Phase 3.1: Storage Stack) and Section 25.3 (Phase 3.2: Advanced Features) for detailed agentic workflow steps within this roadmap phase.

What systemd needs: AF_UNIX sockets (D-Bus), inotify (file monitoring), signalfd/eventfd/timerfd (already Phase 2), cgroups v2 (resource control), namespaces (service isolation), sysctl (/proc/sys/*), netlink (udev device events), pidfd (race-free process management), seccomp-bpf (sandboxing), credentials/capabilities (already Phase 2).

What Docker needs: overlayfs (container image layers), namespaces (mount, PID, net, user, UTS, IPC, cgroup, time), cgroups (resource limits), seccomp-bpf (syscall filtering). For the Phase 3 demo, Docker uses --network=host (shares host network namespace). Full Docker networking (bridge, veth, NAT, conntrack, nftables) ships in Phase 4 alongside Kubernetes.

New subsystems added (each complete per its architecture spec):

Storage: - NVMe driver (Tier 1): full per Section 15.19 + KABI spec — admin queue, I/O queues, interrupt coalescing, crash recovery - ext4 filesystem: full per Section 15.6 — read-write, journaling (JBD2), fsck, extent trees, delayed allocation, inline data - I/O scheduling and priority: full per Section 15.1 — cgroup io controller weight, BFQ-style proportional share. Required for fio benchmarks.

Network stack: - TCP/IP and UDP: full per Section 16.1 — socket API, congestion control (Cubic, BBR), NAPI, NetBuf zero-copy, TCP state machine, UDP, ICMP - AF_UNIX sockets: full per Section 16.3 — stream, datagram, SCM_RIGHTS (fd passing), SCM_CREDENTIALS, abstract namespace. Critical: D-Bus (systemd's IPC) is built on AF_UNIX. systemd cannot start PID 1 without it. - Loopback interface: full — lo device, 127.0.0.1/::1 routing. Required for localhost services, iperf3 loopback tests. - ARP and NDP: full per Section 16.1 — L2/L3 address resolution for IPv4 (ARP) and IPv6 (NDP). Required for any Ethernet communication. - NIC drivers (VirtIO-net, e1000): full per KABI spec, Tier 1

USB subsystem (required for real hardware demo — keyboard, storage, serial): - USB XHCI host controller driver (Tier 1): full per Section 13.12 — XHCI ring-based command/transfer/event architecture, USB 2.0/3.0 device enumeration, hub support, MSI-X interrupts, crash recovery via KABI reload - USB HID class driver: full per Section 13.12 — keyboard, mouse, basic gamepad. Input events routed to evdev-compatible interface (minimal evdev path for Phase 3; full input subsystem in Phase 4). Critical: real hardware demo requires a USB keyboard for interactive use. - USB mass storage class driver: full per Section 13.12 — bulk-only transport (BOT), SCSI command translation, USB flash drives. Enables loading Docker images and test data from USB storage on real hardware. - USB CDC ACM (serial) class driver: USB-to-serial adapters. Common for development boards, serial consoles on headless servers, debug access.

Network stack (continued): - Netlink compat layer: full per Section 19.5 — NETLINK_ROUTE (ip command, systemd-networkd), NETLINK_KOBJECT_UEVENT (udev device events), NETLINK_NETFILTER (basic conntrack query). Required for systemd device management. - Routing table: full per Section 16.6 — FIB lookup, policy routing, default gateway

VFS extensions: - overlayfs: full per Section 14.8 — upper/lower/work directories, whiteout handling, metacopy, redirect_dir. Critical: Docker's primary storage driver for container image layers. - inotify: full per Section 14.13 — watch descriptors, event coalescing, per-user limits, IN_MODIFY/IN_CREATE/IN_DELETE/IN_MOVED_*, /proc/sys/fs/inotify/* sysctls. Critical: systemd uses inotify extensively for monitoring /run, /etc, unit files. - fanotify: full per Section 14.13 — permission events, FAN_CLASS_CONTENT/FAN_CLASS_NOTIF, pre-allocated event ring. Shipped with inotify per the Subsystem Completeness Rule (both are Section 14.13).

Containers and isolation: - Namespaces: all 8 types per Section 19.1 — mount, PID, net, user, UTS, IPC, cgroup, time. Includes clone(CLONE_NEW*), setns, unshare, /proc/[pid]/ns/* inodes. - Cgroups v2: full per Section 19.1 — cpu, memory, io, pids controllers, cgroupfs pseudo-filesystem, delegation, v1 compat shim. Required for both systemd (resource management) and Docker (container limits). - POSIX IPC: full per Section 17.3 — SysV shared memory, semaphores, message queues, POSIX mqueues. IPC namespace isolates these per container.

Scheduling (extended): - RT and deadline scheduling: full per Section 7.1 — SCHED_FIFO, SCHED_RR, SCHED_DEADLINE (CBS), bandwidth throttling, priority inheritance - EAS (Energy-Aware Scheduling): full per Section 7.2 — capacity-aware placement, energy model, big.LITTLE/Intel hybrid support - CPU bandwidth guarantees: full per Section 7.6 — CFS bandwidth, RT bandwidth. Required by cgroups cpu controller. - Power budgeting: full per Section 7.7 — RAPL/SCMI reading, per-cgroup power budgets in watts, multi-domain enforcement

Security (extended): - seccomp-bpf: full per Section 10.3 — filter installation, syscall interception, SECCOMP_RET_* actions, SECCOMP_IOCTL_NOTIF_*. Required by Docker/runc for container syscall filtering. - Kernel Crypto API (full): remaining algorithms per Section 10.1 — full algorithm table, template instantiation, hardware acceleration registration - Verified boot: framework and Ed25519 verifier per Section 9.3 — secure boot chain, kernel image signature verification (Ed25519; hybrid Ed25519+ML-DSA-65 added in Phase 4 when PQC algorithms ship), IMA measurement list. The boot verification framework is algorithm-agnostic; Phase 4 adds ML-DSA-65 and SLH-DSA as additional BOOT_VERIFY_TABLE entries.

Compat and special interfaces: - io_uring: full per Section 19.3 — SQ/CQ rings, all Phase 3 opcodes (read/write/fsync/poll/accept/connect/send/recv), fixed files/buffers - eBPF: full verifier + JIT (x86-64, AArch64) + core map types per Section 19.2 — hash/array/ringbuf maps, bpf() syscall (all subcommands), program attachment points - TTY/PTY: full per Section 21.1 — line discipline, job control, TIOCGWINSZ/TIOCSWINSZ, session leader, controlling terminal - pidfd: full per Section 19.10 — pidfd_open, pidfd_send_signal, pidfd_getfd, CLONE_PIDFD. Used by systemd 250+ for race-free process management. - Sysctl / kernel parameter store: full per Section 20.9 — /proc/sys/* entries, sysctl() syscall, namespace-aware parameters. Required by systemd (sysctl.conf), Docker (network tuning). - sendfile/splice/tee: full per Section 14.17 — zero-copy file-to-socket, pipe-to-pipe, file-to-pipe transfers

Memory (extended): - Memory compression tier: full per Section 4.12 — zswap/zram, compression algorithms (LZ4, ZSTD), writeback to swap

Observability (extended): - Unified Object Namespace: full population per Section 20.5 — all kernel objects registered and accessible via umkafs - Fault Management Architecture: basic per Section 20.1 — health telemetry for NVMe/NIC drivers, rule-based diagnosis for crash recovery events

Distributed (Tier M): - Tier M peer transport: full per Section 11.1 — Kintex-7 FPGA PCIe endpoint as the primary Tier M validation device. The FPGA implements the peer protocol in firmware and simultaneously advertises both network (Ethernet service) and storage (ephemeral block device backed by onboard DDR) capabilities — demonstrating multi-service Tier M peers. Protocol exercised: CLUSTER_JOIN, capability advertisement for both services, workload delegation, crash recovery with service re-advertisement - Peer protocol wire specification: full per Section 5.1 — JoinRequest/JoinAccept, heartbeat, ClusterMessageHeader, session key derivation. Only the PCIe P2P transport is exercised; RDMA transport deferred to Phase 4-5.

Syscall clusters (Phase 3 adds ~200+ syscalls):

Cluster	Syscalls (representative, not exhaustive)
Networking	socket, bind, listen, accept4, connect, send, recv, sendmsg, recvmsg, shutdown, setsockopt, getsockopt, sendfile
Unix sockets	socket(AF_UNIX), socketpair, sendmsg(SCM_RIGHTS), recvmsg(SCM_CREDENTIALS)
Namespaces	clone(CLONE_NEW*), setns, unshare, pidfd_open, pidfd_send_signal
io_uring	io_uring_setup, io_uring_enter, io_uring_register
eBPF	bpf() (all subcommands)
Advanced process	ptrace, prctl, seccomp, waitid, clone3
inotify/fanotify	inotify_init1, inotify_add_watch, inotify_rm_watch, fanotify_init, fanotify_mark
IPC	shmget, shmat, shmdt, shmctl, semget, semop, semctl, msgget, msgsnd, msgrcv, msgctl, mq_open, mq_send, mq_receive

Multi-arch (Phase 3): systemd boot tested on x86-64 (primary) and AArch64 (both QEMU and real hardware — RPi 5 and/or Apple M1). AArch64 real hardware validates weak memory model paths, DMA non-coherent paths, GIC interrupt routing, SMMU IOMMU, and DT-based PCIe that QEMU cannot faithfully emulate. eBPF JIT validated on both x86-64 and AArch64. All 8 architectures pass the Phase 2 busybox test suite plus Phase 3 networking (TCP loopback, AF_UNIX). EAS capacity model validated on AArch64 QEMU with big.LITTLE CPU topology. USB XHCI tested on x86-64 and AArch64 real hardware; VirtIO-input used on other arches in QEMU (same evdev path).

Exit criteria: Ubuntu minimal boots with systemd (PID 1 → multi-user target → login prompt) on x86-64. Docker runs hello-world container (pre-loaded image, --network=host). iperf3 TCP loopback benchmark completes. fio NVMe random read/write benchmark completes with I/O scheduling. USB keyboard works on real hardware (x86-64 and AArch64). Tier M peer device demonstrates capability negotiation and crash recovery over PCIe. AArch64 passes systemd boot on both QEMU and real hardware (RPi 5). eBPF: XDP basic operations (XDP_DROP, XDP_PASS, XDP_TX, XDP_REDIRECT) and tc classifier attachment pass 95%. Full Cilium connectivity test suite deferred to Phase 4 (requires conntrack, IPVS, overlay networking). All subsystems pass their full test suites on all architectures.

24.2.4.1.1 Cgroup v2 Detection Surface (Required for Docker/runc v2 Mode)¶

The following procfs/sysfs entries must return correct v2-format data in Phase 3. Without these, Docker/runc may fall back to cgroup v1, which is deferred to Phase 4.

Path	Required Content	Verified By
`/sys/fs/cgroup/cgroup.controllers`	Space-separated list of available controllers (cpu io memory pids)	`runc spec --rootless`
`/sys/fs/cgroup/cgroup.subtree_control`	Space-separated list of enabled controllers	`docker info`
`/proc/self/cgroup`	`0::/path` format (unified v2 hierarchy, hierarchy ID 0)	`cat /proc/self/cgroup` in container
`/proc/cgroups`	Empty or absent (v1 controllers not enumerated)	`runc` v2 detection logic
`/sys/fs/cgroup/` mount	`cgroup2` filesystem type	`findmnt -t cgroup2`

Integration test: docker run --rm hello-world must succeed with cgroup v2 driver (verified via docker info | grep "Cgroup Driver: cgroupfs" or systemd).

24.2.4.2 First Public Demo¶

The first public demonstration is a single unified demo at Phase 3 exit, showing three pillars in sequence:

Boot unmodified Ubuntu minimal with systemd (credibility anchor): UmkaOS boots → systemd PID 1 → multi-user target → login prompt → USB keyboard works. Then: /bin/sh → ls, cat, ps, uname, docker run hello-world. QEMU first, then real hardware (same kernel binary, USB keyboard + NVMe + NIC).
Tier 1 driver fault recovery (operational shock): run fio against NVMe → inject driver fault → Linux comparison: panic. UmkaOS: brief stall, driver reloads, I/O resumes. "The problem Unix never solved, fixed."
PCIe peer device (architecture shock): Kintex-7 FPGA endpoint auto-detected → capability registry shows both network and storage services → traffic delegated to FPGA NIC, ephemeral block device mounted → kill FPGA (reset) → IOMMU lockout → FPGA reboots → CLUSTER_JOIN → services re-advertised → I/O resumes. The FPGA runs only the peer protocol firmware (no UmkaOS kernel on peripheral) — demonstrates that any device speaking the spec can be a multi-service Tier M peer.

This is one demo, not three. Every subsystem it touches is complete and final.

24.2.5 Phase 4: Production Ready¶

Goal: Drop-in replacement for server and cloud workloads. Full Kubernetes, KVM virtualization, real hardware boot, LTP conformance.

See Section 25.3 (Phase 4.1: Consumer Hardware) for detailed agentic workflow steps within this roadmap phase.

What Kubernetes needs (beyond Phase 3 Docker): IPVS (kube-proxy default backend), full connection tracking (conntrack), nftables rule engine, veth pairs (pod networking), software bridge (CNI), VLAN (overlay networking), VXLAN/Geneve (CNI plugins like Calico/Flannel), AF_VSOCK (VM-based pods via Kata Containers).

What real hardware boot needs (Phase 3): AHCI/SATA driver (legacy disks), real NIC drivers (Intel e1000e/i210, Mellanox mlx5), IPMI (server management), RTC (hardware clock), watchdog (server reliability), NVMEM (MAC addresses, calibration data), I2C/SMBus (sensor access, IPMI/BMC communication, EDID for displays). Phase 3 already provides USB XHCI + HID + mass storage + serial for keyboard/storage/debug.

New subsystems added (each complete per its architecture spec):

Virtualization: - KVM hypervisor: full per Section 18.1 — /dev/kvm, VMX/EPT (x86-64), VHE (AArch64), QEMU/Firecracker support, virtio-mmio passthrough, vcpu scheduling integration with EEVDF - VFIO and iommufd: full per Section 18.5 — device passthrough to VMs, VFIO groups, iommufd descriptors, PCI device assignment, interrupt remapping. Required for GPU passthrough, SR-IOV, Firecracker device model. - AF_VSOCK: full per Section 16.24 — host-guest socket communication, virtio-vsock transport, SOCK_STREAM/SOCK_DGRAM. Required for Kata Containers and Firecracker guest agents. - Suspend and resume: full per Section 18.4 — S3 (suspend-to-RAM), S4 (hibernate), device state save/restore sequencing, PM notifier chains. Server use: IPMI-triggered suspend, UPS-coordinated hibernate.

Network (extended): - Netfilter/nftables: full per Section 16.18 — connection tracking (conntrack), NAT/masquerade, nftables rule engine, iptables compat layer. Enables Docker bridge networking, Kubernetes service routing. - Virtual network devices: veth pairs, software bridge, VLAN, macvlan per Section 16.16. Enables full Docker/K8s pod networking. - Network overlay and tunneling: VXLAN, Geneve per Section 16.16. Required by Kubernetes CNI plugins (Calico, Flannel, Cilium). GRE/IP-in-IP for legacy tunnels. - Traffic control and queue disciplines: full per Section 16.21 — qdisc framework, HTB (hierarchical token bucket), PFIFO, RED, netem. Required for K8s bandwidth limits (kubernetes.io/ingress-bandwidth annotation), network QoS. - IPsec and XFRM: full per Section 16.22 — transform database, SA/SP lookup, ESP/AH, IKEv2 key management integration. Required for site-to-site VPN, K8s encrypted pod networking (Calico IPsec mode). - IPVS: full per Section 16.30 — connection-based load balancing, NAT/DR/TUN modes, persistence, health checking. Default backend for kube-proxy in IPVS mode — required for production Kubernetes. - Network service provider: full per Section 16.31 — capability service for network operations over peer protocol

Storage (extended): - dm/LVM: full per Section 15.2 — dm-linear, dm-crypt, dm-thin, dm-snapshot. Required for many real-world storage configurations (Ubuntu/Fedora default to LVM root). - AHCI/SATA driver (Tier 1): full per Section 15.4 — AHCI controller and SATA disk driver. AHCI link power management, hot-plug, NCQ, error recovery. AhciPort struct (command list, received FIS, port registers), FIS types (Register H2D, DMA Setup, PIO Setup, Data, BIST, Set Device Bits), command slot management (32-slot command header array), NCQ support (tag mapping), ATAPI passthrough for optical drives. Tier 1 KABI driver, Phase 3. - Block storage networking: full per Section 15.13 — iSCSI initiator (RFC 7143), NVMe/TCP (NVMe-oF), iSER (iSCSI over RDMA). Enterprise SAN connectivity. Required for cloud instances with remote block storage. - NFS client (full): full per Section 15.14 — NFSv4.1/4.2, SunRPC, RPCSEC_GSS (Kerberos), delegation, pNFS layouts, state recovery, lease renewal, and multi-server failover. Phases 1-3 use local disk boot only (initramfs → VirtIO-blk or NVMe root); NFS root mount is not a Phase 2-3 gate. Phase 4 implements the complete NFS client from scratch — no partial "NFS root only" stub exists in earlier phases. - Disk quotas: full per Section 14.15 — user/group/project quotas, grace periods, quota files, enforcement at block allocation. Required for multi-user servers. - Persistent memory: full per Section 15.16 — DAX (direct access, bypasses page cache), MAP_SYNC, CLWB fencing, PMEM block device, filesystem DAX (fsdax). For NVDIMM and CXL memory-class devices. - ZFS integration: full per Section 15.10 — KABI bridge to OpenZFS, avoiding GPL/CDDL license conflict. Pool import/export, scrub, send/recv.

Distributed (extended): - DLM: full per Section 15.15 — RDMA-native lock acquisition (atomic CAS), lease-based extension, per-resource recovery, batch operations, deadlock detection (5s timeout). Required for clustered filesystems and multi-node coordination. - RDMA transport (Mode B): full per Section 5.4 — RoCEv2 queue pairs, RDMA Send/Recv for messages, RDMA Write for bulk data, RDMA atomic CAS for one-sided locking. Enables high-performance multi-node clusters (2-3 µs uncontested lock vs 10-100 µs over TCP). - Multi-node cluster membership: full per Section 5.2 — Raft-based quorum, node join/leave/eviction, split-brain protection, leader election. Extends Phase 3 two-node PCIe model to N-node network clusters. - SmartNIC/DPU offload: full per Section 5.11 — offload criteria evaluation, DPU discovery via CapAdvertise, automatic service migration (network stack → DPU), fallback on DPU crash. For Nvidia BlueField, AMD Pensando, Intel IPU. - Affinity-based service placement: full per Section 5.12 — ServiceAffinity rules, three-pass placement algorithm, hysteresis to prevent flapping. Used by SmartNIC offload and multi-node workload placement. - Topology reasoning engine: full per Section 5.2 — TopologyQuery API, constraint solver, cached results with generation tags. Foundation for placement decisions.

Security (extended): - LSM framework: full per Section 9.8 — SELinux policy engine, AppArmor profiles, hook dispatch, LsmBlob per-object storage, stacking support. Required for Fedora (SELinux mandatory) and Ubuntu (AppArmor default). - TPM runtime services: full per Section 9.4 — TPM 2.0 command transport, PCR extend/read/quote, seal/unseal, attestation, HMAC sessions. Required for measured boot, systemd-cryptenroll, remote attestation. - Kernel key retention service: full per Section 10.2 — keyrings (session, process, user, persistent), key types (user, logon, asymmetric, encrypted, trusted), key lifecycle, garbage collection, user-namespace-aware keyrings. - Confidential computing (host): full per Section 9.7 — SEV-SNP (AMD), TDX (Intel), CCA (ARM) VM management. Secure page table management, attestation flow, migration restrictions. Requires KVM (this phase). - PQC algorithm implementations: full per Section 9.6 — ML-KEM-768/1024 (key encapsulation), ML-DSA-65 (signatures), hybrid X25519+ML-KEM mode. Phase 1 provided abstractions; Phase 4 implements the algorithms for driver signing and secure boot. - EVM (Extended Verification Module): full per Section 9.5 — HKDF-SHA3-256 key derivation, protected xattr HMAC, IMA interaction, evm_mode boot parameter.

Scheduling (extended): - Intent-based resource management: full per Section 7.10 — intent cgroup knobs (cpu.intent, memory.intent, io.intent), PD optimizer (SCHED_IDLE background thread), workload classification, auto-tuning feedback loop. - Core provisioning and workload partitioning: full per Section 7.11 — LL/CG/Backfill core classes, cpu.provision_count cgroup knob, gang scheduling (MCP mode), OS noise elimination on CG cores (<1 µs/sec), backfill preemption (10 µs max). For HPC, latency- sensitive workloads, and DPDK-style poll-mode applications.

Observability (extended): - Fault Management Architecture (full): full per Section 20.1 — health telemetry for all driver families (not just NVMe/NIC), rule-based diagnosis, automated repair actions, fault escalation chains. Phase 3 provided basic NVMe/NIC health; Phase 4 covers all Tier 1/2 drivers. - perf_events / PMU: full per Section 20.8 — perf_event_open() syscall, hardware PMU counters (cycles, instructions, cache misses, branch mispredictions), software events, sampling, perf tool support, BPF program attachment to perf events. - EDAC: full per Section 20.6 — memory ECC error reporting, per-DIMM error counters, correctable/uncorrectable classification, MCE integration, CMCI (Corrected Machine Check Interrupt). Required for server reliability monitoring. - pstore: full per Section 20.7 — ramoops (RAM-backed persistent storage), NVRAM logging, console/ftrace/pmsg frontends, coredump capture to persistent storage. Critical for post-crash debugging on real hardware. - Debugging and process inspection: full per Section 20.4 — ptrace() (ATTACH, PEEK/POKE, SINGLESTEP, GETREGSET/SETREGSET, SEIZE), core dumps, /proc/[pid]/mem, /proc/[pid]/maps, gdbserver support. Required for strace, gdb, and many LTP tests.

VFS (extended): - autofs: full per Section 14.10 — mount trigger protocol, userspace daemon communication, direct/indirect/offset mounts, expiry. Used by systemd .automount units and NFS automounting. - configfs: full per Section 14.12 — kernel object configuration filesystem, show/store attribute callbacks, groups, default groups, drop. Used for runtime configuration of USB gadgets, target iSCSI, DLM, and VFIO mdev. - binfmt_misc: full per Section 14.9 — arbitrary binary format registration via magic/extension matching, interpreter invocation. Required for QEMU user-mode emulation (multi-arch containers), Java, Wine. - NFS server (nfsd): full per Section 15.12 — NFSv4.1/4.2 export table, RPC dispatch, state management, lease recovery. For NAS/file server deployments.

Device frameworks (infrastructure for real hardware drivers): - I2C/SMBus bus framework: full per Section 13.13 — I2C adapter/client model, SMBus protocol, userspace /dev/i2c-* access, device tree binding. Required for: IPMI/BMC communication, EDID (display identification), sensor chips (hwmon), touchpads (I2C-HID), EEPROMs. - SPI bus framework: full per Section 13.20 — SPI master/slave, DMA support, chip select, clock mode. Required for NOR flash, some sensors, embedded peripherals. - USB subsystem (extended): Phase 3 provided XHCI + HID + mass storage + serial. Phase 4 adds USB audio class driver (via ALSA framework), USB video class (UVC), and remaining class drivers per Section 13.12. - IPMI: full per Section 13.23 — IPMI 2.0 command transport (KCS, BT, SSIF), sensor data records, system event log, watchdog integration. Standard on all servers; required for out-of-band management. - Hardware watchdog: full per Section 13.19 — WatchdogDevice trait, timeout management, pre-timeout actions (NMI, SCI), panic-on- timeout policy. Required for server reliability (systemd-watchdog, keepalived). - RTC subsystem: full per Section 13.28 — RtcDevice trait, full Linux ioctl table, alarms, Y2K38-safe u64 timestamps. Required for hardware clock sync, hwclock, and systemd-timesyncd fallback. - NVMEM: full per Section 13.25 — non-volatile memory framework for MAC addresses, calibration data, OTP fuses, serial numbers. Required for real NICs (MAC address), real SoCs (calibration). - UIO: full per Section 13.24 — userspace I/O framework, device mmap, interrupt delivery to userspace. Required for DPDK (non-VFIO mode), legacy industrial I/O devices.

User I/O frameworks (infrastructure — drivers come in Phase 5): - DRM/KMS core: full per Section 21.5 — DRM device model, KMS modesetting API (CRTC, encoder, connector, plane), GEM buffer management, atomic modesetting, framebuffer console. Phase 4 provides the framework; Phase 5 adds GPU-specific drivers (i915, amdgpu). - ALSA core: full per Section 21.4 — PCM playback/ capture, mixer controls, ALSA ioctl interface, jack detection, sample rate conversion. Phase 4 provides the framework; Phase 5 adds codec drivers (Intel HDA, USB Audio). - Input subsystem (evdev): full per Section 21.3 — input event device model, EV_KEY/EV_REL/EV_ABS events, force feedback, multitouch protocol. Phase 4 provides the framework; Phase 5 adds touchpad/tablet drivers.

Compat (extended): - Safe kernel extensibility: full per Section 19.9 — policy vtable traits, module lifecycle, domain-isolated extensibility points (scheduler class, congestion control, LSM). Enables third-party kernel modules with crash containment. - Live kernel evolution: full per Section 13.18 — Theseus-inspired state export/import, atomic component swap, post-swap watchdog with 5-second timer, HMAC integrity tags on serialized state. Includes KABI service live replacement (Section 13.18) with incremental state export, multikernel rolling deployment, and driver tier promotion protocol. Post-evolution behavioral health monitoring (Section 13.18): configurable soak period (60-300s) comparing FMA health metrics against pre-evolution baseline, alerting on sustained degradation (forward-only, no automatic rollback). Enables zero-downtime kernel updates for long-running server workloads and fast agentic development cycles (Section 25.17).

Quality and packaging: - LTP conformance: Linux Test Project suite passing (>95% of applicable tests). Non- applicable tests: those requiring kernel features explicitly deferred to Phase 5 (e.g., GPU-specific ioctls, WiFi nl80211, nested KVM). Each exclusion documented with rationale. - Agentic driver rewrite: top-20 Linux driver families ported to KABI via AI-assisted translation. Families prioritized by server/cloud frequency: virtio-, e1000e/i210, mlx5, nvme, ahci, xhci, i2c-, hwmon, ipmi, rtc, watchdog, dm-*, raid, iscsi, nvme-tcp, bridge, veth, tun/tap, vhost, vfio. - Crash recovery testing: full Tier 1/2 fault injection across all Phase 4 driver families. Fault types: MMIO read/write errors, DMA completion timeout, interrupt storm, device reset failure, partial initialization crash. Recovery SLA: Tier 1 reload <150ms, Tier 2 restart <10ms. - Performance tuning: reach within 5% of Linux on target benchmarks — nginx (HTTP throughput), fio (storage IOPS), iperf3 (network bandwidth), sysbench (CPU/memory/mutex), pgbench (database), redis-benchmark (in-memory). - Package: .deb (Ubuntu 24.04+) and .rpm (Fedora 40+) packages. Installable via apt/dnf, GRUB menu entry auto-configured, dual-boot with Linux supported.

Syscall clusters (Phase 4 adds ~80-100 syscalls):

Cluster	Syscalls (representative, not exhaustive)
KVM	ioctl(KVM_CREATE_VM, KVM_CREATE_VCPU, KVM_RUN, KVM_SET_USER_MEMORY_REGION, KVM_GET/SET_REGS)
VFIO	ioctl(VFIO_GET_API_VERSION, VFIO_GROUP_SET_CONTAINER, VFIO_DEVICE_GET_INFO, VFIO_DEVICE_SET_IRQS)
Netfilter	setsockopt(IP_TABLES), nfnetlink socket family, conntrack via netlink
Quota	quotactl, quotactl_fd
Key management	add_key, request_key, keyctl
Perf	perf_event_open, ioctl(PERF_EVENT_IOC_*)
ptrace	ptrace(ATTACH, PEEK, POKE, GETREGSET, SETREGSET, SEIZE, INTERRUPT)
Misc	personality, kcmp, membarrier, rseq, close_range, openat2, statx, copy_file_range

Multi-arch (Phase 4): All Phase 4 subsystems compile and pass unit tests on all 8 architectures. KVM validated on x86-64 (VMX/EPT) and AArch64 QEMU (VHE). Netfilter/ conntrack stress-tested on AArch64 (weak memory model paths in conntrack hash tables). IOMMU domain operations validated on ARM SMMU v3 in QEMU. eBPF JIT produces correct code on x86-64, AArch64, and RISC-V. LTP run on both x86-64 (real hardware) and AArch64 (QEMU) — pass rate may differ but regressions are investigated.

Exit criteria: UmkaOS boots unmodified Ubuntu 24.04 and Fedora 40 on real x86-64 hardware (not just QEMU). Runs Docker + Kubernetes single-node with full bridge networking (veth + bridge + NAT + IPVS). KVM boots a guest VM with device passthrough (VFIO). LTP passes >95% of applicable tests on x86-64 and >90% on AArch64 QEMU. Performance within 5% of Linux on all target benchmarks. All Tier 1/2 drivers survive fault injection with recovery demonstrated end-to-end.

24.2.6 Phase 5: Ecosystem and Platform Maturity¶

Goal: Broad adoption — multi-architecture production support, consumer hardware, advanced distributed computing, HPC acceleration, vendor partnerships.

See Section 25.3 (Phase 5.1: Windows Emulation Acceleration) for detailed agentic workflow steps within this roadmap phase.

Phase 5 is organized into sub-phases. Sub-phases are parallel workstreams, not sequential gates — teams can work on 5b (consumer hardware) and 5c (distributed) concurrently. Each sub-phase has its own exit criteria.

Spec depth note: Phase 5 items are specified at full architectural depth in their respective chapters — data structures, interfaces, and algorithms are defined for design completeness. Implementation is deferred to after Phase 4 exit. Sections in other chapters that define Phase 5 data structures carry an explicit deferral note (e.g., "Phase 5 — data structures defined here for design completeness; implementation deferred"). This ensures agents and reviewers do not mistake Phase 5 specifications for current implementation targets.

24.2.6.1 Phase 5a: Multi-Architecture Production¶

Goal: All 8 architectures reach production quality with full Tier 1 driver isolation.

AArch64: full Tier 1 isolation via POE (ARMv9.4-A+) or page-table fallback per Section 11.2. Production- quality GIC, timer, SMMU v3 drivers.
RISC-V 64: Tier 1 runs as Tier 0 (in-kernel) until ISA adds fast isolation primitives per Section 11.2. Full PLIC, SBI, Sv48 support. Tier 2 (Ring 3 + IOMMU) available for untrusted drivers.
PPC32: full Tier 1 isolation via segment registers per Section 11.2. Embedded PowerPC support (Freescale/NXP e500/e6500).
PPC64LE: full Tier 1 isolation via Radix PID on POWER9+ per Section 11.2. IBM POWER server support with XIVE interrupts, OPAL firmware interface.
ARMv7: full Tier 1 isolation via DACR. Embedded ARM support (Cortex-A7/A15/A17).
s390x: Tier 1 runs as Tier 0 (Storage Keys too coarse for fast domain isolation) per Section 11.2. Full PSW-swap interrupt subsystem, SCLP console, Channel I/O (CCW/QDIO), virtio-ccw transport, SIGP SMP. z/VM and LPAR support.
LoongArch64: Tier 1 runs as Tier 0 (no hardware isolation mechanism) per Section 11.2. Full EIOINTC interrupt controller, Stable Counter timer, hybrid TLB (software refill 3A5000 / hardware PTW 3A6000), PCIe IOMMU.

Exit criteria: All 8 architectures boot on QEMU and pass the full Phase 4 LTP suite with no regressions. Tier 1 driver isolation exercised on each architecture (POE on AArch64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE; RISC-V, s390x, and LoongArch64 confirmed Tier 1 unavailable, Tier 0/Tier 2 placement validated). AArch64 real hardware (RPi 5, optionally Apple M1) passes the full Phase 4 LTP suite alongside x86-64.

24.2.6.2 Phase 5b: Consumer Hardware¶

Consumer hardware support enables UmkaOS as a desktop/laptop OS. This sub-phase provides the kernel-side infrastructure; userspace (desktop environments, package managers) is out of scope.

Wireless and connectivity: - WiFi (nl80211): full per Section 13.15 — nl80211 cfg80211 interface, WPA3/SAE, 802.11ax (WiFi 6), scan/connect/roam. Drivers: Intel iwlwifi, Realtek rtw89, Qualcomm ath11k, Mediatek mt76, Broadcom brcmfmac. - Bluetooth: full per Section 13.14 — HCI transport (USB, UART), L2CAP, RFCOMM, HID (input devices), A2DP (audio routing to ALSA), LE (Low Energy). Drivers: Intel, Realtek, Qualcomm, Broadcom. - rfkill: full per Section 13.21 — RF kill switch framework, per-device enable/disable, sysfs interface, input event integration.

Audio and display: - Audio drivers: Intel HDA (codec driver via ALSA framework from Phase 4), USB Audio Class, SoundWire per Section 13.26. PipeWire/PulseAudio integration (userspace, no kernel changes beyond ALSA). - Graphics drivers: Intel i915 modesetting (DRM/KMS driver using Phase 4 framework), AMD amdgpu modesetting, VESA/EFI framebuffer fallback. Phase 5b provides modesetting only; 3D acceleration in Phase 5e. - Multi-monitor: DRM/KMS atomic modesetting with hotplug detection, EDID parsing (via I2C framework from Phase 4), DisplayPort MST.

Input devices: - Touchpad: I2C-HID driver (using I2C + evdev frameworks from Phase 4), PS/2 Synaptics, multitouch gestures via evdev MT protocol. - Keyboard: USB HID (Phase 3), PS/2 AT keyboard, multimedia keys via evdev.

Platform management: - Suspend/resume (consumer): S3 (suspend-to-RAM), S0ix (Modern Standby) per Section 18.4. Device state save/restore for all consumer drivers. Wake-on-LAN, wake-on-USB, lid switch handling. - Power profiles: performance/balanced/battery-saver modes via power budgeting framework (Section 7.7). Per-app power attribution via cgroup energy accounting. - Regulator framework: full per Section 13.27 — voltage voting model, RegulatorConsumer RAII, SoC PMIC support. - MTD: full per Section 13.22 — raw flash access, bad block management, partition tables. Required for embedded boot media, SPI NOR flash.

Connectivity (extended): - Thunderbolt 3/4 and USB4: device tunneling, PCIe-over-Thunderbolt, security levels. Thunderbolt/USB4 requires a future Thunderbolt framework section (security authorization levels, PCIe tunneling, DisplayPort Alt Mode, daisy-chain topology). Full spec deferred to Phase 5b. Spec: Phase 4 — KABI driver using USB4 tunneling protocol. Architecture in Section 13.12. - eMMC and SD card: MMC framework, SDHCI driver. eMMC/SD requires a future MMC framework (CMD class support, UHS-I/II timing, eMMC 5.1 HS400 mode, partition management). Spec deferred to Phase 5. Spec: Phase 4 — KABI driver using SD/MMC protocol. Architecture in Section 13.1.

Desktop / laptop performance targets:

Metric	Target
Kernel boot (bootloader → login screen)	< 5 seconds
Resume from S3 suspend	< 2 seconds
Resume from S4 hibernate	< 10 seconds
Idle power (WiFi on, display on)	Match or exceed Ubuntu 24.04
Video playback (1080p H.264)	Hardware decode; CPU < 5%

Validation: Side-by-side battery life comparison with Ubuntu 24.04 on identical hardware (Speedometer + video stream benchmark). 100+ beta testers running UmkaOS as daily driver for 30-day soak; collect crash dumps, performance traces, battery stats.

Exit criteria: UmkaOS boots on 3+ common Intel/AMD laptops (ThinkPad, XPS, Framework) with WiFi, Bluetooth, touchpad, audio, and display working. S3 suspend/resume cycles without regression. Battery life within 10% of Ubuntu 24.04.

24.2.6.3 Phase 5c: Advanced Distributed and HPC¶

Goal: Multi-node production clusters, DSM coherence, HPC acceleration.

Distributed shared memory (DSM): - DSM: full per Section 6.2 — MOESI-like page coherence protocol, wire format, home-node management, subscriber-controlled caching with DLM integration, vector clock causal consistency, anti-entropy for relaxed mode. Application-visible DSM with syscall interface and distributed futex. - Clustered filesystems: full per Section 15.14 — GFS2 and OCFS2 support via DLM (Phase 4), journal-per-node, fencing integration. For high-availability shared storage (SAN, iSCSI).

HPC and acceleration: - RDMA userspace verbs: full per Section 22.7 — libibverbs compat, ibverbs uAPI, queue pair management, memory registration, RDMA CM. Required for MPI (OpenMPI, MVAPICH2), NCCL (distributed ML training), UCX. - GPU compute acceleration: full per Section 22.1 — AccelBase framework, GPU memory management (TTM/GEM), compute queue submission, shader dispatch. For OpenCL, CUDA (via KABI shim), ROCm (via KABI shim). - Unified compute topology: full per Section 22.8 — multi-dimensional capacity profiles, cross-device energy model, advisory placement overlay. Enables heterogeneous scheduling across CPU, GPU, DPU, FPGA resources. - Accelerator P2P DMA: full per Section 22.4 — GPU-to-GPU direct memory access, NVLink/xGMI interop, NUMA-aware accelerator memory, CXL fabric integration.

Inference and ML policy: - In-kernel inference engine: full per Section 22.6 — ONNX model loading, tensor operations, NPU binding, accelerator-aware dispatch. Provides the inference substrate for ML-driven kernel policies (closed-loop tuning, anomaly detection). - ML policy framework: full per Section 23.1 — closed-loop kernel intelligence, policy cascade (heuristic → model → optimizer), observation channels, PolicyService rate limiter, per-cgroup parameter overrides. Enables ML-driven scheduling, memory management, and I/O optimization. - Unified cgroup compute.weight: per Section 22.8 — optional knob providing orchestration layer over existing per-domain (CPU, GPU, RDMA) scheduling knobs.

Advanced networking: - SCTP: full per Section 16.23 — multi-stream, multi-homing, message boundaries. Required for telecom signaling (SIGTRAN), some HPC messaging. - Bonding/teaming: link aggregation (802.3ad LACP), active-backup, balance-rr. Required for server NIC redundancy. Link aggregation requires a future bonding section with LACP 802.3ad state machine, bonding mode enum (round-robin, active-backup, XOR, broadcast, 802.3ad, TLB, ALB), and netlink interface for bond management. Spec: Phase 4 — virtual NIC combining multiple physical NICs. Architecture in Section 16.1. - XDP (eXpress Data Path): full per eBPF framework (Phase 3) — XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT at driver level. For line-rate packet processing, DDoS mitigation.

Peer kernel nodes: - ClusterTransport unification: full per Section 22.8 — all ClusterTransport implementations (PCIe BAR, RDMA, CXL, USB, TCP, NVLink, HiperSockets) production-quality with full peer protocol conformance testing. - Peer kernel nodes: full per Section 22.8 — devices with serious compute (DPUs, GPUs with dozens of ARM cores) run full UmkaOS instances. Vendor-driven adoption; architecture ready from Phase 3 peer protocol. - Computational storage: full per Section 15.17 — NVMe Computational Programs command set, CSD as AccelDeviceClass, in-storage compute for database/analytics pushdown.

Exit criteria: 3+ node cluster runs with RDMA transport and DLM. DSM coherence demonstrated with a distributed database workload. GPU compute job runs via AccelBase. MPI hello-world completes over RDMA.

24.2.6.4 Phase 5d: Ecosystem Maturity¶

Goal: Vendor partnerships, distribution certification, community ecosystem.

Vendor KABI drivers: Nvidia GPU driver (signed ML-DSA-65, Tier 2 isolated per Section 24.1), AMD GPU driver, Intel GPU driver. Each vendor ships a single binary driver for all UmkaOS versions via stable KABI ABI.
Distribution certification: RHEL, Ubuntu, SUSE official support. Kernel package in distribution repositories. grubby/update-grub integration.
Community driver SDK: comprehensive documentation, example drivers (null block, loopback NIC, stub GPU), mentorship program. SDK dual-licensed Apache-2.0 OR MIT.
Nested virtualization: KVM-on-KVM per Section 18.1. Required for CI/CD (GitHub Actions, GitLab runners) and cloud providers.
Live kernel upgrade: stop all Tier 1/2 drivers → swap core binary → restart drivers. Zero-downtime kernel updates for long-running server fleets. Uses live kernel evolution framework from Phase 4.

Exit criteria: At least one vendor (Nvidia, AMD, or Intel) ships a signed KABI GPU driver. UmkaOS kernel package accepted into at least one major distribution repository (RHEL, Ubuntu, or SUSE). Live kernel upgrade demonstrated end-to-end with zero downtime on a running workload. Nested KVM boots a guest VM.

24.2.6.5 Phase 5e: Gaming and Creative¶

Goal: Support gaming and content creation workloads.

Vulkan drivers: Mesa RADV (AMD), Intel ANV via DRM/KMS framework (Phase 4). Full 3D acceleration, Vulkan 1.3+ conformance.
Steam + Proton: Proton/Wine game compatibility layer. Requires WEA (Section 19.6) for Windows syscall translation.
Windows Emulation Acceleration (WEA): full per Section 19.6 — WINE integration, PE loader hooks, Windows syscall fast-path translation, D3D-to-Vulkan shader pipeline offload.
GPU video encode/decode: VA-API and V4L2 stateless codec framework. Hardware acceleration for H.264, H.265, VP9, AV1.

Exit criteria: Steam launches, Proton runs top-100 Steam Deck verified games at native-or-better performance vs Linux. Video playback uses hardware decode (CPU <5% for 1080p H.264).

Phase 5 overall exit criteria: All sub-phase (5a-5e) exit criteria met. No regressions in Phase 1-4 test suites (LTP, KVM, Docker/K8s, fault injection). All live evolution primitives validated end-to-end (hot-swap, attestation, crash recovery and watchdog reload). Kernel self-update demonstrated on a production workload without downtime.

24.2.7 Adoption Story: From Drivers to Distributed¶

The phases above define engineering milestones. The adoption story — how UmkaOS delivers value to users and vendors at each stage — maps onto them:

Stage 1: "Better Linux" (Phases 2-3) — UmkaOS boots with Tier 0/1/2 KABI drivers, runs Docker and systemd. Value proposition: crash-recoverable drivers + stable binary ABI. No vendor cooperation required. No firmware changes. Every Linux-supported device works via ported KABI drivers. This is what gets users.

Stage 2: "Self-Describing Devices" (Phase 3-4) — Firmware teams add an 8-12K line C shim to their existing RTOS (see Section 24.1). The device becomes self-describing, crash-recoverable without host involvement, and the vendor can stop maintaining per-OS host-side drivers. The incentive is reduced maintenance burden — a cost saving, not a favor. The existing Tier 1 KABI driver continues working alongside the shim; the vendor can test and cut over at their own pace.

Stage 3: "Peer Kernels" (Phase 5+) — Devices with serious compute (DPUs, GPUs, smart NICs with dozens of ARM cores) run full UmkaOS instances. The peer protocol is already proven at the shim level. DSM coherence is enabled for workloads that benefit (HPC, distributed databases). This is the long-term vision but it gates nothing — every previous stage delivers standalone value.

Each stage builds on proven infrastructure from the previous stage. No stage requires speculative industry cooperation. The friction for Stage 2 is genuinely low: "add a small protocol library to firmware you already ship, and you can stop maintaining host-side drivers for every OS."

24.2.8 Licensing Summary¶

Component	IP Source	Risk
Confidential computing (TEE)	Hardware vendor specs (AMD SEV, Intel TDX, ARM CCA), all public	None
Post-quantum crypto	NIST standards (FIPS 203, 204, 205), public domain algorithms	None
Power budgeting	RAPL (Intel public spec), SCMI (ARM public spec), original design	None
Hardware memory safety	ARM MTE (public ISA), Intel LAM (public ISA)	None
Formal verification	Verus (MIT license), RustBelt (academic, published)	None
Safe extensibility	Original design (extends existing KABI vtable model)	None
Live kernel evolution	Theseus OS concepts (academic, published, Rice University)	None
Intent-based management	Original design, optimization theory (academic)	None
Real-time guarantees	PREEMPT_RT concepts (GPLv2, Linux mainlined), CBS (academic)	Medium — see note below
SmartNIC/DPU offload	Original design (extends existing peer model + capability service providers)	None
Persistent memory	DAX/PMEM specifications (SNIA, public), Linux interfaces (facts)	None
Computational storage	NVMe Computational Programs Command Set and Subsystem Local Memory Command Set (public, NVMe consortium, January 2024)	None
Unified compute model	Original design (extends existing AccelBase + EAS models)	None

All components are either original design, based on published academic research, based on public hardware specifications, or based on NIST/industry standards. No vendor-proprietary APIs or patented algorithms.

PREEMPT_RT derivative risk: PREEMPT_RT is GPLv2 and was merged into Linux mainline (v6.12). Any UmkaOS real-time code derived from PREEMPT_RT implementation (as opposed to the general concepts of preemptible kernels, threaded interrupts, and priority inheritance) could carry GPLv2 obligations that conflict with OKLF's additional permissions. UmkaOS's RT implementation MUST be a clean-room design based on published academic literature (priority inheritance protocols: Sha, Rajkumar, Lehoczky 1990; CBS: Abeni and Buttazzo 1998; LITMUS-RT: Brandenburg 2011) and public OS design textbooks, not derived from Linux PREEMPT_RT source code. Code review must verify no Linux-derived lock conversion patterns, interrupt threading structures, or RT-specific scheduler modifications are copied.

24.2.9 Performance Impact Summary¶

Every feature in this document was evaluated against the constraint: "Does this make UmkaOS measurably slower than Linux on the same workload?"

Feature	Hot-Path Impact vs Linux	Justification
Confidential computing	0% (same hardware, same cost)	Hardware AES engine, identical to Linux
Post-quantum crypto	0% (cold-path only)	Boot/driver-load only. ML-DSA-44 verify comparable to Ed25519; ML-DSA-65 verify ~100-200 µs (cold-path only, not on hot paths)
Power budgeting	0.015% (MSR reads at tick)	600ns per 4ms tick. Invisible in any benchmark. Per-task EAS overhead: see Section 24.4
Hardware memory safety	0% vs Linux when enabled	Same MTE instructions, same hardware cost. Tag RAM overhead: 3.125% of DRAM (ARM MTE only)
Formal verification	0.000% (compile-time)	Not in the binary
Safe extensibility	0% (same as Linux sched_class)	Function pointer dispatch, same mechanism
Live kernel evolution	0.000% (rare event only)	~10μs during replacement, months between events
Intent-based management	~0.00005% (background only)	3μs per second background optimization
Real-time guarantees	0% to 5% (configurable)	Same cost as Linux PREEMPT_RT when enabled. 0% = PREEMPT_NONE/VOLUNTARY, ~1% = PREEMPT_FULL, 2-5% = PREEMPT_FULL with RT scheduling classes active
SmartNIC/DPU offload	Negative (faster)	Moves work OFF host CPU
Persistent memory	Negative (faster)	DAX eliminates page cache copies
Computational storage	Negative (faster)	CSD reduces data movement
Unified compute model	~0.00005% (background only)	~4μs/sec/cgroup advisory. Submission hot path unchanged

Target: match or exceed Linux performance for all common workloads. Most features are invisible at steady state, and several actually improve performance. Known exceptions are conscious trade-offs documented in their respective sections: RT scheduling adds 0-5% overhead for RT-class tasks (same cost as Linux PREEMPT_RT); capability checks add ~5-10 cycles per privileged operation (~0.1%, fully pipelined bitmask test); untrusted policy module isolation adds ~46 cycles per domain crossing (eliminated once the module graduates to the Core isolation domain).

24.3 Verification Strategy¶

24.3.1 Testing Layers¶

Layer	Tool / Method	What it verifies
Unit tests	`cargo test` (in QEMU or host mock)	Individual subsystem correctness
Integration tests	Custom test harness in QEMU	Cross-subsystem interactions
Syscall conformance	Linux Test Project (LTP)	Syscall behavior matches Linux (see below)
Application testing	Boot Ubuntu minimal, Alpine	Real-world application compatibility
Container testing	Docker hello-world, nginx, redis	Container runtime compatibility
Kubernetes testing	k3s single-node	Orchestration platform compatibility
ABI regression	`kabi-compat-check` in CI	No breaking changes to KABI
Crash recovery	Fault injection framework	Tier 1/2 drivers recover correctly
Performance regression	Automated benchmarks vs Linux baseline	No unacceptable performance regression
Fuzzing	syzkaller (adapted for UmkaOS; requires KCOV-equivalent coverage, UmkaOS syscall descriptions, MTE/KASAN-equivalent sanitizer). KCOV specification deferred to Phase 4 (Ch 20 Observability — requires tracepoint integration and per-task coverage ring buffers). Phases 2-3: syzkaller runs in description-guided random mode (no coverage feedback). Phase 4+: syzkaller with KCOV coverage-guided mutation.	Syscall fuzzing for crash/hang detection
Static analysis	`cargo clippy`, custom lints	Code quality, unsafe usage review

24.3.2 LTP as Agentic Compatibility Substrate¶

The Linux Test Project (~5,000+ test cases) is not merely a validation gate — it is the primary development substrate for Linux syscall compatibility work. For agentic development, LTP transforms the largest single task in UmkaOS (implementing ~400 syscalls with correct edge-case behavior) from an open-ended research problem into a structured, test-driven implementation task.

Role in agentic workflow: - Each LTP test encodes a concrete behavioral contract (input → expected output) that the implementing agent can read, implement against, and verify — without human involvement or ambiguous documentation. - Tests are organized by syscall family, providing natural agent work-unit decomposition. - Edge cases encoded in LTP tests represent decades of Linux bug reports and regression fixes — knowledge the agent gets for free. - Cross-architecture execution validates that syscall behavior is identical on all 8 architectures (catches wrong struct padding, wrong register conventions, wrong signal frame layouts).

See Section 25.17 for the full agentic LTP workflow and complementary test suites (syzkaller, xfstests, kselftest).

24.3.3 Key Benchmarks¶

These benchmarks must match Linux within 5% (measured on identical hardware, same kernel configuration, same workload parameters):

Benchmark	What it tests	Target delta
`fio` randread 4K QD32	Block I/O fast path (IOPS)	< 2%
`fio` randwrite 4K QD32	Block I/O write path (IOPS)	< 2%
`fio` sequential read 1M	Block I/O throughput (GB/s)	< 1%
`iperf3` TCP throughput	Network stack throughput	< 5%
`iperf3` TCP latency (RR)	Network stack latency	< 5%
`nginx` small-file HTTP (wrk)	Combined network + filesystem	< 5%
`redis-benchmark`	In-memory key-value (network + mem)	< 3%
`sysbench` OLTP read-write	Database workload (IO + CPU + sched)	< 5%
`hackbench` (groups=100)	Scheduler + IPC throughput	< 3%
`lmbench` lat_ctx	Context switch latency	< 1%
Kernel compile (`make -jN`)	Combined CPU + IO + scheduling	< 5%
`stress-ng` mixed	Overall system stress	< 5%

Note: Target delta values are MAXIMUM ALLOWED overhead (failure thresholds). The design target is negative overhead (faster than Linux on the same hardware despite Tier 1 isolation). Any positive overhead within these thresholds must include root-cause analysis and a remediation plan documenting which UmkaOS optimization (CpuLocal registers, ring batching, lock-free structures, etc.) compensates for the measured cost. See Section 23.1 for the closed-loop optimization framework that drives toward negative overhead.

24.3.4 Crash Recovery Testing¶

Dedicated fault injection framework.

24.3.4.1.1 Activation¶

Fault injection is available in debug builds only (cfg(umka_fault_inject)). It is never compiled into release builds. Two activation mechanisms:

Kernel boot parameter: umka.fault_inject=<target>[,<fault>] Example: umka.fault_inject=nvme0,domain_violation injects a domain access violation into the nvme0 driver on first I/O. The kernel logs the injection at KERN_DEBUG level and proceeds with the fault.
Runtime sysctl (debug builds, init namespace only): umka/debug/fault_inject/<driver_name>/<fault_type> — write 1 to trigger once, write N to trigger on the N-th matching code path, write 0 to cancel.

24.3.4.1.2 Fault injection points in driver code¶

Driver code marks injectable points with the umka_fault_inject! macro (compiled out in release builds):

/// Injects fault `fault_type` at this callsite if fault injection is active for
/// this driver and fault type. No-op in release builds.
///
/// In debug builds: if umka.fault_inject matches this driver + fault_type,
/// executes the fault action (e.g., corrupts a pointer, calls panic!, returns Err).
#[cfg(umka_fault_inject)]
macro_rules! umka_fault_inject {
    ($driver:expr, $fault_type:expr, $action:expr) => {
        if crate::fault_inject::should_inject($driver, $fault_type) {
            $action
        }
    };
}
#[cfg(not(umka_fault_inject))]
macro_rules! umka_fault_inject {
    ($driver:expr, $fault_type:expr, $action:expr) => {};
}

24.3.4.1.3 Fault scenarios tested¶

Domain isolation violation: Inject umka_fault_inject!(driver, FaultType::DomainWrite, /* write to wrong PKEY */) — verifies MPK/DACR/POE catches the fault and reloads the driver without kernel panic.
Null pointer dereference: Inject null dereference in Tier 1 driver handler — verifies fault containment and recovery within 50–150 ms.
Infinite loop: Inject loop {} in a driver kthread — verifies the per-driver watchdog timer (DRIVER_WATCHDOG_TIMEOUT_MS = 5000) fires and kills the driver.
DMA to wrong address: Inject out-of-bounds DMA descriptor — verifies IOMMU fault is caught, driver is torn down, no kernel memory corruption.
Tier 2 process crash: Inject abort() in Tier 2 driver process — verifies umka-core supervisor restarts within 10 ms.
Repeated crashes: Inject crash on every restart — verifies auto-demotion policy engages after DRIVER_MAX_RESTART_ATTEMPTS = 3.
I/O in flight during crash: Inject crash mid-I/O — verifies all in-flight requests complete with -EIO and no request objects leak.

Each test verifies: (1) the system does not panic, (2) the driver recovers within the target time, (3) applications see errors but can retry, and (4) no memory is leaked.

24.3.5 CI Pipeline¶

Every commit triggers:

1. cargo build for all 8 architectures (x86_64, aarch64, armv7, riscv64, ppc32, ppc64le, s390x, loongarch64)
2. cargo test (host-side unit tests)
3. QEMU boot test per architecture (boot + shutdown)
4. kabi-compat-check (no ABI breaks)
5. cargo clippy (lint pass)
6. cargo fmt --check (formatting)

Every merge to main additionally triggers:

7. LTP syscall conformance suite
8. Docker container boot test
9. Performance benchmark suite (vs stored Linux baseline)
10. Crash recovery fault injection suite

24.4 Formal Verification Readiness¶

24.4.1 The Opportunity¶

Formal verification of kernel code crossed the practical threshold:

2009: seL4 — 200,000 lines of proof for 10,000 lines of C. Heroic effort.
2018: RustBelt — Formal soundness proof for Rust's ownership model.
2022-2025: Verus (Carnegie Mellon University, VMware Research, Microsoft Research,
  ETH Zurich, and others) — Automated verification for Rust.
  Write Rust code + specifications → tool PROVES correctness.
  Not testing. Not fuzzing. Mathematical machine-checked proof.

Verus can verify Rust code of realistic complexity: concurrent data structures, state machines, protocols, invariant maintenance. UmkaOS is written in Rust. The verification infrastructure exists.

24.4.2 What To Verify¶

Not everything needs verification. Focus on security-critical invariants and concurrency-sensitive code where bugs have catastrophic consequences.

Priority 1 — Non-Replaceable Core (highest verification priority):

These components include both non-replaceable data structures and their replaceable policy dispatch layers (Section 13.18). A bug in any non-replaceable component requires a full reboot to fix. Policy dispatch is verified to ensure correct routing and monotonic security — a swap must never loosen permissions. All must be verified before Phase 2 exit:

Component	Invariant to Prove	Section
Physical memory allocator (data)	No page allocated twice. No double-free. Buddy merge preserves free-list consistency. PageArray vmemmap mapping correct. PcpPagePool never loses pages.	Section 4.2
Physical memory allocator (policy)	`PhysAllocPolicy` dispatch reaches intended function. Policy replacement preserves free-list invariants (no pages lost during swap).	Section 4.2
Page reclaim (data)	No page on two generation lists simultaneously. Shadow entries correctly encode eviction generation. Generation counter monotonicity. Per-CPU drain buffers never lose pages.	Section 4.4
Page reclaim (policy)	`PageReclaimPolicy` dispatch correct. Policy replacement does not lose pages from LRU lists or corrupt generation state.	Section 4.4
Page table management (hardware ops)	No page mapped twice without sharing. Freed pages never accessible via stale PTE. PTE encoding/decoding correct per architecture.	Section 4.8
Page table management (policy)	`VmmPolicy` dispatch correct. Policy replacement does not leave stale TLB entries.	Section 4.8
Capability system (data)	Capabilities cannot be forged. `cap_lookup()` returns correct entry. Generation check is correct. Permission AND is correct. `CapOperationGuard` never loses decrements.	Section 9.1
Capability system (policy)	`CapPolicy` dispatch reaches intended function. MonotonicVerifier correctly rejects policies that loosen security. Policy replacement preserves CapTable invariants (no capabilities lost during swap).	Section 9.1
Evolution primitive (Nucleus)	INV-1 (atomic visibility), INV-6 (PendingOpsPerCpu transfer integrity). ~2-3 KB of straight-line code within the ~18-20 KB Nucleus — the most tractable Verus target. See Section 13.18.	Section 13.18
LMS boot verifier (Nucleus)	`lms_verify_shake256()` correctly implements LMS verification per NIST SP 800-208. Winternitz chain completion is correct. Merkle path walk reaches the root. SHAKE256 domain separation is correct (padding byte 0x1F). ~1-3 KB code, reuses Keccak-f[1600].	Section 2.21
Evolution orchestration (Evolvable)	INV-2 through INV-5, INV-7, LIV-1, LIV-2. These are enforced by replaceable orchestration — bugs are live-fixable. Verified for defense-in-depth but NOT a deployment gate.	Section 13.18
Data format evolution	INV-DF1 through INV-DF5. No partial reads during migration, no lost writes, epoch monotonicity, wire protocol backward compat, extension array isolation.	Section 13.18
KABI vtable dispatch	Vtable calls never escape the driver's isolation domain. Version checks are correct.	Section 12.1

Priority 2 — Security and Correctness Critical (verified before Phase 3 exit):

These components are live-replaceable but handle security-sensitive or concurrency-critical operations where bugs have catastrophic consequences (data loss, privilege escalation, deadlock):

Component	Invariant to Prove	Section
IPC ring buffer	Producer-consumer protocol never loses messages, never delivers duplicates, never deadlocks.	Section 11.7
CBS bandwidth server	Bandwidth guarantees are met. No starvation.	Section 7.6
DSM coherence protocol	Multiple-reader / single-writer consistency maintained. No lost writes.	Section 6.2
Distributed capabilities	Signature verification is correct. Revocation propagation is complete.	Section 5.7
Power budget enforcement	Budgets are never exceeded by more than one tick interval.	Section 7.7

24.4.3 Design for Verifiability¶

Verification readiness is a design property, not a tool. Code must be structured so that specifications can be written and verified:

// Example: capability lookup with verification-ready specification.
// Verus-style annotations (compile-time only, erased from binary).

/// Lookup a capability by handle.
///
/// SPECIFICATION (verified by Verus):
///   requires: handle is valid for calling process
///   ensures:  returned capability matches the one in the capability table
///   ensures:  returned capability's generation <= object's current generation
///   ensures:  returned capability's permissions are a subset of the
///             delegator's permissions (no escalation)
pub fn cap_lookup(
    table: &CapabilityTable,
    process: ProcessId,
    handle: CapHandle,
) -> Result<Capability, CapError> {
    // Implementation must satisfy the specification.
    // Verus proves this at compile time.
    // No runtime overhead.
}

Design rules for verifiability:

Explicit state: No hidden mutable global state. All state is in named structures with explicit ownership. (Rust already enforces this.)
Small critical sections: Break complex operations into small, individually verifiable steps. Each step has a pre-condition and post-condition.
Interface contracts: Every public function in security-critical modules has a documented specification (pre/post conditions, invariants). Verus verifies these.
Algebraic data types for states: Use enums with exhaustive matching instead of integer flags. The type checker ensures all states are handled.
Monotonic counters: Generation counters, version numbers — use types that enforce monotonicity (can only increase, never decrease).

24.4.4 Verification Tooling¶

Primary tool: Verus (Carnegie Mellon University, VMware Research, Microsoft Research, and others). Automated verification for Rust. Specification-driven proofs of functional correctness and memory safety properties.

Alternative tools (fallback if Verus hits scale limits): - Kani (Amazon): Bounded model checking for Rust. Explores all execution paths up to a configurable bound. Excellent for concurrent code and finding edge cases. Complementary to Verus — Kani finds bugs, Verus proves absence of bugs. - Prusti (ETH Zurich): Automated verification for Rust. Different proof strategy than Verus (separation logic vs SMT). Useful as a cross-check.

CI integration strategy: - Every commit: debug_assert! invariant checks + lightweight type-level assertions. Compile-time only. Seconds. Catches regressions in verified invariants. - Every PR: Kani bounded model checks on critical modules (~5-10 min). Catches concurrency bugs and edge cases. - Nightly: Full Verus specification proofs (~30-60 min for verified modules). Mathematical proof of correctness. Any proof failure blocks the next release.

Scope of verification — what is OUT of scope: Cross-component interactions (e.g., DSM coherence protocol interacting with hardware isolation boundaries simultaneously) are beyond current tool capabilities. Individual components are verified against their specifications; the composition is validated by integration testing and fuzzing. This is an honest limitation — complete whole-system verification remains a research problem.

Unsafe Code Verification Strategy:

Rust's unsafe blocks are the primary verification target — they are where memory safety invariants must be manually upheld. The strategy:

Verus for ownership and invariant proofs: verify that unsafe code upholds the safety contract documented in its // SAFETY: comment. Verus can reason about pointer validity, aliasing, and lifetime guarantees.
Kani for model-checking unsafe code paths: bounded model checking explores all possible inputs to unsafe functions up to a configurable bound, catching edge cases that specifications might miss.
Wrap unsafe in safe abstractions: every unsafe block is encapsulated in a safe function with a verified specification. Callers never touch unsafe directly. The safe wrapper's specification becomes the verification boundary.

Verification Complexity by Component:

Based on published Verus effort data and component characteristics:

Priority	Component	Relative Complexity	Rationale
P1	Capability system — data (Section 9.1)	Low	Small state machine: XArray lookup + integer compare + AND. Clear invariants.
P1	Capability system — policy (Section 9.1)	Low	Dispatch table + MonotonicVerifier swap protocol; small code surface
P1	KABI vtable dispatch (Section 12.1)	Low	Index lookup + bounds check, small code surface
P1	Physical memory allocator — data (Section 4.2)	Medium	Buddy algorithm well-studied; main difficulty is proving no double-alloc
P1	Physical memory allocator — policy (Section 4.2)	Low	Dispatch table + swap protocol; small code surface
P1	Page reclaim — data (Section 4.4)	Medium	Generational LRU: prove no page on two lists, generation monotonicity, shadow entry correctness
P1	Page reclaim — policy (Section 4.4)	Low	Dispatch table + swap protocol; small code surface
P1	Page table management — hardware ops (Section 4.8)	High	Many edge cases, arch-specific
P1	Page table management — policy (Section 4.8)	Low	Dispatch + TLB flush correctness during swap
P1	Evolution primitive — Nucleus (Section 13.18)	Low	~2-3 KB straight-line code (within the ~18-20 KB total Nucleus); INV-1 (IPI + atomic swap) and INV-6 (ring transfer). No loops beyond bounded page remap. The remaining ~15-17 KB of Nucleus (data structures, page table ops, capability lookup, KABI dispatch) is also P1 but with higher complexity.
P1	LMS boot verifier — Nucleus (Section 2.21)	Low	~1-3 KB; Winternitz chains (bounded loop W×p iterations) + Merkle path (bounded loop H iterations) + SHAKE256 (reuses verified Keccak). No allocation, no state.
P1.5	Evolution orchestration — Evolvable (Section 13.18)	Medium	INV-2-5, INV-7, LIV-1-2. Live-fixable — verification is defense-in-depth, not a deployment gate.
P2	IPC ring buffer (Section 11.7)	Medium	Single producer-consumer per ring, bounded. Cross-domain shared memory adds concerns: torn reads on non-cache-line-aligned entries, memory ordering across 8 architectures, overflow detection with potentially non-coherent Tier 2 memory. io_uring-style design (cache-line-aligned, power-of-two, acquire/release) is well-studied but the cross-domain privilege boundary adds verification surface beyond a simple in-process SPSC ring.
P2	CBS bandwidth server (Section 7.6)	Medium	Well-studied algorithm
P2	DSM coherence (Section 6.2)	High	Distributed protocol, concurrent access

Recommended verification order (within Priority 1): capability data → capability policy → KABI vtable dispatch → evolution primitive → LMS boot verifier → memory allocator data → memory allocator policy → page reclaim data → page reclaim policy → page table hardware ops → page table policy. The evolution primitive and LMS verifier are both low-complexity straight-line code and should be verified early. Evolution orchestration (P1.5) is verified for defense-in-depth after all P1 targets but before P2 — it is live-fixable, so verification is not a deployment gate. All P1 components must be verified before Phase 2 exit — verification is the sole defence against defects in non-replaceable code, and the sole guarantee that policy dispatch never misroutes or loosens security.

24.4.5 Performance Impact¶

Literally zero. Verification is compile-time. Verus specifications are erased from the binary. The verified code is identical to the unverified code at runtime.

The only cost is developer time writing specifications. But this pays for itself by eliminating bugs that would otherwise require debugging, CVE patches, and emergency releases.

24.5 Technical Risks¶

Risk	Impact	Likelihood	Mitigation
MPK provides only 16 domains	Medium	Certain	Group related drivers by fault domain (all block share domain, all net share domain). 12 driver-available domains on x86 (4 keys reserved for infrastructure: PKEY 0=core, 1=shared descriptors, 14=shared DMA, 15=guard; per Section 11.2). AArch64 POE has 7 usable indices (1-7), of which 3 are available for Tier 1 driver domains (indices 3-5; indices 1-2 reserved for umka-core, 6 for userspace, 7 for temporary/debug; per Section 24.5). See "MPK Domain Grouping" below for degraded isolation analysis.
eBPF verifier complexity	High	High	Verifier subsystem is ~30K SLOC in Linux (counting `kernel/bpf/verifier.c` at ~23K SLOC as of v6.12, plus `btf.c`, `log.c`, range-tracking helpers, and test infrastructure — the ~30K figure covers the full verification subsystem, not `verifier.c` alone). Start with subset of program types, expand incrementally. UmkaOS implements a clean-room Rust verifier and JIT (GPL avoidance); the eBPF bytecode format and helper API are compatible with Linux but the implementation is original.
KVM deeply integrated with Linux MM	High	High	Design memory manager with KVM hooks from the start (Phase 1 architecture). Dedicate a team to KVM from Phase 4.
Driver coverage gap blocks adoption	Critical	High	Cloud-first strategy (VirtIO covers 100% of VMs). Prioritize top-20 drivers. Agentic rewrite pipeline for open-source drivers.
Subtle syscall compatibility bugs	High	High	LTP conformance suite, real-world application testing, syzkaller fuzzing. Build a comprehensive test matrix of applications.
Spectre/Meltdown mitigations + domain isolation	Medium	Medium	KPTI not needed for Tier 1 (same Ring 0). Tier 2 needs standard KPTI. Retpoline/IBRS for indirect branches. Test on affected hardware.
IOMMU not available on all hardware	Medium	Medium	IOMMU required for Tier 1 DMA fencing. Systems without IOMMU fall back to trusted mode (reduced isolation, logged warning).
ARM64 lacks direct MPK equivalent	Medium	Certain	Use POE (FEAT_S1POE, 7 usable indices of which 3 are for Tier 1 drivers, optional from ARMv8.9+) or page-table fallback. Adaptive isolation policy (Section 11.2) allows per-driver tier pinning or promotion to Tier 0 on pre-POE hardware.
No fast isolation on pre-2020 x86	Medium	Certain	Adaptive isolation policy: `isolation=performance` promotes Tier 1 to Tier 0 (Linux-equivalent speed, no memory isolation). IOMMU DMA fencing still active.
Rust ecosystem maturity for OS dev	Low	Medium	Established patterns from Redox, Linux rust-for-linux, Hubris. Use `#![no_std]` and custom allocator. Unsafe blocks at hardware boundaries are expected and audited.
Performance target too ambitious	Medium	Medium	5% target is for macro benchmarks. Micro-benchmarks may show higher overhead on specific paths. Batch amortization and careful profiling.
Community adoption / contributor pipeline	Medium	Medium	Clean SDK, good documentation, lower barrier than Linux driver development. Cloud-first focus builds credibility before desktop push.
Regulatory / certification barriers	Low	Low	Work with distributions early. Open-source everything except vendor proprietary blobs.
LZ4/Zstd kernel implementation correctness	Medium	Medium	Fuzzing, comparison with reference implementation. Use no_std BSD-licensed implementations with comprehensive test vectors.
Object namespace overhead on hot paths	Low	Low	Lazy registration for high-frequency objects (fds, sockets, VMAs). Eagerly registered objects only (~2000 baseline = ~384 KB).
Shared-domain silent corruption	Medium	Low	Inherent MPK/POE/DACR limitation with finite domains. Rust memory safety is primary defense within shared domains. Operators can promote critical drivers to solo domains or Tier 2. See Section 24.5 below.
CBS scheduling fairness under edge cases	Medium	Medium	Formal analysis against CBS paper (Abeni 1998), stress testing with adversarial workloads, comparison with Linux cpu.max behavior.

24.5.1 Risks from Advanced Features (Chapters 16-18)¶

Risk	Impact	Likelihood	Mitigation
TEE hardware fragmentation (SEV-SNP vs TDX vs CCA)	High	Certain	Abstract behind `ConfidentialContext` trait (Section 9.7). Implement one backend at a time. SEV-SNP first (largest cloud deployment), TDX second, CCA third.
PQC algorithm instability (NIST may revise)	Medium	Medium	Algorithm-agile abstraction (Section 9.6). Algorithms behind enum dispatch; swapping ML-KEM for a successor is a library update, not a kernel redesign.
PQC signature sizes impact IPC latency	Low	Certain	ML-DSA-65 signatures are 3,309 bytes (per NIST FIPS 204, Table 2). Cold-path only (capability minting, not every IPC call). `SignatureData::Heap` variant avoids ring buffer bloat (Section 9.6).
RT + domain isolation interaction causes priority inversion	High	Medium	Domain switch (WRPKRU on x86) is ~23 cycles (no lock needed). Domain switching is O(1) — no contention path. If priority inheritance needed for domain-shared buffers, use PI futexes (Section 8.4).
Formal verification scope creep	Medium	Medium	Verify only security-critical paths: capability table, IPC ring, page table mapping (Section 24.4). Accept that ~80% of kernel code is tested, not verified.
DPU vendor lock-in (proprietary firmware)	Medium	High	DPUs are Tier M peers using the standard peer protocol (Section 5.11). Vendors implement a firmware shim (~10-18K lines of C, excluding crypto primitives already in firmware; a reference implementation will be published with measured counts), not a full OS port. Host-side code is generic `umka-peer-transport`, not vendor-specific.
PMEM/CXL hardware not yet widely deployed	Low	High	Design is hardware-agnostic (Section 15.16). All PMEM code compiles out when hardware is absent. CXL 3.0 adoption expected 2025-2027; architecture ready, implementation deferred.
Unified compute model adds scheduling overhead	Medium	Low	Advisory overlay only — existing schedulers unchanged (Section 22.8). Topology queries are O(1) reads from cached `ComputeCapacityProfile`. No hot-path cost.
Live kernel evolution causes state corruption	Critical	Low	Post-swap watchdog with 5-second timer (Section 13.18). On crash, the system attempts to re-extract state from the failing component; if extraction fails, the system panics rather than reverting to stale state, preventing silent data corruption. State serialization uses versioned HMAC integrity tags.
Intent optimizer makes poor decisions	Low	Medium	Intent system is purely advisory (Section 7.10). Clamping prevents invalid resource configs. Worst case: system falls back to static defaults (no intent optimization).

24.5.2 Risk Response Priority¶

Driver coverage (Critical): Addressed by cloud-first strategy + agentic rewrite
Syscall compatibility (High): Addressed by LTP + application test matrix
eBPF complexity (High): Addressed by incremental implementation
KVM integration (High): Addressed by early architectural planning
TEE fragmentation (High): Addressed by trait-based abstraction
RT + domain isolation interaction (High): Addressed by O(1) domain switching design
Domain limit (Medium): Addressed by driver grouping policy
Live evolution safety (Critical but low likelihood): Addressed by watchdog + state HMAC integrity checks

24.5.3 Domain Grouping: Degraded Isolation Analysis¶

When more than 12 Tier 1 drivers are loaded simultaneously, some drivers must share an isolation domain (protection key). This is an inherent limitation of Intel's 16-key PKU design (16 keys minus PKEY 0 for umka-core, minus PKEY 1 for shared descriptors, minus PKEY 14 for shared DMA, minus PKEY 15 as guard = 12 usable). Grouping has concrete consequences for fault isolation:

What grouping preserves: - IOMMU isolation: each driver retains its own IOMMU domain regardless of domain grouping. DMA fencing is unaffected — a crashing NVMe driver cannot DMA into a NIC driver's buffers, even if they share an isolation domain. - Capability isolation: each driver has its own capability set. Sharing an isolation domain does not grant access to another driver's capabilities. - Crash detection: fault injection and page-fault trapping still identify the crashing driver (via instruction pointer, not isolation domain).

What grouping degrades: - Memory read/write isolation between grouped drivers. If drivers A and B share isolation domain 5, a buffer overrun in A can corrupt B's data structures. The crash is still contained (it cannot escape to umka-core or other domains), but it may take down both A and B. - The blast radius of a crash expands from one driver to one domain group. In practice, this means a faulty NVMe driver could take down the AHCI driver if both are in the "block" group.

Grouping policy — drivers are grouped by fault domain affinity (drivers that interact heavily and would likely cascade-fail anyway):

Isolation Domain	Group	Typical Members
2	Block storage	NVMe, AHCI/SATA, virtio-blk, iSCSI, NVMe-oF
3	Network (data)	Intel NIC, Mellanox NIC, virtio-net
4	Network (stack)	TCP/IP, UDP, RDMA core
5	Filesystem	ext4, XFS, btrfs
6	Display	DRM/KMS, GPU compute
7	KVM	VMX/SVM, vhost-net, vhost-scsi
8	USB	xHCI, USB hub, USB mass storage
9-13	Individual	High-value drivers that warrant solo domains

AArch64 domain budget: POE provides 7 usable indices (1-7; index 0 is reserved for umka-core as the default PTE value). Of the 7 usable indices, 4 are reserved for infrastructure (index 1 for shared read-only, index 2 for shared DMA, index 6 for userspace, index 7 for temporary/debug), leaving only 3 indices for Tier 1 driver domains (indices 3-5; vs. 12 on x86). AArch64 therefore requires much more aggressive grouping.

Note for reviewers: ARM FEAT_S1POE uses a 3-bit POIndex field in page table entries, providing 8 index values (0-7). This is a hardware constraint, not a design choice. Index 0 is the default PTE value (per ARM architecture), leaving 7 configurable indices. Do not suggest "use 4 bits for 16 indices" — the POIndex field width is fixed by the ISA. The grouping table above is designed for x86's 12-domain budget. On AArch64, the kernel applies a reduced grouping scheme: - Domain 0: umka-core (default PTE value) - Domain 1: Shared read-only - Domain 2: Shared DMA buffer pool - Domain 3: VFS + block I/O (merged — these are tightly coupled) - Domain 4: Network stack - Domain 5: All remaining Tier 1 drivers (single shared domain) - Domain 6: Userspace (EL0 default) - Domain 7: Temporary / debug This reduces isolation granularity for Tier 1 drivers on AArch64 (all share one domain) but preserves the critical umka-core/driver/userspace boundaries. The architecture-specific grouping is selected at boot based on arch::current::isolation::domain_count().

Typical server scenario — a cloud server runs NVMe + NIC + TCP + KVM + virtio = 5 drivers. On x86 (12 driver domains), these fit in 5 domains with no grouping needed; the 12-domain limit only triggers on heavily-configured systems (desktop with GPU + audio + USB + Bluetooth + WiFi + NVMe + SATA + NIC + ...). On AArch64 with POE (3 driver domains), even this typical 5-driver configuration requires grouping -- the reduced scheme above merges block I/O, networking, and remaining drivers into 3 shared domains. Architectures with more domains (ARMv7 DACR: 12, PPC32 segments: 12) behave more like x86.

Monitoring — when grouping occurs, UmkaOS logs a warning: umka: isolation domain 1 shared by nvme, ahci (reduced isolation: crash in either affects both)

This allows administrators to make informed decisions about which drivers to load as Tier 2 (full process isolation, unlimited domains) if they require stronger isolation than domain grouping provides.

24.5.3.1 Domain Grouping Security Properties¶

Drivers sharing an isolation domain have the same memory-access fault containment as monolithic kernel drivers. A buffer overrun in one driver can silently corrupt any other driver in the same domain without triggering a hardware exception. The hardware isolation boundary exists at the domain edge, not between drivers within a domain.

Property	Solo Domain (1 driver per domain)	Shared Domain (N drivers per domain)
Hardware fault containment	Full — any cross-domain access triggers immediate hardware exception	None within domain — only the domain boundary is hardware-enforced
Crash detection latency	Immediate (first errant memory access faults)	Delayed — corruption may produce wrong results before any detectable fault
Blast radius	Single driver	All drivers in the domain group
Primary defense	Hardware isolation + Rust memory safety	Rust memory safety only (hardware isolation protects the domain boundary, not interior)
IOMMU DMA fencing	Per-driver (unaffected by domain grouping)	Per-driver (unaffected by domain grouping)
Capability isolation	Per-driver (unaffected by domain grouping)	Per-driver (unaffected by domain grouping)

Mitigation: Rust memory safety is the primary defense within shared domains. Safe Rust prevents buffer overruns, use-after-free, and data races at compile time — the class of bugs that would exploit co-tenancy. Hardware isolation is defense-in-depth for the domain boundary (protecting UmkaOS Core and other domain groups), not within it. unsafe blocks in Tier 1 drivers are the residual attack surface for intra-domain corruption and must be minimized and audited.

For crash recovery implications of shared-domain corruption, see Section 11.9.

24.5.3.2 Shared-Domain Silent Corruption¶

Risk: When multiple Tier 1 drivers share an isolation domain (normal on AArch64 POE with 3 driver domains, and on x86 when >12 Tier 1 drivers are loaded), a bug in one driver can silently corrupt another driver's memory without triggering a hardware fault. The corrupted driver may produce wrong results (silent data corruption) before eventually crashing.

Attribute	Value
Impact	Medium (contained to one domain group; Core and other domains unaffected)
Likelihood	Low (Rust memory safety prevents the dominant bug classes; residual risk from `unsafe` blocks)
Detection	Delayed — no hardware exception until corruption crosses a domain boundary or triggers an unrelated fault
Mitigation	(1) Rust memory safety eliminates buffer overruns, UAF, and data races in safe code. (2) Minimize `unsafe` in Tier 1 drivers. (3) Administrators can promote high-value drivers to solo domains or Tier 2. (4) Watchdog and ring buffer integrity checks (Section 11.9) provide software-level fault detection.

This is an inherent limitation of the MPK/POE/DACR model with finite hardware domains. It is not a design flaw — it is a conscious tradeoff between isolation granularity and domain budget. The tradeoff is documented here so operators can make informed tier placement decisions.

24.6 Appendices¶

Reference material, comparison tables, and open questions.

24.7 Licensing Model: Open Kernel License Framework (OKLF) v1.3¶

UmkaOS uses the Open Kernel License Framework (OKLF) v1.3 (see OKLF-v1.3.md for the full legal text). Key elements:

Base license: GPLv2-only with additional permissions (Sections 2-5 of OKLF). All kernel code — umka-core, umka-kernel, umka-sysapi, umka-net, umka-vfs, umka-block, umka-kvm, tools, and boot code — is GPLv2. This ensures: - All kernel modifications must be open-sourced - Proprietary forks are impossible - Same legal framework the Linux ecosystem understands

Approved Linking License Registry (ALLR): A curated, append-only list of open-source licenses approved for use with kernel code. Tiers 1-2 are GPL-compatible and may be used in Tier 0 drivers (statically linked into the kernel) or Tier 1 drivers (domain-isolated, communicating via KABI IPC — no linking occurs). Tier 3 licenses are GPL-incompatible and may NOT be statically linked with the kernel; Tier 3 code runs as Tier 1 (domain-isolated, KABI IPC) or Tier 2 (process-isolated) drivers — never Tier 0 (static linking creates a derivative work). KABI IPC provides the license boundary at both tiers: no shared symbols, no function calls across the license boundary, one resolved symbol (__kabi_driver_entry): - Tier 1 (weak copyleft, GPL-compatible): MPL-2.0, LGPL-2.1, EPL-2.0 (with Secondary License designation; see note below) - Tier 2 (permissive): MIT, BSD-2, BSD-3, Apache-2.0, ISC, Zlib - Tier 3 (incompatible — process isolation required, no linking): CDDL-1.0, CDDL-1.1, LGPL-3.0, EUPL-1.2 (see note below)

LGPL-3.0 incompatibility with GPLv2-only: LGPL-3.0 is incompatible with GPLv2-only code per the FSF compatibility matrix. LGPL-3.0 is defined as GPLv3 plus additional permissions (LGPL-3.0 Section 1.1: "This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License"). Since GPLv3 is incompatible with GPLv2-only (see GPLv3 exclusion note below), LGPL-3.0 inherits that incompatibility. LGPL-3.0 code must NOT be linked into the UmkaOS kernel. LGPL-3.0 code communicates with the kernel via KABI IPC only (Tier 3, process isolation required). Note that LGPL-2.1 IS compatible with GPLv2 and remains in Tier 1.

EUPL-1.2 classification (Tier 3): EUPL-1.2 is a strong copyleft license that the FSF classifies as GPL-incompatible. While EUPL Article 5 provides a compatibility list (including GPLv2, GPLv3, LGPL, AGPL, MPL-2.0, EPL-1.0, CeCILL) that allows EUPL-licensed code to be relicensed under those licenses when combined with code under those licenses, the FSF's position is that EUPL-1.2's copyleft is "comparable to the GPL's, and incompatible with it" by itself. UmkaOS places EUPL-1.2 in Tier 3 (no linking with kernel code) — same treatment as CDDL and GPLv3-only. EUPL-1.2 drivers may run at Tier 1 (domain-isolated, KABI IPC boundary) or Tier 2 (process-isolated), but never Tier 0 (static linking creates a derivative work). EUPL-1.2 code that has been explicitly relicensed to GPLv2 via Article 5 by its copyright holder may then be treated as GPLv2 code and used in Tier 0/1.

EPL-2.0 GPL compatibility: EPL-2.0 is GPL-compatible only when the distributor explicitly designates GPL as a Secondary License per EPL-2.0 Section 3.2. Without this designation, EPL-2.0 is GPL-incompatible. UmkaOS requires EPL-2.0 dependencies to carry the Secondary License designation; undesignated EPL-2.0 code is treated as Tier 3 (process isolation required, no linking with kernel code). ALLR Tier 1 inclusion applies only to EPL-2.0 code that explicitly carries the Secondary License designation for GPLv2. Enforcement: the KABI module loader checks for the Secondary License designation in the module's license metadata at load time. EPL-2.0 modules without the designation are rejected for Tier 0/1 loading and must run as Tier 2 process-isolated drivers. Additionally, EPL-2.0's patent grant (Section 2.2) requires contributors to grant a patent license for their contributions; UmkaOS cannot enforce this at a technical level, so EPL-2.0 code in Tier 1 carries an implicit assumption that upstream contributors have complied with Section 2.2. Code review should verify the Secondary License designation is present in the upstream project's license header, not just claimed in module metadata.

GPLv3 exclusion from ALLR: GPLv3 is deliberately excluded from the ALLR. UmkaOS's kernel is licensed GPLv2-only (not "GPLv2 or later"). GPLv3 is incompatible with GPLv2-only code per the FSF: GPLv3's additional requirements (anti-tivoization in GPLv3 §6, patent retaliation in GPLv3 §11) constitute "further restrictions" that GPLv2 §7 prohibits. Code licensed GPLv3-only cannot be linked into a GPLv2-only kernel. Code licensed "GPLv2 or later" CAN be used (under its GPLv2 grant), but code licensed GPLv3-only cannot. Adding GPLv3 to the ALLR would create a false impression that GPLv3-only code may be linked with the kernel. If GPLv3-only code must be used, it must run as a Tier 1 or Tier 2 driver (same as CDDL), communicating via KABI IPC with no linking.

CDDL and GPL incompatibility: CDDL is GPL-incompatible per the FSF. CDDL-licensed code may run as Tier 1 or Tier 2 drivers — KABI provides the license boundary at both tiers. Despite CDDL appearing in the ALLR, no linking occurs between CDDL code and GPL kernel code. CDDL drivers communicate exclusively via KABI IPC (ring buffer message passing, vtable dispatch, one resolved symbol __kabi_driver_entry) — no shared symbols, no function calls across the license boundary. This provides more isolation than Linux's EXPORT_SYMBOL_GPL boundary (where modules ARE linked into the kernel). Statically-linked (Tier 0) CDDL code is NOT permitted, as static linking creates a derivative work. The KABI boundary ensures CDDL and GPL code never form a single "work" in the copyright sense.

New licenses added via governance process (60-day review, supermajority LGB vote). Licenses are never removed (append-only for legal certainty).

Proprietary kernel-space code explicitly prohibited (OKLF Section 4.2(c)): Any code that loads into kernel address space and accesses internal kernel symbols is a derivative work and must comply with GPLv2 or an ALLR-listed license. This removes Linux's 30-year "gray area" about proprietary kernel modules.

Proprietary user-space drivers explicitly permitted (OKLF Section 4.2(b)): Code interacting with the kernel exclusively through the stable userspace interface (syscalls, /proc, /sys, VFIO, UIO, FUSE, eBPF) is not a derivative work. This maps directly to our Tier 2 driver model — hardware vendors who cannot open-source their drivers may use user-space driver frameworks with full isolation.

Anti-tivoization stance (OKLF Section 12.1): OKLF encourages but does not mandate installation information disclosure. The OKLF adds only additional permissions to GPLv2 (permitted by the copyright holder's inherent right to grant additional permissions beyond the license terms, a well-established practice — see GCC Runtime Library Exception, Qt commercial exception, Classpath exception), never additional restrictions. Anti-tivoization protection is achieved indirectly: the KABI stability guarantee means users can always replace a Tier 1/2 driver binary without modifying the kernel, making hardware lockdown of individual drivers less effective.

Firmware exception (OKLF Section 4.3): Binary firmware that runs on separate processors (GPU microcode, Wi-Fi firmware, SSD firmware) is outside the license scope. Distributed separately in firmware/. Code running on the main CPU is NOT firmware.

Legal risk acknowledgment — OKLF is a novel license framework built on GPLv2. While it is designed to be GPLv2-compatible (the "additional permissions" model derives from the copyright holder's inherent right to grant additional permissions, a practice well-established by GCC RLE, Qt, and Classpath exceptions), it has not been tested in court and constitutes a novel legal approach that should not be relied upon without independent legal review. Key risks: (1) the ALLR mechanism may be viewed by some lawyers as an untested extension of the "linking exception" concept — FSF/SFLC review is recommended before v1.0 final; (2) the OKLF provides weaker anti-tivoization protection than GPLv3, which is an accepted tradeoff for GPLv2 compatibility — OKLF cannot mandate installation information disclosure without violating GPLv2's "no further restrictions" clause; (3) ecosystem adoption depends on corporate legal teams accepting OKLF as GPLv2-compatible — even if legally sound, unfamiliarity may slow adoption; (4) the "additional permissions" model under copyright law (the copyright holder's right to grant additional permissions) is well-established in principle (e.g., GCC Runtime Library Exception, Qt commercial exception), but OKLF's scope (ALLR registry, driver tier classification, firmware exception) goes beyond typical additional permissions — a court could find that some OKLF provisions constitute "further restrictions" rather than "additional permissions," which GPLv2 Section 6 prohibits. This risk is mitigated by careful drafting but cannot be eliminated without judicial precedent. UmkaOS should seek early legal review from SFLC or equivalent, and provide a "plain GPLv2" fallback for organizations that cannot accept OKLF's additional terms.

KABI Driver SDK: The umka-driver-sdk crate (ABI type definitions, vtable layouts, ring buffer protocol, DMA types) is dual-licensed Apache-2.0 OR MIT. This is the interface contract — drivers of any ALLR-listed license can link against these types without friction.

How this maps to our driver tiers:

Tier	Location	License requirement	OKLF section
Tier 0 (boot-critical)	In-kernel, static	GPLv2 or ALLR	4.1 (in-tree)
Tier 1 (domain-isolated)	Ring 0, loaded	GPLv2 or ALLR	4.2 (out-of-tree open-source)
Tier 2 (user-space)	Ring 3, process	Any (incl. proprietary)	4.2(b) (userspace interface)

Three ABI stability tiers (extending OKLF Section 11.2):

Interface	Stable?	Policy
Internal kernel APIs	No	May change between any two releases
KABI (driver ABI)	Yes	Versioned, append-only, binary-stable
Userspace ABI (syscalls)	Yes	Never broken without extended deprecation

Concern	How addressed
Prevent proprietary kernel forks	GPLv2 copyleft
Allow ZFS (CDDL)	CDDL in ALLR Tier 3 — ZFS runs as a Tier 1 driver (KABI IPC provides license boundary, no linking occurs)
Allow Nvidia GPU (proprietary)	Tier 2 user-space driver via VFIO
Allow BSD/MIT drivers	BSD/MIT in ALLR — full kernel-space access
Force kernel improvements to be open	GPLv2 copyleft on all kernel crates
Module enforcement	Kernel refuses non-compliant modules by default
Clear legal boundaries	OKLF explicit text, not legal gray area

24.8 Project Structure¶

Note: This appendix describes the target project structure at full implementation. The current codebase (see CLAUDE.md "Project Structure") contains the foundational crates (umka-kernel, umka-core, umka-driver-sdk, umka-sysapi, umka-net, umka-vfs, umka-block, umka-kvm). Additional crates listed below (e.g., umka-accel, umka-cluster, drivers/) will be added as their corresponding architecture sections are implemented.

umka-kernel/
  Cargo.toml                        # Workspace root (all crates)
  ARCHITECTURE.md                   # This document

  umka-core/                        # Microkernel core
    Cargo.toml
    src/
      main.rs                       # Boot entry point (calls arch-specific init)
      cap/                          # Capability system
        mod.rs                      #   Capability types, tables, operations
        revocation.rs               #   Generation-based revocation
      mem/                          # Memory management
        phys.rs                     #   Physical page allocator (buddy)
        vmm.rs                      #   Virtual memory manager (maple tree, VMAs)
        page_cache.rs               #   Page cache (RCU radix tree)
        slab.rs                     #   Slab allocator for kernel objects
        pcid.rs                     #   PCID/ASID management
        huge.rs                     #   Huge page (THP + explicit) support
      sched/                        # Scheduler
        mod.rs                      #   Scheduler core, class dispatch
        eevdf.rs                    #   EEVDF fair scheduler
        rt.rs                       #   RT FIFO/RR scheduler
        deadline.rs                 #   Deadline (EDF/CBS) scheduler
        balance.rs                  #   NUMA-aware load balancer
      ipc/                          # IPC and isolation
        mpk.rs                      #   MPK domain management, WRPKRU helpers
        ring.rs                     #   Shared-memory ring buffers
        tier2_ipc.rs                #   Cross-address-space IPC for Tier 2
      arch/                         # Architecture-specific Rust code
        mod.rs                      #   Architecture trait definitions
        x86_64/                     #   x86-64 implementation
          mod.rs
          gdt.rs                    #     GDT setup
          idt.rs                    #     IDT and interrupt dispatch
          apic.rs                   #     Local APIC driver (Tier 0)
          timer.rs                  #     HPET/TSC/APIC timer (Tier 0)
          mpk.rs                    #     MPK hardware interface
          vmx.rs                    #     VMX support for KVM
        aarch64/                    #   ARM64 implementation (phase 2+)
          mod.rs
        armv7/                      #   ARMv7 implementation (phase 2+)
          mod.rs
        riscv64/                    #   RISC-V 64 implementation (phase 2+)
          mod.rs
        ppc32/                      #   PPC32 implementation (phase 2+)
          mod.rs
        ppc64le/                    #   PPC64LE implementation (phase 2+)
          mod.rs
        s390x/                      #   s390x implementation (phase 2+)
          mod.rs
        loongarch64/                #   LoongArch64 implementation (phase 2+)
          mod.rs

  umka-sysapi/                      # Linux syscall interface + SysAPI shims
    Cargo.toml
    src/
      syscall/                      # ~450 syscall dispatch table
        mod.rs                      #   SyscallHandler enum, dispatch table
        process.rs                  #   fork, clone, execve, exit, wait
        file.rs                     #   open, read, write, close, ioctl
        memory.rs                   #   mmap, brk, mprotect, madvise
        network.rs                  #   socket, bind, listen, accept, connect
        time.rs                     #   clock_gettime, nanosleep, timer_*
        misc.rs                     #   getpid, getuid, uname, sysinfo
      proc/                         # /proc filesystem emulation
        mod.rs
        meminfo.rs                  #   /proc/meminfo
        cpuinfo.rs                  #   /proc/cpuinfo
        pid.rs                      #   /proc/[pid]/* (maps, status, fd, etc.)
        sys.rs                      #   /proc/sys/* (sysctl interface)
      sys/                          # /sys filesystem emulation
        mod.rs
        devices.rs                  #   /sys/devices/ device tree
        class.rs                    #   /sys/class/ device classes
        bus.rs                      #   /sys/bus/ bus enumeration
      dev/                          # /dev filesystem emulation
        mod.rs
        devtmpfs.rs                 #   devtmpfs-compatible device nodes
      signal/                       # Signal handling
        mod.rs
        delivery.rs                 #   Signal delivery to user space
        handlers.rs                 #   Default handlers, core dump
      namespace/                    # Linux namespace implementation
        mod.rs
        mnt.rs                      #   Mount namespace
        pid.rs                      #   PID namespace
        net.rs                      #   Network namespace
        user.rs                     #   User namespace
        ipc.rs                      #   IPC namespace
        uts.rs                      #   UTS namespace
        cgroup.rs                   #   Cgroup namespace
        time.rs                     #   Time namespace
      cgroup/                       # Cgroup v1/v2
        mod.rs
        v2.rs                       #   Unified hierarchy (primary)
        v1_compat.rs                #   Legacy hierarchy (compatibility)
        controllers/                #   cpu, memory, io, pids, etc.
      io_uring/                     # io_uring subsystem
        mod.rs
        ring.rs                     #   SQ/CQ ring management
        sqpoll.rs                   #   SQPOLL kernel thread
        ops.rs                      #   Operation dispatch
      lsm/                          # Linux Security Modules
        mod.rs
        hooks.rs                    #   Hook framework
        selinux.rs                  #   SELinux policy engine
        apparmor.rs                 #   AppArmor profile engine
        seccomp.rs                  #   seccomp-bpf filter
      ebpf/                         # eBPF subsystem
        mod.rs
        vm.rs                       #   eBPF virtual machine
        verifier.rs                 #   Static verifier
        jit/                        #   JIT compilers
          x86_64.rs
          aarch64.rs
          armv7.rs
          riscv64.rs
          ppc32.rs
          ppc64le.rs
          s390x.rs
          loongarch64.rs
        maps.rs                     #   Map types (hash, array, ringbuf, etc.)
        helpers.rs                  #   eBPF helper functions
        programs.rs                 #   Program types (XDP, tc, kprobe, etc.)

  umka-net/                         # Network stack (runs as Tier 1)
    Cargo.toml
    src/
      tcp/                          # TCP/IP implementation
      udp/                          # UDP implementation
      ip/                           # IP layer (v4 + v6)
      arp.rs                        # ARP
      icmp.rs                       # ICMP
      netfilter/                    # nftables + iptables compatibility
        mod.rs
        nft.rs                      #   nftables engine
        conntrack.rs                #   Connection tracking
        nat.rs                      #   NAT (SNAT, DNAT, masquerade)
      xdp/                          # XDP fast path
      socket.rs                     # Socket abstraction
      tunnel/                       # Tunnel protocol modules (Ch16: network-overlay-and-tunneling)
        mod.rs                      #   TunnelDevice trait
        vxlan.rs                    #   VXLAN encap/decap
        geneve.rs                   #   Geneve encap/decap
        gre.rs                      #   GRE/GRE6
        ipip.rs                     #   IPIP/SIT
        wireguard.rs                #   WireGuard — Phase 4. VPN tunnel implemented as
                                    #   umka-net module, not separate crate. Architecture
                                    #   in Section 15 (15-networking.md). Requires Noise IK
                                    #   handshake protocol (Curve25519, ChaCha20-Poly1305,
                                    #   BLAKE2s), allowed-IPs routing table (longest-prefix
                                    #   match per peer), peer management (persistent
                                    #   keepalive, roaming endpoint update), timer-based rekey.
      bridge/                       # Software L2 switch (Ch16: network-overlay-and-tunneling)
        mod.rs                      #   Bridge device, FDB, STP
        vlan.rs                     #   802.1Q VLAN filtering
      veth.rs                       # Virtual ethernet pairs
      macvlan.rs                    # macvlan/ipvlan devices
      vrf.rs                        # Virtual Routing and Forwarding

  umka-vfs/                         # Virtual filesystem layer (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # VFS dispatch, mount table
      ext4/                         # ext4 filesystem
      xfs/                          # XFS filesystem
      btrfs/                        # btrfs filesystem
      tmpfs/                        # tmpfs (in-memory)
      overlayfs/                    # OverlayFS (for containers)
      dcache.rs                     # Directory entry cache

  umka-block/                       # Block I/O layer (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # Block device abstraction
      scheduler.rs                  # I/O schedulers (mq-deadline, none, bfq)
      partition.rs                  # Partition table parsing (GPT, MBR)
      dm/                           # Device-mapper framework (Ch15: block-io-and-volume-management)
        mod.rs                      #   DM core: target dispatch, table management
        linear.rs                   #   dm-linear
        striped.rs                  #   dm-striped
        mirror.rs                   #   dm-mirror
        crypt.rs                    #   dm-crypt (AES-XTS)
        verity.rs                   #   dm-verity
        snapshot.rs                 #   dm-snapshot (COW)
        thin.rs                     #   dm-thin-pool
      md.rs                         # MD RAID (0/1/5/6/10) superblock compat
                                    #   MD RAID — Phase 4. Architecture in
                                    #   Ch15: block-io-and-volume-management.
                                    #   MD RAID architecture: Phase 4.
                                    #   Requires superblock compat (v0.90 at end-of-device,
                                    #   v1.0/1.2 at start), RAID personality trait
                                    #   (start_reshape, sync_request, make_request,
                                    #   check_reshape), resync/recovery state machine
                                    #   (idle → resync → active, bitmap-guided incremental sync).
      lvm.rs                        # LVM2 metadata reader
      recovery.rs                   # Recovery-aware volume state machine
      iscsi/                        # iSCSI block storage (Ch15: block-storage-networking)
        mod.rs                      #   iSCSI common: PDU parsing, session state
        initiator.rs                #   iSCSI initiator (RFC 7143)
        target.rs                   #   iSCSI target (LIO-compatible config)
        iser.rs                     #   iSER — RDMA transport for iSCSI
        chap.rs                     #   CHAP authentication
        multipath.rs                #   dm-multipath integration
      nvmeof/                       # NVMe over Fabrics (Ch15: block-storage-networking)
        mod.rs                      #   NVMe-oF common: capsule parsing, queue pairs
        host.rs                     #   NVMe-oF initiator (host) — connect, I/O
        target.rs                   #   NVMe-oF target (subsystem) — nvmetcli compat
        tcp.rs                      #   NVMe/TCP transport (TP 8000)
        rdma.rs                     #   NVMe/RDMA transport (TP 8001)
        discovery.rs                #   Discovery controller client/server
        ana.rs                      #   ANA multipath — asymmetric namespace access

  umka-kvm/                         # KVM hypervisor (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # /dev/kvm interface
      vmx.rs                        # Intel VMX
      svm.rs                        # AMD SVM
      mmu.rs                        # Nested page tables (EPT/NPT)
      tee/                          # Confidential VM support (Ch9: confidential-computing)
        sev.rs                      #   AMD SEV-SNP guest/host
        tdx.rs                      #   Intel TDX guest/host
        cca.rs                      #   ARM CCA realm management

  umka-accel/                       # AI/ML accelerator subsystem (Ch22: unified-accelerator-framework)
    Cargo.toml
    src/
      mod.rs                        # AccelBase trait, device registration
      scheduler.rs                  # CBS-based accelerator scheduler
      hmm.rs                        # Heterogeneous memory management
      p2p.rs                        # Peer-to-peer DMA (PCIe, NVLink, CXL)
      inference.rs                  # In-kernel inference engine
      rdma.rs                       # RDMA and collective ops

  umka-cluster/                     # Distributed kernel (Ch5: distributed-kernel-architecture)
    Cargo.toml
    src/
      mod.rs                        # Cluster topology, node discovery
      transport.rs                  # ClusterTransport trait + RdmaInfra, per-peer bindings
      ipc.rs                        # Distributed IPC proxy
      dsm.rs                        # Distributed shared memory
      dlm.rs                        # Distributed Lock Manager (Ch15: distributed-lock-manager)
      global_pool.rs                # Global memory pool
      scheduler.rs                  # Cluster-wide scheduling
      caps.rs                       # Network-portable capabilities

  umka-driver-sdk/                  # Stable driver SDK
    Cargo.toml
    interfaces/                     # .kabi IDL definitions
      block_device.kabi             #   Block device interface
      net_device.kabi               #   Network device interface
      gpu_device.kabi               #   GPU device interface
      input_device.kabi             #   Input device interface
      usb_device.kabi               #   USB device interface
      char_device.kabi              #   Character device interface
      pci_device.kabi               #   PCI device interface
      platform_device.kabi          #   Platform device interface
    src/
      lib.rs                        # SDK entry point, driver registration
      abi.rs                        # Generated stable ABI types
      dma.rs                        # DMA buffer management
      mmio.rs                       # MMIO access helpers (volatile read/write)
      irq.rs                        # Interrupt handling
      ring.rs                       # Ring buffer helpers for driver use
      manifest.rs                   # Driver manifest parsing

  drivers/                          # In-tree drivers
    tier0/                          # Boot-critical (statically linked)
      apic/                         #   Local APIC + I/O APIC
      timer/                        #   PIT / HPET / TSC
      serial/                       #   Early serial console
      vga/                          #   Early VGA text console
    tier1/                          # Performance-critical (domain-isolated)
      nvme/                         #   NVMe SSD driver
      virtio_blk/                   #   VirtIO block device
      virtio_net/                   #   VirtIO network device
      virtio_gpu/                   #   VirtIO GPU
      virtio_console/               #   VirtIO console
      e1000/                        #   Intel e1000 NIC
      igb/                          #   Intel igb NIC
      ahci/                         #   AHCI/SATA controller
      ext4/                         #   ext4 driver component
    tier2/                          # Isolated (user-space process)
      usb_xhci/                     #   USB XHCI host controller
      usb_hid/                      #   USB HID (keyboard, mouse)
      usb_storage/                  #   USB mass storage
      hda_audio/                    #   Intel HDA audio
      input/                        #   Input subsystem (evdev)

  tools/
    kabi-compiler/                  # .kabi IDL -> Rust/C code generator
      Cargo.toml
      src/
        main.rs
        parser.rs                   #   IDL parser
        codegen_rust.rs             #   Rust binding generator
        codegen_c.rs                #   C binding generator
    kabi-compat-check/              # ABI compatibility CI checker
      Cargo.toml
      src/
        main.rs                     #   Diffs old vs new .kabi, rejects breaks
    umka-initramfs/                 # Initramfs builder tool
      Cargo.toml
      src/
        main.rs                     #   Packs drivers + early userspace

  arch/                             # Architecture-specific C/asm
    x86_64/
      boot/                         # UEFI/BIOS boot stub (C + asm)
        header.S                    #   Linux boot protocol header
        main.c                      #   Early C boot code
        efi_stub.c                  #   UEFI stub
      asm/
        entry.S                     #   Syscall entry/exit
        switch.S                    #   Context switch
        irq_stubs.S                 #   Interrupt stub table
      vdso/
        vdso.lds                    #   vDSO linker script
        clock_gettime.c             #   clock_gettime implementation
        getcpu.c                    #   getcpu implementation
    aarch64/
      boot/                         # ARM64 boot stub
      asm/                          # ARM64 assembly
      vdso/                         # ARM64 vDSO
    riscv64/
      boot/                         # RISC-V boot stub
      asm/                          # RISC-V assembly
      vdso/                         # RISC-V vDSO
    ppc32/
      boot/                         # PPC32 boot stub
      asm/                          # PPC32 assembly
      vdso/                         # PPC32 vDSO
    ppc64le/
      boot/                         # PPC64LE boot stub
      asm/                          # PPC64LE assembly
      vdso/                         # PPC64LE vDSO
    s390x/
      boot/                         # s390x boot stub (IPL)
      asm/                          # s390x assembly
      vdso/                         # s390x vDSO
    loongarch64/
      boot/                         # LoongArch64 boot stub
      asm/                          # LoongArch64 assembly
      vdso/                         # LoongArch64 vDSO

  tests/
    abi_compat/                     # Old driver binaries for compat regression
    syscall/                        # Linux syscall conformance (LTP-based)
    driver/                         # Driver integration tests
    bench/                          # Performance regression benchmarks
    crash_recovery/                 # Fault injection + recovery verification

24.9 What UmkaOS Provides That Linux Cannot¶

Feature	Linux	UmkaOS
Driver crash recovery	Kernel oops or panic depending on fault type. Many driver bugs produce oops (system continues with degraded functionality) rather than panic. Recovery requires at minimum driver module reload; severe faults cause panic and full reboot (30-60s).	Reload driver in ~50-150ms (Tier 1) or ~10ms (Tier 2). On RISC-V/s390x/LoongArch64, Tier 1 is unavailable — drivers run as Tier 0 (crash = panic, same as Linux) or Tier 2 (full crash recovery), depending on licensing, driver preference, and sysadmin decision. Tier 2 crash recovery is available on all architectures.
Stable driver ABI	None (recompile every update)	Versioned, append-only, binary-stable KABI
Driver isolation	None (shared address space)	Domain isolation + IOMMU (Tier 1), full process (Tier 2)
Capability-based security	Bolt-on (POSIX caps are coarse)	Foundational architecture
Lock ordering enforcement	Runtime lockdep (debug only)	Compile-time lock ordering via Rust const generics: `Lock<T, LEVEL>` where `LEVEL: u32` encodes the lock level in the type signature (e.g., `Lock<Rq, 100>`), preventing out-of-order acquisition at compile time. See Section 3.5.
io_uring security	Bypasses syscall monitoring	Per-instance operation whitelist
Hot driver upgrade	Fragile (unstable ABI)	Clean stop/start with stable KABI
Memory safety	C everywhere	Rust with minimal unsafe at hardware boundaries
Many-core scalability	known bottlenecks (RTNL for networking — partially mitigated in Linux 6.x with per-netns locking but still a single global mutex as of mainline, inode_lock for VFS, cgroup_mutex for cgroups)	No global locks, per-CPU/per-NUMA everywhere
Proactive fault management	Ad-hoc (mcelog, rasdaemon)	Unified FMA with diagnosis engine (Section 20.1)
Memory compression	zswap/zram (separate, config-heavy)	Integrated NUMA-aware CompressPool tier (Section 4.12)
CPU bandwidth guarantee	No floor mechanism	CBS-backed cpu.guarantee (Section 7.6)
Stable observability ABI	Tracepoints are unstable	Versioned, documented stable tracepoints (Section 20.2)
Verified boot chain	Fragmented (UEFI SB + IMA + dm-verity)	Unified chain from firmware to drivers (Section 9.3)
Kernel object introspection	Per-subsystem (/proc, /sys, scattered)	Unified object namespace via umkafs (Section 20.5)
Driver state preservation	Lost on crash — cold restart	Checkpointed state buffer, warm restart (Section 11.9)
Core panic diagnostics	kexec + kdump (complex setup)	In-place crash dump to reserved memory (Section 11.9)
Context switch XSAVE cost	Eager XSAVE with XSAVEOPT/XSAVES optimizations (skips unmodified components, but still saves full state for context switches involving SIMD). UmkaOS's lazy approach avoids save/restore entirely for non-SIMD threads.	Lazy XSAVE — zero cost for non-SIMD threads (Section 7.3)
CPU errata management	Scattered #ifdef, ad-hoc	Structured quirk table + boot-param controls (Section 2.18)
Volume layer + driver crash	Device marked failed, RAID resync	Recovery-aware: pause I/O, resume clean (Section 15.2)
VM guest driver crash	VM reboot required	Driver recovers in-place, hypervisor unaware (Section 18.1)
Block storage networking	Separate stacks (open-iscsi, nvme-cli, no unified recovery)	Unified iSCSI + NVMe-oF with RDMA upgrade and crash recovery (Section 15.13)
Clustered FS + driver crash	Node fenced, ejected from cluster	Driver recovers in-place, node stays in cluster (Section 15.14)
Distributed locking	TCP-based DLM (~10-100 μs/op depending on lock locality; local locks <1 μs), global recovery quiesce on any node failure	RDMA-native DLM (~2-3 μs uncontested, ~5-10 μs contested), per-resource recovery, lease-based extension, batch ops (Section 15.15)
TPM key management	Userspace daemon (tpm2-abrmd)	Kernel-native resource manager + capability integration (Section 9.4)
Runtime integrity	IMA bolted onto VFS, optional	Integrated with capability system and driver loading (Section 9.5)
Display stack crash	X/Wayland session lost	DMA-BUF survives driver reload, compositor stalls ~100ms-5s (full recovery window; Section 22.7)

24.10 Cross-Feature Integration Map¶

24.10.1 Cross-Feature Integration Map¶

These features are not independent — they reinforce each other:

Feature A		Feature B	Rationale
Formal verification (Section 24.4)	-->	Confidential computing (Section 9.7)	Proves capability system correct; CC relies on correct capability enforcement
Safe extensibility (Section 19.9)	<->	Live evolution (Section 13.18)	Policy modules are hot-swappable; evolution uses the same mechanism
Intent-based management (Section 7.10)	<->	In-kernel inference (Section 22.6)	Intent optimizer uses learned models; models optimize for declared intents
EAS / heterogeneous CPU (Section 7.2)	<->	Power budgeting (Section 7.7)	EAS picks energy-optimal core; power budget enforces watt cap
Power budgeting (Section 7.7)	<->	Intent-based management (Section 7.10)	Power budget is a constraint; intents include efficiency preference
Hardware memory safety (Section 2.23)	-->	Tier 1 driver isolation (Section 11.3)	MTE catches C driver bugs; domain isolation catches the resulting faults
Confidential computing (Section 9.7)	-->	Distributed kernel (Section 5.1)	TEE-to-TEE RDMA; DSM coherence for encrypted pages
Post-quantum crypto (Section 9.6)	-->	Distributed capabilities (Section 5.7)	PQC signatures on capabilities; network-portable across cluster
SmartNIC/DPU (Section 5.11)	<->	Distributed kernel (Section 5.1)	DPU = peer node (full or shim); same peer protocol + capability services
Persistent memory (Section 15.16)	<->	Memory tiers (Section 22.4)	Persistent memory = another tier; managed by same PageLocationTracker
Computational storage (Section 15.17)	<->	Accelerator framework (Section 22.1)	CSD = storage accelerator; same AccelBase vtable
Unified compute (Section 22.8)	<->	EAS / heterogeneous CPU (Section 7.2)	Multi-dim capacity extends scalar; CPU capacity is a special case
Unified compute (Section 22.8)	<->	Accelerator scheduler (Section 22.2)	Cross-device topology + energy data; accel scheduler consumes advisory
Unified compute (Section 22.8)	<->	Power budgeting (Section 7.7)	Workload profile drives throttle; informed cross-device power decisions
Unified compute (Section 22.8)	<->	Intent-based management (Section 7.10)	compute.weight feeds intent optimizer; optimizer adjusts per-domain knobs
Unified compute (Section 22.8)	<->	Distributed kernel (Section 5.1)	Peer kernel nodes via ClusterTransport; accelerator = close compute node
Unified compute (Section 22.8)	<->	SmartNIC/DPU offload (Section 5.11)	Same convergence: device to peer node; ClusterTransport unifies all transports
Distributed Lock Manager (Section 15.15)	<->	RDMA transport (Section 5.4)	DLM uses RDMA CAS/Send for locks; transport provides kernel RDMA API
Distributed Lock Manager (Section 15.15)	<->	Cluster membership (Section 5.2)	DLM receives join/leave/dead events; single heartbeat source for both
Distributed Lock Manager (Section 15.15)	<->	Clustered filesystems (Section 15.14)	GFS2/OCFS2 use DLM for coordination; DLM lock modes map to FS operations
Distributed Lock Manager (Section 15.15)	<->	Driver recovery (Section 11.9)	DLM in umka-core survives driver crash; no lock recovery needed on Tier 1 reload

Bootstrap Circular Dependency:

The intent optimizer (Section 7.10) uses in-kernel inference models (Section 22.6), but those models may not be loaded at early boot. Resolution: the intent optimizer degrades gracefully to static defaults when models are unavailable. At boot: 1. Intent optimizer starts with hardcoded heuristics (e.g., "latency target → raise cpu.weight by 20%"). 2. When the inference engine loads models (typically within seconds of boot), the optimizer transitions to learned optimization. 3. The transition is seamless — no reconfiguration needed.

24.10.2 Implementation Dependency Graph¶

Foundation (no dependencies): - Formal verification readiness (Section 24.4) -- design methodology - Post-quantum crypto abstraction (Section 9.6) -- data structure sizing - Locking primitives (Section 3.5) -- lock design

Early integration: - Hardware memory safety (Section 2.23) -- needs memory allocator - Power budgeting (Section 7.7) -- needs scheduler - Safe extensibility (Section 19.9) -- needs KABI vtable mechanism

Mid integration: - Confidential computing (Section 9.7) -- needs memory manager, IOMMU - Intent-based management (Section 7.10) -- needs inference engine, cgroups - Live evolution (Section 13.18) -- needs extensibility mechanism

Late integration: - SmartNIC/DPU offload (Section 5.11) -- needs peer protocol, capability service providers, device registry - Persistent memory (Section 15.16) -- needs VFS, memory tiers - Computational storage (Section 15.17) -- needs AccelBase framework - Unified compute topology (Section 22.8) -- needs AccelBase, EAS (Section 7.2), power budgeting (Section 7.7) - Peer kernel nodes (Section 22.8) -- needs unified compute + distributed kernel (Section 5.1)

24.10.3 Cross-Feature Integration Testing Specification¶

The dependency graph above defines 21 feature-pair interactions. Each pair requires a dedicated integration test that exercises the interaction path. This section is the canonical CI specification — it replaces the informal guidance previously in Section 24.11.

24.10.3.1 CI Tier Assignment¶

Tests slot into the 3-tier CI structure defined in the Section 25.8:

Tier	Trigger	Cross-feature tests	Rationale
Compile-time	Every commit	1 pair (XF-21)	Verified by Verus proofs, not runtime tests
Tier 2	Every PR	6 safety-critical pairs (XF-01 – XF-05, XF-07)	Regression = crash, data loss, or security breach
Tier 3	Nightly	14 functional pairs (XF-06, XF-08 – XF-20)	Regression = performance or non-critical functionality

24.10.3.2 Acceptance Criteria (all runtime tiers)¶

Criterion	Threshold
Correctness	Zero failures across 1,000 iterations per pair
Sanitizer	Zero ASan / MSan / TSan findings
Latency regression	P99 < 5% vs. single-feature baseline
Resource leaks	Zero (memory, file descriptors, locks) after test completion
Branch coverage	≥ 80% of the integration code path per pair

24.10.3.3 Tier 2: Every PR (Safety-Critical Pairs)¶

These 6 pairs guard against crash, data loss, or security breach. A failure blocks merge.

ID	Pair	Test Scenario	Failure Mode Prevented
XF-01	Safe extensibility ↔ Live evolution	Load policy module A → hot-swap to B under sustained load (1K ops/s) → verify behavior changes correctly → forward-evolve to original module A → verify state preserved across A->B->A evolution cycle	Stale vtable pointer; lost operations during swap; state corruption across evolution cycle
XF-02	HW memory safety → Tier 1 isolation	Inject OOB write in Tier 1 driver → verify MTE/MPK trap fires → verify isolation domain fault handler runs → verify driver reloads within 150 ms → verify no kernel state corruption	Undetected memory corruption escaping isolation domain
XF-03	DLM ↔ Driver recovery	Acquire DLM lock via Tier 1 storage driver → crash driver (kill domain) → verify lock state preserved in umka-core → reload driver → verify lock accessible without re-acquire	Distributed deadlock from lock state lost on driver crash
XF-04	DLM ↔ Clustered filesystems	Mount clustered FS on 2 QEMU nodes → concurrent file create + write from both → verify DLM serializes conflicting operations → fsck after test → zero corruption	DLM lock ordering violation → filesystem corruption
XF-05	EAS ↔ Power budgeting	Set 10 W power cap → run mixed CPU-bound + I/O workload → verify EAS selects cores within budget → verify cap not exceeded by more than 1 scheduler tick interval	Power budget violation → thermal throttle or hardware damage
XF-07	PQC → Distributed capabilities	Create capability → ML-DSA-65 sign → send to peer via RDMA → verify peer validates → revoke on origin → verify peer rejects within 2 heartbeat intervals	Forged or revoked capability accepted by peer node

24.10.3.4 Tier 3: Nightly (Functional Pairs)¶

These 14 pairs test performance, optimization, and non-safety functionality. Failures block merge to master but do not block PR merge to develop.

ID	Pair	Test Scenario
XF-06	Confidential computing → Distributed kernel	Establish DSM region between 2 QEMU nodes → write 4 KB page on node A inside SEV-SNP guest → verify host hypervisor process cannot read page content (QEMU limited: verify memory encryption APIs are called, not actual ciphertext) → read page on node B via DSM → verify coherence. Note: wire-level RDMA encryption is Phase 5+ (Section 9.7); this test verifies DSM coherence with CC-protected memory, not transport encryption.
XF-08	Intent management ↔ In-kernel inference	Set latency-sensitive intent on cgroup → verify ML model adjusts scheduler weights within 5 ticks → remove intent → verify return to defaults within 5 ticks
XF-09	Power budgeting ↔ Intent management	Set "efficiency" intent → verify power budget tightens (measurable watt reduction) → switch to "performance" → verify budget relaxes within 100 ms
XF-10	SmartNIC/DPU ↔ Distributed kernel	DPU joins as Tier M peer → verify service binding + capability routing → simulate DPU crash → verify host fallback activates within 500 ms → DPU rejoins → verify service restored
XF-11	Persistent memory ↔ Memory tiers	Allocate pages on pmem tier → generate hot access pattern → verify PageLocationTracker promotes to DRAM → cool access → verify demotion back to pmem
XF-12	Computational storage ↔ Accelerator framework	Register CSD as AccelBase → submit SHA-256 compute task → verify CSD executes → verify result matches host-computed reference
XF-13	Unified compute ↔ EAS	Register CPU + GPU + NPU with multi-dim capacity vectors → submit heterogeneous workload → verify EAS uses compute.weight for placement decisions
XF-14	Unified compute ↔ Accelerator scheduler	Build cross-device topology (CPU + 2 accelerators) → submit batch of jobs → verify scheduler places on optimal device per energy advisory → verify no starvation
XF-15	Unified compute ↔ Power budgeting	Set per-domain 5 W cap + aggregate 15 W cap → submit cross-device workload → verify throttle decisions respect both per-device and aggregate limits
XF-16	Unified compute ↔ Intent management	Set compute.weight via intent API → verify optimizer adjusts per-domain scheduling knobs → verify convergence within 10 scheduler ticks
XF-17	Unified compute ↔ Distributed kernel	Register remote accelerator as peer via ClusterTransport → submit remote compute job → verify completion callback → verify capability cleanup on disconnect
XF-18	Unified compute ↔ SmartNIC/DPU	DPU advertises compute offload service → verify unified compute topology includes DPU node → submit offloadable work → verify routing to DPU
XF-19	DLM ↔ RDMA transport	Acquire lock via RDMA CAS on remote node → verify lock state visible on both nodes → release via RDMA → verify release propagates within 1 ms
XF-20	DLM ↔ Cluster membership	3-node cluster → node B leaves → verify DLM redistributes B's master locks to A and C → node B rejoins → verify rebalance completes without orphaned locks

24.10.3.5 Compile-Time (Every Commit)¶

ID	Pair	Verification
XF-21	Formal verification → Confidential computing	Verus proofs for capability system pass (`cargo verus --verify`). Correctness of capability enforcement (which CC relies on) is proven at compile time, not tested at runtime. Proof failure = build failure.

24.10.3.6 Architecture-Specific Notes¶

Most cross-feature tests run on x86-64 (-cpu max) as the primary CI platform. Exceptions:

Test	Additional architectures	Reason
XF-02 (HW safety + isolation)	AArch64 (`-M virt,mte=on -cpu neoverse-n2`), ARMv7 (`-M vexpress-a15`, DACR)	Tests arch-specific trap + isolation mechanisms
XF-06 (CC + distributed)	x86-64 only	SEV-SNP/TDX emulation (limited: QEMU does not model encrypted memory controller; test verifies API calls, not actual encryption); AArch64 CCA not emulable in QEMU

All other tests exercise kernel subsystem interactions independent of architecture. Nightly Tier 3 runs additionally include AArch64 and RISC-V 64 for cross-architecture confidence.

24.10.3.7 Fuzzing (Release Candidates)¶

For each release candidate, run syzkaller-style fuzzing on the 6 Tier 2 pairs for 24 hours. The fuzzer generates random sequences of:

Policy module load/unload interleaved with live evolution swaps (XF-01)
Concurrent DLM lock acquire/release with driver crash injection (XF-03, XF-04)
Power budget changes during workload spikes (XF-05)
DSM page access patterns with CC-protected memory (XF-06)
Capability create/sign/revoke races across nodes (XF-07)

Zero findings required for release sign-off.

24.11 Open Questions¶

The following cross-cutting items require further design work. Each is tracked as an open question with the affected sections and the specific decision to be made.

Mirrored in: Section 25.9 — update both when status changes.

24.11.1 Resolved Decisions (collapsed — full rationale in referenced sections)¶

These items were previously open but are now fully specified in the architecture.

Decision	Resolution	Reference
WiFi: Tier 1 or Tier 2?	Tier 1	Section 13.15
BlueZ or clean-room Bluetooth?	BlueZ adapter (Tier 2 daemon)	Section 13.14
Allow proprietary drivers?	Yes, via KABI binary compatibility	Section 24.1
eBPF verifier: full or partial?	Full verifier, phased delivery (Phase 2–5)	Section 19.2
io_uring + SEV-SNP buffer management?	Bounce buffer architecture, 16 MiB/ring	Section 19.3 (future work)
Live Evolution attestation chain?	Dedicated PCR[16]/PCR[23] + hash-chained event log + `TPM2_PolicyAuthorize`	Section 13.18, Section 9.3
CXL 3.0 fabric management?	First-class memory tier (`NumaNodeType::CxlMemory`)	Section 5.9
CXL 3.0 coherence vs. DSM?	Hybrid — CXL transport for intra-rack, RDMA DSM for inter-rack	Section 5.9
Multi-architecture parity matrix?	8-feature × 8-arch parity matrix defined	Section 2.22
Secure boot: live evolution PCR?	PCR[16] (dev) / PCR[23] (prod) with `LiveEvolutionEvent` struct	See attestation chain above
Default filesystem?	No single default. ext4 (general), XFS (enterprise), ZFS (data integrity/servers). Btrfs for snapshot-centric workloads only.	Filesystem drivers spec
io_uring live evolution?	Task-owned state (Theseus-style); component is stateless processor; ~1-10μs swap	Section 19.3
Cross-feature integration testing?	21 real pairs (not "100+"), 7 PR-critical + 13 nightly + 1 compile-time. Full CI spec with test scenarios, acceptance criteria, fuzzing.	Section 24.10
DPU io_uring submission offload?	Not a separate design question. DPUs in "dumb driver" mode use normal KABI vtable path (no SQ offload). DPUs in Tier M mode use `ServiceMessage` via `DomainRingBuffer` ring pairs — the peer protocol IS the transport. No direct DPU access to userspace SQ/CQ.	Section 5.11
Multi-architecture fallback acceptance criteria?	Per-feature thresholds: native ≤5%, fallback ≤10%, not-available 0%. Per-feature acceptance tests. Sysfs `/sys/kernel/umka/features/` + `dmesg` notification for degradation visibility.	Section 2.22
Policy module measurement enforcement?	Tied to boot security posture: `enforce` (default when secure boot active, rejects unsigned), `advisory` (default otherwise, allows with warning + isolation), `off` (bare-metal debugging). Boot parameter `umka.module_sig=`. Immutable after boot.	Section 19.9
GPU confidential computing?	Not a separate design decision. Both paths (bounce buffer and hardware crypto) are supported. Runtime detection via `CcDeviceCapability`, admin override via `umka.cc_device_dma=auto\\|bounce\\|hwcrypto`. Same pattern as isolation fallback.	Section 9.7
Nested GPU passthrough?	Supported if hardware supports it. Three conditions: IOMMU nested translation, TEE firmware nested device assignment, overhead ≤ 3x. Returns `-ENOTSUP` otherwise.	Section 9.7

24.11.2 Open Questions (genuinely unresolved)¶

24.11.2.1 OEM partnerships strategy¶

Not yet decided. Affects go-to-market for consumer hardware support (Phase 5b). Candidates: Framework, System76, Dell, HP.

This document is the canonical reference for UmkaOS development. All implementation decisions must be traceable to the architecture described here. Changes to this document require team review and approval.

24.12 KABI IDL Compiler Specification¶

The KABI IDL language and umka-kabi-gen tool are fully specified in Section 12.5. The roadmap deliverable is to implement umka-kabi-gen conforming to that specification. See Section 24.2 for the Phase 1 build milestone.