Chapter 24: Roadmap and Verification¶
Driver ecosystem, implementation phases, verification strategy, technical risks, formal verification, appendices
Implementation phases, verification strategy, and project planning. Phase 1 (microkernel core) through Phase 5 (production hardening) define the delivery order. Formal verification targets are identified for safety-critical subsystems. Open questions and technical risks are tracked here as the canonical reference for development decisions.
24.1 Driver Ecosystem Strategy¶
24.1.1 The Challenge¶
Driver coverage is the single largest adoption blocker for any new kernel. Linux has thousands of drivers covering decades of hardware. UmkaOS cannot replicate this overnight.
24.1.2 Agentic Driver Rewrite Project¶
The key insight: all open-source Linux driver source code is available. The hardware programming logic (register sequences, DMA setup, interrupt handling) is identical regardless of kernel API. Only the kernel-facing API surface changes.
AI-assisted translation pipeline:
Input: Linux driver C source code (GPL, ~500-5000 LOC typical)
|
v
Step 1: Parse Linux kernel API calls (kmalloc, dma_alloc_coherent,
request_irq, pci_read_config_*, etc.)
|
v
Step 2: Map to KABI equivalents (KernelServicesVTable methods)
|
v
Step 3: Translate C to Rust, preserving hardware-specific logic exactly
|
v
Step 4: Generate KABI driver entry point and vtable exchange
|
v
Output: Native Rust KABI driver
Human review: Verify hardware-specific sequences are preserved
Testing: Against real hardware + QEMU virtual devices
24.1.3 Prioritized Driver List¶
These drivers cover approximately 95% of real hardware in server and desktop environments:
Priority 1 -- Cloud/VM (covers 100% of cloud deployments):
1. VirtIO block (virtio-blk)
2. VirtIO network (virtio-net)
3. VirtIO GPU (virtio-gpu)
4. VirtIO console (virtio-console)
Priority 2 -- Storage (covers 99% of bare-metal storage): 5. NVMe (universal modern SSD interface) 6. AHCI/SATA (legacy HDDs and older SSDs)
Priority 3 -- Networking (covers 90% of server NICs): 7. Intel e1000/e1000e (universal VM and consumer NIC) 8. Intel igb/ixgbe/ice (server 1G/10G/25G/100G) 9. Realtek r8169 (consumer Ethernet) 10. Mellanox mlx5 (high-performance datacenter)
Priority 4 -- Human Interface (covers desktop usability): 11. USB XHCI host controller (all modern USB) 12. USB EHCI host controller (USB 2.0 legacy) 13. USB HID (keyboard, mouse) 14. USB mass storage 15. Intel HDA audio 16. i915 (Intel integrated graphics, modesetting) 17. amdgpu (AMD graphics, modesetting) 18. UVC (USB Video Class) camera driver — Phase 4-5 implementation. Architecture fully specified in Section 13.16 (CameraDevice trait, ISP pipeline model, V4L2 compat, privacy enforcement). Printing is out of kernel scope (CUPS/IPP are pure userspace; Section 13.17).
Priority 5 -- Platform (covers system management): 19. ACPI subsystem 20. PCI/PCIe enumeration and configuration 21. IOMMU (Intel VT-d, AMD-Vi)
24.1.4 Nvidia / Proprietary Driver Strategy¶
For Nvidia (the most critical proprietary driver):
- Nvidia's driver already has a clean internal abstraction layer between their
proprietary GPU core and the "kernel interface layer" (
nvidia.ko) - UmkaOS provides a KABI-native implementation of this kernel interface layer
- Nvidia's proprietary compute core links against our KABI implementation
- This is more sustainable than binary
.kocompatibility: the interface layer is small, well-defined, and stable
Tier assignment: The Nvidia proprietary GPU compute core runs as a Tier 2 driver (Ring 3 process, IOMMU-isolated). It cannot access kernel memory or hardware registers directly. This is the correct security placement for closed-source proprietary code — a crash or exploit in the Nvidia blob cannot compromise the kernel or other processes.
- GPU command submission goes through the Tier 2 KABI ring protocol (Section 12.6).
- DMA buffers are mapped via the IOMMU through the Tier 2 DMA API (Section 4.14).
- For display output (modesetting), the Nvidia driver uses the DRM/KMS KABI interface (Section 21.5). Display operations are not latency-critical; Tier 2 crossing overhead is acceptable.
Signing and verification: The Nvidia proprietary blob is signed by Nvidia using
ML-DSA-65. The UmkaOS module loader verifies this signature against a Nvidia vendor
certificate embedded in the UmkaOS kernel image at distribution build time (not at
runtime). Certificate chain: Nvidia Root CA → Nvidia Driver Signing Cert → module.
The Root CA certificate is pinned — it cannot be rotated without a kernel update,
preventing supply chain substitution attacks. Unsigned or improperly-signed blobs are
rejected at load time with ENOEXEC.
KABI interface layer: The in-kernel nvidia_kabi_shim (Tier 0) implements the
Linux nvidia.ko kernel interface surface that Nvidia's compute core links against,
translating calls to UmkaOS KABI operations. This shim is open-source (maintained by
the UmkaOS project); Nvidia's proprietary compute core remains closed-source and
requires no kernel patches.
Security boundary: The Tier 2 Nvidia process runs in its own user namespace with no capabilities. It communicates with the kernel exclusively via: 1. GPU command ring (KABI vtable, memory-mapped shared ring). 2. DMA API (buffer map/unmap; IOMMU-mediated; no arbitrary PA access). 3. DRM/KMS KABI (display; Tier 0 shim mediates all modesetting operations).
No raw MMIO access. No /dev/mem. No kernel symbol exports to the blob.
24.1.5 Community Incentive¶
The clean KABI SDK makes driver development significantly easier than Linux: - No need to track unstable internal APIs - Rust safety eliminates entire classes of bugs - Binary compatibility across kernel versions eliminates recompilation burden - Clear, documented interfaces reduce the learning curve
This lower barrier to entry is expected to attract contributors and vendors over time.
24.1.6 Standalone UmkaOS Peer Protocol Specification¶
The peer protocol is a single protocol used by all multikernel communication in UmkaOS — Tier M peers on a single host, distributed kernel nodes across hosts, and firmware shims on smart peripherals all speak the same protocol. The difference between these deployment modes is transport (Layer 0) and whether DSM coherence (Layer 3) is enabled — not the protocol itself.
| Aspect | Detail |
|---|---|
| Scope | Layer 1 wire protocol + Layer 0 transport binding appendices |
| Size | ~40-50 pages standalone document |
| Wire spec source | Section 5.1 |
| Target audience | Firmware engineers, SmartNIC/DPU teams, FPGA developers, embedded/IoT |
| Firmware shim effort | ~10-18K lines of C on existing RTOS, excluding cryptographic primitives already present in the firmware stack (Layers 0-1 only; a reference implementation will be published with measured line counts) |
| Timeline | Draft alongside Phase 3 Tier M demo |
Protocol stack:
Layer 3: DSM page coherence (optional — only for CPU-class peers doing shared memory)
Layer 2: Service messages (service-specific KABI vtables over ring buffers)
Layer 1: PEER PROTOCOL (membership, capabilities, crash recovery, ring transport)
Layer 0: Transport binding (PCIe BAR/MSI, RDMA verbs, CXL.mem, Ethernet+TCP)
Layer 1 is identical across all deployment modes. A SAS controller shim on PCIe, a Tier M GPU peer on CXL, and a DSM node over RDMA all implement the same Layer 1. DSM nodes additionally implement Layer 3 (page coherence) on top. Layer 0 is pluggable — the peer protocol sees "send message to ring, receive doorbell interrupt."
The standalone spec must be published separately from the 24-chapter kernel architecture so that implementers do not need to read the full kernel design.
Standalone document contents:
- Ring buffer layout — entry format, producer/consumer indices, doorbell coalescing
- Message types (~10: CLUSTER_JOIN, CLUSTER_LEAVE, CAP_ADVERTISE, CAP_WITHDRAW, SERVICE_REQUEST, SERVICE_RESPONSE, HEALTH_REPORT, CRASH_NOTIFY, FLR_REQUEST, PING)
- Capability negotiation — 3-way handshake: HELLO → CAP_ADVERTISE → ACK
- Crash and recovery — IOMMU lockout, bus master disable, FLR, rejoin sequence
- Transport binding appendices:
- Appendix A: PCIe — BAR mapping, MSI/MSI-X vector assignment, P2P DMA
- Appendix B: RDMA — RoCEv2 queue pairs, RDMA_SEND/RDMA_WRITE mapping
- Appendix C: CXL — CXL.mem shared region, CXL.cache coherence interaction
- Appendix D: USB — bulk transfer endpoints, interrupt polling fallback
- Appendix E: Ethernet+TCP — software transport for development/demo/IoT
The protocol is transport-agnostic. Any fabric that carries ring buffer messages and delivers doorbells (interrupt or polling) is a valid Layer 0. This means ANY device with a processor — from a SAS HBA with an ARM Cortex-R to an STM32 microcontroller on USB to a datacenter DPU on PCIe — can be an UmkaOS peer.
24.1.6.1 Tier M Adoption Roadmap¶
Tier M viability is first demonstrated via UmkaOS-controlled endpoints that require zero third-party vendor cooperation. Vendor firmware adoption is market-dependent and deferred to later phases.
| Phase | Tier M Milestone | Vendor Dependency |
|---|---|---|
| Phase 3 | Kintex-7 FPGA PCIe endpoint — peer protocol in firmware, simultaneously advertising network (Ethernet service) and storage (ephemeral block device backed by onboard DDR). Validates multi-service peers, CLUSTER_JOIN, capability negotiation, crash recovery, and service re-advertisement. Both sides fully controlled. | None |
| Phase 3 | UmkaOS-to-UmkaOS cluster peers over PCIe P2P | None (both sides controlled) |
| Phase 4 | DPU integration (BlueField, Pensando via DOCA SDK) | Cooperative (DOCA is public SDK) |
| Phase 4 | RDMA transport binding (RoCEv2 queue pairs) | None (standard verbs API) |
| Phase 5+ | SAS HBA firmware shim (Broadcom/Marvell) | Requires vendor cooperation |
| Phase 5+ | GPU firmware shim (NVIDIA, AMD) | Requires vendor NDA/engineering access |
| Phase 5+ | NIC firmware shim (Intel, Mellanox) | Requires vendor cooperation |
| Phase 5+ | NVMe SSD firmware shim | Requires vendor cooperation |
The Kintex-7 FPGA is the cleanest Tier M demonstration: the device has never heard of UmkaOS — it just speaks the peer protocol spec. The dual-service capability (network + storage from a single device) proves that Tier M peers are not constrained to one device class, which is the key architectural distinction from traditional driver models.
24.2 Implementation Phases¶
This section covers the implementation timeline for all features. The first part (Phases 1-5+) defines core kernel milestones. The Enhancement Feature Phasing and Future-Proof Feature Phasing tables below map additional features onto these same phases.
24.2.1.1.1 Subsystem Completeness Rule¶
Every subsystem touched in a phase is implemented completely per its architecture spec, or it is not started. There are no "basic" or "stub" versions that get extended in a later phase. The subsystem is the unit of completeness, not the demo.
Rationale: partial implementations create compounding technical debt — the Phase N scaffolding blocks proper Phase N+1 implementation, nobody can tell what works vs what's half-done, and the analysis cost of untangling partial state exceeds the cost of implementing correctly the first time. This is production engineering, not prototyping.
Concretely: if Phase 2 needs signals for busybox, we implement the full signal
subsystem (all 64 signals, sigaction, sigaltstack, per Section 8.5) —
not "4 signals now, 60 later." If Phase 2 needs procfs for ps, we implement
procfs completely (per Section 19.1) — not a stub.
Phase N+1 adds new subsystems. It never extends Phase N subsystems.
Exception: syscall dispatch table. The syscall table is an enumerated set of independent entries, not a monolithic subsystem. Adding syscall #300 doesn't change syscall #1.
Exception: Kernel Crypto API algorithm table. Like the syscall table, the algorithm
registry is an enumerated set of independent entries. Each algorithm is a separate
CryptoAlg registration that does not affect other algorithms. Phase 2 implements the
framework completely (template instantiation, algorithm lookup, crypto_alloc_*() API,
software fallback dispatch) plus essential algorithms (SHA-256, AES, CRC32c, ChaCha20).
Phase 3 registers the full algorithm table and hardware acceleration backends. The
FRAMEWORK is complete in Phase 2; the CATALOG grows in Phase 3.
However, syscalls form functional clusters that must ship together —
fork() without wait() leaks zombies, socket() without bind()/listen()/accept()
is useless, sigaction() without sigprocmask() breaks signal masking. The
completeness unit for syscalls is the cluster, not the individual entry.
Syscall cluster rule: a cluster is complete when all real programs that use any syscall in the cluster can use all of them correctly. Clusters are defined by functional dependency, not by arbitrary grouping:
| Cluster | Syscalls (representative, not exhaustive) | Ships in |
|---|---|---|
| Process lifecycle | fork, clone, execve, wait4, exit, exit_group, getpid, getppid, setpgid, setsid | Phase 2 |
| Signals | rt_sigaction, rt_sigprocmask, rt_sigreturn, rt_sigsuspend, kill, tgkill, sigaltstack (all 64 signals) | Phase 2 |
| File I/O | open, openat, close, read, write, lseek, dup, dup2, dup3, fcntl, fstat, fstatat, readlink | Phase 2 |
| Memory | mmap, munmap, mprotect, brk, madvise, mremap | Phase 2 |
| Directory | getdents64, mkdir, rmdir, chdir, getcwd, unlink, rename, chmod, chown, link, symlink, access | Phase 2 |
| Pipe/poll | pipe2, poll, ppoll, epoll_create1, epoll_ctl, epoll_wait, select | Phase 2 |
| Mount/FS | mount, umount2, statfs, fstatfs, sync, fsync | Phase 2 |
| Time | clock_gettime, gettimeofday, nanosleep, clock_nanosleep | Phase 2 |
| Identity | getuid, geteuid, getgid, getegid, setuid, setgid, setresuid, setresgid | Phase 2 |
| Resource | getrlimit, setrlimit, prlimit64, getrusage, uname, sysinfo | Phase 2 |
| Networking | socket, bind, listen, accept4, connect, send, recv, sendmsg, recvmsg, shutdown, setsockopt, getsockopt | Phase 3 |
| Namespaces | clone(CLONE_NEW*), setns, unshare | Phase 3 |
| io_uring | io_uring_setup, io_uring_enter, io_uring_register | Phase 3 |
| eBPF | bpf() (all subcommands) | Phase 3 |
| Advanced process | ptrace, prctl, seccomp | Phase 3 |
Phase 2 clusters total ~120+ syscalls (not "60" or "basic"). Phase 3 adds ~200+ more. Each cluster is fully tested before its phase exits.
24.2.1.1.2 Multi-Architecture Development Mandate¶
All phases develop and test on all 8 architectures from day one, not just x86-64. This is not a porting exercise deferred to Phase 5 — it is a code quality discipline that enforces proper separation between arch-specific and arch-independent code. If generic code only compiles and runs on x86-64, it is not actually generic, and fixing the hidden assumptions later requires rewriting core subsystems.
The rule: Every subsystem that ships in a phase must compile, boot, and pass its test suite on all 8 architectures in QEMU before the phase exits. x86-64 is the primary development and real-hardware target for Phases 1-4. AArch64 is the secondary real-hardware target from Phase 2 onward — Raspberry Pi 5 (BCM2712, Cortex-A76, no POE) and optionally Apple M1 (Icestorm/Firestorm, no POE but MTE available) provide non-emulatable validation of weak memory ordering, DMA cache coherency, GIC/SMMU, and DT-based PCIe. Other architectures are tested via QEMU CI. Phase 5a elevates remaining architectures to production quality on real hardware.
High-risk areas where x86-specific assumptions silently pollute generic code (identified by architecture risk analysis):
| Risk Area | x86 Behavior | Non-x86 Behavior | Consequence of x86-Only Testing |
|---|---|---|---|
| Memory ordering | TSO (strong, hides bugs) | Weak model (ARM, RISC-V) | Lock-free code silently corrupts data on ARM/RISC-V |
| DMA cache coherency | Always coherent (no-op sync) | Requires explicit cache flush (ARM SoC, RISC-V) | DMA data corruption on ARM without CCI |
| IOMMU | VT-d (single implementation) | SMMU v3 (ARM), RISC-V IOMMU, PPC IOMMU | Tier 2 drivers fail to probe on non-x86 |
| PCIe config access | ECAM via ACPI MCFG | Device tree ranges property (ARM, RISC-V) |
Device discovery fails on non-x86 |
| Isolation | MPK (WRPKRU, unprivileged) | POE/page-table (ARM), DACR (ARMv7), segments (PPC), none (RISC-V) | Tier 1 code assumes MPK exists everywhere |
Mandatory abstractions (must be defined before Phase 2 drivers):
-
IommuDomain trait: Generic IOMMU operations (map, unmap, flush, fault handler). Per-arch implementations: VT-d (x86), SMMU v3 (ARM), RISC-V IOMMU, PPC IOMMU. Without this, all IOMMU code is implicitly VT-d-specific.
-
PcieConfigAccessor trait: Generic PCIe configuration space access. Per-arch implementations: ECAM via MCFG (x86), DT-based (ARM, RISC-V, PPC). Without this, device enumeration only works on x86.
-
NumaDiscovery trait: Generic NUMA topology discovery. Per-arch implementations: ACPI SRAT/SLIT (x86), device tree
numa-node-id(ARM, RISC-V, PPC). Without this, NUMA-aware allocation silently falls back to node 0 on non-x86.
Per-phase multi-arch requirements are listed in each phase below.
24.2.1.1.3 QEMU Fidelity Matrix¶
QEMU is the primary CI vehicle for non-x86 architectures through Phase 4. Its TCG (Tiny Code Generator) mode is a functional emulator, not a cycle-accurate simulator. Several categories of real-silicon behavior are simplified or absent in QEMU, meaning bugs in these areas will not surface until real-hardware testing (Phase 5a, or earlier for x86-64 and AArch64 which have real-hardware targets in Phases 2-4). This matrix documents known divergences so that developers and reviewers treat QEMU-passing tests with appropriate skepticism in these areas.
See also Section 2.22 for per-architecture hardware capabilities and isolation mechanism availability.
| Category | Architecture(s) | QEMU Behavior | Real Silicon Behavior | Impact on Testing | Mitigation |
|---|---|---|---|---|---|
| Memory ordering | AArch64, ARMv7, RISC-V, PPC32, PPC64LE, LoongArch64 | TCG executes guest instructions sequentially on the host thread; store-buffer forwarding and reordering are not modeled. Effectively TSO or stronger. | Weak memory models (ARM: weakly-ordered with DMB/DSB barriers; RISC-V: RVWMO; PPC: very weak with lwsync/hwsync; LoongArch: weakly-ordered with DBAR). Out-of-order retirement, store buffers, and speculative loads produce reorderings that TCG cannot reproduce. | Lock-free algorithms, RCU, and seqlock code may pass all QEMU tests yet corrupt data on real hardware. This is the highest-risk divergence. | (1) x86-64 real-hardware testing catches TSO-compatible bugs. (2) AArch64 real-hardware testing (RPi 5 / Apple M1 from Phase 2) catches weak-ordering bugs. (3) All lock-free code is reviewed against the formal memory model per Section 3.1. (4) LKMM-style litmus test suite run under herd7 for all barrier-sensitive paths. |
| TLB shootdown latency | All | TLB invalidation (INVLPG, TLBI, sfence.vma) completes instantly — QEMU has no TLB structure to invalidate. IPI delivery for cross-CPU shootdowns is synchronous within the same QEMU process. |
TLB shootdown requires IPI to remote cores, each of which must drain its pipeline, invalidate TLB entries, and acknowledge. Cost: 1-50 us depending on core count and NUMA topology. Batching and lazy invalidation strategies have measurable impact. | TLB-intensive paths (munmap, mprotect, fork COW, context switch ASID flush) will appear ~100x faster in QEMU. Performance regressions from naive shootdown strategies are invisible. | (1) Real-hardware x86-64 and AArch64 benchmarks gate Phase 3/4 exit. (2) Performance budget numbers are derived from real-silicon measurements, not QEMU. (3) TLB shootdown batching is designed per spec regardless of QEMU results. |
| DMA cache coherence | AArch64, ARMv7, RISC-V, PPC32, LoongArch64 | QEMU memory is always coherent — DMA reads/writes see the same view as CPU with no cache maintenance. dma_sync_* operations are no-ops. |
Non-coherent SoCs (many ARM, all RISC-V without Svpbmt, LoongArch) require explicit cache clean/invalidate around DMA transfers. Missing a dma_sync_for_device() causes stale cache lines to overwrite DMA-written data. |
DMA driver bugs (missing cache sync) are completely invisible in QEMU. Corruption manifests only on real non-coherent hardware. | (1) AArch64 real-hardware testing (RPi 5 is non-coherent for some peripherals) catches missing syncs from Phase 2. (2) DMA API (StreamingDmaMap, CoherentDmaBuf per Section 4.14) enforces sync at the type level — the API makes it hard to forget. (3) Static analysis flag for raw MMIO writes to DMA-mapped regions without intervening sync. |
| IOMMU fidelity | AArch64 (SMMUv3), RISC-V (IOMMU), PPC (IOMMU) | QEMU emulates basic IOMMU page-table walks and DMA remapping. Fault injection, ATS (Address Translation Services), PRI (Page Request Interface), and nested/stage-2 translation are partially implemented or absent. SMMU v3 HTTU (Hardware Table Update) is not modeled. | Full IOMMU implementations support ATS/PRI for device-side TLB, HTTU for dirty-bit tracking, nested translation for VM passthrough, and hardware-walked page tables with configurable granularity. Fault reporting is asynchronous via event queues. | Tier 2 driver isolation testing in QEMU validates basic DMA remapping but not ATS/PRI flows, nested translation, or IOMMU fault recovery paths. | (1) x86-64 VT-d is the best-emulated IOMMU in QEMU — Tier 2 regression tests run there. (2) AArch64 SMMU v3 testing on real hardware from Phase 3. (3) IOMMU fault injection test suite exercises error paths via synthetic fault generation independent of QEMU fidelity. |
| Interrupt controller timing | s390x (Adapter Interrupts), LoongArch64 (EIOINTC) | QEMU delivers interrupts synchronously at instruction boundaries. s390x adapter interrupt coalescing is simplified. LoongArch EIOINTC routing between nodes in multi-socket configurations is functional but untested for edge cases. | Real interrupt controllers have delivery latency (10-100 ns), coalescing windows, priority arbitration delays, and routing-table update propagation time. s390x QDIO adapter interrupts have specific timing contracts with channel programs. | Interrupt storm handling, coalescing tuning, and multi-socket routing bugs are invisible in QEMU. EIOINTC cross-node routing on real LoongArch multi-socket (3C5000 8-node) is untested. | (1) x86-64 APIC and AArch64 GICv3 are well-emulated and have real-hardware validation. (2) s390x and LoongArch64 require Phase 5a real-hardware validation for interrupt timing. (3) Interrupt coalescing parameters are configurable and default to conservative values. |
| PCIe configuration space | All non-x86 | QEMU PCIe config reads/writes complete in zero simulated time. Extended capabilities (AER, ACS, L1 PM Substates, SR-IOV) are partially emulated. Config retry status (CRS) is not modeled. | Real PCIe config access requires ECAM MMIO or type 1 configuration cycles with completion timeouts (10-100 us for CRS). Extended capability registers may have hardware-enforced write masks and side effects. Power management state transitions (D0/D3hot/D3cold) have real latency. | Driver probe timing, CRS retry logic, and power state transition handling are untested in QEMU. SR-IOV VF BAR sizing timing is simplified. | (1) x86-64 and AArch64 real-hardware testing validates PCIe probe paths. (2) PCIe config access uses the PcieConfigAccessor trait with per-arch timeout handling designed in regardless of QEMU behavior. (3) CRS retry is implemented per PCIe Base Spec with configurable timeout. |
| s390x Channel I/O | s390x | QEMU emulates basic CCW (Channel Command Word) chains, SSCH/TSCH/HSCH instructions, and virtio-ccw transport. Subchannel multiplexing, QDIO data queues, and FICON channel path failover are simplified. Concurrent I/O on multiple subchannels may not reflect real arbitration. | Real s390x channel subsystem supports 65,536 subchannels, hardware-managed I/O queuing, channel path failover (CHPID/SNID), and QDIO with hardware-assisted buffer management (SBAL/SBALE). I/O interrupts have specific priority and masking semantics tied to the PSW I/O mask bit. | Multi-subchannel I/O scheduling, channel path failover, and QDIO performance optimization cannot be validated in QEMU. Basic virtio-ccw transport and single-subchannel I/O are testable. | (1) s390x is Phase 5a for real-hardware production. (2) virtio-ccw transport (the primary QEMU I/O path) is well-emulated and sufficient for functional testing. (3) FICON/QDIO drivers require z/VM or LPAR testing. |
| Power management | All | QEMU does not model CPU power states (C-states, P-states), voltage/frequency scaling, or thermal throttling. MONITOR/MWAIT (x86), WFI/WFE (ARM) are treated as no-ops or simple halts. |
Real CPUs have multi-level C-states with entry/exit latencies (1-1000 us), P-state transition delays, thermal throttling that reduces effective frequency, and DVFS governors that interact with the scheduler. | Runtime PM, cpufreq governor behavior, and thermal management are completely untested in QEMU. The scheduler's energy-aware scheduling path cannot be validated. | (1) x86-64 and AArch64 real-hardware testing validates power management from Phase 3 (power-aware scheduling is a Phase 3 subsystem). (2) Runtime PM state machine per Section 7.5 is designed and tested for correctness independently of actual power savings. |
Interpreting the matrix: QEMU testing validates functional correctness (correct register values, proper sequencing, ABI compatibility) but not performance characteristics, hardware timing, or weak-ordering behavior. Every "QEMU passes" result for a non-x86 architecture carries an implicit caveat for the categories above. The mitigation column documents how UmkaOS compensates — primarily through early real-hardware testing (x86-64 Phase 1+, AArch64 Phase 2+), type-level API enforcement, and formal memory-model verification.
24.2.2 Phase 1: Foundations¶
Goal: Boot to a hello-world program.
See Section 25.3 for detailed agentic workflow steps within this roadmap phase. Phase 1.1 and 1.2 (formerly separate x86-only and multi-arch phases) are merged — the spec now contains per-arch tables for every subsystem, making multi-arch support a compile-time configuration rather than a separate design phase.
Subsystems implemented (each complete per its architecture spec):
Boot and hardware discovery: - Multi-arch boot chain: Multiboot1/2 (x86-64), DTB (AArch64/ARMv7/RISC-V/PPC), SBI (RISC-V), SLOF (PPC64LE) — per Section 2.1 - ACPI table parsing (x86-64): MADT, MCFG, DMAR, FADT — per Section 2.22. Device Tree parsing for all other architectures — per Section 2.8 - Clock framework: full per Section 2.24 — PLL, divider, gate, mux clock types with runtime rate discovery - Hardware RNG detection: RDRAND (x86), RNDR (AArch64), Zkr (RISC-V) — seeds CSPRNG per Section 2.16 - CPU feature detection: per-arch capability discovery per Section 2.16 — isolation mechanism probing (MPK, POE, DACR, segment registers, Radix PID)
Concurrency primitives (foundation for all later subsystems): - Spinlocks, Mutexes, RwLocks: full per Section 3.1 — including lock ordering enforcement, lockdep debug mode - RCU: full non-preemptible model per Section 3.4 — grace period detection, rcu_read_lock/unlock, synchronize_rcu, call_rcu - CpuLocal and PerCpu infrastructure: full per Section 3.1 — register-based CpuLocal (GS/TPIDR_EL1/tp per arch), PerCpu data areas - IRQ chip and irqdomain hierarchy: full per Section 3.12 — IrqChip trait, IrqDomain, IrqTable, per-arch root domain (APIC/GIC/PLIC/OpenPIC) - Workqueues: full per Section 3.11 — BoundedMpmcRing, named thread pools with backpressure, system-wide + per-CPU + unbound pools
Memory:
- Physical memory allocator: full per Section 4.1
(buddy allocator, NUMA-aware, zone-based, boot allocator → runtime transition)
- Slab allocator: full per Section 4.3 — per-CPU
magazines, size classes, NUMA-aware, GFP flags. Every kmalloc equivalent.
Scheduling and time:
- EEVDF scheduler: full core per Section 7.1
(virtual deadline, lag tracking, preemption, two-tree eligible/timeline design).
RT scheduling classes (SCHED_FIFO, SCHED_RR), deadline scheduling class
(SCHED_DEADLINE), CBS bandwidth enforcement, and EAS energy-aware scheduling
are separate subsystems added in Phase 3. Phase 1's EEVDF subsystem is
complete per its spec.
- Timekeeping: full per Section 7.8 — clock sources,
clockevents, timer wheel, hrtimers, vDSO for clock_gettime()
Security foundation: - UmkaOS capability system: full per Section 9.1 — CapSpace, CapEntry, ObjectRegistry, capability creation/revocation/lookup - PQC crypto abstraction: algorithm enum, variable-length signature fields per Section 9.6 — design-in only, no functional implementation yet
Isolation and drivers: - Isolation domain infrastructure: full Tier 0/1/2 framework per Section 11.2 (MPK setup, domain allocation, IOMMU init on all architectures). No drivers use Tier 1 yet — infrastructure only. - Tier 0 drivers: APIC/GIC/PLIC (per arch), timer, serial console — complete - Device registry skeleton: DeviceRegistry struct with RwLock per Section 11.4 — populated in Phase 2
KABI and syscall infrastructure:
- KABI compiler: umka-kabi-gen per Section 12.5 — complete
- Syscall dispatch table: architecture per Section 19.1,
populated with execve + write + exit_group initially. Table structure is final;
later phases add entries, never change the dispatch mechanism.
Build and test: - Build system + CI/CD: Cargo workspace, linker scripts, QEMU boot tests — complete - Formal verification readiness: spec annotations, design contracts per Section 24.4 — design-in only
Multi-arch (Phase 1): Hello-world runs on all 8 architectures in QEMU. Per-arch isolation mechanism probed and reported (MPK on x86, DACR on ARMv7, segments on PPC32, Radix PID on PPC64LE, page-table fallback on AArch64, "Tier 1 unavailable" on RISC-V, "Tier 1 unavailable" on s390x, "Tier 1 unavailable" on LoongArch64). IrqChip/IrqDomain hierarchy validated on all arches (APIC, GIC, PLIC, OpenPIC). Concurrency primitives stress-tested on ARM64 and RISC-V QEMU (weak memory model targets) — not just x86.
Exit criteria: A statically linked 'Hello, world!' ELF binary runs on UmkaOS in QEMU
(all 8 architectures). The KABI compiler parses a .kabi IDL file and generates Rust/C
stubs that compile. Lock ordering violations detected by lockdep. RCU grace periods
complete correctly under stress. Isolation domain probe succeeds on all arches.
24.2.3 Phase 2: Self-Hosting Shell + Tier 1 Fault Recovery¶
Goal: Run a busybox shell with basic utilities. Demonstrate Tier 1 driver crash recovery.
See Section 25.3 (Phase 2.1: Essential Drivers), Section 25.3 (Phase 2.2: Linux Compatibility Layer), and Section 25.3 (Phase 2.3: Networking Stack) for detailed agentic workflow steps within this roadmap phase.
New subsystems added (each complete per its architecture spec):
Memory (extended):
- Virtual memory manager: full per Section 4.15 —
mmap, brk, munmap, page fault handler, COW, demand paging, mprotect, madvise
- Page cache: full per Section 4.4 — readahead,
writeback, per-inode page tree. All VFS read/write goes through page cache.
- DMA subsystem: full per Section 4.14 — DmaDevice trait,
CoherentDmaBuf, StreamingDmaMap, SWIOTLB fallback, per-arch cache coherency
VFS and pseudo-filesystems:
- VFS layer: full per Section 14.1 — mount table, path resolution,
file descriptor table, dentry cache, inode cache, superblock operations
- tmpfs: full per Section 19.1 — size limits, POSIX semantics
- devtmpfs: full — /dev/null, /dev/zero, /dev/random, /dev/urandom, /dev/console,
auto-populated from device registry
- devpts: full — PTY allocation filesystem, required for terminal emulation
- initramfs (cpio): full — extraction, switchroot
- procfs: full per Section 19.1 — all standard entries
(/proc/[pid]/status, /proc/meminfo, /proc/cpuinfo, /proc/self, etc.)
- sysfs: full per Section 19.1 — device/driver/class hierarchy
- Pipes and FIFOs: full per Section 14.17 — O_NONBLOCK,
PIPE_BUF atomicity, splice between pipes
- File locking: full per Section 14.14 — flock(), POSIX fcntl()
locks, lock conflict detection
Process and credentials:
- Process management: full per Section 8.1 —
fork/clone/execve/wait/exit/exit_group, process groups, sessions,
resource limits (getrlimit/setrlimit/prlimit64)
- Signal handling: full per Section 8.5 —
all 64 signals, sigaction, sigaltstack, sigprocmask, signal queuing, RT signals,
per-arch signal frame layouts (all 8 architectures)
- Credential model and Linux capabilities: full per Section 9.9 —
uid/gid, supplementary groups, setuid/setgid/setresuid/setresgid,
capget/capset, capability bounding set, securebits. Required for busybox su,
login, and any setuid binary.
Crypto and entropy:
- Kernel Crypto API: basic algorithms per Section 10.1 —
SHA-256, AES, CRC32c (ext4 checksums), ChaCha20 (CSPRNG). Full algorithm table
deferred to Phase 3; Phase 2 implements the framework + essential algorithms.
- CSPRNG and getrandom(): full — hardware RNG seeding (Phase 1), ChaCha20-based CSPRNG,
getrandom() syscall with blocking/non-blocking modes. Required by glibc init, ASLR,
stack canaries.
Synchronization and special file descriptors:
- futex: full per Section 19.4 — FUTEX_WAIT, FUTEX_WAKE,
FUTEX_WAIT_BITSET, FUTEX_PI (priority inheritance), FUTEX_REQUEUE.
Required by glibc/pthreads (every multithreaded program depends on futex).
- eventfd: full per Section 19.10 — semaphore mode,
non-blocking, EFD_CLOEXEC. Required by systemd sd-event loop.
- signalfd: full per Section 19.10 — synchronous
signal delivery via file descriptor. Required by systemd PID 1.
- timerfd: full per Section 19.10 — CLOCK_MONOTONIC,
CLOCK_REALTIME, TFD_TIMER_ABSTIME. Required by systemd timer management.
- ioctl framework: full — dispatch table, per-subsystem ioctl handlers (terminal, block device)
Device infrastructure:
- PCIe enumeration and configuration: full per Section 11.4 —
BAR discovery, MSI-X vector allocation, configuration space access. Required before any
PCIe device driver (VirtIO-blk is PCI). Includes PcieConfigAccessor trait with
per-arch implementations: ECAM via ACPI MCFG (x86-64), DT-based ranges (AArch64,
ARMv7, RISC-V, PPC).
- IOMMU abstraction: IommuDomain trait with per-arch implementations — VT-d (x86-64),
SMMU v3 (AArch64), DACR-based (ARMv7), PPC IOMMU, stub (RISC-V). Generic operations:
map, unmap, flush, fault handler. All Tier 1/2 driver code uses the trait, never
arch-specific IOMMU registers directly.
- NUMA discovery abstraction: NumaDiscovery trait — ACPI SRAT/SLIT (x86-64), device
tree numa-node-id (AArch64, RISC-V, PPC). Single-node fallback when no topology
information is available.
- Device registry and bus management: full per Section 11.4 —
device/driver matching, probe sequencing, sysfs population
- IPC architecture: full per Section 11.8 — Tier 1 driver
communication with core via domain ring buffers
- Zero-copy I/O path: full per Section 11.7 — block driver
fast path, ring buffer entry reuse
Block storage: - Block I/O layer: full per Section 15.1 — bio submission, completion, merge, I/O scheduler - VirtIO-blk driver (Tier 1): full per Section 15.5 — VirtIO-blk Tier 1 KABI driver. Virtqueue layout (descriptor table, available ring, used ring), feature negotiation (VIRTIO_BLK_F_SEG_MAX, F_SIZE_MAX, F_BLK_SIZE, F_FLUSH, F_TOPOLOGY, F_MQ, F_DISCARD), I/O request format (VirtioBlkReq: type + ioprio + sector + data + status), flush semantics, crash recovery path.
KABI and recovery: - KABI runtime: full service registry, module loader per Section 12.7 - Crash recovery: full per Section 11.9 — fault detection, IOMMU revoke, FLR reset, driver reload, I/O resume
Observability (design-in):
- Unified Object Namespace: infrastructure per Section 20.5 —
umkafs mount point, basic object registration. Full population in Phase 3-4.
- Stable Tracepoints: framework per Section 20.2 —
tracepoint macros, ring buffer, basic trace_pipe interface. Needed for debugging.
Syscall clusters (Phase 2 adds ~120+ syscalls to the dispatch table):
| Cluster | Syscalls (representative, not exhaustive) |
|---|---|
| Process lifecycle | fork, clone, execve, wait4, exit, exit_group, getpid, getppid, setpgid, setsid |
| Signals | rt_sigaction, rt_sigprocmask, rt_sigreturn, rt_sigsuspend, kill, tgkill, sigaltstack |
| File I/O | open, openat, close, read, write, lseek, dup, dup2, dup3, fcntl, fstat, fstatat, readlink |
| Memory | mmap, munmap, mprotect, brk, madvise, mremap |
| Directory | getdents64, mkdir, rmdir, chdir, getcwd, unlink, rename, chmod, chown, link, symlink, access |
| Pipe/poll | pipe2, poll, ppoll, epoll_create1, epoll_ctl, epoll_wait, select |
| Mount/FS | mount, umount2, statfs, fstatfs, sync, fsync |
| Time | clock_gettime, gettimeofday, nanosleep, clock_nanosleep |
| Identity | getuid, geteuid, getgid, getegid, setuid, setgid, setresuid, setresgid, capget, capset |
| Resource | getrlimit, setrlimit, prlimit64, getrusage, uname, sysinfo, getrandom |
| Sync objects | futex, eventfd2, signalfd4, timerfd_create, timerfd_settime, timerfd_gettime |
| File locking | flock, fcntl(F_SETLK/F_GETLK) |
Multi-arch (Phase 2): Busybox shell boots on all 8 architectures in QEMU. VirtIO-blk driver (Tier 1) loads and serves I/O on all arches — validates IommuDomain trait (map/unmap paths only; fault handler validation requires real IOMMU, Phase 4), PcieConfigAccessor, and DMA cache coherency paths on ARM64 and RISC-V (not just x86). DMA streaming mappings stress-tested on ARM64 QEMU (non-coherent path exercised). RISC-V and PPC IOMMU testing uses software-emulated IOMMU paths (SWIOTLB fallback) in QEMU. Full hardware IOMMU validation requires real hardware, targeted for Phase 4. Signal frame layout tested on all arches (per-arch signal delivery is a common divergence point).
Exit criteria: Busybox shell boots, ls, cat, echo, ps, mount, uname work
(on all 8 architectures in QEMU; x86-64 is primary). A multithreaded C program
(pthreads) runs correctly (futex works). VirtIO-blk driver survives injected fault with
I/O resumption demonstrated end-to-end. All implemented subsystems pass their full test
suites — no "works for the demo" exceptions.
24.2.4 Phase 3: Real Workloads + Tier M Peer Demo¶
Goal: Boot systemd, run Docker containers. Demonstrate Tier M PCIe peer.
See Section 25.3 (Phase 3.1: Storage Stack) and Section 25.3 (Phase 3.2: Advanced Features) for detailed agentic workflow steps within this roadmap phase.
What systemd needs: AF_UNIX sockets (D-Bus), inotify (file monitoring), signalfd/eventfd/timerfd (already Phase 2), cgroups v2 (resource control), namespaces (service isolation), sysctl (/proc/sys/*), netlink (udev device events), pidfd (race-free process management), seccomp-bpf (sandboxing), credentials/capabilities (already Phase 2).
What Docker needs: overlayfs (container image layers), namespaces (mount, PID, net, user, UTS, IPC, cgroup, time), cgroups (resource limits), seccomp-bpf (syscall filtering). For the Phase 3 demo, Docker uses
--network=host(shares host network namespace). Full Docker networking (bridge, veth, NAT, conntrack, nftables) ships in Phase 4 alongside Kubernetes.
New subsystems added (each complete per its architecture spec):
Storage:
- NVMe driver (Tier 1): full per Section 15.19 + KABI spec —
admin queue, I/O queues, interrupt coalescing, crash recovery
- ext4 filesystem: full per Section 15.6 — read-write,
journaling (JBD2), fsck, extent trees, delayed allocation, inline data
- I/O scheduling and priority: full per Section 15.1 —
cgroup io controller weight, BFQ-style proportional share. Required for fio benchmarks.
Network stack:
- TCP/IP and UDP: full per Section 16.1 — socket API,
congestion control (Cubic, BBR), NAPI, NetBuf zero-copy, TCP state machine, UDP, ICMP
- AF_UNIX sockets: full per Section 16.3 — stream, datagram,
SCM_RIGHTS (fd passing), SCM_CREDENTIALS, abstract namespace. Critical: D-Bus
(systemd's IPC) is built on AF_UNIX. systemd cannot start PID 1 without it.
- Loopback interface: full — lo device, 127.0.0.1/::1 routing. Required for
localhost services, iperf3 loopback tests.
- ARP and NDP: full per Section 16.1 — L2/L3 address resolution for
IPv4 (ARP) and IPv6 (NDP). Required for any Ethernet communication.
- NIC drivers (VirtIO-net, e1000): full per KABI spec, Tier 1
USB subsystem (required for real hardware demo — keyboard, storage, serial): - USB XHCI host controller driver (Tier 1): full per Section 13.12 — XHCI ring-based command/transfer/event architecture, USB 2.0/3.0 device enumeration, hub support, MSI-X interrupts, crash recovery via KABI reload - USB HID class driver: full per Section 13.12 — keyboard, mouse, basic gamepad. Input events routed to evdev-compatible interface (minimal evdev path for Phase 3; full input subsystem in Phase 4). Critical: real hardware demo requires a USB keyboard for interactive use. - USB mass storage class driver: full per Section 13.12 — bulk-only transport (BOT), SCSI command translation, USB flash drives. Enables loading Docker images and test data from USB storage on real hardware. - USB CDC ACM (serial) class driver: USB-to-serial adapters. Common for development boards, serial consoles on headless servers, debug access.
Network stack (continued):
- Netlink compat layer: full per Section 19.5 — NETLINK_ROUTE
(ip command, systemd-networkd), NETLINK_KOBJECT_UEVENT (udev device events),
NETLINK_NETFILTER (basic conntrack query). Required for systemd device management.
- Routing table: full per Section 16.6 —
FIB lookup, policy routing, default gateway
VFS extensions:
- overlayfs: full per Section 14.8 — upper/lower/work
directories, whiteout handling, metacopy, redirect_dir. Critical: Docker's primary
storage driver for container image layers.
- inotify: full per Section 14.13 — watch descriptors,
event coalescing, per-user limits, IN_MODIFY/IN_CREATE/IN_DELETE/IN_MOVED_*,
/proc/sys/fs/inotify/* sysctls. Critical: systemd uses inotify extensively for
monitoring /run, /etc, unit files.
- fanotify: full per Section 14.13 — permission events,
FAN_CLASS_CONTENT/FAN_CLASS_NOTIF, pre-allocated event ring. Shipped with inotify
per the Subsystem Completeness Rule (both are Section 14.13).
Containers and isolation:
- Namespaces: all 8 types per Section 19.1 — mount, PID,
net, user, UTS, IPC, cgroup, time. Includes clone(CLONE_NEW*), setns, unshare,
/proc/[pid]/ns/* inodes.
- Cgroups v2: full per Section 19.1 — cpu, memory, io, pids
controllers, cgroupfs pseudo-filesystem, delegation, v1 compat shim. Required for both
systemd (resource management) and Docker (container limits).
- POSIX IPC: full per Section 17.3 — SysV shared memory,
semaphores, message queues, POSIX mqueues. IPC namespace isolates these per container.
Scheduling (extended): - RT and deadline scheduling: full per Section 7.1 — SCHED_FIFO, SCHED_RR, SCHED_DEADLINE (CBS), bandwidth throttling, priority inheritance - EAS (Energy-Aware Scheduling): full per Section 7.2 — capacity-aware placement, energy model, big.LITTLE/Intel hybrid support - CPU bandwidth guarantees: full per Section 7.6 — CFS bandwidth, RT bandwidth. Required by cgroups cpu controller. - Power budgeting: full per Section 7.7 — RAPL/SCMI reading, per-cgroup power budgets in watts, multi-domain enforcement
Security (extended):
- seccomp-bpf: full per Section 10.3 — filter
installation, syscall interception, SECCOMP_RET_* actions, SECCOMP_IOCTL_NOTIF_*.
Required by Docker/runc for container syscall filtering.
- Kernel Crypto API (full): remaining algorithms per Section 10.1 —
full algorithm table, template instantiation, hardware acceleration registration
- Verified boot: framework and Ed25519 verifier per Section 9.3 —
secure boot chain, kernel image signature verification (Ed25519; hybrid
Ed25519+ML-DSA-65 added in Phase 4 when PQC algorithms ship), IMA measurement
list. The boot verification framework is algorithm-agnostic; Phase 4 adds
ML-DSA-65 and SLH-DSA as additional BOOT_VERIFY_TABLE entries.
Compat and special interfaces:
- io_uring: full per Section 19.3 — SQ/CQ rings,
all Phase 3 opcodes (read/write/fsync/poll/accept/connect/send/recv), fixed files/buffers
- eBPF: full verifier + JIT (x86-64, AArch64) + core map types per
Section 19.2 — hash/array/ringbuf maps,
bpf() syscall (all subcommands), program attachment points
- TTY/PTY: full per Section 21.1 — line discipline,
job control, TIOCGWINSZ/TIOCSWINSZ, session leader, controlling terminal
- pidfd: full per Section 19.10 — pidfd_open,
pidfd_send_signal, pidfd_getfd, CLONE_PIDFD. Used by systemd 250+ for race-free
process management.
- Sysctl / kernel parameter store: full per Section 20.9 —
/proc/sys/* entries, sysctl() syscall, namespace-aware parameters. Required by
systemd (sysctl.conf), Docker (network tuning).
- sendfile/splice/tee: full per Section 14.17 —
zero-copy file-to-socket, pipe-to-pipe, file-to-pipe transfers
Memory (extended): - Memory compression tier: full per Section 4.12 — zswap/zram, compression algorithms (LZ4, ZSTD), writeback to swap
Observability (extended): - Unified Object Namespace: full population per Section 20.5 — all kernel objects registered and accessible via umkafs - Fault Management Architecture: basic per Section 20.1 — health telemetry for NVMe/NIC drivers, rule-based diagnosis for crash recovery events
Distributed (Tier M): - Tier M peer transport: full per Section 11.1 — Kintex-7 FPGA PCIe endpoint as the primary Tier M validation device. The FPGA implements the peer protocol in firmware and simultaneously advertises both network (Ethernet service) and storage (ephemeral block device backed by onboard DDR) capabilities — demonstrating multi-service Tier M peers. Protocol exercised: CLUSTER_JOIN, capability advertisement for both services, workload delegation, crash recovery with service re-advertisement - Peer protocol wire specification: full per Section 5.1 — JoinRequest/JoinAccept, heartbeat, ClusterMessageHeader, session key derivation. Only the PCIe P2P transport is exercised; RDMA transport deferred to Phase 4-5.
Syscall clusters (Phase 3 adds ~200+ syscalls):
| Cluster | Syscalls (representative, not exhaustive) |
|---|---|
| Networking | socket, bind, listen, accept4, connect, send, recv, sendmsg, recvmsg, shutdown, setsockopt, getsockopt, sendfile |
| Unix sockets | socket(AF_UNIX), socketpair, sendmsg(SCM_RIGHTS), recvmsg(SCM_CREDENTIALS) |
| Namespaces | clone(CLONE_NEW*), setns, unshare, pidfd_open, pidfd_send_signal |
| io_uring | io_uring_setup, io_uring_enter, io_uring_register |
| eBPF | bpf() (all subcommands) |
| Advanced process | ptrace, prctl, seccomp, waitid, clone3 |
| inotify/fanotify | inotify_init1, inotify_add_watch, inotify_rm_watch, fanotify_init, fanotify_mark |
| IPC | shmget, shmat, shmdt, shmctl, semget, semop, semctl, msgget, msgsnd, msgrcv, msgctl, mq_open, mq_send, mq_receive |
Multi-arch (Phase 3): systemd boot tested on x86-64 (primary) and AArch64 (both QEMU and real hardware — RPi 5 and/or Apple M1). AArch64 real hardware validates weak memory model paths, DMA non-coherent paths, GIC interrupt routing, SMMU IOMMU, and DT-based PCIe that QEMU cannot faithfully emulate. eBPF JIT validated on both x86-64 and AArch64. All 8 architectures pass the Phase 2 busybox test suite plus Phase 3 networking (TCP loopback, AF_UNIX). EAS capacity model validated on AArch64 QEMU with big.LITTLE CPU topology. USB XHCI tested on x86-64 and AArch64 real hardware; VirtIO-input used on other arches in QEMU (same evdev path).
Exit criteria: Ubuntu minimal boots with systemd (PID 1 → multi-user target →
login prompt) on x86-64. Docker runs hello-world container (pre-loaded image,
--network=host). iperf3 TCP loopback benchmark completes. fio NVMe random
read/write benchmark completes with I/O scheduling. USB keyboard works on real
hardware (x86-64 and AArch64). Tier M peer device demonstrates capability negotiation
and crash recovery over PCIe. AArch64 passes systemd boot on both QEMU and real
hardware (RPi 5). eBPF: XDP basic operations (XDP_DROP, XDP_PASS, XDP_TX,
XDP_REDIRECT) and tc classifier attachment pass 95%. Full Cilium connectivity
test suite deferred to Phase 4 (requires conntrack, IPVS, overlay networking).
All subsystems pass their full test suites on all architectures.
24.2.4.1.1 Cgroup v2 Detection Surface (Required for Docker/runc v2 Mode)¶
The following procfs/sysfs entries must return correct v2-format data in Phase 3. Without these, Docker/runc may fall back to cgroup v1, which is deferred to Phase 4.
| Path | Required Content | Verified By |
|---|---|---|
/sys/fs/cgroup/cgroup.controllers |
Space-separated list of available controllers (cpu io memory pids) | runc spec --rootless |
/sys/fs/cgroup/cgroup.subtree_control |
Space-separated list of enabled controllers | docker info |
/proc/self/cgroup |
0::/path format (unified v2 hierarchy, hierarchy ID 0) |
cat /proc/self/cgroup in container |
/proc/cgroups |
Empty or absent (v1 controllers not enumerated) | runc v2 detection logic |
/sys/fs/cgroup/ mount |
cgroup2 filesystem type |
findmnt -t cgroup2 |
Integration test: docker run --rm hello-world must succeed with cgroup v2 driver
(verified via docker info | grep "Cgroup Driver: cgroupfs" or systemd).
24.2.4.2 First Public Demo¶
The first public demonstration is a single unified demo at Phase 3 exit, showing three pillars in sequence:
-
Boot unmodified Ubuntu minimal with systemd (credibility anchor): UmkaOS boots → systemd PID 1 → multi-user target → login prompt → USB keyboard works. Then:
/bin/sh→ls,cat,ps,uname,docker run hello-world. QEMU first, then real hardware (same kernel binary, USB keyboard + NVMe + NIC). -
Tier 1 driver fault recovery (operational shock): run
fioagainst NVMe → inject driver fault → Linux comparison: panic. UmkaOS: brief stall, driver reloads, I/O resumes. "The problem Unix never solved, fixed." -
PCIe peer device (architecture shock): Kintex-7 FPGA endpoint auto-detected → capability registry shows both network and storage services → traffic delegated to FPGA NIC, ephemeral block device mounted → kill FPGA (reset) → IOMMU lockout → FPGA reboots → CLUSTER_JOIN → services re-advertised → I/O resumes. The FPGA runs only the peer protocol firmware (no UmkaOS kernel on peripheral) — demonstrates that any device speaking the spec can be a multi-service Tier M peer.
This is one demo, not three. Every subsystem it touches is complete and final.
24.2.5 Phase 4: Production Ready¶
Goal: Drop-in replacement for server and cloud workloads. Full Kubernetes, KVM virtualization, real hardware boot, LTP conformance.
See Section 25.3 (Phase 4.1: Consumer Hardware) for detailed agentic workflow steps within this roadmap phase.
What Kubernetes needs (beyond Phase 3 Docker): IPVS (kube-proxy default backend), full connection tracking (conntrack), nftables rule engine, veth pairs (pod networking), software bridge (CNI), VLAN (overlay networking), VXLAN/Geneve (CNI plugins like Calico/Flannel), AF_VSOCK (VM-based pods via Kata Containers).
What real hardware boot needs (Phase 3): AHCI/SATA driver (legacy disks), real NIC drivers (Intel e1000e/i210, Mellanox mlx5), IPMI (server management), RTC (hardware clock), watchdog (server reliability), NVMEM (MAC addresses, calibration data), I2C/SMBus (sensor access, IPMI/BMC communication, EDID for displays). Phase 3 already provides USB XHCI + HID + mass storage + serial for keyboard/storage/debug.
New subsystems added (each complete per its architecture spec):
Virtualization:
- KVM hypervisor: full per Section 18.1 —
/dev/kvm, VMX/EPT (x86-64), VHE (AArch64), QEMU/Firecracker support, virtio-mmio
passthrough, vcpu scheduling integration with EEVDF
- VFIO and iommufd: full per Section 18.5 —
device passthrough to VMs, VFIO groups, iommufd descriptors, PCI device assignment,
interrupt remapping. Required for GPU passthrough, SR-IOV, Firecracker device model.
- AF_VSOCK: full per Section 16.24 — host-guest socket
communication, virtio-vsock transport, SOCK_STREAM/SOCK_DGRAM. Required for
Kata Containers and Firecracker guest agents.
- Suspend and resume: full per Section 18.4 —
S3 (suspend-to-RAM), S4 (hibernate), device state save/restore sequencing, PM notifier
chains. Server use: IPMI-triggered suspend, UPS-coordinated hibernate.
Network (extended):
- Netfilter/nftables: full per Section 16.18 —
connection tracking (conntrack), NAT/masquerade, nftables rule engine, iptables compat
layer. Enables Docker bridge networking, Kubernetes service routing.
- Virtual network devices: veth pairs, software bridge, VLAN, macvlan per
Section 16.16. Enables full Docker/K8s
pod networking.
- Network overlay and tunneling: VXLAN, Geneve per
Section 16.16. Required by Kubernetes CNI
plugins (Calico, Flannel, Cilium). GRE/IP-in-IP for legacy tunnels.
- Traffic control and queue disciplines: full per
Section 16.21 — qdisc
framework, HTB (hierarchical token bucket), PFIFO, RED, netem. Required for K8s
bandwidth limits (kubernetes.io/ingress-bandwidth annotation), network QoS.
- IPsec and XFRM: full per Section 16.22 —
transform database, SA/SP lookup, ESP/AH, IKEv2 key management integration. Required
for site-to-site VPN, K8s encrypted pod networking (Calico IPsec mode).
- IPVS: full per Section 16.30 — connection-based load
balancing, NAT/DR/TUN modes, persistence, health checking. Default backend for
kube-proxy in IPVS mode — required for production Kubernetes.
- Network service provider: full per Section 16.31 —
capability service for network operations over peer protocol
Storage (extended):
- dm/LVM: full per Section 15.2 — dm-linear, dm-crypt,
dm-thin, dm-snapshot. Required for many real-world storage configurations (Ubuntu/Fedora
default to LVM root).
- AHCI/SATA driver (Tier 1): full per Section 15.4 —
AHCI controller and SATA disk driver. AHCI link power management, hot-plug, NCQ,
error recovery. AhciPort struct (command list, received FIS, port registers), FIS types
(Register H2D, DMA Setup, PIO Setup, Data, BIST, Set Device Bits), command slot
management (32-slot command header array), NCQ support (tag mapping), ATAPI passthrough
for optical drives. Tier 1 KABI driver, Phase 3.
- Block storage networking: full per Section 15.13 —
iSCSI initiator (RFC 7143), NVMe/TCP (NVMe-oF), iSER (iSCSI over RDMA). Enterprise
SAN connectivity. Required for cloud instances with remote block storage.
- NFS client (full): full per Section 15.14 — NFSv4.1/4.2,
SunRPC, RPCSEC_GSS (Kerberos), delegation, pNFS layouts, state recovery, lease renewal,
and multi-server failover. Phases 1-3 use local disk boot only (initramfs → VirtIO-blk
or NVMe root); NFS root mount is not a Phase 2-3 gate. Phase 4 implements the complete
NFS client from scratch — no partial "NFS root only" stub exists in earlier phases.
- Disk quotas: full per Section 14.15 — user/group/project
quotas, grace periods, quota files, enforcement at block allocation. Required for
multi-user servers.
- Persistent memory: full per Section 15.16 — DAX
(direct access, bypasses page cache), MAP_SYNC, CLWB fencing, PMEM block device,
filesystem DAX (fsdax). For NVDIMM and CXL memory-class devices.
- ZFS integration: full per Section 15.10 — KABI bridge
to OpenZFS, avoiding GPL/CDDL license conflict. Pool import/export, scrub, send/recv.
Distributed (extended): - DLM: full per Section 15.15 — RDMA-native lock acquisition (atomic CAS), lease-based extension, per-resource recovery, batch operations, deadlock detection (5s timeout). Required for clustered filesystems and multi-node coordination. - RDMA transport (Mode B): full per Section 5.4 — RoCEv2 queue pairs, RDMA Send/Recv for messages, RDMA Write for bulk data, RDMA atomic CAS for one-sided locking. Enables high-performance multi-node clusters (2-3 µs uncontested lock vs 10-100 µs over TCP). - Multi-node cluster membership: full per Section 5.2 — Raft-based quorum, node join/leave/eviction, split-brain protection, leader election. Extends Phase 3 two-node PCIe model to N-node network clusters. - SmartNIC/DPU offload: full per Section 5.11 — offload criteria evaluation, DPU discovery via CapAdvertise, automatic service migration (network stack → DPU), fallback on DPU crash. For Nvidia BlueField, AMD Pensando, Intel IPU. - Affinity-based service placement: full per Section 5.12 — ServiceAffinity rules, three-pass placement algorithm, hysteresis to prevent flapping. Used by SmartNIC offload and multi-node workload placement. - Topology reasoning engine: full per Section 5.2 — TopologyQuery API, constraint solver, cached results with generation tags. Foundation for placement decisions.
Security (extended):
- LSM framework: full per Section 9.8 — SELinux policy
engine, AppArmor profiles, hook dispatch, LsmBlob per-object storage, stacking support.
Required for Fedora (SELinux mandatory) and Ubuntu (AppArmor default).
- TPM runtime services: full per Section 9.4 — TPM 2.0
command transport, PCR extend/read/quote, seal/unseal, attestation, HMAC sessions.
Required for measured boot, systemd-cryptenroll, remote attestation.
- Kernel key retention service: full per
Section 10.2 — keyrings (session,
process, user, persistent), key types (user, logon, asymmetric, encrypted, trusted),
key lifecycle, garbage collection, user-namespace-aware keyrings.
- Confidential computing (host): full per Section 9.7 —
SEV-SNP (AMD), TDX (Intel), CCA (ARM) VM management. Secure page table management,
attestation flow, migration restrictions. Requires KVM (this phase).
- PQC algorithm implementations: full per Section 9.6 —
ML-KEM-768/1024 (key encapsulation), ML-DSA-65 (signatures), hybrid X25519+ML-KEM mode.
Phase 1 provided abstractions; Phase 4 implements the algorithms for driver signing
and secure boot.
- EVM (Extended Verification Module): full per
Section 9.5 — HKDF-SHA3-256 key derivation, protected xattr
HMAC, IMA interaction, evm_mode boot parameter.
Scheduling (extended):
- Intent-based resource management: full per
Section 7.10 — intent cgroup knobs
(cpu.intent, memory.intent, io.intent), PD optimizer (SCHED_IDLE background
thread), workload classification, auto-tuning feedback loop.
- Core provisioning and workload partitioning: full per
Section 7.11 — LL/CG/Backfill
core classes, cpu.provision_count cgroup knob, gang scheduling (MCP mode), OS noise
elimination on CG cores (<1 µs/sec), backfill preemption (10 µs max). For HPC, latency-
sensitive workloads, and DPDK-style poll-mode applications.
Observability (extended):
- Fault Management Architecture (full): full per
Section 20.1 — health telemetry
for all driver families (not just NVMe/NIC), rule-based diagnosis, automated repair
actions, fault escalation chains. Phase 3 provided basic NVMe/NIC health; Phase 4
covers all Tier 1/2 drivers.
- perf_events / PMU: full per Section 20.8 —
perf_event_open() syscall, hardware PMU counters (cycles, instructions, cache misses,
branch mispredictions), software events, sampling, perf tool support, BPF program
attachment to perf events.
- EDAC: full per Section 20.6 — memory ECC error reporting,
per-DIMM error counters, correctable/uncorrectable classification, MCE integration,
CMCI (Corrected Machine Check Interrupt). Required for server reliability monitoring.
- pstore: full per Section 20.7 — ramoops (RAM-backed
persistent storage), NVRAM logging, console/ftrace/pmsg frontends, coredump capture
to persistent storage. Critical for post-crash debugging on real hardware.
- Debugging and process inspection: full per
Section 20.4 — ptrace()
(ATTACH, PEEK/POKE, SINGLESTEP, GETREGSET/SETREGSET, SEIZE), core dumps, /proc/[pid]/mem,
/proc/[pid]/maps, gdbserver support. Required for strace, gdb, and many LTP tests.
VFS (extended):
- autofs: full per Section 14.10 — mount trigger protocol,
userspace daemon communication, direct/indirect/offset mounts, expiry. Used by
systemd .automount units and NFS automounting.
- configfs: full per Section 14.12 — kernel object configuration
filesystem, show/store attribute callbacks, groups, default groups, drop. Used for
runtime configuration of USB gadgets, target iSCSI, DLM, and VFIO mdev.
- binfmt_misc: full per Section 14.9 — arbitrary binary format
registration via magic/extension matching, interpreter invocation. Required for QEMU
user-mode emulation (multi-arch containers), Java, Wine.
- NFS server (nfsd): full per Section 15.12 — NFSv4.1/4.2
export table, RPC dispatch, state management, lease recovery. For NAS/file server
deployments.
Device frameworks (infrastructure for real hardware drivers):
- I2C/SMBus bus framework: full per Section 13.13 —
I2C adapter/client model, SMBus protocol, userspace /dev/i2c-* access, device tree
binding. Required for: IPMI/BMC communication, EDID (display identification), sensor
chips (hwmon), touchpads (I2C-HID), EEPROMs.
- SPI bus framework: full per Section 13.20 —
SPI master/slave, DMA support, chip select, clock mode. Required for NOR flash, some
sensors, embedded peripherals.
- USB subsystem (extended): Phase 3 provided XHCI + HID + mass storage + serial. Phase 4
adds USB audio class driver (via ALSA framework), USB video class (UVC), and remaining
class drivers per Section 13.12.
- IPMI: full per Section 13.23 — IPMI 2.0 command
transport (KCS, BT, SSIF), sensor data records, system event log, watchdog integration.
Standard on all servers; required for out-of-band management.
- Hardware watchdog: full per Section 13.19 —
WatchdogDevice trait, timeout management, pre-timeout actions (NMI, SCI), panic-on-
timeout policy. Required for server reliability (systemd-watchdog, keepalived).
- RTC subsystem: full per Section 13.28 — RtcDevice
trait, full Linux ioctl table, alarms, Y2K38-safe u64 timestamps. Required for
hardware clock sync, hwclock, and systemd-timesyncd fallback.
- NVMEM: full per Section 13.25 — non-volatile memory
framework for MAC addresses, calibration data, OTP fuses, serial numbers. Required
for real NICs (MAC address), real SoCs (calibration).
- UIO: full per Section 13.24 — userspace I/O framework,
device mmap, interrupt delivery to userspace. Required for DPDK (non-VFIO mode),
legacy industrial I/O devices.
User I/O frameworks (infrastructure — drivers come in Phase 5): - DRM/KMS core: full per Section 21.5 — DRM device model, KMS modesetting API (CRTC, encoder, connector, plane), GEM buffer management, atomic modesetting, framebuffer console. Phase 4 provides the framework; Phase 5 adds GPU-specific drivers (i915, amdgpu). - ALSA core: full per Section 21.4 — PCM playback/ capture, mixer controls, ALSA ioctl interface, jack detection, sample rate conversion. Phase 4 provides the framework; Phase 5 adds codec drivers (Intel HDA, USB Audio). - Input subsystem (evdev): full per Section 21.3 — input event device model, EV_KEY/EV_REL/EV_ABS events, force feedback, multitouch protocol. Phase 4 provides the framework; Phase 5 adds touchpad/tablet drivers.
Compat (extended): - Safe kernel extensibility: full per Section 19.9 — policy vtable traits, module lifecycle, domain-isolated extensibility points (scheduler class, congestion control, LSM). Enables third-party kernel modules with crash containment. - Live kernel evolution: full per Section 13.18 — Theseus-inspired state export/import, atomic component swap, post-swap watchdog with 5-second timer, HMAC integrity tags on serialized state. Includes KABI service live replacement (Section 13.18) with incremental state export, multikernel rolling deployment, and driver tier promotion protocol. Post-evolution behavioral health monitoring (Section 13.18): configurable soak period (60-300s) comparing FMA health metrics against pre-evolution baseline, alerting on sustained degradation (forward-only, no automatic rollback). Enables zero-downtime kernel updates for long-running server workloads and fast agentic development cycles (Section 25.17).
Quality and packaging:
- LTP conformance: Linux Test Project suite passing (>95% of applicable tests). Non-
applicable tests: those requiring kernel features explicitly deferred to Phase 5
(e.g., GPU-specific ioctls, WiFi nl80211, nested KVM). Each exclusion documented
with rationale.
- Agentic driver rewrite: top-20 Linux driver families ported to KABI via AI-assisted
translation. Families prioritized by server/cloud frequency: virtio-, e1000e/i210,
mlx5, nvme, ahci, xhci, i2c-, hwmon, ipmi, rtc, watchdog, dm-*, raid, iscsi,
nvme-tcp, bridge, veth, tun/tap, vhost, vfio.
- Crash recovery testing: full Tier 1/2 fault injection across all Phase 4 driver
families. Fault types: MMIO read/write errors, DMA completion timeout, interrupt
storm, device reset failure, partial initialization crash. Recovery SLA: Tier 1
reload <150ms, Tier 2 restart <10ms.
- Performance tuning: reach within 5% of Linux on target benchmarks —
nginx (HTTP throughput), fio (storage IOPS), iperf3 (network bandwidth),
sysbench (CPU/memory/mutex), pgbench (database), redis-benchmark (in-memory).
- Package: .deb (Ubuntu 24.04+) and .rpm (Fedora 40+) packages. Installable via
apt/dnf, GRUB menu entry auto-configured, dual-boot with Linux supported.
Syscall clusters (Phase 4 adds ~80-100 syscalls):
| Cluster | Syscalls (representative, not exhaustive) |
|---|---|
| KVM | ioctl(KVM_CREATE_VM, KVM_CREATE_VCPU, KVM_RUN, KVM_SET_USER_MEMORY_REGION, KVM_GET/SET_REGS) |
| VFIO | ioctl(VFIO_GET_API_VERSION, VFIO_GROUP_SET_CONTAINER, VFIO_DEVICE_GET_INFO, VFIO_DEVICE_SET_IRQS) |
| Netfilter | setsockopt(IP_TABLES), nfnetlink socket family, conntrack via netlink |
| Quota | quotactl, quotactl_fd |
| Key management | add_key, request_key, keyctl |
| Perf | perf_event_open, ioctl(PERF_EVENT_IOC_*) |
| ptrace | ptrace(ATTACH, PEEK, POKE, GETREGSET, SETREGSET, SEIZE, INTERRUPT) |
| Misc | personality, kcmp, membarrier, rseq, close_range, openat2, statx, copy_file_range |
Multi-arch (Phase 4): All Phase 4 subsystems compile and pass unit tests on all 8 architectures. KVM validated on x86-64 (VMX/EPT) and AArch64 QEMU (VHE). Netfilter/ conntrack stress-tested on AArch64 (weak memory model paths in conntrack hash tables). IOMMU domain operations validated on ARM SMMU v3 in QEMU. eBPF JIT produces correct code on x86-64, AArch64, and RISC-V. LTP run on both x86-64 (real hardware) and AArch64 (QEMU) — pass rate may differ but regressions are investigated.
Exit criteria: UmkaOS boots unmodified Ubuntu 24.04 and Fedora 40 on real x86-64 hardware (not just QEMU). Runs Docker + Kubernetes single-node with full bridge networking (veth + bridge + NAT + IPVS). KVM boots a guest VM with device passthrough (VFIO). LTP passes >95% of applicable tests on x86-64 and >90% on AArch64 QEMU. Performance within 5% of Linux on all target benchmarks. All Tier 1/2 drivers survive fault injection with recovery demonstrated end-to-end.
24.2.6 Phase 5: Ecosystem and Platform Maturity¶
Goal: Broad adoption — multi-architecture production support, consumer hardware, advanced distributed computing, HPC acceleration, vendor partnerships.
See Section 25.3 (Phase 5.1: Windows Emulation Acceleration) for detailed agentic workflow steps within this roadmap phase.
Phase 5 is organized into sub-phases. Sub-phases are parallel workstreams, not sequential gates — teams can work on 5b (consumer hardware) and 5c (distributed) concurrently. Each sub-phase has its own exit criteria.
Spec depth note: Phase 5 items are specified at full architectural depth in their respective chapters — data structures, interfaces, and algorithms are defined for design completeness. Implementation is deferred to after Phase 4 exit. Sections in other chapters that define Phase 5 data structures carry an explicit deferral note (e.g., "Phase 5 — data structures defined here for design completeness; implementation deferred"). This ensures agents and reviewers do not mistake Phase 5 specifications for current implementation targets.
24.2.6.1 Phase 5a: Multi-Architecture Production¶
Goal: All 8 architectures reach production quality with full Tier 1 driver isolation.
- AArch64: full Tier 1 isolation via POE (ARMv9.4-A+) or page-table fallback per Section 11.2. Production- quality GIC, timer, SMMU v3 drivers.
- RISC-V 64: Tier 1 runs as Tier 0 (in-kernel) until ISA adds fast isolation primitives per Section 11.2. Full PLIC, SBI, Sv48 support. Tier 2 (Ring 3 + IOMMU) available for untrusted drivers.
- PPC32: full Tier 1 isolation via segment registers per Section 11.2. Embedded PowerPC support (Freescale/NXP e500/e6500).
- PPC64LE: full Tier 1 isolation via Radix PID on POWER9+ per Section 11.2. IBM POWER server support with XIVE interrupts, OPAL firmware interface.
- ARMv7: full Tier 1 isolation via DACR. Embedded ARM support (Cortex-A7/A15/A17).
- s390x: Tier 1 runs as Tier 0 (Storage Keys too coarse for fast domain isolation) per Section 11.2. Full PSW-swap interrupt subsystem, SCLP console, Channel I/O (CCW/QDIO), virtio-ccw transport, SIGP SMP. z/VM and LPAR support.
- LoongArch64: Tier 1 runs as Tier 0 (no hardware isolation mechanism) per Section 11.2. Full EIOINTC interrupt controller, Stable Counter timer, hybrid TLB (software refill 3A5000 / hardware PTW 3A6000), PCIe IOMMU.
Exit criteria: All 8 architectures boot on QEMU and pass the full Phase 4 LTP suite with no regressions. Tier 1 driver isolation exercised on each architecture (POE on AArch64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE; RISC-V, s390x, and LoongArch64 confirmed Tier 1 unavailable, Tier 0/Tier 2 placement validated). AArch64 real hardware (RPi 5, optionally Apple M1) passes the full Phase 4 LTP suite alongside x86-64.
24.2.6.2 Phase 5b: Consumer Hardware¶
Consumer hardware support enables UmkaOS as a desktop/laptop OS. This sub-phase provides the kernel-side infrastructure; userspace (desktop environments, package managers) is out of scope.
Wireless and connectivity: - WiFi (nl80211): full per Section 13.15 — nl80211 cfg80211 interface, WPA3/SAE, 802.11ax (WiFi 6), scan/connect/roam. Drivers: Intel iwlwifi, Realtek rtw89, Qualcomm ath11k, Mediatek mt76, Broadcom brcmfmac. - Bluetooth: full per Section 13.14 — HCI transport (USB, UART), L2CAP, RFCOMM, HID (input devices), A2DP (audio routing to ALSA), LE (Low Energy). Drivers: Intel, Realtek, Qualcomm, Broadcom. - rfkill: full per Section 13.21 — RF kill switch framework, per-device enable/disable, sysfs interface, input event integration.
Audio and display: - Audio drivers: Intel HDA (codec driver via ALSA framework from Phase 4), USB Audio Class, SoundWire per Section 13.26. PipeWire/PulseAudio integration (userspace, no kernel changes beyond ALSA). - Graphics drivers: Intel i915 modesetting (DRM/KMS driver using Phase 4 framework), AMD amdgpu modesetting, VESA/EFI framebuffer fallback. Phase 5b provides modesetting only; 3D acceleration in Phase 5e. - Multi-monitor: DRM/KMS atomic modesetting with hotplug detection, EDID parsing (via I2C framework from Phase 4), DisplayPort MST.
Input devices: - Touchpad: I2C-HID driver (using I2C + evdev frameworks from Phase 4), PS/2 Synaptics, multitouch gestures via evdev MT protocol. - Keyboard: USB HID (Phase 3), PS/2 AT keyboard, multimedia keys via evdev.
Platform management: - Suspend/resume (consumer): S3 (suspend-to-RAM), S0ix (Modern Standby) per Section 18.4. Device state save/restore for all consumer drivers. Wake-on-LAN, wake-on-USB, lid switch handling. - Power profiles: performance/balanced/battery-saver modes via power budgeting framework (Section 7.7). Per-app power attribution via cgroup energy accounting. - Regulator framework: full per Section 13.27 — voltage voting model, RegulatorConsumer RAII, SoC PMIC support. - MTD: full per Section 13.22 — raw flash access, bad block management, partition tables. Required for embedded boot media, SPI NOR flash.
Connectivity (extended): - Thunderbolt 3/4 and USB4: device tunneling, PCIe-over-Thunderbolt, security levels. Thunderbolt/USB4 requires a future Thunderbolt framework section (security authorization levels, PCIe tunneling, DisplayPort Alt Mode, daisy-chain topology). Full spec deferred to Phase 5b. Spec: Phase 4 — KABI driver using USB4 tunneling protocol. Architecture in Section 13.12. - eMMC and SD card: MMC framework, SDHCI driver. eMMC/SD requires a future MMC framework (CMD class support, UHS-I/II timing, eMMC 5.1 HS400 mode, partition management). Spec deferred to Phase 5. Spec: Phase 4 — KABI driver using SD/MMC protocol. Architecture in Section 13.1.
Desktop / laptop performance targets:
| Metric | Target |
|---|---|
| Kernel boot (bootloader → login screen) | < 5 seconds |
| Resume from S3 suspend | < 2 seconds |
| Resume from S4 hibernate | < 10 seconds |
| Idle power (WiFi on, display on) | Match or exceed Ubuntu 24.04 |
| Video playback (1080p H.264) | Hardware decode; CPU < 5% |
Validation: Side-by-side battery life comparison with Ubuntu 24.04 on identical hardware (Speedometer + video stream benchmark). 100+ beta testers running UmkaOS as daily driver for 30-day soak; collect crash dumps, performance traces, battery stats.
Exit criteria: UmkaOS boots on 3+ common Intel/AMD laptops (ThinkPad, XPS, Framework) with WiFi, Bluetooth, touchpad, audio, and display working. S3 suspend/resume cycles without regression. Battery life within 10% of Ubuntu 24.04.
24.2.6.3 Phase 5c: Advanced Distributed and HPC¶
Goal: Multi-node production clusters, DSM coherence, HPC acceleration.
Distributed shared memory (DSM): - DSM: full per Section 6.2 — MOESI-like page coherence protocol, wire format, home-node management, subscriber-controlled caching with DLM integration, vector clock causal consistency, anti-entropy for relaxed mode. Application-visible DSM with syscall interface and distributed futex. - Clustered filesystems: full per Section 15.14 — GFS2 and OCFS2 support via DLM (Phase 4), journal-per-node, fencing integration. For high-availability shared storage (SAN, iSCSI).
HPC and acceleration: - RDMA userspace verbs: full per Section 22.7 — libibverbs compat, ibverbs uAPI, queue pair management, memory registration, RDMA CM. Required for MPI (OpenMPI, MVAPICH2), NCCL (distributed ML training), UCX. - GPU compute acceleration: full per Section 22.1 — AccelBase framework, GPU memory management (TTM/GEM), compute queue submission, shader dispatch. For OpenCL, CUDA (via KABI shim), ROCm (via KABI shim). - Unified compute topology: full per Section 22.8 — multi-dimensional capacity profiles, cross-device energy model, advisory placement overlay. Enables heterogeneous scheduling across CPU, GPU, DPU, FPGA resources. - Accelerator P2P DMA: full per Section 22.4 — GPU-to-GPU direct memory access, NVLink/xGMI interop, NUMA-aware accelerator memory, CXL fabric integration.
Inference and ML policy:
- In-kernel inference engine: full per
Section 22.6 — ONNX model loading,
tensor operations, NPU binding, accelerator-aware dispatch. Provides the inference
substrate for ML-driven kernel policies (closed-loop tuning, anomaly detection).
- ML policy framework: full per
Section 23.1 — closed-loop kernel
intelligence, policy cascade (heuristic → model → optimizer), observation channels,
PolicyService rate limiter, per-cgroup parameter overrides. Enables ML-driven scheduling,
memory management, and I/O optimization.
- Unified cgroup compute.weight: per
Section 22.8 — optional knob providing
orchestration layer over existing per-domain (CPU, GPU, RDMA) scheduling knobs.
Advanced networking: - SCTP: full per Section 16.23 — multi-stream, multi-homing, message boundaries. Required for telecom signaling (SIGTRAN), some HPC messaging. - Bonding/teaming: link aggregation (802.3ad LACP), active-backup, balance-rr. Required for server NIC redundancy. Link aggregation requires a future bonding section with LACP 802.3ad state machine, bonding mode enum (round-robin, active-backup, XOR, broadcast, 802.3ad, TLB, ALB), and netlink interface for bond management. Spec: Phase 4 — virtual NIC combining multiple physical NICs. Architecture in Section 16.1. - XDP (eXpress Data Path): full per eBPF framework (Phase 3) — XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT at driver level. For line-rate packet processing, DDoS mitigation.
Peer kernel nodes: - ClusterTransport unification: full per Section 22.8 — all ClusterTransport implementations (PCIe BAR, RDMA, CXL, USB, TCP, NVLink, HiperSockets) production-quality with full peer protocol conformance testing. - Peer kernel nodes: full per Section 22.8 — devices with serious compute (DPUs, GPUs with dozens of ARM cores) run full UmkaOS instances. Vendor-driven adoption; architecture ready from Phase 3 peer protocol. - Computational storage: full per Section 15.17 — NVMe Computational Programs command set, CSD as AccelDeviceClass, in-storage compute for database/analytics pushdown.
Exit criteria: 3+ node cluster runs with RDMA transport and DLM. DSM coherence demonstrated with a distributed database workload. GPU compute job runs via AccelBase. MPI hello-world completes over RDMA.
24.2.6.4 Phase 5d: Ecosystem Maturity¶
Goal: Vendor partnerships, distribution certification, community ecosystem.
- Vendor KABI drivers: Nvidia GPU driver (signed ML-DSA-65, Tier 2 isolated per Section 24.1), AMD GPU driver, Intel GPU driver. Each vendor ships a single binary driver for all UmkaOS versions via stable KABI ABI.
- Distribution certification: RHEL, Ubuntu, SUSE official support. Kernel package in
distribution repositories.
grubby/update-grubintegration. - Community driver SDK: comprehensive documentation, example drivers (null block, loopback NIC, stub GPU), mentorship program. SDK dual-licensed Apache-2.0 OR MIT.
- Nested virtualization: KVM-on-KVM per Section 18.1. Required for CI/CD (GitHub Actions, GitLab runners) and cloud providers.
- Live kernel upgrade: stop all Tier 1/2 drivers → swap core binary → restart drivers. Zero-downtime kernel updates for long-running server fleets. Uses live kernel evolution framework from Phase 4.
Exit criteria: At least one vendor (Nvidia, AMD, or Intel) ships a signed KABI GPU driver. UmkaOS kernel package accepted into at least one major distribution repository (RHEL, Ubuntu, or SUSE). Live kernel upgrade demonstrated end-to-end with zero downtime on a running workload. Nested KVM boots a guest VM.
24.2.6.5 Phase 5e: Gaming and Creative¶
Goal: Support gaming and content creation workloads.
- Vulkan drivers: Mesa RADV (AMD), Intel ANV via DRM/KMS framework (Phase 4). Full 3D acceleration, Vulkan 1.3+ conformance.
- Steam + Proton: Proton/Wine game compatibility layer. Requires WEA (Section 19.6) for Windows syscall translation.
- Windows Emulation Acceleration (WEA): full per Section 19.6 — WINE integration, PE loader hooks, Windows syscall fast-path translation, D3D-to-Vulkan shader pipeline offload.
- GPU video encode/decode: VA-API and V4L2 stateless codec framework. Hardware acceleration for H.264, H.265, VP9, AV1.
Exit criteria: Steam launches, Proton runs top-100 Steam Deck verified games at native-or-better performance vs Linux. Video playback uses hardware decode (CPU <5% for 1080p H.264).
Phase 5 overall exit criteria: All sub-phase (5a-5e) exit criteria met. No regressions in Phase 1-4 test suites (LTP, KVM, Docker/K8s, fault injection). All live evolution primitives validated end-to-end (hot-swap, attestation, crash recovery and watchdog reload). Kernel self-update demonstrated on a production workload without downtime.
24.2.7 Adoption Story: From Drivers to Distributed¶
The phases above define engineering milestones. The adoption story — how UmkaOS delivers value to users and vendors at each stage — maps onto them:
Stage 1: "Better Linux" (Phases 2-3) — UmkaOS boots with Tier 0/1/2 KABI drivers, runs Docker and systemd. Value proposition: crash-recoverable drivers + stable binary ABI. No vendor cooperation required. No firmware changes. Every Linux-supported device works via ported KABI drivers. This is what gets users.
Stage 2: "Self-Describing Devices" (Phase 3-4) — Firmware teams add an 8-12K line C shim to their existing RTOS (see Section 24.1). The device becomes self-describing, crash-recoverable without host involvement, and the vendor can stop maintaining per-OS host-side drivers. The incentive is reduced maintenance burden — a cost saving, not a favor. The existing Tier 1 KABI driver continues working alongside the shim; the vendor can test and cut over at their own pace.
Stage 3: "Peer Kernels" (Phase 5+) — Devices with serious compute (DPUs, GPUs, smart NICs with dozens of ARM cores) run full UmkaOS instances. The peer protocol is already proven at the shim level. DSM coherence is enabled for workloads that benefit (HPC, distributed databases). This is the long-term vision but it gates nothing — every previous stage delivers standalone value.
Each stage builds on proven infrastructure from the previous stage. No stage requires speculative industry cooperation. The friction for Stage 2 is genuinely low: "add a small protocol library to firmware you already ship, and you can stop maintaining host-side drivers for every OS."
24.2.8 Licensing Summary¶
| Component | IP Source | Risk |
|---|---|---|
| Confidential computing (TEE) | Hardware vendor specs (AMD SEV, Intel TDX, ARM CCA), all public | None |
| Post-quantum crypto | NIST standards (FIPS 203, 204, 205), public domain algorithms | None |
| Power budgeting | RAPL (Intel public spec), SCMI (ARM public spec), original design | None |
| Hardware memory safety | ARM MTE (public ISA), Intel LAM (public ISA) | None |
| Formal verification | Verus (MIT license), RustBelt (academic, published) | None |
| Safe extensibility | Original design (extends existing KABI vtable model) | None |
| Live kernel evolution | Theseus OS concepts (academic, published, Rice University) | None |
| Intent-based management | Original design, optimization theory (academic) | None |
| Real-time guarantees | PREEMPT_RT concepts (GPLv2, Linux mainlined), CBS (academic) | Medium — see note below |
| SmartNIC/DPU offload | Original design (extends existing peer model + capability service providers) | None |
| Persistent memory | DAX/PMEM specifications (SNIA, public), Linux interfaces (facts) | None |
| Computational storage | NVMe Computational Programs Command Set and Subsystem Local Memory Command Set (public, NVMe consortium, January 2024) | None |
| Unified compute model | Original design (extends existing AccelBase + EAS models) | None |
All components are either original design, based on published academic research, based on public hardware specifications, or based on NIST/industry standards. No vendor-proprietary APIs or patented algorithms.
PREEMPT_RT derivative risk: PREEMPT_RT is GPLv2 and was merged into Linux mainline (v6.12). Any UmkaOS real-time code derived from PREEMPT_RT implementation (as opposed to the general concepts of preemptible kernels, threaded interrupts, and priority inheritance) could carry GPLv2 obligations that conflict with OKLF's additional permissions. UmkaOS's RT implementation MUST be a clean-room design based on published academic literature (priority inheritance protocols: Sha, Rajkumar, Lehoczky 1990; CBS: Abeni and Buttazzo 1998; LITMUS-RT: Brandenburg 2011) and public OS design textbooks, not derived from Linux PREEMPT_RT source code. Code review must verify no Linux-derived lock conversion patterns, interrupt threading structures, or RT-specific scheduler modifications are copied.
24.2.9 Performance Impact Summary¶
Every feature in this document was evaluated against the constraint: "Does this make UmkaOS measurably slower than Linux on the same workload?"
| Feature | Hot-Path Impact vs Linux | Justification |
|---|---|---|
| Confidential computing | 0% (same hardware, same cost) | Hardware AES engine, identical to Linux |
| Post-quantum crypto | 0% (cold-path only) | Boot/driver-load only. ML-DSA-44 verify comparable to Ed25519; ML-DSA-65 verify ~100-200 µs (cold-path only, not on hot paths) |
| Power budgeting | 0.015% (MSR reads at tick) | 600ns per 4ms tick. Invisible in any benchmark. Per-task EAS overhead: see Section 24.4 |
| Hardware memory safety | 0% vs Linux when enabled | Same MTE instructions, same hardware cost. Tag RAM overhead: 3.125% of DRAM (ARM MTE only) |
| Formal verification | 0.000% (compile-time) | Not in the binary |
| Safe extensibility | 0% (same as Linux sched_class) | Function pointer dispatch, same mechanism |
| Live kernel evolution | 0.000% (rare event only) | ~10μs during replacement, months between events |
| Intent-based management | ~0.00005% (background only) | 3μs per second background optimization |
| Real-time guarantees | 0% to 5% (configurable) | Same cost as Linux PREEMPT_RT when enabled. 0% = PREEMPT_NONE/VOLUNTARY, ~1% = PREEMPT_FULL, 2-5% = PREEMPT_FULL with RT scheduling classes active |
| SmartNIC/DPU offload | Negative (faster) | Moves work OFF host CPU |
| Persistent memory | Negative (faster) | DAX eliminates page cache copies |
| Computational storage | Negative (faster) | CSD reduces data movement |
| Unified compute model | ~0.00005% (background only) | ~4μs/sec/cgroup advisory. Submission hot path unchanged |
Target: match or exceed Linux performance for all common workloads. Most features are invisible at steady state, and several actually improve performance. Known exceptions are conscious trade-offs documented in their respective sections: RT scheduling adds 0-5% overhead for RT-class tasks (same cost as Linux PREEMPT_RT); capability checks add ~5-10 cycles per privileged operation (~0.1%, fully pipelined bitmask test); untrusted policy module isolation adds ~46 cycles per domain crossing (eliminated once the module graduates to the Core isolation domain).
24.3 Verification Strategy¶
24.3.1 Testing Layers¶
| Layer | Tool / Method | What it verifies |
|---|---|---|
| Unit tests | cargo test (in QEMU or host mock) |
Individual subsystem correctness |
| Integration tests | Custom test harness in QEMU | Cross-subsystem interactions |
| Syscall conformance | Linux Test Project (LTP) | Syscall behavior matches Linux (see below) |
| Application testing | Boot Ubuntu minimal, Alpine | Real-world application compatibility |
| Container testing | Docker hello-world, nginx, redis | Container runtime compatibility |
| Kubernetes testing | k3s single-node | Orchestration platform compatibility |
| ABI regression | kabi-compat-check in CI |
No breaking changes to KABI |
| Crash recovery | Fault injection framework | Tier 1/2 drivers recover correctly |
| Performance regression | Automated benchmarks vs Linux baseline | No unacceptable performance regression |
| Fuzzing | syzkaller (adapted for UmkaOS; requires KCOV-equivalent coverage, UmkaOS syscall descriptions, MTE/KASAN-equivalent sanitizer). KCOV specification deferred to Phase 4 (Ch 20 Observability — requires tracepoint integration and per-task coverage ring buffers). Phases 2-3: syzkaller runs in description-guided random mode (no coverage feedback). Phase 4+: syzkaller with KCOV coverage-guided mutation. | Syscall fuzzing for crash/hang detection |
| Static analysis | cargo clippy, custom lints |
Code quality, unsafe usage review |
24.3.2 LTP as Agentic Compatibility Substrate¶
The Linux Test Project (~5,000+ test cases) is not merely a validation gate — it is the primary development substrate for Linux syscall compatibility work. For agentic development, LTP transforms the largest single task in UmkaOS (implementing ~400 syscalls with correct edge-case behavior) from an open-ended research problem into a structured, test-driven implementation task.
Role in agentic workflow: - Each LTP test encodes a concrete behavioral contract (input → expected output) that the implementing agent can read, implement against, and verify — without human involvement or ambiguous documentation. - Tests are organized by syscall family, providing natural agent work-unit decomposition. - Edge cases encoded in LTP tests represent decades of Linux bug reports and regression fixes — knowledge the agent gets for free. - Cross-architecture execution validates that syscall behavior is identical on all 8 architectures (catches wrong struct padding, wrong register conventions, wrong signal frame layouts).
See Section 25.17 for the full agentic LTP workflow and complementary test suites (syzkaller, xfstests, kselftest).
24.3.3 Key Benchmarks¶
These benchmarks must match Linux within 5% (measured on identical hardware, same kernel configuration, same workload parameters):
| Benchmark | What it tests | Target delta |
|---|---|---|
fio randread 4K QD32 |
Block I/O fast path (IOPS) | < 2% |
fio randwrite 4K QD32 |
Block I/O write path (IOPS) | < 2% |
fio sequential read 1M |
Block I/O throughput (GB/s) | < 1% |
iperf3 TCP throughput |
Network stack throughput | < 5% |
iperf3 TCP latency (RR) |
Network stack latency | < 5% |
nginx small-file HTTP (wrk) |
Combined network + filesystem | < 5% |
redis-benchmark |
In-memory key-value (network + mem) | < 3% |
sysbench OLTP read-write |
Database workload (IO + CPU + sched) | < 5% |
hackbench (groups=100) |
Scheduler + IPC throughput | < 3% |
lmbench lat_ctx |
Context switch latency | < 1% |
Kernel compile (make -jN) |
Combined CPU + IO + scheduling | < 5% |
stress-ng mixed |
Overall system stress | < 5% |
Note: Target delta values are MAXIMUM ALLOWED overhead (failure thresholds). The design target is negative overhead (faster than Linux on the same hardware despite Tier 1 isolation). Any positive overhead within these thresholds must include root-cause analysis and a remediation plan documenting which UmkaOS optimization (CpuLocal registers, ring batching, lock-free structures, etc.) compensates for the measured cost. See Section 23.1 for the closed-loop optimization framework that drives toward negative overhead.
24.3.4 Crash Recovery Testing¶
Dedicated fault injection framework.
24.3.4.1.1 Activation¶
Fault injection is available in debug builds only (cfg(umka_fault_inject)).
It is never compiled into release builds. Two activation mechanisms:
-
Kernel boot parameter:
umka.fault_inject=<target>[,<fault>]Example:umka.fault_inject=nvme0,domain_violationinjects a domain access violation into the nvme0 driver on first I/O. The kernel logs the injection atKERN_DEBUGlevel and proceeds with the fault. -
Runtime sysctl (debug builds, init namespace only):
umka/debug/fault_inject/<driver_name>/<fault_type>— write1to trigger once, writeNto trigger on the N-th matching code path, write0to cancel.
24.3.4.1.2 Fault injection points in driver code¶
Driver code marks injectable points with the umka_fault_inject! macro (compiled
out in release builds):
/// Injects fault `fault_type` at this callsite if fault injection is active for
/// this driver and fault type. No-op in release builds.
///
/// In debug builds: if umka.fault_inject matches this driver + fault_type,
/// executes the fault action (e.g., corrupts a pointer, calls panic!, returns Err).
#[cfg(umka_fault_inject)]
macro_rules! umka_fault_inject {
($driver:expr, $fault_type:expr, $action:expr) => {
if crate::fault_inject::should_inject($driver, $fault_type) {
$action
}
};
}
#[cfg(not(umka_fault_inject))]
macro_rules! umka_fault_inject {
($driver:expr, $fault_type:expr, $action:expr) => {};
}
24.3.4.1.3 Fault scenarios tested¶
- Domain isolation violation: Inject
umka_fault_inject!(driver, FaultType::DomainWrite, /* write to wrong PKEY */)— verifies MPK/DACR/POE catches the fault and reloads the driver without kernel panic. - Null pointer dereference: Inject null dereference in Tier 1 driver handler — verifies fault containment and recovery within 50–150 ms.
- Infinite loop: Inject
loop {}in a driver kthread — verifies the per-driver watchdog timer (DRIVER_WATCHDOG_TIMEOUT_MS = 5000) fires and kills the driver. - DMA to wrong address: Inject out-of-bounds DMA descriptor — verifies IOMMU fault is caught, driver is torn down, no kernel memory corruption.
- Tier 2 process crash: Inject
abort()in Tier 2 driver process — verifies umka-core supervisor restarts within 10 ms. - Repeated crashes: Inject crash on every restart — verifies auto-demotion policy engages after
DRIVER_MAX_RESTART_ATTEMPTS = 3. - I/O in flight during crash: Inject crash mid-I/O — verifies all in-flight requests complete with
-EIOand no request objects leak.
Each test verifies: (1) the system does not panic, (2) the driver recovers within the target time, (3) applications see errors but can retry, and (4) no memory is leaked.
24.3.5 CI Pipeline¶
Every commit triggers:
1. cargo build for all 8 architectures (x86_64, aarch64, armv7, riscv64, ppc32, ppc64le, s390x, loongarch64)
2. cargo test (host-side unit tests)
3. QEMU boot test per architecture (boot + shutdown)
4. kabi-compat-check (no ABI breaks)
5. cargo clippy (lint pass)
6. cargo fmt --check (formatting)
Every merge to main additionally triggers:
7. LTP syscall conformance suite
8. Docker container boot test
9. Performance benchmark suite (vs stored Linux baseline)
10. Crash recovery fault injection suite
24.4 Formal Verification Readiness¶
24.4.1 The Opportunity¶
Formal verification of kernel code crossed the practical threshold:
2009: seL4 — 200,000 lines of proof for 10,000 lines of C. Heroic effort.
2018: RustBelt — Formal soundness proof for Rust's ownership model.
2022-2025: Verus (Carnegie Mellon University, VMware Research, Microsoft Research,
ETH Zurich, and others) — Automated verification for Rust.
Write Rust code + specifications → tool PROVES correctness.
Not testing. Not fuzzing. Mathematical machine-checked proof.
Verus can verify Rust code of realistic complexity: concurrent data structures, state machines, protocols, invariant maintenance. UmkaOS is written in Rust. The verification infrastructure exists.
24.4.2 What To Verify¶
Not everything needs verification. Focus on security-critical invariants and concurrency-sensitive code where bugs have catastrophic consequences.
Priority 1 — Non-Replaceable Core (highest verification priority):
These components include both non-replaceable data structures and their replaceable policy dispatch layers (Section 13.18). A bug in any non-replaceable component requires a full reboot to fix. Policy dispatch is verified to ensure correct routing and monotonic security — a swap must never loosen permissions. All must be verified before Phase 2 exit:
| Component | Invariant to Prove | Section |
|---|---|---|
| Physical memory allocator (data) | No page allocated twice. No double-free. Buddy merge preserves free-list consistency. PageArray vmemmap mapping correct. PcpPagePool never loses pages. | Section 4.2 |
| Physical memory allocator (policy) | PhysAllocPolicy dispatch reaches intended function. Policy replacement preserves free-list invariants (no pages lost during swap). |
Section 4.2 |
| Page reclaim (data) | No page on two generation lists simultaneously. Shadow entries correctly encode eviction generation. Generation counter monotonicity. Per-CPU drain buffers never lose pages. | Section 4.4 |
| Page reclaim (policy) | PageReclaimPolicy dispatch correct. Policy replacement does not lose pages from LRU lists or corrupt generation state. |
Section 4.4 |
| Page table management (hardware ops) | No page mapped twice without sharing. Freed pages never accessible via stale PTE. PTE encoding/decoding correct per architecture. | Section 4.8 |
| Page table management (policy) | VmmPolicy dispatch correct. Policy replacement does not leave stale TLB entries. |
Section 4.8 |
| Capability system (data) | Capabilities cannot be forged. cap_lookup() returns correct entry. Generation check is correct. Permission AND is correct. CapOperationGuard never loses decrements. |
Section 9.1 |
| Capability system (policy) | CapPolicy dispatch reaches intended function. MonotonicVerifier correctly rejects policies that loosen security. Policy replacement preserves CapTable invariants (no capabilities lost during swap). |
Section 9.1 |
| Evolution primitive (Nucleus) | INV-1 (atomic visibility), INV-6 (PendingOpsPerCpu transfer integrity). ~2-3 KB of straight-line code within the ~18-20 KB Nucleus — the most tractable Verus target. See Section 13.18. | Section 13.18 |
| LMS boot verifier (Nucleus) | lms_verify_shake256() correctly implements LMS verification per NIST SP 800-208. Winternitz chain completion is correct. Merkle path walk reaches the root. SHAKE256 domain separation is correct (padding byte 0x1F). ~1-3 KB code, reuses Keccak-f[1600]. |
Section 2.21 |
| Evolution orchestration (Evolvable) | INV-2 through INV-5, INV-7, LIV-1, LIV-2. These are enforced by replaceable orchestration — bugs are live-fixable. Verified for defense-in-depth but NOT a deployment gate. | Section 13.18 |
| Data format evolution | INV-DF1 through INV-DF5. No partial reads during migration, no lost writes, epoch monotonicity, wire protocol backward compat, extension array isolation. | Section 13.18 |
| KABI vtable dispatch | Vtable calls never escape the driver's isolation domain. Version checks are correct. | Section 12.1 |
Priority 2 — Security and Correctness Critical (verified before Phase 3 exit):
These components are live-replaceable but handle security-sensitive or concurrency-critical operations where bugs have catastrophic consequences (data loss, privilege escalation, deadlock):
| Component | Invariant to Prove | Section |
|---|---|---|
| IPC ring buffer | Producer-consumer protocol never loses messages, never delivers duplicates, never deadlocks. | Section 11.7 |
| CBS bandwidth server | Bandwidth guarantees are met. No starvation. | Section 7.6 |
| DSM coherence protocol | Multiple-reader / single-writer consistency maintained. No lost writes. | Section 6.2 |
| Distributed capabilities | Signature verification is correct. Revocation propagation is complete. | Section 5.7 |
| Power budget enforcement | Budgets are never exceeded by more than one tick interval. | Section 7.7 |
24.4.3 Design for Verifiability¶
Verification readiness is a design property, not a tool. Code must be structured so that specifications can be written and verified:
// Example: capability lookup with verification-ready specification.
// Verus-style annotations (compile-time only, erased from binary).
/// Lookup a capability by handle.
///
/// SPECIFICATION (verified by Verus):
/// requires: handle is valid for calling process
/// ensures: returned capability matches the one in the capability table
/// ensures: returned capability's generation <= object's current generation
/// ensures: returned capability's permissions are a subset of the
/// delegator's permissions (no escalation)
pub fn cap_lookup(
table: &CapabilityTable,
process: ProcessId,
handle: CapHandle,
) -> Result<Capability, CapError> {
// Implementation must satisfy the specification.
// Verus proves this at compile time.
// No runtime overhead.
}
Design rules for verifiability:
-
Explicit state: No hidden mutable global state. All state is in named structures with explicit ownership. (Rust already enforces this.)
-
Small critical sections: Break complex operations into small, individually verifiable steps. Each step has a pre-condition and post-condition.
-
Interface contracts: Every public function in security-critical modules has a documented specification (pre/post conditions, invariants). Verus verifies these.
-
Algebraic data types for states: Use enums with exhaustive matching instead of integer flags. The type checker ensures all states are handled.
-
Monotonic counters: Generation counters, version numbers — use types that enforce monotonicity (can only increase, never decrease).
24.4.4 Verification Tooling¶
Primary tool: Verus (Carnegie Mellon University, VMware Research, Microsoft Research, and others). Automated verification for Rust. Specification-driven proofs of functional correctness and memory safety properties.
Alternative tools (fallback if Verus hits scale limits): - Kani (Amazon): Bounded model checking for Rust. Explores all execution paths up to a configurable bound. Excellent for concurrent code and finding edge cases. Complementary to Verus — Kani finds bugs, Verus proves absence of bugs. - Prusti (ETH Zurich): Automated verification for Rust. Different proof strategy than Verus (separation logic vs SMT). Useful as a cross-check.
CI integration strategy:
- Every commit: debug_assert! invariant checks + lightweight type-level assertions.
Compile-time only. Seconds. Catches regressions in verified invariants.
- Every PR: Kani bounded model checks on critical modules (~5-10 min).
Catches concurrency bugs and edge cases.
- Nightly: Full Verus specification proofs (~30-60 min for verified modules).
Mathematical proof of correctness. Any proof failure blocks the next release.
Scope of verification — what is OUT of scope: Cross-component interactions (e.g., DSM coherence protocol interacting with hardware isolation boundaries simultaneously) are beyond current tool capabilities. Individual components are verified against their specifications; the composition is validated by integration testing and fuzzing. This is an honest limitation — complete whole-system verification remains a research problem.
Unsafe Code Verification Strategy:
Rust's unsafe blocks are the primary verification target — they are where memory
safety invariants must be manually upheld. The strategy:
-
Verus for ownership and invariant proofs: verify that
unsafecode upholds the safety contract documented in its// SAFETY:comment. Verus can reason about pointer validity, aliasing, and lifetime guarantees. -
Kani for model-checking
unsafecode paths: bounded model checking explores all possible inputs tounsafefunctions up to a configurable bound, catching edge cases that specifications might miss. -
Wrap unsafe in safe abstractions: every
unsafeblock is encapsulated in a safe function with a verified specification. Callers never touchunsafedirectly. The safe wrapper's specification becomes the verification boundary.
Verification Complexity by Component:
Based on published Verus effort data and component characteristics:
| Priority | Component | Relative Complexity | Rationale |
|---|---|---|---|
| P1 | Capability system — data (Section 9.1) | Low | Small state machine: XArray lookup + integer compare + AND. Clear invariants. |
| P1 | Capability system — policy (Section 9.1) | Low | Dispatch table + MonotonicVerifier swap protocol; small code surface |
| P1 | KABI vtable dispatch (Section 12.1) | Low | Index lookup + bounds check, small code surface |
| P1 | Physical memory allocator — data (Section 4.2) | Medium | Buddy algorithm well-studied; main difficulty is proving no double-alloc |
| P1 | Physical memory allocator — policy (Section 4.2) | Low | Dispatch table + swap protocol; small code surface |
| P1 | Page reclaim — data (Section 4.4) | Medium | Generational LRU: prove no page on two lists, generation monotonicity, shadow entry correctness |
| P1 | Page reclaim — policy (Section 4.4) | Low | Dispatch table + swap protocol; small code surface |
| P1 | Page table management — hardware ops (Section 4.8) | High | Many edge cases, arch-specific |
| P1 | Page table management — policy (Section 4.8) | Low | Dispatch + TLB flush correctness during swap |
| P1 | Evolution primitive — Nucleus (Section 13.18) | Low | ~2-3 KB straight-line code (within the ~18-20 KB total Nucleus); INV-1 (IPI + atomic swap) and INV-6 (ring transfer). No loops beyond bounded page remap. The remaining ~15-17 KB of Nucleus (data structures, page table ops, capability lookup, KABI dispatch) is also P1 but with higher complexity. |
| P1 | LMS boot verifier — Nucleus (Section 2.21) | Low | ~1-3 KB; Winternitz chains (bounded loop W×p iterations) + Merkle path (bounded loop H iterations) + SHAKE256 (reuses verified Keccak). No allocation, no state. |
| P1.5 | Evolution orchestration — Evolvable (Section 13.18) | Medium | INV-2-5, INV-7, LIV-1-2. Live-fixable — verification is defense-in-depth, not a deployment gate. |
| P2 | IPC ring buffer (Section 11.7) | Medium | Single producer-consumer per ring, bounded. Cross-domain shared memory adds concerns: torn reads on non-cache-line-aligned entries, memory ordering across 8 architectures, overflow detection with potentially non-coherent Tier 2 memory. io_uring-style design (cache-line-aligned, power-of-two, acquire/release) is well-studied but the cross-domain privilege boundary adds verification surface beyond a simple in-process SPSC ring. |
| P2 | CBS bandwidth server (Section 7.6) | Medium | Well-studied algorithm |
| P2 | DSM coherence (Section 6.2) | High | Distributed protocol, concurrent access |
Recommended verification order (within Priority 1): capability data → capability policy → KABI vtable dispatch → evolution primitive → LMS boot verifier → memory allocator data → memory allocator policy → page reclaim data → page reclaim policy → page table hardware ops → page table policy. The evolution primitive and LMS verifier are both low-complexity straight-line code and should be verified early. Evolution orchestration (P1.5) is verified for defense-in-depth after all P1 targets but before P2 — it is live-fixable, so verification is not a deployment gate. All P1 components must be verified before Phase 2 exit — verification is the sole defence against defects in non-replaceable code, and the sole guarantee that policy dispatch never misroutes or loosens security.
24.4.5 Performance Impact¶
Literally zero. Verification is compile-time. Verus specifications are erased from the binary. The verified code is identical to the unverified code at runtime.
The only cost is developer time writing specifications. But this pays for itself by eliminating bugs that would otherwise require debugging, CVE patches, and emergency releases.
24.5 Technical Risks¶
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| MPK provides only 16 domains | Medium | Certain | Group related drivers by fault domain (all block share domain, all net share domain). 12 driver-available domains on x86 (4 keys reserved for infrastructure: PKEY 0=core, 1=shared descriptors, 14=shared DMA, 15=guard; per Section 11.2). AArch64 POE has 7 usable indices (1-7), of which 3 are available for Tier 1 driver domains (indices 3-5; indices 1-2 reserved for umka-core, 6 for userspace, 7 for temporary/debug; per Section 24.5). See "MPK Domain Grouping" below for degraded isolation analysis. |
| eBPF verifier complexity | High | High | Verifier subsystem is ~30K SLOC in Linux (counting kernel/bpf/verifier.c at ~23K SLOC as of v6.12, plus btf.c, log.c, range-tracking helpers, and test infrastructure — the ~30K figure covers the full verification subsystem, not verifier.c alone). Start with subset of program types, expand incrementally. UmkaOS implements a clean-room Rust verifier and JIT (GPL avoidance); the eBPF bytecode format and helper API are compatible with Linux but the implementation is original. |
| KVM deeply integrated with Linux MM | High | High | Design memory manager with KVM hooks from the start (Phase 1 architecture). Dedicate a team to KVM from Phase 4. |
| Driver coverage gap blocks adoption | Critical | High | Cloud-first strategy (VirtIO covers 100% of VMs). Prioritize top-20 drivers. Agentic rewrite pipeline for open-source drivers. |
| Subtle syscall compatibility bugs | High | High | LTP conformance suite, real-world application testing, syzkaller fuzzing. Build a comprehensive test matrix of applications. |
| Spectre/Meltdown mitigations + domain isolation | Medium | Medium | KPTI not needed for Tier 1 (same Ring 0). Tier 2 needs standard KPTI. Retpoline/IBRS for indirect branches. Test on affected hardware. |
| IOMMU not available on all hardware | Medium | Medium | IOMMU required for Tier 1 DMA fencing. Systems without IOMMU fall back to trusted mode (reduced isolation, logged warning). |
| ARM64 lacks direct MPK equivalent | Medium | Certain | Use POE (FEAT_S1POE, 7 usable indices of which 3 are for Tier 1 drivers, optional from ARMv8.9+) or page-table fallback. Adaptive isolation policy (Section 11.2) allows per-driver tier pinning or promotion to Tier 0 on pre-POE hardware. |
| No fast isolation on pre-2020 x86 | Medium | Certain | Adaptive isolation policy: isolation=performance promotes Tier 1 to Tier 0 (Linux-equivalent speed, no memory isolation). IOMMU DMA fencing still active. |
| Rust ecosystem maturity for OS dev | Low | Medium | Established patterns from Redox, Linux rust-for-linux, Hubris. Use #![no_std] and custom allocator. Unsafe blocks at hardware boundaries are expected and audited. |
| Performance target too ambitious | Medium | Medium | 5% target is for macro benchmarks. Micro-benchmarks may show higher overhead on specific paths. Batch amortization and careful profiling. |
| Community adoption / contributor pipeline | Medium | Medium | Clean SDK, good documentation, lower barrier than Linux driver development. Cloud-first focus builds credibility before desktop push. |
| Regulatory / certification barriers | Low | Low | Work with distributions early. Open-source everything except vendor proprietary blobs. |
| LZ4/Zstd kernel implementation correctness | Medium | Medium | Fuzzing, comparison with reference implementation. Use no_std BSD-licensed implementations with comprehensive test vectors. |
| Object namespace overhead on hot paths | Low | Low | Lazy registration for high-frequency objects (fds, sockets, VMAs). Eagerly registered objects only (~2000 baseline = ~384 KB). |
| Shared-domain silent corruption | Medium | Low | Inherent MPK/POE/DACR limitation with finite domains. Rust memory safety is primary defense within shared domains. Operators can promote critical drivers to solo domains or Tier 2. See Section 24.5 below. |
| CBS scheduling fairness under edge cases | Medium | Medium | Formal analysis against CBS paper (Abeni 1998), stress testing with adversarial workloads, comparison with Linux cpu.max behavior. |
24.5.1 Risks from Advanced Features (Chapters 16-18)¶
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| TEE hardware fragmentation (SEV-SNP vs TDX vs CCA) | High | Certain | Abstract behind ConfidentialContext trait (Section 9.7). Implement one backend at a time. SEV-SNP first (largest cloud deployment), TDX second, CCA third. |
| PQC algorithm instability (NIST may revise) | Medium | Medium | Algorithm-agile abstraction (Section 9.6). Algorithms behind enum dispatch; swapping ML-KEM for a successor is a library update, not a kernel redesign. |
| PQC signature sizes impact IPC latency | Low | Certain | ML-DSA-65 signatures are 3,309 bytes (per NIST FIPS 204, Table 2). Cold-path only (capability minting, not every IPC call). SignatureData::Heap variant avoids ring buffer bloat (Section 9.6). |
| RT + domain isolation interaction causes priority inversion | High | Medium | Domain switch (WRPKRU on x86) is ~23 cycles (no lock needed). Domain switching is O(1) — no contention path. If priority inheritance needed for domain-shared buffers, use PI futexes (Section 8.4). |
| Formal verification scope creep | Medium | Medium | Verify only security-critical paths: capability table, IPC ring, page table mapping (Section 24.4). Accept that ~80% of kernel code is tested, not verified. |
| DPU vendor lock-in (proprietary firmware) | Medium | High | DPUs are Tier M peers using the standard peer protocol (Section 5.11). Vendors implement a firmware shim (~10-18K lines of C, excluding crypto primitives already in firmware; a reference implementation will be published with measured counts), not a full OS port. Host-side code is generic umka-peer-transport, not vendor-specific. |
| PMEM/CXL hardware not yet widely deployed | Low | High | Design is hardware-agnostic (Section 15.16). All PMEM code compiles out when hardware is absent. CXL 3.0 adoption expected 2025-2027; architecture ready, implementation deferred. |
| Unified compute model adds scheduling overhead | Medium | Low | Advisory overlay only — existing schedulers unchanged (Section 22.8). Topology queries are O(1) reads from cached ComputeCapacityProfile. No hot-path cost. |
| Live kernel evolution causes state corruption | Critical | Low | Post-swap watchdog with 5-second timer (Section 13.18). On crash, the system attempts to re-extract state from the failing component; if extraction fails, the system panics rather than reverting to stale state, preventing silent data corruption. State serialization uses versioned HMAC integrity tags. |
| Intent optimizer makes poor decisions | Low | Medium | Intent system is purely advisory (Section 7.10). Clamping prevents invalid resource configs. Worst case: system falls back to static defaults (no intent optimization). |
24.5.2 Risk Response Priority¶
- Driver coverage (Critical): Addressed by cloud-first strategy + agentic rewrite
- Syscall compatibility (High): Addressed by LTP + application test matrix
- eBPF complexity (High): Addressed by incremental implementation
- KVM integration (High): Addressed by early architectural planning
- TEE fragmentation (High): Addressed by trait-based abstraction
- RT + domain isolation interaction (High): Addressed by O(1) domain switching design
- Domain limit (Medium): Addressed by driver grouping policy
- Live evolution safety (Critical but low likelihood): Addressed by watchdog + state HMAC integrity checks
24.5.3 Domain Grouping: Degraded Isolation Analysis¶
When more than 12 Tier 1 drivers are loaded simultaneously, some drivers must share an isolation domain (protection key). This is an inherent limitation of Intel's 16-key PKU design (16 keys minus PKEY 0 for umka-core, minus PKEY 1 for shared descriptors, minus PKEY 14 for shared DMA, minus PKEY 15 as guard = 12 usable). Grouping has concrete consequences for fault isolation:
What grouping preserves: - IOMMU isolation: each driver retains its own IOMMU domain regardless of domain grouping. DMA fencing is unaffected — a crashing NVMe driver cannot DMA into a NIC driver's buffers, even if they share an isolation domain. - Capability isolation: each driver has its own capability set. Sharing an isolation domain does not grant access to another driver's capabilities. - Crash detection: fault injection and page-fault trapping still identify the crashing driver (via instruction pointer, not isolation domain).
What grouping degrades: - Memory read/write isolation between grouped drivers. If drivers A and B share isolation domain 5, a buffer overrun in A can corrupt B's data structures. The crash is still contained (it cannot escape to umka-core or other domains), but it may take down both A and B. - The blast radius of a crash expands from one driver to one domain group. In practice, this means a faulty NVMe driver could take down the AHCI driver if both are in the "block" group.
Grouping policy — drivers are grouped by fault domain affinity (drivers that interact heavily and would likely cascade-fail anyway):
| Isolation Domain | Group | Typical Members |
|---|---|---|
| 2 | Block storage | NVMe, AHCI/SATA, virtio-blk, iSCSI, NVMe-oF |
| 3 | Network (data) | Intel NIC, Mellanox NIC, virtio-net |
| 4 | Network (stack) | TCP/IP, UDP, RDMA core |
| 5 | Filesystem | ext4, XFS, btrfs |
| 6 | Display | DRM/KMS, GPU compute |
| 7 | KVM | VMX/SVM, vhost-net, vhost-scsi |
| 8 | USB | xHCI, USB hub, USB mass storage |
| 9-13 | Individual | High-value drivers that warrant solo domains |
AArch64 domain budget: POE provides 7 usable indices (1-7; index 0 is reserved for umka-core as the default PTE value). Of the 7 usable indices, 4 are reserved for infrastructure (index 1 for shared read-only, index 2 for shared DMA, index 6 for userspace, index 7 for temporary/debug), leaving only 3 indices for Tier 1 driver domains (indices 3-5; vs. 12 on x86). AArch64 therefore requires much more aggressive grouping.
Note for reviewers: ARM FEAT_S1POE uses a 3-bit POIndex field in page table entries, providing 8 index values (0-7). This is a hardware constraint, not a design choice. Index 0 is the default PTE value (per ARM architecture), leaving 7 configurable indices. Do not suggest "use 4 bits for 16 indices" — the POIndex field width is fixed by the ISA. The grouping table above is designed for x86's 12-domain budget. On AArch64, the kernel applies a reduced grouping scheme: - Domain 0: umka-core (default PTE value) - Domain 1: Shared read-only - Domain 2: Shared DMA buffer pool - Domain 3: VFS + block I/O (merged — these are tightly coupled) - Domain 4: Network stack - Domain 5: All remaining Tier 1 drivers (single shared domain) - Domain 6: Userspace (EL0 default) - Domain 7: Temporary / debug This reduces isolation granularity for Tier 1 drivers on AArch64 (all share one domain) but preserves the critical umka-core/driver/userspace boundaries. The architecture-specific grouping is selected at boot based on
arch::current::isolation::domain_count().
Typical server scenario — a cloud server runs NVMe + NIC + TCP + KVM + virtio = 5 drivers. On x86 (12 driver domains), these fit in 5 domains with no grouping needed; the 12-domain limit only triggers on heavily-configured systems (desktop with GPU + audio + USB + Bluetooth + WiFi + NVMe + SATA + NIC + ...). On AArch64 with POE (3 driver domains), even this typical 5-driver configuration requires grouping -- the reduced scheme above merges block I/O, networking, and remaining drivers into 3 shared domains. Architectures with more domains (ARMv7 DACR: 12, PPC32 segments: 12) behave more like x86.
Monitoring — when grouping occurs, UmkaOS logs a warning:
umka: isolation domain 1 shared by nvme, ahci (reduced isolation: crash in either affects both)
This allows administrators to make informed decisions about which drivers to load as Tier 2 (full process isolation, unlimited domains) if they require stronger isolation than domain grouping provides.
24.5.3.1 Domain Grouping Security Properties¶
Drivers sharing an isolation domain have the same memory-access fault containment as monolithic kernel drivers. A buffer overrun in one driver can silently corrupt any other driver in the same domain without triggering a hardware exception. The hardware isolation boundary exists at the domain edge, not between drivers within a domain.
| Property | Solo Domain (1 driver per domain) | Shared Domain (N drivers per domain) |
|---|---|---|
| Hardware fault containment | Full — any cross-domain access triggers immediate hardware exception | None within domain — only the domain boundary is hardware-enforced |
| Crash detection latency | Immediate (first errant memory access faults) | Delayed — corruption may produce wrong results before any detectable fault |
| Blast radius | Single driver | All drivers in the domain group |
| Primary defense | Hardware isolation + Rust memory safety | Rust memory safety only (hardware isolation protects the domain boundary, not interior) |
| IOMMU DMA fencing | Per-driver (unaffected by domain grouping) | Per-driver (unaffected by domain grouping) |
| Capability isolation | Per-driver (unaffected by domain grouping) | Per-driver (unaffected by domain grouping) |
Mitigation: Rust memory safety is the primary defense within shared domains. Safe
Rust prevents buffer overruns, use-after-free, and data races at compile time — the
class of bugs that would exploit co-tenancy. Hardware isolation is defense-in-depth for
the domain boundary (protecting UmkaOS Core and other domain groups), not within it.
unsafe blocks in Tier 1 drivers are the residual attack surface for intra-domain
corruption and must be minimized and audited.
For crash recovery implications of shared-domain corruption, see Section 11.9.
24.5.3.2 Shared-Domain Silent Corruption¶
Risk: When multiple Tier 1 drivers share an isolation domain (normal on AArch64 POE with 3 driver domains, and on x86 when >12 Tier 1 drivers are loaded), a bug in one driver can silently corrupt another driver's memory without triggering a hardware fault. The corrupted driver may produce wrong results (silent data corruption) before eventually crashing.
| Attribute | Value |
|---|---|
| Impact | Medium (contained to one domain group; Core and other domains unaffected) |
| Likelihood | Low (Rust memory safety prevents the dominant bug classes; residual risk from unsafe blocks) |
| Detection | Delayed — no hardware exception until corruption crosses a domain boundary or triggers an unrelated fault |
| Mitigation | (1) Rust memory safety eliminates buffer overruns, UAF, and data races in safe code. (2) Minimize unsafe in Tier 1 drivers. (3) Administrators can promote high-value drivers to solo domains or Tier 2. (4) Watchdog and ring buffer integrity checks (Section 11.9) provide software-level fault detection. |
This is an inherent limitation of the MPK/POE/DACR model with finite hardware domains. It is not a design flaw — it is a conscious tradeoff between isolation granularity and domain budget. The tradeoff is documented here so operators can make informed tier placement decisions.
24.6 Appendices¶
Reference material, comparison tables, and open questions.
24.7 Licensing Model: Open Kernel License Framework (OKLF) v1.3¶
UmkaOS uses the Open Kernel License Framework (OKLF) v1.3 (see OKLF-v1.3.md
for the full legal text). Key elements:
Base license: GPLv2-only with additional permissions (Sections 2-5 of OKLF). All kernel code — umka-core, umka-kernel, umka-sysapi, umka-net, umka-vfs, umka-block, umka-kvm, tools, and boot code — is GPLv2. This ensures: - All kernel modifications must be open-sourced - Proprietary forks are impossible - Same legal framework the Linux ecosystem understands
Approved Linking License Registry (ALLR): A curated, append-only list of open-source
licenses approved for use with kernel code. Tiers 1-2 are GPL-compatible and may be used
in Tier 0 drivers (statically linked into the kernel) or Tier 1 drivers (domain-isolated,
communicating via KABI IPC — no linking occurs). Tier 3 licenses are GPL-incompatible and
may NOT be statically linked with the kernel; Tier 3 code runs as Tier 1
(domain-isolated, KABI IPC) or Tier 2 (process-isolated) drivers — never Tier 0 (static
linking creates a derivative work). KABI IPC provides the license boundary at both tiers:
no shared symbols, no function calls across the license boundary, one resolved symbol
(__kabi_driver_entry):
- Tier 1 (weak copyleft, GPL-compatible): MPL-2.0, LGPL-2.1, EPL-2.0 (with Secondary License designation; see note below)
- Tier 2 (permissive): MIT, BSD-2, BSD-3, Apache-2.0, ISC, Zlib
- Tier 3 (incompatible — process isolation required, no linking): CDDL-1.0, CDDL-1.1, LGPL-3.0, EUPL-1.2 (see note below)
LGPL-3.0 incompatibility with GPLv2-only: LGPL-3.0 is incompatible with GPLv2-only code per the FSF compatibility matrix. LGPL-3.0 is defined as GPLv3 plus additional permissions (LGPL-3.0 Section 1.1: "This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License"). Since GPLv3 is incompatible with GPLv2-only (see GPLv3 exclusion note below), LGPL-3.0 inherits that incompatibility. LGPL-3.0 code must NOT be linked into the UmkaOS kernel. LGPL-3.0 code communicates with the kernel via KABI IPC only (Tier 3, process isolation required). Note that LGPL-2.1 IS compatible with GPLv2 and remains in Tier 1.
EUPL-1.2 classification (Tier 3): EUPL-1.2 is a strong copyleft license that the FSF classifies as GPL-incompatible. While EUPL Article 5 provides a compatibility list (including GPLv2, GPLv3, LGPL, AGPL, MPL-2.0, EPL-1.0, CeCILL) that allows EUPL-licensed code to be relicensed under those licenses when combined with code under those licenses, the FSF's position is that EUPL-1.2's copyleft is "comparable to the GPL's, and incompatible with it" by itself. UmkaOS places EUPL-1.2 in Tier 3 (no linking with kernel code) — same treatment as CDDL and GPLv3-only. EUPL-1.2 drivers may run at Tier 1 (domain-isolated, KABI IPC boundary) or Tier 2 (process-isolated), but never Tier 0 (static linking creates a derivative work). EUPL-1.2 code that has been explicitly relicensed to GPLv2 via Article 5 by its copyright holder may then be treated as GPLv2 code and used in Tier 0/1.
EPL-2.0 GPL compatibility: EPL-2.0 is GPL-compatible only when the distributor explicitly designates GPL as a Secondary License per EPL-2.0 Section 3.2. Without this designation, EPL-2.0 is GPL-incompatible. UmkaOS requires EPL-2.0 dependencies to carry the Secondary License designation; undesignated EPL-2.0 code is treated as Tier 3 (process isolation required, no linking with kernel code). ALLR Tier 1 inclusion applies only to EPL-2.0 code that explicitly carries the Secondary License designation for GPLv2. Enforcement: the KABI module loader checks for the Secondary License designation in the module's license metadata at load time. EPL-2.0 modules without the designation are rejected for Tier 0/1 loading and must run as Tier 2 process-isolated drivers. Additionally, EPL-2.0's patent grant (Section 2.2) requires contributors to grant a patent license for their contributions; UmkaOS cannot enforce this at a technical level, so EPL-2.0 code in Tier 1 carries an implicit assumption that upstream contributors have complied with Section 2.2. Code review should verify the Secondary License designation is present in the upstream project's license header, not just claimed in module metadata.
GPLv3 exclusion from ALLR: GPLv3 is deliberately excluded from the ALLR. UmkaOS's kernel is licensed GPLv2-only (not "GPLv2 or later"). GPLv3 is incompatible with GPLv2-only code per the FSF: GPLv3's additional requirements (anti-tivoization in GPLv3 §6, patent retaliation in GPLv3 §11) constitute "further restrictions" that GPLv2 §7 prohibits. Code licensed GPLv3-only cannot be linked into a GPLv2-only kernel. Code licensed "GPLv2 or later" CAN be used (under its GPLv2 grant), but code licensed GPLv3-only cannot. Adding GPLv3 to the ALLR would create a false impression that GPLv3-only code may be linked with the kernel. If GPLv3-only code must be used, it must run as a Tier 1 or Tier 2 driver (same as CDDL), communicating via KABI IPC with no linking.
CDDL and GPL incompatibility: CDDL is GPL-incompatible per the FSF. CDDL-licensed code may run as Tier 1 or Tier 2 drivers — KABI provides the license boundary at both tiers. Despite CDDL appearing in the ALLR, no linking occurs between CDDL code and GPL kernel code. CDDL drivers communicate exclusively via KABI IPC (ring buffer message passing, vtable dispatch, one resolved symbol
__kabi_driver_entry) — no shared symbols, no function calls across the license boundary. This provides more isolation than Linux'sEXPORT_SYMBOL_GPLboundary (where modules ARE linked into the kernel). Statically-linked (Tier 0) CDDL code is NOT permitted, as static linking creates a derivative work. The KABI boundary ensures CDDL and GPL code never form a single "work" in the copyright sense.
New licenses added via governance process (60-day review, supermajority LGB vote). Licenses are never removed (append-only for legal certainty).
Proprietary kernel-space code explicitly prohibited (OKLF Section 4.2(c)): Any code that loads into kernel address space and accesses internal kernel symbols is a derivative work and must comply with GPLv2 or an ALLR-listed license. This removes Linux's 30-year "gray area" about proprietary kernel modules.
Proprietary user-space drivers explicitly permitted (OKLF Section 4.2(b)): Code interacting with the kernel exclusively through the stable userspace interface (syscalls, /proc, /sys, VFIO, UIO, FUSE, eBPF) is not a derivative work. This maps directly to our Tier 2 driver model — hardware vendors who cannot open-source their drivers may use user-space driver frameworks with full isolation.
Anti-tivoization stance (OKLF Section 12.1): OKLF encourages but does not mandate installation information disclosure. The OKLF adds only additional permissions to GPLv2 (permitted by the copyright holder's inherent right to grant additional permissions beyond the license terms, a well-established practice — see GCC Runtime Library Exception, Qt commercial exception, Classpath exception), never additional restrictions. Anti-tivoization protection is achieved indirectly: the KABI stability guarantee means users can always replace a Tier 1/2 driver binary without modifying the kernel, making hardware lockdown of individual drivers less effective.
Firmware exception (OKLF Section 4.3): Binary firmware that runs on separate
processors (GPU microcode, Wi-Fi firmware, SSD firmware) is outside the license scope.
Distributed separately in firmware/. Code running on the main CPU is NOT firmware.
Legal risk acknowledgment — OKLF is a novel license framework built on GPLv2. While it is designed to be GPLv2-compatible (the "additional permissions" model derives from the copyright holder's inherent right to grant additional permissions, a practice well-established by GCC RLE, Qt, and Classpath exceptions), it has not been tested in court and constitutes a novel legal approach that should not be relied upon without independent legal review. Key risks: (1) the ALLR mechanism may be viewed by some lawyers as an untested extension of the "linking exception" concept — FSF/SFLC review is recommended before v1.0 final; (2) the OKLF provides weaker anti-tivoization protection than GPLv3, which is an accepted tradeoff for GPLv2 compatibility — OKLF cannot mandate installation information disclosure without violating GPLv2's "no further restrictions" clause; (3) ecosystem adoption depends on corporate legal teams accepting OKLF as GPLv2-compatible — even if legally sound, unfamiliarity may slow adoption; (4) the "additional permissions" model under copyright law (the copyright holder's right to grant additional permissions) is well-established in principle (e.g., GCC Runtime Library Exception, Qt commercial exception), but OKLF's scope (ALLR registry, driver tier classification, firmware exception) goes beyond typical additional permissions — a court could find that some OKLF provisions constitute "further restrictions" rather than "additional permissions," which GPLv2 Section 6 prohibits. This risk is mitigated by careful drafting but cannot be eliminated without judicial precedent. UmkaOS should seek early legal review from SFLC or equivalent, and provide a "plain GPLv2" fallback for organizations that cannot accept OKLF's additional terms.
KABI Driver SDK: The umka-driver-sdk crate (ABI type definitions, vtable layouts, ring buffer protocol, DMA types) is dual-licensed Apache-2.0 OR MIT. This is the interface contract — drivers of any ALLR-listed license can link against these types without friction.
How this maps to our driver tiers:
| Tier | Location | License requirement | OKLF section |
|---|---|---|---|
| Tier 0 (boot-critical) | In-kernel, static | GPLv2 or ALLR | 4.1 (in-tree) |
| Tier 1 (domain-isolated) | Ring 0, loaded | GPLv2 or ALLR | 4.2 (out-of-tree open-source) |
| Tier 2 (user-space) | Ring 3, process | Any (incl. proprietary) | 4.2(b) (userspace interface) |
Three ABI stability tiers (extending OKLF Section 11.2):
| Interface | Stable? | Policy |
|---|---|---|
| Internal kernel APIs | No | May change between any two releases |
| KABI (driver ABI) | Yes | Versioned, append-only, binary-stable |
| Userspace ABI (syscalls) | Yes | Never broken without extended deprecation |
| Concern | How addressed |
|---|---|
| Prevent proprietary kernel forks | GPLv2 copyleft |
| Allow ZFS (CDDL) | CDDL in ALLR Tier 3 — ZFS runs as a Tier 1 driver (KABI IPC provides license boundary, no linking occurs) |
| Allow Nvidia GPU (proprietary) | Tier 2 user-space driver via VFIO |
| Allow BSD/MIT drivers | BSD/MIT in ALLR — full kernel-space access |
| Force kernel improvements to be open | GPLv2 copyleft on all kernel crates |
| Module enforcement | Kernel refuses non-compliant modules by default |
| Clear legal boundaries | OKLF explicit text, not legal gray area |
24.8 Project Structure¶
Note: This appendix describes the target project structure at full implementation. The current codebase (see CLAUDE.md "Project Structure") contains the foundational crates (
umka-kernel,umka-core,umka-driver-sdk,umka-sysapi,umka-net,umka-vfs,umka-block,umka-kvm). Additional crates listed below (e.g.,umka-accel,umka-cluster,drivers/) will be added as their corresponding architecture sections are implemented.
umka-kernel/
Cargo.toml # Workspace root (all crates)
ARCHITECTURE.md # This document
umka-core/ # Microkernel core
Cargo.toml
src/
main.rs # Boot entry point (calls arch-specific init)
cap/ # Capability system
mod.rs # Capability types, tables, operations
revocation.rs # Generation-based revocation
mem/ # Memory management
phys.rs # Physical page allocator (buddy)
vmm.rs # Virtual memory manager (maple tree, VMAs)
page_cache.rs # Page cache (RCU radix tree)
slab.rs # Slab allocator for kernel objects
pcid.rs # PCID/ASID management
huge.rs # Huge page (THP + explicit) support
sched/ # Scheduler
mod.rs # Scheduler core, class dispatch
eevdf.rs # EEVDF fair scheduler
rt.rs # RT FIFO/RR scheduler
deadline.rs # Deadline (EDF/CBS) scheduler
balance.rs # NUMA-aware load balancer
ipc/ # IPC and isolation
mpk.rs # MPK domain management, WRPKRU helpers
ring.rs # Shared-memory ring buffers
tier2_ipc.rs # Cross-address-space IPC for Tier 2
arch/ # Architecture-specific Rust code
mod.rs # Architecture trait definitions
x86_64/ # x86-64 implementation
mod.rs
gdt.rs # GDT setup
idt.rs # IDT and interrupt dispatch
apic.rs # Local APIC driver (Tier 0)
timer.rs # HPET/TSC/APIC timer (Tier 0)
mpk.rs # MPK hardware interface
vmx.rs # VMX support for KVM
aarch64/ # ARM64 implementation (phase 2+)
mod.rs
armv7/ # ARMv7 implementation (phase 2+)
mod.rs
riscv64/ # RISC-V 64 implementation (phase 2+)
mod.rs
ppc32/ # PPC32 implementation (phase 2+)
mod.rs
ppc64le/ # PPC64LE implementation (phase 2+)
mod.rs
s390x/ # s390x implementation (phase 2+)
mod.rs
loongarch64/ # LoongArch64 implementation (phase 2+)
mod.rs
umka-sysapi/ # Linux syscall interface + SysAPI shims
Cargo.toml
src/
syscall/ # ~450 syscall dispatch table
mod.rs # SyscallHandler enum, dispatch table
process.rs # fork, clone, execve, exit, wait
file.rs # open, read, write, close, ioctl
memory.rs # mmap, brk, mprotect, madvise
network.rs # socket, bind, listen, accept, connect
time.rs # clock_gettime, nanosleep, timer_*
misc.rs # getpid, getuid, uname, sysinfo
proc/ # /proc filesystem emulation
mod.rs
meminfo.rs # /proc/meminfo
cpuinfo.rs # /proc/cpuinfo
pid.rs # /proc/[pid]/* (maps, status, fd, etc.)
sys.rs # /proc/sys/* (sysctl interface)
sys/ # /sys filesystem emulation
mod.rs
devices.rs # /sys/devices/ device tree
class.rs # /sys/class/ device classes
bus.rs # /sys/bus/ bus enumeration
dev/ # /dev filesystem emulation
mod.rs
devtmpfs.rs # devtmpfs-compatible device nodes
signal/ # Signal handling
mod.rs
delivery.rs # Signal delivery to user space
handlers.rs # Default handlers, core dump
namespace/ # Linux namespace implementation
mod.rs
mnt.rs # Mount namespace
pid.rs # PID namespace
net.rs # Network namespace
user.rs # User namespace
ipc.rs # IPC namespace
uts.rs # UTS namespace
cgroup.rs # Cgroup namespace
time.rs # Time namespace
cgroup/ # Cgroup v1/v2
mod.rs
v2.rs # Unified hierarchy (primary)
v1_compat.rs # Legacy hierarchy (compatibility)
controllers/ # cpu, memory, io, pids, etc.
io_uring/ # io_uring subsystem
mod.rs
ring.rs # SQ/CQ ring management
sqpoll.rs # SQPOLL kernel thread
ops.rs # Operation dispatch
lsm/ # Linux Security Modules
mod.rs
hooks.rs # Hook framework
selinux.rs # SELinux policy engine
apparmor.rs # AppArmor profile engine
seccomp.rs # seccomp-bpf filter
ebpf/ # eBPF subsystem
mod.rs
vm.rs # eBPF virtual machine
verifier.rs # Static verifier
jit/ # JIT compilers
x86_64.rs
aarch64.rs
armv7.rs
riscv64.rs
ppc32.rs
ppc64le.rs
s390x.rs
loongarch64.rs
maps.rs # Map types (hash, array, ringbuf, etc.)
helpers.rs # eBPF helper functions
programs.rs # Program types (XDP, tc, kprobe, etc.)
umka-net/ # Network stack (runs as Tier 1)
Cargo.toml
src/
tcp/ # TCP/IP implementation
udp/ # UDP implementation
ip/ # IP layer (v4 + v6)
arp.rs # ARP
icmp.rs # ICMP
netfilter/ # nftables + iptables compatibility
mod.rs
nft.rs # nftables engine
conntrack.rs # Connection tracking
nat.rs # NAT (SNAT, DNAT, masquerade)
xdp/ # XDP fast path
socket.rs # Socket abstraction
tunnel/ # Tunnel protocol modules (Ch16: network-overlay-and-tunneling)
mod.rs # TunnelDevice trait
vxlan.rs # VXLAN encap/decap
geneve.rs # Geneve encap/decap
gre.rs # GRE/GRE6
ipip.rs # IPIP/SIT
wireguard.rs # WireGuard — Phase 4. VPN tunnel implemented as
# umka-net module, not separate crate. Architecture
# in Section 15 (15-networking.md). Requires Noise IK
# handshake protocol (Curve25519, ChaCha20-Poly1305,
# BLAKE2s), allowed-IPs routing table (longest-prefix
# match per peer), peer management (persistent
# keepalive, roaming endpoint update), timer-based rekey.
bridge/ # Software L2 switch (Ch16: network-overlay-and-tunneling)
mod.rs # Bridge device, FDB, STP
vlan.rs # 802.1Q VLAN filtering
veth.rs # Virtual ethernet pairs
macvlan.rs # macvlan/ipvlan devices
vrf.rs # Virtual Routing and Forwarding
umka-vfs/ # Virtual filesystem layer (Tier 1)
Cargo.toml
src/
mod.rs # VFS dispatch, mount table
ext4/ # ext4 filesystem
xfs/ # XFS filesystem
btrfs/ # btrfs filesystem
tmpfs/ # tmpfs (in-memory)
overlayfs/ # OverlayFS (for containers)
dcache.rs # Directory entry cache
umka-block/ # Block I/O layer (Tier 1)
Cargo.toml
src/
mod.rs # Block device abstraction
scheduler.rs # I/O schedulers (mq-deadline, none, bfq)
partition.rs # Partition table parsing (GPT, MBR)
dm/ # Device-mapper framework (Ch15: block-io-and-volume-management)
mod.rs # DM core: target dispatch, table management
linear.rs # dm-linear
striped.rs # dm-striped
mirror.rs # dm-mirror
crypt.rs # dm-crypt (AES-XTS)
verity.rs # dm-verity
snapshot.rs # dm-snapshot (COW)
thin.rs # dm-thin-pool
md.rs # MD RAID (0/1/5/6/10) superblock compat
# MD RAID — Phase 4. Architecture in
# Ch15: block-io-and-volume-management.
# MD RAID architecture: Phase 4.
# Requires superblock compat (v0.90 at end-of-device,
# v1.0/1.2 at start), RAID personality trait
# (start_reshape, sync_request, make_request,
# check_reshape), resync/recovery state machine
# (idle → resync → active, bitmap-guided incremental sync).
lvm.rs # LVM2 metadata reader
recovery.rs # Recovery-aware volume state machine
iscsi/ # iSCSI block storage (Ch15: block-storage-networking)
mod.rs # iSCSI common: PDU parsing, session state
initiator.rs # iSCSI initiator (RFC 7143)
target.rs # iSCSI target (LIO-compatible config)
iser.rs # iSER — RDMA transport for iSCSI
chap.rs # CHAP authentication
multipath.rs # dm-multipath integration
nvmeof/ # NVMe over Fabrics (Ch15: block-storage-networking)
mod.rs # NVMe-oF common: capsule parsing, queue pairs
host.rs # NVMe-oF initiator (host) — connect, I/O
target.rs # NVMe-oF target (subsystem) — nvmetcli compat
tcp.rs # NVMe/TCP transport (TP 8000)
rdma.rs # NVMe/RDMA transport (TP 8001)
discovery.rs # Discovery controller client/server
ana.rs # ANA multipath — asymmetric namespace access
umka-kvm/ # KVM hypervisor (Tier 1)
Cargo.toml
src/
mod.rs # /dev/kvm interface
vmx.rs # Intel VMX
svm.rs # AMD SVM
mmu.rs # Nested page tables (EPT/NPT)
tee/ # Confidential VM support (Ch9: confidential-computing)
sev.rs # AMD SEV-SNP guest/host
tdx.rs # Intel TDX guest/host
cca.rs # ARM CCA realm management
umka-accel/ # AI/ML accelerator subsystem (Ch22: unified-accelerator-framework)
Cargo.toml
src/
mod.rs # AccelBase trait, device registration
scheduler.rs # CBS-based accelerator scheduler
hmm.rs # Heterogeneous memory management
p2p.rs # Peer-to-peer DMA (PCIe, NVLink, CXL)
inference.rs # In-kernel inference engine
rdma.rs # RDMA and collective ops
umka-cluster/ # Distributed kernel (Ch5: distributed-kernel-architecture)
Cargo.toml
src/
mod.rs # Cluster topology, node discovery
transport.rs # ClusterTransport trait + RdmaInfra, per-peer bindings
ipc.rs # Distributed IPC proxy
dsm.rs # Distributed shared memory
dlm.rs # Distributed Lock Manager (Ch15: distributed-lock-manager)
global_pool.rs # Global memory pool
scheduler.rs # Cluster-wide scheduling
caps.rs # Network-portable capabilities
umka-driver-sdk/ # Stable driver SDK
Cargo.toml
interfaces/ # .kabi IDL definitions
block_device.kabi # Block device interface
net_device.kabi # Network device interface
gpu_device.kabi # GPU device interface
input_device.kabi # Input device interface
usb_device.kabi # USB device interface
char_device.kabi # Character device interface
pci_device.kabi # PCI device interface
platform_device.kabi # Platform device interface
src/
lib.rs # SDK entry point, driver registration
abi.rs # Generated stable ABI types
dma.rs # DMA buffer management
mmio.rs # MMIO access helpers (volatile read/write)
irq.rs # Interrupt handling
ring.rs # Ring buffer helpers for driver use
manifest.rs # Driver manifest parsing
drivers/ # In-tree drivers
tier0/ # Boot-critical (statically linked)
apic/ # Local APIC + I/O APIC
timer/ # PIT / HPET / TSC
serial/ # Early serial console
vga/ # Early VGA text console
tier1/ # Performance-critical (domain-isolated)
nvme/ # NVMe SSD driver
virtio_blk/ # VirtIO block device
virtio_net/ # VirtIO network device
virtio_gpu/ # VirtIO GPU
virtio_console/ # VirtIO console
e1000/ # Intel e1000 NIC
igb/ # Intel igb NIC
ahci/ # AHCI/SATA controller
ext4/ # ext4 driver component
tier2/ # Isolated (user-space process)
usb_xhci/ # USB XHCI host controller
usb_hid/ # USB HID (keyboard, mouse)
usb_storage/ # USB mass storage
hda_audio/ # Intel HDA audio
input/ # Input subsystem (evdev)
tools/
kabi-compiler/ # .kabi IDL -> Rust/C code generator
Cargo.toml
src/
main.rs
parser.rs # IDL parser
codegen_rust.rs # Rust binding generator
codegen_c.rs # C binding generator
kabi-compat-check/ # ABI compatibility CI checker
Cargo.toml
src/
main.rs # Diffs old vs new .kabi, rejects breaks
umka-initramfs/ # Initramfs builder tool
Cargo.toml
src/
main.rs # Packs drivers + early userspace
arch/ # Architecture-specific C/asm
x86_64/
boot/ # UEFI/BIOS boot stub (C + asm)
header.S # Linux boot protocol header
main.c # Early C boot code
efi_stub.c # UEFI stub
asm/
entry.S # Syscall entry/exit
switch.S # Context switch
irq_stubs.S # Interrupt stub table
vdso/
vdso.lds # vDSO linker script
clock_gettime.c # clock_gettime implementation
getcpu.c # getcpu implementation
aarch64/
boot/ # ARM64 boot stub
asm/ # ARM64 assembly
vdso/ # ARM64 vDSO
riscv64/
boot/ # RISC-V boot stub
asm/ # RISC-V assembly
vdso/ # RISC-V vDSO
ppc32/
boot/ # PPC32 boot stub
asm/ # PPC32 assembly
vdso/ # PPC32 vDSO
ppc64le/
boot/ # PPC64LE boot stub
asm/ # PPC64LE assembly
vdso/ # PPC64LE vDSO
s390x/
boot/ # s390x boot stub (IPL)
asm/ # s390x assembly
vdso/ # s390x vDSO
loongarch64/
boot/ # LoongArch64 boot stub
asm/ # LoongArch64 assembly
vdso/ # LoongArch64 vDSO
tests/
abi_compat/ # Old driver binaries for compat regression
syscall/ # Linux syscall conformance (LTP-based)
driver/ # Driver integration tests
bench/ # Performance regression benchmarks
crash_recovery/ # Fault injection + recovery verification
24.9 What UmkaOS Provides That Linux Cannot¶
| Feature | Linux | UmkaOS |
|---|---|---|
| Driver crash recovery | Kernel oops or panic depending on fault type. Many driver bugs produce oops (system continues with degraded functionality) rather than panic. Recovery requires at minimum driver module reload; severe faults cause panic and full reboot (30-60s). | Reload driver in ~50-150ms (Tier 1) or ~10ms (Tier 2). On RISC-V/s390x/LoongArch64, Tier 1 is unavailable — drivers run as Tier 0 (crash = panic, same as Linux) or Tier 2 (full crash recovery), depending on licensing, driver preference, and sysadmin decision. Tier 2 crash recovery is available on all architectures. |
| Stable driver ABI | None (recompile every update) | Versioned, append-only, binary-stable KABI |
| Driver isolation | None (shared address space) | Domain isolation + IOMMU (Tier 1), full process (Tier 2) |
| Capability-based security | Bolt-on (POSIX caps are coarse) | Foundational architecture |
| Lock ordering enforcement | Runtime lockdep (debug only) | Compile-time lock ordering via Rust const generics: Lock<T, LEVEL> where LEVEL: u32 encodes the lock level in the type signature (e.g., Lock<Rq, 100>), preventing out-of-order acquisition at compile time. See Section 3.5. |
| io_uring security | Bypasses syscall monitoring | Per-instance operation whitelist |
| Hot driver upgrade | Fragile (unstable ABI) | Clean stop/start with stable KABI |
| Memory safety | C everywhere | Rust with minimal unsafe at hardware boundaries |
| Many-core scalability | known bottlenecks (RTNL for networking — partially mitigated in Linux 6.x with per-netns locking but still a single global mutex as of mainline, inode_lock for VFS, cgroup_mutex for cgroups) | No global locks, per-CPU/per-NUMA everywhere |
| Proactive fault management | Ad-hoc (mcelog, rasdaemon) | Unified FMA with diagnosis engine (Section 20.1) |
| Memory compression | zswap/zram (separate, config-heavy) | Integrated NUMA-aware CompressPool tier (Section 4.12) |
| CPU bandwidth guarantee | No floor mechanism | CBS-backed cpu.guarantee (Section 7.6) |
| Stable observability ABI | Tracepoints are unstable | Versioned, documented stable tracepoints (Section 20.2) |
| Verified boot chain | Fragmented (UEFI SB + IMA + dm-verity) | Unified chain from firmware to drivers (Section 9.3) |
| Kernel object introspection | Per-subsystem (/proc, /sys, scattered) | Unified object namespace via umkafs (Section 20.5) |
| Driver state preservation | Lost on crash — cold restart | Checkpointed state buffer, warm restart (Section 11.9) |
| Core panic diagnostics | kexec + kdump (complex setup) | In-place crash dump to reserved memory (Section 11.9) |
| Context switch XSAVE cost | Eager XSAVE with XSAVEOPT/XSAVES optimizations (skips unmodified components, but still saves full state for context switches involving SIMD). UmkaOS's lazy approach avoids save/restore entirely for non-SIMD threads. | Lazy XSAVE — zero cost for non-SIMD threads (Section 7.3) |
| CPU errata management | Scattered #ifdef, ad-hoc | Structured quirk table + boot-param controls (Section 2.18) |
| Volume layer + driver crash | Device marked failed, RAID resync | Recovery-aware: pause I/O, resume clean (Section 15.2) |
| VM guest driver crash | VM reboot required | Driver recovers in-place, hypervisor unaware (Section 18.1) |
| Block storage networking | Separate stacks (open-iscsi, nvme-cli, no unified recovery) | Unified iSCSI + NVMe-oF with RDMA upgrade and crash recovery (Section 15.13) |
| Clustered FS + driver crash | Node fenced, ejected from cluster | Driver recovers in-place, node stays in cluster (Section 15.14) |
| Distributed locking | TCP-based DLM (~10-100 μs/op depending on lock locality; local locks <1 μs), global recovery quiesce on any node failure | RDMA-native DLM (~2-3 μs uncontested, ~5-10 μs contested), per-resource recovery, lease-based extension, batch ops (Section 15.15) |
| TPM key management | Userspace daemon (tpm2-abrmd) | Kernel-native resource manager + capability integration (Section 9.4) |
| Runtime integrity | IMA bolted onto VFS, optional | Integrated with capability system and driver loading (Section 9.5) |
| Display stack crash | X/Wayland session lost | DMA-BUF survives driver reload, compositor stalls ~100ms-5s (full recovery window; Section 22.7) |
24.10 Cross-Feature Integration Map¶
24.10.1 Cross-Feature Integration Map¶
These features are not independent — they reinforce each other:
| Feature A | Feature B | Rationale | |
|---|---|---|---|
| Formal verification (Section 24.4) | --> | Confidential computing (Section 9.7) | Proves capability system correct; CC relies on correct capability enforcement |
| Safe extensibility (Section 19.9) | <-> | Live evolution (Section 13.18) | Policy modules are hot-swappable; evolution uses the same mechanism |
| Intent-based management (Section 7.10) | <-> | In-kernel inference (Section 22.6) | Intent optimizer uses learned models; models optimize for declared intents |
| EAS / heterogeneous CPU (Section 7.2) | <-> | Power budgeting (Section 7.7) | EAS picks energy-optimal core; power budget enforces watt cap |
| Power budgeting (Section 7.7) | <-> | Intent-based management (Section 7.10) | Power budget is a constraint; intents include efficiency preference |
| Hardware memory safety (Section 2.23) | --> | Tier 1 driver isolation (Section 11.3) | MTE catches C driver bugs; domain isolation catches the resulting faults |
| Confidential computing (Section 9.7) | --> | Distributed kernel (Section 5.1) | TEE-to-TEE RDMA; DSM coherence for encrypted pages |
| Post-quantum crypto (Section 9.6) | --> | Distributed capabilities (Section 5.7) | PQC signatures on capabilities; network-portable across cluster |
| SmartNIC/DPU (Section 5.11) | <-> | Distributed kernel (Section 5.1) | DPU = peer node (full or shim); same peer protocol + capability services |
| Persistent memory (Section 15.16) | <-> | Memory tiers (Section 22.4) | Persistent memory = another tier; managed by same PageLocationTracker |
| Computational storage (Section 15.17) | <-> | Accelerator framework (Section 22.1) | CSD = storage accelerator; same AccelBase vtable |
| Unified compute (Section 22.8) | <-> | EAS / heterogeneous CPU (Section 7.2) | Multi-dim capacity extends scalar; CPU capacity is a special case |
| Unified compute (Section 22.8) | <-> | Accelerator scheduler (Section 22.2) | Cross-device topology + energy data; accel scheduler consumes advisory |
| Unified compute (Section 22.8) | <-> | Power budgeting (Section 7.7) | Workload profile drives throttle; informed cross-device power decisions |
| Unified compute (Section 22.8) | <-> | Intent-based management (Section 7.10) | compute.weight feeds intent optimizer; optimizer adjusts per-domain knobs |
| Unified compute (Section 22.8) | <-> | Distributed kernel (Section 5.1) | Peer kernel nodes via ClusterTransport; accelerator = close compute node |
| Unified compute (Section 22.8) | <-> | SmartNIC/DPU offload (Section 5.11) | Same convergence: device to peer node; ClusterTransport unifies all transports |
| Distributed Lock Manager (Section 15.15) | <-> | RDMA transport (Section 5.4) | DLM uses RDMA CAS/Send for locks; transport provides kernel RDMA API |
| Distributed Lock Manager (Section 15.15) | <-> | Cluster membership (Section 5.2) | DLM receives join/leave/dead events; single heartbeat source for both |
| Distributed Lock Manager (Section 15.15) | <-> | Clustered filesystems (Section 15.14) | GFS2/OCFS2 use DLM for coordination; DLM lock modes map to FS operations |
| Distributed Lock Manager (Section 15.15) | <-> | Driver recovery (Section 11.9) | DLM in umka-core survives driver crash; no lock recovery needed on Tier 1 reload |
Bootstrap Circular Dependency:
The intent optimizer (Section 7.10) uses in-kernel inference models (Section 22.6), but those models may not be loaded at early boot. Resolution: the intent optimizer degrades gracefully to static defaults when models are unavailable. At boot: 1. Intent optimizer starts with hardcoded heuristics (e.g., "latency target → raise cpu.weight by 20%"). 2. When the inference engine loads models (typically within seconds of boot), the optimizer transitions to learned optimization. 3. The transition is seamless — no reconfiguration needed.
24.10.2 Implementation Dependency Graph¶
Foundation (no dependencies): - Formal verification readiness (Section 24.4) -- design methodology - Post-quantum crypto abstraction (Section 9.6) -- data structure sizing - Locking primitives (Section 3.5) -- lock design
Early integration: - Hardware memory safety (Section 2.23) -- needs memory allocator - Power budgeting (Section 7.7) -- needs scheduler - Safe extensibility (Section 19.9) -- needs KABI vtable mechanism
Mid integration: - Confidential computing (Section 9.7) -- needs memory manager, IOMMU - Intent-based management (Section 7.10) -- needs inference engine, cgroups - Live evolution (Section 13.18) -- needs extensibility mechanism
Late integration: - SmartNIC/DPU offload (Section 5.11) -- needs peer protocol, capability service providers, device registry - Persistent memory (Section 15.16) -- needs VFS, memory tiers - Computational storage (Section 15.17) -- needs AccelBase framework - Unified compute topology (Section 22.8) -- needs AccelBase, EAS (Section 7.2), power budgeting (Section 7.7) - Peer kernel nodes (Section 22.8) -- needs unified compute + distributed kernel (Section 5.1)
24.10.3 Cross-Feature Integration Testing Specification¶
The dependency graph above defines 21 feature-pair interactions. Each pair requires a dedicated integration test that exercises the interaction path. This section is the canonical CI specification — it replaces the informal guidance previously in Section 24.11.
24.10.3.1 CI Tier Assignment¶
Tests slot into the 3-tier CI structure defined in the Section 25.8:
| Tier | Trigger | Cross-feature tests | Rationale |
|---|---|---|---|
| Compile-time | Every commit | 1 pair (XF-21) | Verified by Verus proofs, not runtime tests |
| Tier 2 | Every PR | 6 safety-critical pairs (XF-01 – XF-05, XF-07) | Regression = crash, data loss, or security breach |
| Tier 3 | Nightly | 14 functional pairs (XF-06, XF-08 – XF-20) | Regression = performance or non-critical functionality |
24.10.3.2 Acceptance Criteria (all runtime tiers)¶
| Criterion | Threshold |
|---|---|
| Correctness | Zero failures across 1,000 iterations per pair |
| Sanitizer | Zero ASan / MSan / TSan findings |
| Latency regression | P99 < 5% vs. single-feature baseline |
| Resource leaks | Zero (memory, file descriptors, locks) after test completion |
| Branch coverage | ≥ 80% of the integration code path per pair |
24.10.3.3 Tier 2: Every PR (Safety-Critical Pairs)¶
These 6 pairs guard against crash, data loss, or security breach. A failure blocks merge.
| ID | Pair | Test Scenario | Failure Mode Prevented |
|---|---|---|---|
| XF-01 | Safe extensibility ↔ Live evolution | Load policy module A → hot-swap to B under sustained load (1K ops/s) → verify behavior changes correctly → forward-evolve to original module A → verify state preserved across A->B->A evolution cycle | Stale vtable pointer; lost operations during swap; state corruption across evolution cycle |
| XF-02 | HW memory safety → Tier 1 isolation | Inject OOB write in Tier 1 driver → verify MTE/MPK trap fires → verify isolation domain fault handler runs → verify driver reloads within 150 ms → verify no kernel state corruption | Undetected memory corruption escaping isolation domain |
| XF-03 | DLM ↔ Driver recovery | Acquire DLM lock via Tier 1 storage driver → crash driver (kill domain) → verify lock state preserved in umka-core → reload driver → verify lock accessible without re-acquire | Distributed deadlock from lock state lost on driver crash |
| XF-04 | DLM ↔ Clustered filesystems | Mount clustered FS on 2 QEMU nodes → concurrent file create + write from both → verify DLM serializes conflicting operations → fsck after test → zero corruption | DLM lock ordering violation → filesystem corruption |
| XF-05 | EAS ↔ Power budgeting | Set 10 W power cap → run mixed CPU-bound + I/O workload → verify EAS selects cores within budget → verify cap not exceeded by more than 1 scheduler tick interval | Power budget violation → thermal throttle or hardware damage |
| XF-07 | PQC → Distributed capabilities | Create capability → ML-DSA-65 sign → send to peer via RDMA → verify peer validates → revoke on origin → verify peer rejects within 2 heartbeat intervals | Forged or revoked capability accepted by peer node |
24.10.3.4 Tier 3: Nightly (Functional Pairs)¶
These 14 pairs test performance, optimization, and non-safety functionality.
Failures block merge to master but do not block PR merge to develop.
| ID | Pair | Test Scenario |
|---|---|---|
| XF-06 | Confidential computing → Distributed kernel | Establish DSM region between 2 QEMU nodes → write 4 KB page on node A inside SEV-SNP guest → verify host hypervisor process cannot read page content (QEMU limited: verify memory encryption APIs are called, not actual ciphertext) → read page on node B via DSM → verify coherence. Note: wire-level RDMA encryption is Phase 5+ (Section 9.7); this test verifies DSM coherence with CC-protected memory, not transport encryption. |
| XF-08 | Intent management ↔ In-kernel inference | Set latency-sensitive intent on cgroup → verify ML model adjusts scheduler weights within 5 ticks → remove intent → verify return to defaults within 5 ticks |
| XF-09 | Power budgeting ↔ Intent management | Set "efficiency" intent → verify power budget tightens (measurable watt reduction) → switch to "performance" → verify budget relaxes within 100 ms |
| XF-10 | SmartNIC/DPU ↔ Distributed kernel | DPU joins as Tier M peer → verify service binding + capability routing → simulate DPU crash → verify host fallback activates within 500 ms → DPU rejoins → verify service restored |
| XF-11 | Persistent memory ↔ Memory tiers | Allocate pages on pmem tier → generate hot access pattern → verify PageLocationTracker promotes to DRAM → cool access → verify demotion back to pmem |
| XF-12 | Computational storage ↔ Accelerator framework | Register CSD as AccelBase → submit SHA-256 compute task → verify CSD executes → verify result matches host-computed reference |
| XF-13 | Unified compute ↔ EAS | Register CPU + GPU + NPU with multi-dim capacity vectors → submit heterogeneous workload → verify EAS uses compute.weight for placement decisions |
| XF-14 | Unified compute ↔ Accelerator scheduler | Build cross-device topology (CPU + 2 accelerators) → submit batch of jobs → verify scheduler places on optimal device per energy advisory → verify no starvation |
| XF-15 | Unified compute ↔ Power budgeting | Set per-domain 5 W cap + aggregate 15 W cap → submit cross-device workload → verify throttle decisions respect both per-device and aggregate limits |
| XF-16 | Unified compute ↔ Intent management | Set compute.weight via intent API → verify optimizer adjusts per-domain scheduling knobs → verify convergence within 10 scheduler ticks |
| XF-17 | Unified compute ↔ Distributed kernel | Register remote accelerator as peer via ClusterTransport → submit remote compute job → verify completion callback → verify capability cleanup on disconnect |
| XF-18 | Unified compute ↔ SmartNIC/DPU | DPU advertises compute offload service → verify unified compute topology includes DPU node → submit offloadable work → verify routing to DPU |
| XF-19 | DLM ↔ RDMA transport | Acquire lock via RDMA CAS on remote node → verify lock state visible on both nodes → release via RDMA → verify release propagates within 1 ms |
| XF-20 | DLM ↔ Cluster membership | 3-node cluster → node B leaves → verify DLM redistributes B's master locks to A and C → node B rejoins → verify rebalance completes without orphaned locks |
24.10.3.5 Compile-Time (Every Commit)¶
| ID | Pair | Verification |
|---|---|---|
| XF-21 | Formal verification → Confidential computing | Verus proofs for capability system pass (cargo verus --verify). Correctness of capability enforcement (which CC relies on) is proven at compile time, not tested at runtime. Proof failure = build failure. |
24.10.3.6 Architecture-Specific Notes¶
Most cross-feature tests run on x86-64 (-cpu max) as the primary CI platform.
Exceptions:
| Test | Additional architectures | Reason |
|---|---|---|
| XF-02 (HW safety + isolation) | AArch64 (-M virt,mte=on -cpu neoverse-n2), ARMv7 (-M vexpress-a15, DACR) |
Tests arch-specific trap + isolation mechanisms |
| XF-06 (CC + distributed) | x86-64 only | SEV-SNP/TDX emulation (limited: QEMU does not model encrypted memory controller; test verifies API calls, not actual encryption); AArch64 CCA not emulable in QEMU |
All other tests exercise kernel subsystem interactions independent of architecture. Nightly Tier 3 runs additionally include AArch64 and RISC-V 64 for cross-architecture confidence.
24.10.3.7 Fuzzing (Release Candidates)¶
For each release candidate, run syzkaller-style fuzzing on the 6 Tier 2 pairs for
24 hours. The fuzzer generates random sequences of:
- Policy module load/unload interleaved with live evolution swaps (XF-01)
- Concurrent DLM lock acquire/release with driver crash injection (XF-03, XF-04)
- Power budget changes during workload spikes (XF-05)
- DSM page access patterns with CC-protected memory (XF-06)
- Capability create/sign/revoke races across nodes (XF-07)
Zero findings required for release sign-off.
24.11 Open Questions¶
The following cross-cutting items require further design work. Each is tracked as an open question with the affected sections and the specific decision to be made.
Mirrored in: Section 25.9 — update both when status changes.
24.11.1 Resolved Decisions (collapsed — full rationale in referenced sections)¶
These items were previously open but are now fully specified in the architecture.
| Decision | Resolution | Reference |
|---|---|---|
| WiFi: Tier 1 or Tier 2? | Tier 1 | Section 13.15 |
| BlueZ or clean-room Bluetooth? | BlueZ adapter (Tier 2 daemon) | Section 13.14 |
| Allow proprietary drivers? | Yes, via KABI binary compatibility | Section 24.1 |
| eBPF verifier: full or partial? | Full verifier, phased delivery (Phase 2–5) | Section 19.2 |
| io_uring + SEV-SNP buffer management? | Bounce buffer architecture, 16 MiB/ring | Section 19.3 (future work) |
| Live Evolution attestation chain? | Dedicated PCR[16]/PCR[23] + hash-chained event log + TPM2_PolicyAuthorize |
Section 13.18, Section 9.3 |
| CXL 3.0 fabric management? | First-class memory tier (NumaNodeType::CxlMemory) |
Section 5.9 |
| CXL 3.0 coherence vs. DSM? | Hybrid — CXL transport for intra-rack, RDMA DSM for inter-rack | Section 5.9 |
| Multi-architecture parity matrix? | 8-feature × 8-arch parity matrix defined | Section 2.22 |
| Secure boot: live evolution PCR? | PCR[16] (dev) / PCR[23] (prod) with LiveEvolutionEvent struct |
See attestation chain above |
| Default filesystem? | No single default. ext4 (general), XFS (enterprise), ZFS (data integrity/servers). Btrfs for snapshot-centric workloads only. | Filesystem drivers spec |
| io_uring live evolution? | Task-owned state (Theseus-style); component is stateless processor; ~1-10μs swap | Section 19.3 |
| Cross-feature integration testing? | 21 real pairs (not "100+"), 7 PR-critical + 13 nightly + 1 compile-time. Full CI spec with test scenarios, acceptance criteria, fuzzing. | Section 24.10 |
| DPU io_uring submission offload? | Not a separate design question. DPUs in "dumb driver" mode use normal KABI vtable path (no SQ offload). DPUs in Tier M mode use ServiceMessage via DomainRingBuffer ring pairs — the peer protocol IS the transport. No direct DPU access to userspace SQ/CQ. |
Section 5.11 |
| Multi-architecture fallback acceptance criteria? | Per-feature thresholds: native ≤5%, fallback ≤10%, not-available 0%. Per-feature acceptance tests. Sysfs /sys/kernel/umka/features/ + dmesg notification for degradation visibility. |
Section 2.22 |
| Policy module measurement enforcement? | Tied to boot security posture: enforce (default when secure boot active, rejects unsigned), advisory (default otherwise, allows with warning + isolation), off (bare-metal debugging). Boot parameter umka.module_sig=. Immutable after boot. |
Section 19.9 |
| GPU confidential computing? | Not a separate design decision. Both paths (bounce buffer and hardware crypto) are supported. Runtime detection via CcDeviceCapability, admin override via umka.cc_device_dma=auto\|bounce\|hwcrypto. Same pattern as isolation fallback. |
Section 9.7 |
| Nested GPU passthrough? | Supported if hardware supports it. Three conditions: IOMMU nested translation, TEE firmware nested device assignment, overhead ≤ 3x. Returns -ENOTSUP otherwise. |
Section 9.7 |
24.11.2 Open Questions (genuinely unresolved)¶
24.11.2.1 OEM partnerships strategy¶
Not yet decided. Affects go-to-market for consumer hardware support (Phase 5b). Candidates: Framework, System76, Dell, HP.
This document is the canonical reference for UmkaOS development. All implementation decisions must be traceable to the architecture described here. Changes to this document require team review and approval.
24.12 KABI IDL Compiler Specification¶
The KABI IDL language and umka-kabi-gen tool are fully specified in
Section 12.5.
The roadmap deliverable is to implement umka-kabi-gen conforming to that
specification. See Section 24.2 for the
Phase 1 build milestone.