Chapter 23: Roadmap and Verification
Driver ecosystem, implementation phases, verification strategy, technical risks, formal verification, appendices
23.1 Driver Ecosystem Strategy
23.1.1 The Challenge
Driver coverage is the single largest adoption blocker for any new kernel. Linux has thousands of drivers covering decades of hardware. UmkaOS cannot replicate this overnight.
23.1.2 Agentic Driver Rewrite Project
The key insight: all open-source Linux driver source code is available. The hardware programming logic (register sequences, DMA setup, interrupt handling) is identical regardless of kernel API. Only the kernel-facing API surface changes.
AI-assisted translation pipeline:
Input: Linux driver C source code (GPL, ~500-5000 LOC typical)
|
v
Step 1: Parse Linux kernel API calls (kmalloc, dma_alloc_coherent,
request_irq, pci_read_config_*, etc.)
|
v
Step 2: Map to KABI equivalents (KernelServicesVTable methods)
|
v
Step 3: Translate C to Rust, preserving hardware-specific logic exactly
|
v
Step 4: Generate KABI driver entry point and vtable exchange
|
v
Output: Native Rust KABI driver
Human review: Verify hardware-specific sequences are preserved
Testing: Against real hardware + QEMU virtual devices
23.1.3 Prioritized Driver List
These drivers cover approximately 95% of real hardware in server and desktop environments:
Priority 1 -- Cloud/VM (covers 100% of cloud deployments):
1. VirtIO block (virtio-blk)
2. VirtIO network (virtio-net)
3. VirtIO GPU (virtio-gpu)
4. VirtIO console (virtio-console)
Priority 2 -- Storage (covers 99% of bare-metal storage): 5. NVMe (universal modern SSD interface) 6. AHCI/SATA (legacy HDDs and older SSDs)
Priority 3 -- Networking (covers 90% of server NICs): 7. Intel e1000/e1000e (universal VM and consumer NIC) 8. Intel igb/ixgbe/ice (server 1G/10G/25G/100G) 9. Realtek r8169 (consumer Ethernet) 10. Mellanox mlx5 (high-performance datacenter)
Priority 4 -- Human Interface (covers desktop usability): 11. USB XHCI host controller (all modern USB) 12. USB EHCI host controller (USB 2.0 legacy) 13. USB HID (keyboard, mouse) 14. USB mass storage 15. Intel HDA audio 16. i915 (Intel integrated graphics, modesetting) 17. amdgpu (AMD graphics, modesetting) 18. UVC (USB Video Class) camera driver — Phase 3/4; V4L2 API compatibility required for webcams, screen capture, and video conferencing applications. Printing (CUPS, USB printer class, IPP) is Phase 4+; requires USB printing class driver plus CUPS daemon compatibility in userspace (no kernel changes beyond USB HID/class support already in Phase 1-2).
> **Reviewer note**: Camera (V4L2) and printing (CUPS) support are **intentionally
> deferred**. Their absence in Phase 1-2 is by design, not an oversight.
> Early phases target server/container workloads that do not require these
> peripherals. Flagging missing camera or printer support as a Phase 1-2 bug
> is a false positive.
Priority 5 -- Platform (covers system management): 19. ACPI subsystem 20. PCI/PCIe enumeration and configuration 21. IOMMU (Intel VT-d, AMD-Vi)
23.1.4 Nvidia / Proprietary Driver Strategy
For Nvidia (the most critical proprietary driver):
- Nvidia's driver already has a clean internal abstraction layer between their
proprietary GPU core and the "kernel interface layer" (
nvidia.ko) - UmkaOS provides a KABI-native implementation of this kernel interface layer
- Nvidia's proprietary compute core links against our KABI implementation
- This is more sustainable than binary
.kocompatibility: the interface layer is small, well-defined, and stable
23.1.5 Community Incentive
The clean KABI SDK makes driver development significantly easier than Linux: - No need to track unstable internal APIs - Rust safety eliminates entire classes of bugs - Binary compatibility across kernel versions eliminates recompilation burden - Clear, documented interfaces reduce the learning curve
This lower barrier to entry is expected to attract contributors and vendors over time.
23.2 Implementation Phases
This section covers the implementation timeline for all features. The first part (Phases 1-5+) defines core kernel milestones. The Enhancement Feature Phasing and Future-Proof Feature Phasing tables below map additional features onto these same phases.
23.2.1 Phase 1: Foundations
Goal: Boot to a hello-world program.
See Section 24.3.1 (Phase 1.1: Core Kernel) and Section 24.3.2 (Phase 1.2: Multi-arch) for detailed agentic workflow steps within this roadmap phase.
- UmkaOS Core: x86-64 boot (UEFI + BIOS), physical memory allocator, basic scheduler, IPC/isolation domain infrastructure
- LinuxCompat: minimal syscalls for
execve+write+exit_group - Tier 0 drivers: APIC, timer, serial console
- Build system: Cargo workspace with custom target spec, linker scripts
- CI/CD: QEMU-based boot tests on every commit
- KABI compiler: implement
umka-kabi-genper Section 11.1.7. The KABI IDL language andumka-kabi-gentool are fully specified in Section 11.1.7 (11-kabi.md).
Exit criteria: A statically linked 'Hello, world!' ELF binary runs on UmkaOS in QEMU. The KABI compiler
successfully parses a minimal .kabi IDL file and generates Rust/C stubs that compile
without errors.
23.2.2 Phase 2: Self-Hosting Shell
Goal: Run a busybox shell with basic utilities.
See Section 24.3.3 (Phase 2.1: Essential Drivers), Section 24.3.4 (Phase 2.2: Linux Compatibility Layer), and Section 24.3.5 (Phase 2.3: Networking Stack) for detailed agentic workflow steps within this roadmap phase.
- VFS layer: mount table, path resolution, file descriptor table
- Filesystems: tmpfs, initramfs (cpio), procfs (basic), sysfs (stub)
- Block I/O layer + VirtIO-blk driver (Tier 1)
- Memory manager:
mmap,brk, page fault handler, COW, demand paging - Process management:
fork/clone,execve,wait,exit - Basic signal handling:
SIGCHLD,SIGKILL,SIGTERM,SIGSEGV - Pipe and simple I/O
Exit criteria: Busybox shell boots, ls, cat, echo, ps work.
23.2.3 Phase 3: Real Workloads
Goal: Boot systemd, run Docker containers.
See Section 24.3.6 (Phase 3.1: Storage Stack) and Section 24.3.7 (Phase 3.2: Advanced Features) for detailed agentic workflow steps within this roadmap phase.
- Full syscall coverage: approximately 330+ commonly used syscalls (from a dispatch table covering ~450 total Linux syscall numbers, with uncommon/obsolete syscalls returning -ENOSYS)
- NVMe driver (Tier 1), ext4 filesystem (read-write)
- Network: VirtIO-net driver, e1000 driver, TCP/IP stack, socket API
- Namespaces: all 8 types
- Cgroups: v2 (primary) + v1 compat
- io_uring: full implementation
- eBPF: full verifier + JIT (x86-64) + all core map types + XDP/TC/kprobe/tracepoint/cgroup programs (see Section 23.9 eBPF verifier completeness for phase breakdown)
- seccomp-bpf: for container runtime compatibility
- Full signal handling: all 64 signals,
sigaction,sigaltstack - TTY/PTY subsystem: for terminal emulators
Exit criteria: Ubuntu minimal boots with systemd, Docker runs hello-world
container, iperf3 and fio benchmarks complete.
23.2.4 Phase 4: Production Ready
Goal: Drop-in replacement for specific workloads.
See Section 24.3.8 (Phase 4.1: Consumer Hardware) for detailed agentic workflow steps within this roadmap phase.
- KVM hypervisor:
/dev/kvm, VMX, EPT, QEMU/Firecracker support - Netfilter/nftables: connection tracking, NAT, Docker networking
- LSM framework: SELinux policy engine, AppArmor profiles
- Agentic driver rewrite: top-20 driver families ported
- Performance tuning: reach within 5% of Linux on all target benchmarks
- Crash recovery: full Tier 1/2 fault injection testing
- Package:
.deband.rpmpackages for Ubuntu 24.04+ and Fedora 40+ - LTP conformance: Linux Test Project suite passing (>95% of applicable tests)
Exit criteria: UmkaOS boots unmodified Ubuntu 24.04 and Fedora 40, runs Docker + Kubernetes single-node, passes LTP, within 5% of Linux on benchmarks.
23.2.5 Phase 5: Ecosystem
Goal: Broad adoption and platform maturity.
See Section 24.3.9 (Phase 5.1: Windows Emulation Acceleration) for detailed agentic workflow steps within this roadmap phase.
- ARM64 port: full Tier 1 isolation using architecture-appropriate mechanisms
- RISC-V 64 port: same
- PPC32 port: embedded PowerPC support with segment-register isolation
- PPC64LE port: IBM POWER server support with Radix MMU isolation
- Extended driver coverage: GPU acceleration (i915, amdgpu compute), WiFi, Bluetooth
- Vendor partnerships: Nvidia KABI driver, AMD KABI driver, Intel KABI driver
- Community driver development: SDK documentation, examples, mentorship
- Distribution certification: RHEL, Ubuntu, SUSE official support
- Nested virtualization: KVM-on-KVM
- Live kernel upgrade: stop all Tier 1/2 drivers, swap core, restart drivers
23.2.6 Enhancement Feature Phasing
The kernel-internal enhancements described in Section 4.2, 7.1, 8.2, and 18.1–18.4 have different urgency levels relative to the phases above:
| Feature | Earliest Phase | Rationale |
|---|---|---|
| Unified Object Namespace (Section 19.4) | Phase 1-2 | Foundational — other features build on it |
| Stable Tracepoints (Section 19.2) | Phase 2 | Needed for debugging from the start |
| Memory Compression (Section 4.2) | Phase 3 | Requires mature memory manager |
| Verified Boot (Section 8.2) | Phase 3 | Requires bootable system to protect |
| CPU Bandwidth Guarantees (Section 6.3) | Phase 3-4 | Requires mature scheduler + cgroups |
| Fault Management (Section 19.1) | Phase 4 | Requires mature driver ecosystem reporting health |
The following table covers the implementation timeline for advanced features (Chapters 16-18). Phase numbers align with the core kernel phases defined above. "Design-In" items (Phase 1) require data structure reservations and trait definitions but no functional implementation. Higher-phase items depend on core infrastructure being available.
| Feature | Phase | Dependencies | Design-In Cost | Notes |
|---|---|---|---|---|
| PQC crypto abstraction (Section 8.5) | Phase 1 | None | Low | Variable-length signature fields, algorithm enum |
| Formal verification readiness (Section 23.10) | Phase 1 | None | Low | Spec annotations, design contracts |
| RT preemption model (Section 7.2) | Phase 1-2 | Scheduler | Medium | Lock design, interrupt threading |
| Hardware memory safety hooks (Section 2.3) | Phase 2 | Memory allocator | Low | Tag allocation/deallocation in slab/buddy |
| Power budgeting (Section 6.4) | Phase 3 | Scheduler, cgroups | Medium | RAPL/SCMI reading, power cgroup controller. Per-task EAS is in Section 6.1.5 |
| Safe kernel extensibility (Section 18.7) | Phase 3 | KABI, domain isolation | Medium | Policy vtable traits, module lifecycle |
| Confidential computing — guest (Section 8.6) | Phase 3 | Memory manager | Medium | Bounce buffers, shared/private pages |
| Confidential computing — host (Section 8.6) | Phase 4 | umka-kvm, IOMMU | Medium | SEV-SNP/TDX VM management |
| PQC algorithm implementations (Section 8.5) | Phase 3-4 | Crypto abstraction | Medium | ML-KEM, ML-DSA, hybrid mode |
| Live kernel evolution (Section 12.6) | Phase 4-5 | Extensibility | Medium | State export/import, atomic swap |
| Intent-based management (Section 6.7) | Phase 4-5 | Inference engine, cgroups | Medium | Optimization loop, intent cgroup knobs |
| SmartNIC/DPU offload (Section 5.2) | Phase 4-5 | Device registry, proxy drivers | Medium | Offload transport, DPU discovery |
| Persistent memory (Section 14.7) | Phase 4-5 | VFS, memory tiers | Medium | DAX, MAP_SYNC, CLWB fencing |
| Computational storage (Section 14.8) | Phase 5+ | AccelBase framework | Low | CSD as AccelDeviceClass |
| Unified compute topology (Section 21.6) | Phase 4-5 | AccelBase, EAS (Section 6.1.5), power budgeting (Section 6.4) | Low | Advisory overlay; multi-dim capacity profiles, cross-device energy |
| Unified cgroup compute.weight (Section 21.6) | Phase 5+ | Unified topology, intent optimizer (Section 6.7) | Low | Optional knob; orchestration layer over existing per-domain knobs |
| NodeTransport unification (Section 21.6.13) | Phase 5 | KernelTransport (Section 5.1), OffloadTransport (Section 5.2) | Medium | Merge RDMA + PCIe + NVLink + CXL into one transport abstraction |
| Peer kernel nodes (Section 21.6.13) | Phase 5+ | NodeTransport, distributed kernel (Section 5.1) | Low | Vendor-driven; architecture ready, adoption depends on industry |
23.2.7 Priority Rationale
Phase 1-2 (Design-In): PQC sizing, verification readiness, RT lock design. These cost almost nothing now but are impossible to retrofit. Design contracts and data structure sizes affect everything built on top.
Phase 3 (Real Workloads): Extensibility, power budgeting, confidential guest mode. These enable the kernel to run real workloads in modern environments (cloud, power- constrained datacenters).
Phase 4-5 (Competitive Advantage): Live evolution, intent-based management, DPU offload. These are features that Linux cannot provide due to architectural constraints. They differentiate UmkaOS in production environments.
23.2.8 Licensing Summary
| Component | IP Source | Risk |
|---|---|---|
| Confidential computing (TEE) | Hardware vendor specs (AMD SEV, Intel TDX, ARM CCA), all public | None |
| Post-quantum crypto | NIST standards (FIPS 203, 204, 205), public domain algorithms | None |
| Power budgeting | RAPL (Intel public spec), SCMI (ARM public spec), original design | None |
| Hardware memory safety | ARM MTE (public ISA), Intel LAM (public ISA) | None |
| Formal verification | Verus (MIT license), RustBelt (academic, published) | None |
| Safe extensibility | Original design (extends existing KABI vtable model) | None |
| Live kernel evolution | Theseus OS concepts (academic, published, Rice University) | None |
| Intent-based management | Original design, optimization theory (academic) | None |
| Real-time guarantees | PREEMPT_RT concepts (GPLv2, Linux mainlined), CBS (academic) | Medium — see note below |
| SmartNIC/DPU offload | Original design (extends existing KABI proxy model) | None |
| Persistent memory | DAX/PMEM specifications (SNIA, public), Linux interfaces (facts) | None |
| Computational storage | NVMe Computational Programs Command Set and Subsystem Local Memory Command Set (public, NVMe consortium, January 2024) | None |
| Unified compute model | Original design (extends existing AccelBase + EAS models) | None |
All components are either original design, based on published academic research, based on public hardware specifications, or based on NIST/industry standards. No vendor-proprietary APIs or patented algorithms.
PREEMPT_RT derivative risk: PREEMPT_RT is GPLv2 and was merged into Linux mainline (v6.12). Any UmkaOS real-time code derived from PREEMPT_RT implementation (as opposed to the general concepts of preemptible kernels, threaded interrupts, and priority inheritance) could carry GPLv2 obligations that conflict with OKLF's additional permissions. UmkaOS's RT implementation MUST be a clean-room design based on published academic literature (priority inheritance protocols: Sha, Rajkumar, Lehoczky 1990; CBS: Abeni and Buttazzo 1998; LITMUS-RT: Brandenburg 2011) and public OS design textbooks, not derived from Linux PREEMPT_RT source code. Code review must verify no Linux-derived lock conversion patterns, interrupt threading structures, or RT-specific scheduler modifications are copied.
23.2.9 Performance Impact Summary
Every feature in this document was evaluated against the constraint: "Does this make UmkaOS measurably slower than Linux on the same workload?"
| Feature | Hot-Path Impact vs Linux | Justification |
|---|---|---|
| Confidential computing | 0% (same hardware, same cost) | Hardware AES engine, identical to Linux |
| Post-quantum crypto | 0% (cold-path only) | Boot/driver-load only. ML-DSA-44 verify comparable to Ed25519; ML-DSA-65 verify ~100-200 µs (cold-path only, not on hot paths) |
| Power budgeting | 0.015% (MSR reads at tick) | 600ns per 4ms tick. Invisible in any benchmark. Per-task EAS overhead: see Section 6.1.5.12 |
| Hardware memory safety | 0% vs Linux when enabled | Same MTE instructions, same hardware cost. Tag RAM overhead: 3.125% of DRAM (ARM MTE only) |
| Formal verification | 0.000% (compile-time) | Not in the binary |
| Safe extensibility | 0% (same as Linux sched_class) | Function pointer dispatch, same mechanism |
| Live kernel evolution | 0.000% (rare event only) | ~10μs during replacement, months between events |
| Intent-based management | ~0.00005% (background only) | 3μs per second background optimization |
| Real-time guarantees | 0% to 5% (configurable) | Same cost as Linux PREEMPT_RT when enabled. 0% = Voluntary, ~1% = Full, 2-5% = Realtime |
| SmartNIC/DPU offload | Negative (faster) | Moves work OFF host CPU |
| Persistent memory | Negative (faster) | DAX eliminates page cache copies |
| Computational storage | Negative (faster) | CSD reduces data movement |
| Unified compute model | ~0.00005% (background only) | ~4μs/sec/cgroup advisory. Submission hot path unchanged |
Target: match or exceed Linux performance for all common workloads. Most features are invisible at steady state, and several actually improve performance. Known exceptions are conscious trade-offs documented in their respective sections: RT scheduling adds 0-5% overhead for RT-class tasks (same cost as Linux PREEMPT_RT); capability checks add ~5-10 cycles per privileged operation (~0.1%, fully pipelined bitmask test); untrusted policy module isolation adds ~46 cycles per domain crossing (eliminated once the module graduates to the Core isolation domain).
23.2.10 Consumer and Desktop Phases (Phase 5)
Phase 5 focuses on consumer hardware support, desktop integration, and application ecosystem compatibility. These phases begin after Phase 4 server/cloud stability.
23.2.11 Phase 5a: Essential Consumer Hardware
Goal: UmkaOS boots and runs on common Intel/AMD laptops with basic functionality.
Deliverables: - WiFi drivers (Intel, Realtek) - Bluetooth stack (HID, audio) - Touchpad drivers (I2C-HID, PS/2) - Audio (Intel HDA, USB Audio) - Graphics (Intel i915 modesetting, basic AMD) - S3 suspend/resume
23.2.12 Phase 5b: Consumer Power Management
Goal: Battery life competitive with existing Linux distributions.
Deliverables: - Power profiles (performance, balanced, battery-saver) - S0ix Modern Standby support - Per-app power attribution kernel interfaces
23.2.13 Phase 5c: Desktop Integration
Goal: Polished desktop experience, ready for enthusiast adoption.
Deliverables: - Wayland compositor support (DRM, input events) - Multi-monitor support (hotplug) - Desktop notifications (battery, network, USB events) - Per-app sandboxing capability primitives
23.2.14 Phase 5d: Broader Hardware
Goal: Support popular consumer laptops (ThinkPad, XPS, etc.).
Deliverables: - More WiFi chipsets (Qualcomm, Mediatek, Broadcom) - AMD graphics (amdgpu modesetting) - Thunderbolt 3/4 support - USB4 support - SATA, eMMC, SD card readers
23.2.15 Phase 5e: Gaming & Creative
Goal: Support gaming, content creation workloads.
Deliverables: - Vulkan drivers (Mesa RADV for AMD, Intel ANV) - Steam + Proton support - GPU video encode/decode (hardware acceleration)
23.2.16 Desktop / Laptop Performance Targets
Performance targets for UmkaOS running on consumer-grade desktop/laptop hardware. These are acceptance criteria for Phase 5 completion, not kernel architectural constraints — specific numbers are deployment-profile goals.
| Metric | Target |
|---|---|
| Kernel boot (bootloader → login screen) | < 5 seconds |
| Resume from S3 suspend | < 2 seconds |
| Resume from S4 hibernate | < 10 seconds |
| Idle power (WiFi on, display on) | Match or exceed Ubuntu 24.04 |
| Video playback (1080p H.264) | Hardware decode; CPU < 5% |
23.2.16.1 Validation Methodology
Battery life: - Side-by-side comparison with Windows 11 and Ubuntu 24.04 on the same hardware - Standardised web-browsing benchmark (Speedometer + video stream) - UmkaOS must match or exceed Ubuntu 24.04 battery life
Real-world validation: - 100+ beta testers (developer community) running UmkaOS as daily driver - 30-day soak; collect crash dumps, performance traces, battery statistics
23.3 Verification Strategy
23.3.1 Testing Layers
| Layer | Tool / Method | What it verifies |
|---|---|---|
| Unit tests | cargo test (in QEMU or host mock) |
Individual subsystem correctness |
| Integration tests | Custom test harness in QEMU | Cross-subsystem interactions |
| Syscall conformance | Linux Test Project (LTP) | Syscall behavior matches Linux |
| Application testing | Boot Ubuntu minimal, Alpine | Real-world application compatibility |
| Container testing | Docker hello-world, nginx, redis | Container runtime compatibility |
| Kubernetes testing | k3s single-node | Orchestration platform compatibility |
| ABI regression | kabi-compat-check in CI |
No breaking changes to KABI |
| Crash recovery | Fault injection framework | Tier 1/2 drivers recover correctly |
| Performance regression | Automated benchmarks vs Linux baseline | No unacceptable performance regression |
| Fuzzing | syzkaller (adapted for UmkaOS) | Syscall fuzzing for crash/hang detection |
| Static analysis | cargo clippy, custom lints |
Code quality, unsafe usage review |
23.3.2 Key Benchmarks
These benchmarks must match Linux within 5% (measured on identical hardware, same kernel configuration, same workload parameters):
| Benchmark | What it tests | Target delta |
|---|---|---|
fio randread 4K QD32 |
Block I/O fast path (IOPS) | < 2% |
fio randwrite 4K QD32 |
Block I/O write path (IOPS) | < 2% |
fio sequential read 1M |
Block I/O throughput (GB/s) | < 1% |
iperf3 TCP throughput |
Network stack throughput | < 5% |
iperf3 TCP latency (RR) |
Network stack latency | < 5% |
nginx small-file HTTP (wrk) |
Combined network + filesystem | < 5% |
redis-benchmark |
In-memory key-value (network + mem) | < 3% |
sysbench OLTP read-write |
Database workload (IO + CPU + sched) | < 5% |
hackbench (groups=100) |
Scheduler + IPC throughput | < 3% |
lmbench lat_ctx |
Context switch latency | < 1% |
Kernel compile (make -jN) |
Combined CPU + IO + scheduling | < 5% |
stress-ng mixed |
Overall system stress | < 5% |
23.3.3 Crash Recovery Testing
Dedicated fault injection framework.
Activation
Fault injection is available in debug builds only (cfg(umka_fault_inject)).
It is never compiled into release builds. Two activation mechanisms:
-
Kernel boot parameter:
umka.fault_inject=<target>[,<fault>]Example:umka.fault_inject=nvme0,domain_violationinjects a domain access violation into the nvme0 driver on first I/O. The kernel logs the injection atKERN_DEBUGlevel and proceeds with the fault. -
Runtime sysctl (debug builds, init namespace only):
umka/debug/fault_inject/<driver_name>/<fault_type>— write1to trigger once, writeNto trigger on the N-th matching code path, write0to cancel.
Fault injection points in driver code
Driver code marks injectable points with the umka_fault_inject! macro (compiled
out in release builds):
/// Injects fault `fault_type` at this callsite if fault injection is active for
/// this driver and fault type. No-op in release builds.
///
/// In debug builds: if umka.fault_inject matches this driver + fault_type,
/// executes the fault action (e.g., corrupts a pointer, calls panic!, returns Err).
#[cfg(umka_fault_inject)]
macro_rules! umka_fault_inject {
($driver:expr, $fault_type:expr, $action:expr) => {
if crate::fault_inject::should_inject($driver, $fault_type) {
$action
}
};
}
#[cfg(not(umka_fault_inject))]
macro_rules! umka_fault_inject {
($driver:expr, $fault_type:expr, $action:expr) => {};
}
Fault scenarios tested
- Domain isolation violation: Inject
umka_fault_inject!(driver, FaultType::DomainWrite, /* write to wrong PKEY */)— verifies MPK/DACR/POE catches the fault and reloads the driver without kernel panic. - Null pointer dereference: Inject null dereference in Tier 1 driver handler — verifies fault containment and recovery within 50–150 ms.
- Infinite loop: Inject
loop {}in a driver kthread — verifies the per-driver watchdog timer (DRIVER_WATCHDOG_TIMEOUT_MS = 5000) fires and kills the driver. - DMA to wrong address: Inject out-of-bounds DMA descriptor — verifies IOMMU fault is caught, driver is torn down, no kernel memory corruption.
- Tier 2 process crash: Inject
abort()in Tier 2 driver process — verifies umka-core supervisor restarts within 10 ms. - Repeated crashes: Inject crash on every restart — verifies auto-demotion policy engages after
DRIVER_MAX_RESTART_ATTEMPTS = 3. - I/O in flight during crash: Inject crash mid-I/O — verifies all in-flight requests complete with
-EIOand no request objects leak.
Each test verifies: (1) the system does not panic, (2) the driver recovers within the target time, (3) applications see errors but can retry, and (4) no memory is leaked.
23.3.4 CI Pipeline
Every commit triggers:
1. cargo build --target x86_64-unknown-none
2. cargo test (host-side unit tests)
3. QEMU boot test (basic boot + shutdown)
4. kabi-compat-check (no ABI breaks)
5. cargo clippy (lint pass)
6. cargo fmt --check (formatting)
Every merge to main additionally triggers:
7. LTP syscall conformance suite
8. Docker container boot test
9. Performance benchmark suite (vs stored Linux baseline)
10. Crash recovery fault injection suite
23.4 Technical Risks
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| MPK provides only 16 domains | Medium | Certain | Group related drivers by fault domain (all block share domain, all net share domain). 12 driver-available domains on x86 (4 keys reserved for infrastructure: PKEY 0=core, 1=shared descriptors, 14=shared DMA, 15=guard; per Section 10.2/5). AArch64 POE has 7 usable indices (1-7), of which 3 are available for Tier 1 driver domains (indices 3-5; indices 1-2 reserved for umka-core, 6 for userspace, 7 for temporary/debug; per Section 23.4.3). See "MPK Domain Grouping" below for degraded isolation analysis. |
| eBPF verifier complexity | High | High | Verifier subsystem is ~30K SLOC in Linux (counting kernel/bpf/verifier.c at ~23K SLOC as of v6.12, plus btf.c, log.c, range-tracking helpers, and test infrastructure — the ~30K figure covers the full verification subsystem, not verifier.c alone). Start with subset of program types, expand incrementally. UmkaOS implements a clean-room Rust verifier and JIT (GPL avoidance); the eBPF bytecode format and helper API are compatible with Linux but the implementation is original. |
| KVM deeply integrated with Linux MM | High | High | Design memory manager with KVM hooks from the start (Phase 1 architecture). Dedicate a team to KVM from Phase 4. |
| Driver coverage gap blocks adoption | Critical | High | Cloud-first strategy (VirtIO covers 100% of VMs). Prioritize top-20 drivers. Agentic rewrite pipeline for open-source drivers. |
| Subtle syscall compatibility bugs | High | High | LTP conformance suite, real-world application testing, syzkaller fuzzing. Build a comprehensive test matrix of applications. |
| Spectre/Meltdown mitigations + domain isolation | Medium | Medium | KPTI not needed for Tier 1 (same Ring 0). Tier 2 needs standard KPTI. Retpoline/IBRS for indirect branches. Test on affected hardware. |
| IOMMU not available on all hardware | Medium | Medium | IOMMU required for Tier 1 DMA fencing. Systems without IOMMU fall back to trusted mode (reduced isolation, logged warning). |
| ARM64 lacks direct MPK equivalent | Medium | Certain | Use POE (FEAT_S1POE, 7 usable indices of which 3 are for Tier 1 drivers, optional from ARMv8.9+) or page-table fallback. Adaptive isolation policy (Section 10.2) allows per-driver tier pinning or promotion to Tier 0 on pre-POE hardware. |
| No fast isolation on pre-2020 x86 | Medium | Certain | Adaptive isolation policy: isolation=performance promotes Tier 1 to Tier 0 (Linux-equivalent speed, no memory isolation). IOMMU DMA fencing still active. |
| Rust ecosystem maturity for OS dev | Low | Medium | Established patterns from Redox, Linux rust-for-linux, Hubris. Use #![no_std] and custom allocator. Unsafe blocks at hardware boundaries are expected and audited. |
| Performance target too ambitious | Medium | Medium | 5% target is for macro benchmarks. Micro-benchmarks may show higher overhead on specific paths. Batch amortization and careful profiling. |
| Community adoption / contributor pipeline | Medium | Medium | Clean SDK, good documentation, lower barrier than Linux driver development. Cloud-first focus builds credibility before desktop push. |
| Regulatory / certification barriers | Low | Low | Work with distributions early. Open-source everything except vendor proprietary blobs. |
| LZ4/Zstd kernel implementation correctness | Medium | Medium | Fuzzing, comparison with reference implementation. Use no_std BSD-licensed implementations with comprehensive test vectors. |
| Object namespace overhead on hot paths | Low | Low | Lazy registration for high-frequency objects (fds, sockets, VMAs). Eagerly registered objects only (~2000 baseline = ~384 KB). |
| CBS scheduling fairness under edge cases | Medium | Medium | Formal analysis against CBS paper (Abeni 1998), stress testing with adversarial workloads, comparison with Linux cpu.max behavior. |
23.4.1 Risks from Advanced Features (Chapters 16-18)
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| TEE hardware fragmentation (SEV-SNP vs TDX vs CCA) | High | Certain | Abstract behind ConfidentialContext trait (Section 8.6.3). Implement one backend at a time. SEV-SNP first (largest cloud deployment), TDX second, CCA third. |
| PQC algorithm instability (NIST may revise) | Medium | Medium | Algorithm-agile abstraction (Section 8.5.2). Algorithms behind enum dispatch; swapping ML-KEM for a successor is a library update, not a kernel redesign. |
| PQC signature sizes impact IPC latency | Low | Certain | ML-DSA-65 signatures are 3,309 bytes (per NIST FIPS 204, Table 2). Cold-path only (capability minting, not every IPC call). SignatureData::Heap variant avoids ring buffer bloat (Section 8.5). |
| RT + domain isolation interaction causes priority inversion | High | Medium | Domain switch (WRPKRU on x86) is ~23 cycles (no lock needed). Domain switching is O(1) — no contention path. If priority inheritance needed for domain-shared buffers, use PI futexes (Section 7.2.3). |
| Formal verification scope creep | Medium | Medium | Verify only security-critical paths: capability table, IPC ring, page table mapping (Section 23.10). Accept that ~80% of kernel code is tested, not verified. |
| DPU vendor lock-in (proprietary firmware) | Medium | High | KABI vtable for OffloadTransport (Section 5.2). DPU-specific code is behind the same driver isolation as any Tier 1 device. Vendor-specific logic in driver, not kernel. |
| PMEM/CXL hardware not yet widely deployed | Low | High | Design is hardware-agnostic (Section 14.7). All PMEM code compiles out when hardware is absent. CXL 3.0 adoption expected 2025-2027; architecture ready, implementation deferred. |
| Unified compute model adds scheduling overhead | Medium | Low | Advisory overlay only — existing schedulers unchanged (Section 21.6). Topology queries are O(1) reads from cached ComputeCapacityProfile. No hot-path cost. |
| Live kernel evolution causes state corruption | Critical | Low | Post-swap watchdog with 5-second timer (Section 12.6). On crash, the system attempts to re-extract state from the failing component; if extraction fails, the system panics rather than reverting to stale state, preventing silent data corruption. State serialization uses versioned HMAC integrity tags. |
| Intent optimizer makes poor decisions | Low | Medium | Intent system is purely advisory (Section 6.7). Clamping prevents invalid resource configs. Worst case: system falls back to static defaults (no intent optimization). |
23.4.2 Risk Response Priority
- Driver coverage (Critical): Addressed by cloud-first strategy + agentic rewrite
- Syscall compatibility (High): Addressed by LTP + application test matrix
- eBPF complexity (High): Addressed by incremental implementation
- KVM integration (High): Addressed by early architectural planning
- TEE fragmentation (High): Addressed by trait-based abstraction
- RT + domain isolation interaction (High): Addressed by O(1) domain switching design
- Domain limit (Medium): Addressed by driver grouping policy
- Live evolution safety (Critical but low likelihood): Addressed by watchdog + state HMAC integrity checks
23.4.3 Domain Grouping: Degraded Isolation Analysis
When more than 12 Tier 1 drivers are loaded simultaneously, some drivers must share an isolation domain (protection key). This is an inherent limitation of Intel's 16-key PKU design (16 keys minus PKEY 0 for umka-core, minus PKEY 1 for shared descriptors, minus PKEY 14 for shared DMA, minus PKEY 15 as guard = 12 usable). Grouping has concrete consequences for fault isolation:
What grouping preserves: - IOMMU isolation: each driver retains its own IOMMU domain regardless of domain grouping. DMA fencing is unaffected — a crashing NVMe driver cannot DMA into a NIC driver's buffers, even if they share an isolation domain. - Capability isolation: each driver has its own capability set. Sharing an isolation domain does not grant access to another driver's capabilities. - Crash detection: fault injection and page-fault trapping still identify the crashing driver (via instruction pointer, not isolation domain).
What grouping degrades: - Memory read/write isolation between grouped drivers. If drivers A and B share isolation domain 5, a buffer overrun in A can corrupt B's data structures. The crash is still contained (it cannot escape to umka-core or other domains), but it may take down both A and B. - The blast radius of a crash expands from one driver to one domain group. In practice, this means a faulty NVMe driver could take down the AHCI driver if both are in the "block" group.
Grouping policy — drivers are grouped by fault domain affinity (drivers that interact heavily and would likely cascade-fail anyway):
| Isolation Domain | Group | Typical Members |
|---|---|---|
| 2 | Block storage | NVMe, AHCI/SATA, virtio-blk, iSCSI, NVMe-oF |
| 3 | Network (data) | Intel NIC, Mellanox NIC, virtio-net |
| 4 | Network (stack) | TCP/IP, UDP, RDMA core |
| 5 | Filesystem | ext4, XFS, btrfs |
| 6 | Display | DRM/KMS, GPU compute |
| 7 | KVM | VMX/SVM, vhost-net, vhost-scsi |
| 8 | USB | xHCI, USB hub, USB mass storage |
| 9-13 | Individual | High-value drivers that warrant solo domains |
AArch64 domain budget: POE provides 7 usable indices (1-7; index 0 is reserved for umka-core as the default PTE value). Of the 7 usable indices, 4 are reserved for infrastructure (index 1 for shared read-only, index 2 for shared DMA, index 6 for userspace, index 7 for temporary/debug), leaving only 3 indices for Tier 1 driver domains (indices 3-5; vs. 12 on x86). AArch64 therefore requires much more aggressive grouping.
Note for reviewers: ARM FEAT_S1POE uses a 3-bit POIndex field in page table entries, providing 8 index values (0-7). This is a hardware constraint, not a design choice. Index 0 is the default PTE value (per ARM architecture), leaving 7 configurable indices. Do not suggest "use 4 bits for 16 indices" — the POIndex field width is fixed by the ISA. The grouping table above is designed for x86's 12-domain budget. On AArch64, the kernel applies a reduced grouping scheme: - Domain 0: umka-core (default PTE value) - Domain 1: Shared read-only - Domain 2: Shared DMA buffer pool - Domain 3: VFS + block I/O (merged — these are tightly coupled) - Domain 4: Network stack - Domain 5: All remaining Tier 1 drivers (single shared domain) - Domain 6: Userspace (EL0 default) - Domain 7: Temporary / debug This reduces isolation granularity for Tier 1 drivers on AArch64 (all share one domain) but preserves the critical umka-core/driver/userspace boundaries. The architecture-specific grouping is selected at boot based on
arch::current::isolation::domain_count().
Typical server scenario — a cloud server runs NVMe + NIC + TCP + KVM + virtio = 5 drivers. On x86 (12 driver domains), these fit in 5 domains with no grouping needed; the 12-domain limit only triggers on heavily-configured systems (desktop with GPU + audio + USB + Bluetooth + WiFi + NVMe + SATA + NIC + ...). On AArch64 with POE (3 driver domains), even this typical 5-driver configuration requires grouping -- the reduced scheme above merges block I/O, networking, and remaining drivers into 3 shared domains. Architectures with more domains (ARMv7 DACR: 15, PPC32 segments: 15) behave more like x86.
Monitoring — when grouping occurs, UmkaOS logs a warning:
umka: isolation domain 1 shared by nvme, ahci (reduced isolation: crash in either affects both)
This allows administrators to make informed decisions about which drivers to load as Tier 2 (full process isolation, unlimited domains) if they require stronger isolation than domain grouping provides.
# Appendices
Reference material, comparison tables, and open questions.
23.5 Licensing Model: Open Kernel License Framework (OKLF) v1.3
UmkaOS uses the Open Kernel License Framework (OKLF) v1.3 (see OKLF-v1.3.md
for the full legal text). Key elements:
Base license: GPLv2-only with additional permissions (Sections 2-5 of OKLF). All kernel code — umka-core, umka-kernel, umka-compat, umka-net, umka-vfs, umka-block, umka-kvm, tools, and boot code — is GPLv2. This ensures: - All kernel modifications must be open-sourced - Proprietary forks are impossible - Same legal framework the Linux ecosystem understands
Approved Linking License Registry (ALLR): A curated, append-only list of open-source licenses approved for use with kernel code. Tiers 1-2 may link with kernel code directly (Tier 0/1 drivers). Tier 3 licenses are GPL-incompatible and may NOT link with the kernel; Tier 3 code runs exclusively as Tier 2 process-isolated drivers communicating via KABI IPC, where no linking occurs: - Tier 1 (weak copyleft, GPL-compatible): MPL-2.0, LGPL-2.1, EPL-2.0 (with Secondary License designation; see note below) - Tier 2 (permissive): MIT, BSD-2, BSD-3, Apache-2.0, ISC, Zlib - Tier 3 (incompatible — process isolation required, no linking): CDDL-1.0, CDDL-1.1, LGPL-3.0, EUPL-1.2 (see note below)
LGPL-3.0 incompatibility with GPLv2-only: LGPL-3.0 is incompatible with GPLv2-only code per the FSF compatibility matrix. LGPL-3.0 is defined as GPLv3 plus additional permissions (LGPL-3.0 Section 1.1: "This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License"). Since GPLv3 is incompatible with GPLv2-only (see GPLv3 exclusion note below), LGPL-3.0 inherits that incompatibility. LGPL-3.0 code must NOT be linked into the UmkaOS kernel. LGPL-3.0 code communicates with the kernel via KABI IPC only (Tier 3, process isolation required). Note that LGPL-2.1 IS compatible with GPLv2 and remains in Tier 1.
EUPL-1.2 classification (Tier 3): EUPL-1.2 is a strong copyleft license that the FSF classifies as GPL-incompatible. While EUPL Article 5 provides a compatibility list (including GPLv2, GPLv3, LGPL, AGPL, MPL-2.0, EPL-1.0, CeCILL) that allows EUPL-licensed code to be relicensed under those licenses when combined with code under those licenses, the FSF's position is that EUPL-1.2's copyleft is "comparable to the GPL's, and incompatible with it" by itself. UmkaOS places EUPL-1.2 in Tier 3 (process isolation required, no linking with kernel code) as the conservative default. EUPL-1.2 code that has been explicitly relicensed to GPLv2 via Article 5 by its copyright holder may then be treated as GPLv2 code and used in Tier 0/1. Without explicit relicensing, EUPL-1.2 code runs as a Tier 2 process-isolated driver communicating via KABI IPC only.
EPL-2.0 GPL compatibility: EPL-2.0 is GPL-compatible only when the distributor explicitly designates GPL as a Secondary License per EPL-2.0 Section 3.2. Without this designation, EPL-2.0 is GPL-incompatible. UmkaOS requires EPL-2.0 dependencies to carry the Secondary License designation; undesignated EPL-2.0 code is treated as Tier 3 (process isolation required, no linking with kernel code). ALLR Tier 1 inclusion applies only to EPL-2.0 code that explicitly carries the Secondary License designation for GPLv2. Enforcement: the KABI module loader checks for the Secondary License designation in the module's license metadata at load time. EPL-2.0 modules without the designation are rejected for Tier 0/1 loading and must run as Tier 2 process-isolated drivers. Additionally, EPL-2.0's patent grant (Section 2.2) requires contributors to grant a patent license for their contributions; UmkaOS cannot enforce this at a technical level, so EPL-2.0 code in Tier 1 carries an implicit assumption that upstream contributors have complied with Section 2.2. Code review should verify the Secondary License designation is present in the upstream project's license header, not just claimed in module metadata.
GPLv3 exclusion from ALLR: GPLv3 is deliberately excluded from the ALLR. UmkaOS's kernel is licensed GPLv2-only (not "GPLv2 or later"). GPLv3 is incompatible with GPLv2-only code per the FSF: GPLv3's additional requirements (anti-tivoization in Section 10.4, patent retaliation in Section 8.1) constitute "further restrictions" that GPLv2 Section 10.5 prohibits. Code licensed GPLv3-only cannot be linked into a GPLv2-only kernel. Code licensed "GPLv2 or later" CAN be used (under its GPLv2 grant), but code licensed GPLv3-only cannot. Adding GPLv3 to the ALLR would create a false impression that GPLv3-only code may be linked with the kernel. If GPLv3-only code must be used, it must run as a Tier 1 or Tier 2 driver (same as CDDL), communicating via KABI IPC with no linking.
CDDL and GPL incompatibility: CDDL is GPL-incompatible per the FSF. CDDL-licensed code may run as Tier 1 or Tier 2 drivers — KABI provides the license boundary at both tiers. Despite CDDL appearing in the ALLR, no linking occurs between CDDL code and GPL kernel code. CDDL drivers communicate exclusively via KABI IPC (ring buffer message passing, vtable dispatch, one resolved symbol
__kabi_driver_entry) — no shared symbols, no function calls across the license boundary. This provides more isolation than Linux'sEXPORT_SYMBOL_GPLboundary (where modules ARE linked into the kernel). Statically-linked (Tier 0) CDDL code is NOT permitted, as static linking creates a derivative work. The KABI boundary ensures CDDL and GPL code never form a single "work" in the copyright sense.
New licenses added via governance process (60-day review, supermajority LGB vote). Licenses are never removed (append-only for legal certainty).
Proprietary kernel-space code explicitly prohibited (OKLF Section 4.2(c)): Any code that loads into kernel address space and accesses internal kernel symbols is a derivative work and must comply with GPLv2 or an ALLR-listed license. This removes Linux's 30-year "gray area" about proprietary kernel modules.
Proprietary user-space drivers explicitly permitted (OKLF Section 4.2(b)): Code interacting with the kernel exclusively through the stable userspace interface (syscalls, /proc, /sys, VFIO, UIO, FUSE, eBPF) is not a derivative work. This maps directly to our Tier 2 driver model — hardware vendors who cannot open-source their drivers may use user-space driver frameworks with full isolation.
Anti-tivoization stance (OKLF Section 11.1): OKLF encourages but does not mandate installation information disclosure. The OKLF adds only additional permissions to GPLv2 (permitted by GPLv2 Sections 0 and 10), never additional restrictions. Anti-tivoization protection is achieved indirectly: the KABI stability guarantee means users can always replace a Tier 1/2 driver binary without modifying the kernel, making hardware lockdown of individual drivers less effective.
Firmware exception (OKLF Section 4.3): Binary firmware that runs on separate
processors (GPU microcode, Wi-Fi firmware, SSD firmware) is outside the license scope.
Distributed separately in firmware/. Code running on the main CPU is NOT firmware.
Legal risk acknowledgment — OKLF is a novel license framework built on GPLv2. While it is designed to be GPLv2-compatible (the "additional permissions" model is explicitly contemplated by GPLv2 Section 0 and Section 3.1), it has not been tested in court and constitutes a novel legal approach that should not be relied upon without independent legal review. Key risks: (1) the ALLR mechanism may be viewed by some lawyers as an untested extension of the "linking exception" concept — FSF/SFLC review is recommended before v1.0 final; (2) the OKLF provides weaker anti-tivoization protection than GPLv3, which is an accepted tradeoff for GPLv2 compatibility — OKLF cannot mandate installation information disclosure without violating GPLv2's "no further restrictions" clause; (3) ecosystem adoption depends on corporate legal teams accepting OKLF as GPLv2-compatible — even if legally sound, unfamiliarity may slow adoption; (4) the "additional permissions" model under GPLv2 Section 0/Section 3.1 is well-established in principle (e.g., GCC Runtime Library Exception, Qt commercial exception), but OKLF's scope (ALLR registry, driver tier classification, firmware exception) goes beyond typical additional permissions — a court could find that some OKLF provisions constitute "further restrictions" rather than "additional permissions," which GPLv2 Section 10.5 prohibits. This risk is mitigated by careful drafting but cannot be eliminated without judicial precedent. UmkaOS should seek early legal review from SFLC or equivalent, and provide a "plain GPLv2" fallback for organizations that cannot accept OKLF's additional terms.
KABI Driver SDK: The umka-driver-sdk crate (ABI type definitions, vtable layouts, ring buffer protocol, DMA types) is dual-licensed Apache-2.0 OR MIT. This is the interface contract — drivers of any ALLR-listed license can link against these types without friction.
How this maps to our driver tiers:
| Tier | Location | License requirement | OKLF section |
|---|---|---|---|
| Tier 0 (boot-critical) | In-kernel, static | GPLv2 or ALLR | 4.1 (in-tree) |
| Tier 1 (domain-isolated) | Ring 0, loaded | GPLv2 or ALLR | 4.2 (out-of-tree open-source) |
| Tier 2 (user-space) | Ring 3, process | Any (incl. proprietary) | 4.2(b) (userspace interface) |
Three ABI stability tiers (extending OKLF Section 10.2):
| Interface | Stable? | Policy |
|---|---|---|
| Internal kernel APIs | No | May change between any two releases |
| KABI (driver ABI) | Yes | Versioned, append-only, binary-stable |
| Userspace ABI (syscalls) | Yes | Never broken without extended deprecation |
| Concern | How addressed |
|---|---|
| Prevent proprietary kernel forks | GPLv2 copyleft |
| Allow ZFS (CDDL) | CDDL in ALLR Tier 3 — ZFS runs as a Tier 1 driver (KABI IPC provides license boundary, no linking occurs) |
| Allow Nvidia GPU (proprietary) | Tier 2 user-space driver via VFIO |
| Allow BSD/MIT drivers | BSD/MIT in ALLR — full kernel-space access |
| Force kernel improvements to be open | GPLv2 copyleft on all kernel crates |
| Module enforcement | Kernel refuses non-compliant modules by default |
| Clear legal boundaries | OKLF explicit text, not legal gray area |
23.6 Project Structure
Note: This appendix describes the target project structure at full implementation. The current codebase (see CLAUDE.md "Project Structure") contains the foundational crates (
umka-kernel,umka-core,umka-driver-sdk,umka-compat,umka-net,umka-vfs,umka-block,umka-kvm). Additional crates listed below (e.g.,umka-accel,umka-cluster,drivers/) will be added as their corresponding architecture sections are implemented.
umka-kernel/
Cargo.toml # Workspace root (all crates)
ARCHITECTURE.md # This document
umka-core/ # Microkernel core
Cargo.toml
src/
main.rs # Boot entry point (calls arch-specific init)
cap/ # Capability system
mod.rs # Capability types, tables, operations
revocation.rs # Generation-based revocation
mem/ # Memory management
phys.rs # Physical page allocator (buddy)
vmm.rs # Virtual memory manager (maple tree, VMAs)
page_cache.rs # Page cache (RCU radix tree)
slab.rs # Slab allocator for kernel objects
pcid.rs # PCID/ASID management
huge.rs # Huge page (THP + explicit) support
sched/ # Scheduler
mod.rs # Scheduler core, class dispatch
cfs.rs # CFS/EEVDF fair scheduler
rt.rs # RT FIFO/RR scheduler
deadline.rs # Deadline (EDF/CBS) scheduler
balance.rs # NUMA-aware load balancer
ipc/ # IPC and isolation
mpk.rs # MPK domain management, WRPKRU helpers
ring.rs # Shared-memory ring buffers
tier2_ipc.rs # Cross-address-space IPC for Tier 2
arch/ # Architecture-specific Rust code
mod.rs # Architecture trait definitions
x86_64/ # x86-64 implementation
mod.rs
gdt.rs # GDT setup
idt.rs # IDT and interrupt dispatch
apic.rs # Local APIC driver (Tier 0)
timer.rs # HPET/TSC/APIC timer (Tier 0)
mpk.rs # MPK hardware interface
vmx.rs # VMX support for KVM
aarch64/ # ARM64 implementation (phase 2+)
mod.rs
armv7/ # ARMv7 implementation (phase 2+)
mod.rs
riscv64/ # RISC-V 64 implementation (phase 2+)
mod.rs
ppc32/ # PPC32 implementation (phase 2+)
mod.rs
ppc64le/ # PPC64LE implementation (phase 2+)
mod.rs
umka-compat/ # Linux syscall interface + compat shims
Cargo.toml
src/
syscall/ # ~450 syscall dispatch table
mod.rs # SyscallHandler enum, dispatch table
process.rs # fork, clone, execve, exit, wait
file.rs # open, read, write, close, ioctl
memory.rs # mmap, brk, mprotect, madvise
network.rs # socket, bind, listen, accept, connect
time.rs # clock_gettime, nanosleep, timer_*
misc.rs # getpid, getuid, uname, sysinfo
proc/ # /proc filesystem emulation
mod.rs
meminfo.rs # /proc/meminfo
cpuinfo.rs # /proc/cpuinfo
pid.rs # /proc/[pid]/* (maps, status, fd, etc.)
sys.rs # /proc/sys/* (sysctl interface)
sys/ # /sys filesystem emulation
mod.rs
devices.rs # /sys/devices/ device tree
class.rs # /sys/class/ device classes
bus.rs # /sys/bus/ bus enumeration
dev/ # /dev filesystem emulation
mod.rs
devtmpfs.rs # devtmpfs-compatible device nodes
signal/ # Signal handling
mod.rs
delivery.rs # Signal delivery to user space
handlers.rs # Default handlers, core dump
namespace/ # Linux namespace implementation
mod.rs
mnt.rs # Mount namespace
pid.rs # PID namespace
net.rs # Network namespace
user.rs # User namespace
ipc.rs # IPC namespace
uts.rs # UTS namespace
cgroup.rs # Cgroup namespace
time.rs # Time namespace
cgroup/ # Cgroup v1/v2
mod.rs
v2.rs # Unified hierarchy (primary)
v1_compat.rs # Legacy hierarchy (compatibility)
controllers/ # cpu, memory, io, pids, etc.
io_uring/ # io_uring subsystem
mod.rs
ring.rs # SQ/CQ ring management
sqpoll.rs # SQPOLL kernel thread
ops.rs # Operation dispatch
lsm/ # Linux Security Modules
mod.rs
hooks.rs # Hook framework
selinux.rs # SELinux policy engine
apparmor.rs # AppArmor profile engine
seccomp.rs # seccomp-bpf filter
ebpf/ # eBPF subsystem
mod.rs
vm.rs # eBPF virtual machine
verifier.rs # Static verifier
jit/ # JIT compilers
x86_64.rs
aarch64.rs
armv7.rs
riscv64.rs
ppc32.rs
ppc64le.rs
maps.rs # Map types (hash, array, ringbuf, etc.)
helpers.rs # eBPF helper functions
programs.rs # Program types (XDP, tc, kprobe, etc.)
umka-net/ # Network stack (runs as Tier 1)
Cargo.toml
src/
tcp/ # TCP/IP implementation
udp/ # UDP implementation
ip/ # IP layer (v4 + v6)
arp.rs # ARP
icmp.rs # ICMP
netfilter/ # nftables + iptables compatibility
mod.rs
nft.rs # nftables engine
conntrack.rs # Connection tracking
nat.rs # NAT (SNAT, DNAT, masquerade)
xdp/ # XDP fast path
socket.rs # Socket abstraction
tunnel/ # Tunnel protocol modules (Section 15.2)
mod.rs # TunnelDevice trait
vxlan.rs # VXLAN encap/decap
geneve.rs # Geneve encap/decap
gre.rs # GRE/GRE6
ipip.rs # IPIP/SIT
wireguard.rs # WireGuard VPN
bridge/ # Software L2 switch (Section 15.2)
mod.rs # Bridge device, FDB, STP
vlan.rs # 802.1Q VLAN filtering
veth.rs # Virtual ethernet pairs
macvlan.rs # macvlan/ipvlan devices
vrf.rs # Virtual Routing and Forwarding
umka-vfs/ # Virtual filesystem layer (Tier 1)
Cargo.toml
src/
mod.rs # VFS dispatch, mount table
ext4/ # ext4 filesystem
xfs/ # XFS filesystem
btrfs/ # btrfs filesystem
tmpfs/ # tmpfs (in-memory)
overlayfs/ # OverlayFS (for containers)
dcache.rs # Directory entry cache
umka-block/ # Block I/O layer (Tier 1)
Cargo.toml
src/
mod.rs # Block device abstraction
scheduler.rs # I/O schedulers (mq-deadline, none, bfq)
partition.rs # Partition table parsing (GPT, MBR)
dm/ # Device-mapper framework (Section 14.3)
mod.rs # DM core: target dispatch, table management
linear.rs # dm-linear
striped.rs # dm-striped
mirror.rs # dm-mirror
crypt.rs # dm-crypt (AES-XTS)
verity.rs # dm-verity
snapshot.rs # dm-snapshot (COW)
thin.rs # dm-thin-pool
md.rs # MD RAID (0/1/5/6/10) superblock compat
lvm.rs # LVM2 metadata reader
recovery.rs # Recovery-aware volume state machine
iscsi/ # iSCSI block storage (Section 14.4)
mod.rs # iSCSI common: PDU parsing, session state
initiator.rs # iSCSI initiator (RFC 7143)
target.rs # iSCSI target (LIO-compatible config)
iser.rs # iSER — RDMA transport for iSCSI
chap.rs # CHAP authentication
multipath.rs # dm-multipath integration
nvmeof/ # NVMe over Fabrics (Section 14.4)
mod.rs # NVMe-oF common: capsule parsing, queue pairs
host.rs # NVMe-oF initiator (host) — connect, I/O
target.rs # NVMe-oF target (subsystem) — nvmetcli compat
tcp.rs # NVMe/TCP transport (TP 8000)
rdma.rs # NVMe/RDMA transport (TP 8001)
discovery.rs # Discovery controller client/server
ana.rs # ANA multipath — asymmetric namespace access
umka-kvm/ # KVM hypervisor (Tier 1)
Cargo.toml
src/
mod.rs # /dev/kvm interface
vmx.rs # Intel VMX
svm.rs # AMD SVM
mmu.rs # Nested page tables (EPT/NPT)
tee/ # Confidential VM support (Section 8.6)
sev.rs # AMD SEV-SNP guest/host
tdx.rs # Intel TDX guest/host
cca.rs # ARM CCA realm management
umka-accel/ # AI/ML accelerator subsystem (Section 21.1)
Cargo.toml
src/
mod.rs # AccelBase trait, device registration
scheduler.rs # CBS-based accelerator scheduler
hmm.rs # Heterogeneous memory management
p2p.rs # Peer-to-peer DMA (PCIe, NVLink, CXL)
inference.rs # In-kernel inference engine
rdma.rs # RDMA and collective ops
umka-cluster/ # Distributed kernel (Section 5.1)
Cargo.toml
src/
mod.rs # Cluster topology, node discovery
transport.rs # KernelTransport (RDMA, CXL, TCP)
ipc.rs # Distributed IPC proxy
dsm.rs # Distributed shared memory
dlm.rs # Distributed Lock Manager (Section 14.6)
global_pool.rs # Global memory pool
scheduler.rs # Cluster-wide scheduling
caps.rs # Network-portable capabilities
umka-driver-sdk/ # Stable driver SDK
Cargo.toml
interfaces/ # .kabi IDL definitions
block_device.kabi # Block device interface
net_device.kabi # Network device interface
gpu_device.kabi # GPU device interface
input_device.kabi # Input device interface
usb_device.kabi # USB device interface
char_device.kabi # Character device interface
pci_device.kabi # PCI device interface
platform_device.kabi # Platform device interface
src/
lib.rs # SDK entry point, driver registration
abi.rs # Generated stable ABI types
dma.rs # DMA buffer management
mmio.rs # MMIO access helpers (volatile read/write)
irq.rs # Interrupt handling
ring.rs # Ring buffer helpers for driver use
manifest.rs # Driver manifest parsing
drivers/ # In-tree drivers
tier0/ # Boot-critical (statically linked)
apic/ # Local APIC + I/O APIC
timer/ # PIT / HPET / TSC
serial/ # Early serial console
vga/ # Early VGA text console
tier1/ # Performance-critical (domain-isolated)
nvme/ # NVMe SSD driver
virtio_blk/ # VirtIO block device
virtio_net/ # VirtIO network device
virtio_gpu/ # VirtIO GPU
virtio_console/ # VirtIO console
e1000/ # Intel e1000 NIC
igb/ # Intel igb NIC
ahci/ # AHCI/SATA controller
ext4/ # ext4 driver component
tier2/ # Isolated (user-space process)
usb_xhci/ # USB XHCI host controller
usb_hid/ # USB HID (keyboard, mouse)
usb_storage/ # USB mass storage
hda_audio/ # Intel HDA audio
input/ # Input subsystem (evdev)
tools/
kabi-compiler/ # .kabi IDL -> Rust/C code generator
Cargo.toml
src/
main.rs
parser.rs # IDL parser
codegen_rust.rs # Rust binding generator
codegen_c.rs # C binding generator
kabi-compat-check/ # ABI compatibility CI checker
Cargo.toml
src/
main.rs # Diffs old vs new .kabi, rejects breaks
umka-initramfs/ # Initramfs builder tool
Cargo.toml
src/
main.rs # Packs drivers + early userspace
arch/ # Architecture-specific C/asm
x86_64/
boot/ # UEFI/BIOS boot stub (C + asm)
header.S # Linux boot protocol header
main.c # Early C boot code
efi_stub.c # UEFI stub
asm/
entry.S # Syscall entry/exit
switch.S # Context switch
irq_stubs.S # Interrupt stub table
vdso/
vdso.lds # vDSO linker script
clock_gettime.c # clock_gettime implementation
getcpu.c # getcpu implementation
aarch64/
boot/ # ARM64 boot stub
asm/ # ARM64 assembly
vdso/ # ARM64 vDSO
riscv64/
boot/ # RISC-V boot stub
asm/ # RISC-V assembly
vdso/ # RISC-V vDSO
ppc32/
boot/ # PPC32 boot stub
asm/ # PPC32 assembly
vdso/ # PPC32 vDSO
ppc64le/
boot/ # PPC64LE boot stub
asm/ # PPC64LE assembly
vdso/ # PPC64LE vDSO
tests/
abi_compat/ # Old driver binaries for compat regression
syscall/ # Linux syscall conformance (LTP-based)
driver/ # Driver integration tests
bench/ # Performance regression benchmarks
crash_recovery/ # Fault injection + recovery verification
23.7 What UmkaOS Provides That Linux Cannot
| Feature | Linux | UmkaOS |
|---|---|---|
| Driver crash recovery | Kernel oops or panic depending on fault type. Many driver bugs produce oops (system continues with degraded functionality) rather than panic. Recovery requires at minimum driver module reload; severe faults cause panic and full reboot (30-60s). | Reload driver in ~50-150ms (Tier 1) or ~10ms (Tier 2) |
| Stable driver ABI | None (recompile every update) | Versioned, append-only, binary-stable KABI |
| Driver isolation | None (shared address space) | Domain isolation + IOMMU (Tier 1), full process (Tier 2) |
| Capability-based security | Bolt-on (POSIX caps are coarse) | Foundational architecture |
| Lock ordering enforcement | Runtime lockdep (debug only) | Compile-time via Rust type system: type-level lock ordering using phantom type parameters that encode lock level in the type signature (e.g., Lock<Level3>), preventing out-of-order acquisition at compile time. See umka-core lock design (Section 7.2). |
| io_uring security | Bypasses syscall monitoring | Per-instance operation whitelist |
| Hot driver upgrade | Fragile (unstable ABI) | Clean stop/start with stable KABI |
| Memory safety | C everywhere | Rust with minimal unsafe at hardware boundaries |
| Many-core scalability | known bottlenecks (RTNL for networking, inode_lock for VFS, cgroup_mutex for cgroups) | No global locks, per-CPU/per-NUMA everywhere |
| Proactive fault management | Ad-hoc (mcelog, rasdaemon) | Unified FMA with diagnosis engine (Section 19.1) |
| Memory compression | zswap/zram (separate, config-heavy) | Integrated NUMA-aware zpool tier (Section 4.2) |
| CPU bandwidth guarantee | No floor mechanism | CBS-backed cpu.guarantee (Section 6.3) |
| Stable observability ABI | Tracepoints are unstable | Versioned, documented stable tracepoints (Section 19.2) |
| Verified boot chain | Fragmented (UEFI SB + IMA + dm-verity) | Unified chain from firmware to drivers (Section 8.2) |
| Kernel object introspection | Per-subsystem (/proc, /sys, scattered) | Unified object namespace via umkafs (Section 19.4) |
| Driver state preservation | Lost on crash — cold restart | Checkpointed state buffer, warm restart (Section 10.8) |
| Core panic diagnostics | kexec + kdump (complex setup) | In-place crash dump to reserved memory (Section 10.8) |
| Context switch XSAVE cost | Eager XSAVE with XSAVEOPT/XSAVES optimizations (skips unmodified components, but still saves full state for context switches involving SIMD). UmkaOS's lazy approach avoids save/restore entirely for non-SIMD threads. | Lazy XSAVE — zero cost for non-SIMD threads (Section 6.1.6) |
| CPU errata management | Scattered #ifdef, ad-hoc | Structured quirk table + boot-param controls (Section 2.1.4) |
| Volume layer + driver crash | Device marked failed, RAID resync | Recovery-aware: pause I/O, resume clean (Section 14.3) |
| VM guest driver crash | VM reboot required | Driver recovers in-place, hypervisor unaware (Section 17.1) |
| Block storage networking | Separate stacks (open-iscsi, nvme-cli, no unified recovery) | Unified iSCSI + NVMe-oF with RDMA upgrade and crash recovery (Section 14.4) |
| Clustered FS + driver crash | Node fenced, ejected from cluster | Driver recovers in-place, node stays in cluster (Section 14.5) |
| Distributed locking | TCP-based DLM (~10-100 μs/op depending on lock locality; local locks <1 μs), global recovery quiesce on any node failure | RDMA-native DLM (~2-3 μs uncontested, ~5-10 μs contested), per-resource recovery, lease-based extension, batch ops (Section 14.6) |
| TPM key management | Userspace daemon (tpm2-abrmd) | Kernel-native resource manager + capability integration (Section 8.3) |
| Runtime integrity | IMA bolted onto VFS, optional | Integrated with capability system and driver loading (Section 8.4) |
| Display stack crash | X/Wayland session lost | DMA-BUF survives driver reload, compositor stalls ~100ms-5s (full recovery window; Section 21.5.2.6) |
23.8 Cross-Feature Integration Map
23.8.1 D.1 Cross-Feature Integration Map
These features are not independent — they reinforce each other:
Formal verification (Section 23.10) ──────► Confidential computing (Section 8.6)
Proves capability system correct Relies on correct capability enforcement
Safe extensibility (Section 18.7) ◄──────► Live evolution (Section 12.6)
Policy modules are hot-swappable Evolution uses the same mechanism
Intent-based management (Section 6.7) ◄──► In-kernel inference (Section 21.4)
Intent optimizer uses learned models Models optimize for declared intents
EAS / heterogeneous CPU (Section 6.1.5) ◄──► Power budgeting (Section 6.4)
EAS picks energy-optimal core Power budget enforces watt cap
Power budgeting (Section 6.4) ◄──────► Intent-based management (Section 6.7)
Power budget is a constraint Intents include efficiency preference
Hardware memory safety (Section 2.3) ──────► Tier 1 driver isolation (Section 10.4)
MTE catches C driver bugs Domain isolation catches the resulting faults
Confidential computing (Section 8.6) ──────► Distributed kernel (Section 5.1)
TEE-to-TEE RDMA DSM coherence for encrypted pages
Post-quantum crypto (Section 8.5) ──────► Distributed capabilities (Section 5.1.10)
PQC signatures on capabilities Network-portable across cluster
SmartNIC/DPU (Section 5.2) ◄──────► Distributed kernel (Section 5.1)
DPU = close remote node Same proxy driver pattern
Persistent memory (Section 14.7) ◄──────► Memory tiers (Section 21.2)
Persistent memory = another tier Managed by same PageLocationTracker
Computational storage (Section 14.8) ◄──► Accelerator framework (Section 21.1)
CSD = storage accelerator Same AccelBase vtable
Unified compute (Section 21.6) ◄──────► EAS / heterogeneous CPU (Section 6.1.5)
Multi-dim capacity extends scalar CPU capacity is a special case
Unified compute (Section 21.6) ◄──────► Accelerator scheduler (Section 21.1.2.4)
Cross-device topology + energy data Accel scheduler consumes advisory
Unified compute (Section 21.6) ◄──────► Power budgeting (Section 6.4)
Workload profile drives throttle Informed cross-device power decisions
Unified compute (Section 21.6) ◄──────► Intent-based management (Section 6.7)
compute.weight feeds intent optimizer Optimizer adjusts per-domain knobs
Unified compute (Section 21.6) ◄──────► Distributed kernel (Section 5.1)
Peer kernel nodes via NodeTransport Accelerator = close compute node
Unified compute (Section 21.6) ◄──────► SmartNIC/DPU offload (Section 5.2)
Same convergence: device → peer node NodeTransport unifies both transports
Distributed Lock Manager (Section 14.6) ◄──► RDMA transport (Section 5.1.4)
DLM uses RDMA CAS/Send for locks Transport provides kernel RDMA API
Distributed Lock Manager (Section 14.6) ◄──► Cluster membership (Section 5.1.12)
DLM receives join/leave/dead events Single heartbeat source for both
Distributed Lock Manager (Section 14.6) ◄──► Clustered filesystems (Section 14.5)
GFS2/OCFS2 use DLM for coordination DLM lock modes map to FS operations
Distributed Lock Manager (Section 14.6) ◄──► Driver recovery (Section 10.8)
DLM in umka-core survives driver crash No lock recovery needed on Tier 1 reload
Bootstrap Circular Dependency:
The intent optimizer (Section 6.7) uses in-kernel inference models (Section 21.4), but those models may not be loaded at early boot. Resolution: the intent optimizer degrades gracefully to static defaults when models are unavailable. At boot: 1. Intent optimizer starts with hardcoded heuristics (e.g., "latency target → raise cpu.weight by 20%"). 2. When the inference engine loads models (typically within seconds of boot), the optimizer transitions to learned optimization. 3. The transition is seamless — no reconfiguration needed.
23.8.2 D.2 Implementation Dependency Graph
Foundation (no dependencies):
├── Formal verification readiness (Section 23.10) — design methodology
├── Post-quantum crypto abstraction (Section 8.5) — data structure sizing
└── Real-time preemption model (Section 7.2) — lock design
Early integration:
├── Hardware memory safety (Section 2.3) — needs memory allocator
├── Power budgeting (Section 6.4) — needs scheduler
└── Safe extensibility (Section 18.7) — needs KABI vtable mechanism
Mid integration:
├── Confidential computing (Section 8.6) — needs memory manager, IOMMU
├── Intent-based management (Section 6.7) — needs inference engine, cgroups
└── Live evolution (Section 12.6) — needs extensibility mechanism
Late integration:
├── SmartNIC/DPU offload (Section 5.2) — needs proxy driver, device registry
├── Persistent memory (Section 14.7) — needs VFS, memory tiers
├── Computational storage (Section 14.8) — needs AccelBase framework
├── Unified compute topology (Section 21.6) — needs AccelBase, EAS (Section 6.1.5), power budgeting (Section 6.4)
└── Peer kernel nodes (Section 21.6.13) — needs unified compute + distributed kernel (Section 5.1)
23.9 Open Questions
The following cross-cutting items require further design work. Each is tracked as an open question with the affected sections and the specific decision to be made.
io_uring integration (affects Section 18.1.5, Section 8.6, Section 12.6, Section 5.2):
- Registered buffers in confidential computing: io_uring pre-registers DMA buffers
at setup time. When a VM runs under SEV-SNP, these buffers must be in shared
(unencrypted) memory. Decision needed: register-time enforcement vs. lazy conversion.
- State migration during live evolution: io_uring's SQ/CQ rings, registered files, and
registered buffers constitute persistent state. The live evolution framework (Section 12.6) needs
a StateSerializer for io_uring context. Decision needed: drain-and-recreate vs.
in-place serialization.
- DPU submission offload: DPUs can process io_uring submission queues directly, bypassing
host CPU for network and storage operations. Decision needed: how the DPU reads SQ
entries (shared memory mapping vs. DMA push) and how completions are posted back to CQ.
GPU virtualization (affects Section 21.1, Section 8.6): - Confidential GPU VMs require that GPU VRAM is encrypted and attestable. SEV-SNP does not natively protect PCIe device memory. TDX Connect (Intel) and ARM CCA device assignment are emerging but not yet stable. Decision needed: software bounce buffer path (safe, slow) vs. hardware-assisted device encryption (fast, hardware-dependent). - Nested virtualization with GPU passthrough: a confidential VM running a nested hypervisor that passes through a GPU adds three layers of IOMMU translation. Decision needed: whether to support this (performance may be prohibitive).
Testing strategy for cross-feature interactions (affects all Section 8.6-Section 23.2):
- Combinatorial explosion: 15 features yield 105 pairwise interactions. Exhaustive
testing is infeasible. Prioritized critical pairs:
1. RT + confidential computing (latency impact of memory encryption)
2. Power budgeting + intent optimization (conflicting objectives)
3. MTE + DSM page migration (tag preservation across RDMA transfer)
4. Live evolution + RT (component swap during hard-RT operation)
5. DPU offload + confidential computing (encrypted DPU-host channel)
- Test matrix and CI strategy:
- Each prioritized pair has a dedicated integration test suite in tests/compat/.
- CI runs the top 5 pairs on every PR. The remaining 100 pairs run nightly on
the develop branch. Failures block merge to master.
- Acceptance threshold per pair: P99 latency regression < 5%, zero correctness
failures on 10,000 test iterations, zero sanitizer findings (KASAN/KCSAN/KMSAN
equivalents via UmkaOS's compile-time and runtime checks).
- For confidential-computing pairs (CC + RT, CC + DSM, CC + GPU): additional
attestation correctness check — remote verifier must accept the measurement
after every feature combination is enabled.
- Fuzz-assisted testing: syzkaller-style syscall fuzzer runs the top 10 pairs
for 24 hours per release candidate, targeting namespace + cgroup + LSM
interactions (historically the highest-bug-density intersection).
- Test coverage gate: each pair's integration test suite must achieve ≥80%
branch coverage of the relevant subsystem code paths before being declared
"passing" (not just "not crashing").
Secure boot measurement chain (affects Section 8.2, Section 8.6, Section 18.7, Section 12.6):
- Live kernel evolution PCR and remote attestation protocol: RESOLVED — see the
"Live Evolution attestation chain" entry below for the full decision (dedicated
PCR[16]/PCR[23], hash-chained event log, TPM2_PolicyAuthorize for sealed secrets).
- Policy modules (Section 18.7) loaded at runtime must also be measured. Decision needed: whether
measurement is mandatory (blocks unsigned modules) or advisory (measure but allow).
CXL 3.0 fabric management (affects Section 5.1, Section 14.7, Section 21.6): - CXL 3.0 introduces fabric-attached memory with hardware-managed coherence. Decision needed: how this integrates with the distributed kernel's software DSM protocol (Section 5.1.6). Options: CXL replaces DSM for intra-rack, DSM remains for inter-rack; or DSM degrades gracefully when CXL is available.
Multi-architecture parity for advanced features (affects Section 2.2, Section 8.6-Section 23.2): - Many features in Section 8.6-Section 23.2 specify x86-64 mechanisms (WRPKRU, SEV-SNP, RAPL, MTE). ARM64, RISC-V, and PowerPC equivalents exist for some but not all. Partially addressed: Section 2.2 now includes an "Advanced Feature Architecture Parity" matrix covering 8 key features across all six architectures. Remaining decision: per-feature acceptance criteria for "software fallback" vs. "not supported" (performance thresholds, testing requirements).
eBPF verifier completeness (affects Section 18.1.4): RESOLVED — Full verifier, phased delivery.
Decision: UmkaOS implements a full eBPF verifier equivalent to Linux's verifier.
Rationale: - UmkaOS targets 100% Linux userspace compatibility. Tools in widespread production use (BCC, bpftrace, libbpf, Cilium, Falco) rely on full verifier semantics — type-safe map access, bounded loops, all helper prototypes. A partial verifier silently rejects valid programs, breaking these tools without any useful error message. - Safety: a partial verifier is not a safe subset — it is an incomplete verifier. The historical CVEs (CVE-2021-31440, CVE-2021-3490, CVE-2022-23222) arose from incorrect bounds tracking and register state pruning in the full verifier codebase, not from attempting full verification. The solution is a correct full verifier, not a simpler but still-incorrect one. - The Linux verifier is a well-understood reference implementation whose semantics are fully documented via the BPF ISA specification and the kernel's internal type system. UmkaOS's clean-room Rust reimplementation targets semantic equivalence, not code equivalence, allowing a cleaner design that avoids the accumulated technical debt of the Linux C implementation.
Verifier capabilities (all required, no deferred items):
-
Type safety: All register types tracked through every instruction — scalars, pointers to map values, pointers to ctx fields, pointers to stack slots, packet data pointers. Type propagation through helper calls uses the full helper prototype table. Pointer arithmetic on typed pointers is tracked with offset bounds.
-
Memory bounds checking: Every load and store is proven in-bounds before JIT emission. For map value pointers:
[0, map.value_size). For ctx pointers: per- program-type access matrix (e.g.,__sk_bufffield access rules for TC programs). For stack slots:[-512, 0). For packet data pointers:[data, data_end)with explicitdata_endcheck before access. -
Termination: Bounded loop analysis using the loop bound counter mechanism introduced in Linux 5.3. Maximum iterations per loop: 8,388,608 (8M) by default, matching Linux. Programs that cannot prove termination within the bound are rejected. Back-edge detection uses DFS on the CFG; a back edge without a proven decreasing bound variable is a hard rejection.
-
Helper function verification: Every
bpf_callinstruction is checked against the full helper prototype table. Argument types are checked (e.g.,ARG_PTR_TO_MAPrequires a loaded map fd,ARG_PTR_TO_MEM | MEM_RDONLYrequires a readable stack slot or map value). Return value types are recorded for type propagation. -
JIT safety: Programs that pass verification are JIT-compiled; programs that fail verification are rejected at
BPF_PROG_LOADtime with a structured error report (verifier log, available viaBPF_OBJ_GET_INFO_BY_FD). Unverified execution of eBPF bytecode is never permitted at any privilege level. -
Privileged/unprivileged split: Without
CAP_BPF(andCAP_PERFMONfor tracing programs), program types are restricted to socket filters and cgroup skb programs. Pointer arithmetic on packet data is allowed; pointer arithmetic on map values and ctx fields is disallowed. This matches Linux'sallow_ptr_leaksandbypass_spec_v1verifier flags for unprivileged contexts.
Phase assignment:
-
Phase 2 — eBPF bytecode interpreter + verifier for scalar types and simple map access. Program types:
BPF_PROG_TYPE_SOCKET_FILTERandBPF_PROG_TYPE_CGROUP_SKB. Map types:BPF_MAP_TYPE_HASH,BPF_MAP_TYPE_ARRAY,BPF_MAP_TYPE_PERF_EVENT_ARRAY. No loops (loop-free control flow only). This subset is sufficient for seccomp-bpf and basic network filtering, satisfying the Phase 2 exit criteria (Docker hello-world). -
Phase 3 — Full verifier: complete type system including all pointer kinds, bounded loop analysis, all core helper prototypes (>200 helpers). Program types: all types required for
bpftrace,BCC, and Cilium CNI —XDP,TC,KPROBE/KRETPROBE,TRACEPOINT,PERF_EVENT,CGROUP_*,SK_*,FLOW_DISSECTOR. JIT backend for x86-64. Map types: all Linux-equivalent types includingBPF_MAP_TYPE_RINGBUF,BPF_MAP_TYPE_SOCKHASH,BPF_MAP_TYPE_LPM_TRIE.
Implementation risk note: The full eBPF verifier is among the most complex single components in the entire implementation plan. Linux's verifier + BTF implementation (verifier.c + btf.c) required years of iterative security hardening to reach production grade. Phase 3 delivers functional completeness (correct programs accepted, incorrect programs rejected); production-grade security hardening against adversarial programs is an ongoing concern through Phase 4-5. Operators running untrusted eBPF programs should use unprivileged BPF restrictions (
CAP_BPFrequired) until the verifier has accumulated sufficient security review.
-
Phase 4 — JIT backend for AArch64. Verifier additions:
struct_opsprogram type (required for TCP congestion control via eBPF andsched_extschedulers). -
Phase 5 — JIT backend for RISC-V 64. Full parity on all six supported architectures: programs compiled for x86-64 are re-verified and JIT-compiled on each architecture; verifier output is architecture-independent (the verifier itself is not JIT-backend-specific).
io_uring + SEV-SNP shared buffer management (affects Section 18.1.5, Section 8.6):
RESOLVED — see Section 18.1.5.1.
Resolution: bounce buffer architecture. SQE/CQE rings remain in encrypted guest memory
(kernel and userspace share the same encryption domain). DMA data payloads are bounced
through a pre-allocated C-bit-clear pool (default 16 MiB per ring, 64 MiB system-wide).
Plaintext bounce buffers are acceptable for block I/O (dm-crypt handles encryption above
io_uring); network buffers carrying secrets use an opt-in IORING_REGISTER_BUFFERS_ENCRYPTED
flag for per-buffer AES-GCM encryption (~1 us per 4 KiB). Performance impact: ~0.6-1.0 us
per I/O for the two extra memcpy operations, additive to SEV-SNP's 5-15% baseline.
GPU VRAM encryption for confidential VMs (affects Section 21.1, Section 8.6): - NVIDIA H100 supports CC (Confidential Computing) mode with hardware-encrypted VRAM and attestable GPU firmware. AMD MI300X has a less mature confidential computing ecosystem: MI300X provides memory encryption via AMD Infinity Fabric (SME/SEV-based whole-system encryption), but full VRAM encryption integration with SEV-SNP guest VMs — where individual guest VM memory is isolated from the hypervisor — should be verified against current hardware silicon revision and ROCm driver availability before relying on it. Decision needed: software fallback path where GPU computations operate on encrypted host memory via bounce buffers (10-100x slower due to PCIe round-trips and CPU-side encryption), or restrict confidential GPU workloads to inference-only (model weights are public, only input/output needs encryption) and encrypt only the host-to-GPU and GPU-to-host transfer buffers.
CXL 3.0 coherence domain interaction with DSM (affects Section 5.1.6, Section 5.1.13): - CXL 3.0 Type 3 devices provide hardware-coherent shared memory between hosts via the CXL.mem protocol with back-invalidate support. This capability overlaps with the software DSM protocol defined in Section 5.1.6. Decision needed: when CXL 3.0 fabric is available between two nodes, should the DSM protocol defer entirely to CXL hardware coherence (simpler, lower latency at ~200ns vs. ~5 microseconds for software DSM, but limited to CXL-connected nodes within a single rack), or should DSM provide a unified abstraction that uses CXL as a fast transport underneath (more complex, but uniform API across CXL and non-CXL nodes)? The hybrid approach adds a transport-selection layer to DSM that routes coherence traffic over CXL when available and falls back to RDMA otherwise.
Live Evolution attestation chain (affects Section 12.6, Section 8.2): RESOLVED — Deferred attestation via dedicated auxiliary PCR and structured event log.
Decision: Dedicated PCR with hash-chained event log; TPM2_PolicyAuthorize for sealed secrets.
Re-measuring the entire kernel image on each component swap is rejected: it requires the re-measurer to know the full composition of every other loaded component, which is not available to the kernel itself during a hot-swap (components may be loaded from different packages at different times). Extending a single "current kernel" PCR with the new image hash after each swap collapses all ordering information — a verifier cannot distinguish "component A then B" from "component B then A" — and makes rollback detection impossible.
Mechanism: Auxiliary PCR Hash Chain
A dedicated TPM PCR is reserved exclusively for live evolution measurements. It is never extended by the boot firmware, bootloader, or the initial kernel load (those measurements go into their standard PCRs per the TCG PC Client Platform Firmware Profile specification).
PCR assignment by phase:
- Phase 3 development: PCR[16]. PCR[16] is designated by the TCG specification as a
"debug" PCR that is resettable via TPM2_PCR_Reset while the platform is in debug
mode. This allows iterative testing of the attestation chain without requiring a reboot
to clear state after each test run.
- Production (Phase 4+): PCR[23]. PCR[23] is the standard "application-specific"
PCR reserved for OS and application use. It is not reset by firmware transitions and
is not extended by any standard boot component, making it clean for UmkaOS's exclusive
use. The transition from PCR[16] to PCR[23] is a compile-time constant
UMKA_LIVE_EVOLUTION_PCR that changes between development and production builds.
Extension protocol: Before activating a hot-swapped component, UmkaOS executes:
PCR[UMKA_LIVE_EVOLUTION_PCR] =
SHA-256(PCR_current || component_sha256 || component_metadata_hash)
where component_metadata_hash = SHA-256(component_name || component_version || load_timestamp_ns).
This creates a cryptographically ordered chain: each PCR value commits to all previously loaded components in their exact load order. Reordering components or omitting any component produces a different PCR value that no attestation policy will accept.
Event log entry: Each extension appends a LiveEvolutionEvent record to the TPM
event log (the standard binary log at /sys/kernel/security/tpm0/binary_bios_measurements,
extended by the kernel via tpm_pcr_extend and the event log API):
/// One record appended to the TPM event log per hot-swapped component.
/// Written before PCR extension; if the extension fails, the record is
/// removed and activation is aborted.
#[repr(C)]
pub struct LiveEvolutionEvent {
/// EFI_PLATFORM_FIRMWARE_BLOB2 event type (0x0000000A) repurposed for
/// kernel use; vendors read event_type to skip non-firmware records.
/// UmkaOS uses type 0x00000085 (first unallocated vendor range per TCG spec).
pub event_type: u32,
/// Monotonically increasing sequence number across all live-evolution events
/// since boot. Starts at 1; the baseline extension at boot is sequence 0.
pub sequence: u64,
/// SHA-256 of the component binary payload before signature stripping.
pub component_sha256: [u8; 32],
/// SHA-256 of the metadata fields below (for independent verification).
pub metadata_hash: [u8; 32],
/// Null-terminated UTF-8 component name, e.g. "umka-net" or "umka-nvme".
pub component_name: [u8; 64],
/// Null-terminated UTF-8 semantic version string, e.g. "2.1.0+build.4711".
pub component_version: [u8; 32],
/// Nanoseconds since boot (CLOCK_BOOTTIME) at activation time.
pub load_timestamp_ns: u64,
/// ML-DSA-44 signature over all preceding fields, signed by the
/// kernel's live-evolution signing key (provisioned at boot from the
/// UmkaOS signing certificate in the UEFI Secure Boot db).
pub signature: [u8; 2420],
}
The event log is the complete record of the system's live evolution history. Remote
attestation verifiers reconstruct the expected PCR value by replaying all
LiveEvolutionEvent records in sequence order and computing the same extend chain.
The signature on each record lets verifiers authenticate individual events without
trusting the log's storage integrity (the PCR value itself provides tamper detection
for the chain as a whole).
Baseline event at boot: During kernel initialization, before any component is
hot-swapped, UmkaOS extends UMKA_LIVE_EVOLUTION_PCR with a baseline record representing
the boot-time kernel composition:
component_sha256 = SHA-256(entire_kernel_image_at_boot)
sequence = 0
component_name = "umka-kernel-baseline"
component_version = <kernel version string>
This anchors the chain to the boot measurement. A system that has never applied any live patch has a PCR value equal to this single extension. The attestation policy for "unpatched kernel" requires exactly this single-extension value.
Remote attestation protocol for live-patched kernels:
- The relying party requests a TPM quote covering the standard boot PCRs and
UMKA_LIVE_EVOLUTION_PCR, plus the full binary event log. - The verifier confirms the boot PCRs match the expected boot-time measurements (kernel image hash, Secure Boot policy, GRUB measurement).
- The verifier replays the
LiveEvolutionEventrecords in sequence order, verifying each record's ML-DSA-44 signature against the UmkaOS signing certificate. - The verifier recomputes the expected
UMKA_LIVE_EVOLUTION_PCRvalue from the replay and checks it against the quoted PCR value. - The verifier applies its patch policy: each
component_sha256in the event log must appear in the verifier's approved-patch database. Unknown patch hashes cause attestation failure.
TPM-sealed secrets under live evolution:
Secrets sealed to specific PCR values (e.g., disk encryption keys sealed to the boot
PCR set) cannot be unsealed after live patches change UMKA_LIVE_EVOLUTION_PCR. This
is the correct security behavior: a modified kernel must re-prove its trustworthiness
before accessing secrets.
For systems that need to unseal secrets after applying approved patches, the sealing
policy uses TPM2_PolicyAuthorize rather than TPM2_PolicyPCR with fixed values.
The TPM2_PolicyAuthorize policy delegates the unsealing decision to the holder of an
authorized signing key (the UmkaOS attestation key, provisioned at enrollment time). The
UmkaOS attestation service, after verifying the event log and confirming all applied
patches are approved, signs a policy digest that authorizes unsealing. The kernel
presents this signed authorization to the TPM alongside the quote, and the TPM unseals
the secret without requiring the PCR values to match the original sealed-time values.
Existing systems that have secrets sealed under TPM2_PolicyPCR (without
TPM2_PolicyAuthorize) must re-seal their secrets during the first maintenance window
after the live evolution feature is enabled. The umka-attestd daemon handles this
migration automatically: it unseals secrets using the old policy (which still works
before the first patch is applied), re-seals them under the new
TPM2_PolicyAuthorize-based policy, and verifies the re-seal by immediately performing
a test unseal before committing the new sealed blob to persistent storage.
This document is the canonical reference for UmkaOS development. All implementation decisions must be traceable to the architecture described here. Changes to this document require team review and approval.
23.10 Formal Verification Readiness
23.10.1 The Opportunity
Formal verification of kernel code crossed the practical threshold:
2009: seL4 — 200,000 lines of proof for 10,000 lines of C. Heroic effort.
2018: RustBelt — Formal soundness proof for Rust's ownership model.
2022-2025: Verus (Carnegie Mellon University, VMware Research, Microsoft Research,
ETH Zurich, and others) — Automated verification for Rust.
Write Rust code + specifications → tool PROVES correctness.
Not testing. Not fuzzing. Mathematical machine-checked proof.
Verus can verify Rust code of realistic complexity: concurrent data structures, state machines, protocols, invariant maintenance. UmkaOS is written in Rust. The verification infrastructure exists.
23.10.2 What To Verify
Not everything needs verification. Focus on security-critical invariants and concurrency-sensitive code where bugs have catastrophic consequences:
| Component | Invariant to Prove | Section |
|---|---|---|
| Capability system | Capabilities cannot be forged. Revocation is complete. Permissions never escalate. | Section 8.1.1 |
| Page table management | No page mapped into two processes simultaneously without explicit sharing. Freed pages never accessible. | Section 4.1 |
| Memory allocator | No page allocated twice. No double-free. Buddy merging preserves free-list consistency. Allocation never returns memory outside tracked ranges. | Section 4.1 |
| KABI vtable dispatch | Vtable calls never escape the driver's isolation domain. Version checks are correct. | Section 11.1 |
| IPC ring buffer | Producer-consumer protocol never loses messages, never delivers duplicates, never deadlocks. | Section 10.6 |
| CBS bandwidth server | Bandwidth guarantees are met. No starvation. | Section 6.3.4 |
| DSM coherence protocol | Multiple-reader / single-writer consistency maintained. No lost writes. | Section 5.1.6 |
| Distributed capabilities | Signature verification is correct. Revocation propagation is complete. | Section 5.1.10 |
| Power budget enforcement | Budgets are never exceeded by more than one tick interval. | Section 6.4 |
23.10.3 Design for Verifiability
Verification readiness is a design property, not a tool. Code must be structured so that specifications can be written and verified:
// Example: capability lookup with verification-ready specification.
// Verus-style annotations (compile-time only, erased from binary).
/// Lookup a capability by handle.
///
/// SPECIFICATION (verified by Verus):
/// requires: handle is valid for calling process
/// ensures: returned capability matches the one in the capability table
/// ensures: returned capability's generation <= object's current generation
/// ensures: returned capability's permissions are a subset of the
/// delegator's permissions (no escalation)
pub fn cap_lookup(
table: &CapabilityTable,
process: ProcessId,
handle: CapHandle,
) -> Result<Capability, CapError> {
// Implementation must satisfy the specification.
// Verus proves this at compile time.
// No runtime overhead.
}
Design rules for verifiability:
-
Explicit state: No hidden mutable global state. All state is in named structures with explicit ownership. (Rust already enforces this.)
-
Small critical sections: Break complex operations into small, individually verifiable steps. Each step has a pre-condition and post-condition.
-
Interface contracts: Every public function in security-critical modules has a documented specification (pre/post conditions, invariants). Verus verifies these.
-
Algebraic data types for states: Use enums with exhaustive matching instead of integer flags. The type checker ensures all states are handled.
-
Monotonic counters: Generation counters, version numbers — use types that enforce monotonicity (can only increase, never decrease).
23.10.4 Verification Tooling
Primary tool: Verus (Carnegie Mellon University, VMware Research, Microsoft Research, and others). Automated verification for Rust. Specification-driven proofs of functional correctness and memory safety properties.
Alternative tools (fallback if Verus hits scale limits): - Kani (Amazon): Bounded model checking for Rust. Explores all execution paths up to a configurable bound. Excellent for concurrent code and finding edge cases. Complementary to Verus — Kani finds bugs, Verus proves absence of bugs. - Prusti (ETH Zurich): Automated verification for Rust. Different proof strategy than Verus (separation logic vs SMT). Useful as a cross-check.
CI integration strategy:
- Every commit: debug_assert! invariant checks + lightweight type-level assertions.
Compile-time only. Seconds. Catches regressions in verified invariants.
- Every PR: Kani bounded model checks on critical modules (~5-10 min).
Catches concurrency bugs and edge cases.
- Nightly: Full Verus specification proofs (~30-60 min for verified modules).
Mathematical proof of correctness. Any proof failure blocks the next release.
Scope of verification — what is OUT of scope: Cross-component interactions (e.g., DSM coherence protocol interacting with hardware isolation boundaries simultaneously) are beyond current tool capabilities. Individual components are verified against their specifications; the composition is validated by integration testing and fuzzing. This is an honest limitation — complete whole-system verification remains a research problem.
Unsafe Code Verification Strategy:
Rust's unsafe blocks are the primary verification target — they are where memory
safety invariants must be manually upheld. The strategy:
-
Verus for ownership and invariant proofs: verify that
unsafecode upholds the safety contract documented in its// SAFETY:comment. Verus can reason about pointer validity, aliasing, and lifetime guarantees. -
Kani for model-checking
unsafecode paths: bounded model checking explores all possible inputs tounsafefunctions up to a configurable bound, catching edge cases that specifications might miss. -
Wrap unsafe in safe abstractions: every
unsafeblock is encapsulated in a safe function with a verified specification. Callers never touchunsafedirectly. The safe wrapper's specification becomes the verification boundary.
Verification Complexity by Component:
Based on published Verus effort data and component characteristics:
| Component | Relative Complexity | Rationale |
|---|---|---|
| Capability system (Section 8.1) | Low | Small state machine, clear invariants |
| IPC ring buffer (Section 10.6) | Low | Single producer-consumer, bounded |
| Page table management (Section 4.1) | High | Many edge cases, arch-specific |
| CBS bandwidth server (Section 6.3) | Medium | Well-studied algorithm |
| DSM coherence (Section 5.1.6) | High | Distributed protocol, concurrent access |
Page table management and DSM coherence are the hardest verification targets due to arch-specific code paths and distributed state. The capability system and IPC ring buffer are the easiest starting points for building verification expertise.
23.10.5 Performance Impact
Literally zero. Verification is compile-time. Verus specifications are erased from the binary. The verified code is identical to the unverified code at runtime.
The only cost is developer time writing specifications. But this pays for itself by eliminating bugs that would otherwise require debugging, CVE patches, and emergency releases.
23.11 KABI IDL Compiler Specification
The KABI IDL language and umka-kabi-gen tool are fully specified in
Section 11.1.7.
The roadmap deliverable is to implement umka-kabi-gen conforming to that
specification. See Section 23.2.1 for the
Phase 1 build milestone.