Chapter 23: Roadmap and Verification

Driver ecosystem, implementation phases, verification strategy, technical risks, formal verification, appendices

23.1 Driver Ecosystem Strategy

23.1.1 The Challenge

Driver coverage is the single largest adoption blocker for any new kernel. Linux has thousands of drivers covering decades of hardware. UmkaOS cannot replicate this overnight.

23.1.2 Agentic Driver Rewrite Project

The key insight: all open-source Linux driver source code is available. The hardware programming logic (register sequences, DMA setup, interrupt handling) is identical regardless of kernel API. Only the kernel-facing API surface changes.

AI-assisted translation pipeline:

Input:  Linux driver C source code (GPL, ~500-5000 LOC typical)
   |
   v
Step 1: Parse Linux kernel API calls (kmalloc, dma_alloc_coherent,
        request_irq, pci_read_config_*, etc.)
   |
   v
Step 2: Map to KABI equivalents (KernelServicesVTable methods)
   |
   v
Step 3: Translate C to Rust, preserving hardware-specific logic exactly
   |
   v
Step 4: Generate KABI driver entry point and vtable exchange
   |
   v
Output: Native Rust KABI driver

Human review: Verify hardware-specific sequences are preserved
Testing: Against real hardware + QEMU virtual devices

23.1.3 Prioritized Driver List

These drivers cover approximately 95% of real hardware in server and desktop environments:

Priority 1 -- Cloud/VM (covers 100% of cloud deployments): 1. VirtIO block (virtio-blk) 2. VirtIO network (virtio-net) 3. VirtIO GPU (virtio-gpu) 4. VirtIO console (virtio-console)

Priority 2 -- Storage (covers 99% of bare-metal storage): 5. NVMe (universal modern SSD interface) 6. AHCI/SATA (legacy HDDs and older SSDs)

Priority 3 -- Networking (covers 90% of server NICs): 7. Intel e1000/e1000e (universal VM and consumer NIC) 8. Intel igb/ixgbe/ice (server 1G/10G/25G/100G) 9. Realtek r8169 (consumer Ethernet) 10. Mellanox mlx5 (high-performance datacenter)

Priority 4 -- Human Interface (covers desktop usability): 11. USB XHCI host controller (all modern USB) 12. USB EHCI host controller (USB 2.0 legacy) 13. USB HID (keyboard, mouse) 14. USB mass storage 15. Intel HDA audio 16. i915 (Intel integrated graphics, modesetting) 17. amdgpu (AMD graphics, modesetting) 18. UVC (USB Video Class) camera driver — Phase 3/4; V4L2 API compatibility required for webcams, screen capture, and video conferencing applications. Printing (CUPS, USB printer class, IPP) is Phase 4+; requires USB printing class driver plus CUPS daemon compatibility in userspace (no kernel changes beyond USB HID/class support already in Phase 1-2).

> **Reviewer note**: Camera (V4L2) and printing (CUPS) support are **intentionally
> deferred**. Their absence in Phase 1-2 is by design, not an oversight.
> Early phases target server/container workloads that do not require these
> peripherals. Flagging missing camera or printer support as a Phase 1-2 bug
> is a false positive.

Priority 5 -- Platform (covers system management): 19. ACPI subsystem 20. PCI/PCIe enumeration and configuration 21. IOMMU (Intel VT-d, AMD-Vi)

23.1.4 Nvidia / Proprietary Driver Strategy

For Nvidia (the most critical proprietary driver):

Nvidia's driver already has a clean internal abstraction layer between their proprietary GPU core and the "kernel interface layer" (nvidia.ko)
UmkaOS provides a KABI-native implementation of this kernel interface layer
Nvidia's proprietary compute core links against our KABI implementation
This is more sustainable than binary .ko compatibility: the interface layer is small, well-defined, and stable

23.1.5 Community Incentive

The clean KABI SDK makes driver development significantly easier than Linux: - No need to track unstable internal APIs - Rust safety eliminates entire classes of bugs - Binary compatibility across kernel versions eliminates recompilation burden - Clear, documented interfaces reduce the learning curve

This lower barrier to entry is expected to attract contributors and vendors over time.

23.2 Implementation Phases

This section covers the implementation timeline for all features. The first part (Phases 1-5+) defines core kernel milestones. The Enhancement Feature Phasing and Future-Proof Feature Phasing tables below map additional features onto these same phases.

23.2.1 Phase 1: Foundations

Goal: Boot to a hello-world program.

See Section 24.3.1 (Phase 1.1: Core Kernel) and Section 24.3.2 (Phase 1.2: Multi-arch) for detailed agentic workflow steps within this roadmap phase.

UmkaOS Core: x86-64 boot (UEFI + BIOS), physical memory allocator, basic scheduler, IPC/isolation domain infrastructure
LinuxCompat: minimal syscalls for execve + write + exit_group
Tier 0 drivers: APIC, timer, serial console
Build system: Cargo workspace with custom target spec, linker scripts
CI/CD: QEMU-based boot tests on every commit
KABI compiler: implement umka-kabi-gen per Section 11.1.7. The KABI IDL language and umka-kabi-gen tool are fully specified in Section 11.1.7 (11-kabi.md).

Exit criteria: A statically linked 'Hello, world!' ELF binary runs on UmkaOS in QEMU. The KABI compiler successfully parses a minimal .kabi IDL file and generates Rust/C stubs that compile without errors.

23.2.2 Phase 2: Self-Hosting Shell

Goal: Run a busybox shell with basic utilities.

See Section 24.3.3 (Phase 2.1: Essential Drivers), Section 24.3.4 (Phase 2.2: Linux Compatibility Layer), and Section 24.3.5 (Phase 2.3: Networking Stack) for detailed agentic workflow steps within this roadmap phase.

VFS layer: mount table, path resolution, file descriptor table
Filesystems: tmpfs, initramfs (cpio), procfs (basic), sysfs (stub)
Block I/O layer + VirtIO-blk driver (Tier 1)
Memory manager: mmap, brk, page fault handler, COW, demand paging
Process management: fork/clone, execve, wait, exit
Basic signal handling: SIGCHLD, SIGKILL, SIGTERM, SIGSEGV
Pipe and simple I/O

Exit criteria: Busybox shell boots, ls, cat, echo, ps work.

23.2.3 Phase 3: Real Workloads

Goal: Boot systemd, run Docker containers.

See Section 24.3.6 (Phase 3.1: Storage Stack) and Section 24.3.7 (Phase 3.2: Advanced Features) for detailed agentic workflow steps within this roadmap phase.

Full syscall coverage: approximately 330+ commonly used syscalls (from a dispatch table covering ~450 total Linux syscall numbers, with uncommon/obsolete syscalls returning -ENOSYS)
NVMe driver (Tier 1), ext4 filesystem (read-write)
Network: VirtIO-net driver, e1000 driver, TCP/IP stack, socket API
Namespaces: all 8 types
Cgroups: v2 (primary) + v1 compat
io_uring: full implementation
eBPF: full verifier + JIT (x86-64) + all core map types + XDP/TC/kprobe/tracepoint/cgroup programs (see Section 23.9 eBPF verifier completeness for phase breakdown)
seccomp-bpf: for container runtime compatibility
Full signal handling: all 64 signals, sigaction, sigaltstack
TTY/PTY subsystem: for terminal emulators

Exit criteria: Ubuntu minimal boots with systemd, Docker runs hello-world container, iperf3 and fio benchmarks complete.

23.2.4 Phase 4: Production Ready

Goal: Drop-in replacement for specific workloads.

See Section 24.3.8 (Phase 4.1: Consumer Hardware) for detailed agentic workflow steps within this roadmap phase.

KVM hypervisor: /dev/kvm, VMX, EPT, QEMU/Firecracker support
Netfilter/nftables: connection tracking, NAT, Docker networking
LSM framework: SELinux policy engine, AppArmor profiles
Agentic driver rewrite: top-20 driver families ported
Performance tuning: reach within 5% of Linux on all target benchmarks
Crash recovery: full Tier 1/2 fault injection testing
Package: .deb and .rpm packages for Ubuntu 24.04+ and Fedora 40+
LTP conformance: Linux Test Project suite passing (>95% of applicable tests)

Exit criteria: UmkaOS boots unmodified Ubuntu 24.04 and Fedora 40, runs Docker + Kubernetes single-node, passes LTP, within 5% of Linux on benchmarks.

23.2.5 Phase 5: Ecosystem

Goal: Broad adoption and platform maturity.

See Section 24.3.9 (Phase 5.1: Windows Emulation Acceleration) for detailed agentic workflow steps within this roadmap phase.

ARM64 port: full Tier 1 isolation using architecture-appropriate mechanisms
RISC-V 64 port: same
PPC32 port: embedded PowerPC support with segment-register isolation
PPC64LE port: IBM POWER server support with Radix MMU isolation
Extended driver coverage: GPU acceleration (i915, amdgpu compute), WiFi, Bluetooth
Vendor partnerships: Nvidia KABI driver, AMD KABI driver, Intel KABI driver
Community driver development: SDK documentation, examples, mentorship
Distribution certification: RHEL, Ubuntu, SUSE official support
Nested virtualization: KVM-on-KVM
Live kernel upgrade: stop all Tier 1/2 drivers, swap core, restart drivers

23.2.6 Enhancement Feature Phasing

The kernel-internal enhancements described in Section 4.2, 7.1, 8.2, and 18.1–18.4 have different urgency levels relative to the phases above:

Feature	Earliest Phase	Rationale
Unified Object Namespace (Section 19.4)	Phase 1-2	Foundational — other features build on it
Stable Tracepoints (Section 19.2)	Phase 2	Needed for debugging from the start
Memory Compression (Section 4.2)	Phase 3	Requires mature memory manager
Verified Boot (Section 8.2)	Phase 3	Requires bootable system to protect
CPU Bandwidth Guarantees (Section 6.3)	Phase 3-4	Requires mature scheduler + cgroups
Fault Management (Section 19.1)	Phase 4	Requires mature driver ecosystem reporting health

The following table covers the implementation timeline for advanced features (Chapters 16-18). Phase numbers align with the core kernel phases defined above. "Design-In" items (Phase 1) require data structure reservations and trait definitions but no functional implementation. Higher-phase items depend on core infrastructure being available.

Feature	Phase	Dependencies	Design-In Cost	Notes
PQC crypto abstraction (Section 8.5)	Phase 1	None	Low	Variable-length signature fields, algorithm enum
Formal verification readiness (Section 23.10)	Phase 1	None	Low	Spec annotations, design contracts
RT preemption model (Section 7.2)	Phase 1-2	Scheduler	Medium	Lock design, interrupt threading
Hardware memory safety hooks (Section 2.3)	Phase 2	Memory allocator	Low	Tag allocation/deallocation in slab/buddy
Power budgeting (Section 6.4)	Phase 3	Scheduler, cgroups	Medium	RAPL/SCMI reading, power cgroup controller. Per-task EAS is in Section 6.1.5
Safe kernel extensibility (Section 18.7)	Phase 3	KABI, domain isolation	Medium	Policy vtable traits, module lifecycle
Confidential computing — guest (Section 8.6)	Phase 3	Memory manager	Medium	Bounce buffers, shared/private pages
Confidential computing — host (Section 8.6)	Phase 4	umka-kvm, IOMMU	Medium	SEV-SNP/TDX VM management
PQC algorithm implementations (Section 8.5)	Phase 3-4	Crypto abstraction	Medium	ML-KEM, ML-DSA, hybrid mode
Live kernel evolution (Section 12.6)	Phase 4-5	Extensibility	Medium	State export/import, atomic swap
Intent-based management (Section 6.7)	Phase 4-5	Inference engine, cgroups	Medium	Optimization loop, intent cgroup knobs
SmartNIC/DPU offload (Section 5.2)	Phase 4-5	Device registry, proxy drivers	Medium	Offload transport, DPU discovery
Persistent memory (Section 14.7)	Phase 4-5	VFS, memory tiers	Medium	DAX, MAP_SYNC, CLWB fencing
Computational storage (Section 14.8)	Phase 5+	AccelBase framework	Low	CSD as AccelDeviceClass
Unified compute topology (Section 21.6)	Phase 4-5	AccelBase, EAS (Section 6.1.5), power budgeting (Section 6.4)	Low	Advisory overlay; multi-dim capacity profiles, cross-device energy
Unified cgroup compute.weight (Section 21.6)	Phase 5+	Unified topology, intent optimizer (Section 6.7)	Low	Optional knob; orchestration layer over existing per-domain knobs
NodeTransport unification (Section 21.6.13)	Phase 5	KernelTransport (Section 5.1), OffloadTransport (Section 5.2)	Medium	Merge RDMA + PCIe + NVLink + CXL into one transport abstraction
Peer kernel nodes (Section 21.6.13)	Phase 5+	NodeTransport, distributed kernel (Section 5.1)	Low	Vendor-driven; architecture ready, adoption depends on industry

23.2.7 Priority Rationale

Phase 1-2 (Design-In): PQC sizing, verification readiness, RT lock design. These cost almost nothing now but are impossible to retrofit. Design contracts and data structure sizes affect everything built on top.

Phase 3 (Real Workloads): Extensibility, power budgeting, confidential guest mode. These enable the kernel to run real workloads in modern environments (cloud, power- constrained datacenters).

Phase 4-5 (Competitive Advantage): Live evolution, intent-based management, DPU offload. These are features that Linux cannot provide due to architectural constraints. They differentiate UmkaOS in production environments.

23.2.8 Licensing Summary

Component	IP Source	Risk
Confidential computing (TEE)	Hardware vendor specs (AMD SEV, Intel TDX, ARM CCA), all public	None
Post-quantum crypto	NIST standards (FIPS 203, 204, 205), public domain algorithms	None
Power budgeting	RAPL (Intel public spec), SCMI (ARM public spec), original design	None
Hardware memory safety	ARM MTE (public ISA), Intel LAM (public ISA)	None
Formal verification	Verus (MIT license), RustBelt (academic, published)	None
Safe extensibility	Original design (extends existing KABI vtable model)	None
Live kernel evolution	Theseus OS concepts (academic, published, Rice University)	None
Intent-based management	Original design, optimization theory (academic)	None
Real-time guarantees	PREEMPT_RT concepts (GPLv2, Linux mainlined), CBS (academic)	Medium — see note below
SmartNIC/DPU offload	Original design (extends existing KABI proxy model)	None
Persistent memory	DAX/PMEM specifications (SNIA, public), Linux interfaces (facts)	None
Computational storage	NVMe Computational Programs Command Set and Subsystem Local Memory Command Set (public, NVMe consortium, January 2024)	None
Unified compute model	Original design (extends existing AccelBase + EAS models)	None

All components are either original design, based on published academic research, based on public hardware specifications, or based on NIST/industry standards. No vendor-proprietary APIs or patented algorithms.

PREEMPT_RT derivative risk: PREEMPT_RT is GPLv2 and was merged into Linux mainline (v6.12). Any UmkaOS real-time code derived from PREEMPT_RT implementation (as opposed to the general concepts of preemptible kernels, threaded interrupts, and priority inheritance) could carry GPLv2 obligations that conflict with OKLF's additional permissions. UmkaOS's RT implementation MUST be a clean-room design based on published academic literature (priority inheritance protocols: Sha, Rajkumar, Lehoczky 1990; CBS: Abeni and Buttazzo 1998; LITMUS-RT: Brandenburg 2011) and public OS design textbooks, not derived from Linux PREEMPT_RT source code. Code review must verify no Linux-derived lock conversion patterns, interrupt threading structures, or RT-specific scheduler modifications are copied.

23.2.9 Performance Impact Summary

Every feature in this document was evaluated against the constraint: "Does this make UmkaOS measurably slower than Linux on the same workload?"

Feature	Hot-Path Impact vs Linux	Justification
Confidential computing	0% (same hardware, same cost)	Hardware AES engine, identical to Linux
Post-quantum crypto	0% (cold-path only)	Boot/driver-load only. ML-DSA-44 verify comparable to Ed25519; ML-DSA-65 verify ~100-200 µs (cold-path only, not on hot paths)
Power budgeting	0.015% (MSR reads at tick)	600ns per 4ms tick. Invisible in any benchmark. Per-task EAS overhead: see Section 6.1.5.12
Hardware memory safety	0% vs Linux when enabled	Same MTE instructions, same hardware cost. Tag RAM overhead: 3.125% of DRAM (ARM MTE only)
Formal verification	0.000% (compile-time)	Not in the binary
Safe extensibility	0% (same as Linux sched_class)	Function pointer dispatch, same mechanism
Live kernel evolution	0.000% (rare event only)	~10μs during replacement, months between events
Intent-based management	~0.00005% (background only)	3μs per second background optimization
Real-time guarantees	0% to 5% (configurable)	Same cost as Linux PREEMPT_RT when enabled. 0% = Voluntary, ~1% = Full, 2-5% = Realtime
SmartNIC/DPU offload	Negative (faster)	Moves work OFF host CPU
Persistent memory	Negative (faster)	DAX eliminates page cache copies
Computational storage	Negative (faster)	CSD reduces data movement
Unified compute model	~0.00005% (background only)	~4μs/sec/cgroup advisory. Submission hot path unchanged

Target: match or exceed Linux performance for all common workloads. Most features are invisible at steady state, and several actually improve performance. Known exceptions are conscious trade-offs documented in their respective sections: RT scheduling adds 0-5% overhead for RT-class tasks (same cost as Linux PREEMPT_RT); capability checks add ~5-10 cycles per privileged operation (~0.1%, fully pipelined bitmask test); untrusted policy module isolation adds ~46 cycles per domain crossing (eliminated once the module graduates to the Core isolation domain).

23.2.10 Consumer and Desktop Phases (Phase 5)

Phase 5 focuses on consumer hardware support, desktop integration, and application ecosystem compatibility. These phases begin after Phase 4 server/cloud stability.

23.2.11 Phase 5a: Essential Consumer Hardware

Goal: UmkaOS boots and runs on common Intel/AMD laptops with basic functionality.

Deliverables: - WiFi drivers (Intel, Realtek) - Bluetooth stack (HID, audio) - Touchpad drivers (I2C-HID, PS/2) - Audio (Intel HDA, USB Audio) - Graphics (Intel i915 modesetting, basic AMD) - S3 suspend/resume

23.2.12 Phase 5b: Consumer Power Management

Goal: Battery life competitive with existing Linux distributions.

Deliverables: - Power profiles (performance, balanced, battery-saver) - S0ix Modern Standby support - Per-app power attribution kernel interfaces

23.2.13 Phase 5c: Desktop Integration

Goal: Polished desktop experience, ready for enthusiast adoption.

Deliverables: - Wayland compositor support (DRM, input events) - Multi-monitor support (hotplug) - Desktop notifications (battery, network, USB events) - Per-app sandboxing capability primitives

23.2.14 Phase 5d: Broader Hardware

Goal: Support popular consumer laptops (ThinkPad, XPS, etc.).

Deliverables: - More WiFi chipsets (Qualcomm, Mediatek, Broadcom) - AMD graphics (amdgpu modesetting) - Thunderbolt 3/4 support - USB4 support - SATA, eMMC, SD card readers

23.2.15 Phase 5e: Gaming & Creative

Goal: Support gaming, content creation workloads.

Deliverables: - Vulkan drivers (Mesa RADV for AMD, Intel ANV) - Steam + Proton support - GPU video encode/decode (hardware acceleration)

23.2.16 Desktop / Laptop Performance Targets

Performance targets for UmkaOS running on consumer-grade desktop/laptop hardware. These are acceptance criteria for Phase 5 completion, not kernel architectural constraints — specific numbers are deployment-profile goals.

Metric	Target
Kernel boot (bootloader → login screen)	< 5 seconds
Resume from S3 suspend	< 2 seconds
Resume from S4 hibernate	< 10 seconds
Idle power (WiFi on, display on)	Match or exceed Ubuntu 24.04
Video playback (1080p H.264)	Hardware decode; CPU < 5%

23.2.16.1 Validation Methodology

Battery life: - Side-by-side comparison with Windows 11 and Ubuntu 24.04 on the same hardware - Standardised web-browsing benchmark (Speedometer + video stream) - UmkaOS must match or exceed Ubuntu 24.04 battery life

Real-world validation: - 100+ beta testers (developer community) running UmkaOS as daily driver - 30-day soak; collect crash dumps, performance traces, battery statistics

23.3 Verification Strategy

23.3.1 Testing Layers

Layer	Tool / Method	What it verifies
Unit tests	`cargo test` (in QEMU or host mock)	Individual subsystem correctness
Integration tests	Custom test harness in QEMU	Cross-subsystem interactions
Syscall conformance	Linux Test Project (LTP)	Syscall behavior matches Linux
Application testing	Boot Ubuntu minimal, Alpine	Real-world application compatibility
Container testing	Docker hello-world, nginx, redis	Container runtime compatibility
Kubernetes testing	k3s single-node	Orchestration platform compatibility
ABI regression	`kabi-compat-check` in CI	No breaking changes to KABI
Crash recovery	Fault injection framework	Tier 1/2 drivers recover correctly
Performance regression	Automated benchmarks vs Linux baseline	No unacceptable performance regression
Fuzzing	syzkaller (adapted for UmkaOS)	Syscall fuzzing for crash/hang detection
Static analysis	`cargo clippy`, custom lints	Code quality, unsafe usage review

23.3.2 Key Benchmarks

These benchmarks must match Linux within 5% (measured on identical hardware, same kernel configuration, same workload parameters):

Benchmark	What it tests	Target delta
`fio` randread 4K QD32	Block I/O fast path (IOPS)	< 2%
`fio` randwrite 4K QD32	Block I/O write path (IOPS)	< 2%
`fio` sequential read 1M	Block I/O throughput (GB/s)	< 1%
`iperf3` TCP throughput	Network stack throughput	< 5%
`iperf3` TCP latency (RR)	Network stack latency	< 5%
`nginx` small-file HTTP (wrk)	Combined network + filesystem	< 5%
`redis-benchmark`	In-memory key-value (network + mem)	< 3%
`sysbench` OLTP read-write	Database workload (IO + CPU + sched)	< 5%
`hackbench` (groups=100)	Scheduler + IPC throughput	< 3%
`lmbench` lat_ctx	Context switch latency	< 1%
Kernel compile (`make -jN`)	Combined CPU + IO + scheduling	< 5%
`stress-ng` mixed	Overall system stress	< 5%

23.3.3 Crash Recovery Testing

Dedicated fault injection framework.

Activation

Fault injection is available in debug builds only (cfg(umka_fault_inject)). It is never compiled into release builds. Two activation mechanisms:

Kernel boot parameter: umka.fault_inject=<target>[,<fault>] Example: umka.fault_inject=nvme0,domain_violation injects a domain access violation into the nvme0 driver on first I/O. The kernel logs the injection at KERN_DEBUG level and proceeds with the fault.
Runtime sysctl (debug builds, init namespace only): umka/debug/fault_inject/<driver_name>/<fault_type> — write 1 to trigger once, write N to trigger on the N-th matching code path, write 0 to cancel.

Fault injection points in driver code

Driver code marks injectable points with the umka_fault_inject! macro (compiled out in release builds):

/// Injects fault `fault_type` at this callsite if fault injection is active for
/// this driver and fault type. No-op in release builds.
///
/// In debug builds: if umka.fault_inject matches this driver + fault_type,
/// executes the fault action (e.g., corrupts a pointer, calls panic!, returns Err).
#[cfg(umka_fault_inject)]
macro_rules! umka_fault_inject {
    ($driver:expr, $fault_type:expr, $action:expr) => {
        if crate::fault_inject::should_inject($driver, $fault_type) {
            $action
        }
    };
}
#[cfg(not(umka_fault_inject))]
macro_rules! umka_fault_inject {
    ($driver:expr, $fault_type:expr, $action:expr) => {};
}

Fault scenarios tested

Domain isolation violation: Inject umka_fault_inject!(driver, FaultType::DomainWrite, /* write to wrong PKEY */) — verifies MPK/DACR/POE catches the fault and reloads the driver without kernel panic.
Null pointer dereference: Inject null dereference in Tier 1 driver handler — verifies fault containment and recovery within 50–150 ms.
Infinite loop: Inject loop {} in a driver kthread — verifies the per-driver watchdog timer (DRIVER_WATCHDOG_TIMEOUT_MS = 5000) fires and kills the driver.
DMA to wrong address: Inject out-of-bounds DMA descriptor — verifies IOMMU fault is caught, driver is torn down, no kernel memory corruption.
Tier 2 process crash: Inject abort() in Tier 2 driver process — verifies umka-core supervisor restarts within 10 ms.
Repeated crashes: Inject crash on every restart — verifies auto-demotion policy engages after DRIVER_MAX_RESTART_ATTEMPTS = 3.
I/O in flight during crash: Inject crash mid-I/O — verifies all in-flight requests complete with -EIO and no request objects leak.

Each test verifies: (1) the system does not panic, (2) the driver recovers within the target time, (3) applications see errors but can retry, and (4) no memory is leaked.

23.3.4 CI Pipeline

Every commit triggers:

1. cargo build --target x86_64-unknown-none
2. cargo test (host-side unit tests)
3. QEMU boot test (basic boot + shutdown)
4. kabi-compat-check (no ABI breaks)
5. cargo clippy (lint pass)
6. cargo fmt --check (formatting)

Every merge to main additionally triggers:

7. LTP syscall conformance suite
8. Docker container boot test
9. Performance benchmark suite (vs stored Linux baseline)
10. Crash recovery fault injection suite

23.4 Technical Risks

Risk	Impact	Likelihood	Mitigation
MPK provides only 16 domains	Medium	Certain	Group related drivers by fault domain (all block share domain, all net share domain). 12 driver-available domains on x86 (4 keys reserved for infrastructure: PKEY 0=core, 1=shared descriptors, 14=shared DMA, 15=guard; per Section 10.2/5). AArch64 POE has 7 usable indices (1-7), of which 3 are available for Tier 1 driver domains (indices 3-5; indices 1-2 reserved for umka-core, 6 for userspace, 7 for temporary/debug; per Section 23.4.3). See "MPK Domain Grouping" below for degraded isolation analysis.
eBPF verifier complexity	High	High	Verifier subsystem is ~30K SLOC in Linux (counting `kernel/bpf/verifier.c` at ~23K SLOC as of v6.12, plus `btf.c`, `log.c`, range-tracking helpers, and test infrastructure — the ~30K figure covers the full verification subsystem, not `verifier.c` alone). Start with subset of program types, expand incrementally. UmkaOS implements a clean-room Rust verifier and JIT (GPL avoidance); the eBPF bytecode format and helper API are compatible with Linux but the implementation is original.
KVM deeply integrated with Linux MM	High	High	Design memory manager with KVM hooks from the start (Phase 1 architecture). Dedicate a team to KVM from Phase 4.
Driver coverage gap blocks adoption	Critical	High	Cloud-first strategy (VirtIO covers 100% of VMs). Prioritize top-20 drivers. Agentic rewrite pipeline for open-source drivers.
Subtle syscall compatibility bugs	High	High	LTP conformance suite, real-world application testing, syzkaller fuzzing. Build a comprehensive test matrix of applications.
Spectre/Meltdown mitigations + domain isolation	Medium	Medium	KPTI not needed for Tier 1 (same Ring 0). Tier 2 needs standard KPTI. Retpoline/IBRS for indirect branches. Test on affected hardware.
IOMMU not available on all hardware	Medium	Medium	IOMMU required for Tier 1 DMA fencing. Systems without IOMMU fall back to trusted mode (reduced isolation, logged warning).
ARM64 lacks direct MPK equivalent	Medium	Certain	Use POE (FEAT_S1POE, 7 usable indices of which 3 are for Tier 1 drivers, optional from ARMv8.9+) or page-table fallback. Adaptive isolation policy (Section 10.2) allows per-driver tier pinning or promotion to Tier 0 on pre-POE hardware.
No fast isolation on pre-2020 x86	Medium	Certain	Adaptive isolation policy: `isolation=performance` promotes Tier 1 to Tier 0 (Linux-equivalent speed, no memory isolation). IOMMU DMA fencing still active.
Rust ecosystem maturity for OS dev	Low	Medium	Established patterns from Redox, Linux rust-for-linux, Hubris. Use `#![no_std]` and custom allocator. Unsafe blocks at hardware boundaries are expected and audited.
Performance target too ambitious	Medium	Medium	5% target is for macro benchmarks. Micro-benchmarks may show higher overhead on specific paths. Batch amortization and careful profiling.
Community adoption / contributor pipeline	Medium	Medium	Clean SDK, good documentation, lower barrier than Linux driver development. Cloud-first focus builds credibility before desktop push.
Regulatory / certification barriers	Low	Low	Work with distributions early. Open-source everything except vendor proprietary blobs.
LZ4/Zstd kernel implementation correctness	Medium	Medium	Fuzzing, comparison with reference implementation. Use no_std BSD-licensed implementations with comprehensive test vectors.
Object namespace overhead on hot paths	Low	Low	Lazy registration for high-frequency objects (fds, sockets, VMAs). Eagerly registered objects only (~2000 baseline = ~384 KB).
CBS scheduling fairness under edge cases	Medium	Medium	Formal analysis against CBS paper (Abeni 1998), stress testing with adversarial workloads, comparison with Linux cpu.max behavior.

23.4.1 Risks from Advanced Features (Chapters 16-18)

Risk	Impact	Likelihood	Mitigation
TEE hardware fragmentation (SEV-SNP vs TDX vs CCA)	High	Certain	Abstract behind `ConfidentialContext` trait (Section 8.6.3). Implement one backend at a time. SEV-SNP first (largest cloud deployment), TDX second, CCA third.
PQC algorithm instability (NIST may revise)	Medium	Medium	Algorithm-agile abstraction (Section 8.5.2). Algorithms behind enum dispatch; swapping ML-KEM for a successor is a library update, not a kernel redesign.
PQC signature sizes impact IPC latency	Low	Certain	ML-DSA-65 signatures are 3,309 bytes (per NIST FIPS 204, Table 2). Cold-path only (capability minting, not every IPC call). `SignatureData::Heap` variant avoids ring buffer bloat (Section 8.5).
RT + domain isolation interaction causes priority inversion	High	Medium	Domain switch (WRPKRU on x86) is ~23 cycles (no lock needed). Domain switching is O(1) — no contention path. If priority inheritance needed for domain-shared buffers, use PI futexes (Section 7.2.3).
Formal verification scope creep	Medium	Medium	Verify only security-critical paths: capability table, IPC ring, page table mapping (Section 23.10). Accept that ~80% of kernel code is tested, not verified.
DPU vendor lock-in (proprietary firmware)	Medium	High	KABI vtable for `OffloadTransport` (Section 5.2). DPU-specific code is behind the same driver isolation as any Tier 1 device. Vendor-specific logic in driver, not kernel.
PMEM/CXL hardware not yet widely deployed	Low	High	Design is hardware-agnostic (Section 14.7). All PMEM code compiles out when hardware is absent. CXL 3.0 adoption expected 2025-2027; architecture ready, implementation deferred.
Unified compute model adds scheduling overhead	Medium	Low	Advisory overlay only — existing schedulers unchanged (Section 21.6). Topology queries are O(1) reads from cached `ComputeCapacityProfile`. No hot-path cost.
Live kernel evolution causes state corruption	Critical	Low	Post-swap watchdog with 5-second timer (Section 12.6). On crash, the system attempts to re-extract state from the failing component; if extraction fails, the system panics rather than reverting to stale state, preventing silent data corruption. State serialization uses versioned HMAC integrity tags.
Intent optimizer makes poor decisions	Low	Medium	Intent system is purely advisory (Section 6.7). Clamping prevents invalid resource configs. Worst case: system falls back to static defaults (no intent optimization).

23.4.2 Risk Response Priority

Driver coverage (Critical): Addressed by cloud-first strategy + agentic rewrite
Syscall compatibility (High): Addressed by LTP + application test matrix
eBPF complexity (High): Addressed by incremental implementation
KVM integration (High): Addressed by early architectural planning
TEE fragmentation (High): Addressed by trait-based abstraction
RT + domain isolation interaction (High): Addressed by O(1) domain switching design
Domain limit (Medium): Addressed by driver grouping policy
Live evolution safety (Critical but low likelihood): Addressed by watchdog + state HMAC integrity checks

23.4.3 Domain Grouping: Degraded Isolation Analysis

When more than 12 Tier 1 drivers are loaded simultaneously, some drivers must share an isolation domain (protection key). This is an inherent limitation of Intel's 16-key PKU design (16 keys minus PKEY 0 for umka-core, minus PKEY 1 for shared descriptors, minus PKEY 14 for shared DMA, minus PKEY 15 as guard = 12 usable). Grouping has concrete consequences for fault isolation:

What grouping preserves: - IOMMU isolation: each driver retains its own IOMMU domain regardless of domain grouping. DMA fencing is unaffected — a crashing NVMe driver cannot DMA into a NIC driver's buffers, even if they share an isolation domain. - Capability isolation: each driver has its own capability set. Sharing an isolation domain does not grant access to another driver's capabilities. - Crash detection: fault injection and page-fault trapping still identify the crashing driver (via instruction pointer, not isolation domain).

What grouping degrades: - Memory read/write isolation between grouped drivers. If drivers A and B share isolation domain 5, a buffer overrun in A can corrupt B's data structures. The crash is still contained (it cannot escape to umka-core or other domains), but it may take down both A and B. - The blast radius of a crash expands from one driver to one domain group. In practice, this means a faulty NVMe driver could take down the AHCI driver if both are in the "block" group.

Grouping policy — drivers are grouped by fault domain affinity (drivers that interact heavily and would likely cascade-fail anyway):

Isolation Domain	Group	Typical Members
2	Block storage	NVMe, AHCI/SATA, virtio-blk, iSCSI, NVMe-oF
3	Network (data)	Intel NIC, Mellanox NIC, virtio-net
4	Network (stack)	TCP/IP, UDP, RDMA core
5	Filesystem	ext4, XFS, btrfs
6	Display	DRM/KMS, GPU compute
7	KVM	VMX/SVM, vhost-net, vhost-scsi
8	USB	xHCI, USB hub, USB mass storage
9-13	Individual	High-value drivers that warrant solo domains

AArch64 domain budget: POE provides 7 usable indices (1-7; index 0 is reserved for umka-core as the default PTE value). Of the 7 usable indices, 4 are reserved for infrastructure (index 1 for shared read-only, index 2 for shared DMA, index 6 for userspace, index 7 for temporary/debug), leaving only 3 indices for Tier 1 driver domains (indices 3-5; vs. 12 on x86). AArch64 therefore requires much more aggressive grouping.

Note for reviewers: ARM FEAT_S1POE uses a 3-bit POIndex field in page table entries, providing 8 index values (0-7). This is a hardware constraint, not a design choice. Index 0 is the default PTE value (per ARM architecture), leaving 7 configurable indices. Do not suggest "use 4 bits for 16 indices" — the POIndex field width is fixed by the ISA. The grouping table above is designed for x86's 12-domain budget. On AArch64, the kernel applies a reduced grouping scheme: - Domain 0: umka-core (default PTE value) - Domain 1: Shared read-only - Domain 2: Shared DMA buffer pool - Domain 3: VFS + block I/O (merged — these are tightly coupled) - Domain 4: Network stack - Domain 5: All remaining Tier 1 drivers (single shared domain) - Domain 6: Userspace (EL0 default) - Domain 7: Temporary / debug This reduces isolation granularity for Tier 1 drivers on AArch64 (all share one domain) but preserves the critical umka-core/driver/userspace boundaries. The architecture-specific grouping is selected at boot based on arch::current::isolation::domain_count().

Typical server scenario — a cloud server runs NVMe + NIC + TCP + KVM + virtio = 5 drivers. On x86 (12 driver domains), these fit in 5 domains with no grouping needed; the 12-domain limit only triggers on heavily-configured systems (desktop with GPU + audio + USB + Bluetooth + WiFi + NVMe + SATA + NIC + ...). On AArch64 with POE (3 driver domains), even this typical 5-driver configuration requires grouping -- the reduced scheme above merges block I/O, networking, and remaining drivers into 3 shared domains. Architectures with more domains (ARMv7 DACR: 15, PPC32 segments: 15) behave more like x86.

Monitoring — when grouping occurs, UmkaOS logs a warning: umka: isolation domain 1 shared by nvme, ahci (reduced isolation: crash in either affects both)

This allows administrators to make informed decisions about which drivers to load as Tier 2 (full process isolation, unlimited domains) if they require stronger isolation than domain grouping provides.

# Appendices

Reference material, comparison tables, and open questions.

23.5 Licensing Model: Open Kernel License Framework (OKLF) v1.3

UmkaOS uses the Open Kernel License Framework (OKLF) v1.3 (see OKLF-v1.3.md for the full legal text). Key elements:

Base license: GPLv2-only with additional permissions (Sections 2-5 of OKLF). All kernel code — umka-core, umka-kernel, umka-compat, umka-net, umka-vfs, umka-block, umka-kvm, tools, and boot code — is GPLv2. This ensures: - All kernel modifications must be open-sourced - Proprietary forks are impossible - Same legal framework the Linux ecosystem understands

Approved Linking License Registry (ALLR): A curated, append-only list of open-source licenses approved for use with kernel code. Tiers 1-2 may link with kernel code directly (Tier 0/1 drivers). Tier 3 licenses are GPL-incompatible and may NOT link with the kernel; Tier 3 code runs exclusively as Tier 2 process-isolated drivers communicating via KABI IPC, where no linking occurs: - Tier 1 (weak copyleft, GPL-compatible): MPL-2.0, LGPL-2.1, EPL-2.0 (with Secondary License designation; see note below) - Tier 2 (permissive): MIT, BSD-2, BSD-3, Apache-2.0, ISC, Zlib - Tier 3 (incompatible — process isolation required, no linking): CDDL-1.0, CDDL-1.1, LGPL-3.0, EUPL-1.2 (see note below)

LGPL-3.0 incompatibility with GPLv2-only: LGPL-3.0 is incompatible with GPLv2-only code per the FSF compatibility matrix. LGPL-3.0 is defined as GPLv3 plus additional permissions (LGPL-3.0 Section 1.1: "This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License"). Since GPLv3 is incompatible with GPLv2-only (see GPLv3 exclusion note below), LGPL-3.0 inherits that incompatibility. LGPL-3.0 code must NOT be linked into the UmkaOS kernel. LGPL-3.0 code communicates with the kernel via KABI IPC only (Tier 3, process isolation required). Note that LGPL-2.1 IS compatible with GPLv2 and remains in Tier 1.

EUPL-1.2 classification (Tier 3): EUPL-1.2 is a strong copyleft license that the FSF classifies as GPL-incompatible. While EUPL Article 5 provides a compatibility list (including GPLv2, GPLv3, LGPL, AGPL, MPL-2.0, EPL-1.0, CeCILL) that allows EUPL-licensed code to be relicensed under those licenses when combined with code under those licenses, the FSF's position is that EUPL-1.2's copyleft is "comparable to the GPL's, and incompatible with it" by itself. UmkaOS places EUPL-1.2 in Tier 3 (process isolation required, no linking with kernel code) as the conservative default. EUPL-1.2 code that has been explicitly relicensed to GPLv2 via Article 5 by its copyright holder may then be treated as GPLv2 code and used in Tier 0/1. Without explicit relicensing, EUPL-1.2 code runs as a Tier 2 process-isolated driver communicating via KABI IPC only.

EPL-2.0 GPL compatibility: EPL-2.0 is GPL-compatible only when the distributor explicitly designates GPL as a Secondary License per EPL-2.0 Section 3.2. Without this designation, EPL-2.0 is GPL-incompatible. UmkaOS requires EPL-2.0 dependencies to carry the Secondary License designation; undesignated EPL-2.0 code is treated as Tier 3 (process isolation required, no linking with kernel code). ALLR Tier 1 inclusion applies only to EPL-2.0 code that explicitly carries the Secondary License designation for GPLv2. Enforcement: the KABI module loader checks for the Secondary License designation in the module's license metadata at load time. EPL-2.0 modules without the designation are rejected for Tier 0/1 loading and must run as Tier 2 process-isolated drivers. Additionally, EPL-2.0's patent grant (Section 2.2) requires contributors to grant a patent license for their contributions; UmkaOS cannot enforce this at a technical level, so EPL-2.0 code in Tier 1 carries an implicit assumption that upstream contributors have complied with Section 2.2. Code review should verify the Secondary License designation is present in the upstream project's license header, not just claimed in module metadata.

GPLv3 exclusion from ALLR: GPLv3 is deliberately excluded from the ALLR. UmkaOS's kernel is licensed GPLv2-only (not "GPLv2 or later"). GPLv3 is incompatible with GPLv2-only code per the FSF: GPLv3's additional requirements (anti-tivoization in Section 10.4, patent retaliation in Section 8.1) constitute "further restrictions" that GPLv2 Section 10.5 prohibits. Code licensed GPLv3-only cannot be linked into a GPLv2-only kernel. Code licensed "GPLv2 or later" CAN be used (under its GPLv2 grant), but code licensed GPLv3-only cannot. Adding GPLv3 to the ALLR would create a false impression that GPLv3-only code may be linked with the kernel. If GPLv3-only code must be used, it must run as a Tier 1 or Tier 2 driver (same as CDDL), communicating via KABI IPC with no linking.

CDDL and GPL incompatibility: CDDL is GPL-incompatible per the FSF. CDDL-licensed code may run as Tier 1 or Tier 2 drivers — KABI provides the license boundary at both tiers. Despite CDDL appearing in the ALLR, no linking occurs between CDDL code and GPL kernel code. CDDL drivers communicate exclusively via KABI IPC (ring buffer message passing, vtable dispatch, one resolved symbol __kabi_driver_entry) — no shared symbols, no function calls across the license boundary. This provides more isolation than Linux's EXPORT_SYMBOL_GPL boundary (where modules ARE linked into the kernel). Statically-linked (Tier 0) CDDL code is NOT permitted, as static linking creates a derivative work. The KABI boundary ensures CDDL and GPL code never form a single "work" in the copyright sense.

New licenses added via governance process (60-day review, supermajority LGB vote). Licenses are never removed (append-only for legal certainty).

Proprietary kernel-space code explicitly prohibited (OKLF Section 4.2(c)): Any code that loads into kernel address space and accesses internal kernel symbols is a derivative work and must comply with GPLv2 or an ALLR-listed license. This removes Linux's 30-year "gray area" about proprietary kernel modules.

Proprietary user-space drivers explicitly permitted (OKLF Section 4.2(b)): Code interacting with the kernel exclusively through the stable userspace interface (syscalls, /proc, /sys, VFIO, UIO, FUSE, eBPF) is not a derivative work. This maps directly to our Tier 2 driver model — hardware vendors who cannot open-source their drivers may use user-space driver frameworks with full isolation.

Anti-tivoization stance (OKLF Section 11.1): OKLF encourages but does not mandate installation information disclosure. The OKLF adds only additional permissions to GPLv2 (permitted by GPLv2 Sections 0 and 10), never additional restrictions. Anti-tivoization protection is achieved indirectly: the KABI stability guarantee means users can always replace a Tier 1/2 driver binary without modifying the kernel, making hardware lockdown of individual drivers less effective.

Firmware exception (OKLF Section 4.3): Binary firmware that runs on separate processors (GPU microcode, Wi-Fi firmware, SSD firmware) is outside the license scope. Distributed separately in firmware/. Code running on the main CPU is NOT firmware.

Legal risk acknowledgment — OKLF is a novel license framework built on GPLv2. While it is designed to be GPLv2-compatible (the "additional permissions" model is explicitly contemplated by GPLv2 Section 0 and Section 3.1), it has not been tested in court and constitutes a novel legal approach that should not be relied upon without independent legal review. Key risks: (1) the ALLR mechanism may be viewed by some lawyers as an untested extension of the "linking exception" concept — FSF/SFLC review is recommended before v1.0 final; (2) the OKLF provides weaker anti-tivoization protection than GPLv3, which is an accepted tradeoff for GPLv2 compatibility — OKLF cannot mandate installation information disclosure without violating GPLv2's "no further restrictions" clause; (3) ecosystem adoption depends on corporate legal teams accepting OKLF as GPLv2-compatible — even if legally sound, unfamiliarity may slow adoption; (4) the "additional permissions" model under GPLv2 Section 0/Section 3.1 is well-established in principle (e.g., GCC Runtime Library Exception, Qt commercial exception), but OKLF's scope (ALLR registry, driver tier classification, firmware exception) goes beyond typical additional permissions — a court could find that some OKLF provisions constitute "further restrictions" rather than "additional permissions," which GPLv2 Section 10.5 prohibits. This risk is mitigated by careful drafting but cannot be eliminated without judicial precedent. UmkaOS should seek early legal review from SFLC or equivalent, and provide a "plain GPLv2" fallback for organizations that cannot accept OKLF's additional terms.

KABI Driver SDK: The umka-driver-sdk crate (ABI type definitions, vtable layouts, ring buffer protocol, DMA types) is dual-licensed Apache-2.0 OR MIT. This is the interface contract — drivers of any ALLR-listed license can link against these types without friction.

How this maps to our driver tiers:

Tier	Location	License requirement	OKLF section
Tier 0 (boot-critical)	In-kernel, static	GPLv2 or ALLR	4.1 (in-tree)
Tier 1 (domain-isolated)	Ring 0, loaded	GPLv2 or ALLR	4.2 (out-of-tree open-source)
Tier 2 (user-space)	Ring 3, process	Any (incl. proprietary)	4.2(b) (userspace interface)

Three ABI stability tiers (extending OKLF Section 10.2):

Interface	Stable?	Policy
Internal kernel APIs	No	May change between any two releases
KABI (driver ABI)	Yes	Versioned, append-only, binary-stable
Userspace ABI (syscalls)	Yes	Never broken without extended deprecation

Concern	How addressed
Prevent proprietary kernel forks	GPLv2 copyleft
Allow ZFS (CDDL)	CDDL in ALLR Tier 3 — ZFS runs as a Tier 1 driver (KABI IPC provides license boundary, no linking occurs)
Allow Nvidia GPU (proprietary)	Tier 2 user-space driver via VFIO
Allow BSD/MIT drivers	BSD/MIT in ALLR — full kernel-space access
Force kernel improvements to be open	GPLv2 copyleft on all kernel crates
Module enforcement	Kernel refuses non-compliant modules by default
Clear legal boundaries	OKLF explicit text, not legal gray area

23.6 Project Structure

Note: This appendix describes the target project structure at full implementation. The current codebase (see CLAUDE.md "Project Structure") contains the foundational crates (umka-kernel, umka-core, umka-driver-sdk, umka-compat, umka-net, umka-vfs, umka-block, umka-kvm). Additional crates listed below (e.g., umka-accel, umka-cluster, drivers/) will be added as their corresponding architecture sections are implemented.

umka-kernel/
  Cargo.toml                        # Workspace root (all crates)
  ARCHITECTURE.md                   # This document

  umka-core/                        # Microkernel core
    Cargo.toml
    src/
      main.rs                       # Boot entry point (calls arch-specific init)
      cap/                          # Capability system
        mod.rs                      #   Capability types, tables, operations
        revocation.rs               #   Generation-based revocation
      mem/                          # Memory management
        phys.rs                     #   Physical page allocator (buddy)
        vmm.rs                      #   Virtual memory manager (maple tree, VMAs)
        page_cache.rs               #   Page cache (RCU radix tree)
        slab.rs                     #   Slab allocator for kernel objects
        pcid.rs                     #   PCID/ASID management
        huge.rs                     #   Huge page (THP + explicit) support
      sched/                        # Scheduler
        mod.rs                      #   Scheduler core, class dispatch
        cfs.rs                      #   CFS/EEVDF fair scheduler
        rt.rs                       #   RT FIFO/RR scheduler
        deadline.rs                 #   Deadline (EDF/CBS) scheduler
        balance.rs                  #   NUMA-aware load balancer
      ipc/                          # IPC and isolation
        mpk.rs                      #   MPK domain management, WRPKRU helpers
        ring.rs                     #   Shared-memory ring buffers
        tier2_ipc.rs                #   Cross-address-space IPC for Tier 2
      arch/                         # Architecture-specific Rust code
        mod.rs                      #   Architecture trait definitions
        x86_64/                     #   x86-64 implementation
          mod.rs
          gdt.rs                    #     GDT setup
          idt.rs                    #     IDT and interrupt dispatch
          apic.rs                   #     Local APIC driver (Tier 0)
          timer.rs                  #     HPET/TSC/APIC timer (Tier 0)
          mpk.rs                    #     MPK hardware interface
          vmx.rs                    #     VMX support for KVM
        aarch64/                    #   ARM64 implementation (phase 2+)
          mod.rs
        armv7/                      #   ARMv7 implementation (phase 2+)
          mod.rs
        riscv64/                    #   RISC-V 64 implementation (phase 2+)
          mod.rs
        ppc32/                      #   PPC32 implementation (phase 2+)
          mod.rs
        ppc64le/                    #   PPC64LE implementation (phase 2+)
          mod.rs

  umka-compat/                      # Linux syscall interface + compat shims
    Cargo.toml
    src/
      syscall/                      # ~450 syscall dispatch table
        mod.rs                      #   SyscallHandler enum, dispatch table
        process.rs                  #   fork, clone, execve, exit, wait
        file.rs                     #   open, read, write, close, ioctl
        memory.rs                   #   mmap, brk, mprotect, madvise
        network.rs                  #   socket, bind, listen, accept, connect
        time.rs                     #   clock_gettime, nanosleep, timer_*
        misc.rs                     #   getpid, getuid, uname, sysinfo
      proc/                         # /proc filesystem emulation
        mod.rs
        meminfo.rs                  #   /proc/meminfo
        cpuinfo.rs                  #   /proc/cpuinfo
        pid.rs                      #   /proc/[pid]/* (maps, status, fd, etc.)
        sys.rs                      #   /proc/sys/* (sysctl interface)
      sys/                          # /sys filesystem emulation
        mod.rs
        devices.rs                  #   /sys/devices/ device tree
        class.rs                    #   /sys/class/ device classes
        bus.rs                      #   /sys/bus/ bus enumeration
      dev/                          # /dev filesystem emulation
        mod.rs
        devtmpfs.rs                 #   devtmpfs-compatible device nodes
      signal/                       # Signal handling
        mod.rs
        delivery.rs                 #   Signal delivery to user space
        handlers.rs                 #   Default handlers, core dump
      namespace/                    # Linux namespace implementation
        mod.rs
        mnt.rs                      #   Mount namespace
        pid.rs                      #   PID namespace
        net.rs                      #   Network namespace
        user.rs                     #   User namespace
        ipc.rs                      #   IPC namespace
        uts.rs                      #   UTS namespace
        cgroup.rs                   #   Cgroup namespace
        time.rs                     #   Time namespace
      cgroup/                       # Cgroup v1/v2
        mod.rs
        v2.rs                       #   Unified hierarchy (primary)
        v1_compat.rs                #   Legacy hierarchy (compatibility)
        controllers/                #   cpu, memory, io, pids, etc.
      io_uring/                     # io_uring subsystem
        mod.rs
        ring.rs                     #   SQ/CQ ring management
        sqpoll.rs                   #   SQPOLL kernel thread
        ops.rs                      #   Operation dispatch
      lsm/                          # Linux Security Modules
        mod.rs
        hooks.rs                    #   Hook framework
        selinux.rs                  #   SELinux policy engine
        apparmor.rs                 #   AppArmor profile engine
        seccomp.rs                  #   seccomp-bpf filter
      ebpf/                         # eBPF subsystem
        mod.rs
        vm.rs                       #   eBPF virtual machine
        verifier.rs                 #   Static verifier
        jit/                        #   JIT compilers
          x86_64.rs
          aarch64.rs
          armv7.rs
          riscv64.rs
          ppc32.rs
          ppc64le.rs
        maps.rs                     #   Map types (hash, array, ringbuf, etc.)
        helpers.rs                  #   eBPF helper functions
        programs.rs                 #   Program types (XDP, tc, kprobe, etc.)

  umka-net/                         # Network stack (runs as Tier 1)
    Cargo.toml
    src/
      tcp/                          # TCP/IP implementation
      udp/                          # UDP implementation
      ip/                           # IP layer (v4 + v6)
      arp.rs                        # ARP
      icmp.rs                       # ICMP
      netfilter/                    # nftables + iptables compatibility
        mod.rs
        nft.rs                      #   nftables engine
        conntrack.rs                #   Connection tracking
        nat.rs                      #   NAT (SNAT, DNAT, masquerade)
      xdp/                          # XDP fast path
      socket.rs                     # Socket abstraction
      tunnel/                       # Tunnel protocol modules (Section 15.2)
        mod.rs                      #   TunnelDevice trait
        vxlan.rs                    #   VXLAN encap/decap
        geneve.rs                   #   Geneve encap/decap
        gre.rs                      #   GRE/GRE6
        ipip.rs                     #   IPIP/SIT
        wireguard.rs                #   WireGuard VPN
      bridge/                       # Software L2 switch (Section 15.2)
        mod.rs                      #   Bridge device, FDB, STP
        vlan.rs                     #   802.1Q VLAN filtering
      veth.rs                       # Virtual ethernet pairs
      macvlan.rs                    # macvlan/ipvlan devices
      vrf.rs                        # Virtual Routing and Forwarding

  umka-vfs/                         # Virtual filesystem layer (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # VFS dispatch, mount table
      ext4/                         # ext4 filesystem
      xfs/                          # XFS filesystem
      btrfs/                        # btrfs filesystem
      tmpfs/                        # tmpfs (in-memory)
      overlayfs/                    # OverlayFS (for containers)
      dcache.rs                     # Directory entry cache

  umka-block/                       # Block I/O layer (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # Block device abstraction
      scheduler.rs                  # I/O schedulers (mq-deadline, none, bfq)
      partition.rs                  # Partition table parsing (GPT, MBR)
      dm/                           # Device-mapper framework (Section 14.3)
        mod.rs                      #   DM core: target dispatch, table management
        linear.rs                   #   dm-linear
        striped.rs                  #   dm-striped
        mirror.rs                   #   dm-mirror
        crypt.rs                    #   dm-crypt (AES-XTS)
        verity.rs                   #   dm-verity
        snapshot.rs                 #   dm-snapshot (COW)
        thin.rs                     #   dm-thin-pool
      md.rs                         # MD RAID (0/1/5/6/10) superblock compat
      lvm.rs                        # LVM2 metadata reader
      recovery.rs                   # Recovery-aware volume state machine
      iscsi/                        # iSCSI block storage (Section 14.4)
        mod.rs                      #   iSCSI common: PDU parsing, session state
        initiator.rs                #   iSCSI initiator (RFC 7143)
        target.rs                   #   iSCSI target (LIO-compatible config)
        iser.rs                     #   iSER — RDMA transport for iSCSI
        chap.rs                     #   CHAP authentication
        multipath.rs                #   dm-multipath integration
      nvmeof/                       # NVMe over Fabrics (Section 14.4)
        mod.rs                      #   NVMe-oF common: capsule parsing, queue pairs
        host.rs                     #   NVMe-oF initiator (host) — connect, I/O
        target.rs                   #   NVMe-oF target (subsystem) — nvmetcli compat
        tcp.rs                      #   NVMe/TCP transport (TP 8000)
        rdma.rs                     #   NVMe/RDMA transport (TP 8001)
        discovery.rs                #   Discovery controller client/server
        ana.rs                      #   ANA multipath — asymmetric namespace access

  umka-kvm/                         # KVM hypervisor (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # /dev/kvm interface
      vmx.rs                        # Intel VMX
      svm.rs                        # AMD SVM
      mmu.rs                        # Nested page tables (EPT/NPT)
      tee/                          # Confidential VM support (Section 8.6)
        sev.rs                      #   AMD SEV-SNP guest/host
        tdx.rs                      #   Intel TDX guest/host
        cca.rs                      #   ARM CCA realm management

  umka-accel/                       # AI/ML accelerator subsystem (Section 21.1)
    Cargo.toml
    src/
      mod.rs                        # AccelBase trait, device registration
      scheduler.rs                  # CBS-based accelerator scheduler
      hmm.rs                        # Heterogeneous memory management
      p2p.rs                        # Peer-to-peer DMA (PCIe, NVLink, CXL)
      inference.rs                  # In-kernel inference engine
      rdma.rs                       # RDMA and collective ops

  umka-cluster/                     # Distributed kernel (Section 5.1)
    Cargo.toml
    src/
      mod.rs                        # Cluster topology, node discovery
      transport.rs                  # KernelTransport (RDMA, CXL, TCP)
      ipc.rs                        # Distributed IPC proxy
      dsm.rs                        # Distributed shared memory
      dlm.rs                        # Distributed Lock Manager (Section 14.6)
      global_pool.rs                # Global memory pool
      scheduler.rs                  # Cluster-wide scheduling
      caps.rs                       # Network-portable capabilities

  umka-driver-sdk/                  # Stable driver SDK
    Cargo.toml
    interfaces/                     # .kabi IDL definitions
      block_device.kabi             #   Block device interface
      net_device.kabi               #   Network device interface
      gpu_device.kabi               #   GPU device interface
      input_device.kabi             #   Input device interface
      usb_device.kabi               #   USB device interface
      char_device.kabi              #   Character device interface
      pci_device.kabi               #   PCI device interface
      platform_device.kabi          #   Platform device interface
    src/
      lib.rs                        # SDK entry point, driver registration
      abi.rs                        # Generated stable ABI types
      dma.rs                        # DMA buffer management
      mmio.rs                       # MMIO access helpers (volatile read/write)
      irq.rs                        # Interrupt handling
      ring.rs                       # Ring buffer helpers for driver use
      manifest.rs                   # Driver manifest parsing

  drivers/                          # In-tree drivers
    tier0/                          # Boot-critical (statically linked)
      apic/                         #   Local APIC + I/O APIC
      timer/                        #   PIT / HPET / TSC
      serial/                       #   Early serial console
      vga/                          #   Early VGA text console
    tier1/                          # Performance-critical (domain-isolated)
      nvme/                         #   NVMe SSD driver
      virtio_blk/                   #   VirtIO block device
      virtio_net/                   #   VirtIO network device
      virtio_gpu/                   #   VirtIO GPU
      virtio_console/               #   VirtIO console
      e1000/                        #   Intel e1000 NIC
      igb/                          #   Intel igb NIC
      ahci/                         #   AHCI/SATA controller
      ext4/                         #   ext4 driver component
    tier2/                          # Isolated (user-space process)
      usb_xhci/                     #   USB XHCI host controller
      usb_hid/                      #   USB HID (keyboard, mouse)
      usb_storage/                  #   USB mass storage
      hda_audio/                    #   Intel HDA audio
      input/                        #   Input subsystem (evdev)

  tools/
    kabi-compiler/                  # .kabi IDL -> Rust/C code generator
      Cargo.toml
      src/
        main.rs
        parser.rs                   #   IDL parser
        codegen_rust.rs             #   Rust binding generator
        codegen_c.rs                #   C binding generator
    kabi-compat-check/              # ABI compatibility CI checker
      Cargo.toml
      src/
        main.rs                     #   Diffs old vs new .kabi, rejects breaks
    umka-initramfs/                 # Initramfs builder tool
      Cargo.toml
      src/
        main.rs                     #   Packs drivers + early userspace

  arch/                             # Architecture-specific C/asm
    x86_64/
      boot/                         # UEFI/BIOS boot stub (C + asm)
        header.S                    #   Linux boot protocol header
        main.c                      #   Early C boot code
        efi_stub.c                  #   UEFI stub
      asm/
        entry.S                     #   Syscall entry/exit
        switch.S                    #   Context switch
        irq_stubs.S                 #   Interrupt stub table
      vdso/
        vdso.lds                    #   vDSO linker script
        clock_gettime.c             #   clock_gettime implementation
        getcpu.c                    #   getcpu implementation
    aarch64/
      boot/                         # ARM64 boot stub
      asm/                          # ARM64 assembly
      vdso/                         # ARM64 vDSO
    riscv64/
      boot/                         # RISC-V boot stub
      asm/                          # RISC-V assembly
      vdso/                         # RISC-V vDSO
    ppc32/
      boot/                         # PPC32 boot stub
      asm/                          # PPC32 assembly
      vdso/                         # PPC32 vDSO
    ppc64le/
      boot/                         # PPC64LE boot stub
      asm/                          # PPC64LE assembly
      vdso/                         # PPC64LE vDSO

  tests/
    abi_compat/                     # Old driver binaries for compat regression
    syscall/                        # Linux syscall conformance (LTP-based)
    driver/                         # Driver integration tests
    bench/                          # Performance regression benchmarks
    crash_recovery/                 # Fault injection + recovery verification

23.7 What UmkaOS Provides That Linux Cannot

Feature	Linux	UmkaOS
Driver crash recovery	Kernel oops or panic depending on fault type. Many driver bugs produce oops (system continues with degraded functionality) rather than panic. Recovery requires at minimum driver module reload; severe faults cause panic and full reboot (30-60s).	Reload driver in ~50-150ms (Tier 1) or ~10ms (Tier 2)
Stable driver ABI	None (recompile every update)	Versioned, append-only, binary-stable KABI
Driver isolation	None (shared address space)	Domain isolation + IOMMU (Tier 1), full process (Tier 2)
Capability-based security	Bolt-on (POSIX caps are coarse)	Foundational architecture
Lock ordering enforcement	Runtime lockdep (debug only)	Compile-time via Rust type system: type-level lock ordering using phantom type parameters that encode lock level in the type signature (e.g., `Lock<Level3>`), preventing out-of-order acquisition at compile time. See umka-core lock design (Section 7.2).
io_uring security	Bypasses syscall monitoring	Per-instance operation whitelist
Hot driver upgrade	Fragile (unstable ABI)	Clean stop/start with stable KABI
Memory safety	C everywhere	Rust with minimal unsafe at hardware boundaries
Many-core scalability	known bottlenecks (RTNL for networking, inode_lock for VFS, cgroup_mutex for cgroups)	No global locks, per-CPU/per-NUMA everywhere
Proactive fault management	Ad-hoc (mcelog, rasdaemon)	Unified FMA with diagnosis engine (Section 19.1)
Memory compression	zswap/zram (separate, config-heavy)	Integrated NUMA-aware zpool tier (Section 4.2)
CPU bandwidth guarantee	No floor mechanism	CBS-backed cpu.guarantee (Section 6.3)
Stable observability ABI	Tracepoints are unstable	Versioned, documented stable tracepoints (Section 19.2)
Verified boot chain	Fragmented (UEFI SB + IMA + dm-verity)	Unified chain from firmware to drivers (Section 8.2)
Kernel object introspection	Per-subsystem (/proc, /sys, scattered)	Unified object namespace via umkafs (Section 19.4)
Driver state preservation	Lost on crash — cold restart	Checkpointed state buffer, warm restart (Section 10.8)
Core panic diagnostics	kexec + kdump (complex setup)	In-place crash dump to reserved memory (Section 10.8)
Context switch XSAVE cost	Eager XSAVE with XSAVEOPT/XSAVES optimizations (skips unmodified components, but still saves full state for context switches involving SIMD). UmkaOS's lazy approach avoids save/restore entirely for non-SIMD threads.	Lazy XSAVE — zero cost for non-SIMD threads (Section 6.1.6)
CPU errata management	Scattered #ifdef, ad-hoc	Structured quirk table + boot-param controls (Section 2.1.4)
Volume layer + driver crash	Device marked failed, RAID resync	Recovery-aware: pause I/O, resume clean (Section 14.3)
VM guest driver crash	VM reboot required	Driver recovers in-place, hypervisor unaware (Section 17.1)
Block storage networking	Separate stacks (open-iscsi, nvme-cli, no unified recovery)	Unified iSCSI + NVMe-oF with RDMA upgrade and crash recovery (Section 14.4)
Clustered FS + driver crash	Node fenced, ejected from cluster	Driver recovers in-place, node stays in cluster (Section 14.5)
Distributed locking	TCP-based DLM (~10-100 μs/op depending on lock locality; local locks <1 μs), global recovery quiesce on any node failure	RDMA-native DLM (~2-3 μs uncontested, ~5-10 μs contested), per-resource recovery, lease-based extension, batch ops (Section 14.6)
TPM key management	Userspace daemon (tpm2-abrmd)	Kernel-native resource manager + capability integration (Section 8.3)
Runtime integrity	IMA bolted onto VFS, optional	Integrated with capability system and driver loading (Section 8.4)
Display stack crash	X/Wayland session lost	DMA-BUF survives driver reload, compositor stalls ~100ms-5s (full recovery window; Section 21.5.2.6)

23.8 Cross-Feature Integration Map

23.8.1 D.1 Cross-Feature Integration Map

These features are not independent — they reinforce each other:

Formal verification (Section 23.10) ──────► Confidential computing (Section 8.6)
  Proves capability system correct     Relies on correct capability enforcement

Safe extensibility (Section 18.7) ◄──────► Live evolution (Section 12.6)
  Policy modules are hot-swappable     Evolution uses the same mechanism

Intent-based management (Section 6.7) ◄──► In-kernel inference (Section 21.4)
  Intent optimizer uses learned models  Models optimize for declared intents

EAS / heterogeneous CPU (Section 6.1.5) ◄──► Power budgeting (Section 6.4)
  EAS picks energy-optimal core         Power budget enforces watt cap

Power budgeting (Section 6.4) ◄──────► Intent-based management (Section 6.7)
  Power budget is a constraint          Intents include efficiency preference

Hardware memory safety (Section 2.3) ──────► Tier 1 driver isolation (Section 10.4)
  MTE catches C driver bugs             Domain isolation catches the resulting faults

Confidential computing (Section 8.6) ──────► Distributed kernel (Section 5.1)
  TEE-to-TEE RDMA                       DSM coherence for encrypted pages

Post-quantum crypto (Section 8.5) ──────► Distributed capabilities (Section 5.1.10)
  PQC signatures on capabilities        Network-portable across cluster

SmartNIC/DPU (Section 5.2) ◄──────► Distributed kernel (Section 5.1)
  DPU = close remote node               Same proxy driver pattern

Persistent memory (Section 14.7) ◄──────► Memory tiers (Section 21.2)
  Persistent memory = another tier      Managed by same PageLocationTracker

Computational storage (Section 14.8) ◄──► Accelerator framework (Section 21.1)
  CSD = storage accelerator             Same AccelBase vtable

Unified compute (Section 21.6) ◄──────► EAS / heterogeneous CPU (Section 6.1.5)
  Multi-dim capacity extends scalar      CPU capacity is a special case

Unified compute (Section 21.6) ◄──────► Accelerator scheduler (Section 21.1.2.4)
  Cross-device topology + energy data    Accel scheduler consumes advisory

Unified compute (Section 21.6) ◄──────► Power budgeting (Section 6.4)
  Workload profile drives throttle       Informed cross-device power decisions

Unified compute (Section 21.6) ◄──────► Intent-based management (Section 6.7)
  compute.weight feeds intent optimizer  Optimizer adjusts per-domain knobs

Unified compute (Section 21.6) ◄──────► Distributed kernel (Section 5.1)
  Peer kernel nodes via NodeTransport    Accelerator = close compute node

Unified compute (Section 21.6) ◄──────► SmartNIC/DPU offload (Section 5.2)
  Same convergence: device → peer node   NodeTransport unifies both transports

Distributed Lock Manager (Section 14.6) ◄──► RDMA transport (Section 5.1.4)
  DLM uses RDMA CAS/Send for locks      Transport provides kernel RDMA API

Distributed Lock Manager (Section 14.6) ◄──► Cluster membership (Section 5.1.12)
  DLM receives join/leave/dead events    Single heartbeat source for both

Distributed Lock Manager (Section 14.6) ◄──► Clustered filesystems (Section 14.5)
  GFS2/OCFS2 use DLM for coordination   DLM lock modes map to FS operations

Distributed Lock Manager (Section 14.6) ◄──► Driver recovery (Section 10.8)
  DLM in umka-core survives driver crash  No lock recovery needed on Tier 1 reload

Bootstrap Circular Dependency:

The intent optimizer (Section 6.7) uses in-kernel inference models (Section 21.4), but those models may not be loaded at early boot. Resolution: the intent optimizer degrades gracefully to static defaults when models are unavailable. At boot: 1. Intent optimizer starts with hardcoded heuristics (e.g., "latency target → raise cpu.weight by 20%"). 2. When the inference engine loads models (typically within seconds of boot), the optimizer transitions to learned optimization. 3. The transition is seamless — no reconfiguration needed.

23.8.2 D.2 Implementation Dependency Graph

Foundation (no dependencies):
  ├── Formal verification readiness (Section 23.10) — design methodology
  ├── Post-quantum crypto abstraction (Section 8.5) — data structure sizing
  └── Real-time preemption model (Section 7.2) — lock design

Early integration:
  ├── Hardware memory safety (Section 2.3) — needs memory allocator
  ├── Power budgeting (Section 6.4) — needs scheduler
  └── Safe extensibility (Section 18.7) — needs KABI vtable mechanism

Mid integration:
  ├── Confidential computing (Section 8.6) — needs memory manager, IOMMU
  ├── Intent-based management (Section 6.7) — needs inference engine, cgroups
  └── Live evolution (Section 12.6) — needs extensibility mechanism

Late integration:
  ├── SmartNIC/DPU offload (Section 5.2) — needs proxy driver, device registry
  ├── Persistent memory (Section 14.7) — needs VFS, memory tiers
  ├── Computational storage (Section 14.8) — needs AccelBase framework
  ├── Unified compute topology (Section 21.6) — needs AccelBase, EAS (Section 6.1.5), power budgeting (Section 6.4)
  └── Peer kernel nodes (Section 21.6.13) — needs unified compute + distributed kernel (Section 5.1)

23.9 Open Questions

The following cross-cutting items require further design work. Each is tracked as an open question with the affected sections and the specific decision to be made.

io_uring integration (affects Section 18.1.5, Section 8.6, Section 12.6, Section 5.2): - Registered buffers in confidential computing: io_uring pre-registers DMA buffers at setup time. When a VM runs under SEV-SNP, these buffers must be in shared (unencrypted) memory. Decision needed: register-time enforcement vs. lazy conversion. - State migration during live evolution: io_uring's SQ/CQ rings, registered files, and registered buffers constitute persistent state. The live evolution framework (Section 12.6) needs a StateSerializer for io_uring context. Decision needed: drain-and-recreate vs. in-place serialization. - DPU submission offload: DPUs can process io_uring submission queues directly, bypassing host CPU for network and storage operations. Decision needed: how the DPU reads SQ entries (shared memory mapping vs. DMA push) and how completions are posted back to CQ.

GPU virtualization (affects Section 21.1, Section 8.6): - Confidential GPU VMs require that GPU VRAM is encrypted and attestable. SEV-SNP does not natively protect PCIe device memory. TDX Connect (Intel) and ARM CCA device assignment are emerging but not yet stable. Decision needed: software bounce buffer path (safe, slow) vs. hardware-assisted device encryption (fast, hardware-dependent). - Nested virtualization with GPU passthrough: a confidential VM running a nested hypervisor that passes through a GPU adds three layers of IOMMU translation. Decision needed: whether to support this (performance may be prohibitive).

Testing strategy for cross-feature interactions (affects all Section 8.6-Section 23.2): - Combinatorial explosion: 15 features yield 105 pairwise interactions. Exhaustive testing is infeasible. Prioritized critical pairs: 1. RT + confidential computing (latency impact of memory encryption) 2. Power budgeting + intent optimization (conflicting objectives) 3. MTE + DSM page migration (tag preservation across RDMA transfer) 4. Live evolution + RT (component swap during hard-RT operation) 5. DPU offload + confidential computing (encrypted DPU-host channel) - Test matrix and CI strategy: - Each prioritized pair has a dedicated integration test suite in tests/compat/. - CI runs the top 5 pairs on every PR. The remaining 100 pairs run nightly on the develop branch. Failures block merge to master. - Acceptance threshold per pair: P99 latency regression < 5%, zero correctness failures on 10,000 test iterations, zero sanitizer findings (KASAN/KCSAN/KMSAN equivalents via UmkaOS's compile-time and runtime checks). - For confidential-computing pairs (CC + RT, CC + DSM, CC + GPU): additional attestation correctness check — remote verifier must accept the measurement after every feature combination is enabled. - Fuzz-assisted testing: syzkaller-style syscall fuzzer runs the top 10 pairs for 24 hours per release candidate, targeting namespace + cgroup + LSM interactions (historically the highest-bug-density intersection). - Test coverage gate: each pair's integration test suite must achieve ≥80% branch coverage of the relevant subsystem code paths before being declared "passing" (not just "not crashing").

Secure boot measurement chain (affects Section 8.2, Section 8.6, Section 18.7, Section 12.6): - Live kernel evolution PCR and remote attestation protocol: RESOLVED — see the "Live Evolution attestation chain" entry below for the full decision (dedicated PCR[16]/PCR[23], hash-chained event log, TPM2_PolicyAuthorize for sealed secrets). - Policy modules (Section 18.7) loaded at runtime must also be measured. Decision needed: whether measurement is mandatory (blocks unsigned modules) or advisory (measure but allow).

CXL 3.0 fabric management (affects Section 5.1, Section 14.7, Section 21.6): - CXL 3.0 introduces fabric-attached memory with hardware-managed coherence. Decision needed: how this integrates with the distributed kernel's software DSM protocol (Section 5.1.6). Options: CXL replaces DSM for intra-rack, DSM remains for inter-rack; or DSM degrades gracefully when CXL is available.

Multi-architecture parity for advanced features (affects Section 2.2, Section 8.6-Section 23.2): - Many features in Section 8.6-Section 23.2 specify x86-64 mechanisms (WRPKRU, SEV-SNP, RAPL, MTE). ARM64, RISC-V, and PowerPC equivalents exist for some but not all. Partially addressed: Section 2.2 now includes an "Advanced Feature Architecture Parity" matrix covering 8 key features across all six architectures. Remaining decision: per-feature acceptance criteria for "software fallback" vs. "not supported" (performance thresholds, testing requirements).

eBPF verifier completeness (affects Section 18.1.4): RESOLVED — Full verifier, phased delivery.

Decision: UmkaOS implements a full eBPF verifier equivalent to Linux's verifier.

Rationale: - UmkaOS targets 100% Linux userspace compatibility. Tools in widespread production use (BCC, bpftrace, libbpf, Cilium, Falco) rely on full verifier semantics — type-safe map access, bounded loops, all helper prototypes. A partial verifier silently rejects valid programs, breaking these tools without any useful error message. - Safety: a partial verifier is not a safe subset — it is an incomplete verifier. The historical CVEs (CVE-2021-31440, CVE-2021-3490, CVE-2022-23222) arose from incorrect bounds tracking and register state pruning in the full verifier codebase, not from attempting full verification. The solution is a correct full verifier, not a simpler but still-incorrect one. - The Linux verifier is a well-understood reference implementation whose semantics are fully documented via the BPF ISA specification and the kernel's internal type system. UmkaOS's clean-room Rust reimplementation targets semantic equivalence, not code equivalence, allowing a cleaner design that avoids the accumulated technical debt of the Linux C implementation.

Verifier capabilities (all required, no deferred items):

Type safety: All register types tracked through every instruction — scalars, pointers to map values, pointers to ctx fields, pointers to stack slots, packet data pointers. Type propagation through helper calls uses the full helper prototype table. Pointer arithmetic on typed pointers is tracked with offset bounds.
Memory bounds checking: Every load and store is proven in-bounds before JIT emission. For map value pointers: [0, map.value_size). For ctx pointers: per- program-type access matrix (e.g., __sk_buff field access rules for TC programs). For stack slots: [-512, 0). For packet data pointers: [data, data_end) with explicit data_end check before access.
Termination: Bounded loop analysis using the loop bound counter mechanism introduced in Linux 5.3. Maximum iterations per loop: 8,388,608 (8M) by default, matching Linux. Programs that cannot prove termination within the bound are rejected. Back-edge detection uses DFS on the CFG; a back edge without a proven decreasing bound variable is a hard rejection.
Helper function verification: Every bpf_call instruction is checked against the full helper prototype table. Argument types are checked (e.g., ARG_PTR_TO_MAP requires a loaded map fd, ARG_PTR_TO_MEM | MEM_RDONLY requires a readable stack slot or map value). Return value types are recorded for type propagation.
JIT safety: Programs that pass verification are JIT-compiled; programs that fail verification are rejected at BPF_PROG_LOAD time with a structured error report (verifier log, available via BPF_OBJ_GET_INFO_BY_FD). Unverified execution of eBPF bytecode is never permitted at any privilege level.
Privileged/unprivileged split: Without CAP_BPF (and CAP_PERFMON for tracing programs), program types are restricted to socket filters and cgroup skb programs. Pointer arithmetic on packet data is allowed; pointer arithmetic on map values and ctx fields is disallowed. This matches Linux's allow_ptr_leaks and bypass_spec_v1 verifier flags for unprivileged contexts.

Phase assignment:

Phase 2 — eBPF bytecode interpreter + verifier for scalar types and simple map access. Program types: BPF_PROG_TYPE_SOCKET_FILTER and BPF_PROG_TYPE_CGROUP_SKB. Map types: BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_PERF_EVENT_ARRAY. No loops (loop-free control flow only). This subset is sufficient for seccomp-bpf and basic network filtering, satisfying the Phase 2 exit criteria (Docker hello-world).
Phase 3 — Full verifier: complete type system including all pointer kinds, bounded loop analysis, all core helper prototypes (>200 helpers). Program types: all types required for bpftrace, BCC, and Cilium CNI — XDP, TC, KPROBE/KRETPROBE, TRACEPOINT, PERF_EVENT, CGROUP_*, SK_*, FLOW_DISSECTOR. JIT backend for x86-64. Map types: all Linux-equivalent types including BPF_MAP_TYPE_RINGBUF, BPF_MAP_TYPE_SOCKHASH, BPF_MAP_TYPE_LPM_TRIE.

Implementation risk note: The full eBPF verifier is among the most complex single components in the entire implementation plan. Linux's verifier + BTF implementation (verifier.c + btf.c) required years of iterative security hardening to reach production grade. Phase 3 delivers functional completeness (correct programs accepted, incorrect programs rejected); production-grade security hardening against adversarial programs is an ongoing concern through Phase 4-5. Operators running untrusted eBPF programs should use unprivileged BPF restrictions (CAP_BPF required) until the verifier has accumulated sufficient security review.

Phase 4 — JIT backend for AArch64. Verifier additions: struct_ops program type (required for TCP congestion control via eBPF and sched_ext schedulers).
Phase 5 — JIT backend for RISC-V 64. Full parity on all six supported architectures: programs compiled for x86-64 are re-verified and JIT-compiled on each architecture; verifier output is architecture-independent (the verifier itself is not JIT-backend-specific).

io_uring + SEV-SNP shared buffer management (affects Section 18.1.5, Section 8.6): RESOLVED — see Section 18.1.5.1. Resolution: bounce buffer architecture. SQE/CQE rings remain in encrypted guest memory (kernel and userspace share the same encryption domain). DMA data payloads are bounced through a pre-allocated C-bit-clear pool (default 16 MiB per ring, 64 MiB system-wide). Plaintext bounce buffers are acceptable for block I/O (dm-crypt handles encryption above io_uring); network buffers carrying secrets use an opt-in IORING_REGISTER_BUFFERS_ENCRYPTED flag for per-buffer AES-GCM encryption (~1 us per 4 KiB). Performance impact: ~0.6-1.0 us per I/O for the two extra memcpy operations, additive to SEV-SNP's 5-15% baseline.

GPU VRAM encryption for confidential VMs (affects Section 21.1, Section 8.6): - NVIDIA H100 supports CC (Confidential Computing) mode with hardware-encrypted VRAM and attestable GPU firmware. AMD MI300X has a less mature confidential computing ecosystem: MI300X provides memory encryption via AMD Infinity Fabric (SME/SEV-based whole-system encryption), but full VRAM encryption integration with SEV-SNP guest VMs — where individual guest VM memory is isolated from the hypervisor — should be verified against current hardware silicon revision and ROCm driver availability before relying on it. Decision needed: software fallback path where GPU computations operate on encrypted host memory via bounce buffers (10-100x slower due to PCIe round-trips and CPU-side encryption), or restrict confidential GPU workloads to inference-only (model weights are public, only input/output needs encryption) and encrypt only the host-to-GPU and GPU-to-host transfer buffers.

CXL 3.0 coherence domain interaction with DSM (affects Section 5.1.6, Section 5.1.13): - CXL 3.0 Type 3 devices provide hardware-coherent shared memory between hosts via the CXL.mem protocol with back-invalidate support. This capability overlaps with the software DSM protocol defined in Section 5.1.6. Decision needed: when CXL 3.0 fabric is available between two nodes, should the DSM protocol defer entirely to CXL hardware coherence (simpler, lower latency at ~200ns vs. ~5 microseconds for software DSM, but limited to CXL-connected nodes within a single rack), or should DSM provide a unified abstraction that uses CXL as a fast transport underneath (more complex, but uniform API across CXL and non-CXL nodes)? The hybrid approach adds a transport-selection layer to DSM that routes coherence traffic over CXL when available and falls back to RDMA otherwise.

Live Evolution attestation chain (affects Section 12.6, Section 8.2): RESOLVED — Deferred attestation via dedicated auxiliary PCR and structured event log.

Decision: Dedicated PCR with hash-chained event log; TPM2_PolicyAuthorize for sealed secrets.

Re-measuring the entire kernel image on each component swap is rejected: it requires the re-measurer to know the full composition of every other loaded component, which is not available to the kernel itself during a hot-swap (components may be loaded from different packages at different times). Extending a single "current kernel" PCR with the new image hash after each swap collapses all ordering information — a verifier cannot distinguish "component A then B" from "component B then A" — and makes rollback detection impossible.

Mechanism: Auxiliary PCR Hash Chain

A dedicated TPM PCR is reserved exclusively for live evolution measurements. It is never extended by the boot firmware, bootloader, or the initial kernel load (those measurements go into their standard PCRs per the TCG PC Client Platform Firmware Profile specification).

PCR assignment by phase: - Phase 3 development: PCR[16]. PCR[16] is designated by the TCG specification as a "debug" PCR that is resettable via TPM2_PCR_Reset while the platform is in debug mode. This allows iterative testing of the attestation chain without requiring a reboot to clear state after each test run. - Production (Phase 4+): PCR[23]. PCR[23] is the standard "application-specific" PCR reserved for OS and application use. It is not reset by firmware transitions and is not extended by any standard boot component, making it clean for UmkaOS's exclusive use. The transition from PCR[16] to PCR[23] is a compile-time constant UMKA_LIVE_EVOLUTION_PCR that changes between development and production builds.

Extension protocol: Before activating a hot-swapped component, UmkaOS executes:

PCR[UMKA_LIVE_EVOLUTION_PCR] =
    SHA-256(PCR_current || component_sha256 || component_metadata_hash)

where component_metadata_hash = SHA-256(component_name || component_version || load_timestamp_ns).

This creates a cryptographically ordered chain: each PCR value commits to all previously loaded components in their exact load order. Reordering components or omitting any component produces a different PCR value that no attestation policy will accept.

Event log entry: Each extension appends a LiveEvolutionEvent record to the TPM event log (the standard binary log at /sys/kernel/security/tpm0/binary_bios_measurements, extended by the kernel via tpm_pcr_extend and the event log API):

/// One record appended to the TPM event log per hot-swapped component.
/// Written before PCR extension; if the extension fails, the record is
/// removed and activation is aborted.
#[repr(C)]
pub struct LiveEvolutionEvent {
    /// EFI_PLATFORM_FIRMWARE_BLOB2 event type (0x0000000A) repurposed for
    /// kernel use; vendors read event_type to skip non-firmware records.
    /// UmkaOS uses type 0x00000085 (first unallocated vendor range per TCG spec).
    pub event_type: u32,
    /// Monotonically increasing sequence number across all live-evolution events
    /// since boot. Starts at 1; the baseline extension at boot is sequence 0.
    pub sequence: u64,
    /// SHA-256 of the component binary payload before signature stripping.
    pub component_sha256: [u8; 32],
    /// SHA-256 of the metadata fields below (for independent verification).
    pub metadata_hash: [u8; 32],
    /// Null-terminated UTF-8 component name, e.g. "umka-net" or "umka-nvme".
    pub component_name: [u8; 64],
    /// Null-terminated UTF-8 semantic version string, e.g. "2.1.0+build.4711".
    pub component_version: [u8; 32],
    /// Nanoseconds since boot (CLOCK_BOOTTIME) at activation time.
    pub load_timestamp_ns: u64,
    /// ML-DSA-44 signature over all preceding fields, signed by the
    /// kernel's live-evolution signing key (provisioned at boot from the
    /// UmkaOS signing certificate in the UEFI Secure Boot db).
    pub signature: [u8; 2420],
}

The event log is the complete record of the system's live evolution history. Remote attestation verifiers reconstruct the expected PCR value by replaying all LiveEvolutionEvent records in sequence order and computing the same extend chain. The signature on each record lets verifiers authenticate individual events without trusting the log's storage integrity (the PCR value itself provides tamper detection for the chain as a whole).

Baseline event at boot: During kernel initialization, before any component is hot-swapped, UmkaOS extends UMKA_LIVE_EVOLUTION_PCR with a baseline record representing the boot-time kernel composition:

component_sha256 = SHA-256(entire_kernel_image_at_boot)
sequence = 0
component_name = "umka-kernel-baseline"
component_version = <kernel version string>

This anchors the chain to the boot measurement. A system that has never applied any live patch has a PCR value equal to this single extension. The attestation policy for "unpatched kernel" requires exactly this single-extension value.

Remote attestation protocol for live-patched kernels:

The relying party requests a TPM quote covering the standard boot PCRs and UMKA_LIVE_EVOLUTION_PCR, plus the full binary event log.
The verifier confirms the boot PCRs match the expected boot-time measurements (kernel image hash, Secure Boot policy, GRUB measurement).
The verifier replays the LiveEvolutionEvent records in sequence order, verifying each record's ML-DSA-44 signature against the UmkaOS signing certificate.
The verifier recomputes the expected UMKA_LIVE_EVOLUTION_PCR value from the replay and checks it against the quoted PCR value.
The verifier applies its patch policy: each component_sha256 in the event log must appear in the verifier's approved-patch database. Unknown patch hashes cause attestation failure.

TPM-sealed secrets under live evolution:

Secrets sealed to specific PCR values (e.g., disk encryption keys sealed to the boot PCR set) cannot be unsealed after live patches change UMKA_LIVE_EVOLUTION_PCR. This is the correct security behavior: a modified kernel must re-prove its trustworthiness before accessing secrets.

For systems that need to unseal secrets after applying approved patches, the sealing policy uses TPM2_PolicyAuthorize rather than TPM2_PolicyPCR with fixed values. The TPM2_PolicyAuthorize policy delegates the unsealing decision to the holder of an authorized signing key (the UmkaOS attestation key, provisioned at enrollment time). The UmkaOS attestation service, after verifying the event log and confirming all applied patches are approved, signs a policy digest that authorizes unsealing. The kernel presents this signed authorization to the TPM alongside the quote, and the TPM unseals the secret without requiring the PCR values to match the original sealed-time values.

Existing systems that have secrets sealed under TPM2_PolicyPCR (without TPM2_PolicyAuthorize) must re-seal their secrets during the first maintenance window after the live evolution feature is enabled. The umka-attestd daemon handles this migration automatically: it unseals secrets using the old policy (which still works before the first patch is applied), re-seals them under the new TPM2_PolicyAuthorize-based policy, and verifies the re-seal by immediately performing a test unseal before committing the new sealed blob to persistent storage.

This document is the canonical reference for UmkaOS development. All implementation decisions must be traceable to the architecture described here. Changes to this document require team review and approval.

23.10 Formal Verification Readiness

23.10.1 The Opportunity

Formal verification of kernel code crossed the practical threshold:

2009: seL4 — 200,000 lines of proof for 10,000 lines of C. Heroic effort.
2018: RustBelt — Formal soundness proof for Rust's ownership model.
2022-2025: Verus (Carnegie Mellon University, VMware Research, Microsoft Research,
  ETH Zurich, and others) — Automated verification for Rust.
  Write Rust code + specifications → tool PROVES correctness.
  Not testing. Not fuzzing. Mathematical machine-checked proof.

Verus can verify Rust code of realistic complexity: concurrent data structures, state machines, protocols, invariant maintenance. UmkaOS is written in Rust. The verification infrastructure exists.

23.10.2 What To Verify

Not everything needs verification. Focus on security-critical invariants and concurrency-sensitive code where bugs have catastrophic consequences:

Component	Invariant to Prove	Section
Capability system	Capabilities cannot be forged. Revocation is complete. Permissions never escalate.	Section 8.1.1
Page table management	No page mapped into two processes simultaneously without explicit sharing. Freed pages never accessible.	Section 4.1
Memory allocator	No page allocated twice. No double-free. Buddy merging preserves free-list consistency. Allocation never returns memory outside tracked ranges.	Section 4.1
KABI vtable dispatch	Vtable calls never escape the driver's isolation domain. Version checks are correct.	Section 11.1
IPC ring buffer	Producer-consumer protocol never loses messages, never delivers duplicates, never deadlocks.	Section 10.6
CBS bandwidth server	Bandwidth guarantees are met. No starvation.	Section 6.3.4
DSM coherence protocol	Multiple-reader / single-writer consistency maintained. No lost writes.	Section 5.1.6
Distributed capabilities	Signature verification is correct. Revocation propagation is complete.	Section 5.1.10
Power budget enforcement	Budgets are never exceeded by more than one tick interval.	Section 6.4

23.10.3 Design for Verifiability

Verification readiness is a design property, not a tool. Code must be structured so that specifications can be written and verified:

// Example: capability lookup with verification-ready specification.
// Verus-style annotations (compile-time only, erased from binary).

/// Lookup a capability by handle.
///
/// SPECIFICATION (verified by Verus):
///   requires: handle is valid for calling process
///   ensures:  returned capability matches the one in the capability table
///   ensures:  returned capability's generation <= object's current generation
///   ensures:  returned capability's permissions are a subset of the
///             delegator's permissions (no escalation)
pub fn cap_lookup(
    table: &CapabilityTable,
    process: ProcessId,
    handle: CapHandle,
) -> Result<Capability, CapError> {
    // Implementation must satisfy the specification.
    // Verus proves this at compile time.
    // No runtime overhead.
}

Design rules for verifiability:

Explicit state: No hidden mutable global state. All state is in named structures with explicit ownership. (Rust already enforces this.)
Small critical sections: Break complex operations into small, individually verifiable steps. Each step has a pre-condition and post-condition.
Interface contracts: Every public function in security-critical modules has a documented specification (pre/post conditions, invariants). Verus verifies these.
Algebraic data types for states: Use enums with exhaustive matching instead of integer flags. The type checker ensures all states are handled.
Monotonic counters: Generation counters, version numbers — use types that enforce monotonicity (can only increase, never decrease).

23.10.4 Verification Tooling

Primary tool: Verus (Carnegie Mellon University, VMware Research, Microsoft Research, and others). Automated verification for Rust. Specification-driven proofs of functional correctness and memory safety properties.

Alternative tools (fallback if Verus hits scale limits): - Kani (Amazon): Bounded model checking for Rust. Explores all execution paths up to a configurable bound. Excellent for concurrent code and finding edge cases. Complementary to Verus — Kani finds bugs, Verus proves absence of bugs. - Prusti (ETH Zurich): Automated verification for Rust. Different proof strategy than Verus (separation logic vs SMT). Useful as a cross-check.

CI integration strategy: - Every commit: debug_assert! invariant checks + lightweight type-level assertions. Compile-time only. Seconds. Catches regressions in verified invariants. - Every PR: Kani bounded model checks on critical modules (~5-10 min). Catches concurrency bugs and edge cases. - Nightly: Full Verus specification proofs (~30-60 min for verified modules). Mathematical proof of correctness. Any proof failure blocks the next release.

Scope of verification — what is OUT of scope: Cross-component interactions (e.g., DSM coherence protocol interacting with hardware isolation boundaries simultaneously) are beyond current tool capabilities. Individual components are verified against their specifications; the composition is validated by integration testing and fuzzing. This is an honest limitation — complete whole-system verification remains a research problem.

Unsafe Code Verification Strategy:

Rust's unsafe blocks are the primary verification target — they are where memory safety invariants must be manually upheld. The strategy:

Verus for ownership and invariant proofs: verify that unsafe code upholds the safety contract documented in its // SAFETY: comment. Verus can reason about pointer validity, aliasing, and lifetime guarantees.
Kani for model-checking unsafe code paths: bounded model checking explores all possible inputs to unsafe functions up to a configurable bound, catching edge cases that specifications might miss.
Wrap unsafe in safe abstractions: every unsafe block is encapsulated in a safe function with a verified specification. Callers never touch unsafe directly. The safe wrapper's specification becomes the verification boundary.

Verification Complexity by Component:

Based on published Verus effort data and component characteristics:

Component	Relative Complexity	Rationale
Capability system (Section 8.1)	Low	Small state machine, clear invariants
IPC ring buffer (Section 10.6)	Low	Single producer-consumer, bounded
Page table management (Section 4.1)	High	Many edge cases, arch-specific
CBS bandwidth server (Section 6.3)	Medium	Well-studied algorithm
DSM coherence (Section 5.1.6)	High	Distributed protocol, concurrent access

Page table management and DSM coherence are the hardest verification targets due to arch-specific code paths and distributed state. The capability system and IPC ring buffer are the easiest starting points for building verification expertise.

23.10.5 Performance Impact

Literally zero. Verification is compile-time. Verus specifications are erased from the binary. The verified code is identical to the unverified code at runtime.

The only cost is developer time writing specifications. But this pays for itself by eliminating bugs that would otherwise require debugging, CVE patches, and emergency releases.

23.11 KABI IDL Compiler Specification

The KABI IDL language and umka-kabi-gen tool are fully specified in Section 11.1.7. The roadmap deliverable is to implement umka-kabi-gen conforming to that specification. See Section 23.2.1 for the Phase 1 build milestone.