Skip to content

Chapter 25: Agentic Development Methodology

Development model, parallel workflow, phase timelines, sensitivity analysis, recommendations

Note: This chapter describes development methodology and project planning guidance. It is not kernel specification — the kernel's behavior is defined in Chapters 1-24.


This chapter describes development methodology, not kernel specification. UmkaOS is developed using agentic programming — AI agents perform both design and implementation from the architecture docs. Phase timelines, hardware bottlenecks, and sensitivity analysis are project planning guidance, not kernel behavior definitions.

25.1 Understanding the Bottleneck

25.1.1 What AI Agents Are Fast At

At 50 t/s, an AI agent can: - Read 30K lines of architecture docs: ~5-10 minutes (vs human: 3-5 hours) - Write 500 lines of Rust code: ~5-10 minutes (vs human: 2-4 hours) - Understand complex context: ~2-5 minutes (vs human: 20-60 minutes) - Generate test cases: ~2-5 minutes (vs human: 30-60 minutes)

AI speedup for pure cognitive work: 10-30x

25.1.2 What AI Agents Are NOT Fast At

The real bottlenecks in agentic development:

  1. Compilation time (hardware-bound):
  2. Full cargo build --release for UmkaOS kernel: ~15-25 minutes (300K SLOC Rust with heavy monomorphization via LLVM)
  3. Incremental rebuild: ~30 seconds to 2 minutes
  4. AI can't speed this up — it's CPU/disk I/O

  5. Test execution time (hardware-bound):

  6. QEMU boot + run tests: ~2-5 minutes per test suite
  7. Real hardware boot: ~1-2 minutes
  8. Integration tests (network, distributed): ~5-15 minutes
  9. AI can't speed this up — it's waiting for hardware

  10. Iteration cycles (required for bugs):

  11. Average bug requires 3-5 test-fix-test cycles (even for AI)
  12. Each cycle: code (2 min) + compile (5 min) + test (3 min) = 10 minutes
  13. AI can reduce iteration count slightly (fewer logic bugs) but not eliminate it

  14. Real hardware testing (physics-bound):

  15. Testing WiFi driver on 10 different chipsets: ~1-2 days per chipset (firmware loading, WPA3, roaming, power save, monitor mode)
  16. Suspend/resume testing: ~4 hours per laptop (1000 cycles, 10-15 seconds per cycle + failure analysis)
  17. Battery life validation: ~10-15 hours per test run (actually drain the battery)
  18. AI can't speed this up — you must wait for physical hardware

  19. Unknown unknowns (spec bugs):

  20. The architecture had 89 documented flaws from initial reviews; ~50 remain after three rounds of architecture review and targeted fixes (individual findings are resolved in-place across the architecture documents as they are identified)
  21. Implementation will find more (estimated 200-300 additional issues)
  22. Each requires: discovery → spec fix → re-implementation → re-test
  23. AI speeds up the fix but not the discovery

25.2 Development Model: Parallel Agentic Workflow

25.2.1 Agent Parallelization

Key advantage: Unlike humans (1-10 developers), you can run 100+ AI agents in parallel with proper coordination.

Parallelization strategy:

Phase 1.1: Core kernel (Roadmap Phase 1: Foundations)
  - Agent 1: Boot code (x86_64)
  - Agent 2: Boot code (aarch64)
  - Agent 3: Boot code (riscv64)
  - Agent 4: Memory management
  - Agent 5: Scheduler
  - Agent 6: Capabilities
  - Agent 7: IPC
  - Agent 8: RCU
  ... (20 agents in parallel)

Phase 2.1: Essential drivers (Roadmap Phase 2: Self-Hosting Shell)
  - Agent 1: NVMe driver
  - Agent 2: Intel NIC driver
  - Agent 3: Realtek NIC driver
  - Agent 4: USB core
  - Agent 5: WiFi (Intel)
  - Agent 6: WiFi (Realtek)
  ... (50 agents in parallel)

Bottleneck: Integration conflicts, shared infrastructure dependencies.

Agent coordination protocol: Agent coordination uses git branches (one feature branch per agent task), task files (.claude/ skills and plans), and project-level CLAUDE.md for shared instructions — not a formal inter-agent protocol. File-level locking (one agent per file) prevents merge conflicts. CI merge validation (all 8 architectures must pass QEMU boot + unit tests) gates every merge. Agent code review protocol: each agent's changes are reviewed by a separate agent instance before merge.

Realistic parallelism: Theoretical parallelism of 100+ agents per the task graph; practical sustained throughput limited to ~10-20 concurrent agents by coordination overhead, context switching, and merge conflict resolution.

25.2.2 Coordination Overhead

With N agents working in parallel: - Code review: Each agent's code must be reviewed by another agent - Integration: Merging N parallel branches requires conflict resolution - Testing: Integrated system must be tested after each merge - Synchronization: Agents must wait for shared infrastructure (memory allocator before scheduler, etc.)

Estimated coordination overhead: ~20-30% of total time with 10-20 agents (estimate based on independent subsystem boundaries; may increase to ~40-50% for tightly-coupled subsystems with shared infrastructure dependencies).


25.3 Phase-by-Phase Timeline (Agentic)

Note: Phase numbering uses the Chapter 24 roadmap as the primary reference. Sub-phases (e.g., Phase 2.1) correspond to agentic workflow steps within each roadmap phase. See Section 24.2 for the top-level five-phase structure.

25.3.1 Phase 1.1: Core Kernel (all 8 architectures, minimal functionality)

Scope: Boot, memory, scheduler, capabilities, syscall interface, ELF loader — sufficient to boot a statically linked hello-world ELF binary on all 8 architectures.

Exit criteria: make test passes on all 8 architectures (hello-world binary boots and prints to serial in QEMU).

Status update (2026-03-20): The architecture spec has been through 3 major review cycles (~900 findings processed, ~400 spec fixes applied). Spec coverage for Phase 1 subsystems is ~95% — struct definitions, pseudocode, per-arch tables, memory ordering annotations, error paths, and lock hierarchies are all explicit. Boot scaffolding (entry assembly, serial drivers, linker scripts) exists for all 8 architectures. This fundamentally changes the agentic development model: the implementation task is spec translation, not design.

25.3.1.1 What Already Exists

  • Boot entry assembly for all 8 architectures (Multiboot1/2, DTB, SBI, IPL, SLOF)
  • Serial drivers for all 8 architectures (COM1, PL011, 16550, SCLP, NS16550)
  • Linker scripts for all 8 architectures
  • umka-core skeleton with cap, ipc, phys modules (~850 LOC, most needs rewrite)
  • umka-driver-sdk types and ring buffer definitions
  • Build system (make build/test/run for all architectures)

25.3.1.2 Spec Readiness by Subsystem

Stream Subsystems Spec LOC Spec Detail Implementation LOC
A: Memory Boot alloc → buddy → slab → heap ~13K Pseudocode for alloc/free/coalesce, zone model, PCP magazines, GfpFlags, watermarks ~4K
B: Concurrency Locks → CpuLocal → RCU → IRQ → workqueue ~12K Per-arch register tables, lock state machines, ordering annotations, BoundedMpmcRing ~4K
C: Scheduling EEVDF → timekeeping ~10K Augmented RB-tree walk, weight table, vDSO seqlock, per-arch clock sources ~3.5K
D: Security Capabilities → isolation domains ~7K XArray-based CapSpace, delegation protocol, per-arch isolation (MPK/POE/DACR) ~2K
E: Boot/HW ACPI/DTB → features → SMP → RNG → clocks ~10K Per-arch init sequences, fan-out tree, feature flag tables ~3.5K
F: KABI/Syscall KABI compiler → dispatch → ELF loader ~6K IDL grammar, dispatch table, ELF PT_LOAD walk, user stack layout ~2.3K
G: Tests Host unit tests → in-kernel harness ~2K Test patterns, KTest macro, assertion protocol ~2.5K
Z: Core0/Core1 Image split → boot rewrite ~5K Linker sections, vtable population, canonical boot phase table ~1.1K

Total: ~65K lines of reviewed spec → ~23K lines of production Rust. Spec-to-code ratio: ~3:1 (expected when spec includes pseudocode).

25.3.1.3 Agent Batching (8 Streams, 7 Sequential Layers)

Phase 1 has 31 subsystems organized into 8 parallel streams (A-G, Z). The critical path is the memory stream: boot_alloc → vmemmap → buddy → slab → heap. All other streams can begin in parallel once their dependencies from the memory stream are met.

Effective parallelism: 7-8 concurrent agents (limited by dependency graph, not agent availability). The KABI compiler (F1) has zero kernel dependencies and can run from the first batch.

Layer 0 (parallel, no deps):
  F1: KABI compiler (host tool)        — independent
  Z2: Boot sequence rewrite            — independent (restructure main.rs)
  E4: CPU feature detection            — independent (reads CPUID/ID regs)
  E5: Hardware RNG                     — independent (reads RDRAND/RNDR)

Layer 1 (after Layer 0):
  A1: Boot allocator                   — needs Z2 (boot flow)
  B1: Locking primitives               — needs E4 (arch features)
  E1: ACPI parsing (x86)               — needs A1 (memory for tables)
  E2: DTB parsing (non-x86)            — needs A1

Layer 2 (after Layer 1):
  A2: Page descriptor + vmemmap        — needs A1
  B2: CpuLocal + PerCpu                — needs B1
  E3: Clock framework                  — needs E2/E1

Layer 3 (after Layer 2):
  A3: Buddy allocator                  — needs A1, A2
  B3: RCU                              — needs B1, B2
  B4: IRQ domain hierarchy             — needs B1, E4
  E6: SMP bringup                      — needs B2, E4

Layer 4 (after Layer 3):
  A4: Slab allocator                   — needs A3, B2
  B5: Workqueues                       — needs B1, B3
  C2: Timekeeping                      — needs B4, E3

Layer 5 (after Layer 4):
  A5: Heap allocator bridge            — needs A4
  C1: EEVDF scheduler                  — needs A4, B2, B5, C2
  D1: Capability system                — needs A4, B1
  D2: Isolation domain infra           — needs E4, B1

Layer 6 (convergence):
  F2: Syscall dispatch (3 syscalls)    — needs C1, D1
  F3: ELF loader (static only)        — needs A5, C1
  Z1: Core0/Core1 boundary             — needs A1
  G1: Host unit tests                  — incremental throughout
  G2: In-kernel test harness           — needs F2

Critical path: A1 → A2 → A3 → A4 → C1 → F2 → F3 (7 layers). Each layer's wall-clock time is dominated by the largest subsystem in that layer.

25.3.1.4 Key Differences From Prior Estimate

  1. Phase 1.1 and 1.2 are merged. The prior estimate split Phase 1 into x86-only (1.1) then multi-arch port (1.2). This was based on the assumption that multi-arch code would need separate design work. With the spec now containing per-arch tables for every subsystem (register mappings, init sequences, feature flags, barrier instructions), multi-arch support is a compile-time configuration, not a design phase. All 8 architectures are implemented from the start.

  2. Integration testing is NOT the bottleneck. The spec review cycles eliminated most design ambiguities. Integration bugs will be spec-implementation mismatches (caught by unit tests) rather than design-level incompatibilities (which required redesign in the old model).

  3. IPC is not in Phase 1. The original estimate included IPC as a Phase 1 subsystem. Per the current roadmap (Section 24.2), full IPC is Phase 2. Phase 1 needs only the syscall dispatch path (write/execve/exit_group) — no inter-process communication.

  4. Boot code already exists. Entry assembly, serial drivers, and linker scripts for all 8 architectures are implemented and tested. The boot sequence rewrite (Z2) restructures the existing code to follow the canonical phase table, not writes it from scratch.

25.3.1.5 Risk Assessment

High risk (most likely to need iteration): - EEVDF augmented RB-tree (C1) — algorithm complexity, subtle invariants - BoundedMpmcRing (B5) — lock-free CAS with per-arch barriers - Buddy allocator zone model (A3) — largest Phase 1 subsystem - Context switch (within C1) — 8 different architectures, register conventions

Medium risk: - Capability system (D1) — security-critical, XArray integration - SMP bringup (E6) — per-arch IPI mechanisms, timing-sensitive

Low risk (mechanical translation): - Boot allocator (A1) — bump allocator, minimal state - Page descriptor (A2) — pure data structure - Syscall dispatch (F2) — 3 syscalls, trivial table - ELF loader (F3) — static binaries only, PT_LOAD walk - KABI compiler (F1) — host tool, no kernel dependency

25.3.2 Phase 2.1: Essential Drivers (NVMe, NIC, USB, I/O)

Scope: NVMe, Intel NIC, USB core, serial, framebuffer

Human estimate: 6-9 months
Agent estimate:

Driver Agent Work Hardware Testing Debug Cycles Real Time
NVMe 8 hours 10 hours 8x 7 days
Intel e1000e NIC 6 hours 8 hours 6x 5 days
USB core 12 hours 15 hours 10x 10 days
USB HID 4 hours 5 hours 4x 3 days
Framebuffer (full per §21.4 DRM/KMS) 3 hours 4 hours 3x 2 days
Serial (all arches) 2 hours 3 hours 2x 1 day

With 6 agents in parallel: - Wall clock time: ~10 days (2 weeks) - Real hardware testing is the bottleneck (need NVMe drives, NICs, USB devices)

25.3.3 Phase 2.2: Linux Compatibility Layer

Scope: 330 syscalls, eBPF verifier (ext4 is in Phase 3.1, not here — Phase 2 uses tmpfs/initramfs)

Human estimate: 9-12 months (eBPF verifier alone is 6+ months)
Agent estimate:

Component Agent Work Testing Iterations Real Time
Syscall dispatch 4 hours 6 hours 4x 3 days
File I/O syscalls (50) 15 hours 20 hours 10x 12 days
Process/thread (40) 12 hours 18 hours 8x 10 days
Memory syscalls (30) 10 hours 15 hours 8x 8 days
Network syscalls (50) 15 hours 20 hours 10x 12 days
Misc syscalls (160) 30 hours 40 hours 15x 20 days
eBPF verifier 40 hours 60 hours 30x TBD (see caveat below)
(ext4 moved to Phase 3.1 — implemented once, completely)

Scope caveat (eBPF verifier): The eBPF verifier is implemented as a complete subsystem per Section 19.2: all program types (socket filter, XDP, tc, kprobe, tracepoint, cgroup, LSM, struct_ops), all map types, full abstract interpretation with the complete RegState/RegType type system. This is not a "socket filters only" partial implementation — the verifier is one subsystem and ships complete. Linux's verifier.c is ~23K SLOC (v6.12) with a decade of security hardening and dozens of CVE-driven fixes. Reaching equivalent security coverage will require sustained fuzzing campaigns (see Section 24.3). The eBPF verifier is one of the most complex single subsystems (comparable in complexity to a compiler backend) and its real-time estimate is not meaningfully reducible to a day count.

With 10 agents in parallel (syscall groups can be independent): - Wall clock time: dominated by eBPF verifier (the most complex single subsystem — comparable in complexity to a compiler backend) - Bottleneck: eBPF verifier complexity (Linux's verifier.c is ~23K SLOC with a decade of security hardening; even AI needs extensive iteration and security fuzzing)

25.3.4 Phase 2.3: Networking Stack

Scope: TCP/IP, UDP, routing, netfilter, WiFi subsystem

Human estimate: 6-9 months
Agent estimate:

Component Agent Work Testing Iterations Real Time
Ethernet layer 6 hours 10 hours 6x 5 days
IPv4/IPv6 stack 15 hours 25 hours 12x 15 days
TCP 20 hours 35 hours 15x 20 days
UDP 8 hours 12 hours 6x 6 days
Routing 10 hours 15 hours 8x 8 days
Netfilter/firewall 12 hours 18 hours 10x 10 days
WiFi subsystem 15 hours 25 hours 12x 15 days

Scope caveat (TCP): TCP is implemented as a complete, conformant stack per Section 16.1: full state machine, SACK, congestion control (Reno + CUBIC — each a complete pluggable module per §16.4), and ECN. MPTCP multi-path, BBRv2, and TCP-AO are separate subsystems added in Phase 3.2+ — they do not extend the core TCP implementation, they plug into it. The core TCP stack is complete at Phase 2.3 exit.

With 7 agents in parallel: - Wall clock time: ~20 days (3 weeks) (core TCP complete; MPTCP/BBRv2 are separate Phase 3.2 subsystems) - Bottleneck: TCP complexity, WiFi driver integration

25.3.5 Phase 3.1: Storage Stack (VFS, filesystems, DM/MD)

Scope: VFS layer, ext4, XFS, Btrfs core, device mapper

Human estimate: 6-9 months
Agent estimate:

Component Agent Work Testing Iterations Real Time
VFS layer 20 hours 30 hours 15x 20 days
Page cache 12 hours 18 hours 10x 10 days
ext4 (full) 30 hours 45 hours 20x TBD (see caveat below)
XFS 25 hours 40 hours 18x 25 days
Btrfs core (COW, subvols, snaps, checksums) 35 hours 50 hours 22x TBD (see caveat below)
Device mapper (DM) 15 hours 25 hours 12x 15 days
MD RAID 12 hours 20 hours 10x 12 days

Scope caveat (filesystems): "ext4 (full)" means feature-complete for the on-disk format (extents, journaling, inline data, encryption hooks) but not bug-for-bug compatibility with Linux's fs/ext4/ (~50-70K SLOC including jbd2). "Btrfs core" means COW B-tree, subvolumes, snapshots, and checksums — a complete subsystem per Section 15.8. Btrfs RAID5/6, send/receive, and deduplication are separate subsystems added in Phase 4+ — they plug into the Btrfs core but do not modify it. These estimates assume the filesystem trait API is stable; if significant VFS redesign is needed, add 50-100% contingency.

With 7 agents in parallel: - Wall clock time: ~35 days (5 weeks) (all filesystems complete per spec; Btrfs RAID/dedup separate Phase 4) - Bottleneck: Filesystem complexity (ext4, Btrfs are massive)

25.3.6 Phase 3.2: Advanced Features (Distributed, Observability, Power)

Scope: DSM, DLM, FMA, power budgeting, live evolution

Human estimate: 9-12 months
Agent estimate:

Component Agent Work Testing Iterations Real Time
DSM (distributed shared memory) 40 hours 60 hours 25x 40 days
DLM (distributed lock manager) 35 hours 50 hours 20x 35 days
RDMA integration 20 hours 30 hours 15x 20 days
Cluster membership 15 hours 25 hours 12x 15 days
FMA (telemetry) 12 hours 18 hours 10x 10 days
Power budgeting 18 hours 28 hours 14x 18 days
Live kernel evolution 25 hours 40 hours 18x 25 days
Observability (umkafs) 15 hours 22 hours 12x 12 days

With 8 agents in parallel: - Wall clock time: ~40 days (6 weeks) - Bottleneck: Distributed systems testing (need cluster hardware)

25.3.7 Phase 4.1: Consumer Hardware (WiFi, Bluetooth, Audio, Graphics)

Scope: WiFi drivers (5 chipsets), Bluetooth, audio, touchpad, suspend/resume

Human estimate: 12-18 months (hardware compatibility is painful)
Agent estimate:

Component Agent Work Hardware Testing Iterations Real Time
WiFi driver (Intel) 12 hours 20 hours 12x 15 days
WiFi driver (Realtek) 12 hours 20 hours 12x 15 days
WiFi driver (Qualcomm) 12 hours 20 hours 12x 15 days
Bluetooth stack 15 hours 25 hours 15x 20 days
Audio (Intel HDA) 10 hours 15 hours 10x 10 days
Touchpad (I2C-HID) 8 hours 12 hours 8x 8 days
Graphics (i915 modesetting + display) 20 hours 30 hours 18x 25 days
S3 suspend/resume 15 hours 40 hours 20x 30 days
Power management UX 10 hours 15 hours 10x 10 days

Scope caveat (i915): "i915 modesetting + display" is a complete subsystem: modesetting, framebuffer, and display output for Gen 9+ (Skylake and later) per Section 21.5. GPU compute (OpenCL/Vulkan) is a separate subsystem requiring the accelerator framework from Section 22.1 — added in Phase 5+, not an extension of the display driver.

With 9 agents in parallel: - Wall clock time: ~30 days (4 weeks) (modesetting only; see scope caveat) - Bottleneck: Suspend/resume testing (need real laptops, slow iteration)

25.3.8 Phase 5.1: Windows Emulation Acceleration (WEA)

Scope: NT object manager, IOCP, memory management, SEH

Human estimate: 12-15 months
Agent estimate:

Component Agent Work Testing Iterations Real Time
NT object manager 15 hours 25 hours 15x 20 days
Synchronization (wait) 12 hours 20 hours 12x 15 days
IOCP 18 hours 30 hours 18x 25 days
Memory (VirtualAlloc) 10 hours 18 hours 10x 12 days
Thread model (TEB, APC) 12 hours 20 hours 12x 15 days
Security tokens 8 hours 12 hours 8x 8 days
SEH support 15 hours 25 hours 15x 20 days
WINE integration 10 hours 30 hours 15x 20 days

With 8 agents in parallel: - Wall clock time: ~25 days (3.5 weeks) - Bottleneck: WINE testing (need many games, slow iteration)


25.4 Total Timeline (Sequential Phases)

Note: Superseded by realistic-full-timeline.md. Phase 1.1 and 1.2 have been merged per implementation-phases.md.

If phases are done sequentially (each phase depends on previous):

Phase Human Estimate Agentic Estimate (10-20 agents)
Phase 1: Core kernel + multi-arch 2-3 months 3-4 weeks
Phase 2.1: Essential drivers 6-9 months 2 weeks
Phase 2.2: Linux compat 9-12 months 8-12 weeks
Phase 2.3: Networking 6-9 months 3 weeks
Phase 3.1: Storage 6-9 months 5 weeks
Phase 3.2: Advanced features 9-12 months 6 weeks
Phase 4.1: Consumer hardware 12-18 months 4 weeks
Phase 5.1: WEA 12-15 months 3.5 weeks
TOTAL (sequential) 5-7 years ~36-42 weeks (~9-10 months)

But many phases can overlap!


25.5 Total Timeline (Optimized Parallelism)

Key insight: After Phase 1.1 (core kernel), many subsystems are independent: - Drivers (Phase 2.1) can start immediately after Phase 1.1 - Networking (Phase 2.3) can start after essential drivers - Storage (Phase 3.1) can start after essential drivers - Advanced features (Phase 3.2) can start after Phase 2.2 (syscall layer) - Consumer hardware (Phase 4.1) can start after Phase 2.1 (USB core) - WEA (Phase 5.1) can start after Phase 2.2 (syscall layer)

Critical path (longest dependency chain): 1. Phase 1.1: Core kernel + multi-arch (5 weeks) 2. Phase 2.2: Linux compat (8–12 weeks) — depends on Phase 1.1; dominated by eBPF verifier 3. Phase 3.2: Advanced features (6 weeks) — depends on Phase 2.2 Critical path total: 19–23 weeks (best case assumes eBPF verifier at lower bound)

Parallel work (can happen alongside critical path): - Phase 2.1 (drivers) starts at week 3, finishes week 5 - Phase 2.3 (networking) starts at week 5, finishes week 8 - Phase 3.1 (storage) starts at week 5, finishes week 10 - Phase 4.1 (consumer) starts at week 5, finishes week 9 - Phase 5.1 (WEA) starts at week 8, finishes week 11.5

Optimized timeline with smart parallelization:

Week 0-5:   Phase 1.1 (Core kernel + multi-arch) [critical path]
            Phase 2.1 (Drivers) [parallel, starts week 3]
Week 5-17:  Phase 2.2 (Linux compat) [critical path, 8-12 weeks]
            Phase 2.3 (Networking) [parallel, weeks 5-8]
            Phase 3.1 (Storage) [parallel, weeks 5-10]
            Phase 4.1 (Consumer) [parallel, weeks 5-9]
Week 17-23: Phase 3.2 (Advanced) [critical path]
            Phase 5.1 (WEA) [parallel, weeks 8-11.5]
Week 23-27: Integration, testing, bug fixes

Total optimized timeline: ~27 weeks (~7 months) (best case; eBPF verifier complexity may extend this)


25.6 What About Spec Bugs?

~50 remaining documented flaws (down from 89 after three review rounds) + estimated 200-300 more undiscovered = ~250-350 spec bugs.

Per-bug handling: 1. Discovery during implementation: ~10-30 minutes (test fails, agent analyzes) 2. Spec fix: ~30-60 minutes (human architect or agent) 3. Re-implementation: ~30-120 minutes (agent rewrites affected code) 4. Re-testing: ~10-30 minutes (compile + test) Average: ~2-4 hours per bug

300 bugs × 3 hours average = 900 hours = ~37 days with 1 agent

But: Many bugs can be fixed in parallel (different subsystems). - With 10 agents handling bugs in parallel: ~4 days - Spread across 5 months: absorbed into iteration cycles

Impact on timeline: Spec bugs already accounted for in the "iterations" column above. The iteration counts (3-25x) include discovering and fixing spec bugs.


25.7 Hardware Bottlenecks

25.7.1 Real Hardware Testing Requirements

Cannot be parallelized beyond physical hardware availability:

  1. Suspend/resume testing (Phase 4.1):

    • Need: 10 different laptop models
    • Test: 1000 cycles per laptop
    • Time: ~4 hours per laptop (even if fully automated)
    • Total: ~40 hours (2 days) minimum
  2. Battery life validation (Phase 4.1):

    • Need: 5 laptop models
    • Test: Full discharge cycle
    • Time: ~10-15 hours per laptop
    • Total: ~60 hours (3 days) minimum
  3. WiFi compatibility testing (Phase 4.1):

    • Need: 10 different WiFi chipsets
    • Test: Connect, transfer, disconnect, repeat
    • Time: ~1-2 days per chipset (firmware, WPA3, roaming, power save, monitor mode)
    • Total: ~10-20 days (2-4 weeks)
  4. Multi-GPU testing (Phase 3.2):

    • Need: 5 different GPU models
    • Test: P2P transfers, workload distribution
    • Time: ~4 hours per GPU
    • Total: ~20 hours (1 day)
  5. Cluster testing (Phase 3.2):

    • Need: 8-16 node cluster with RDMA
    • Test: DSM, DLM, membership, failover
    • Time: ~40-60 hours (multiple days)
    • Total: ~3-5 days

Hardware testing adds: ~2-3 weeks to timeline (but overlaps with development).

25.7.2 Specialized Hardware Acquisition

Before development can start, need to acquire: - 10+ laptop models (Intel, AMD, ARM) - 20+ WiFi/Bluetooth adapters - 10+ NVMe drives (different vendors) - 5+ GPUs (NVIDIA, AMD, Intel) - 8-16 node RDMA cluster - Touchpads, touchscreens, webcams, audio devices

Procurement time: ~2-4 weeks
Cost: $150,000-350,000 for full hardware lab (8-16 dual-socket servers with RDMA NICs, switches, and infrastructure)


25.8 QEMU CPU Feature Testing Matrix

Target: QEMU 10.2.1. Defines the CPU configurations tested per architecture to ensure UmkaOS handles feature presence/absence correctly at boot and runtime.

25.8.1 Design Principles

  1. Every feature-dependent code path must have a test configuration where that feature is absent. If UmkaOS has if cpu_has_X { fast_path } else { fallback }, both branches must be exercised.

  2. One "minimal" config per arch — the weakest CPU the architecture must boot on. Tests fallback paths, graceful degradation, no-SIMD paths, software TLB flush, etc.

  3. One "maximal" config per arch — everything on. Tests feature detection, fast paths, and that no feature combination triggers unexpected interactions.

  4. Feature-targeted configs — isolate specific features UmkaOS cares about (driver isolation, SIMD dispatch, MMU modes, crypto acceleration).

  5. Profiles map to real hardware generations where possible — not synthetic combinations that no silicon ever shipped.


25.8.2 x86-64

QEMU binary: qemu-system-x86_64. Machine: default (q35 implied by -cpu).

Features tested: PKU (Tier 1 isolation), AVX2/AVX-512 (SIMD dispatch), SHA-NI (crypto acceleration), SMAP/SMEP (kernel hardening), PCID/INVPCID (TLB management), UMIP (userspace instruction prevention), LA57 (5-level paging).

Profile -cpu Argument Real HW Equivalent Key Features Tests
x86-minimal Haswell Intel 4th gen (2013) AVX2, PCID. No PKU, no AVX-512, no SMAP, no SHA-NI Fallback isolation (no MPK), no SMAP guard, software crypto
x86-broadwell Broadwell Intel 5th gen (2015) +SMAP over Haswell. Still no PKU SMAP enforcement, still no MPK
x86-skylake Skylake-Server Intel Xeon Scalable 1st gen (2017) +PKU, +AVX-512F/BW/CD/DQ/VL MPK Tier 1 isolation, AVX-512 SIMD dispatch
x86-icelake Icelake-Server Intel 10th gen server (2019) +SHA-NI, +UMIP, +LA57, +AVX-512-VNNI 5-level paging, SHA-NI crypto, UMIP
x86-sapphire SapphireRapids Intel 4th gen Xeon (2023) +AMX, +AVX-512-FP16, +SERIALIZE Full Intel feature set
x86-epyc-v1 EPYC-v1 AMD EPYC Naples (2017) No PCID, no INVPCID, no PKU. Has SHA-NI AMD fallback: no PCID TLB, no MPK, AMD crypto
x86-epyc-genoa EPYC-Genoa AMD EPYC 4th gen (2022) Full: PKU, AVX-512, SHA-NI, PCID, LA57 Full AMD feature set
x86-max max Synthetic (all features) Everything QEMU can emulate Maximum coverage, feature interaction testing

Feature toggle examples (apply to any base model):

# Skylake without PKU (test MPK fallback on otherwise modern CPU)
-cpu Skylake-Server,-pku

# Haswell with PKU added (test MPK on older baseline)
-cpu Haswell,+pku

# Icelake without LA57 (test 4-level paging on modern CPU)
-cpu Icelake-Server,-la57

# EPYC-v1 with PCID added (test PCID on AMD)
-cpu EPYC-v1,+pcid,+invpcid

Concrete QEMU command (example: x86-skylake):

qemu-system-x86_64 \
    -cdrom target/umka-kernel.iso \
    -serial stdio -display none -no-reboot -m 256M \
    -cpu Skylake-Server


25.8.3 AArch64

QEMU binary: qemu-system-aarch64. Machine: -M virt.

Features tested: MTE (memory tagging), SVE (scalable vectors), PAuth (pointer authentication), BTI (branch target identification), LSE/LSE2 (atomics), GICv3 (interrupt controller). POE (FEAT_S1POE) is NOT emulated by QEMU 10.2.1 — Tier 1 POE-based driver isolation cannot be tested; only page-table+ASID fallback is testable.

Profile -cpu Argument Machine Options Real HW Equivalent Key Features Tests
arm64-minimal cortex-a53 -M virt Raspberry Pi 3/4, many SoCs ARMv8.0-A only. No MTE, no SVE, no PAuth, no BTI, no LSE2 All fallback paths
arm64-a72 cortex-a72 -M virt AWS Graviton 1, many server SoCs ARMv8.0-A. No MTE/SVE/PAuth/BTI. LSE (v8.1 atomics via QEMU) Current default. Baseline server
arm64-a76 cortex-a76 -M virt Graviton 2 era, mobile flagship ARMv8.2-A. +LSE. No MTE/SVE/BTI LSE atomics, DotProd
arm64-n2 neoverse-n2 -M virt,mte=on Graviton 3, Ampere Altra Max ARMv9.0-A. +MTE2, +SVE (128-bit), +PAuth, +BTI MTE tagging, SVE dispatch, PAuth, BTI
arm64-a710 cortex-a710 -M virt,mte=on Mobile ARMv9 (Cortex-X2 era) ARMv9.0-A. +MTE2, +SVE2, +BTI, +PAuth SVE2, mobile-class ARMv9
arm64-max max -M virt,mte=on Synthetic (all features) MTE3, SVE (max VL), PAuth2, BTI, RME, everything Maximum coverage

Feature toggle notes (AArch64 is more restrictive than x86):

# SVE can be toggled on most models:
-cpu neoverse-n2,sve=off

# MTE requires BOTH machine and CPU support:
-M virt,mte=on -cpu neoverse-n2    # MTE on
-M virt         -cpu neoverse-n2    # MTE off (machine doesn't enable it)

# PAuth can be toggled:
-cpu neoverse-n2,pauth=off

# SVE vector length can be set:
-cpu max,sve=on,sve128=on,sve256=on,sve512=off

POE testing gap: FEAT_S1POE (Permission Overlay Extension, ARMv8.9-A / ARMv9.4-A) is not implemented in any QEMU version as of 10.2.1. UmkaOS AArch64 Tier 1 driver isolation falls back to page-table+ASID switching (~150-300 cycles) when POE is absent. This fallback path IS tested (all profiles except a hypothetical future POE-enabled one exercise it). POE fast-path testing requires real hardware (e.g., ARM Neoverse V3, Apple M4).

Concrete QEMU command (example: arm64-n2):

qemu-system-aarch64 \
    -M virt,mte=on \
    -cpu neoverse-n2 \
    -serial stdio -display none -no-reboot -m 256M \
    -kernel target/aarch64-unknown-none/release/umka-kernel


25.8.4 ARMv7

QEMU binary: qemu-system-arm. Machine: -M vexpress-a15 (primary), -M virt (for alternative CPUs).

Features tested: DACR (Tier 1 isolation — all ARMv7-A CPUs have this), LPAE (large physical address), VFPv4/NEON (floating point/SIMD), Thumb-2.

Profile -cpu Argument Machine Real HW Equivalent Key Features Tests
armv7-a15 cortex-a15 -M vexpress-a15 Exynos 5, OMAP5 LPAE, VFPv4-D32, NEON, HYP, TrustZone Primary target. Full feature set
armv7-a7 cortex-a7 -M virt Raspberry Pi 2, many IoT SoCs LPAE, VFPv4-D32, NEON. No HYP/TZ by default Lower-end ARMv7, tests core paths

Note: The vexpress-a15 machine only accepts cortex-a15. Alternative CPUs require -M virt. Since all ARMv7-A CPUs have DACR (the Tier 1 isolation mechanism), there is no feature-absent fallback to test — DACR is architectural.

The ARMv7 matrix is intentionally small. ARMv7 is a legacy support tier: two profiles suffice to cover the feature space.

Concrete QEMU command (example: armv7-a15):

qemu-system-arm \
    -M vexpress-a15 \
    -cpu cortex-a15 \
    -serial stdio -display none -no-reboot -m 256M \
    -kernel target/armv7a-none-eabi/release/umka-kernel


25.8.5 RISC-V 64

QEMU binary: qemu-system-riscv64. Machine: -M virt.

Features tested: V (vector), H (hypervisor), Zicbom (cache block management), Svpbmt (page-based memory types), Sstc (stimecmp timer), Zba/Zbb/Zbc/Zbs (bitmanip). RISC-V has no fast Tier 1 isolation mechanism — all drivers run as Tier 0 (in-kernel) or Tier 2 (Ring 3 + IOMMU).

Profile -cpu Argument Real HW Equivalent Key Extensions Tests
rv64-minimal sifive-u54 SiFive HiFive Unleashed RV64GC only. No V, no H, no Bitmanip, no Sstc, no Svpbmt All fallback paths. Minimal RISC-V
rv64-default rv64 Generic baseline +Zba/Zbb/Zbc/Zbs, +Zicbom, +Sstc. No V, no H, no Svpbmt Current boot default. Bitmanip, timer
rv64-rva22 rva22s64 RVA22 profile HW +Svpbmt, +Zicbom, +Zba/Zbb/Zbs. No V, no H, no Sstc Profile compliance, page-based mem types
rv64-rva23 rva23s64 RVA23 profile HW (modern) +V, +H, +Sstc, +Svpbmt, +Zba/Zbb/Zbs. No Zbc Modern profile. Vector, hypervisor
rv64-veyron veyron-v1 Ventana Veyron V1 +H, +Sstc, +Svpbmt, +Zba/Zbb/Zbc/Zbs, +Zicbom. No V Real high-perf core without Vector
rv64-ascalon tt-ascalon Tenstorrent Ascalon +V, +H, +Sstc, +Svpbmt, +Zba/Zbb/Zbs. No Zbc Real high-perf core with Vector
rv64-max max Synthetic (all extensions) Everything: V, H, Zicbom, Svpbmt, Sstc, all Bitmanip, crypto, CFI Maximum coverage

Concrete QEMU command (example: rv64-rva23):

qemu-system-riscv64 \
    -M virt \
    -cpu rva23s64 \
    -serial stdio -display none -no-reboot -m 256M \
    -kernel target/riscv64gc-unknown-none-elf/release/umka-kernel \
    -bios default


25.8.6 PPC32

QEMU binary: qemu-system-ppc. Machine: -M ppce500.

Features tested: SPE (signal processing engine), BookE MMU (TLB-based, no hash page table). PPC32 is a legacy support tier.

Profile -cpu Argument Real HW Equivalent Key Features Tests
ppc32-e500v2 e500v2 (default) Freescale/NXP P1020, P2020 SPE, BookE MMU, 36-bit phys Primary target. SPE SIMD, BookE TLB
ppc32-e500mc e500mc QorIQ P3041, P5020 No SPE. BookE, HW virtualization Tests SPE-absent path, e500mc multicore

Concrete QEMU command (example: ppc32-e500v2):

qemu-system-ppc \
    -M ppce500 \
    -cpu e500v2 \
    -nographic -no-reboot -m 256M \
    -kernel target/powerpc-unknown-none/release/umka-kernel


25.8.7 PPC64LE

QEMU binary: qemu-system-ppc64. Machine: -M pseries.

Features tested: VSX (vector-scalar), HTM (hardware transactional memory), Radix MMU vs Hash Page Table, MMA (matrix math). PPC64LE uses pseries machine with SLOF firmware.

Profile -cpu Argument Real HW Equivalent Key Features Tests
ppc64-power8 power8 IBM POWER8 (2014) VSX, HTM, Hash Page Table only HPT MMU, HTM paths
ppc64-power9 power9 IBM POWER9 (2017) VSX, HTM, Radix MMU + HPT fallback Radix MMU (primary UmkaOS path), HTM
ppc64-power10 power10 IBM POWER10 (2021) VSX, MMA, Radix MMU only. No HTM Modern POWER. MMA, no HTM fallback test

Note on HTM: POWER8 and POWER9 have HTM. POWER10 removed it. UmkaOS should detect HTM absence gracefully — the power10 profile tests this.

Concrete QEMU command (example: ppc64-power10):

qemu-system-ppc64 \
    -M pseries \
    -cpu power10 \
    -nographic -no-reboot -m 1G \
    -kernel target/powerpc64le-unknown-none/release/umka-kernel


25.8.8 s390x

QEMU binary: qemu-system-s390x. Machine: -M s390-ccw-virtio.

Features tested: Vector Facility (SIMD), Vector Enhancements (IEEE 754), MSA (message-security-assist — crypto acceleration), NNPA (neural network processing assist), DFLT (deflate conversion). s390x uses Storage Keys for memory protection, but these are page-granularity and too coarse for Tier 1 fast domain isolation — Tier 1 is unavailable; drivers choose Tier 0 or Tier 2. Tier 2 available via channel I/O subchannel protection.

Profile -cpu Argument Real HW Equivalent Key Features Tests
s390x-z14 z14-base IBM z14 (2017) Vector Facility, MSA5. No NNPA, no DFLT, no Vector Enhancements 2 Baseline vector, crypto. No NNPA fallback
s390x-z15 gen15a-base IBM z15 T01 (2019) +Vector Enhancements 2, +DFLT, +MSA9 (CPACF enhancements). No NNPA Enhanced vector, deflate acceleration
s390x-z16 gen16a-base IBM z16 / 3931 (2022) +NNPA (AI inference), +enhanced crypto (MSA10). Full facility set NNPA detection, full crypto suite
s390x-max max Synthetic (all facilities) Everything QEMU can emulate Maximum coverage, facility interaction testing

Feature matrix:

Feature z14-base gen15a-base gen16a-base max
Vector Facility Yes Yes Yes Yes
Vector Enhancements 2 No Yes Yes Yes
MSA (crypto) MSA5 MSA9 MSA10 All
ETOKEN No Yes Yes Yes
NNPA No No Yes Yes
DFLT (deflate) No Yes Yes Yes

Known testing gaps: SIE (Start Interpretive Execution — nested virtualization) is not functional in QEMU TCG mode. z16-specific NNPA operations are partially emulated; instruction-level fidelity depends on QEMU version. Storage Key protection is emulated but performance characteristics differ from real hardware.

Concrete QEMU command (example: s390x-z16):

qemu-system-s390x \
    -M s390-ccw-virtio \
    -cpu gen16a-base \
    -nographic -no-reboot -m 512M \
    -kernel target/s390x-unknown-linux-gnu/release/umka-kernel


25.8.9 LoongArch64

QEMU binary: qemu-system-loongarch64. Machine: -M virt.

Features tested: LSX (128-bit SIMD), LASX (256-bit SIMD), CRYPTO (CRC32, AES acceleration), LBT (binary translation assist for x86/ARM/MIPS code), PTW (hardware page table walker). LoongArch64 has no hardware memory domain isolation mechanism — Tier 1 is unavailable; drivers choose Tier 0 or Tier 2. Tier 2 available via IOMMU.

Profile -cpu Argument Real HW Equivalent Key Features Tests
la64-base la464 Loongson 3A5000, LA464 core LSX, LASX, CRYPTO. Software TLB refill Primary target. SIMD dispatch, software TLB
la64-max max Synthetic (all features) Everything QEMU can emulate: LSX, LASX, LBT, PTW Maximum coverage, all feature paths

Feature matrix:

Feature la464 max
LSX (128-bit SIMD) Yes Yes
LASX (256-bit SIMD) Yes Yes
CRYPTO (CRC32/AES) Yes Yes
LBT (binary translation) No Yes
PTW (HW page table walker) No Yes

Known testing gaps: QEMU has limited CPU model variety for LoongArch — only the LA464 core and max are available. The LA664 core (Loongson 3A6000) is not yet modeled. Hardware page table walker (PTW) behavior in max mode differs from real silicon software TLB refill in LA464 — both paths must be tested but performance characteristics are synthetic. LBT (binary translation assist) is a QEMU max-only feature with no real hardware equivalent in currently modeled cores.

The LoongArch64 matrix is intentionally small — two profiles cover the available QEMU CPU models. As QEMU adds LA664 and future cores, additional profiles should be added.

Concrete QEMU command (example: la64-base):

qemu-system-loongarch64 \
    -M virt \
    -cpu la464 \
    -serial stdio -display none -no-reboot -m 512M \
    -kernel target/loongarch64-unknown-linux-gnu/release/umka-kernel


Three tiers of test coverage, from fast (every commit) to thorough (nightly):

25.8.10.1 Tier 1: Every Commit (8 configs, ~3 min total)

One profile per architecture — the current defaults, ensuring nothing regresses:

Arch Profile -cpu
x86-64 x86-max max
AArch64 arm64-a72 cortex-a72
ARMv7 armv7-a15 cortex-a15
RISC-V 64 rv64-default rv64
PPC32 ppc32-e500v2 e500v2
PPC64LE ppc64-power10 power10
s390x s390x-max max
LoongArch64 la64-base la464

25.8.10.2 Tier 2: Every PR (22 configs, ~10 min total)

Adds feature-absent profiles to exercise fallback paths:

All Tier 1 configs, plus:

Arch Profile Tests
x86-64 x86-minimal (Haswell) No PKU, no SMAP, no SHA-NI
x86-64 x86-skylake PKU + AVX-512
x86-64 x86-epyc-v1 AMD, no PCID
AArch64 arm64-minimal (cortex-a53) No MTE/SVE/PAuth/BTI
AArch64 arm64-n2 + mte=on MTE + SVE + PAuth + BTI
AArch64 arm64-max + mte=on Full feature set
RISC-V 64 rv64-minimal (sifive-u54) Bare RV64GC
RISC-V 64 rv64-rva23 Vector + Hypervisor
PPC64LE ppc64-power8 HPT MMU, HTM
PPC64LE ppc64-power9 Radix + HTM
PPC32 ppc32-e500mc No SPE
ARMv7 armv7-a7 (virt) Lower-end ARMv7
s390x s390x-z14 Baseline vector, no NNPA/DFLT
LoongArch64 la64-max All features, HW PTW + LBT

25.8.10.3 Tier 3: Nightly (34 configs, ~25 min total)

All Tier 2 configs, plus:

Arch Profile Tests
x86-64 x86-broadwell +SMAP, still no PKU
x86-64 x86-icelake SHA-NI, LA57, UMIP
x86-64 x86-sapphire Full Intel
x86-64 x86-epyc-genoa Full AMD
x86-64 Skylake-Server,-pku Modern CPU, MPK disabled
x86-64 Icelake-Server,-la57 Modern CPU, 4-level paging
AArch64 arm64-a76 LSE, no MTE/SVE
AArch64 arm64-a710 + mte=on SVE2, mobile ARMv9
RISC-V 64 rv64-rva22 Svpbmt, no V/H
RISC-V 64 rv64-veyron Real core, no V
s390x s390x-z15 Enhanced vector, DFLT
s390x s390x-z16 NNPA, full crypto

25.8.11 Known Testing Gaps

Gap Impact Mitigation
AArch64 POE (FEAT_S1POE) Cannot test POE-based Tier 1 fast isolation (~40-80 cycles) Test page-table+ASID fallback (all profiles exercise this). POE fast path requires real hardware (Neoverse V3, Apple M4).
AArch64 RME (Realm Management Extension) Cannot test CCA confidential computing RME is a Phase 4+ feature. max CPU advertises RME but full CCA requires firmware support not emulated.
RISC-V Tier 1 isolation No fast isolation mechanism exists in RISC-V ISA By design: RISC-V Tier 1 runs as Tier 0. Tier 2 (Ring 3 + IOMMU) tested via rv64-max with H extension.
x86 TDX/SEV Cannot test confidential VM support Phase 4+ feature. Requires KVM passthrough, not QEMU TCG.
Real NUMA topology QEMU -smp + -numa can simulate, but latency is synthetic Functional testing only; performance characteristics require real hardware.
PPC64 PowerVM LPAR pseries emulates, but not full LPAR partitioning Acceptable for functional testing.
s390x SIE (nested virtualization) Cannot test KVM-on-s390x in QEMU TCG mode Requires real z/Architecture hardware or KVM passthrough. Functional guest support only.
s390x NNPA fidelity z16 NNPA instruction emulation is partial in QEMU Feature detection tested; instruction-level behavior requires real z16 hardware.
LoongArch64 LA664 core No LA664 (3A6000) model in QEMU Only LA464 (3A5000) available. LA664-specific features untestable until QEMU adds the model.
LoongArch64 PTW vs software refill HW PTW in max mode differs from LA464 software TLB refill Both paths tested functionally, but real PTW latency characteristics require hardware.

25.9 Human Involvement Required

Agentic development is not fully autonomous. Humans are needed for:

25.9.1 Architectural Decisions (Non-Automatable)

Canonical list: Section 24.11 (§24.11). This table mirrors that list — update both when status changes.

Question Status Notes
OEM partnerships strategy OPEN Framework, System76, Dell, HP — go-to-market for Phase 5b
GPU confidential computing (VRAM encryption) RESOLVED Both paths supported: runtime detection + admin override (umka.cc_device_dma=). See Section 9.7
Nested GPU passthrough RESOLVED Supported if hardware allows it (IOMMU nested + TEE firmware + ≤3x overhead). See Section 9.7
Policy module measurement enforcement RESOLVED Tied to boot security posture: enforce/advisory/off. See Section 19.9
DPU io_uring submission offload RESOLVED Not a separate question — Tier M peer protocol IS the transport; dumb drivers use normal KABI path
Multi-arch fallback acceptance criteria RESOLVED Per-feature thresholds (native ≤5%, fallback ≤10%), sysfs + dmesg notification. See Section 2.22
Cross-feature CI testing formalization RESOLVED 21 pairs, 3-tier CI spec. See Section 24.10

All previously open items (WiFi tier, BlueZ, proprietary drivers, default filesystem, eBPF verifier, io_uring+SEV-SNP, CXL+DSM, live evolution attestation) are also RESOLVED — see the full resolved decisions table in Section 24.11.

25.9.2 Spec Review & Correction

Automated review cycles (ZAI, Opus, Qwen, cross-subsystem) have resolved the majority of spec bugs and design gaps. Remaining work is incremental: each review cycle finds fewer issues as the spec matures. Human review is needed for ambiguous cases where multiple valid design approaches exist.

25.9.3 External Coordination

  • WINE/Proton integration: Negotiate with Valve, CodeWeavers
  • OEM partnerships: Framework, System76, Dell, HP
  • Upstream contributions: Linux driver code reuse, licensing
  • Community building: Documentation, marketing, beta testing

25.10 Realistic Full Timeline (Agentic + Human)

Assuming: - 50 t/s inference (fast model) - 10-20 AI agents in parallel - Human architect for decisions - Hardware lab available - Spec is corrected first (arch-review loops)

Activity Duration Notes
Pre-development
Arch review + spec fixes 2-3 weeks Human-in-loop with AI review agents
Hardware procurement 2-4 weeks Can overlap with spec fixes
Setup CI/CD infrastructure 1 week Automated build/test pipelines (see below)
Core development
Phases 1.1–5.1 (optimized) 20 weeks (~5 months) AI agents, parallelized
Hardware testing 3 weeks Overlaps with development
Post-development
Integration testing 2-3 weeks Full system, all architectures
Security testing 3-4 weeks Adversarial testing (see below)
Bug fixing (found in integration + security) 2-3 weeks Final polish
Performance tuning 2-3 weeks Optimize hot paths
Documentation 2 weeks User docs, admin guides
Beta testing
Internal alpha (10 users) 4 weeks Find major issues
Public beta (100 users) 8 weeks Broader hardware, edge cases
TOTAL ~12-14 months From spec to public beta

CI/CD infrastructure: GitHub Actions with 8-architecture QEMU matrix (x86-64, AArch64, ARMv7, RISC-V, PPC32, PPC64LE, s390x, LoongArch64). Local pre-push validation: make check (= fmt + lint + build + test for the default architecture). Each PR runs: make fmt then make lint-all then make build per arch then make test per arch then syzkaller fuzz (x86-64 only, 5-minute regression run). Hardware lab: real-device testing on x86-64 (Intel + AMD) and AArch64 (Raspberry Pi 5, Apple M1) for non-emulatable hardware paths (PCIe link training, IOMMU fault injection, actual suspend/resume cycles).

Security testing phase (included in post-development): - Syzkaller fuzzing: Continuous syscall fuzzing across all 8 architectures. Minimum 72-hour clean fuzzing gate per architecture before beta (continuous fuzzing continues beyond this gate in CI; the 72-hour requirement is the minimum before a release is considered candidate-ready). - eBPF verifier adversarial testing: Crafted programs targeting bypass, out-of-bounds access, and infinite loops. Coverage-guided with KASAN. - Namespace/capability escape testing: Adversarial seccomp/setns/clone sequences attempting privilege escalation across namespace boundaries. - Tier 1 isolation testing: Verify that a compromised Tier 1 driver cannot read PKEY 0 memory (on hardware with MPK; on QEMU, verify the software enforcement path). - Penetration testing: External audit of the capability system, IPC paths, and SysAPI layer for TOCTOU, use-after-free, and confused deputy.

KVM implementation is included in Roadmap Phase 4 (Production Ready) as a sub-item, since KVM host-side (VMX/SVM, EPT, vCPU scheduling) depends on the scheduler and memory subsystems from Phases 1.1 through 2.2. Estimated: 40-60 agent hours, 30-45 days elapsed (comparable to a substantial driver subsystem).

Breakdown: - Pre-development: 1 month - Core development: 5 months - Post-development (including security testing): 2.5 months - Beta testing: 3 months - Buffer: 1 month (unexpected issues)


25.11 Comparison: Human vs Agentic

Metric Human Development Agentic Development (50 t/s)
Team size 10-15 developers 10-20 AI agents (+ 1 architect)
Timeline (to public beta) 5-7 years 12-14 months
Cost (developer salaries) $10-15 million (7 years × $150K × 10 devs) $200K-500K (compute + 1 architect; assumes negotiated enterprise API pricing — retail token costs would be 3-5× higher)
Cost (hardware) $100K (same) $100K (same)
Total cost $10-15M $0.3-0.6M
Speedup 1x ~5x faster
Code quality Varies by developer Consistent (determined by spec)
Bugs from spec errors Same Same (GIGO applies)
Bugs from implementation Higher (human error) Different profile (consistent patterns, but risk of systematic errors across similar code)

Key insight: Agentic development is 5x faster and 10-20x cheaper, but bottlenecked by: 1. Hardware testing (physics-bound) 2. Iteration cycles (compile/test, not coding) 3. Spec quality (AI can't fix bad specs without human guidance)


25.12 Sensitivity Analysis: Slower Inference

What if inference is slower?

Inference Speed Agent Coding Time Impact on Timeline Total Timeline
50 t/s (base case) ~5-10 min/component 12-14 months
25 t/s (2x slower) ~10-20 min/component +10-15% 13-16 months
10 t/s (5x slower) ~25-50 min/component +25-30% 15-18 months
5 t/s (10x slower) ~50-100 min/component +40-50% 18-21 months

Key insight: Even at 10x slower inference, agentic development is only +50% longer (18-21 months vs 12-14 months), because most time is spent in compilation/testing, not AI inference.

Inference speed matters less than you'd expect once it's above ~5-10 t/s.


25.13 Optimistic vs Pessimistic Scenarios

25.13.1 Best Case (Everything Goes Right)

Assumptions: - Spec has zero showstoppers after initial review - Hardware available immediately - AI agents rarely hit bugs requiring human intervention - Beta testing finds only minor issues

Timeline: 10-11 months to public beta

25.13.2 Realistic Case (Some Issues)

Assumptions: - Spec has ~300 bugs (as analyzed) - Hardware procurement takes time - Some subsystems need multiple rewrites - Beta testing finds 50-100 additional issues

Timeline: 12-14 months to public beta (our base estimate)

25.13.3 Pessimistic Case (Major Problems)

Assumptions: - Spec has fundamental architectural flaws (e.g., RCU design is unsound) - Major subsystem needs redesign (e.g., DSM quorum logic) - Hardware compatibility worse than expected (WiFi works on 3/10 chipsets) - Beta testing finds showstopper issues (data corruption, security vulnerabilities)

Timeline: 18-24 months to public beta


25.14 What Determines Success?

The bottleneck is NOT AI speed — it's specification quality.

Critical success factors: 1. ✅ Spec correctness (run arch-review → fix loops until zero showstoppers) 2. ✅ Hardware availability (don't wait 6 months for cluster procurement) 3. ✅ Automated testing (CI/CD must catch regressions immediately) 4. ✅ Human architectural guidance (AI can't make strategic decisions) 5. ⚠️ Unknown unknowns (things you discover only during implementation)

With perfect spec: 10-12 months is achievable.
With current spec (~50 remaining flaws): 12-14 months realistic.
With flawed spec (fundamental issues): 18-24 months or requires redesign.


25.15 Recommendations

25.15.1 Before Starting Implementation

  1. Run 2-3 more arch-review cycles (eliminate all showstoppers)
  2. Procure hardware lab (10 laptops, cluster, WiFi adapters)
  3. Set up CI/CD (automated build/test on every commit)
  4. Define architectural decision process (who decides Tier 1 vs Tier 2 for WiFi?)

Time investment: 1 month
Payoff: Saves 2-4 months during implementation

25.15.2 During Implementation

  1. Daily integration testing (catch cross-subsystem bugs early)
  2. Weekly human review (architect reviews AI agent work)
  3. Parallel spec updates (fix spec bugs as they're discovered)
  4. Hardware testing from day 1 (don't wait until "code complete")

25.15.3 Metrics to Track

Leading indicators (predict timeline): - Spec bugs discovered per week (should decrease over time) - Test pass rate (should increase toward 95%+) - Integration conflicts per week (should stabilize <10)

Lagging indicators (measure progress): - Lines of code (target: ~300K SLOC) - Test coverage (target: >80%) - Supported hardware (target: 50+ laptop models)


25.16 Final Answer: Realistic Timeline

Question: With 50 t/s inference and agentic development, how long to develop UmkaOS?

Answer: 12-14 months from spec finalization to public beta

Breakdown: - Spec review & fixes: 1 month - Core development (Phases 1.1–5.1): 5 months - Integration & polish: 2 months - Beta testing: 3 months - Buffer for unknowns: 1 month

Compared to human development: 5x faster (5-7 years → 12-14 months)

Cost: 10-20x cheaper ($5-10M → $0.3-0.6M)

Caveat: This assumes good spec quality and hardware availability. Poor spec quality adds 6-12 months. Hardware unavailability adds 2-6 months.

The bottleneck is not AI — it's specification correctness and hardware testing.


25.17 Agentic Live Development Workflow

Phase numbering: This chapter uses the same phase numbering as Section 24.2: Phase 1 = Foundations (boot to hello-world), Phase 2 = Self-hosting shell + Tier 1 fault recovery (busybox, VirtIO-blk), Phase 3 = Real workloads + Tier M peer demo (systemd, Docker, TCP, NVMe), Phase 4 = Production ready (K8s, KVM, LTP, real hardware), Phase 5 = Ecosystem and platform maturity (5a-5e sub-phases).

UmkaOS's architecture enables a development workflow fundamentally different from traditional kernel development: live, on-host, iterative development where an AI agent develops and tests kernel code on the same machine the kernel is running on, without rebooting.

This is not a distant Phase 5 aspiration — it falls out naturally from three existing architectural features:

  1. KABI tier isolation (Section 11.2): Drivers compiled against umka-driver-sdk run at any tier (0/1/2) without recompilation. Tier 2 (Ring 3 + IOMMU) provides full crash containment — a buggy driver cannot harm the host.

  2. Live kernel evolution (Section 13.18): Core components and KABI services can be replaced at runtime via the EvolvableComponent protocol. State is serialized, swapped atomically (~1-10 μs), and rolled back if the new version crashes.

  3. Multikernel peer model (Section 11.1, Section 5.1): Peer kernels on DPUs or remote nodes provide isolated test environments within the same cluster. A new service version can be deployed to one peer, soaked, and rolled to others.

25.17.1 Driver Development on a Live Host

The standard agentic driver development loop:

repeat {
    1. Agent writes/modifies driver code against umka-driver-sdk.
    2. Agent compiles: cargo build --release -p my-driver
    3. Agent loads driver at Tier 2:
         echo "load my-driver.kabi tier=2" > /ukfs/kernel/drivers/control
       Tier 2: Ring 3, IOMMU-isolated. Crashes cannot harm the host kernel.
    4. Agent runs test suite against the driver.
       - Functional tests (does the driver handle I/O correctly?)
       - Stress tests (high IOPS, error injection, timeout simulation)
       - KABI conformance tests (vtable completeness, ring buffer protocol)
    5. If tests fail: agent reads crash log, analyzes, modifies code → goto 1.
    6. If tests pass: agent promotes to Tier 1:
         echo "1" > /ukfs/kernel/drivers/my-driver/tier
}

Iteration speed: Steps 1-5 take ~2-5 minutes per cycle (compile ~30s incremental, load ~10ms, test ~1-3 min). This is 10-100x faster than traditional kernel development (which requires reboot per change on a monolithic kernel) and comparable to userspace development speed.

No QEMU needed for driver development: Because Tier 2 drivers run in Ring 3 with IOMMU isolation, the host kernel is protected. The agent can develop drivers on the target hardware directly, accessing real devices (NVMe, NIC, GPU) with real firmware and real interrupt behavior — not emulated. QEMU is still needed for: - Core kernel development (scheduler, memory manager — not tier-isolated) - Multi-architecture testing (cross-compile + QEMU boot for non-host architectures) - CI/CD validation (reproducible environments)

25.17.2 Kernel Service Development via Live Replacement

For kernel subsystems (VFS, networking, block layer) — not just drivers — the live evolution framework (Section 13.18, Section 13.18) enables iterative development without rebooting:

repeat {
    1. Agent modifies KABI service code (e.g., umka-net TCP congestion module).
    2. Agent compiles: cargo build --release -p umka-net
    3. Agent triggers live replacement:
         echo "evolve umka-net /path/to/new/umka-net.uko" > /ukfs/kernel/evolution/control
       The evolution framework runs Phase A/A'/B/C:
         Phase A  — Loads new binary, validates KABI signature and vtable compatibility.
                    Exports old service state in chunks (connection table, routing FIB,
                    etc.) while old service continues handling requests.
         Phase A' — Quiescence (100ms deadline for services): blocks new syscall
                    entries (-ERESTARTSYS), drains in-flight operations, final
                    atomic state re-export. See [Section 13.18](13-device-classes.md#live-kernel-evolution--abort-handling-and-rollback)
                    for quiescence failure handling (abort, rollback, retry).
         Phase B  — Atomic swap (~1-10 μs stop-the-world): IPI, vtable pointer
                    swap, pending ops ring transfer, CPUs released.
         Phase C  — New service drains pending ops, blocked syscalls auto-retry
                    (transparent to userspace).
    4. Agent runs tests against the new service.
    5. If new service crashes within watchdog window (10s for services):
       → Automatic reload of previous version via forward evolution from retained state. Agent reads
         crash log → goto 1.
    6. If tests pass: new version is now the active service.
}

Constraints: - Non-replaceable data components (memory allocator data — PageArray, BuddyFreeList, PcpPagePool; page table hardware ops; capability data — CapTable, CapEntry; page reclaim data; KABI dispatch trampoline; evolution primitive — see Section 13.18) still require QEMU or reboot for development. Their corresponding policy layers (PhysAllocPolicy, VmmPolicy, PageReclaimPolicy, CapPolicy) are replaceable via atomic pointer swap. - The first deployment of a service (from nothing) requires a boot. Live replacement only works for updating an already-running service. - State format changes between versions must provide a migration function (v(N-1) → v(N)). If the agent makes a major restructuring, a chained migration or fresh restart may be needed.

25.17.3 Multikernel Testing Strategy

On a multikernel cluster (host + DPU peers, or multi-host), the agent can use the rolling replacement protocol (Section 13.18) for safe incremental testing:

Development cluster topology:
  ┌──────────┐   ┌──────────┐   ┌──────────┐
  │  Host    │   │  DPU #1  │   │  DPU #2  │
  │ (dev box)│←→│ (test    │←→│ (prod    │
  │          │   │  target) │   │  control)│
  └──────────┘   └──────────┘   └──────────┘

Agent development workflow:
  1. Agent writes new network stack version on Host.
  2. Agent deploys to DPU #1 only (via rolling replacement, one peer).
  3. DPU #1 runs the new version; DPU #2 and Host run the old version.
     Capability service routing ensures clients are served by whichever
     peer has an active provider.
  4. Agent runs workload against DPU #1:
     - Network throughput tests (iperf3, netperf)
     - Latency tests (sockperf, ping)
     - Error injection (packet drop, reorder, corrupt)
  5. If DPU #1 crashes: automatic reload of previous version via forward evolution. Only DPU #1 affected.
     Host and DPU #2 continue serving. Agent analyzes → goto 1.
  6. If soak succeeds (1 hour default):
     → Roll to DPU #2.
     → Roll to Host.
     → Cluster-wide verification.

Key advantage: The host running the development tools (compiler, editor, agent runtime) is the last node to receive the update. If the new code has a fatal bug, the development environment is never disrupted — the agent can always diagnose and fix the issue from the unaffected host.

25.17.4 LTP as Agentic Development Substrate

Implementing Linux syscall compatibility is the largest single task in UmkaOS development: ~400 syscalls, each with complex edge cases, architecture-specific behavior, and underdocumented invariants. For human developers, this is years of tedious work. For agentic development, the existence of the Linux Test Project (LTP) transforms this from an open-ended research problem into a structured, test-driven implementation task.

Why LTP is uniquely valuable for agents:

  1. Machine-readable behavioral specification. LTP contains ~5,000+ test cases that encode the actual expected behavior of Linux syscalls — not what the man page says, but what the kernel actually does. Each test is a concrete input → expected output pair. An agent can read a test, understand the contract, implement the syscall to satisfy the test, and verify correctness — all without human involvement.

  2. Natural task decomposition. LTP tests are organized by syscall family (open, mmap, clone, futex, etc.). Each family is an independent work unit that an agent can claim, implement, and validate. The test suite provides a natural progress tracker: "242/400 syscall families passing" is an unambiguous status.

  3. Edge case discovery. LTP tests exercise corner cases that are difficult to derive from documentation alone: mmap with MAP_FIXED overlapping an existing mapping, clone with invalid flag combinations, futex wake-vs-requeue races, signalfd interaction with SA_SIGINFO. These tests encode decades of bug reports and regression fixes. The agent gets this knowledge for free.

  4. Regression prevention. Once a test passes, it must never fail again. The agent runs the full LTP suite after every change. Any regression is caught immediately with a specific failing test that identifies the exact syscall and edge case — the agent can localize the bug without human debugging.

  5. Cross-architecture validation. LTP runs on all architectures. The same test suite validates that syscall behavior is identical on x86-64, AArch64, ARMv7, RISC-V, PPC32, and PPC64LE. Architecture-specific bugs (wrong register convention, wrong signal frame layout, wrong struct stat padding) are caught automatically.

Agentic LTP workflow:

For each syscall family (e.g., "mmap"):
  1. Agent reads LTP tests for the family (e.g., ltp/testcases/kernel/syscalls/mmap/)
  2. Agent reads the architecture spec (sysapi/syscall-interface.md, process/*.md, etc.)
  3. Agent implements the syscall handler to satisfy the spec
  4. Agent runs the LTP tests for that family on all 8 arches in QEMU
  5. If tests fail: agent reads failure output, identifies the discrepancy, fixes → goto 3
  6. If tests pass: commit, move to next family
  7. After every N families: run full LTP regression to catch cross-syscall interactions

Scale advantage: A human developer implementing mmap compatibility might spend 2-4 weeks reading documentation, writing code, and debugging edge cases. An agent with LTP tests as a feedback signal can iterate in minutes per cycle. The combinatorial complexity of ~400 syscalls × ~10 edge cases each = ~4,000 implementation decisions is within agent capability when each decision has an immediate correctness signal (LTP test pass/fail).

Complementary test suites: The LTP pattern generalizes to at least 12 subsystems. See Section 25.18 for the comprehensive inventory of Linux test suites usable as agentic development accelerators — including xfstests (~1,500 filesystem tests), packetdrill (2,000+ TCP scripts from Google), BPF selftests (~1,000 eBPF tests), liburing tests, kvm-unit-tests, IGT GPU Tools (2,228+ DRM/KMS subtests), blktests, and 20+ kselftest subdirectories covering mm, cgroup, seccomp, futex, ptrace, and more.

Cross-references: - Syscall interface spec: Section 19.1 - Verification strategy (LTP gate): Section 24.3 - Phase 4 exit criteria (>95% LTP): Section 24.2

25.17.5 Development Acceleration Summary

Development task Traditional Linux UmkaOS agentic
Driver bug fix Edit → compile → reboot → reproduce → verify (~10-30 min) Edit → compile → reload Tier 2 → verify (~2-5 min)
Network stack change Edit → compile → reboot → reconfigure → test (~15-45 min) Edit → compile → live evolve → test (~3-8 min)
Scheduler policy tuning Edit → compile → reboot → benchmark (~20-60 min) Edit → compile → policy hot-swap → benchmark (~2-5 min)
Cluster service update N × (reboot + rejoin) (~N × 5-10 min) N × (live replace + soak) (~N × 200ms + 1 hour soak)
Multi-arch driver testing 6 × (cross-compile → QEMU boot → test) (~6 × 10 min) Tier 2 on native hardware + 5 × QEMU (~1 × 5 min + 5 × 10 min)

Net effect: The compile-test cycle for drivers and services drops from 10-60 minutes (reboot-bound) to 2-8 minutes (compile-bound). With incremental compilation (~30s), the bottleneck shifts entirely to test execution time — exactly where it should be.

Cross-references: - KABI driver model: Section 12.1 - Tier isolation: Section 11.2 - Crash recovery: Section 11.9 - Live kernel evolution: Section 13.18 - KABI service live replacement: Section 13.18 - Driver tier promotion: Section 13.18 (promotion protocol) - Policy hot-swap: Section 19.9 - Multikernel peer model: Section 11.1 - Cluster membership: Section 5.1 - ServiceDrainNotify: Section 5.11 - FMA telemetry: Section 20.1

25.18 Linux Test Suite Inventory for Agentic Development

The "LTP as development substrate" pattern (Section 25.17) generalizes far beyond syscalls. Linux has accumulated decades of test suites across nearly every subsystem — each encoding the actual behavioral contract that UmkaOS must implement. For agentic development, these suites convert open-ended "implement Linux compatibility" tasks into structured implement → run tests → fix failures → repeat loops with unambiguous pass/fail signals.

This section catalogues every major Linux test suite usable as an agentic development accelerator for UmkaOS, organized by value tier.

25.18.1 Tier 1: High-Value Test Suites (Pure Userspace API, Directly Usable)

These suites test exclusively via userspace API (syscalls, ioctls, procfs/sysfs) and require no Linux kernel internals. They validate exactly the external ABI boundary UmkaOS must implement.

25.18.1.1 LTP (Linux Test Project)

Property Value
Repository github.com/linux-test-project/ltp
Test count ~5,000+ test cases
Interface Userspace: syscalls, procfs, sysfs
UmkaOS chapters Section 8.1 (fork/exec/signals), Section 17.1 (namespaces), Section 17.2 (cgroups), Section 19.1 (syscalls), Section 9.1 (capabilities), Section 17.3 (IPC)

LTP is the single largest general-purpose Linux kernel test suite. Tests are organized by syscall family (open, mmap, clone, futex, etc.). Each family is an independent work unit. See Section 25.17 for the full agentic LTP workflow.

25.18.1.2 xfstests (fstests)

Property Value
Repository git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git, mirror at github.com/kdave/xfstests
Test count ~1,500+ tests across generic + per-filesystem groups
Interface Userspace: open/read/write/fsync/fallocate/ioctl/xattr
Filesystems ext4 (L4), XFS (L4), btrfs (L4), f2fs (L3), tmpfs (L2), NFS (L2), overlayfs (L2), many others
UmkaOS chapters Section 14.1 (VFS), Section 15.6 (ext4), Section 15.7 (XFS), Section 15.8 (Btrfs), Section 14.11 (FUSE), Section 14.7 (overlayfs)

Probably the single highest ROI test suite for agentic development. xfstests encodes the complete POSIX filesystem behavioral contract plus Linux-specific extensions (fallocate, FIEMAP, O_TMPFILE, copy_file_range, splice, hole punch, CoW semantics). The tests/generic/ directory contains ~700+ tests that apply to any filesystem — these test VFS-layer semantics independent of filesystem implementation.

Agentic workflow:

For each filesystem operation (e.g., "fsync + CoW"):
  1. Agent reads xfstests/tests/generic/ tests exercising that operation
  2. Agent reads VFS spec ([Section 14.1](14-vfs.md#virtual-filesystem-layer))
  3. Agent implements the VFS operation
  4. Agent runs xfstests generic group: ./check -g generic/quick
  5. Failures → agent reads test output, fixes → goto 3
  6. Pass → run full generic suite, then per-filesystem tests

Full suite runtime: 5-6 days for all filesystems (per LWN 2022 report). The quick group runs in ~30 minutes — suitable for per-iteration feedback.

25.18.1.3 Packetdrill

Property Value
Repository github.com/google/packetdrill
Test count 300+ open-source scripts; 2,000+ at Google (being incrementally open-sourced); 66 in mainline tools/testing/selftests/net/packetdrill/ as of Linux 6.12
Interface Userspace: socket API (socket/bind/listen/accept/send/recv) + injected/verified packets via TUN device
Protocols TCP, UDP, ICMP over IPv4 and IPv6
UmkaOS chapters Section 16.3 (sockets), Section 16.9 (congestion), Section 16.15 (kTLS), Section 16.11 (MPTCP)

Packetdrill is a scriptable network stack testing tool developed by Google. Each script specifies a precise sequence of socket syscalls interleaved with timestamped packet injections/expectations — testing the full TCP state machine from userspace to wire. Google's internal suite of 2,000+ scripts covers congestion control, loss recovery, buffer management, ECN, SACK, Fast Open, Tail Loss Probe, Path MTU discovery, and more.

Key advantage for agentic development: Packetdrill scripts are machine-readable behavioral specifications. Each script encodes: - Exact syscall sequence with expected return values - Exact packet sequence with expected header values (tcpdump-like syntax) - Timing constraints (±tolerance for jitter) - Internal TCP state assertions (via TCP_INFO getsockopt)

This is the richest available test suite for TCP implementation correctness. An agent implementing UmkaOS TCP can use packetdrill scripts as the primary feedback signal, iterating in seconds per script.

25.18.1.4 BPF Selftests

Property Value
Location tools/testing/selftests/bpf/ in Linux source tree
Test count ~1,000+ test cases
Interface Userspace: bpf() syscall, perf_event_open, netlink
UmkaOS chapters Section 19.1 (§eBPF subsystem)

Tests eBPF verifier correctness, map operations, program types, BTF (BPF Type Format), tracing hooks, networking hooks, and the full bpf() syscall interface. Already identified as a key agentic accelerator.

25.18.1.5 liburing Tests

Property Value
Repository github.com/axboe/liburing (test/ directory)
Test count ~200+ test cases
Interface Userspace: io_uring_setup/io_uring_enter syscalls
UmkaOS chapters Section 19.1 (§io_uring subsystem)

Tests io_uring submission/completion ring semantics: read/write/poll/timeout/ link/cancel/fixed files/multishot/provided buffers/registered buffers. The liburing test suite is the de facto compliance test for io_uring implementations.

25.18.1.6 KVM Unit Tests

Property Value
Repository github.com/kvm-unit-tests/kvm-unit-tests
Test count Hundreds of tests per architecture
Architectures x86_64, arm64, s390x, ppc64/ppc64le, riscv64
Interface KVM ioctl interface (/dev/kvm)
UmkaOS chapters Section 18.1 (KVM), Section 18.4 (migration)

Each test is a tiny guest OS that exercises specific KVM features: VM-exit handling, interrupt injection, nested virtualization, memory mapping, timer emulation. Tests run as userspace programs that interact with /dev/kvm — they test the hypervisor's ioctl interface as a black box. Multi-architecture coverage aligns perfectly with UmkaOS's 8-arch support matrix.

25.18.1.7 IGT GPU Tools

Property Value
Repository gitlab.freedesktop.org/drm/igt-gpu-tools
Test count 2,228+ subtests (50 KMS test binaries alone). ~6M subtests executed per week across ~130 test machines
Interface Userspace: DRM ioctls (/dev/dri/*)
UmkaOS chapters Section 21.4 (§DRM/KMS in user-io), Section 22.1 (accelerators)

IGT tests DRM/KMS kernel APIs: mode setting, planes, flips, atomic modesetting, vblanks, color management, rotation, cursor. Vendor-agnostic tests cover the core DRM subsystem; vendor-specific directories cover Intel, AMD, and vc4/v3d. Tests operate via DRM ioctls — pure userspace.

Note: Originally "Intel GPU Tools" but now vendor-agnostic. VKMS (Virtual KMS) support means tests can run without physical GPU hardware in QEMU — ideal for agentic CI.

25.18.1.8 blktests

Property Value
Repository github.com/osandov/blktests
Test count ~200+ tests across test groups
Test groups block (generic block layer), loop (loop devices), nvme (NVMe), scsi (SCSI), dm (device-mapper), nbd (network block device), zbd (zoned block devices), thp (transparent huge pages interaction)
Interface Userspace: block device ioctls, sysfs, procfs
UmkaOS chapters Section 15.2 (block I/O), Section 15.19 (NVMe), Section 15.18 (I/O scheduling)

Complements xfstests: while xfstests tests filesystem semantics, blktests tests the block layer below. Full suite runs in ~1 day (vs 5-6 days for xfstests).

25.18.2 Tier 2: Kselftest Subdirectories

tools/testing/selftests/ in the Linux source tree contains 100+ subdirectories, each a mini test suite for a specific subsystem. Kselftest runs as userspace programs. The following subdirectories have the highest value for UmkaOS agentic development:

Subdirectory Approx Tests UmkaOS Chapter What It Validates
net/ 100+ scripts Section 16.2 Routing, VLANs, tunnels, namespaces, policy routing, GRO/GSO, TCP, UDP, MPTCP
net/mptcp/ 20+ Section 16.11 MPTCP subflows, path management, fallback to TCP
net/netfilter/ 20+ Section 16.18 nftables, conntrack, NAT, packet filtering rules
mm/ 50+ Section 4.8, Section 4.7 mmap, mprotect, madvise, userfaultfd, THP, KSM, NUMA balancing, mremap
cgroup/ 30+ Section 17.2 Cgroup v2 hierarchy, memory controller, CPU controller, freezer
seccomp/ 20+ Section 10.3 seccomp-BPF filter installation, SECCOMP_RET_* actions, TSYNC
capabilities/ 10+ Section 9.1, Section 9.9 Capability inheritance, ambient caps, no_new_privs
futex/ 15+ Section 19.1 (§futex) futex ops: wait/wake/waitv, PI futexes, robust lists
ptrace/ 20+ Section 20.4 ptrace attach/detach, PEEKUSER, register read/write, syscall tracing
ipc/ 10+ Section 17.3 POSIX IPC: semaphores, shared memory, message queues. Achieves 73% line coverage on Linux — small subsystem, extremely well tested
io_uring/ 30+ Section 19.1 (§io_uring) io_uring submission/completion, various op types
timers/ 15+ Section 7.1 (§timekeeping) POSIX timers, timerfd, clock_gettime, nanosleep precision
clone3/ 10+ Section 8.1 clone3() flags, pidfd, CLONE_INTO_CGROUP
pidfd/ 10+ Section 8.1 pidfd_open, pidfd_send_signal, pidfd_getfd
mount/ 10+ Section 14.1 (§mount) New mount API: fsopen/fsmount/move_mount
filesystems/ 15+ Section 14.1, Section 14.13 statx, fanotify/inotify, overlayfs
landlock/ 15+ Section 9.8 Landlock LSM access rules
perf_events/ 10+ Section 20.8 perf_event_open, PMU counters, sampling modes
kvm/ 20+ Section 18.1 KVM ioctls, vCPU creation, memory regions

Agentic kselftest workflow:

For each subsystem (e.g., "cgroup"):
  1. Agent reads selftests/cgroup/ tests
  2. Agent reads UmkaOS spec ([Section 17.2](17-containers.md#control-groups))
  3. Agent implements the cgroup v2 interface
  4. Agent runs: make -C tools/testing/selftests TARGETS=cgroup run_tests
  5. Failures → fix → goto 3
  6. Pass → commit, move to next subsystem

25.18.3 Tier 3: Specialized External Suites

Test Suite Repository UmkaOS Chapter Notes
nftables test suite nftables.org + selftests/net/netfilter/ Section 16.18 Packet filtering rules, sets, maps, chains, conntrack
iproute2 tests iproute2 source tree Section 16.17 Validates netlink interface: route/rule/neigh/bridge/tc commands
FRRouting Topotests github.com/FRRouting/frr Section 16.6 Routing protocol compliance (BGP, OSPF) via netlink/routing socket
perf test (built-in) perf test command (~80 subtests) Section 20.8, Section 20.2 PMU events, tracepoints, symbol resolution, dwarf unwinding
syzkaller github.com/google/syzkaller All chapters Fuzz-driven syscall testing. Not deterministic but finds crashes/hangs LTP misses. Agent reads crash reports as bug specs
Docker/K8s test suites Docker CE tests, K8s e2e conformance All chapters Application-level validation: "does docker run nginx work?" is the ultimate compat test

25.18.4 Coverage Map: UmkaOS Chapters × Available Test Suites

The following table shows which test suites provide coverage for each major UmkaOS subsystem area, and the expected value for test-driven agentic development:

Subsystem Area Primary Suite Supporting Suites Agentic Value
Syscall dispatch LTP kselftest (clone3, pidfd) Very high
Memory management kselftest/mm LTP (mm tests) High
Process lifecycle LTP (fork/exec/signals) kselftest (clone3, pidfd) High
Capabilities + credentials LTP, kselftest/capabilities kselftest/landlock High
seccomp-BPF kselftest/seccomp High
VFS + filesystems xfstests (~1,500 tests) kselftest/mount, kselftest/filesystems Very high
Block I/O blktests xfstests (I/O path tests) Very high
TCP/IP networking packetdrill (2,000+) kselftest/net Very high
eBPF BPF selftests (~1,000) Very high
io_uring liburing tests (~200) kselftest/io_uring Very high
Cgroups v2 kselftest/cgroup LTP (cgroup tests) High
Namespaces LTP, kselftest/net High
IPC (SysV + POSIX) LTP, kselftest/ipc High (73% coverage)
KVM hypervisor kvm-unit-tests kselftest/kvm High
DRM/KMS display IGT GPU Tools (2,228+) High
Packet filtering kselftest/net/netfilter nftables test suite Medium-high
Perf events + PMU kselftest/perf_events perf test Medium
ptrace + debugging kselftest/ptrace LTP Medium
Timers + clocks kselftest/timers LTP Medium
Audio (ALSA) alsa-utils (bat, speaker-test) Low (thin tests)
Scheduler Low (not directly testable via API)

25.18.5 The Test-Driven Agentic Development Pattern

All test suites in this inventory share a common property that makes them uniquely valuable for agentic development: they test via the userspace API boundary, which is exactly the external ABI contract UmkaOS must implement. The pattern is:

For each subsystem with an available test suite:
  1. Agent reads the test suite → derives the behavioral contract
  2. Agent reads the UmkaOS spec → understands the intended design
  3. Agent implements the subsystem
  4. Agent runs the test suite as the feedback signal
  5. Test failures provide specific, actionable error descriptions
     (expected vs actual values, specific syscall, specific edge case)
  6. Agent fixes → re-runs → iterates until pass
  7. Full regression after each subsystem completes

Why this works for agents but not (as well) for humans: - Agents can run thousands of test iterations per day (compile + test in minutes) - Agents can read test source code to understand the exact behavioral expectation - Agents do not get fatigued by repetitive fix → test → fix cycles - The test suite provides an unambiguous progress metric (N/M tests passing) - Cross-architecture testing (8 arches × QEMU) is parallelizable

Phase integration (see Section 24.2): - Phase 2 (Self-hosting): LTP core syscalls, basic kselftest/net, basic xfstests/generic - Phase 3 (Real workloads): Full LTP, full xfstests, packetdrill TCP suite, BPF selftests, blktests, liburing tests - Phase 4 (Production): Full kvm-unit-tests, IGT GPU Tools, syzkaller soak, Docker/K8s e2e - Phase 5 (Ecosystem): nftables suite, FRR topotests, perf test, vendor-specific IGT

Cross-references: - LTP agentic workflow: Section 25.17 - Phase exit criteria: Section 24.2 - Verification strategy: Section 24.3 - QEMU CPU testing matrix: Section 25.8