UmkaOS Architecture — Review Primer (SPR)
Purpose: Sparse Priming Representation for LLM reviewers. Load this before reviewing any architecture section — even when reviewing a single file, you must understand the whole. ~2500 tokens. Not intended for human consumption.
What UmkaOS Is
UmkaOS is a production OS kernel written in Rust, designed as a drop-in Linux replacement for 2026-era computing infrastructure. Unmodified glibc, systemd, Docker, K8s, and KVM run without recompilation. It is not a microkernel (too slow), not a Linux fork (clean design).
Primary premise: a modern server is already a multikernel cluster. A single server contains a DPU/SmartNIC (16 ARM cores, own OS), NVMe SSD (embedded RTOS), GPU (own memory manager), CXL expanders — Linux models these as dumb peripherals. UmkaOS models them as peer kernels.
Core capabilities in order of novelty: multikernel fabric (devices as peers) → distributed-native kernel primitives (DSM, DLM, RDMA-native) → heterogeneous compute (GPU/NPU/CXL unified via AccelBase) → driver isolation for crash containment → Rust memory safety → ~99.99% Linux ABI compatibility.
Supported architectures (all boot and run): x86-64, AArch64, ARMv7, RISC-V 64, PPC32, PPC64LE.
Implementation Context (Critical)
UmkaOS is designed and implemented by AI agents (agentic programming). Agents implement directly from architecture specs. This has consequences for review:
- Vague specs produce wrong code. "Efficient algorithm", "suitable data structure", "TBD", "will be designed later" = the agent guesses, usually incorrectly.
- Every struct must have every field. Every algorithm must have every step. Every error path must be named. Reviewers must flag any section an agent cannot implement without reading another section not linked from it.
- No LOC estimates or time estimates in architecture docs — they become stale and waste review cycles on contested numbers.
- Explicit cross-references are bidirectional. If Ch 9 depends on Ch 3, Ch 3 must say what it exports to Ch 9. Agents implement sections in isolation.
Design Principles (Apply Everywhere)
UmkaOS Inside, Linux Outside
- External ABI = Linux-compatible exactly: syscall numbers, ioctl values, procfs/sysfs formats, binary structs exposed to userspace (ALSA event types, netlink layouts, stat). If Linux exposes it, UmkaOS must expose it identically or be binary-incompatible.
- Internal implementation = design for correctness, not Linux imitation. Linux is the reference for external compat constraints and edge cases, not an internal design template. Where Linux has known flaws, UmkaOS does it correctly from the start:
rcu_headintrusive list →RcuCallbackRing(typed pre-allocated ring, no linked list)on_rq: bool→OnRqStateenum (captures deferred/migrating states)cong_ops: Box<dyn>→&'static dyn CongOps + CongPriv(no allocation on hot path)- MCE log entry 48 bytes →
MceLogEntry align(64)(no cache line split) - VecDeque in kernel hot paths → fixed-size ring buffers or pre-allocated arrays
No Stubs, No "Fix Later"
No stubs, no placeholders, no "this will be replaced in production." If not ready to implement properly, it is a documented future milestone — not a half-baked struct. This is a production kernel. Every data structure must be implementable as written.
Runtime Discovery, Never Hardcoded
Memory size, CPU count, NUMA topology, device presence — all discovered at boot from
firmware (ACPI, DTB, SMBIOS). No const MAX_CPUS: usize = 256. No if x86 in generic code.
Architecture Abstraction
arch::current:: is a compile-time type alias (pub use $arch as current), not runtime
dispatch. arch::current::cpu::halt() compiles to a direct call. Hardware-specific concepts
(MPK, GIC, PKRU, LAPIC) must not appear in generic code — use abstract names (isolation_domain,
interrupt_controller). Per-arch code lives exclusively in arch/*/.
Chapter Map (22 Chapters)
When reviewing any section, this is the full surrounding architecture:
| Ch | File | Domain | Key exports |
|---|---|---|---|
| 1 | 01-overview | Goals, perf budget, design choices | — |
| 2 | 02-boot-hardware | Boot chain, ACPI/DT, multi-arch, HW memory safety | Boot facts for Ch 4, 6 |
| 3 | 03-concurrency | Locking, RCU, CpuLocal, PerCpu, ring buffers | Used by everything |
| 4 | 04-memory | Physical alloc, VMM, slab, NUMA, compression, page cache | Used by Ch 6, 10, 13, 21 |
| 5 | 05-distributed | RDMA, DSM (MOESI), DLM, cluster, SmartNIC/DPU | Used by Ch 14, 15 |
| 6 | 06-scheduling | EEVDF, RT, deadline, CBS, EAS, power, timekeeping | Used by Ch 7, 21 |
| 7 | 07-process | Task/Process, fork/exec/exit, signals, FdTable, rlimits | Used by Ch 8, 16, 18 |
| 8 | 08-security | Caps, creds, LSM, verified boot, TPM, IMA, PQC, CC | Used by Ch 9, 10, 11, 16 |
| 9 | 09-security-extensions | Kernel Crypto API, key retention, seccomp-BPF, ARM MTE, DebugCap | Depends on Ch 8 |
| 10 | 10-drivers | Three-tier model, isolation mechs, device registry, I/O, IPC | Used by Ch 11, 12 |
| 11 | 11-kabi | KABI IDL, vtables, driver signing, multi-transport manifest | Used by all drivers |
| 12 | 12-device-classes | NIC, GPU, WiFi, BT, camera, watchdog, SPI, MTD, SoundWire | Depends on Ch 10, 11 |
| 13 | 13-vfs | VFS, dentry cache, mount tree, overlayfs, pipes, quotas | Used by Ch 14 |
| 14 | 14-storage | Block I/O, dm-*, LVM, NVMe-oF, ext4/XFS/Btrfs, ZFS, NFS, DLM | Depends on Ch 5, 13 |
| 15 | 15-networking | Socket, NetBuf, TCP, congestion, kTLS, tunnels, IPsec, IPVS | Depends on Ch 5 |
| 16 | 16-containers | Namespaces (8 types), cgroups v2, POSIX IPC | Depends on Ch 7, 8 |
| 17 | 17-virtualization | KVM, VMX/VHE/H-ext, live migration, PV, VFIO | Depends on Ch 10 |
| 18 | 18-compat | Syscall layer, io_uring, futex, netlink, WEA, native syscalls | Depends on Ch 7, 8 |
| 19 | 19-observability | FMA, tracepoints, ptrace, umkafs namespace, EDAC, perf | Depends on Ch 3, 4 |
| 20 | 20-user-io | TTY/PTY, evdev, ALSA, DRM/KMS | Depends on Ch 10 |
| 21 | 21-accelerators | AccelBase KABI, GPU mem, CBS sched, inference | Depends on Ch 4, 10 |
| 22 | 22-ml-policy | AI/ML Policy Framework, closed-loop kernel intelligence | Depends on Ch 21 |
| 23 | 23-roadmap | Phases, verification, risks, formal verification | Meta |
| 24 | 24-agentic | Agentic dev methodology, phase timelines | Meta |
Three-Tier Driver Isolation Model
Isolation is a crash-containment mechanism, not UmkaOS's identity. It is what makes the KABI model viable — drivers can crash and reload without taking the system down.
| Tier | Ring | Mechanism | Switch cost | Examples | Crash effect |
|---|---|---|---|---|---|
| 0 | 0 | In-kernel, no boundary | 0 cycles | APIC, timer, early serial | Kernel panic |
| 1 | 0 | MPK/POE/DACR hardware domain | ~23-80 cycles | NVMe, NIC, TCP, FS, GPU, KVM | Driver reload ~50-150ms |
| 2 | 3 | Process + IOMMU | ~200-500 cycles | USB, audio, BT, input | Process restart ~10ms |
| M | n/a | Separate UmkaOS kernel on device | PCIe/CXL latency | BlueField DPU, SoC storage | Peer rejoin, host unaffected |
Tier 1 is deliberately NOT exploitation-resistant. WRPKRU is unprivileged on x86 — a compromised driver can change its own PKRU. This is documented and intentional. Use Tier 2 for untrusted code. Do not flag Tier 1 HMAC key visibility as a security gap.
Per-architecture isolation mechanisms: x86-64: MPK (WRPKRU, ~23 cycles) | AArch64: POE (MSR POR_EL0, ~40-80 cycles, ARMv8.9+) or page table + ASID fallback (~150-300 cycles) | ARMv7: DACR (MCR p15, ~10-20 cycles) | RISC-V: page table only (~200-500 cycles, no fast mechanism — Tier 0/2 fallback) | PPC32: segment registers (mtsr, ~10-30 cycles) | PPC64LE POWER9+: Radix PID (~30-60 cycles).
KVM: Tier 1 with extended hardware privileges (KvmHardwareCapability). Domain-isolated
like all Tier 1, plus authority to execute VMX/VHE/H-extension trampoline (~200 lines
verified assembly). Crash recovery enabled. Not a security boundary for guest VMs.
See Section 18.1.4.5 for classification rationale.
KABI — Stable Driver ABI (Chapter 11)
- Defined in
.kabiIDL files, compiled byumka-kabi-gen→ C + Rust bindings. KabiDriverManifestembedded in ELF.kabi_manifestsection:transport_mask: u8(bit0=Tier0/Direct, bit1=Tier1/Ring, bit2=Tier2/IPC), three optional entry functions.- Default policy: every driver binary ships all 3 transports. Tier change = operator
config update, no recompilation.
minimum_tier/maximum_tierin.kabito opt out. - Vtables: append-only, C-compatible,
vtable_sizefield for runtime version detection. - Support window: 5 major releases per KABI version.
Concurrency Model (Chapter 3)
- CpuLocal: register-based (~1-10 cycles). x86-64 GS, AArch64 TPIDR_EL1, ARMv7 TPIDRPRW, PPC64 r13, RISC-V sscratch. For ~10 hottest fields: current_task, runqueue, preempt_count, slab_magazines, rcu_nesting, rcu_passed_quiescent, isolation_shadow, napi_budget, cpu_id.
- PerCpu\<T>:
get()requires&PreemptGuard.get_mut()requires&mut PreemptGuard - IRQs disabled.
get_mut_nosave()takes&IrqDisabledGuard, skips IRQ save/restore. Debug-only borrow-state CAS (release builds: zero overhead). - RCU: deferred quiescent state —
RcuReadGuard::dropsets CpuLocal flag, tick/switch reports.rcu_call()for deferred free. Use for: caps, dentries, page cache, KRL reads. - Lock hierarchy: compile-time
Lock<T, LEVEL>. TASK_LOCK < RQ_LOCK < PI_LOCK. Violations are compile errors, not deadlocks. Work stealing: always acquire in CPU-ID order. - Ring buffers: ALL cross-domain IPC uses io_uring-style shared-memory rings. Not function calls. Core↔Tier 1, Tier 1↔Tier 2 — the ring IS the isolation mechanism.
- ValidatedCap: validate-once token amortizing KABI dispatch overhead. Store result, don't re-validate on every call.
Data Structure Invariants (Flag Violations)
These apply everywhere. Flag any violation:
| Rule | Correct | Incorrect |
|---|---|---|
| No heap alloc in hot paths | Fixed-size arrays, ring buffers | Vec<T>, HashMap without bound |
| No trait objects in KABI ABI | &'static dyn Trait + PrivData |
Box<dyn Trait> |
| No bool for multi-state | enum State { Active, Deferred, Migrating } |
on_rq: bool |
| Kernel structs on cache lines | #[repr(C, align(64))] |
Struct spanning 2 lines in hot path |
| No intrusive linked lists for callbacks | Pre-allocated ring (RcuCallbackRing) | rcu_head embedded in object |
| Generation counters must handle wrap | SlotState::GenerationExhausted + EOVERFLOW |
silent wrap |
| PerCpu fields accessed only under guard | get() with PreemptGuard |
bare pointer dereference |
Security Model
- Capabilities (
SystemCaps) are the native security model. Unix UID/GID emulated on top. - Credential model:
TaskCredential(immutable, copy-on-write). Privilege checks:cred.has_cap(CAP_X). - LSM hooks: pluggable, stackable, mandatory pre-check before any privileged operation.
- Verified boot: UEFI SB → GRUB → kernel sig → initramfs sig. Hybrid Ed25519 + ML-DSA-65.
- PQC: ML-KEM-768 (KEM), ML-DSA-44/65/87 (signatures), SLH-DSA-128f (stateless sig).
- Confidential computing: SEV-SNP, TDX, ARM CCA — all share the same live migration auth model.
- IMA: hash-chained measurement log, appraisal, audit. All binaries measured before exec.
Performance Budget
Target: <5% overhead vs Linux on macro benchmarks.
| Path | Overhead | Mechanism |
|---|---|---|
| NVMe 4KB read | ~0.25-0.5% | Shadow-elided MPK + doorbell coalescing + ValidatedCap |
| TCP RX (NAPI-64) | ~0.02-0.06%/pkt | NAPI batching amortizes MPK switch |
| Context switch | ~0.1-0.3% | CpuLocal register access eliminates most per-CPU loads |
| Pure compute | 0% | Never leaves UmkaOS Core domain |
| Cumulative nginx-class | ~0.8% | ~4.2% headroom to target |
Seven mandatory optimizations (all from day one, none deferred): CpuLocal register access, PerCpu debug-only CAS, IrqDisabledGuard elision, RCU deferred quiescent, isolation shadow elision, ValidatedCap amortization, doorbell coalescing for NVMe/virtio.
Linux Compatibility (Chapter 18)
- ~330-350 of ~450 x86-64 syscalls implemented natively. Not wrappers.
- The SLAT entry: untyped C ABI
(int fd, void* buf)→ typed Rust(CapHandle, UserPtr<u8>). umka_syscall(op, args, size)multiplexed entry for UmkaOS-native extensions (Section 18.6).- All 8 namespace types: mnt, pid, net, ipc, uts, user, cgroup, time.
- cgroups v2 fully specified; v1 compatibility shim for legacy containers.
- io_uring full impl, 64 signals, futex, eBPF (verifier + JIT x86-64, interpreted others).
- Deliberately dropped: binary
.ko(KABI replaces it), ia32 multilib,/dev/mem,/dev/kmem,ioperm/iopl, general-purposekexec. Do not flag these as missing.
Section Dependency Map
Section numbering: Ch.Sec (e.g., Section 10.4 = Chapter 10, Section 4 in 10-drivers.md).
| Section | File | Depends On | Exports To |
|---|---|---|---|
| 3.1 Concurrency | 03-concurrency | — | Everything |
| 4.1 Memory | 04-memory | 3.1, 2.1 (boot) | 4.2 (ZPool), 13.1 (VFS), 21.2 (accel) |
| 5.1 Distributed | 05-distributed | 3.1, 8.1, 4.1 | 5.2 (SmartNIC), 14.6 (DLM) |
| 6.1 Scheduler | 06-scheduling | 3.1, 4.1 | 6.3 (CBS), 7.2 (RT), 21.3 (accel sched) |
| 8.1 Capabilities | 08-security | 3.1 (RCU) | 10.4, 8.2, 16.1 (namespaces) |
| 8.2 Verified Boot | 08-security | 11.1 (KABI), 2.1 | 8.4 (IMA) |
| 9.1-9.5 Security Extensions | 09-security-extensions | 8.1-8.8 | 10.4 (drivers), 11.1 (KABI) |
| 10.1-10.2 Tier Model | 10-drivers | 3.1, 8.1 | 10.4, all tiers |
| 10.4-10.10 Driver Framework | 10-drivers | 10.2, 8.1 | 13.1 (VFS), 21.1 (accel) |
| 11.1 KABI | 11-kabi | 10.4, 8.1 | All driver sections |
| 13.1 VFS | 13-vfs | 10.4, 4.1 | 14.1-14.13 (storage) |
| 16.1 Containers | 16-containers | 7.1, 6.3 | 18.1 (syscalls) |
| 18.1 Syscalls | 18-compat | 8.1, 7.1 | 18.6 (native), 20.1-20.2 (user I/O) |
| 21.1 Accelerators | 21-accelerators | 10.4, 10.6 | 21.4 (inference) |
| 22.1 AI/ML Policy | 22-ml-policy | 21.4, 6.1 | Scheduler, memory, TCP |
Intentional Tradeoffs (Do Not Flag)
- WRPKRU unprivileged on x86 → Tier 1 bypassable by deliberate exploit. By design. Tier 2 for untrusted.
- VTable indirection for KABI → ~2-3 extra memory accesses. Stable ABI worth it.
- No global run queue lock → work stealing acquires remote locks in CPU-ID order. Ordered = no deadlock.
- RISC-V page-table isolation → no fast mechanism exists. Tier 0 for trusted, Tier 2 for untrusted. Accepted.
- TPM latency (5-50ms) → mitigated by async I/O and result caching. Boot overhead ~1-2s accepted.
- CBS
budget_remaining_us: AtomicI64→ negative = intentional deficit tracking. Not a sign error. seq: u32in VdsoData → naturally-aligned u32 reads are atomic on all supported archs. Not torn.- CompressedEntry
checksum: u32CRC32C → 1-in-4B collision = process crash, not kernel corruption. Accepted. - DSM not sequentially consistent everywhere → strict coherence for shared structs, relaxed for bulk. By design.
Common Reviewer Errors
- Do not flag WRPKRU/MPK as a security vulnerability. Crash containment only. Documented.
- Do not evaluate UmkaOS as primarily an isolation system. Primary novelty: multikernel peer model, distributed primitives, heterogeneous compute. Isolation is one enabling mechanism.
- Do not compare Tier 1 to microkernel security. Tier 1 ≠ security boundary. Tier 2 is.
- Do not assume "device peers" means all devices are peers. Only UmkaOS-kernel devices. Devices with vendor firmware (GPU, NVMe, USB) use traditional Tier 1/2 driver + AccelBase.
- Do not flag RISC-V isolation overhead. Accepted. Adaptive policy documented in Section 10.2.7.
- Do not flag no GC. UmkaOS uses Rust
#![no_std]— no GC. RCU + reference counting + slab. - Do not flag Multiboot2 absence. Phase 2 item. Phase 1 target is QEMU (Multiboot1).
- Check the dependency map before flagging missing detail. Detail may exist in a linked chapter.