Skip to content

UmkaOS Architecture — Review Primer (SPR)

Purpose: Sparse Priming Representation for LLM reviewers. Load this before reviewing any architecture section — even when reviewing a single file, you must understand the whole. ~2500 tokens. Not intended for human consumption.


What UmkaOS Is

UmkaOS is a production OS kernel written in Rust, designed as a drop-in Linux replacement for 2026-era computing infrastructure. Unmodified glibc, systemd, Docker, K8s, and KVM run without recompilation. It is not a microkernel (too slow), not a Linux fork (clean design).

Primary premise: a modern server is already a multikernel cluster. A single server contains a DPU/SmartNIC (16 ARM cores, own OS), NVMe SSD (embedded RTOS), GPU (own memory manager), CXL expanders — Linux models these as dumb peripherals. UmkaOS models them as peer kernels.

Core capabilities in order of novelty: multikernel fabric (devices as peers) → distributed-native kernel primitives (DSM, DLM, RDMA-native) → heterogeneous compute (GPU/NPU/CXL unified via AccelBase) → driver isolation for crash containment → Rust memory safety → ~99.99% Linux ABI compatibility.

Supported architectures (all boot and run): x86-64, AArch64, ARMv7, RISC-V 64, PPC32, PPC64LE.


Implementation Context (Critical)

UmkaOS is designed and implemented by AI agents (agentic programming). Agents implement directly from architecture specs. This has consequences for review:

  • Vague specs produce wrong code. "Efficient algorithm", "suitable data structure", "TBD", "will be designed later" = the agent guesses, usually incorrectly.
  • Every struct must have every field. Every algorithm must have every step. Every error path must be named. Reviewers must flag any section an agent cannot implement without reading another section not linked from it.
  • No LOC estimates or time estimates in architecture docs — they become stale and waste review cycles on contested numbers.
  • Explicit cross-references are bidirectional. If Ch 9 depends on Ch 3, Ch 3 must say what it exports to Ch 9. Agents implement sections in isolation.

Design Principles (Apply Everywhere)

UmkaOS Inside, Linux Outside

  • External ABI = Linux-compatible exactly: syscall numbers, ioctl values, procfs/sysfs formats, binary structs exposed to userspace (ALSA event types, netlink layouts, stat). If Linux exposes it, UmkaOS must expose it identically or be binary-incompatible.
  • Internal implementation = design for correctness, not Linux imitation. Linux is the reference for external compat constraints and edge cases, not an internal design template. Where Linux has known flaws, UmkaOS does it correctly from the start:
  • rcu_head intrusive list → RcuCallbackRing (typed pre-allocated ring, no linked list)
  • on_rq: boolOnRqState enum (captures deferred/migrating states)
  • cong_ops: Box<dyn>&'static dyn CongOps + CongPriv (no allocation on hot path)
  • MCE log entry 48 bytes → MceLogEntry align(64) (no cache line split)
  • VecDeque in kernel hot paths → fixed-size ring buffers or pre-allocated arrays

No Stubs, No "Fix Later"

No stubs, no placeholders, no "this will be replaced in production." If not ready to implement properly, it is a documented future milestone — not a half-baked struct. This is a production kernel. Every data structure must be implementable as written.

Runtime Discovery, Never Hardcoded

Memory size, CPU count, NUMA topology, device presence — all discovered at boot from firmware (ACPI, DTB, SMBIOS). No const MAX_CPUS: usize = 256. No if x86 in generic code.

Architecture Abstraction

arch::current:: is a compile-time type alias (pub use $arch as current), not runtime dispatch. arch::current::cpu::halt() compiles to a direct call. Hardware-specific concepts (MPK, GIC, PKRU, LAPIC) must not appear in generic code — use abstract names (isolation_domain, interrupt_controller). Per-arch code lives exclusively in arch/*/.


Chapter Map (22 Chapters)

When reviewing any section, this is the full surrounding architecture:

Ch File Domain Key exports
1 01-overview Goals, perf budget, design choices
2 02-boot-hardware Boot chain, ACPI/DT, multi-arch, HW memory safety Boot facts for Ch 4, 6
3 03-concurrency Locking, RCU, CpuLocal, PerCpu, ring buffers Used by everything
4 04-memory Physical alloc, VMM, slab, NUMA, compression, page cache Used by Ch 6, 10, 13, 21
5 05-distributed RDMA, DSM (MOESI), DLM, cluster, SmartNIC/DPU Used by Ch 14, 15
6 06-scheduling EEVDF, RT, deadline, CBS, EAS, power, timekeeping Used by Ch 7, 21
7 07-process Task/Process, fork/exec/exit, signals, FdTable, rlimits Used by Ch 8, 16, 18
8 08-security Caps, creds, LSM, verified boot, TPM, IMA, PQC, CC Used by Ch 9, 10, 11, 16
9 09-security-extensions Kernel Crypto API, key retention, seccomp-BPF, ARM MTE, DebugCap Depends on Ch 8
10 10-drivers Three-tier model, isolation mechs, device registry, I/O, IPC Used by Ch 11, 12
11 11-kabi KABI IDL, vtables, driver signing, multi-transport manifest Used by all drivers
12 12-device-classes NIC, GPU, WiFi, BT, camera, watchdog, SPI, MTD, SoundWire Depends on Ch 10, 11
13 13-vfs VFS, dentry cache, mount tree, overlayfs, pipes, quotas Used by Ch 14
14 14-storage Block I/O, dm-*, LVM, NVMe-oF, ext4/XFS/Btrfs, ZFS, NFS, DLM Depends on Ch 5, 13
15 15-networking Socket, NetBuf, TCP, congestion, kTLS, tunnels, IPsec, IPVS Depends on Ch 5
16 16-containers Namespaces (8 types), cgroups v2, POSIX IPC Depends on Ch 7, 8
17 17-virtualization KVM, VMX/VHE/H-ext, live migration, PV, VFIO Depends on Ch 10
18 18-compat Syscall layer, io_uring, futex, netlink, WEA, native syscalls Depends on Ch 7, 8
19 19-observability FMA, tracepoints, ptrace, umkafs namespace, EDAC, perf Depends on Ch 3, 4
20 20-user-io TTY/PTY, evdev, ALSA, DRM/KMS Depends on Ch 10
21 21-accelerators AccelBase KABI, GPU mem, CBS sched, inference Depends on Ch 4, 10
22 22-ml-policy AI/ML Policy Framework, closed-loop kernel intelligence Depends on Ch 21
23 23-roadmap Phases, verification, risks, formal verification Meta
24 24-agentic Agentic dev methodology, phase timelines Meta

Three-Tier Driver Isolation Model

Isolation is a crash-containment mechanism, not UmkaOS's identity. It is what makes the KABI model viable — drivers can crash and reload without taking the system down.

Tier Ring Mechanism Switch cost Examples Crash effect
0 0 In-kernel, no boundary 0 cycles APIC, timer, early serial Kernel panic
1 0 MPK/POE/DACR hardware domain ~23-80 cycles NVMe, NIC, TCP, FS, GPU, KVM Driver reload ~50-150ms
2 3 Process + IOMMU ~200-500 cycles USB, audio, BT, input Process restart ~10ms
M n/a Separate UmkaOS kernel on device PCIe/CXL latency BlueField DPU, SoC storage Peer rejoin, host unaffected

Tier 1 is deliberately NOT exploitation-resistant. WRPKRU is unprivileged on x86 — a compromised driver can change its own PKRU. This is documented and intentional. Use Tier 2 for untrusted code. Do not flag Tier 1 HMAC key visibility as a security gap.

Per-architecture isolation mechanisms: x86-64: MPK (WRPKRU, ~23 cycles) | AArch64: POE (MSR POR_EL0, ~40-80 cycles, ARMv8.9+) or page table + ASID fallback (~150-300 cycles) | ARMv7: DACR (MCR p15, ~10-20 cycles) | RISC-V: page table only (~200-500 cycles, no fast mechanism — Tier 0/2 fallback) | PPC32: segment registers (mtsr, ~10-30 cycles) | PPC64LE POWER9+: Radix PID (~30-60 cycles).

KVM: Tier 1 with extended hardware privileges (KvmHardwareCapability). Domain-isolated like all Tier 1, plus authority to execute VMX/VHE/H-extension trampoline (~200 lines verified assembly). Crash recovery enabled. Not a security boundary for guest VMs. See Section 18.1.4.5 for classification rationale.


KABI — Stable Driver ABI (Chapter 11)

  • Defined in .kabi IDL files, compiled by umka-kabi-gen → C + Rust bindings.
  • KabiDriverManifest embedded in ELF .kabi_manifest section: transport_mask: u8 (bit0=Tier0/Direct, bit1=Tier1/Ring, bit2=Tier2/IPC), three optional entry functions.
  • Default policy: every driver binary ships all 3 transports. Tier change = operator config update, no recompilation. minimum_tier/maximum_tier in .kabi to opt out.
  • Vtables: append-only, C-compatible, vtable_size field for runtime version detection.
  • Support window: 5 major releases per KABI version.

Concurrency Model (Chapter 3)

  • CpuLocal: register-based (~1-10 cycles). x86-64 GS, AArch64 TPIDR_EL1, ARMv7 TPIDRPRW, PPC64 r13, RISC-V sscratch. For ~10 hottest fields: current_task, runqueue, preempt_count, slab_magazines, rcu_nesting, rcu_passed_quiescent, isolation_shadow, napi_budget, cpu_id.
  • PerCpu\<T>: get() requires &PreemptGuard. get_mut() requires &mut PreemptGuard
  • IRQs disabled. get_mut_nosave() takes &IrqDisabledGuard, skips IRQ save/restore. Debug-only borrow-state CAS (release builds: zero overhead).
  • RCU: deferred quiescent state — RcuReadGuard::drop sets CpuLocal flag, tick/switch reports. rcu_call() for deferred free. Use for: caps, dentries, page cache, KRL reads.
  • Lock hierarchy: compile-time Lock<T, LEVEL>. TASK_LOCK < RQ_LOCK < PI_LOCK. Violations are compile errors, not deadlocks. Work stealing: always acquire in CPU-ID order.
  • Ring buffers: ALL cross-domain IPC uses io_uring-style shared-memory rings. Not function calls. Core↔Tier 1, Tier 1↔Tier 2 — the ring IS the isolation mechanism.
  • ValidatedCap: validate-once token amortizing KABI dispatch overhead. Store result, don't re-validate on every call.

Data Structure Invariants (Flag Violations)

These apply everywhere. Flag any violation:

Rule Correct Incorrect
No heap alloc in hot paths Fixed-size arrays, ring buffers Vec<T>, HashMap without bound
No trait objects in KABI ABI &'static dyn Trait + PrivData Box<dyn Trait>
No bool for multi-state enum State { Active, Deferred, Migrating } on_rq: bool
Kernel structs on cache lines #[repr(C, align(64))] Struct spanning 2 lines in hot path
No intrusive linked lists for callbacks Pre-allocated ring (RcuCallbackRing) rcu_head embedded in object
Generation counters must handle wrap SlotState::GenerationExhausted + EOVERFLOW silent wrap
PerCpu fields accessed only under guard get() with PreemptGuard bare pointer dereference

Security Model

  • Capabilities (SystemCaps) are the native security model. Unix UID/GID emulated on top.
  • Credential model: TaskCredential (immutable, copy-on-write). Privilege checks: cred.has_cap(CAP_X).
  • LSM hooks: pluggable, stackable, mandatory pre-check before any privileged operation.
  • Verified boot: UEFI SB → GRUB → kernel sig → initramfs sig. Hybrid Ed25519 + ML-DSA-65.
  • PQC: ML-KEM-768 (KEM), ML-DSA-44/65/87 (signatures), SLH-DSA-128f (stateless sig).
  • Confidential computing: SEV-SNP, TDX, ARM CCA — all share the same live migration auth model.
  • IMA: hash-chained measurement log, appraisal, audit. All binaries measured before exec.

Performance Budget

Target: <5% overhead vs Linux on macro benchmarks.

Path Overhead Mechanism
NVMe 4KB read ~0.25-0.5% Shadow-elided MPK + doorbell coalescing + ValidatedCap
TCP RX (NAPI-64) ~0.02-0.06%/pkt NAPI batching amortizes MPK switch
Context switch ~0.1-0.3% CpuLocal register access eliminates most per-CPU loads
Pure compute 0% Never leaves UmkaOS Core domain
Cumulative nginx-class ~0.8% ~4.2% headroom to target

Seven mandatory optimizations (all from day one, none deferred): CpuLocal register access, PerCpu debug-only CAS, IrqDisabledGuard elision, RCU deferred quiescent, isolation shadow elision, ValidatedCap amortization, doorbell coalescing for NVMe/virtio.


Linux Compatibility (Chapter 18)

  • ~330-350 of ~450 x86-64 syscalls implemented natively. Not wrappers.
  • The SLAT entry: untyped C ABI (int fd, void* buf) → typed Rust (CapHandle, UserPtr<u8>).
  • umka_syscall(op, args, size) multiplexed entry for UmkaOS-native extensions (Section 18.6).
  • All 8 namespace types: mnt, pid, net, ipc, uts, user, cgroup, time.
  • cgroups v2 fully specified; v1 compatibility shim for legacy containers.
  • io_uring full impl, 64 signals, futex, eBPF (verifier + JIT x86-64, interpreted others).
  • Deliberately dropped: binary .ko (KABI replaces it), ia32 multilib, /dev/mem, /dev/kmem, ioperm/iopl, general-purpose kexec. Do not flag these as missing.

Section Dependency Map

Section numbering: Ch.Sec (e.g., Section 10.4 = Chapter 10, Section 4 in 10-drivers.md).

Section File Depends On Exports To
3.1 Concurrency 03-concurrency Everything
4.1 Memory 04-memory 3.1, 2.1 (boot) 4.2 (ZPool), 13.1 (VFS), 21.2 (accel)
5.1 Distributed 05-distributed 3.1, 8.1, 4.1 5.2 (SmartNIC), 14.6 (DLM)
6.1 Scheduler 06-scheduling 3.1, 4.1 6.3 (CBS), 7.2 (RT), 21.3 (accel sched)
8.1 Capabilities 08-security 3.1 (RCU) 10.4, 8.2, 16.1 (namespaces)
8.2 Verified Boot 08-security 11.1 (KABI), 2.1 8.4 (IMA)
9.1-9.5 Security Extensions 09-security-extensions 8.1-8.8 10.4 (drivers), 11.1 (KABI)
10.1-10.2 Tier Model 10-drivers 3.1, 8.1 10.4, all tiers
10.4-10.10 Driver Framework 10-drivers 10.2, 8.1 13.1 (VFS), 21.1 (accel)
11.1 KABI 11-kabi 10.4, 8.1 All driver sections
13.1 VFS 13-vfs 10.4, 4.1 14.1-14.13 (storage)
16.1 Containers 16-containers 7.1, 6.3 18.1 (syscalls)
18.1 Syscalls 18-compat 8.1, 7.1 18.6 (native), 20.1-20.2 (user I/O)
21.1 Accelerators 21-accelerators 10.4, 10.6 21.4 (inference)
22.1 AI/ML Policy 22-ml-policy 21.4, 6.1 Scheduler, memory, TCP

Intentional Tradeoffs (Do Not Flag)

  • WRPKRU unprivileged on x86 → Tier 1 bypassable by deliberate exploit. By design. Tier 2 for untrusted.
  • VTable indirection for KABI → ~2-3 extra memory accesses. Stable ABI worth it.
  • No global run queue lock → work stealing acquires remote locks in CPU-ID order. Ordered = no deadlock.
  • RISC-V page-table isolation → no fast mechanism exists. Tier 0 for trusted, Tier 2 for untrusted. Accepted.
  • TPM latency (5-50ms) → mitigated by async I/O and result caching. Boot overhead ~1-2s accepted.
  • CBS budget_remaining_us: AtomicI64 → negative = intentional deficit tracking. Not a sign error.
  • seq: u32 in VdsoData → naturally-aligned u32 reads are atomic on all supported archs. Not torn.
  • CompressedEntry checksum: u32 CRC32C → 1-in-4B collision = process crash, not kernel corruption. Accepted.
  • DSM not sequentially consistent everywhere → strict coherence for shared structs, relaxed for bulk. By design.

Common Reviewer Errors

  • Do not flag WRPKRU/MPK as a security vulnerability. Crash containment only. Documented.
  • Do not evaluate UmkaOS as primarily an isolation system. Primary novelty: multikernel peer model, distributed primitives, heterogeneous compute. Isolation is one enabling mechanism.
  • Do not compare Tier 1 to microkernel security. Tier 1 ≠ security boundary. Tier 2 is.
  • Do not assume "device peers" means all devices are peers. Only UmkaOS-kernel devices. Devices with vendor firmware (GPU, NVMe, USB) use traditional Tier 1/2 driver + AccelBase.
  • Do not flag RISC-V isolation overhead. Accepted. Adaptive policy documented in Section 10.2.7.
  • Do not flag no GC. UmkaOS uses Rust #![no_std] — no GC. RCU + reference counting + slab.
  • Do not flag Multiboot2 absence. Phase 2 item. Phase 1 target is QEMU (Multiboot1).
  • Check the dependency map before flagging missing detail. Detail may exist in a linked chapter.