Chapter 1: Architecture Overview
Design philosophy, architectural goals, performance budget
1.1 Overview and Philosophy
1.1.1 What UmkaOS Is
UmkaOS is a production OS kernel designed for the computing environment of 2026, built as a drop-in replacement for Linux. Unmodified Linux userspace — glibc, musl, systemd, Docker, Kubernetes, QEMU/KVM, and the entire ecosystem — runs without recompilation. The compatibility target is 99.99% of real-world usage.
The fundamental premise: a modern server is not a CPU commanding passive peripherals. It is a heterogeneous distributed system — CPU complex, DPU/SmartNIC running 16 ARM cores with its own OS, NVMe SSD running an embedded RTOS, GPU running its own memory manager and scheduler, CXL memory expanders making autonomous prefetch decisions. Linux models all of these as peripherals. UmkaOS models them as peers. This is not a styling difference — it changes what the OS kernel is, what it does, and how it is structured.
Two models, one kernel. "Peer" has a specific meaning: a device running its own UmkaOS kernel instance that participates in the distributed cluster as a first-class member. Most current hardware does not do this — GPUs run CUDA/ROCm firmware, NVMe SSDs run embedded RTOS, USB controllers run vendor microcode. These devices continue to use the traditional driver model, which UmkaOS also supports and improves: stable KABI for binary compatibility, hardware-assisted crash containment with recovery in milliseconds, and unified compute integration for GPUs and accelerators via AccelBase.
Both models coexist on the same host. A server might run a BlueField DPU as an UmkaOS cluster peer (host driver: ~2,000 lines of generic transport), a GPU under the AccelBase KABI with unified memory scheduling (host driver: full Tier 1, but structured and crash-recoverable), and a USB controller in a Tier 2 userspace process — all at the same time. The architecture is designed for this spectrum, not a single model. As hardware evolves and more devices gain programmable cores capable of running UmkaOS, the boundary between the two models moves — but the kernel handles both ends and everything in between.
The kernel is written primarily in Rust (with C and assembly only for boot code and arch-specific primitives), with a stable driver ABI, hardware-assisted crash containment, and performance parity with monolithic Linux.
Beyond Linux compatibility, UmkaOS provides capabilities that Linux cannot realistically add to its existing architecture: transparent driver crash recovery without reboot, distributed kernel primitives (shared memory, distributed locking, cluster membership), a unified heterogeneous compute framework for GPUs/TPUs/NPUs/CXL devices, structured observability with automated fault management, per-cgroup power budgeting with enforcement, post-quantum cryptographic verification, and live kernel updates without downtime. These features are designed into the architecture from day one — not retrofitted as afterthoughts.
Architecture Coverage
UmkaOS targets six architectures with equal first-class status:
| Architecture | Primary deployment | Hardware isolation | eBPF JIT |
|---|---|---|---|
| x86-64 | Cloud servers, workstations | MPK (PKEY) | Phase 1 |
| AArch64 | Mobile, edge, cloud servers | POE (ARMv8.9+) / page-table | Phase 1 |
| ARMv7 | Embedded, IoT | DACR | Phase 2 |
| RISC-V 64 | Emerging servers, embedded | Page-table only (hardware TBD) | Phase 2 |
| PPC32 | Embedded, automotive | Segment registers | Phase 3 |
| PPC64LE | HPC, IBM servers | Radix PID | Phase 2 |
Performance budgets and optimization priorities apply to all architectures, not just x86-64. Benchmarks and performance claims in this document are validated against each architecture. Where architecture-specific data is given, it is explicitly labelled.
1.1.2 Why UmkaOS Exists
Linux's monolithic architecture has fundamental limitations that are nearly impossible to fix within the existing codebase:
-
The machine is no longer what Unix assumed. A 2026 server contains BlueField DPUs (16 ARM cores, own Linux instance, 200Gb/s RDMA), NVMe SSDs (embedded RTOS, flash translation, wear-leveling firmware), GPUs (own memory manager, inter-engine scheduler, PCIe P2P fabric), and CXL memory expanders (autonomous prefetch and power management). Linux models all of these as passive peripherals commanded by the host CPU. This requires 100,000–700,000+ lines of Ring 0 driver code per device class to proxy the device's own intelligence into the host OS. The abstraction is broken at its foundation.
-
No first-class distribution. Cluster-wide shared memory, locking, and coherence are implemented as ad-hoc userspace layers (Ceph, GFS2, MPI) rather than kernel primitives. Every distributed system reinvents membership, failure detection, and data placement. Linux cannot express that a DPU and a host CPU share a coherent memory space — it has no model for it.
-
Heterogeneous compute as afterthought. GPUs, TPUs, NPUs, DPUs, and CXL memory expanders each have their own driver stack, memory model, and scheduling framework. No unified substrate exists.
amdgpualone is ~700K lines of handwritten code (the amdgpu directory contains ~5.9M lines total, of which ~4.4M are auto-generated headers).i915is ~400K. Nvidia's out-of-tree driver is ~1M lines. Each reimplements the same memory management, scheduling, and DMA infrastructure from scratch in Ring 0. -
Device firmware treated as dumb peripheral. A BlueField DPU runs 16 ARM cores and a full Linux OS. An NVMe controller runs an embedded RTOS. A GPU runs a complete memory manager and scheduler in firmware. These devices already are computers — yet Linux commands them as passive peripherals with no kernel-level coordination, no capability model, and no structured recovery when a device misbehaves.
-
No driver isolation. A single driver bug crashes the entire system. Drivers account for approximately 50% of kernel code changes and approximately 50% of all regressions and CVEs. When a driver crashes, the entire machine reboots — taking down every VM, container, and long-running job with it.
-
Device drivers are the dominant attack surface.
amdgpualone is ~700,000 lines of Ring 0 code.mlx5is ~150,000. Every line runs with full kernel privileges. A single memory-safety bug anywhere in that code equals full kernel compromise. Firmware updates require coordinated host driver updates and usually a reboot, entangling device and OS release cycles. -
No stable in-kernel ABI. Every kernel update can break out-of-tree drivers, requiring constant recompilation (DKMS). Nvidia, ZFS, and every out-of-tree module suffers from this.
-
Coarse-grained locking. RTNL and other legacy locks are scalability bottlenecks on many-core systems. Documented regressions exist on 256+ core servers.
-
No capability-based security. The monolithic privilege model means any kernel vulnerability equals full system compromise.
-
Real-time limitations. PREEMPT_RT still trades throughput for latency and cannot eliminate all unbounded-latency paths.
-
Observability bolted on. eBPF, tracepoints, /proc, /sys, and audit are separate subsystems with inconsistent interfaces, added incrementally over decades.
-
"Never break userspace" constrains evolution. Decades of API debt cannot be cleaned up without breaking backward compatibility.
1.1.3 What UmkaOS Delivers
UmkaOS is not just "Linux with better isolation." It is a comprehensive rethink of what a production kernel should provide, addressing nine fundamental capabilities:
-
Multikernel: device peers, not peripherals (Section 5.2, 5.3, 5.12) — Physically-attached devices that run their own kernel instance (BlueField DPU with 16 ARM cores, RISC-V accelerator, computational storage with Zynx SoC) participate as first-class cluster peers — not managed peripherals. Each device has its own scheduler, memory manager, and capability space. Communication is UmkaOS message passing over PCIe P2P domain ring buffers (Section 10.7 Layer 4), the same abstraction used everywhere else. The host needs no device-specific driver — a single generic
umka-peer-transportmodule (~2,000 lines) handles every UmkaOS peer device regardless of what it does, replacing hundreds of thousands of lines of Ring 0 driver code per device class. Firmware updates are entirely the device's own responsibility: the device sends an orderly CLUSTER_LEAVE, updates its own firmware or kernel independently, and rejoins — the host never reboots, the host driver never changes, and device and OS release cycles are fully decoupled. When a device kernel crashes, the host does not crash: it executes an ordered recovery sequence — IOMMU lockout and PCIe bus master disable in under 2ms, followed by distributed state cleanup, then optional FLR and device reboot (Section 5.3). This constitutes Tier M (Multikernel Peer) — a qualitatively different isolation class where no host kernel state is shared with the device. Tier M isolation exceeds Tier 2 (the host's strongest software-defined boundary): the only communication surface is the typed capability channel, not shared kernel address space or IOMMU- protected DMA. See Section 10.1.2. Devices with ARM or RISC-V cores can run the UmkaOS kernel with zero porting effort, as UmkaOS already builds foraarch64-unknown-noneandriscv64gc-unknown-none-elf. -
Distributed kernel primitives (Section 5.1, 5.12) — Cluster-wide distributed shared memory (DSM), a distributed lock manager (DLM) with RDMA-native one-sided operations, and built-in membership and quorum protocols. A cluster of UmkaOS nodes can share memory pages, coordinate locks, and detect failures as kernel-level operations — not userspace libraries. This enables clustered filesystems, distributed caches, and multi-node workloads without bolt-on middleware. The same distributed protocol that connects RDMA-linked servers also connects locally-attached peer kernels (Section 5.2.2), with the transport adapted to PCIe P2P instead of RDMA network.
-
Heterogeneous compute fabric (Section 21.1–18.6) — A unified framework for GPUs, TPUs, NPUs, FPGAs, and CXL memory. Pluggable per-device schedulers, unified memory tiers (HBM, CXL, DDR, NVMe), and cross-device P2P transfers. New accelerator types plug into the existing framework without kernel modifications.
-
Driver isolation with crash recovery (Section 10.1, 9.2, 9.8) — The enabling infrastructure for the peer model: when a driver crashes, UmkaOS recovers it in milliseconds without rebooting. Applications see a brief hiccup, not a system failure. On hardware with fast isolation (MPK, POE), this costs near-zero overhead. On hardware without it, administrators choose their trade-off: slower isolation via page tables, full performance without isolation, or per-driver demotion to userspace. The kernel adapts to available hardware rather than demanding specific features.
-
Stable driver ABI (Section 11.1) — Required for device peer kernels to maintain binary compatibility across updates: drivers are binary-compatible across kernel updates. No DKMS, no recompilation on every kernel update. Third-party drivers (GPU, WiFi, storage) work across kernel versions by contract, not by accident.
-
Structured observability (Section 19.1–16.4) — Fault Management Architecture (FMA) with per-device health telemetry, rule-based diagnosis, and automated remediation. An object namespace (umkafs) exposes every kernel object with capability-based access control. Integrated audit logging tied to the capability system, not a separate subsystem.
-
Power budgeting with enforcement (Section 6.4) — Per-cgroup power budgets in watts, multi-domain enforcement (CPU, GPU, DRAM, package), and intent-driven optimization. Datacenters can cap power per rack; laptops can maximize battery life per application.
-
Post-quantum security (Section 8.2–8.7) — Hybrid classical + ML-DSA signatures for kernel and driver verification from day one. No retrofitting needed when quantum computers threaten RSA/ECDSA. Confidential computing support for Intel TDX, AMD SEV-SNP, and ARM CCA.
-
Live kernel evolution (Section 12.6) — Replace kernel subsystems at runtime with versioned state migration. Security patches apply without reboot. No more "Update and Restart."
1.1.4 The Core Technical Challenge
A kernel that treats heterogeneous devices as first-class peers, runs distributed primitives natively, and manages unified heterogeneous compute — while maintaining full Linux compatibility and performance parity with a monolithic kernel.
This is considered "impossible" because traditional microkernel and distributed OS designs impose 10-50% overhead from IPC-based isolation and cross-node coordination. UmkaOS achieves near-zero overhead through four key techniques:
- Hardware-assisted Tier 1 isolation — Using the best available mechanism on each architecture (MPK on x86, POE on AArch64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE), domain switches cost approximately 23-80 cycles — not the 600+ cycle IPC of traditional microkernels. On architectures without fast isolation (RISC-V), the kernel adapts: promote trusted drivers to Tier 0, demote untrusted drivers to Tier 2, or accept the page-table fallback overhead. See Section 10.2.7 for the full adaptive isolation policy.
- io_uring-style shared memory rings at every tier boundary, eliminating data copies
- PCID/ASID for TLB preservation across protection domain switches, avoiding the flush penalty
- Batch amortization of all domain-crossing costs, spreading fixed overhead across many operations
1.1.5 Design Principles
- Device firmware is a peer, not a servant. Modern hardware runs its own kernel: NICs run firmware (BlueField DPUs run full Linux), GPUs run scheduling and memory management firmware, storage controllers run RTOS. UmkaOS's distributed kernel design allows device-local kernels to participate as first-class members of the distributed system — not just passive devices commanded by the host. A cluster can include CPU nodes, GPU nodes, and SmartNIC nodes as equals.
- Plan for distribution from day one. Shared memory, locking, and coherence protocols are core kernel subsystems, not afterthoughts. Retrofitting distribution into a single-node kernel always produces inferior results.
- Heterogeneous compute is first-class. GPUs, accelerators, CXL memory, and disaggregated resources are not special cases — they are the normal operating environment for modern workloads.
- Performance is not negotiable. Every abstraction must justify its overhead in cycles.
- Isolation enables reliability, not security. Driver boundaries are structural — bugs can't escape their tier and crash the system. But isolation mechanisms (MPK, POE, DACR) are crash containment only, not exploitation defense. The security boundary is Tier 2 (Ring 3 + IOMMU). The enforcement mechanism adapts to what hardware provides.
- Adapt to available hardware. When the hardware provides fast isolation, use it. When it does not, degrade gracefully — do not refuse to run. A universal kernel must work on everything, even if it means honest trade-offs on some platforms.
- Rust ownership replaces runtime checks. Compile-time guarantees replace lockdep, KASAN, and similar debug-only tools.
- Stable ABI is a first-class contract. Drivers are binary-compatible across kernel updates by design.
- Linux compatibility is near-complete. If glibc, systemd, or any actively-maintained software calls it, we implement it. Only interfaces deprecated for 15+ years with zero modern users are excluded.
- No new
ioctlcalls for UmkaOS-specific features.ioctl(2)is a grab-bag interface: untyped integer commands, unversioned argument structs, no introspection, no capability model, and historically a major source of kernel bugs and CVEs. Linux cannot remove ioctls because of backward-compatibility obligations. UmkaOS starts clean.
New subsystems and features in UmkaOS that do not need Linux binary compatibility use typed alternatives instead of ioctls:
| Instead of | Use |
|---|---|
ioctl(fd, CMD_FOO, &arg) |
Named umkafs file: write("/System/Kernel/foo/config", &typed_arg) |
| Device control ioctls | io_uring typed operations (IORING_OP_*) |
| Query ioctls | Read umkafs attribute files with structured binary format |
| Event subscription ioctls | FileWatchCap or EventRing subscription |
| Batch control ioctls | io_uring SQE chains with per-operation typed structs |
Existing Linux ioctls (block device, socket, DRM, USB, etc.) are fully supported —
required for binary compatibility. New UmkaOS subsystems (AccelBase, umkafs management,
UmkaOS-specific security primitives, driver configuration extensions) expose their control
plane via umkafs typed files or io_uring operations. Driver KABI extensions are added to
the versioned KernelServicesVTable / DriverVTable (Section 11.1), not via new ioctls on
a device fd. This rule creates no compat gap: the Linux compatibility layer (Section 15)
implements all Linux-specified ioctls as-is.
1.2 Performance Budget
Target: less than 5% total overhead versus Linux on macro benchmarks.
Platform qualification: This 5% target applies to platforms with hardware isolation support (x86 MPK, ARMv8.9+ POE). On page-table-fallback platforms (RISC-V, pre-POE ARM), domain switch overhead is higher (~200–500 cycles vs ~23 cycles), which may exceed the 5% target on isolation-heavy workloads. These platforms are fully supported but the performance target is relaxed on them; see Section 10.2 for per-architecture overhead estimates.
Architecture-specific overhead budget:
The ≤5% overhead target applies to architectures with native fast isolation mechanisms: - x86-64: Memory Protection Keys (MPK / WRPKRU, ~23 cycles) - AArch64: Permission Overlay Extension (POE / MSR POR_EL0, ~40–80 cycles) or page-table ASID fallback - ARMv7: Domain Access Control Register (DACR / MCR p15, ~10–20 cycles) - PPC32: Segment registers (mtsr, ~10–30 cycles) - PPC64LE POWER9+: Radix PID (mtspr PIDR, ~30–60 cycles)
RISC-V exception: RISC-V currently has no equivalent hardware fast-isolation primitive. Tier 1 drivers on RISC-V run as Tier 0 (in-kernel, fully trusted — same model as Linux monolithic drivers) until a suitable RISC-V ISA extension is standardized. The RISC-V overhead budget is therefore tracked separately: page-table-only isolation (Tier 2 only) is budgeted at ≤10% until hardware support arrives. See Section 10.2 for the complete per-architecture isolation analysis.
Per-Architecture Performance Overhead Targets
| Architecture | Isolation mechanism | Isolation overhead per domain switch | Achievable 5% budget? |
|---|---|---|---|
| x86-64 | MPK WRPKRU | ~23 cycles (~0.007 μs) | Yes — budget met at ~0.8% |
| AArch64 (ARMv8.9+ POE) | POR_EL0 write | ~40-80 cycles | Yes — budget met at ~1.2-1.8% |
| AArch64 (no POE, mainstream) | Page table + ASID | ~150-300 cycles | Yes — budget met at ~2-4% (coalescing required) |
| ARMv7 | DACR MCR p15 | ~10-20 cycles | Yes — budget met at ~0.5-1.0% |
| RISC-V 64 | Page table only | ~200-500 cycles | Marginal — Tier 1 isolation disabled by default; upgrade when hardware support arrives |
| PPC32 | Segment registers | ~10-30 cycles | Yes — budget met at ~0.5-1.5% |
| PPC64LE (POWER9+) | Radix PID PIDR | ~30-60 cycles | Yes — budget met at ~1-2% |
RISC-V note: RISC-V has no hardware fast-isolation mechanism (no MPK, no POE equivalent). Page-table-based domain isolation costs ~200–500 cycles per switch. The adaptive isolation policy (Section 10.2) promotes trusted Tier 1 drivers to Tier 0 on RISC-V to avoid this cost, staying within the 5% overhead budget at the cost of reduced fault isolation for those drivers — a documented tradeoff. Tier 1 isolation remains unavailable until RISC-V ISA extensions provide suitable mechanisms.
AArch64 POE domain availability: POE provides 7 usable protection keys (3-bit PTE index, key 0 reserved). After infrastructure allocation (shared read-only ring descriptors, shared DMA pool, userspace domain, debug domain), 3 keys remain for driver isolation domains — significantly fewer than x86 MPK's 12 usable driver domains. Deployments requiring more than 3 concurrent Tier 1 driver isolation domains on AArch64 must use Tier 0 promotion (for trusted drivers) or Tier 2 (Ring 3 + IOMMU, for untrusted ones) for the excess drivers. See Section 10.2.6 for the full per-architecture domain allocation table.
The overhead comes exclusively from protection domain crossings on I/O paths — operations
that must transit between UmkaOS Core and Tier 1 drivers. Operations handled entirely within
UmkaOS Core (scheduling, memory management, page faults, vDSO calls) have zero additional
overhead compared to Linux. Syscall dispatch itself adds ~23 cycles for the MPK domain
transition, but this is only measurable on I/O-bound syscalls; for compute-heavy syscalls
(e.g., mmap, mprotect, brk) the ~23 cycles are negligible relative to the work done.
1.2.1 Per-Operation Overhead
Note: cycle counts in the table below are measured on x86-64. A cross-architecture comparison of the same operations is given immediately after the main table.
| Operation | Linux | UmkaOS | Overhead | Notes |
|---|---|---|---|---|
| Syscall dispatch (I/O) | ~100 cycles | ~123 cycles | +23 cycles | x86-64: bare SYSCALL/SYSRET round-trip without KPTI or Spectre mitigations. +23 cycles for MPK domain switch to Tier 1 driver. With KPTI + Spectre mitigations (production default), Linux syscall entry/exit is ~700-1800 cycles. On non-Meltdown-vulnerable CPUs (Intel Ice Lake+, all AMD Zen), UmkaOS avoids KPTI page table switches for intra-core syscalls. On Meltdown-vulnerable CPUs (Intel Skylake through Cascade Lake), KPTI is a hardware requirement that UmkaOS cannot avoid -- both Linux and UmkaOS pay the same KPTI cost on these CPUs.^1 |
| NVMe 4KB read (total) | ~10 us | ~10.1 us | +1% | 4 MPK switches = ~92 cycles on 10 us op |
| TCP packet RX | ~5 us | ~5.1 us | +2% | 4 MPK switches (NIC driver + umka-net, enter + exit each) = ~92 cycles + ring overhead on 5 us op |
| Page fault (anonymous) | ~300 cycles | ~300 cycles | 0% | Handled entirely in UmkaOS Core |
| Context switch (minimal) | ~200 cycles | ~200 cycles | 0% | Register save/restore only (same mechanism as Linux). This is the lmbench-style minimal context switch between threads in the same address space. A full process context switch with TLB flush and cache effects costs 5,000-20,000 cycles on Linux; UmkaOS's intra-tier switches (MPK domain change) avoid this by not changing address spaces. |
vDSO (clock_gettime) |
~25 cycles | ~25 cycles | 0% | Mapped directly into user space |
epoll_wait (ready) |
~80 cycles | ~80 cycles | 0% | Handled in UmkaOS Core |
mmap (anonymous) |
~400 cycles | ~400 cycles | 0% | Handled in UmkaOS Core |
^1 KPTI note: Kernel Page Table Isolation is required on x86 CPUs vulnerable to Meltdown (Intel client/server cores from Nehalem through Cascade Lake). On these CPUs, every kernel entry/exit pays the KPTI page table switch cost (~200-1000 cycles depending on microarchitecture and TLB pressure). This is a hardware-imposed requirement, not a software choice -- UmkaOS cannot avoid it any more than Linux can. The "~123 cycles" figure in the table assumes a non-vulnerable CPU (Intel Ice Lake+, AMD Zen, ARM, RISC-V) or one with hardware Meltdown fixes. On Meltdown-vulnerable hardware, add the KPTI overhead to both the Linux and UmkaOS columns equally.
Per-Architecture Operation Cost Comparison
The table above gives x86-64 figures. The following table shows the same operations across the three most performance-sensitive architectures. These numbers capture structural differences in ISA design — memory ordering, TLB invalidation, trap entry — rather than microarchitectural variation.
| Operation | x86-64 | AArch64 | RISC-V 64 | Notes |
|---|---|---|---|---|
| Syscall dispatch (I/O) | ~123 cycles | ~130-145 cycles | ~160-200 cycles | AArch64: SVC + EL0→EL1 + VBAR dispatch. RISC-V: ecall + stvec trap entry |
| Context switch (minimal) | ~200 cycles | ~220-260 cycles | ~250-320 cycles | AArch64 includes TLBI ASIDE1 if ASID changes. RISC-V: satp write + SFENCE.VMA |
| Tier 1 isolation switch | ~23 cycles (WRPKRU) | ~40-80 cycles (POR_EL0, ARMv8.9+) or ~150-300 cycles (page table, older ARM) | N/A (Tier 1 not available) | See Section 10.2 |
| Memory barrier (Release) | ~0 cycles (TSO) | ~5-15 cycles (DMB ISH) | ~5-20 cycles (FENCE) | x86 stores are already ordered; ARM/RISC-V pay real barrier cost |
| TLB shootdown (single page) | ~50-150 cycles + IPI | ~50-150 cycles + DSB+TLBI | ~100-200 cycles + IPI+SFENCE.VMA | IPI cost dominates at high core counts |
| RCU read-side | ~1-3 cycles | ~1-3 cycles | ~1-3 cycles | CpuLocal rcu_nesting increment; identical across archs |
| Cache line bounce | ~50-70 cycles | ~40-60 cycles | ~50-80 cycles | MESI protocol cost; ARM's smaller cache line (64B vs x86 64B typical) similar |
All non-x86 figures are architectural estimates based on instruction latencies and design specifications. They will be calibrated against real hardware measurements in Phase 4 (Production Ready). AArch64 figures apply to Graviton 3/Neoverse V1 class hardware. RISC-V figures apply to SiFive P670-class hardware.
1.2.2 Macro Benchmark Targets
| Benchmark | Acceptable overhead vs Linux |
|---|---|
fio randread/randwrite 4K QD32 |
< 2% |
iperf3 TCP throughput |
< 3% |
nginx small-file HTTP |
< 3% |
sysbench OLTP |
< 5% |
hackbench |
< 3% |
lmbench context switch |
< 1% |
Kernel compile (make -jN) |
< 5% |
1.2.3 Where the Overhead Comes From
The overhead budget is dominated by I/O paths that cross the isolation domain boundary between UmkaOS Core and Tier 1 drivers. Pure compute workloads, memory-intensive workloads, and scheduling-intensive workloads have effectively zero overhead because they stay entirely within UmkaOS Core.
Worst case: a micro-benchmark that issues millions of tiny I/O operations per second (e.g., 4K random IOPS at maximum queue depth). Even here, the ~92 cycles of domain-switch overhead per operation (x86-64 MPK: ~23 cycles; AArch64 POE: ~40-80 cycles) is less than 1% of the approximately 10 us total device latency on hardware with fast isolation. On page-table-fallback platforms (pre-POE AArch64, RISC-V), Tier 1 isolation is coalesced or disabled to stay within the budget; see the per-architecture table above.
1.2.4 Comprehensive Overhead Budget
The 5% target applies to macro benchmarks on all supported architectures (see per-architecture targets in the table above). The total overhead is the sum of all UmkaOS-specific costs versus a monolithic Linux kernel. This section enumerates every source of overhead so the budget can be audited. Per-event costs are shown for x86-64 MPK as the reference; see Section 10.2.6.2 for equivalent figures on other architectures.
| Source | Per-event cost (x86-64 MPK; reference) | Frequency | Contribution to macro benchmarks |
|---|---|---|---|
| MPK domain switches | ~23 cycles per WRPKRU | 2-6 per I/O op | 1-4% of I/O-heavy workloads. 0% for pure compute. |
| IOMMU DMA mapping | 0 (same as Linux) | Per DMA op | 0% — UmkaOS uses IOMMU identically to Linux. |
| KABI vtable dispatch | ~2-5 cycles (indirect call) | Per driver method call | <0.1% — indirect call vs direct call. Branch predictor hides this. |
| Capability checks | ~5-10 cycles (bit test) | Per privileged op | <0.1% — bitmask test, fully pipelined. |
| Driver state checkpointing | ~0.2-0.5 μs per checkpoint (memcpy + doorbell) | Periodic (every ~1ms) | ~0.02-0.05% — amortized over 1ms. HMAC computed asynchronously by umka-core, not on driver hot path. |
| Scheduler (EAS + PELT) | 0 (same algorithms as Linux) | Per context switch | 0% — UmkaOS uses the same CFS/EEVDF + PELT as Linux. |
| Scheduler (CBS guarantee) | ~50-100 cycles | Per CBS replenishment | <0.05% — replenishment every ~1ms for CBS-enabled groups only. |
| FMA health checks | ~10-50 cycles | Per device poll (~1s) | <0.001% — background, amortized over seconds. |
| Stable tracepoints | 0 when disabled, ~20-50 cycles when enabled | Per tracepoint hit | 0% disabled. <0.1% when actively tracing. |
| umkafs object bookkeeping | ~50-100 cycles | Per object create/destroy | <0.01% — object lifecycle is cold path. |
| In-kernel inference | 500-5000 cycles per invocation | Per prefetch/scheduling decision | <0.1% — invoked on slow-path decisions (page reclaim, I/O reordering), not per-I/O. Clamped by cycle watchdog. |
| Per-CPU access (CpuLocal) | ~1 cycle (x86 gs: prefix) |
Per slab alloc, per scheduler tick, per NAPI poll | <0.05% — matches Linux this_cpu_* cost. See Section 3.1.2. |
| Per-CPU access (PerCpu\<T>) | ~1-3 cycles (nosave) / ~3-8 cycles (with IRQ save) | Per non-hot per-CPU field access | <0.1% — IRQ elision (Section 3.1.3.1) eliminates save/restore in IRQ context. |
| RCU quiescent state | ~1 cycle (CpuLocal flag write) | Per outermost RCU read section exit | <0.01% — deferred to tick/switch, not per-drop. See Section 3.1.1. |
| Capability validation | ~7-14 cycles (amortized via ValidatedCap) | Per KABI dispatch | <0.05% — validate once, use token for 3-5 sub-calls. See Section 8.1.1.1. |
| Doorbell coalescing | ~5 cycles/cmd (amortized batch-32) | Per batched NVMe/virtio submit | <0.02% — one MMIO write per batch. See Section 10.6. |
| Isolation shadow elision | ~1 cycle (compare) vs ~23-80 (write) | Per domain switch that hits shadow | ~0.1-0.2% saved — mandatory, see Section 10.2.5.1. |
Workload-specific overhead estimates (x86-64 MPK, all optimizations active; representative of hardware with fast isolation):
| Workload | Dominant overhead source | Estimated total overhead |
|---|---|---|
fio 4K random IOPS |
MPK switches (shadow-elided) + doorbell | ~0.5-1.5% |
iperf3 TCP throughput |
MPK switches (NIC + TCP, NAPI-batched) | ~1.5-2.5% |
nginx small-file HTTP |
MPK switches (NIC + TCP + NVMe) | ~1-2% |
sysbench OLTP |
MPK switches (NVMe + TCP) | ~1.5-2.5% |
hackbench (IPC-heavy) |
MPK switches (scheduler stays in core) | ~0.5-1.5% |
Kernel compile (make -jN) |
Nearly zero (CPU-bound, in-core) | <1% |
memcached (GET-heavy) |
MPK switches (NIC + TCP) | ~1.5-2.5% |
| ML training (GPU) | Nearly zero (GPU work, not CPU I/O) | <1% |
Key insight: the overhead budget is dominated by isolation domain switch cost multiplied by the number of domain crossings per operation. Seven core design techniques (Section 3.1.4) reduce the non-isolation overhead to near-zero and cut isolation overhead by ~25-50% through shadow elision and batching:
- CpuLocal register-based access — hottest per-CPU fields accessed via the
architecture's dedicated per-CPU base register (x86-64: GS, AArch64: TPIDR_EL1,
ARMv7: TPIDRPRW, PPC64: r13, RISC-V: sscratch) at ~1-10 cycles depending on
architecture, matching Linux
this_cpu_*cost (Section 3.1.2) - Debug-only PerCpu CAS — borrow-state CAS present only in debug builds; zero cost in release builds (Section 3.1.3)
- IRQ save/restore elision via IrqDisabledGuard —
get_mut_nosave()skips IRQ save/restore (~1-3 cycles saved) when caller already holdsIrqDisabledGuard(Section 3.1.3.1) - RCU deferred quiescent state to tick/switch —
RcuReadGuard::dropwrites a CpuLocal flag (~1 cycle); the actual quiescent-state report is deferred to the next scheduler tick or context switch (Section 3.1.1) - Isolation register shadow elision — WRPKRU/MSR write skipped when the shadow register value already matches, saving ~23-80 cycles per elided switch (Section 10.2.5.1)
- ValidatedCap capability amortization — validate once, use a token for 3-5 sub-calls (~7-14 cycles amortized vs per-call) (Section 8.1.1.1)
- Doorbell coalescing for NVMe/virtio — one MMIO write per batch-32, ~5 cycles/cmd amortized (Section 10.6)
The cumulative nginx-class overhead is ~0.8% on x86-64 (MPK), ~1.5% on AArch64 with POE, and ~2-4% on mainstream AArch64 (page-table fallback with coalescing). This leaves substantial headroom under the 5% budget on all architectures for implementation-phase unknowns (cache effects, compiler variation, subsystem interactions). All seven techniques are implemented from day one — none are deferred optimizations. See Section 3.1.4 for the cumulative per-architecture overhead analysis with complete breakdowns.