Chapter 1: Architecture Overview¶
Design philosophy, architectural goals, performance budget
UmkaOS is a production Rust kernel designed as a drop-in Linux replacement. Unmodified Linux userspace (glibc, systemd, Docker, K8s) runs without recompile. Internal design prioritizes correctness over Linux imitation — where Linux has known flaws, UmkaOS does it right. The performance budget targets ≤5% throughput overhead on macro benchmarks and <8% p99 tail latency for non-batched RPC workloads on fast-isolation architectures.
1.1 Overview and Philosophy¶
1.1.1 What UmkaOS Is¶
UmkaOS (Universal Multi-Kernel Architecture OS) is a production OS kernel designed for the computing environment of 2026, built as a drop-in replacement for Linux. Unmodified Linux userspace — glibc, musl, systemd, Docker, Kubernetes, QEMU/KVM, and the entire ecosystem — runs without recompilation. The compatibility target is 99.99% of real-world usage (measured by instruction count across representative workloads, not by raw syscall number coverage — Linux has ~450 syscall numbers but a handful dominate real workloads).
The fundamental premise: a modern server is not a CPU commanding passive peripherals. It is a heterogeneous distributed system — CPU complex, DPU/SmartNIC running 16 ARM cores with its own OS, NVMe SSD running an embedded RTOS, GPU running its own memory manager and scheduler, CXL memory expanders making autonomous prefetch decisions. Linux models all of these as peripherals. UmkaOS models them as peers. This is not a styling difference — it changes what the OS kernel is, what it does, and how it is structured.
Three paths, one protocol. Devices connect to UmkaOS in three ways:
- Traditional driver (Tier 0/1/2): the host runs a driver for the device. GPUs, NVMe SSDs, USB controllers — any device with vendor firmware that doesn't speak the peer protocol. UmkaOS improves this with stable KABI, crash containment, and unified compute via AccelBase.
- Peer kernel: the device runs its own UmkaOS kernel instance and joins the cluster as a first-class member. BlueField DPUs, computational storage, multi-host RDMA clusters. Host driver: ~2,000 lines of generic transport.
- Firmware shim: the device keeps its existing RTOS but implements the UmkaOS peer protocol (~10-18K lines of C on an existing RTOS, excluding cryptographic primitives already present in the firmware stack; a reference implementation will be published with measured line counts). SAS controllers, NVMe SSD firmware, FPGA soft-cores, USB microcontrollers. No OS replacement — just a protocol implementation. The host cannot distinguish a shim from a full peer kernel.
Paths 2 and 3 speak the same wire protocol (Section 5.1). The only difference is what's behind it. As hardware evolves and more devices gain programmable cores, devices migrate from path 1 → 3 → 2. The architecture handles all three simultaneously.
All three coexist on the same host. A server might run a BlueField DPU as a peer kernel (path 2), a SAS HBA as a firmware shim (path 3), a GPU under AccelBase KABI (path 1, Tier 1), and a USB controller in Tier 2 — all at the same time.
The kernel is written primarily in Rust (with C and assembly only for boot code and arch-specific primitives), with a stable driver ABI, hardware-assisted crash containment, and performance parity with monolithic Linux.
Beyond Linux compatibility, UmkaOS provides capabilities that Linux cannot realistically add to its existing architecture: transparent driver crash recovery without reboot, distributed kernel primitives (shared memory, distributed locking, cluster membership), a unified heterogeneous compute framework for GPUs/TPUs/NPUs/CXL devices, structured observability with automated fault management, per-cgroup power budgeting with enforcement, post-quantum cryptographic verification, and live kernel updates without downtime. These features are designed into the architecture from day one — not retrofitted as afterthoughts.
1.1.2 Replaceability Model: Nucleus and Evolvable¶
UmkaOS components are classified along two orthogonal axes: replaceability and isolation tier. These axes are independent and must not be conflated.
1.1.2.1 Replaceability Axis¶
Nucleus is the non-replaceable, formally verified core of the kernel (~18-20 KB per architecture, ~25-35 KB total across all 8 architectures; see Section 2.21 for the component-by-component enumeration). It cannot be live-evolved -- a bug in Nucleus requires a reboot. The critical core within Nucleus is the evolution primitive (~2-3 KB of straight-line code), which is the primary formal verification target via Verus. Nucleus's sole active role at runtime is managing live evolution of Evolvable components. Nucleus contains only what must be correct for the evolution mechanism itself to function: the evolution primitive, capability table lookup, physical memory data structures, page table hardware ops, KABI dispatch, and the minimal scaffolding that bootstraps everything else.
Evolvable encompasses all kernel components that can be live-replaced at runtime via the evolution framework (Section 13.18). This includes policy modules, schedulers, filesystems, network stacks, device drivers, and most kernel subsystems. An Evolvable component is factored into non-replaceable verified data structures and replaceable stateless policy — the policy half can be swapped without reboot while the data structures persist across the transition.
1.1.2.2 Isolation Tier Axis¶
The isolation tier determines the hardware memory isolation boundary around a component:
- Tier 0: Ring 0, no hardware isolation domain. Full kernel privilege, shared address space with the rest of the kernel. A crash may bring down the system.
- Tier 1: Ring 0, hardware memory domain isolated (MPK/POE/DACR/segments). Crash is contained; the component can be reloaded in ~50-150 ms.
- Tier 2: Ring 3, full process isolation + IOMMU. Crash is fully contained; restart in ~10 ms.
See Section 11.2 for the complete per-architecture isolation mechanism specification.
1.1.2.3 Orthogonality: Why These Axes Are Independent¶
Both Nucleus and Evolvable code runs in Tier 0 — the same Ring 0 privilege level, the same address space, the same hardware execution context. The replaceability axis determines whether a component can be swapped at runtime; the isolation tier determines whether hardware enforces memory boundaries around it. Neither implies the other:
- A component can be Evolvable (live-replaceable) AND Tier 0 (no hardware isolation).
- A component can be Evolvable AND Tier 1 or Tier 2 (hardware-isolated).
- Nucleus is always Tier 0 — it runs in the kernel's own address space with no isolation domain, because it is the trusted base that manages everything else.
Being bug-free does not mandate Nucleus placement. The design goal is to minimize Nucleus — only code whose correctness is required for the evolution mechanism itself belongs there. Everything else is Evolvable, regardless of how critical or performance-sensitive it is.
Tier 0 inhabitants are diverse. Tier 0 contains Nucleus code, Evolvable kernel subsystems (scheduler, memory allocator, VFS core), AND dynamically loadable Tier 0 kernel modules/drivers. The kernel is not strictly monolithic — some modules load on demand but still run in Tier 0 with full privilege.
1.1.2.4 Quick Reference¶
| Axis | Values | What it determines |
|---|---|---|
| Replaceability | Nucleus / Evolvable | Can it be live-replaced at runtime? |
| Isolation Tier | 0 / 1 / 2 | Hardware memory isolation boundary |
1.1.2.5 Concrete Examples¶
| Component | Replaceability | Tier | Why |
|---|---|---|---|
| Evolution primitive | Nucleus | 0 | Must be correct — no recovery if buggy |
| Capability data (lookup, generation, rights) | Nucleus | 0 | ~5 instructions per check, formally verified |
| Capability policy (capable, delegation, revocation) | Evolvable | 0 | Policy decisions replaceable over 50-year lifetime |
| EEVDF scheduler | Evolvable | 0 | Can be live-replaced if improved |
| NAPI poll loop | Evolvable | 0 | Replaceable; Tier 0 for performance |
| umka-net (TCP/IP stack) | Evolvable | 1 | MPK-isolated, replaceable |
| NVMe driver | Evolvable | 1 | MPK-isolated, crash-recoverable |
| USB driver | Evolvable | 2 | Ring 3, full process isolation |
The NAPI example is instructive: NAPI runs in Tier 0 (Ring 0, no isolation domain) for performance, AND it is Evolvable (can be live-replaced via the evolution framework). Being performance-critical and running without hardware isolation does not make it Nucleus — it is not part of the formally verified evolution foundation.
1.1.2.6 Platform Isolation Summary¶
Not all architectures provide hardware mechanisms for Tier 1 (in-kernel memory domain) isolation. The following table summarizes per-platform Tier 1 availability:
| Platform | Tier 1 Mechanism | Usable Driver Domains | Tier 1 Status |
|---|---|---|---|
| x86-64 | MPK (WRPKRU) | 12 | Full |
| AArch64 (ARMv8.9+ POE) | POE (POR_EL0 + ISB) | 3 | Grouped (Cortex-X4+ / Neoverse V3+ only) |
| AArch64 (pre-POE) | Page-table + ASID | Unlimited (per-process) | Full (higher switch cost: ~150-300 cycles) |
| ARMv7 | DACR (MCR p15 + ISB) | 12 | Full |
| PPC32 | Segment registers (mtsr + isync) | 12 | Full |
| PPC64LE (POWER9+) | Radix PID (mtspr PIDR) | Per-process | Full |
| RISC-V 64 | None | 0 | Tier 1 unavailable |
| s390x | None (Storage Keys too coarse) | 0 | Tier 1 unavailable |
| LoongArch64 | None | 0 | Tier 1 unavailable |
On platforms where Tier 1 is unavailable (RISC-V, s390x, LoongArch64), drivers run as either Tier 0 (in-kernel, no isolation — equivalent to Linux monolithic drivers) or Tier 2 (Ring 3 + IOMMU, full process isolation). The placement decision depends on three factors:
- Licensing: Proprietary (non-open-source) drivers are required to run as Tier 2. They cannot be granted Tier 0 kernel-space access regardless of platform.
- Driver default preference: Each driver declares a preferred tier reflecting its performance/isolation tradeoff. Performance-critical drivers (e.g., NVMe, NIC) may default to Tier 0; less performance-sensitive drivers (e.g., USB, HID) may default to Tier 2.
- Sysadmin operational decision: The system administrator can override the default and pin any driver to any tier via boot parameters or runtime configuration. This allows site-specific security policies (e.g., "all third-party drivers must be Tier 2").
Tier 2 (Ring 3 + IOMMU) is available on all platforms. Rust memory safety provides the primary defense against driver bugs for Tier 0 drivers on platforms without Tier 1.
See Section 11.2 for the complete per-architecture mechanism specification, domain allocation tables, and switch cost analysis.
1.2 Architecture Coverage¶
UmkaOS targets eight architectures with equal first-class status:
| Architecture | Primary deployment | Hardware isolation | eBPF JIT |
|---|---|---|---|
| x86-64 | Cloud servers, workstations | MPK (PKEY) | Phase 1 |
| AArch64 | Mobile, edge, cloud servers | POE (ARMv8.9+) / page-table | Phase 1 |
| ARMv7 | Embedded, IoT | DACR | Phase 3 |
| RISC-V 64 | Emerging servers, embedded | None (Tier 1 unavailable) | Phase 2 |
| PPC32 | Embedded, automotive | Segment registers | Phase 3 |
| PPC64LE | HPC, IBM servers | Radix PID | Phase 2 |
| s390x | IBM z systems, mainframes | Storage Keys (Tier 1 unavailable) | Phase 3 |
| LoongArch64 | Loongson servers, China ecosystem | None (Tier 1 unavailable) | Phase 3 |
eBPF JIT phasing: Phase 1 = native JIT compiler available from day one (x86-64, AArch64). Phase 2 = JIT ported (RISC-V 64, PPC64LE). Phase 3 = JIT ported (ARMv7, PPC32, s390x, LoongArch64). Interpreted fallback available on all architectures from Phase 1. See Section 19.2 for the full per-architecture JIT phasing and Section 24.2 for phase definitions.
Performance budgets and optimization priorities apply to all architectures, not just x86-64. Benchmarks and performance claims in this document are validated against each architecture. Where architecture-specific data is given, it is explicitly labelled.
1.2.1 Why UmkaOS Exists¶
Linux's monolithic architecture has fundamental limitations that are nearly impossible to fix within the existing codebase:
-
The machine is no longer what Unix assumed. A 2026 server contains BlueField DPUs (16 ARM cores, own Linux instance, 200Gb/s RDMA), NVMe SSDs (embedded RTOS, flash translation, wear-leveling firmware), GPUs (own memory manager, inter-engine scheduler, PCIe P2P fabric), and CXL memory expanders (autonomous prefetch and power management). Linux models all of these as passive peripherals commanded by the host CPU. This requires 100,000–700,000+ lines of Ring 0 driver code per device class to proxy the device's own intelligence into the host OS. The abstraction is broken at its foundation.
-
No first-class distribution. Cluster-wide shared memory, locking, and coherence are implemented as ad-hoc userspace layers (Ceph, GFS2, MPI) rather than kernel primitives. Every distributed system reinvents membership, failure detection, and data placement. Linux cannot express that a DPU and a host CPU share a coherent memory space — it has no model for it.
-
Heterogeneous compute as afterthought. GPUs, TPUs, NPUs, DPUs, and CXL memory expanders each have their own driver stack, memory model, and scheduling framework. No unified substrate exists.
amdgpualone is ~1.5M lines of handwritten code (the amdgpu directory contains ~5.9M lines total, of which ~4.4M are auto-generated register headers).i915is ~400K. Nvidia's out-of-tree driver is ~1M lines. Each reimplements the same memory management, scheduling, and DMA infrastructure from scratch in Ring 0. -
Device firmware treated as dumb peripheral. A BlueField DPU runs 16 ARM cores and a full Linux OS. An NVMe controller runs an embedded RTOS. A GPU runs a complete memory manager and scheduler in firmware. These devices already are computers — yet Linux commands them as passive peripherals with no kernel-level coordination, no capability model, and no structured recovery when a device misbehaves.
-
No driver isolation. A single driver bug crashes the entire system. Drivers account for approximately 50% of kernel code changes and approximately 50% of all regressions and CVEs. When a driver crashes, the entire machine reboots — taking down every VM, container, and long-running job with it.
-
Device drivers are the dominant attack surface.
amdgpualone is ~1.5M lines of handwritten Ring 0 code.mlx5is ~150,000. Every line runs with full kernel privileges. A single memory-safety bug anywhere in that code equals full kernel compromise. Firmware updates require coordinated host driver updates and usually a reboot, entangling device and OS release cycles. -
No stable in-kernel ABI. Every kernel update can break out-of-tree drivers, requiring constant recompilation (DKMS). Nvidia, ZFS, and every out-of-tree module suffers from this.
-
Coarse-grained locking. RTNL and other legacy locks are scalability bottlenecks on many-core systems. Documented regressions exist on 256+ core servers.
-
No capability-based security. The monolithic privilege model means any kernel vulnerability equals full system compromise.
-
Real-time limitations. PREEMPT_RT still trades throughput for latency and cannot eliminate all unbounded-latency paths.
-
Observability bolted on. eBPF, tracepoints, /proc, /sys, and audit are separate subsystems with inconsistent interfaces, added incrementally over decades.
-
"Never break userspace" constrains evolution. Decades of API debt cannot be cleaned up without breaking backward compatibility.
1.2.2 What UmkaOS Delivers¶
UmkaOS is not just "Linux with better isolation." It is a comprehensive rethink of what a production kernel should provide, addressing nine fundamental capabilities:
-
Multikernel: device peers, not peripherals (Section 5.2, Section 5.3, Section 5.11) — Physically-attached devices that run their own kernel instance (BlueField DPU with 16 ARM cores, RISC-V accelerator, computational storage with Zynx SoC) participate as first-class cluster peers — not managed peripherals. Each device has its own scheduler, memory manager, and capability space. Communication is UmkaOS message passing over PCIe P2P domain ring buffers (Section 11.8 Layer 4), the same abstraction used everywhere else. The host needs no device-specific driver — a single generic
umka-peer-transportmodule (~2,000 lines) handles every UmkaOS peer device regardless of what it does, replacing hundreds of thousands of lines of Ring 0 driver code per device class. Firmware updates are entirely the device's own responsibility: the device sends an orderly CLUSTER_LEAVE, updates its own firmware or kernel independently, and rejoins — the host never reboots, the host driver never changes, and device and OS release cycles are fully decoupled. When a device kernel crashes, the host does not crash: it executes an ordered recovery sequence — IOMMU lockout and PCIe bus master disable in under 2ms, followed by distributed state cleanup, then optional FLR and device reboot (Section 5.3). This constitutes Tier M (Multikernel Peer) — a qualitatively different isolation class where no host kernel state is shared with the device. Tier M isolation exceeds Tier 2 (the host's strongest software-defined boundary): the only communication surface is the typed capability channel, not shared kernel address space or IOMMU- protected DMA. See Section 11.1. Devices with ARM or RISC-V cores can run the UmkaOS kernel with zero porting effort, as UmkaOS already builds foraarch64-unknown-noneandriscv64gc-unknown-none-elf. -
Distributed kernel primitives (Section 5.1, Section 5.11) — Cluster-wide distributed shared memory (DSM), a distributed lock manager (DLM) with RDMA-native one-sided operations, and built-in membership and quorum protocols. A cluster of UmkaOS nodes can share memory pages, coordinate locks, and detect failures as kernel-level operations — not userspace libraries. This enables clustered filesystems, distributed caches, and multi-node workloads without bolt-on middleware. The same distributed protocol that connects RDMA-linked servers also connects locally-attached peer kernels (Section 5.2), with the transport adapted to PCIe P2P instead of RDMA network.
-
Heterogeneous compute fabric (Section 22.1–Section 22.8) — A unified framework for GPUs, TPUs, NPUs, FPGAs, and CXL memory. Pluggable per-device schedulers, unified memory tiers (HBM, CXL, DDR, NVMe), and cross-device P2P transfers. New accelerator types plug into the existing framework without kernel modifications.
-
Driver isolation with crash recovery (Section 11.1, Section 11.2, Section 11.9) — The enabling infrastructure for the peer model: when a driver crashes, UmkaOS recovers it in milliseconds without rebooting. Applications see a brief hiccup, not a system failure. On hardware with fast isolation (MPK, POE), this costs near-zero overhead. On hardware without it, administrators choose their trade-off: slower isolation via page tables, full performance without isolation, or per-driver demotion to userspace. The kernel adapts to available hardware rather than demanding specific features.
-
Stable driver ABI (Section 12.1) — Required for device peer kernels to maintain binary compatibility across updates: drivers are binary-compatible across kernel updates. No DKMS, no recompilation on every kernel update. Third-party drivers (GPU, WiFi, storage) work across kernel versions by contract, not by accident.
-
Structured observability (Section 20.1–Section 20.5) — Fault Management Architecture (FMA) with per-device health telemetry, rule-based diagnosis, and automated remediation. An object namespace (umkafs) exposes every kernel object with capability-based access control. Integrated audit logging tied to the capability system, not a separate subsystem.
-
Power budgeting with enforcement (Section 7.7) — Per-cgroup power budgets in watts, multi-domain enforcement (CPU, GPU, DRAM, package), and intent-driven optimization. Datacenters can cap power per rack; laptops can maximize battery life per application.
-
Post-quantum security (Section 9.3–Section 9.8) — Hybrid classical + ML-DSA signatures for kernel and driver verification from day one. No retrofitting needed when quantum computers threaten RSA/ECDSA. Confidential computing support for Intel TDX, AMD SEV-SNP, and ARM CCA.
-
Live kernel evolution (Section 13.18) — Replace kernel subsystems at runtime with versioned state migration. Security patches apply without reboot. No more "Update and Restart."
1.2.3 The Core Technical Challenge¶
A kernel that treats heterogeneous devices as first-class peers, runs distributed primitives natively, and manages unified heterogeneous compute — while maintaining full Linux compatibility and performance parity with a monolithic kernel.
This is considered "impossible" because traditional microkernel and distributed OS designs impose 10-50% overhead from IPC-based isolation and cross-node coordination. UmkaOS achieves near-zero overhead through four key techniques:
- Hardware-assisted Tier 1 isolation — Using the best available mechanism on each architecture (MPK on x86, POE on AArch64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE), domain switches cost approximately 23-80 cycles — not the 600+ cycle IPC of traditional microkernels. On architectures without fast isolation (RISC-V), the kernel adapts: promote trusted drivers to Tier 0, demote untrusted drivers to Tier 2, or accept the page-table fallback overhead. See Section 11.2 for the full adaptive isolation policy.
- io_uring-style shared memory rings at every tier boundary, eliminating data copies
- PCID/ASID for TLB preservation across protection domain switches, avoiding the flush penalty
- Batch amortization of all domain-crossing costs, spreading fixed overhead across many operations
1.2.4 Design Principles¶
- Device firmware is a peer, not a servant. Modern hardware runs its own kernel: NICs run firmware (BlueField DPUs run full Linux), GPUs run scheduling and memory management firmware, storage controllers run RTOS. UmkaOS's distributed kernel design allows device-local kernels to participate as first-class members of the distributed system — not just passive devices commanded by the host. A cluster can include CPU nodes, GPU nodes, and SmartNIC nodes as equals.
- Plan for distribution from day one. Shared memory, locking, and coherence protocols are core kernel subsystems, not afterthoughts. Retrofitting distribution into a single-node kernel always produces inferior results.
- Heterogeneous compute is first-class. GPUs, accelerators, CXL memory, and disaggregated resources are not special cases — they are the normal operating environment for modern workloads.
- Performance is not negotiable. Every abstraction must justify its overhead in cycles.
- Isolation enables reliability, not security. Driver boundaries are structural — bugs can't escape their tier and crash the system. But isolation mechanisms (MPK, POE, DACR) are crash containment only, not exploitation defense. The security boundary is Tier 2 (Ring 3 + IOMMU). The enforcement mechanism adapts to what hardware provides.
- Adapt to available hardware. When the hardware provides fast isolation, use it. When it does not, degrade gracefully — do not refuse to run. A universal kernel must work on everything, even if it means honest trade-offs on some platforms.
- Rust ownership replaces runtime checks. Compile-time guarantees replace lockdep, KASAN, and similar debug-only tools.
- Stable ABI is a first-class contract. Drivers are binary-compatible across kernel updates by design.
- Linux compatibility is near-complete. If glibc, systemd, or any actively-maintained software calls it, we implement it. Only interfaces deprecated for 15+ years with zero modern users are excluded.
- No new
ioctlcalls for UmkaOS-specific features.ioctl(2)is a grab-bag interface: untyped integer commands, unversioned argument structs, no introspection, no capability model, and historically a major source of kernel bugs and CVEs. Linux cannot remove ioctls because of backward-compatibility obligations. UmkaOS starts clean.
New subsystems and features in UmkaOS that do not need Linux binary compatibility use typed alternatives instead of ioctls:
| Instead of | Use |
|---|---|
ioctl(fd, CMD_FOO, &arg) |
Named umkafs file: write("/ukfs/kernel/foo/config", &typed_arg) |
| Device control ioctls | io_uring typed operations (IORING_OP_*) |
| Query ioctls | Read umkafs attribute files with structured binary format |
| Event subscription ioctls | FileWatchCap or EventRing subscription |
| Batch control ioctls | io_uring SQE chains with per-operation typed structs |
Existing Linux ioctls (block device, socket, DRM, USB, etc.) are fully supported —
required for binary compatibility. New UmkaOS subsystems (AccelBase, umkafs management,
UmkaOS-specific security primitives, driver configuration extensions) expose their control
plane via umkafs typed files or io_uring operations. Driver KABI extensions are added to
the versioned KernelServicesVTable / DriverVTable (Section 12.1), not via new ioctls on
a device fd. This rule creates no compat gap: the Linux compatibility layer (Section 19.1)
implements all Linux-specified ioctls as-is.
1.3 Performance Budget¶
Target: less than 5% total overhead versus Linux on macro benchmarks.
Steady-state scope: This budget applies to steady-state throughput workloads. Transient operations (cold start, driver reload, service migration) may temporarily exceed 5% without violating the budget.
Tail latency target: For latency-sensitive non-batched workloads (RPC microservices, database transactions, interactive key-value stores), individual request overhead may exceed the 5% throughput target due to cache-cold paths and lack of batch amortization. The tail latency target is <8% p99 overhead on fast-isolation architectures (x86-64 MPK, AArch64 POE, ARMv7 DACR, PPC) and <15% on page-table-fallback architectures. See Section 3.4 for the complete single-request overhead analysis with per-architecture breakdown and mitigation strategies.
Platform qualification: This 5% target applies to platforms with hardware isolation support (x86 MPK, ARMv8.9+ POE). On page-table-fallback platforms (RISC-V, pre-POE ARM), domain switch overhead is higher (~200–500 cycles vs ~23 cycles), which may exceed the 5% target on isolation-heavy workloads. These platforms are fully supported but the performance target is relaxed on them; see Section 11.2 for per-architecture overhead estimates. See Section 1.1 for which platforms provide Tier 1 hardware isolation and how driver placement is determined on platforms without it.
Architecture-specific overhead budget:
The ≤5% overhead target applies to architectures with native fast isolation mechanisms: - x86-64: Memory Protection Keys (MPK / WRPKRU, ~23 cycles) - AArch64: Permission Overlay Extension (POE / MSR POR_EL0, ~40–80 cycles) or page-table ASID fallback - ARMv7: Domain Access Control Register (DACR / MCR p15 + ISB, ~30–40 cycles) - PPC32: Segment registers (mtsr, ~10–30 cycles) - PPC64LE POWER9+: Radix PID (mtspr PIDR, ~30–60 cycles)
RISC-V exception: RISC-V currently has no equivalent hardware fast-isolation primitive. Tier 1 is unavailable on RISC-V — drivers run as either Tier 0 (in-kernel, no isolation) or Tier 2 (Ring 3 + IOMMU), depending on licensing, driver preference, and sysadmin decision (see Section 1.1). The RISC-V overhead budget is therefore tracked separately: Tier 0 drivers have zero isolation overhead; Tier 2 drivers use page-table-only isolation budgeted at ≤10% until hardware support arrives. See Section 11.2 for the complete per-architecture isolation analysis.
1.3.1.1 Per-Architecture Performance Overhead Targets¶
| Architecture | Isolation mechanism | Isolation overhead per domain switch | Achievable 5% budget? |
|---|---|---|---|
| x86-64 | MPK WRPKRU | ~23 cycles (~0.007 μs) | Yes — budget met at ~0.8% |
| AArch64 (ARMv8.9+ POE) | POR_EL0 write | ~40-80 cycles | Yes — budget met at ~1.2-1.8% |
| AArch64 (no POE, mainstream) | Page table + ASID | ~150-300 cycles | Yes — budget met at ~2-4% (coalescing required) |
| ARMv7 | DACR MCR p15 + ISB | ~30-40 cycles | Yes — budget met at ~1-2% |
| RISC-V 64 | Page table only | ~200-500 cycles | Marginal — Tier 1 isolation disabled by default; upgrade when hardware support arrives |
| PPC32 | Segment registers + isync | ~20-40 cycles | Yes — budget met at ~1-2% |
| PPC64LE (POWER9+) | Radix PID PIDR | ~30-60 cycles | Yes — budget met at ~1-2% |
| s390x | Storage Keys (ISK/SSK) | N/A — Tier 1 unavailable | No — Storage Keys are page-granularity (4-bit per page), too coarse for fast domain isolation; drivers choose Tier 0 or Tier 2 |
| LoongArch64 | None | N/A — Tier 1 unavailable | No — no hardware isolation mechanism exists; drivers choose Tier 0 or Tier 2 |
RISC-V note: RISC-V has no hardware fast-isolation mechanism (no MPK, no POE equivalent). Tier 1 isolation is unavailable on RISC-V. Drivers run as either Tier 0 (in-kernel, no isolation) or Tier 2 (Ring 3 + IOMMU), depending on licensing requirements, driver default preference, and sysadmin configuration (see Section 1.1). Tier 0 drivers have zero isolation overhead but reduced fault containment — a documented tradeoff. Tier 1 remains unavailable until RISC-V ISA extensions provide suitable mechanisms.
AArch64 POE domain availability: POE provides 7 usable protection keys (3-bit PTE index, key 0 reserved). After infrastructure allocation (shared read-only ring descriptors, shared DMA pool, userspace domain, debug domain), 3 keys remain for driver isolation domains — significantly fewer than x86 MPK's 12 usable driver domains. Deployments requiring more than 3 concurrent Tier 1 driver isolation domains on AArch64 must use Tier 0 promotion (for trusted drivers) or Tier 2 (Ring 3 + IOMMU, for untrusted ones) for the excess drivers. See Section 11.2 for the full per-architecture domain allocation table.
The overhead comes exclusively from protection domain crossings on I/O paths — operations
that must transit between UmkaOS Core and Tier 1 drivers. Operations handled entirely within
UmkaOS Core (scheduling, memory management, page faults, vDSO calls) have zero additional
overhead compared to Linux. Syscall dispatch itself adds ~28-38 cycles for the full isolation transition on x86-64:
~23 cycles for the MPK domain switch, ~5-10 cycles for capability validation (bitmask
test via ValidatedCap), and ~2-5 cycles for KABI vtable dispatch (indirect call,
branch-predictor-friendly). This is only measurable on I/O-bound syscalls; for
compute-heavy syscalls (e.g., mmap, mprotect, brk) the ~28-38 cycles are
negligible relative to the work done.
1.3.2 Per-Operation Overhead¶
Note: cycle counts in the table below are measured on x86-64. A cross-architecture comparison of the same operations is given immediately after the main table.
| Operation | Linux | UmkaOS | UmkaOS-specific differential | Notes |
|---|---|---|---|---|
| Syscall dispatch (I/O) | ~100 cycles (bare) | ~128-138 cycles (bare) | +28-38 cycles | x86-64: bare SYSCALL/SYSRET round-trip without KPTI or Spectre mitigations. Production baseline: ~700-1800 cycles (range spans microarchitectures from Haswell to Raptor Lake and varies with active Spectre mitigations). +23 cycles for MPK domain switch + ~5-10 cycles capability validation + ~2-5 cycles KABI vtable dispatch to Tier 1 driver. This differential is additive to the shared KPTI/Spectre mitigation base that both Linux and UmkaOS pay equally. UmkaOS pays the same production base cost plus this +28-38 cycle differential. On non-Meltdown-vulnerable CPUs (Intel Ice Lake+, all AMD Zen), UmkaOS avoids KPTI page table switches for intra-core syscalls. On Meltdown-vulnerable CPUs (Intel Skylake through Cascade Lake), KPTI is a hardware requirement that UmkaOS cannot avoid -- both Linux and UmkaOS pay the same KPTI cost on these CPUs.^1 |
| NVMe 4KB read (total) | ~10 us | ~10.025-10.05 us | ~0.25-0.5% | 2 effective MPK switches after shadow elision (submit + completion coalesced = 2 domain switches × 23 cycles = 46 cycles) on 10 us op. Without shadow elision: 4 switches × 23 = 92 cycles (+1%), but the shadow elision optimization (Section 11.2) merges adjacent enter/exit pairs, halving the actual cost. |
| NVMe 4KB write (syscall) | ~10 us | ~10.07-10.15 us | ~0.7-1.5% | Write path adds ~66-152 cycles over read path (itemized below). Doorbell coalescing (Section 11.7) amortizes the NVMe submission; the extra cost is from the writeback→filesystem→block layer domain traversal. See Section 3.4 for the detailed write path breakdown. |
NVMe 4KB write path itemized breakdown (cycles beyond the read path baseline):
| Component | Cycles | Notes |
|---|---|---|
| Writeback domain crossing pair | 23-46 | VFS→filesystem Tier 1 boundary (enter+exit, shadow-elided to 1 pair) |
| Bio dispatch domain crossing pair | 23-46 | Block layer→NVMe driver Tier 1 boundary (enter+exit) |
LSM file_permission hook |
10-30 | Write permission check on the file object |
LSM inode_permission hook |
10-30 | Write permission check on the inode |
| Total | 66-152 | Added to the read path's ~46 cycles for 2 effective domain switches |
| Operation | Linux | UmkaOS | UmkaOS-specific differential | Notes |
|---|---|---|---|---|
| TCP packet RX | ~5 us | ~5.001-5.003 us | ~0.02-0.06% | 4 MPK switches per NAPI poll (NIC driver + umka-net, enter + exit each) = ~92 cycles, amortized over NAPI batch of 64 packets = ~1.4 cycles/packet. Per-packet overhead: ~1.4 cycles on a ~5 us (20,000 cycle) op. Without NAPI batching the per-packet cost would be +2%, but NAPI batch-64 amortization reduces it to negligible levels. |
| Page fault (anonymous) | ~300 cycles | ~300 cycles | 0% | Handled entirely in UmkaOS Core |
| Context switch (minimal) | ~200 cycles | ~200 cycles | 0% | Register save/restore only (same mechanism as Linux). This is the lmbench-style minimal context switch between threads in the same address space. A full process context switch with TLB flush and cache effects costs 5,000-20,000 cycles on Linux; UmkaOS's intra-tier switches (MPK domain change) avoid this by not changing address spaces. With N active perf events: perf_schedule_out/in adds +20-50 cycles/event (PMU counter save/restore). 8 events ≈ +160-400 cycles. See Section 20.8. |
vDSO (clock_gettime) |
~25 cycles | ~25 cycles | 0% | Mapped directly into user space |
epoll_wait (ready) |
~80 cycles | ~80 cycles | 0% | Handled in UmkaOS Core |
mmap (anonymous) |
~400 cycles | ~400 cycles | 0% | Handled in UmkaOS Core |
Reading the syscall dispatch row: The "~100 cycles (bare)" and "~128-138 cycles (bare)" figures are the bare SYSCALL/SYSRET instruction overhead without any OS-level mitigations. On production systems with KPTI and Spectre mitigations active, both Linux and UmkaOS pay ~700-1800 cycles for the full kernel entry/exit path. The "+28-38 cycles" differential is the UmkaOS-specific isolation cost, additive to the shared mitigation base. The table compares differentials (not totals) because both kernels pay the same KPTI/Spectre base.
At 4 GHz, 100 cycles ≈ 25 ns; 10 μs ≈ 40,000 cycles.
^1 KPTI note: Kernel Page Table Isolation is required on x86 CPUs vulnerable to Meltdown (Intel client/server cores from Nehalem through Cascade Lake). On these CPUs, every kernel entry/exit pays the KPTI page table switch cost (~200-1000 cycles depending on microarchitecture and TLB pressure). This is a hardware-imposed requirement, not a software choice -- UmkaOS cannot avoid it any more than Linux can. The "~128-138 cycles" figure in the table assumes a non-vulnerable CPU (Intel Ice Lake+, AMD Zen, ARM, RISC-V) or one with hardware Meltdown fixes. On Meltdown-vulnerable hardware, add the KPTI overhead to both the Linux and UmkaOS columns equally.
Spectre mitigation note: The cycle counts in the table above represent bare-instruction costs comparing UmkaOS vs Linux on identical hardware with identical Spectre mitigations. Retpoline and eIBRS overhead applies to both Linux and UmkaOS equally — neither kernel can avoid it. UmkaOS's differential retpoline cost is one additional indirect call per domain crossing (~15-25 cycles on pre-eIBRS hardware) due to the KABI vtable dispatch. On eIBRS-capable hardware (Intel Ice Lake+, AMD Zen 3+), this cost drops to ~2-5 cycles (predicted indirect branch). See Section 2.18 for the full per-mitigation overhead analysis across all architectures.
1.3.2.1.1 Per-Architecture Operation Cost Comparison¶
The table above gives x86-64 figures. The following table shows the same operations across all eight first-class architectures. These numbers capture structural differences in ISA design — memory ordering, TLB invalidation, trap entry — rather than microarchitectural variation.
| Operation | x86-64 | AArch64 | ARMv7 | RISC-V 64 | PPC32 | PPC64LE | s390x | LoongArch64 | Notes |
|---|---|---|---|---|---|---|---|---|---|
| Syscall dispatch (I/O) | ~128-138 cy | ~137-160 cy | ~150-200 cy | ~167-225 cy | ~180-250 cy | ~140-180 cy | ~150-200 cy | ~100-150 cy | ARMv7: SVC + banked register save. PPC32: sc trap + GPR save. PPC64LE: sc/scv + GPR save |
| Context switch (minimal) | ~200 cy | ~220-260 cy | ~200-300 cy | ~250-320 cy | ~250-400 cy | ~250-400 cy | ~300-500 cy | ~200-400 cy | ARMv7: DACR+ISB for Tier 1 + CONTEXTIDR for ASID. PPC32: segment register reload. PPC64LE: Radix PID or HPT switch |
| Tier 1 isolation switch | ~23 cy (WRPKRU) | ~40-80 cy (POR_EL0) or ~150-300 cy (PT) | ~30-40 cy (DACR+ISB) | N/A (Tier 0/2 fallback) | ~20-40 cy (segment+isync) | ~30-60 cy (Radix PID) | N/A (Tier 0/2 fallback) | N/A (Tier 0/2 fallback) | ARMv7 DACR is fast; PPC32 segment regs cheap; PPC64LE Radix PID moderate |
| Memory barrier (Release) | ~0 cy (TSO) | ~5-15 cy (DMB ISH) | ~5-15 cy (DMB) | ~5-20 cy (FENCE) | ~5-15 cy (lwsync) | ~5-15 cy (lwsync) | ~0 cy (TSO) | ~15-25 cy (DBAR) | PPC lwsync is lighter than hwsync; sufficient for Release |
| TLB shootdown (single page) | ~50-150 cy + IPI | ~50-150 cy + DSB+TLBI | ~100-200 cy + IPI | ~100-200 cy + IPI+SFENCE.VMA | ~100-200 cy + IPI | ~100-200 cy + IPI+TLBIE | ~500-1000 cy (SIGP+IPTE) | ~500-1500 cy (IPI+INVTLB) | ARMv7 broadcast TLBI via inner-shareable domain; PPC64LE TLBIE with RIC/PRS fields |
| RCU read-side | ~1-3 cy | ~1-3 cy | ~1-3 cy | ~1-3 cy | ~3-5 cy | ~3-5 cy | ~5-10 cy | ~5-10 cy | PPC per-CPU access via r13 base; slightly higher latency than x86 GS-based |
| Cache line bounce | ~50-70 cy | ~40-60 cy | ~40-60 cy | ~50-80 cy | ~50-80 cy | ~50-80 cy | ~50-80 cy | ~50-80 cy | ARMv7 32B cache lines on some cores; PPC 128B cache lines on POWER9+ |
All non-x86 figures are architectural estimates based on instruction latencies and design specifications. They will be calibrated against real hardware measurements in Phase 4 (Production Ready). AArch64 figures apply to Graviton 3/Neoverse V1 class hardware. RISC-V figures apply to SiFive P670-class hardware. ARMv7 figures assume Cortex-A15 class. PPC32 figures assume e500mc. PPC64LE figures assume POWER9+.
1.3.3 Macro Benchmark Targets¶
| Benchmark | Acceptable overhead vs Linux |
|---|---|
fio randread/randwrite 4K QD32 |
< 2% |
iperf3 TCP throughput |
< 3% |
nginx small-file HTTP |
< 3% |
sysbench OLTP |
< 5% |
hackbench |
< 3% |
lmbench context switch |
< 1% |
Kernel compile (make -jN) |
< 5% |
Latency-sensitive benchmarks (non-batched, single-request overhead):
| Benchmark | Acceptable overhead vs Linux | Notes |
|---|---|---|
redis-benchmark (GET/SET, pipeline=1) |
< 3% throughput, < 5% p99 | Single-packet NAPI, no batch amortization |
wrk single-connection HTTP |
< 4% throughput, < 8% p99 | 1 RPC = recv + read + send, no NAPI batch |
pgbench (TPC-B, single client) |
< 5% throughput, < 8% p99 | Multiple NVMe reads per transaction |
| gRPC unary (1 req → 1 resp) | < 4% throughput, < 8% p99 | Non-batched NIC + storage path |
1.3.4 Where the Overhead Comes From¶
The overhead budget is dominated by I/O paths that cross the isolation domain boundary between UmkaOS Core and Tier 1 drivers. Pure compute workloads, memory-intensive workloads, and scheduling-intensive workloads have effectively zero overhead because they stay entirely within UmkaOS Core.
Worst case: a micro-benchmark that issues millions of tiny I/O operations per second (e.g., 4K random IOPS at maximum queue depth). Even here, the ~92 cycles of domain-switch overhead per operation (x86-64 MPK: ~23 cycles; AArch64 POE: ~40-80 cycles) is less than 1% of the approximately 10 us total device latency on hardware with fast isolation. On page-table-fallback platforms (pre-POE AArch64, RISC-V), Tier 1 isolation is coalesced or disabled to stay within the budget; see the per-architecture table above.
L1I cache displacement: The cycle counts above assume L1-hot code paths. In production workloads with multiple active Tier 1 drivers, domain switches cause L1I working set displacement — the driver's instruction footprint evicts umka-core's hot code, which must be re-fetched from L2 on return. This is a structural cost that Linux does not incur (monolithic kernel shares one working set). The realistic steady-state penalty is ~2x on isolation-related cycles, raising the x86-64 nginx-class overhead from ~0.8% (L1-hot) to ~1.2-2.4% (L2-warm) — still well within the 5% budget with 4.2% headroom. See Section 3.4 for the complete L1I displacement analysis with per-scenario breakdown.
1.3.5 Comprehensive Overhead Budget¶
The 5% target applies to macro benchmarks on all supported architectures (see per-architecture targets in the table above). The total overhead is the sum of all UmkaOS-specific costs versus a monolithic Linux kernel. This section enumerates every source of overhead so the budget can be audited. Per-event costs are shown for x86-64 MPK as the reference; see Section 11.2 for equivalent figures on other architectures.
| Source | Per-event cost (x86-64 MPK; reference) | Frequency | Contribution to macro benchmarks |
|---|---|---|---|
| MPK domain switches | ~23 cycles per WRPKRU | 2-6 per I/O op | 1-4% of I/O-heavy workloads. 0% for pure compute. |
| IOMMU DMA mapping | 0 (same as Linux) | Per DMA op | 0% — UmkaOS uses IOMMU identically to Linux. |
| KABI vtable dispatch | ~2-5 cycles (indirect call) | Per driver method call | <0.1% — indirect call vs direct call. Branch predictor hides this. |
| Capability checks | ~5-10 cycles (bit test) | Per privileged op | <0.1% — bitmask test, fully pipelined. |
| Driver state checkpointing | ~0.2-0.5 μs per checkpoint (memcpy + doorbell) | Periodic (every ~1ms) | ~0.02-0.05% — amortized over 1ms. HMAC computed asynchronously by umka-core, not on driver hot path. |
| Scheduler (EAS + PELT) | 0 (same algorithms as Linux) | Per context switch | 0% — UmkaOS uses the same CFS/EEVDF + PELT as Linux. |
| Scheduler (CBS guarantee) | ~50-100 cycles | Per CBS replenishment | <0.05% — replenishment per CBS period (typically ~1s, configurable) for CBS-enabled groups only. |
| FMA health checks | ~10-50 cycles | Per device poll (~1s) | <0.001% — background, amortized over seconds. |
| Stable tracepoints | 0 when disabled, ~20-50 cycles when enabled | Per tracepoint hit | 0% disabled. <0.1% when actively tracing. |
| umkafs object bookkeeping | ~50-100 cycles | Per object create/destroy | <0.01% — object lifecycle is cold path. |
| In-kernel inference | 500-5000 cycles per invocation | Per prefetch/scheduling decision | <0.1% — invoked on slow-path decisions (page reclaim, I/O reordering), not per-I/O. Clamped by cycle watchdog (defined in Section 22.6). |
| Per-CPU access (CpuLocal) | ~1 cycle (x86 gs: prefix) |
Per slab alloc, per scheduler tick, per NAPI poll | <0.05% — matches Linux this_cpu_* cost. See Section 3.2. |
| Per-CPU access (PerCpu\<T>) | ~1-3 cycles (nosave) / ~3-8 cycles (with IRQ save) | Per non-hot per-CPU field access | <0.1% — IRQ elision (Section 3.3) eliminates save/restore in IRQ context. |
| RCU quiescent state | ~1 cycle (CpuLocal flag write) | Per outermost RCU read section exit | <0.01% — deferred to tick/switch, not per-drop. See Section 3.1. |
| Capability validation | ~7-14 cycles (amortized via ValidatedCap) | Per KABI dispatch | <0.05% — validate once, use token for 3-5 sub-calls. See Section 12.3. |
| Doorbell coalescing | ~5 cycles/cmd (amortized batch-32) | Per batched NVMe/virtio submit | <0.02% — one MMIO write per batch. See Section 11.7. |
| Isolation shadow elision | ~1 cycle (compare) vs ~23-80 (write) | Per domain switch that hits shadow | ~0.1-0.2% saved — mandatory, see Section 11.2. |
| PMU sampler kthread | CBS-limited to 5% CPU per core (default) | Per-CPU, woken on PMU overflow | ≤5% of one core when actively profiling; 0% when no perf events are open. Configurable via perf_sampler_cpu_budget_pct. See Section 20.8. |
| Aggregation counters | ~1-2 ns/event | Per tracepoint/perf event | <0.01% — amortized across batch updates. |
Workload-specific overhead estimates (x86-64 MPK, all optimizations active; representative of hardware with fast isolation):
| Workload | Dominant overhead source | Estimated total overhead |
|---|---|---|
fio 4K random IOPS |
MPK switches (shadow-elided) + doorbell | ~0.5-1.5% |
iperf3 TCP throughput |
MPK switches (NIC + TCP, NAPI-batched) | ~1.5-2.5% |
nginx small-file HTTP |
MPK switches (NIC + TCP + NVMe) | ~1-2% |
sysbench OLTP |
MPK switches (NVMe + TCP) | ~1.5-2.5% |
hackbench (IPC-heavy) |
MPK switches (scheduler stays in core) | ~0.5-1.5% |
Kernel compile (make -jN) |
Nearly zero (CPU-bound, in-core) | <1% |
memcached (GET-heavy) |
MPK switches (NIC + TCP) | ~1.5-2.5% |
| ML training (GPU) | Nearly zero (GPU work, not CPU I/O) | <1% |
| gRPC unary (non-batched) | MPK switches (NIC + NVMe, no NAPI/doorbell batch) | ~1.2-2.4% (L2-warm) |
redis GET pipeline=1 |
MPK switches (NIC, NAPI batch=1) | ~1.5-2.2% |
Distributed TX relay overhead (cross-node communication only):
| Operation | Per-hop cost | Components | Amortizable? |
|---|---|---|---|
| Distributed TX relay (per hop) | ~2-10 μs | RDMA round-trip (~1-3 μs) + capability check (~0.3-0.5 μs) + ring buffer copy (~0.1-0.3 μs) + serialization (~0.1-0.2 μs) | No — hops are sequential |
| 3-hop TX relay (worst case) | ~6-30 μs | 3 × per-hop cost | No |
Note: This overhead applies only to cross-node communication via the peer protocol (Section 5.1). Local I/O paths are unaffected. Each hop is sequential — the request must arrive at hop N before hop N+1 can begin — so batching multiple requests does not reduce per-request hop latency. The per-hop cost is dominated by the RDMA network round-trip time; the kernel overhead (capability check + ring buffer copy) is ~0.5-1 μs per hop. For single-hop operations (direct peer communication), the overhead is ~2-10 μs. Multi-hop relay occurs only when the target service is not directly reachable from the requesting node (e.g., a service behind a DPU or on a node in a different RDMA subnet). The topology-aware placement engine (Section 5.12) minimizes hop count by preferring direct routes.
Key insight: the overhead budget is dominated by isolation domain switch cost multiplied by the number of domain crossings per operation. Seven core design techniques (Section 3.4) reduce the non-isolation overhead to near-zero and cut isolation overhead by ~25-50% through shadow elision and batching:
- CpuLocal register-based access — hottest per-CPU fields accessed via the
architecture's dedicated per-CPU base register (x86-64: GS, AArch64: TPIDR_EL1,
ARMv7: TPIDRPRW, PPC64: r13, RISC-V: tp) at ~1-10 cycles depending on
architecture, matching Linux
this_cpu_*cost (Section 3.2) - Debug-only PerCpu CAS — borrow-state CAS present only in debug builds; zero cost in release builds (Section 3.3)
- IRQ save/restore elision via IrqDisabledGuard —
get_mut_nosave()skips IRQ save/restore (~1-3 cycles saved) when caller already holdsIrqDisabledGuard(Section 3.3) - RCU deferred quiescent state to tick/switch —
RcuReadGuard::dropwrites a CpuLocal flag (~1 cycle); the actual quiescent-state report is deferred to the next scheduler tick or context switch (Section 3.1) - Isolation register shadow elision — WRPKRU/MSR write skipped when the shadow register value already matches, saving ~23-80 cycles per elided switch (Section 11.2)
- ValidatedCap capability amortization — validate once, use a token for 3-5 sub-calls (~7-14 cycles amortized vs per-call) (Section 12.3)
- Doorbell coalescing for NVMe/virtio — one MMIO write per batch-32, ~5 cycles/cmd amortized (Section 11.7)
The cumulative nginx-class overhead is ~0.8% on x86-64 (MPK), ~1.5% on AArch64 with POE, and ~2-4% on mainstream AArch64 (page-table fallback with coalescing). This leaves substantial headroom under the 5% budget on all architectures for implementation-phase unknowns (cache effects, compiler variation, subsystem interactions). All seven techniques are implemented from day one — none are deferred optimizations. See Section 3.4 for the cumulative per-architecture overhead analysis with complete breakdowns.
1.3.6 Counter and Identifier Longevity Budget¶
UmkaOS targets 50-year continuous uptime via live kernel evolution (Section 13.18). Every monotonically increasing counter and identifier must be safe against overflow for the full operational lifetime. This section documents the worst-case exhaustion analysis for every such counter.
1.3.6.1 Design Rule¶
All kernel-internal identifiers that are not constrained by an external protocol MUST use u64. The only exceptions are:
- Identifiers constrained by an external ABI (Linux syscall ABI, NFS/RPC wire format, POSIX IPC) — these use the protocol-mandated width (typically u32) and MUST document their wrap-safety analysis inline.
- Identifiers packed into fixed-size structures where space is at a premium (e.g.,
NetBufHandle) — these use the minimum safe width with an inline longevity analysis comment.
1.3.6.2 Audit Table¶
| Counter | Width | Location | Worst-case rate | Time to wrap | Mitigation |
|---|---|---|---|---|---|
Cgroup::id |
u64 | Section 17.2 | 10K creates/sec | 58M years | Safe |
DriverDomainId |
u64 | Section 12.1 | 100 loads/sec | 5.8B years | Safe (changed from u32) |
DriverDomain::generation |
u64 | Section 12.1 | 100 crashes/sec | 5.8B years | Safe; operator reset at threshold |
t0_vtable_generation |
u64 | Section 13.18 | 100 swaps/sec | 5.8B years | Incremented on each Tier 0 vtable swap; same counter space as DriverDomain::generation |
CapValidationToken::cap_generation |
u64 | Section 12.1 | 1M revocations/sec | 584K years | Safe |
NetBufHandle::generation |
u32 | Section 16.5 | 8.3M pkts/sec (100Gbps) | 515 sec | Safe: no handle outlives RCU grace period (~10ms). Changed from u16 (was 7.8ms!) |
MceLog::head |
u64 | Section 2.23 | burst: 64/NMI | >10^18 years at 1/sec | Safe: u64 is effectively infinite; drain resets to 0 |
AutofsMount::next_token |
u64 | Section 14.10 | 100 mounts/sec | 5.8B years | Safe (changed from u32; wire token truncated to u32 for Linux ABI) |
EventHandle / SemHandle |
u32 | Section 19.8 | 1M creates/sec | n/a (slot recycled) | Slot+generation: 24-bit slot index recycled; internal per-slot generation is u64 |
KeySerial |
i32 | Section 10.2 | 1K keys/sec | n/a (recycled) | Protocol-mandated signed (Linux key_serial_t = int32_t); only positive values allocated (1..=0x7FFFFFFF). Recycled via XArray slot reuse; per-slot generation in Key struct |
NFS xid_counter |
u32 | Section 15.11 | 100K RPCs/sec | 11.9 hours | Protocol-mandated (RFC 5531); collision handled by (client_addr, xid) tuple |
PipeBuffer::read_idx/write_idx |
u32 | Section 17.3 | 1M ops/sec | 71 min | Modular counter, not monotonic ID — wrapping is correct by design (difference arithmetic). Exempt from u64 rule. Correctness invariants: (1) capacity is power-of-two (so capacity divides 2^32), AND (2) producer blocks when ring is full (guaranteed by pipe semantics — back-pressure prevents producer-consumer gap from exceeding capacity). Without back-pressure, a stalled consumer and a wrapping producer would produce indistinguishable gap values |
KRL::version |
u64 | Section 9.3 | 1 update/sec | 584B years | Safe |
TrustAnchorChain::epoch |
u64 | Section 9.3 | 1 rotation/year | ~1.8×10^19 years | Safe |
Cgroup::generation |
u64 | Section 17.2 | 1M limit changes/sec | 584K years | Safe |
RCU grace period counter |
u64 | Section 3.5 | 1M GPs/sec | 584K years | Safe |