Skip to content

Chapter 1: Architecture Overview

Design philosophy, architectural goals, performance budget


UmkaOS is a production Rust kernel designed as a drop-in Linux replacement. Unmodified Linux userspace (glibc, systemd, Docker, K8s) runs without recompile. Internal design prioritizes correctness over Linux imitation — where Linux has known flaws, UmkaOS does it right. The performance budget targets ≤5% throughput overhead on macro benchmarks and <8% p99 tail latency for non-batched RPC workloads on fast-isolation architectures.

1.1 Overview and Philosophy

1.1.1 What UmkaOS Is

UmkaOS (Universal Multi-Kernel Architecture OS) is a production OS kernel designed for the computing environment of 2026, built as a drop-in replacement for Linux. Unmodified Linux userspace — glibc, musl, systemd, Docker, Kubernetes, QEMU/KVM, and the entire ecosystem — runs without recompilation. The compatibility target is 99.99% of real-world usage (measured by instruction count across representative workloads, not by raw syscall number coverage — Linux has ~450 syscall numbers but a handful dominate real workloads).

The fundamental premise: a modern server is not a CPU commanding passive peripherals. It is a heterogeneous distributed system — CPU complex, DPU/SmartNIC running 16 ARM cores with its own OS, NVMe SSD running an embedded RTOS, GPU running its own memory manager and scheduler, CXL memory expanders making autonomous prefetch decisions. Linux models all of these as peripherals. UmkaOS models them as peers. This is not a styling difference — it changes what the OS kernel is, what it does, and how it is structured.

Three paths, one protocol. Devices connect to UmkaOS in three ways:

  1. Traditional driver (Tier 0/1/2): the host runs a driver for the device. GPUs, NVMe SSDs, USB controllers — any device with vendor firmware that doesn't speak the peer protocol. UmkaOS improves this with stable KABI, crash containment, and unified compute via AccelBase.
  2. Peer kernel: the device runs its own UmkaOS kernel instance and joins the cluster as a first-class member. BlueField DPUs, computational storage, multi-host RDMA clusters. Host driver: ~2,000 lines of generic transport.
  3. Firmware shim: the device keeps its existing RTOS but implements the UmkaOS peer protocol (~10-18K lines of C on an existing RTOS, excluding cryptographic primitives already present in the firmware stack; a reference implementation will be published with measured line counts). SAS controllers, NVMe SSD firmware, FPGA soft-cores, USB microcontrollers. No OS replacement — just a protocol implementation. The host cannot distinguish a shim from a full peer kernel.

Paths 2 and 3 speak the same wire protocol (Section 5.1). The only difference is what's behind it. As hardware evolves and more devices gain programmable cores, devices migrate from path 1 → 3 → 2. The architecture handles all three simultaneously.

All three coexist on the same host. A server might run a BlueField DPU as a peer kernel (path 2), a SAS HBA as a firmware shim (path 3), a GPU under AccelBase KABI (path 1, Tier 1), and a USB controller in Tier 2 — all at the same time.

The kernel is written primarily in Rust (with C and assembly only for boot code and arch-specific primitives), with a stable driver ABI, hardware-assisted crash containment, and performance parity with monolithic Linux.

Beyond Linux compatibility, UmkaOS provides capabilities that Linux cannot realistically add to its existing architecture: transparent driver crash recovery without reboot, distributed kernel primitives (shared memory, distributed locking, cluster membership), a unified heterogeneous compute framework for GPUs/TPUs/NPUs/CXL devices, structured observability with automated fault management, per-cgroup power budgeting with enforcement, post-quantum cryptographic verification, and live kernel updates without downtime. These features are designed into the architecture from day one — not retrofitted as afterthoughts.

1.1.2 Replaceability Model: Nucleus and Evolvable

UmkaOS components are classified along two orthogonal axes: replaceability and isolation tier. These axes are independent and must not be conflated.

1.1.2.1 Replaceability Axis

Nucleus is the non-replaceable, formally verified core of the kernel (~18-20 KB per architecture, ~25-35 KB total across all 8 architectures; see Section 2.21 for the component-by-component enumeration). It cannot be live-evolved -- a bug in Nucleus requires a reboot. The critical core within Nucleus is the evolution primitive (~2-3 KB of straight-line code), which is the primary formal verification target via Verus. Nucleus's sole active role at runtime is managing live evolution of Evolvable components. Nucleus contains only what must be correct for the evolution mechanism itself to function: the evolution primitive, capability table lookup, physical memory data structures, page table hardware ops, KABI dispatch, and the minimal scaffolding that bootstraps everything else.

Evolvable encompasses all kernel components that can be live-replaced at runtime via the evolution framework (Section 13.18). This includes policy modules, schedulers, filesystems, network stacks, device drivers, and most kernel subsystems. An Evolvable component is factored into non-replaceable verified data structures and replaceable stateless policy — the policy half can be swapped without reboot while the data structures persist across the transition.

1.1.2.2 Isolation Tier Axis

The isolation tier determines the hardware memory isolation boundary around a component:

  • Tier 0: Ring 0, no hardware isolation domain. Full kernel privilege, shared address space with the rest of the kernel. A crash may bring down the system.
  • Tier 1: Ring 0, hardware memory domain isolated (MPK/POE/DACR/segments). Crash is contained; the component can be reloaded in ~50-150 ms.
  • Tier 2: Ring 3, full process isolation + IOMMU. Crash is fully contained; restart in ~10 ms.

See Section 11.2 for the complete per-architecture isolation mechanism specification.

1.1.2.3 Orthogonality: Why These Axes Are Independent

Both Nucleus and Evolvable code runs in Tier 0 — the same Ring 0 privilege level, the same address space, the same hardware execution context. The replaceability axis determines whether a component can be swapped at runtime; the isolation tier determines whether hardware enforces memory boundaries around it. Neither implies the other:

  • A component can be Evolvable (live-replaceable) AND Tier 0 (no hardware isolation).
  • A component can be Evolvable AND Tier 1 or Tier 2 (hardware-isolated).
  • Nucleus is always Tier 0 — it runs in the kernel's own address space with no isolation domain, because it is the trusted base that manages everything else.

Being bug-free does not mandate Nucleus placement. The design goal is to minimize Nucleus — only code whose correctness is required for the evolution mechanism itself belongs there. Everything else is Evolvable, regardless of how critical or performance-sensitive it is.

Tier 0 inhabitants are diverse. Tier 0 contains Nucleus code, Evolvable kernel subsystems (scheduler, memory allocator, VFS core), AND dynamically loadable Tier 0 kernel modules/drivers. The kernel is not strictly monolithic — some modules load on demand but still run in Tier 0 with full privilege.

1.1.2.4 Quick Reference

Axis Values What it determines
Replaceability Nucleus / Evolvable Can it be live-replaced at runtime?
Isolation Tier 0 / 1 / 2 Hardware memory isolation boundary

1.1.2.5 Concrete Examples

Component Replaceability Tier Why
Evolution primitive Nucleus 0 Must be correct — no recovery if buggy
Capability data (lookup, generation, rights) Nucleus 0 ~5 instructions per check, formally verified
Capability policy (capable, delegation, revocation) Evolvable 0 Policy decisions replaceable over 50-year lifetime
EEVDF scheduler Evolvable 0 Can be live-replaced if improved
NAPI poll loop Evolvable 0 Replaceable; Tier 0 for performance
umka-net (TCP/IP stack) Evolvable 1 MPK-isolated, replaceable
NVMe driver Evolvable 1 MPK-isolated, crash-recoverable
USB driver Evolvable 2 Ring 3, full process isolation

The NAPI example is instructive: NAPI runs in Tier 0 (Ring 0, no isolation domain) for performance, AND it is Evolvable (can be live-replaced via the evolution framework). Being performance-critical and running without hardware isolation does not make it Nucleus — it is not part of the formally verified evolution foundation.

1.1.2.6 Platform Isolation Summary

Not all architectures provide hardware mechanisms for Tier 1 (in-kernel memory domain) isolation. The following table summarizes per-platform Tier 1 availability:

Platform Tier 1 Mechanism Usable Driver Domains Tier 1 Status
x86-64 MPK (WRPKRU) 12 Full
AArch64 (ARMv8.9+ POE) POE (POR_EL0 + ISB) 3 Grouped (Cortex-X4+ / Neoverse V3+ only)
AArch64 (pre-POE) Page-table + ASID Unlimited (per-process) Full (higher switch cost: ~150-300 cycles)
ARMv7 DACR (MCR p15 + ISB) 12 Full
PPC32 Segment registers (mtsr + isync) 12 Full
PPC64LE (POWER9+) Radix PID (mtspr PIDR) Per-process Full
RISC-V 64 None 0 Tier 1 unavailable
s390x None (Storage Keys too coarse) 0 Tier 1 unavailable
LoongArch64 None 0 Tier 1 unavailable

On platforms where Tier 1 is unavailable (RISC-V, s390x, LoongArch64), drivers run as either Tier 0 (in-kernel, no isolation — equivalent to Linux monolithic drivers) or Tier 2 (Ring 3 + IOMMU, full process isolation). The placement decision depends on three factors:

  1. Licensing: Proprietary (non-open-source) drivers are required to run as Tier 2. They cannot be granted Tier 0 kernel-space access regardless of platform.
  2. Driver default preference: Each driver declares a preferred tier reflecting its performance/isolation tradeoff. Performance-critical drivers (e.g., NVMe, NIC) may default to Tier 0; less performance-sensitive drivers (e.g., USB, HID) may default to Tier 2.
  3. Sysadmin operational decision: The system administrator can override the default and pin any driver to any tier via boot parameters or runtime configuration. This allows site-specific security policies (e.g., "all third-party drivers must be Tier 2").

Tier 2 (Ring 3 + IOMMU) is available on all platforms. Rust memory safety provides the primary defense against driver bugs for Tier 0 drivers on platforms without Tier 1.

See Section 11.2 for the complete per-architecture mechanism specification, domain allocation tables, and switch cost analysis.

1.2 Architecture Coverage

UmkaOS targets eight architectures with equal first-class status:

Architecture Primary deployment Hardware isolation eBPF JIT
x86-64 Cloud servers, workstations MPK (PKEY) Phase 1
AArch64 Mobile, edge, cloud servers POE (ARMv8.9+) / page-table Phase 1
ARMv7 Embedded, IoT DACR Phase 3
RISC-V 64 Emerging servers, embedded None (Tier 1 unavailable) Phase 2
PPC32 Embedded, automotive Segment registers Phase 3
PPC64LE HPC, IBM servers Radix PID Phase 2
s390x IBM z systems, mainframes Storage Keys (Tier 1 unavailable) Phase 3
LoongArch64 Loongson servers, China ecosystem None (Tier 1 unavailable) Phase 3

eBPF JIT phasing: Phase 1 = native JIT compiler available from day one (x86-64, AArch64). Phase 2 = JIT ported (RISC-V 64, PPC64LE). Phase 3 = JIT ported (ARMv7, PPC32, s390x, LoongArch64). Interpreted fallback available on all architectures from Phase 1. See Section 19.2 for the full per-architecture JIT phasing and Section 24.2 for phase definitions.

Performance budgets and optimization priorities apply to all architectures, not just x86-64. Benchmarks and performance claims in this document are validated against each architecture. Where architecture-specific data is given, it is explicitly labelled.


1.2.1 Why UmkaOS Exists

Linux's monolithic architecture has fundamental limitations that are nearly impossible to fix within the existing codebase:

  • The machine is no longer what Unix assumed. A 2026 server contains BlueField DPUs (16 ARM cores, own Linux instance, 200Gb/s RDMA), NVMe SSDs (embedded RTOS, flash translation, wear-leveling firmware), GPUs (own memory manager, inter-engine scheduler, PCIe P2P fabric), and CXL memory expanders (autonomous prefetch and power management). Linux models all of these as passive peripherals commanded by the host CPU. This requires 100,000–700,000+ lines of Ring 0 driver code per device class to proxy the device's own intelligence into the host OS. The abstraction is broken at its foundation.

  • No first-class distribution. Cluster-wide shared memory, locking, and coherence are implemented as ad-hoc userspace layers (Ceph, GFS2, MPI) rather than kernel primitives. Every distributed system reinvents membership, failure detection, and data placement. Linux cannot express that a DPU and a host CPU share a coherent memory space — it has no model for it.

  • Heterogeneous compute as afterthought. GPUs, TPUs, NPUs, DPUs, and CXL memory expanders each have their own driver stack, memory model, and scheduling framework. No unified substrate exists. amdgpu alone is ~1.5M lines of handwritten code (the amdgpu directory contains ~5.9M lines total, of which ~4.4M are auto-generated register headers). i915 is ~400K. Nvidia's out-of-tree driver is ~1M lines. Each reimplements the same memory management, scheduling, and DMA infrastructure from scratch in Ring 0.

  • Device firmware treated as dumb peripheral. A BlueField DPU runs 16 ARM cores and a full Linux OS. An NVMe controller runs an embedded RTOS. A GPU runs a complete memory manager and scheduler in firmware. These devices already are computers — yet Linux commands them as passive peripherals with no kernel-level coordination, no capability model, and no structured recovery when a device misbehaves.

  • No driver isolation. A single driver bug crashes the entire system. Drivers account for approximately 50% of kernel code changes and approximately 50% of all regressions and CVEs. When a driver crashes, the entire machine reboots — taking down every VM, container, and long-running job with it.

  • Device drivers are the dominant attack surface. amdgpu alone is ~1.5M lines of handwritten Ring 0 code. mlx5 is ~150,000. Every line runs with full kernel privileges. A single memory-safety bug anywhere in that code equals full kernel compromise. Firmware updates require coordinated host driver updates and usually a reboot, entangling device and OS release cycles.

  • No stable in-kernel ABI. Every kernel update can break out-of-tree drivers, requiring constant recompilation (DKMS). Nvidia, ZFS, and every out-of-tree module suffers from this.

  • Coarse-grained locking. RTNL and other legacy locks are scalability bottlenecks on many-core systems. Documented regressions exist on 256+ core servers.

  • No capability-based security. The monolithic privilege model means any kernel vulnerability equals full system compromise.

  • Real-time limitations. PREEMPT_RT still trades throughput for latency and cannot eliminate all unbounded-latency paths.

  • Observability bolted on. eBPF, tracepoints, /proc, /sys, and audit are separate subsystems with inconsistent interfaces, added incrementally over decades.

  • "Never break userspace" constrains evolution. Decades of API debt cannot be cleaned up without breaking backward compatibility.

1.2.2 What UmkaOS Delivers

UmkaOS is not just "Linux with better isolation." It is a comprehensive rethink of what a production kernel should provide, addressing nine fundamental capabilities:

  1. Multikernel: device peers, not peripherals (Section 5.2, Section 5.3, Section 5.11) — Physically-attached devices that run their own kernel instance (BlueField DPU with 16 ARM cores, RISC-V accelerator, computational storage with Zynx SoC) participate as first-class cluster peers — not managed peripherals. Each device has its own scheduler, memory manager, and capability space. Communication is UmkaOS message passing over PCIe P2P domain ring buffers (Section 11.8 Layer 4), the same abstraction used everywhere else. The host needs no device-specific driver — a single generic umka-peer-transport module (~2,000 lines) handles every UmkaOS peer device regardless of what it does, replacing hundreds of thousands of lines of Ring 0 driver code per device class. Firmware updates are entirely the device's own responsibility: the device sends an orderly CLUSTER_LEAVE, updates its own firmware or kernel independently, and rejoins — the host never reboots, the host driver never changes, and device and OS release cycles are fully decoupled. When a device kernel crashes, the host does not crash: it executes an ordered recovery sequence — IOMMU lockout and PCIe bus master disable in under 2ms, followed by distributed state cleanup, then optional FLR and device reboot (Section 5.3). This constitutes Tier M (Multikernel Peer) — a qualitatively different isolation class where no host kernel state is shared with the device. Tier M isolation exceeds Tier 2 (the host's strongest software-defined boundary): the only communication surface is the typed capability channel, not shared kernel address space or IOMMU- protected DMA. See Section 11.1. Devices with ARM or RISC-V cores can run the UmkaOS kernel with zero porting effort, as UmkaOS already builds for aarch64-unknown-none and riscv64gc-unknown-none-elf.

  2. Distributed kernel primitives (Section 5.1, Section 5.11) — Cluster-wide distributed shared memory (DSM), a distributed lock manager (DLM) with RDMA-native one-sided operations, and built-in membership and quorum protocols. A cluster of UmkaOS nodes can share memory pages, coordinate locks, and detect failures as kernel-level operations — not userspace libraries. This enables clustered filesystems, distributed caches, and multi-node workloads without bolt-on middleware. The same distributed protocol that connects RDMA-linked servers also connects locally-attached peer kernels (Section 5.2), with the transport adapted to PCIe P2P instead of RDMA network.

  3. Heterogeneous compute fabric (Section 22.1Section 22.8) — A unified framework for GPUs, TPUs, NPUs, FPGAs, and CXL memory. Pluggable per-device schedulers, unified memory tiers (HBM, CXL, DDR, NVMe), and cross-device P2P transfers. New accelerator types plug into the existing framework without kernel modifications.

  4. Driver isolation with crash recovery (Section 11.1, Section 11.2, Section 11.9) — The enabling infrastructure for the peer model: when a driver crashes, UmkaOS recovers it in milliseconds without rebooting. Applications see a brief hiccup, not a system failure. On hardware with fast isolation (MPK, POE), this costs near-zero overhead. On hardware without it, administrators choose their trade-off: slower isolation via page tables, full performance without isolation, or per-driver demotion to userspace. The kernel adapts to available hardware rather than demanding specific features.

  5. Stable driver ABI (Section 12.1) — Required for device peer kernels to maintain binary compatibility across updates: drivers are binary-compatible across kernel updates. No DKMS, no recompilation on every kernel update. Third-party drivers (GPU, WiFi, storage) work across kernel versions by contract, not by accident.

  6. Structured observability (Section 20.1Section 20.5) — Fault Management Architecture (FMA) with per-device health telemetry, rule-based diagnosis, and automated remediation. An object namespace (umkafs) exposes every kernel object with capability-based access control. Integrated audit logging tied to the capability system, not a separate subsystem.

  7. Power budgeting with enforcement (Section 7.7) — Per-cgroup power budgets in watts, multi-domain enforcement (CPU, GPU, DRAM, package), and intent-driven optimization. Datacenters can cap power per rack; laptops can maximize battery life per application.

  8. Post-quantum security (Section 9.3Section 9.8) — Hybrid classical + ML-DSA signatures for kernel and driver verification from day one. No retrofitting needed when quantum computers threaten RSA/ECDSA. Confidential computing support for Intel TDX, AMD SEV-SNP, and ARM CCA.

  9. Live kernel evolution (Section 13.18) — Replace kernel subsystems at runtime with versioned state migration. Security patches apply without reboot. No more "Update and Restart."

1.2.3 The Core Technical Challenge

A kernel that treats heterogeneous devices as first-class peers, runs distributed primitives natively, and manages unified heterogeneous compute — while maintaining full Linux compatibility and performance parity with a monolithic kernel.

This is considered "impossible" because traditional microkernel and distributed OS designs impose 10-50% overhead from IPC-based isolation and cross-node coordination. UmkaOS achieves near-zero overhead through four key techniques:

  1. Hardware-assisted Tier 1 isolation — Using the best available mechanism on each architecture (MPK on x86, POE on AArch64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE), domain switches cost approximately 23-80 cycles — not the 600+ cycle IPC of traditional microkernels. On architectures without fast isolation (RISC-V), the kernel adapts: promote trusted drivers to Tier 0, demote untrusted drivers to Tier 2, or accept the page-table fallback overhead. See Section 11.2 for the full adaptive isolation policy.
  2. io_uring-style shared memory rings at every tier boundary, eliminating data copies
  3. PCID/ASID for TLB preservation across protection domain switches, avoiding the flush penalty
  4. Batch amortization of all domain-crossing costs, spreading fixed overhead across many operations

1.2.4 Design Principles

  • Device firmware is a peer, not a servant. Modern hardware runs its own kernel: NICs run firmware (BlueField DPUs run full Linux), GPUs run scheduling and memory management firmware, storage controllers run RTOS. UmkaOS's distributed kernel design allows device-local kernels to participate as first-class members of the distributed system — not just passive devices commanded by the host. A cluster can include CPU nodes, GPU nodes, and SmartNIC nodes as equals.
  • Plan for distribution from day one. Shared memory, locking, and coherence protocols are core kernel subsystems, not afterthoughts. Retrofitting distribution into a single-node kernel always produces inferior results.
  • Heterogeneous compute is first-class. GPUs, accelerators, CXL memory, and disaggregated resources are not special cases — they are the normal operating environment for modern workloads.
  • Performance is not negotiable. Every abstraction must justify its overhead in cycles.
  • Isolation enables reliability, not security. Driver boundaries are structural — bugs can't escape their tier and crash the system. But isolation mechanisms (MPK, POE, DACR) are crash containment only, not exploitation defense. The security boundary is Tier 2 (Ring 3 + IOMMU). The enforcement mechanism adapts to what hardware provides.
  • Adapt to available hardware. When the hardware provides fast isolation, use it. When it does not, degrade gracefully — do not refuse to run. A universal kernel must work on everything, even if it means honest trade-offs on some platforms.
  • Rust ownership replaces runtime checks. Compile-time guarantees replace lockdep, KASAN, and similar debug-only tools.
  • Stable ABI is a first-class contract. Drivers are binary-compatible across kernel updates by design.
  • Linux compatibility is near-complete. If glibc, systemd, or any actively-maintained software calls it, we implement it. Only interfaces deprecated for 15+ years with zero modern users are excluded.
  • No new ioctl calls for UmkaOS-specific features. ioctl(2) is a grab-bag interface: untyped integer commands, unversioned argument structs, no introspection, no capability model, and historically a major source of kernel bugs and CVEs. Linux cannot remove ioctls because of backward-compatibility obligations. UmkaOS starts clean.

New subsystems and features in UmkaOS that do not need Linux binary compatibility use typed alternatives instead of ioctls:

Instead of Use
ioctl(fd, CMD_FOO, &arg) Named umkafs file: write("/ukfs/kernel/foo/config", &typed_arg)
Device control ioctls io_uring typed operations (IORING_OP_*)
Query ioctls Read umkafs attribute files with structured binary format
Event subscription ioctls FileWatchCap or EventRing subscription
Batch control ioctls io_uring SQE chains with per-operation typed structs

Existing Linux ioctls (block device, socket, DRM, USB, etc.) are fully supported — required for binary compatibility. New UmkaOS subsystems (AccelBase, umkafs management, UmkaOS-specific security primitives, driver configuration extensions) expose their control plane via umkafs typed files or io_uring operations. Driver KABI extensions are added to the versioned KernelServicesVTable / DriverVTable (Section 12.1), not via new ioctls on a device fd. This rule creates no compat gap: the Linux compatibility layer (Section 19.1) implements all Linux-specified ioctls as-is.


1.3 Performance Budget

Target: less than 5% total overhead versus Linux on macro benchmarks.

Steady-state scope: This budget applies to steady-state throughput workloads. Transient operations (cold start, driver reload, service migration) may temporarily exceed 5% without violating the budget.

Tail latency target: For latency-sensitive non-batched workloads (RPC microservices, database transactions, interactive key-value stores), individual request overhead may exceed the 5% throughput target due to cache-cold paths and lack of batch amortization. The tail latency target is <8% p99 overhead on fast-isolation architectures (x86-64 MPK, AArch64 POE, ARMv7 DACR, PPC) and <15% on page-table-fallback architectures. See Section 3.4 for the complete single-request overhead analysis with per-architecture breakdown and mitigation strategies.

Platform qualification: This 5% target applies to platforms with hardware isolation support (x86 MPK, ARMv8.9+ POE). On page-table-fallback platforms (RISC-V, pre-POE ARM), domain switch overhead is higher (~200–500 cycles vs ~23 cycles), which may exceed the 5% target on isolation-heavy workloads. These platforms are fully supported but the performance target is relaxed on them; see Section 11.2 for per-architecture overhead estimates. See Section 1.1 for which platforms provide Tier 1 hardware isolation and how driver placement is determined on platforms without it.

Architecture-specific overhead budget:

The ≤5% overhead target applies to architectures with native fast isolation mechanisms: - x86-64: Memory Protection Keys (MPK / WRPKRU, ~23 cycles) - AArch64: Permission Overlay Extension (POE / MSR POR_EL0, ~40–80 cycles) or page-table ASID fallback - ARMv7: Domain Access Control Register (DACR / MCR p15 + ISB, ~30–40 cycles) - PPC32: Segment registers (mtsr, ~10–30 cycles) - PPC64LE POWER9+: Radix PID (mtspr PIDR, ~30–60 cycles)

RISC-V exception: RISC-V currently has no equivalent hardware fast-isolation primitive. Tier 1 is unavailable on RISC-V — drivers run as either Tier 0 (in-kernel, no isolation) or Tier 2 (Ring 3 + IOMMU), depending on licensing, driver preference, and sysadmin decision (see Section 1.1). The RISC-V overhead budget is therefore tracked separately: Tier 0 drivers have zero isolation overhead; Tier 2 drivers use page-table-only isolation budgeted at ≤10% until hardware support arrives. See Section 11.2 for the complete per-architecture isolation analysis.

1.3.1.1 Per-Architecture Performance Overhead Targets

Architecture Isolation mechanism Isolation overhead per domain switch Achievable 5% budget?
x86-64 MPK WRPKRU ~23 cycles (~0.007 μs) Yes — budget met at ~0.8%
AArch64 (ARMv8.9+ POE) POR_EL0 write ~40-80 cycles Yes — budget met at ~1.2-1.8%
AArch64 (no POE, mainstream) Page table + ASID ~150-300 cycles Yes — budget met at ~2-4% (coalescing required)
ARMv7 DACR MCR p15 + ISB ~30-40 cycles Yes — budget met at ~1-2%
RISC-V 64 Page table only ~200-500 cycles Marginal — Tier 1 isolation disabled by default; upgrade when hardware support arrives
PPC32 Segment registers + isync ~20-40 cycles Yes — budget met at ~1-2%
PPC64LE (POWER9+) Radix PID PIDR ~30-60 cycles Yes — budget met at ~1-2%
s390x Storage Keys (ISK/SSK) N/A — Tier 1 unavailable No — Storage Keys are page-granularity (4-bit per page), too coarse for fast domain isolation; drivers choose Tier 0 or Tier 2
LoongArch64 None N/A — Tier 1 unavailable No — no hardware isolation mechanism exists; drivers choose Tier 0 or Tier 2

RISC-V note: RISC-V has no hardware fast-isolation mechanism (no MPK, no POE equivalent). Tier 1 isolation is unavailable on RISC-V. Drivers run as either Tier 0 (in-kernel, no isolation) or Tier 2 (Ring 3 + IOMMU), depending on licensing requirements, driver default preference, and sysadmin configuration (see Section 1.1). Tier 0 drivers have zero isolation overhead but reduced fault containment — a documented tradeoff. Tier 1 remains unavailable until RISC-V ISA extensions provide suitable mechanisms.

AArch64 POE domain availability: POE provides 7 usable protection keys (3-bit PTE index, key 0 reserved). After infrastructure allocation (shared read-only ring descriptors, shared DMA pool, userspace domain, debug domain), 3 keys remain for driver isolation domains — significantly fewer than x86 MPK's 12 usable driver domains. Deployments requiring more than 3 concurrent Tier 1 driver isolation domains on AArch64 must use Tier 0 promotion (for trusted drivers) or Tier 2 (Ring 3 + IOMMU, for untrusted ones) for the excess drivers. See Section 11.2 for the full per-architecture domain allocation table.

The overhead comes exclusively from protection domain crossings on I/O paths — operations that must transit between UmkaOS Core and Tier 1 drivers. Operations handled entirely within UmkaOS Core (scheduling, memory management, page faults, vDSO calls) have zero additional overhead compared to Linux. Syscall dispatch itself adds ~28-38 cycles for the full isolation transition on x86-64: ~23 cycles for the MPK domain switch, ~5-10 cycles for capability validation (bitmask test via ValidatedCap), and ~2-5 cycles for KABI vtable dispatch (indirect call, branch-predictor-friendly). This is only measurable on I/O-bound syscalls; for compute-heavy syscalls (e.g., mmap, mprotect, brk) the ~28-38 cycles are negligible relative to the work done.

1.3.2 Per-Operation Overhead

Note: cycle counts in the table below are measured on x86-64. A cross-architecture comparison of the same operations is given immediately after the main table.

Operation Linux UmkaOS UmkaOS-specific differential Notes
Syscall dispatch (I/O) ~100 cycles (bare) ~128-138 cycles (bare) +28-38 cycles x86-64: bare SYSCALL/SYSRET round-trip without KPTI or Spectre mitigations. Production baseline: ~700-1800 cycles (range spans microarchitectures from Haswell to Raptor Lake and varies with active Spectre mitigations). +23 cycles for MPK domain switch + ~5-10 cycles capability validation + ~2-5 cycles KABI vtable dispatch to Tier 1 driver. This differential is additive to the shared KPTI/Spectre mitigation base that both Linux and UmkaOS pay equally. UmkaOS pays the same production base cost plus this +28-38 cycle differential. On non-Meltdown-vulnerable CPUs (Intel Ice Lake+, all AMD Zen), UmkaOS avoids KPTI page table switches for intra-core syscalls. On Meltdown-vulnerable CPUs (Intel Skylake through Cascade Lake), KPTI is a hardware requirement that UmkaOS cannot avoid -- both Linux and UmkaOS pay the same KPTI cost on these CPUs.^1
NVMe 4KB read (total) ~10 us ~10.025-10.05 us ~0.25-0.5% 2 effective MPK switches after shadow elision (submit + completion coalesced = 2 domain switches × 23 cycles = 46 cycles) on 10 us op. Without shadow elision: 4 switches × 23 = 92 cycles (+1%), but the shadow elision optimization (Section 11.2) merges adjacent enter/exit pairs, halving the actual cost.
NVMe 4KB write (syscall) ~10 us ~10.07-10.15 us ~0.7-1.5% Write path adds ~66-152 cycles over read path (itemized below). Doorbell coalescing (Section 11.7) amortizes the NVMe submission; the extra cost is from the writeback→filesystem→block layer domain traversal. See Section 3.4 for the detailed write path breakdown.

NVMe 4KB write path itemized breakdown (cycles beyond the read path baseline):

Component Cycles Notes
Writeback domain crossing pair 23-46 VFS→filesystem Tier 1 boundary (enter+exit, shadow-elided to 1 pair)
Bio dispatch domain crossing pair 23-46 Block layer→NVMe driver Tier 1 boundary (enter+exit)
LSM file_permission hook 10-30 Write permission check on the file object
LSM inode_permission hook 10-30 Write permission check on the inode
Total 66-152 Added to the read path's ~46 cycles for 2 effective domain switches
Operation Linux UmkaOS UmkaOS-specific differential Notes
TCP packet RX ~5 us ~5.001-5.003 us ~0.02-0.06% 4 MPK switches per NAPI poll (NIC driver + umka-net, enter + exit each) = ~92 cycles, amortized over NAPI batch of 64 packets = ~1.4 cycles/packet. Per-packet overhead: ~1.4 cycles on a ~5 us (20,000 cycle) op. Without NAPI batching the per-packet cost would be +2%, but NAPI batch-64 amortization reduces it to negligible levels.
Page fault (anonymous) ~300 cycles ~300 cycles 0% Handled entirely in UmkaOS Core
Context switch (minimal) ~200 cycles ~200 cycles 0% Register save/restore only (same mechanism as Linux). This is the lmbench-style minimal context switch between threads in the same address space. A full process context switch with TLB flush and cache effects costs 5,000-20,000 cycles on Linux; UmkaOS's intra-tier switches (MPK domain change) avoid this by not changing address spaces. With N active perf events: perf_schedule_out/in adds +20-50 cycles/event (PMU counter save/restore). 8 events ≈ +160-400 cycles. See Section 20.8.
vDSO (clock_gettime) ~25 cycles ~25 cycles 0% Mapped directly into user space
epoll_wait (ready) ~80 cycles ~80 cycles 0% Handled in UmkaOS Core
mmap (anonymous) ~400 cycles ~400 cycles 0% Handled in UmkaOS Core

Reading the syscall dispatch row: The "~100 cycles (bare)" and "~128-138 cycles (bare)" figures are the bare SYSCALL/SYSRET instruction overhead without any OS-level mitigations. On production systems with KPTI and Spectre mitigations active, both Linux and UmkaOS pay ~700-1800 cycles for the full kernel entry/exit path. The "+28-38 cycles" differential is the UmkaOS-specific isolation cost, additive to the shared mitigation base. The table compares differentials (not totals) because both kernels pay the same KPTI/Spectre base.

At 4 GHz, 100 cycles ≈ 25 ns; 10 μs ≈ 40,000 cycles.

^1 KPTI note: Kernel Page Table Isolation is required on x86 CPUs vulnerable to Meltdown (Intel client/server cores from Nehalem through Cascade Lake). On these CPUs, every kernel entry/exit pays the KPTI page table switch cost (~200-1000 cycles depending on microarchitecture and TLB pressure). This is a hardware-imposed requirement, not a software choice -- UmkaOS cannot avoid it any more than Linux can. The "~128-138 cycles" figure in the table assumes a non-vulnerable CPU (Intel Ice Lake+, AMD Zen, ARM, RISC-V) or one with hardware Meltdown fixes. On Meltdown-vulnerable hardware, add the KPTI overhead to both the Linux and UmkaOS columns equally.

Spectre mitigation note: The cycle counts in the table above represent bare-instruction costs comparing UmkaOS vs Linux on identical hardware with identical Spectre mitigations. Retpoline and eIBRS overhead applies to both Linux and UmkaOS equally — neither kernel can avoid it. UmkaOS's differential retpoline cost is one additional indirect call per domain crossing (~15-25 cycles on pre-eIBRS hardware) due to the KABI vtable dispatch. On eIBRS-capable hardware (Intel Ice Lake+, AMD Zen 3+), this cost drops to ~2-5 cycles (predicted indirect branch). See Section 2.18 for the full per-mitigation overhead analysis across all architectures.

1.3.2.1.1 Per-Architecture Operation Cost Comparison

The table above gives x86-64 figures. The following table shows the same operations across all eight first-class architectures. These numbers capture structural differences in ISA design — memory ordering, TLB invalidation, trap entry — rather than microarchitectural variation.

Operation x86-64 AArch64 ARMv7 RISC-V 64 PPC32 PPC64LE s390x LoongArch64 Notes
Syscall dispatch (I/O) ~128-138 cy ~137-160 cy ~150-200 cy ~167-225 cy ~180-250 cy ~140-180 cy ~150-200 cy ~100-150 cy ARMv7: SVC + banked register save. PPC32: sc trap + GPR save. PPC64LE: sc/scv + GPR save
Context switch (minimal) ~200 cy ~220-260 cy ~200-300 cy ~250-320 cy ~250-400 cy ~250-400 cy ~300-500 cy ~200-400 cy ARMv7: DACR+ISB for Tier 1 + CONTEXTIDR for ASID. PPC32: segment register reload. PPC64LE: Radix PID or HPT switch
Tier 1 isolation switch ~23 cy (WRPKRU) ~40-80 cy (POR_EL0) or ~150-300 cy (PT) ~30-40 cy (DACR+ISB) N/A (Tier 0/2 fallback) ~20-40 cy (segment+isync) ~30-60 cy (Radix PID) N/A (Tier 0/2 fallback) N/A (Tier 0/2 fallback) ARMv7 DACR is fast; PPC32 segment regs cheap; PPC64LE Radix PID moderate
Memory barrier (Release) ~0 cy (TSO) ~5-15 cy (DMB ISH) ~5-15 cy (DMB) ~5-20 cy (FENCE) ~5-15 cy (lwsync) ~5-15 cy (lwsync) ~0 cy (TSO) ~15-25 cy (DBAR) PPC lwsync is lighter than hwsync; sufficient for Release
TLB shootdown (single page) ~50-150 cy + IPI ~50-150 cy + DSB+TLBI ~100-200 cy + IPI ~100-200 cy + IPI+SFENCE.VMA ~100-200 cy + IPI ~100-200 cy + IPI+TLBIE ~500-1000 cy (SIGP+IPTE) ~500-1500 cy (IPI+INVTLB) ARMv7 broadcast TLBI via inner-shareable domain; PPC64LE TLBIE with RIC/PRS fields
RCU read-side ~1-3 cy ~1-3 cy ~1-3 cy ~1-3 cy ~3-5 cy ~3-5 cy ~5-10 cy ~5-10 cy PPC per-CPU access via r13 base; slightly higher latency than x86 GS-based
Cache line bounce ~50-70 cy ~40-60 cy ~40-60 cy ~50-80 cy ~50-80 cy ~50-80 cy ~50-80 cy ~50-80 cy ARMv7 32B cache lines on some cores; PPC 128B cache lines on POWER9+

All non-x86 figures are architectural estimates based on instruction latencies and design specifications. They will be calibrated against real hardware measurements in Phase 4 (Production Ready). AArch64 figures apply to Graviton 3/Neoverse V1 class hardware. RISC-V figures apply to SiFive P670-class hardware. ARMv7 figures assume Cortex-A15 class. PPC32 figures assume e500mc. PPC64LE figures assume POWER9+.

1.3.3 Macro Benchmark Targets

Benchmark Acceptable overhead vs Linux
fio randread/randwrite 4K QD32 < 2%
iperf3 TCP throughput < 3%
nginx small-file HTTP < 3%
sysbench OLTP < 5%
hackbench < 3%
lmbench context switch < 1%
Kernel compile (make -jN) < 5%

Latency-sensitive benchmarks (non-batched, single-request overhead):

Benchmark Acceptable overhead vs Linux Notes
redis-benchmark (GET/SET, pipeline=1) < 3% throughput, < 5% p99 Single-packet NAPI, no batch amortization
wrk single-connection HTTP < 4% throughput, < 8% p99 1 RPC = recv + read + send, no NAPI batch
pgbench (TPC-B, single client) < 5% throughput, < 8% p99 Multiple NVMe reads per transaction
gRPC unary (1 req → 1 resp) < 4% throughput, < 8% p99 Non-batched NIC + storage path

1.3.4 Where the Overhead Comes From

The overhead budget is dominated by I/O paths that cross the isolation domain boundary between UmkaOS Core and Tier 1 drivers. Pure compute workloads, memory-intensive workloads, and scheduling-intensive workloads have effectively zero overhead because they stay entirely within UmkaOS Core.

Worst case: a micro-benchmark that issues millions of tiny I/O operations per second (e.g., 4K random IOPS at maximum queue depth). Even here, the ~92 cycles of domain-switch overhead per operation (x86-64 MPK: ~23 cycles; AArch64 POE: ~40-80 cycles) is less than 1% of the approximately 10 us total device latency on hardware with fast isolation. On page-table-fallback platforms (pre-POE AArch64, RISC-V), Tier 1 isolation is coalesced or disabled to stay within the budget; see the per-architecture table above.

L1I cache displacement: The cycle counts above assume L1-hot code paths. In production workloads with multiple active Tier 1 drivers, domain switches cause L1I working set displacement — the driver's instruction footprint evicts umka-core's hot code, which must be re-fetched from L2 on return. This is a structural cost that Linux does not incur (monolithic kernel shares one working set). The realistic steady-state penalty is ~2x on isolation-related cycles, raising the x86-64 nginx-class overhead from ~0.8% (L1-hot) to ~1.2-2.4% (L2-warm) — still well within the 5% budget with 4.2% headroom. See Section 3.4 for the complete L1I displacement analysis with per-scenario breakdown.

1.3.5 Comprehensive Overhead Budget

The 5% target applies to macro benchmarks on all supported architectures (see per-architecture targets in the table above). The total overhead is the sum of all UmkaOS-specific costs versus a monolithic Linux kernel. This section enumerates every source of overhead so the budget can be audited. Per-event costs are shown for x86-64 MPK as the reference; see Section 11.2 for equivalent figures on other architectures.

Source Per-event cost (x86-64 MPK; reference) Frequency Contribution to macro benchmarks
MPK domain switches ~23 cycles per WRPKRU 2-6 per I/O op 1-4% of I/O-heavy workloads. 0% for pure compute.
IOMMU DMA mapping 0 (same as Linux) Per DMA op 0% — UmkaOS uses IOMMU identically to Linux.
KABI vtable dispatch ~2-5 cycles (indirect call) Per driver method call <0.1% — indirect call vs direct call. Branch predictor hides this.
Capability checks ~5-10 cycles (bit test) Per privileged op <0.1% — bitmask test, fully pipelined.
Driver state checkpointing ~0.2-0.5 μs per checkpoint (memcpy + doorbell) Periodic (every ~1ms) ~0.02-0.05% — amortized over 1ms. HMAC computed asynchronously by umka-core, not on driver hot path.
Scheduler (EAS + PELT) 0 (same algorithms as Linux) Per context switch 0% — UmkaOS uses the same CFS/EEVDF + PELT as Linux.
Scheduler (CBS guarantee) ~50-100 cycles Per CBS replenishment <0.05% — replenishment per CBS period (typically ~1s, configurable) for CBS-enabled groups only.
FMA health checks ~10-50 cycles Per device poll (~1s) <0.001% — background, amortized over seconds.
Stable tracepoints 0 when disabled, ~20-50 cycles when enabled Per tracepoint hit 0% disabled. <0.1% when actively tracing.
umkafs object bookkeeping ~50-100 cycles Per object create/destroy <0.01% — object lifecycle is cold path.
In-kernel inference 500-5000 cycles per invocation Per prefetch/scheduling decision <0.1% — invoked on slow-path decisions (page reclaim, I/O reordering), not per-I/O. Clamped by cycle watchdog (defined in Section 22.6).
Per-CPU access (CpuLocal) ~1 cycle (x86 gs: prefix) Per slab alloc, per scheduler tick, per NAPI poll <0.05% — matches Linux this_cpu_* cost. See Section 3.2.
Per-CPU access (PerCpu\<T>) ~1-3 cycles (nosave) / ~3-8 cycles (with IRQ save) Per non-hot per-CPU field access <0.1% — IRQ elision (Section 3.3) eliminates save/restore in IRQ context.
RCU quiescent state ~1 cycle (CpuLocal flag write) Per outermost RCU read section exit <0.01% — deferred to tick/switch, not per-drop. See Section 3.1.
Capability validation ~7-14 cycles (amortized via ValidatedCap) Per KABI dispatch <0.05% — validate once, use token for 3-5 sub-calls. See Section 12.3.
Doorbell coalescing ~5 cycles/cmd (amortized batch-32) Per batched NVMe/virtio submit <0.02% — one MMIO write per batch. See Section 11.7.
Isolation shadow elision ~1 cycle (compare) vs ~23-80 (write) Per domain switch that hits shadow ~0.1-0.2% saved — mandatory, see Section 11.2.
PMU sampler kthread CBS-limited to 5% CPU per core (default) Per-CPU, woken on PMU overflow ≤5% of one core when actively profiling; 0% when no perf events are open. Configurable via perf_sampler_cpu_budget_pct. See Section 20.8.
Aggregation counters ~1-2 ns/event Per tracepoint/perf event <0.01% — amortized across batch updates.

Workload-specific overhead estimates (x86-64 MPK, all optimizations active; representative of hardware with fast isolation):

Workload Dominant overhead source Estimated total overhead
fio 4K random IOPS MPK switches (shadow-elided) + doorbell ~0.5-1.5%
iperf3 TCP throughput MPK switches (NIC + TCP, NAPI-batched) ~1.5-2.5%
nginx small-file HTTP MPK switches (NIC + TCP + NVMe) ~1-2%
sysbench OLTP MPK switches (NVMe + TCP) ~1.5-2.5%
hackbench (IPC-heavy) MPK switches (scheduler stays in core) ~0.5-1.5%
Kernel compile (make -jN) Nearly zero (CPU-bound, in-core) <1%
memcached (GET-heavy) MPK switches (NIC + TCP) ~1.5-2.5%
ML training (GPU) Nearly zero (GPU work, not CPU I/O) <1%
gRPC unary (non-batched) MPK switches (NIC + NVMe, no NAPI/doorbell batch) ~1.2-2.4% (L2-warm)
redis GET pipeline=1 MPK switches (NIC, NAPI batch=1) ~1.5-2.2%

Distributed TX relay overhead (cross-node communication only):

Operation Per-hop cost Components Amortizable?
Distributed TX relay (per hop) ~2-10 μs RDMA round-trip (~1-3 μs) + capability check (~0.3-0.5 μs) + ring buffer copy (~0.1-0.3 μs) + serialization (~0.1-0.2 μs) No — hops are sequential
3-hop TX relay (worst case) ~6-30 μs 3 × per-hop cost No

Note: This overhead applies only to cross-node communication via the peer protocol (Section 5.1). Local I/O paths are unaffected. Each hop is sequential — the request must arrive at hop N before hop N+1 can begin — so batching multiple requests does not reduce per-request hop latency. The per-hop cost is dominated by the RDMA network round-trip time; the kernel overhead (capability check + ring buffer copy) is ~0.5-1 μs per hop. For single-hop operations (direct peer communication), the overhead is ~2-10 μs. Multi-hop relay occurs only when the target service is not directly reachable from the requesting node (e.g., a service behind a DPU or on a node in a different RDMA subnet). The topology-aware placement engine (Section 5.12) minimizes hop count by preferring direct routes.

Key insight: the overhead budget is dominated by isolation domain switch cost multiplied by the number of domain crossings per operation. Seven core design techniques (Section 3.4) reduce the non-isolation overhead to near-zero and cut isolation overhead by ~25-50% through shadow elision and batching:

  1. CpuLocal register-based access — hottest per-CPU fields accessed via the architecture's dedicated per-CPU base register (x86-64: GS, AArch64: TPIDR_EL1, ARMv7: TPIDRPRW, PPC64: r13, RISC-V: tp) at ~1-10 cycles depending on architecture, matching Linux this_cpu_* cost (Section 3.2)
  2. Debug-only PerCpu CAS — borrow-state CAS present only in debug builds; zero cost in release builds (Section 3.3)
  3. IRQ save/restore elision via IrqDisabledGuardget_mut_nosave() skips IRQ save/restore (~1-3 cycles saved) when caller already holds IrqDisabledGuard (Section 3.3)
  4. RCU deferred quiescent state to tick/switchRcuReadGuard::drop writes a CpuLocal flag (~1 cycle); the actual quiescent-state report is deferred to the next scheduler tick or context switch (Section 3.1)
  5. Isolation register shadow elision — WRPKRU/MSR write skipped when the shadow register value already matches, saving ~23-80 cycles per elided switch (Section 11.2)
  6. ValidatedCap capability amortization — validate once, use a token for 3-5 sub-calls (~7-14 cycles amortized vs per-call) (Section 12.3)
  7. Doorbell coalescing for NVMe/virtio — one MMIO write per batch-32, ~5 cycles/cmd amortized (Section 11.7)

The cumulative nginx-class overhead is ~0.8% on x86-64 (MPK), ~1.5% on AArch64 with POE, and ~2-4% on mainstream AArch64 (page-table fallback with coalescing). This leaves substantial headroom under the 5% budget on all architectures for implementation-phase unknowns (cache effects, compiler variation, subsystem interactions). All seven techniques are implemented from day one — none are deferred optimizations. See Section 3.4 for the cumulative per-architecture overhead analysis with complete breakdowns.


1.3.6 Counter and Identifier Longevity Budget

UmkaOS targets 50-year continuous uptime via live kernel evolution (Section 13.18). Every monotonically increasing counter and identifier must be safe against overflow for the full operational lifetime. This section documents the worst-case exhaustion analysis for every such counter.

1.3.6.1 Design Rule

All kernel-internal identifiers that are not constrained by an external protocol MUST use u64. The only exceptions are:

  • Identifiers constrained by an external ABI (Linux syscall ABI, NFS/RPC wire format, POSIX IPC) — these use the protocol-mandated width (typically u32) and MUST document their wrap-safety analysis inline.
  • Identifiers packed into fixed-size structures where space is at a premium (e.g., NetBufHandle) — these use the minimum safe width with an inline longevity analysis comment.

1.3.6.2 Audit Table

Counter Width Location Worst-case rate Time to wrap Mitigation
Cgroup::id u64 Section 17.2 10K creates/sec 58M years Safe
DriverDomainId u64 Section 12.1 100 loads/sec 5.8B years Safe (changed from u32)
DriverDomain::generation u64 Section 12.1 100 crashes/sec 5.8B years Safe; operator reset at threshold
t0_vtable_generation u64 Section 13.18 100 swaps/sec 5.8B years Incremented on each Tier 0 vtable swap; same counter space as DriverDomain::generation
CapValidationToken::cap_generation u64 Section 12.1 1M revocations/sec 584K years Safe
NetBufHandle::generation u32 Section 16.5 8.3M pkts/sec (100Gbps) 515 sec Safe: no handle outlives RCU grace period (~10ms). Changed from u16 (was 7.8ms!)
MceLog::head u64 Section 2.23 burst: 64/NMI >10^18 years at 1/sec Safe: u64 is effectively infinite; drain resets to 0
AutofsMount::next_token u64 Section 14.10 100 mounts/sec 5.8B years Safe (changed from u32; wire token truncated to u32 for Linux ABI)
EventHandle / SemHandle u32 Section 19.8 1M creates/sec n/a (slot recycled) Slot+generation: 24-bit slot index recycled; internal per-slot generation is u64
KeySerial i32 Section 10.2 1K keys/sec n/a (recycled) Protocol-mandated signed (Linux key_serial_t = int32_t); only positive values allocated (1..=0x7FFFFFFF). Recycled via XArray slot reuse; per-slot generation in Key struct
NFS xid_counter u32 Section 15.11 100K RPCs/sec 11.9 hours Protocol-mandated (RFC 5531); collision handled by (client_addr, xid) tuple
PipeBuffer::read_idx/write_idx u32 Section 17.3 1M ops/sec 71 min Modular counter, not monotonic ID — wrapping is correct by design (difference arithmetic). Exempt from u64 rule. Correctness invariants: (1) capacity is power-of-two (so capacity divides 2^32), AND (2) producer blocks when ring is full (guaranteed by pipe semantics — back-pressure prevents producer-consumer gap from exceeding capacity). Without back-pressure, a stalled consumer and a wrapping producer would produce indistinguishable gap values
KRL::version u64 Section 9.3 1 update/sec 584B years Safe
TrustAnchorChain::epoch u64 Section 9.3 1 rotation/year ~1.8×10^19 years Safe
Cgroup::generation u64 Section 17.2 1M limit changes/sec 584K years Safe
RCU grace period counter u64 Section 3.5 1M GPs/sec 584K years Safe