Chapter 5: Distributed Kernel Architecture
Cluster topology, distance matrix, RDMA transport, distributed shared memory, distributed lock manager, SmartNIC/DPU integration
5.1 Distributed Kernel Architecture
This section extends UmkaOS from a single-machine kernel to a distributed-capable kernel with RDMA-native primitives, page-level distributed shared memory, cluster-aware scheduling, global memory pooling, and network-portable capabilities. These are kernel-internal capabilities — distributed applications remain in userspace, but the kernel transparently optimizes data placement and movement across RDMA fabric.
Design Constraints:
- Drop-in compatibility: A single-node UmkaOS system behaves identically to a non-distributed kernel. All distributed features are opt-in. Existing Linux binaries (MPI, NCCL, Spark, Redis, PostgreSQL) work without modification.
- Superset, not replacement: Standard TCP/IP sockets, POSIX shared memory, and SysV IPC work exactly as before. Distributed capabilities are additional. Applications that opt in (via new interfaces or transparent kernel policies) get better performance.
- RDMA-native from day one: The kernel's core primitives (IPC, page cache, memory management) are designed with RDMA transport in mind, not bolted on after the fact.
- Page-level coherence: Distributed memory coherence operates at page granularity (4KB minimum), not cache-line granularity (64B). This is the fundamental design decision that makes distributed shared memory practical over network latencies.
- Graceful degradation: Node failures are handled. No single point of failure. Split-brain is detected and resolved. Partial cluster operation is always possible.
5.1.1.1 The Hardware Shift
The datacenter is becoming a single, disaggregated computer:
2015: Machines are islands. 10GbE, TCP/IP, millisecond latencies.
Networking = slow, unreliable. Kernel = local machine only.
2020: RDMA everywhere. 100GbE RoCEv2 / InfiniBand HDR (200Gb/s).
1-2 μs latency. Kernel-bypass networking is the norm for HPC.
2024: CXL 2.0 memory pooling. Disaggregated memory over PCIe fabric.
Memory can live outside the machine, accessible at ~200-400ns.
2025-2026: CXL 3.0 hardware-coherent shared memory. PCIe 6.0 (64 GT/s).
400GbE / InfiniBand NDR (400Gb/s). Sub-microsecond RDMA.
2027+: CXL switches, memory fabric topology, composable infrastructure.
The distinction between "local" and "remote" memory blurs.
The hardware is converging on a model where: - Network latency (RDMA) ≈ remote NUMA latency ≈ CXL latency - All three are ~1-5 μs, compared to NVMe SSD at ~10-15 μs - Remote memory over RDMA is faster than local SSD
No existing operating system is designed for this reality.
5.1.1.2 Why Linux Cannot Adapt
Linux has two completely separate networking paradigms:
World A: Socket-based (kernel-managed)
- TCP/IP, UDP, Unix sockets
- Kernel manages connections, buffers, routing
- Page cache, VFS, block I/O all use this world
- Latency: ~5 μs per packet (kernel processing overhead — time for the
kernel network stack to process one packet, not end-to-end round-trip)
World B: RDMA/verbs (kernel-bypass)
- InfiniBand verbs, RoCE
- Application manages everything via libibverbs
- Kernel provides setup (protection domains, memory registration) then gets out
- Page cache, VFS, block I/O know nothing about this world
- Latency: ~1-2 μs per operation
These worlds do not interact. There is no way for the Linux page cache
to fetch a page from a remote node via RDMA. There is no way for the
Linux scheduler to migrate a process to where its data lives across
an RDMA link. There is no way for Linux IPC to transparently extend
to a remote node.
Most distributed features in Linux (DRBD, Ceph, GlusterFS, GFS2, OCFS2)
are built on World A (sockets). NFS and SMB have added RDMA transports
(svcrdma/xprtrdma since 2.6.24, SMB Direct since v4.15, KSMBD since
v5.15), but these are bolt-on transport alternatives — the core protocols,
data structures, and failure handling remain socket-oriented. None use
RDMA verbs (CAS, FAA, one-sided Read/Write) for lock-free data structure
access or coherence protocols. UmkaOS's distinction is using RDMA atomics
and one-sided operations as the primary coordination primitive, not just
as a transport.
Previous attempts to add distributed capabilities to Linux (Kerrighed, OpenSSI, MOSIX, SSI clusters, GAM patches) all failed because:
- Linux's core subsystems (MM, scheduler, VFS, IPC) assume single-machine
- Patches touched thousands of lines across dozens of subsystems
- Every kernel update broke the patches
- Cache-line-level coherence over network was too expensive
- No clean abstraction boundary — distributed logic was smeared everywhere
5.1.1.3 UmkaOS's Structural Advantage
UmkaOS's existing architecture is uniquely suited for distributed extension:
| Existing Feature | Distributed Extension |
|---|---|
| NUMA-aware memory manager (per-node buddy allocators) | Remote node = distant NUMA node |
| PageLocationTracker (Section 21.2.1.5) | Already tracks CPU, GPU, compressed, swap — add RemoteNode |
| MPSC ring buffer IPC | Ring buffer maps naturally to RDMA queue pair |
| Capability tokens (generation-based revocation) | Cryptographic signing → network-portable |
| Device registry (topology tree) | Extend topology to include RDMA fabric |
| AccelBase KABI (Section 21.1) | GPU on remote node = remote accelerator |
| CBS bandwidth guarantees (Section 6.3) | Extend CBS to cluster-wide resource accounting |
| Object namespace (Section 19.4) | \Cluster\Node2\Devices\gpu0 |
The key insight: UmkaOS already models heterogeneous memory (CPU RAM, GPU VRAM, compressed pages, swap) as different tiers in a unified memory hierarchy. Remote memory over RDMA is just another tier. The memory manager already knows how to migrate pages between tiers on demand. Extending "tiers" to include remote nodes is a natural generalization, not a fundamental redesign.
5.2 Cluster Topology Model
5.2.1 Extending the Device Registry
The device registry (Section 10.5) models hardware topology as a tree with parent-child and provider-client edges. For distributed operation, extend the tree to span multiple nodes:
\Cluster (root of distributed namespace)
+-- node0 (this machine)
| +-- pci0000:00
| | +-- 0000:41:00.0 (GPU)
| | +-- 0000:06:00.0 (NVMe)
| | +-- 0000:03:00.0 (RDMA NIC, mlx5)
| +-- cpu0 ... cpu31
| +-- numa-node0 (512GB DDR5)
| +-- numa-node1 (512GB DDR5)
|
+-- node1 (remote machine, discovered via RDMA fabric)
| +-- [remote device tree, cached]
| +-- numa-node0 (512GB DDR5, reachable via RDMA)
| +-- gpu0 (80GB VRAM, reachable via GPUDirect RDMA)
|
+-- node2 ...
|
+-- fabric (RDMA fabric topology)
+-- switch0 (InfiniBand switch)
| +-- port0 → node0:mlx5_0
| +-- port1 → node1:mlx5_0
+-- switch1
+-- port0 → node2:mlx5_0
+-- port1 → node3:mlx5_0
5.2.2 Cluster Node Descriptor
/// Describes a node in the cluster (including self).
#[repr(C)]
pub struct ClusterNode {
/// Unique node ID (assigned during cluster join, never reused).
pub node_id: NodeId,
/// Node state.
pub state: NodeState,
/// RDMA endpoint for reaching this node.
pub rdma_endpoint: RdmaEndpoint,
/// Total CPU memory available for remote access (bytes).
pub remote_accessible_memory: u64,
/// Number of NUMA nodes.
pub numa_nodes: u32,
/// Number of accelerators (GPUs, NPUs, etc.).
pub accelerator_count: u32,
/// Round-trip latency to this node (nanoseconds, measured).
/// u32 covers up to ~4.29 seconds, sufficient for datacenter and
/// campus-scale clusters. WAN links with higher RTT use the
/// TCP fallback transport (Section 5.11.2.6), not RDMA, and are
/// represented in the distance matrix (Section 5.2.9) instead.
pub measured_rtt_ns: u32,
pub _pad_rtt: u32, // alignment: brings offset to 72 (multiple of 8) for u64 fields below
/// Unidirectional bandwidth to this node (bytes/sec, measured).
/// Field name uses `bytes_per_sec` to avoid ambiguity with
/// "bps" (bits per second) common in networking contexts.
pub measured_bw_bytes_per_sec: u64,
/// Heartbeat: last received timestamp.
pub last_heartbeat_ns: u64,
/// Heartbeat: monotonic generation (detects restarts).
pub heartbeat_generation: u64,
/// Cluster protocol version (must match to join).
/// Nodes with mismatched protocol_version are rejected during cluster join.
pub protocol_version: u32,
pub _pad_pv: u32, // alignment padding after protocol_version
pub _pad: [u8; 24], // total struct size = 128 bytes (2 × cache-line)
}
pub type NodeId = u32;
#[repr(u32)]
pub enum NodeState {
/// Node is reachable and healthy.
Active = 0,
/// Node is joining (exchanging topology, syncing state).
Joining = 1,
/// Node missed heartbeats but not yet declared dead.
Suspect = 2,
/// Node is unreachable. Its resources are being reclaimed.
Dead = 3,
/// Node is gracefully leaving (draining work, migrating pages).
Leaving = 4,
}
#[repr(C)]
pub struct RdmaEndpoint {
/// RDMA GID (Global Identifier) — InfiniBand/RoCE address.
pub gid: [u8; 16],
/// Queue pair number for control channel.
pub control_qpn: u32,
/// Protection domain key for this cluster.
pub pd_key: u32,
/// RDMA device index on the local machine.
pub local_rdma_device: u32,
pub _pad: [u8; 12],
}
Device-local kernels as cluster members — The ClusterNode structure describes
traditional CPU-based compute nodes, but modern hardware increasingly runs its own
firmware OS:
- SmartNICs/DPUs: NVIDIA BlueField-2/3 DPUs run full Ubuntu with 8-16 ARM cores, 16-32 GB DRAM, and can host containers and VMs. Intel IPU and AMD Pensando DPUs run similar firmware stacks.
- GPUs: NVIDIA GPUs run CUDA firmware that schedules work across thousands of cores, manages HBM memory, and coordinates P2P transfers. AMD GPUs run ROCm firmware.
- Storage controllers: High-end NVMe controllers and RAID cards run embedded RTOS or Linux to manage flash translation layers, wear leveling, and caching.
- CXL devices: CXL defines three device types, each with a different operating model in UmkaOS's multikernel cluster:
- Type 1 (coherent compute, no device-managed memory): Device compute participates
in the host CPU cache coherency domain via
CXL.cache. Natural Mode B peer — ring buffers in shared memory are hardware-coherent without explicit flush. Examples: coherent FPGAs, smart NICs with CXL. - Type 2 (compute + device-managed memory): Both
CXL.cache(device cache in host coherency domain) andCXL.mem(device DRAM accessible to host via load/store). The richest UmkaOS peer type — bidirectional zero-copy coherent access. Device runs UmkaOS on its embedded cores, Mode B ring buffers are coherent in both directions. Examples: future CXL-attached GPUs, AI accelerators with HBM. - Type 3 (memory expansion, minimal or no compute): Provides additional DRAM via
CXL.mem; host sees it as a slower NUMA node. The tiny management processor (if present, typically ARM/RISC-V) acts as a memory-manager peer, not a compute peer: it manages tiering, compression, encryption, and error reporting for the pool, but does not run workloads. Examples: Samsung CMM-H, Micron CZ120, SK Hynix AiMM. See Section 5.10.1 for the full Type 3 operating model and crash recovery distinction.
Rather than treating these as passive devices controlled exclusively by the host kernel, UmkaOS's distributed design allows device-local kernels to participate as first-class cluster members. A BlueField-3 DPU running UmkaOS could:
- Join the cluster as a peer node with its own
NodeId, exchange topology with other nodes, and participate in membership/heartbeat protocols. - Expose resources in the device registry: its own CPUs, DRAM, and attached storage/network as remotely-accessible resources.
- Run workloads: containers or VMs can be scheduled on the DPU's cores, with distributed locking and DSM providing transparent access to host memory or other cluster nodes.
- Offload functions: RDMA transport, network filtering, encryption, compression, or storage can run on the DPU with kernel-level coordination via the distributed lock manager and DSM.
This multikernel model treats a single physical server as a cluster of heterogeneous kernels — one on the host CPU, one on each DPU, one on each GPU (if the GPU firmware exposes cluster primitives). The distributed protocols (membership, DSM, DLM, quorum) work identically whether communicating between physical servers or between the host and a DPU on the same PCIe bus.
Protocol requirements — For a device-local kernel to participate as a first-class cluster member, it must implement UmkaOS's inter-kernel messaging protocol. This is a wire protocol, not an API — the device kernel does not need to be UmkaOS itself, but it must speak the same language:
- Transport layer: RDMA (for network-attached nodes) or PCIe P2P MMIO+interrupts (for on-board devices like DPUs/GPUs). The device must expose:
- A control channel for cluster management messages (join, heartbeat, topology sync)
- A data channel for DSM page transfers and DLM lock requests
-
MMIO-mapped doorbell registers or MSI-X interrupts for signaling
-
Cluster membership protocol (Section 5.4):
- Implement the join handshake: authenticate, exchange topology, sync protocol version
- Send periodic heartbeats (every 100ms) with monotonic generation counter
- Respond to membership queries with node state (Active, Suspect, Dead, Leaving)
-
Participate in failure detection: mark other nodes as Suspect if heartbeats missed
-
DSM page protocol (Section 5.6):
- Accept page ownership transfer requests:
PAGE_REQUEST(vpfn, read|write) - Respond with page data or forward request if not owner
- Implement cache coherence state machine (Owner, Shared, Invalid)
-
Participate in invalidation broadcasts for write requests
-
DLM lock protocol (Section 5.6.1):
- Accept lock acquisition requests:
LOCK_ACQUIRE(lock_id, mode=shared|exclusive) - Maintain lock ownership table and grant/deny based on current holders
- Support one-sided RDMA lock operations (atomic CAS on lock words)
-
Implement deadlock detection timeout (10 seconds default)
-
Serialization format: All messages use fixed-size binary structs with explicit padding and versioning. Each message has a 32-byte header:
rust #[repr(C)] pub struct ClusterMessageHeader { pub protocol_version: u32, // Currently 1 pub message_type: u32, // JOIN_REQUEST, HEARTBEAT, PAGE_REQUEST, etc. pub node_id: NodeId, // Sender's node ID _align: u32, // Explicit padding for u64 alignment pub sequence: u64, // Message sequence number (for ordering) pub payload_length: u32, // Bytes following this header pub checksum: u32, // CRC32C of header + payload }Payload structs are defined in Section 5.5 (message formats). All fields are little-endian. Nodes with mismatchedprotocol_versionare rejected during join.
Implementation paths for device vendors:
-
Path A: Full UmkaOS on device — Run UmkaOS kernel on the device's embedded CPU (e.g., BlueField DPU with 16 ARM cores runs UmkaOS natively). This gives full protocol support with zero extra work. The device becomes a cluster node indistinguishable from a regular server.
-
Path B: Firmware shim — Device vendor implements a minimal protocol adapter in their existing firmware. The adapter translates UmkaOS cluster messages into the device's native operations. Example: NVIDIA GPU firmware receives
PAGE_REQUESTmessages and responds by copying HBM pages to system memory via GPUDirect RDMA. Does not require rewriting the entire firmware stack. -
Path C: Host-side proxy driver — A Tier 1 driver on the host acts as proxy, translating UmkaOS cluster messages into device-specific commands (PCIe transactions, proprietary protocols). The device appears as a cluster member but is actually controlled by the proxy. Lower performance than Path B (extra CPU involvement) but requires zero firmware changes.
Near-term hardware targets — UmkaOS already builds for aarch64-unknown-none and
riscv64gc-unknown-none-elf. Devices with ARM or RISC-V cores can run the UmkaOS kernel
with zero ISA porting work, making Path A immediately actionable:
| Device | Cores | ISA | Path | Notes |
|---|---|---|---|---|
| NVIDIA BlueField-2 DPU | 8× Cortex-A72 | AArch64 | A | Replace host OS with UmkaOS. PCIe P2P to host. Currently runs Ubuntu. |
| NVIDIA BlueField-3 DPU | 16× Neoverse N2 | AArch64 | A | Same. Higher bandwidth NIC. |
| Marvell OCTEON 10 DPU | ARM Neoverse N2 | AArch64 | A | Open SDK. Same category as BlueField. |
| Microchip PolarFire SoC FPGA | 4× U54 | riscv64gc | A | UmkaOS boot target. FPGA implements custom datapath. Open toolchain. |
| StarFive JH7110 (VisionFive 2) | 4× U74 | riscv64gc | A | Boots UmkaOS today. PCIe expansion for host interconnect. |
| SiFive Intelligence X280 | U74 + RVV | riscv64gc | A | RISC-V vector AI accelerator. UmkaOS-compatible ISA. |
| Netronome Agilio CX (NFP3800) | NFP microengines | proprietary | B | Open C/BPF SDK. Published ring interface specs. Implement UmkaOS ring protocol in NFP firmware. |
| AMD/Xilinx Alveo U50/U250 | FPGA + ARM | AArch64 / FPGA | A or B | Fully programmable. Define any protocol in RTL. UmkaOS on embedded ARM for Path A. |
| Samsung SmartSSD | Zynq UltraScale+ (ARM + FPGA) | AArch64 | A or B | ARM Cortex-A53 runs UmkaOS. FPGA handles NVMe datapath. NVMe CSI spec published. |
| Samsung CMM-H | ARM management core | AArch64 | A (mgmt) | CXL Type 3 memory expander. Management core runs UmkaOS as a memory-manager peer (Type 3 model, Section 5.10.1). 256 GB–1 TB LPDDR5 pool. |
| Micron CZ120 | ARM management core | AArch64 | A (mgmt) | CXL Type 3. Same model as CMM-H. CXL 2.0, 128 GB–512 GB. |
| SK Hynix AiMM | ARM management core | AArch64 | A (mgmt) | CXL Type 3 with in-memory compute (PIM). Management core as memory-manager peer. |
The key insight: RISC-V devices are uniquely positioned as zero-effort Path A targets.
UmkaOS already cross-compiles to riscv64gc-unknown-none-elf with OpenSBI boot. Any device
with a RISC-V core and OpenSBI can boot an unmodified UmkaOS kernel — no porting required.
ARM-based DPUs (BlueField, OCTEON) are equally zero-effort via the AArch64 build target.
Security boundary — If a device firmware participates as a cluster member, it must be trusted to the same degree as any cluster node. A malicious or compromised GPU firmware with cluster membership could: - Request arbitrary memory pages via DSM (reading sensitive data) - Corrupt shared memory by writing to DSM pages - Initiate denial-of-service by flooding lock requests
Therefore, device cluster membership is disabled by default and enabled per-device:
echo 1 > /sys/bus/pci/devices/0000:41:00.0/umka_cluster_enabled
Only devices running signed, verified firmware (Section 10.4) should be granted cluster membership in secure environments.
Initially, only host kernels participate. Device participation is a Phase 5+ capability (Section 8.6, 5.10.1) that requires firmware modifications by hardware vendors or open-source firmware projects. The protocol specification will be published as an RFC-style document to enable third-party implementations.
Real-world precedents: - Barrelfish multikernel OS: Research OS where each CPU core runs its own kernel instance, coordinating via message passing. UmkaOS generalizes this to heterogeneous hardware. - BlueField DPU offload: Current NVIDIA BlueField firmware can run OVS, storage targets, or custom applications, but coordination with the host is ad-hoc userspace protocols. UmkaOS provides kernel-level coordination. - GPU Direct Storage: NVIDIA GDS allows GPUs to directly access NVMe storage, bypassing the CPU. This is a point solution. UmkaOS's model makes such bypasses general-purpose.
Lightweight mode for intra-machine devices — The full distributed protocol (DSM page transfers, DLM with RDMA CAS, quorum protocols) was designed for multi-node clusters over RDMA networks with 1-5 μs latency. For intra-machine devices (host ↔ DPU/GPU on the same PCIe bus), the latency is 10x lower (~200-500ns), but failure modes are different (no network partitions, but devices can hang or reset independently).
Architectural principle: message passing is the primitive. UmkaOS's IPC model (Section 9.7) is built on message passing — explicit ownership transfer, capability-mediated channels, and defined send/receive semantics. This is the architectural primitive. It composes uniformly across every boundary UmkaOS targets: intra-kernel, kernel-user, cross-process, cross-VM, cross-network, and hardware peer. The "shared memory fast path" described in Mode B below is not a competing model — it is how message passing is implemented locally when hardware cache coherency is available. The abstraction is always message passing; the transport is chosen to match the hardware.
Two coordination modes are supported:
Mode A: Full Distributed Protocol (default for network-attached nodes) - All messages via RDMA or PCIe P2P with ClusterMessageHeader wire format - DSM: explicit page ownership transfer with RDMA Read/Write - DLM: distributed lock tables, RDMA atomic CAS, deadlock detection - Membership: heartbeat every 100ms, suspect timeout 1 second - Best for: Multi-node clusters, devices that may have partial failures
Mode B: Hardware-Coherent Transport (optional for trusted local devices)
- The message-passing ownership guarantee is provided by the hardware cache coherency
protocol (PCIe ACS, CCIX, or CXL.cache) rather than by the software ownership
transfer protocol. The MESI state machine in hardware plays the same role that the
software ownership protocol plays over RDMA: only one writer at a time, cached reads
see the latest write. No software ownership messages are needed on top of hardware
coherency — that would be redundant.
- Both host and device map the same physical memory (via PCIe BAR mappings or pinned
system memory). Locks use local atomic ops (x86 LOCK CMPXCHG, ARM LDXR/STXR)
on cache-coherent shared memory instead of RDMA CAS.
- Membership via MMIO doorbell registers (no network heartbeat overhead).
- 5-10x lower latency: ~50-100ns for lock acquire vs ~500ns-1μs for RDMA CAS.
- Requires: device must be cache-coherent with host (PCIe ATS + ACS, CCIX, or CXL.cache).
Non-coherent devices (all current discrete GPUs, most current DPUs) must use Mode A.
When to use each mode:
| Scenario | Mode | Reason |
|---|---|---|
| Multi-node RDMA cluster | A | Must handle network failures, can't assume cache coherence |
| BlueField DPU running full UmkaOS | A | Separate memory space, needs explicit coordination |
| Future CXL 3.0-coherent GPU | B | CXL.cache coherency makes hardware the ownership protocol |
| Integrated GPU / APU (UMA) | B | CPU and GPU share coherent on-die/on-package memory |
| NVMe storage controller | B | Controller and host share command queues in coherent memory |
| Untrusted/unverified device | A | Mode B relies on hardware coherency guarantee — only for verified hardware |
Mode B is an optimization, not a separate protocol. It reuses the same data structures (lock tables, membership records) but the coherency guarantee is provided by hardware instead of software message passing. A device can fall back to Mode A if cache coherence fails or if the device resets.
Selection is per-device at join time:
# # Force Mode A (full protocol) for untrusted DPU
echo "rdma://pci:0000:41:00.0?mode=distributed" > /sys/kernel/umka/cluster/join
# # Enable Mode B (shared memory) for trusted GPU with cache-coherent access
echo "shmem://pci:0000:03:00.0" > /sys/kernel/umka/cluster/join
Implementation note: Mode B requires PCIe ATS (Address Translation Services) or CXL.cache to ensure device accesses to system memory are cache-coherent. Non-coherent devices (most current GPUs) must use Mode A.
5.2.3 Host-Side Component: umka-peer-transport
A device participating as a multikernel peer does not require a traditional
device driver on the host. A traditional driver (e.g., mlx5_core, amdgpu,
nvme) must understand the device's register layout, command format, initialization
sequence, error recovery procedure, and internal resource model. All of that
complexity now lives inside the device's own kernel. The host never touches device
registers and has no knowledge of the device's internals.
What the host requires instead is a single generic module — umka-peer-transport
— that handles the PCIe connection to any UmkaOS peer device, regardless of what the
device actually does:
/// Host-side state for one UmkaOS peer kernel, maintained by umka-peer-transport.
/// Identical structure for every peer device — NIC, GPU, storage, custom ASIC.
/// The device's function is irrelevant to this layer.
pub struct PeerTransport {
/// PCIe BDF of the peer device (for unilateral controls, see Section 5.3).
pub pcie_bdf: PcieBdf,
/// Shared memory region for the inbound/outbound domain ring buffer pair.
/// Allocated by the host, mapped into PCIe BAR by device at join time.
pub ring_region: DmaBuffer,
/// MMIO doorbell register: host writes here to signal the device.
pub doorbell_mmio: MmioRegion,
/// MMIO watchdog register: device writes a counter here every ~10ms.
pub watchdog_mmio: MmioRegion,
/// Cluster membership state (shared with the membership protocol Section 5.4).
pub health: PeerKernelHealth,
/// Negotiated cluster protocol version.
pub protocol_version: u32,
}
umka-peer-transport does five things and nothing else:
- Enumerate — detect that the PCIe device at a given BDF exposes the UmkaOS peer capability register (a new PCI capability ID assigned to the UmkaOS protocol).
- Connect — allocate the shared ring buffer region, map the device's MMIO doorbell, run the cluster join handshake (Section 5.2.8).
- Monitor — poll the MMIO watchdog counter and participate in the heartbeat protocol to detect device failure (Section 5.3.4).
- Contain — execute IOMMU lockout, bus master disable, and FLR if the device fails (Section 5.3.5).
- Disconnect — handle voluntary CLUSTER_LEAVE (planned update, shutdown).
umka-peer-transport has zero device-specific logic. The same binary handles
a BlueField DPU, a RISC-V AI accelerator, a computational storage device, and any
future device that implements the UmkaOS peer protocol. The host kernel's dependency
on device-specific code goes from hundreds of thousands of lines (per device class)
to zero.
5.2.4 Live Firmware Update Without Host Reboot
Because the host has no device-specific driver and no knowledge of device internals, firmware updates on the device are entirely the device's own responsibility. The host is not involved in the update itself — it only observes the device leaving and rejoining the cluster.
Update procedure from the device's perspective:
1. Device decides to update (admin command, automatic policy, or
device-side health check triggers it).
2. Device sends CLUSTER_LEAVE (orderly departure, not a crash).
ClusterMessageHeader { message_type: MEMBER_LEAVING, node_id: self }
3. Host receives CLUSTER_LEAVE, executes graceful shutdown:
- Migrates any workloads running on device to surviving nodes.
- Drains in-flight IPC channels and DSM operations (waits for
completions, does not issue new requests to the device).
- Revokes cluster membership cleanly (no IOMMU lockout, no FLR —
this is voluntary, not a crash; Section 5.3 crash path not taken).
Host is fully operational throughout. Zero disruption to workloads
not using this device.
4. Device updates its own firmware:
- For Path A (full UmkaOS kernel): device does a kernel rolling update
(Section 12.6) or reboots its own cores with new kernel image.
- For Path B (firmware shim): device applies vendor firmware update
procedure — entirely internal, host has no visibility.
- For all-in-one firmware: same, internal to device.
The host cannot observe what happens inside. It only knows the
device is absent from the cluster.
5. Device reinitializes hardware on new firmware (internal).
6. Device sends CLUSTER_JOIN with new protocol version and capabilities.
Host authenticates (verifies firmware signature per [Section 8.2](08-security.md#82-verified-boot-chain)),
negotiates protocol version, exchanges updated topology.
Device rejoins — may announce new capabilities (e.g., firmware
added support for a new DSM extension).
7. Workloads migrate back to device (or new workloads scheduled).
Update cadence is fully independent per device:
| Component | Update authority | Host reboot required? |
|---|---|---|
| Host kernel | Host admin | No (Section 12.6 live evolution) or yes |
| Device firmware (any path) | Device / device admin | Never |
| UmkaOS cluster protocol | Negotiated at join | No (backward-compatible range) |
| Device hardware capabilities | Announced at rejoin | No |
A device can update its firmware multiple times per day. The host never reboots.
The host's umka-peer-transport module never changes. The only host-visible event
is the device being absent for the duration of the update (seconds to minutes,
device-dependent).
For Path A devices running full UmkaOS, Section 12.6 (Live Kernel Evolution) makes it possible to update individual kernel subsystems without even leaving the cluster — no CLUSTER_LEAVE at all. This is a future optimization requiring Section 12.6 to be implemented on the device kernel, but it is architecturally reachable.
5.2.5 Attack Surface Reduction
The shift from device-specific drivers to a generic peer transport has a significant security consequence. Traditional device drivers run in Ring 0 with full kernel privileges. A single memory-safety bug anywhere in driver code equals full kernel compromise. Driver code is the dominant source of kernel CVEs:
Linux kernel CVE distribution (approximate, 2020-2024):
~50% — driver bugs (memory safety, race conditions, use-after-free)
~15% — networking stack
~10% — filesystem
~25% — other subsystems
Lines of driver code per device class (approximate):
mlx5 (Mellanox NIC): ~150,000 lines in Ring 0
amdgpu (AMD GPU): ~700,000 lines in Ring 0
i915 (Intel GPU): ~400,000 lines in Ring 0
nvme (NVMe storage): ~15,000 lines in Ring 0
In the UmkaOS multikernel model:
umka-peer-transport (all devices, combined): ~2,000 lines in Ring 0
Device-specific code on device: lives behind IOMMU boundary
cannot reach host kernel memory
even if completely compromised
A vulnerability in device firmware (whether that is a full UmkaOS kernel, a firmware shim, or an all-in-one firmware) cannot escalate to the host kernel. The IOMMU is the hard boundary (Section 5.3.2). The firmware can be replaced or compromised entirely; the host kernel's critical structures (text, stacks, capability tables, scheduler state) remain unreachable.
The host's trust relationship with a peer device is: 1. At join time: verify firmware signature (Section 8.2) — the device presents a signed identity. If the signature is invalid, the join is rejected. 2. During operation: treat all messages as untrusted input — validate message type, version, checksum, and semantic correctness before acting. Same discipline as any network protocol. 3. On failure: IOMMU lockout bounds the damage regardless of what the device firmware does (Section 5.3).
This is a fundamentally different trust model than Ring 0 driver code, which must be trusted completely because there is no boundary between it and the kernel.
5.2.6 Toward a Universal Device Protocol
The three-path model (Section 5.2.2) has an implication beyond UmkaOS: it describes a universal device management protocol that could eliminate device-specific drivers entirely.
The core observation: every PCIe device — from a simple NIC to a full DPU — can be modeled as a cluster peer that advertises its capabilities. The host doesn't need device-specific knowledge. It needs to know what the device can do and handle everything else locally.
How it works for any device class:
Device boots → speaks cluster protocol → CLUSTER_JOIN with capabilities:
Simple NIC (E810-class):
capabilities: [RSS, FLOW_DIRECTOR, VXLAN_DECAP, CHECKSUM_OFFLOAD, SR_IOV]
→ Host runs: TCP/IP, firewall, eBPF, congestion control, routing
→ Device runs: packet classification, checksum, encap/decap
Smart NIC (Bluefield-class):
capabilities: [FULL_KERNEL, TCP_OFFLOAD, FIREWALL, EBPF, IPSEC, RDMA]
→ Host runs: almost nothing — thin peer transport proxy
→ Device runs: full networking stack on its own cores
NVMe controller:
capabilities: [BLOCK_IO, NAMESPACE_MGMT, ZONED_APPEND, HW_CRYPTO]
→ Host runs: filesystem, page cache, I/O scheduling
→ Device runs: flash translation, wear leveling, encryption
GPU:
capabilities: [COMPUTE, RENDER, DISPLAY, VIDEO_DECODE, P2P_DMA]
→ Host runs: command submission scheduling, memory management policy
→ Device runs: shader execution, display scanout, video decode
What this replaces:
| Today (device-specific drivers) | With universal peer protocol |
|---|---|
| ~700K lines of NIC drivers in Linux (e810, mlx5, bnxt, igb, ...) | ~2K generic umka-peer-transport + capability dispatch |
| Every new NIC = write + upstream a driver | New NIC = firmware speaks protocol on day one |
| Driver bugs crash the kernel (Ring 0) | Device is a peer behind IOMMU — host never crashes |
| Vendor-specific management tools (ethtool quirks, proprietary ioctls) | One protocol for config, health, firmware lifecycle |
| Firmware update often requires reboot | CLUSTER_LEAVE → flash → CLUSTER_JOIN |
The host-side logic becomes capability-driven, not device-driven:
/// Host receives CLUSTER_JOIN from a device and fills in the gaps.
fn configure_device_peer(peer: &PeerNode) {
// Device advertises what it can do — host runs everything else.
if !peer.capabilities.contains(TCP_OFFLOAD) {
// Run TCP/IP stack locally, route packets to/from peer
host_networking.attach_local_stack(peer.interface);
}
if !peer.capabilities.contains(FIREWALL) {
// Run netfilter locally for this interface
host_netfilter.attach(peer.interface);
} else {
// Push firewall rules to device
peer.send_config(NetfilterRules::current());
}
if !peer.capabilities.contains(CHECKSUM_OFFLOAD) {
// Software checksum on host
host_networking.enable_software_checksum(peer.interface);
}
// ... same pattern for every capability
}
Precedent: CXL already does this for memory. CXL Type 3 devices (memory expanders) are self-describing — the host doesn't need a device-specific driver. CXL defines standardized discovery, management, and data protocols. The UmkaOS peer protocol generalizes CXL's approach from memory devices to all device classes. USB class drivers are another precedent — a USB mass storage device works without a vendor driver because it speaks a standard class protocol. PCIe never achieved this level of standardization above the transport layer.
Path to adoption:
- UmkaOS proves the model works with existing hardware (Path C firmware shims for major NIC/NVMe families; Path A for DPUs with native support).
- One or two vendors ship native peer protocol support in firmware (likely DPU vendors first — NVIDIA Bluefield, Intel IPU — since they already have general-purpose cores and benefit most from standardized management).
- PCIe-SIG or CXL Consortium evaluates standardization of a device peer capability structure (building on CXL's existing discovery model).
- Eventually: PCIe devices ship with a standard "device peer protocol" capability in their PCIe capability list, the same way devices today advertise MSI-X, AER, or SR-IOV support.
The multikernel peer model is not just an UmkaOS feature. It is a potential industry standard that could eliminate the driver problem at its root — by making devices self-describing cluster members rather than opaque hardware that requires device-specific software.
5.2.7 Hierarchical Cluster Topology
From the host's perspective, every device — local or remote, simple or complex — is a node in a uniform cluster tree. Each node advertises a set of child resources (capabilities, compute, memory, I/O ports). The host doesn't distinguish between "a PCIe device" and "a remote server" at the management layer — both are nodes with different capability sets and different latencies.
Example topology as seen by the host:
Cluster
├── Host node: "server-1" (this machine)
│ ├── 64x CPU cores
│ ├── 256 GB DRAM (4 NUMA zones)
│ ├── Local Tier 1 drivers (USB HCI, audio codec — too simple for peering)
│ └── Peer devices (below)
│
├── Peer node: "bf3-nic0" (Bluefield-3, PCIe slot 2, ~100ns RTT)
│ ├── 16x ARM A78 cores [COMPUTE]
│ ├── 16 GB DRAM [MEMORY]
│ ├── 2x 100G Ethernet ports [NETWORK]
│ ├── eSwitch engine [FLOW_DIRECTOR, L2_L3_OFFLOAD]
│ ├── Crypto engine [IPSEC, TLS_OFFLOAD]
│ ├── RegEx engine [DPI, PATTERN_MATCH]
│ └── ConnectX-7 RDMA [RDMA_VERBS, ROCEV2]
│
├── Peer node: "e810-nic1" (Intel E810, PCIe slot 3, ~100ns RTT)
│ ├── 2x 100G Ethernet ports [NETWORK]
│ ├── RSS (128 queues) [RSS]
│ ├── Flow Director (8192 rules) [FLOW_DIRECTOR]
│ └── SR-IOV (256 VFs) [SR_IOV]
│ (no COMPUTE, no MEMORY — host runs TCP/IP, firewall, eBPF locally)
│
├── Peer node: "nvme0" (NVMe SSD, M.2 slot, ~100ns RTT)
│ ├── 4 namespaces (1TB each) [BLOCK_IO, NAMESPACE_MGMT]
│ ├── HW crypto (AES-256-XTS) [HW_CRYPTO]
│ └── Zoned append [ZONED_APPEND]
│
├── Peer node: "gpu0" (NVIDIA GPU, PCIe slot 1, ~200ns RTT)
│ ├── 16384 CUDA cores [COMPUTE, SHADER]
│ ├── 24 GB VRAM [MEMORY, DEVICE_LOCAL]
│ ├── Video decode engine [VIDEO_DECODE]
│ ├── Display outputs (3x DP, 1x HDMI) [DISPLAY]
│ └── P2P DMA [P2P_DMA → nvme0, bf3-nic0]
│
├── Remote node: "server-2" (RDMA-attached, ~3μs RTT)
│ ├── 128x CPU cores [COMPUTE]
│ ├── 512 GB DRAM [MEMORY, RDMA_ACCESSIBLE]
│ ├── (server-2's own peer devices are opaque from here —
│ │ server-2 manages them locally in its own cluster tree)
│ └── Exported capabilities: [BLOCK_IO, COMPUTE, MEMORY]
│
└── Remote node: "server-3" (RDMA-attached, ~5μs RTT)
└── Exported capabilities: [BLOCK_IO, MEMORY]
Key properties of this model:
-
Uniform abstraction. A Bluefield DPU, an E810 NIC, an NVMe SSD, a GPU, and a remote server are all the same thing: a node with capabilities at a measured latency. The cluster manager, scheduler, and capability dispatch logic don't have device-class-specific code paths.
-
Recursive composition. Each remote node manages its own local cluster tree internally. From
server-1's perspective,server-2is a single node that exports aggregated capabilities.server-2internally sees its own GPUs, DPUs, and SSDs as peer nodes. The cluster is a tree of trees — each level manages the level below it. -
Capability-driven scheduling. "Find a node with
IPSECcapability and 100G throughput" returns bothbf3-nic0(local, ~100ns) andserver-2's exported capabilities (remote, ~3μs). The scheduler picks based on latency, current load, and data locality — the same algorithm for local devices and remote servers. -
Graceful heterogeneity. The E810 with no compute resources and the Bluefield with 16 ARM cores are both valid network-capable nodes. The host automatically runs TCP/IP locally for the E810 and offloads it to the Bluefield. No configuration needed — the capability advertisement drives the decision.
-
P2P awareness. The topology captures direct peer-to-peer paths (e.g.,
gpu0can DMA directly tonvme0andbf3-nic0without host involvement). The scheduler uses this for GPU-direct-storage and GPU-direct-RDMA paths.
5.2.8 Topology Discovery
Cluster formation is explicit (no auto-discovery magic):
1. Admin configures cluster membership:
echo "rdma://10.0.0.2/mlx5_0" > /sys/kernel/umka/cluster/join
2. Kernel establishes RDMA control channel to target node:
- Create RDMA queue pair (reliable connected)
- Perform authenticated key exchange:
- Each node has an Ed25519 signing key pair (for authentication) and an X25519
Diffie-Hellman key pair (for key exchange)
- Nodes exchange X25519 public keys and authenticate them with Ed25519 signatures
- Shared secret derived via X25519 DH, then HKDF-SHA256 to derive session keys
- Mutual authentication via pre-shared cluster secret or PKI
- **Ephemeral key exchange for forward secrecy**: after authenticating with the
long-term Ed25519/ML-DSA-65 keys, both nodes perform an additional X25519 ECDH
exchange using **ephemeral** (per-session) key pairs. Each node generates a fresh
X25519 key pair at session establishment time; these ephemeral keys are never
persisted to disk and are zeroed from memory after the shared secret is derived.
The final session symmetric key is derived as:
`session_key = HKDF-SHA256(static_shared_secret || ephemeral_shared_secret, session_nonce)`
where `static_shared_secret` is from the long-term X25519 exchange,
`ephemeral_shared_secret` is from the ephemeral X25519 exchange, and
`session_nonce` is a 32-byte random value contributed by both nodes (16 bytes
each, concatenated). This ensures that compromise of the long-term signing key
or the long-term X25519 key does not allow decryption of recorded past sessions:
the ephemeral keys needed to reconstruct `ephemeral_shared_secret` no longer
exist after session setup. The ephemeral key pair is regenerated on every
reconnection (including reconnection after RDMA link recovery), so each session
has independent forward secrecy.
3. Exchange topology information:
- Each node sends its device registry summary (devices, NUMA, accelerators)
- Each node sends its memory availability (total, available for remote access)
- Each node sends its RDMA capabilities (bandwidth, latency, features)
4. Measure link quality:
- Ping-pong latency measurement (RTT)
- Bandwidth probe (bulk RDMA write)
- Results stored in ClusterNode.measured_rtt_ns / measured_bw_bytes_per_sec
5. Fabric topology construction:
- If InfiniBand: query subnet manager for switch topology
- If RoCEv2: infer topology from latency measurements + LLDP
- Build cluster-wide distance matrix (for scheduling and placement)
6. Cluster is operational.
Heartbeat monitoring begins (RDMA send every 100ms per node).
5.2.9 Distance Matrix
The memory manager and scheduler need a cost model for data movement:
/// Cluster-wide distance matrix.
/// distance[src][dst] = cost in nanoseconds to move one page (4KB).
///
/// Examples (measured, not configured):
/// Local NUMA same-node: ~300ns (cache-hot) to ~3-5μs (cold, typical for migration)
/// Local NUMA cross-socket: ~500-800ns (4KB memcpy, QPI/UPI hop)
/// GPU VRAM (PCIe): ~1000-2000ns (4KB DMA over PCIe)
/// Remote node (same switch): ~3000-5000ns (RDMA RTT + DMA)
/// Remote node (cross-switch): ~5000-8000ns
/// CXL-attached memory: ~200-400ns
/// NVMe SSD: ~10000-15000ns
/// Remote NVMe (NVMe-oF/RDMA): ~15000-25000ns
/// HDD: ~5000000ns (5ms)
///
/// Key observation: remote RDMA is faster than local SSD.
/// Design constant: maximum cluster size. 64 nodes fits in a u64 bitmask
/// for efficient set operations (reader sets, membership views, allowed_nodes).
/// For clusters larger than 64 nodes, extend to `[u64; N]` or a hierarchical
/// directory (see Section 5.11.2.6 open questions).
pub const MAX_CLUSTER_NODES: usize = 64;
/// Heap-allocated via slab (too large for stack: ~49 KB).
///
/// **Allocation**: This struct is ~49 KB and MUST NOT be placed on the kernel stack
/// (which is 8-16 KB). It is allocated from the slab allocator at cluster join time
/// (one instance per cluster membership) and persists for the node's cluster lifetime.
/// The `Box<ClusterDistanceMatrix>` is stored in the `ClusterState` struct.
pub struct ClusterDistanceMatrix {
/// Node count (including self). Must be <= MAX_CLUSTER_NODES.
node_count: u32,
/// Distances[src_node * node_count + dst_node] = page migration cost (ns).
/// Symmetric for RDMA (full-duplex), but stored both directions
/// because asymmetric links exist (satellite, WAN).
distances: [u32; MAX_CLUSTER_NODES * MAX_CLUSTER_NODES],
/// Bandwidth[src_node * node_count + dst_node] = bytes/sec.
bandwidth: [u64; MAX_CLUSTER_NODES * MAX_CLUSTER_NODES],
/// Local NVMe 4KB random read latency in nanoseconds, measured at cluster join
/// time by calibrate_local_storage(). Used to decide whether to fetch a page
/// from a remote node or read it from local SSD. Default: 12,000 ns (12 μs) if
/// no local NVMe is present. Updated on storage topology changes.
ssd_cost_ns: u64,
}
impl ClusterDistanceMatrix {
/// Cost in nanoseconds to transfer one 4KB page from `remote` to `local`.
/// Returns the entry from the distance matrix: `distances[remote * node_count + local]`.
/// Panics in debug builds if either node index is out of range.
pub fn page_cost_ns(&self, local: NodeId, remote: NodeId) -> u64 {
debug_assert!((local as usize) < self.node_count as usize);
debug_assert!((remote as usize) < self.node_count as usize);
self.distances[remote as usize * self.node_count as usize + local as usize] as u64
}
/// Should we fetch a page from a remote node vs. from local SSD?
/// Returns true if remote fetch is expected to be faster.
pub fn prefer_remote_over_ssd(&self, local: NodeId, remote: NodeId) -> bool {
let remote_cost = self.page_cost_ns(local, remote);
// Measured from local NVMe at cluster join time via a calibration 4KB random
// read. Default fallback 12,000 ns if no local NVMe present.
// This field is set by ClusterDistanceMatrix::calibrate_local_storage() called
// during cluster join (Section 5.9) and updated on storage topology changes.
let ssd_cost = self.ssd_cost_ns;
remote_cost < ssd_cost
}
}
5.3 Peer Kernel Isolation and Crash Recovery
This section defines the isolation model for device-local kernels (Section 5.2.2 Path A/B) — devices running UmkaOS or an UmkaOS-protocol firmware as a first-class cluster peer. This model is fundamentally different from both Tier 1 driver crash recovery (Section 10.8) and remote node failure (Section 5.9), and must not be confused with either.
5.3.1 The Isolation Model Shift
With traditional drivers (Tier 1/Tier 2), UmkaOS operates a supervisor hierarchy: the host kernel loads, supervises, and recovers drivers. A Tier 1 driver crash is handled by the kernel — it detects the fault, revokes the domain, issues FLR, and reloads the driver binary. The kernel is always in control.
With a multikernel peer, this hierarchy does not exist. The device runs its own autonomous kernel on its own cores in its own physical memory. The host kernel:
- Does not load the device kernel
- Cannot inspect or modify the device kernel's private memory
- Cannot "reload" the device kernel the way it reloads a Tier 1 driver
- Cannot force the device into a known-good state without a full hardware reset
The isolation unit shifts from software domain (MPK/DACR keys, enforced by the kernel) to physical boundary (PCIe bus, IOMMU, enforced by hardware).
5.3.2 Host Unilateral Controls
Despite being a peer, the device is physically attached to the host's PCIe fabric. The host retains six unilateral controls it can exercise regardless of the device kernel's state — even against a buggy, wedged, or hostile peer. They form an escalating ladder from surgical containment to full hardware reset:
1. IOMMU domain lockout — The host programmed the device's IOMMU domain at join time. It can modify or revoke that domain at any time. Revoking the IOMMU domain prevents all further device DMA into host memory, hardware-enforced, with no cooperation from the device kernel. This is the primary containment action.
2. PCIe bus master disable — A single MMIO write to the device's Command register clears the Bus Master Enable bit. All PCIe transactions from the device are silently dropped by the root complex. Effective immediately; no cooperation needed.
3. Function Level Reset (FLR) — The host can issue FLR via the PCIe capability register. This resets all device hardware state: the device kernel dies, all in-flight DMA is cancelled, device cores reset. The device must re-initialize and rejoin the cluster. FLR is the multikernel equivalent of "driver reload" but takes device-reboot time (~1-30 seconds depending on firmware initialization) rather than driver-reload time (~50-150ms for Tier 1).
CXL Type 1/2 note: On CXL-attached peers with
CXL.cache(coherent cache in the host CPU's coherency domain), the CXL specification requires the device to flush all dirty cache lines back to host memory before the FLR completes. The host must not access the shared memory region until the FLR completion status is confirmed — cache lines in flight during FLR may be in an indeterminate state. UmkaOS's crash recovery sequence (Section 5.3.5) issues IOMMU lockout and bus master disable (steps 1-2) before FLR (step 7) precisely to ensure no new DMA or cache traffic originates from the device during the flush window.
4. PCIe Secondary Bus Reset (SBR) — The upstream bridge or root port can assert the Secondary Bus Reset bit in its Bridge Control register. This propagates a hard reset signal to all devices on that bus segment — more comprehensive than FLR (resets all PCIe functions on the device, not just one) but less disruptive than a full power cycle. Used when FLR is not supported or does not produce a clean reset.
5. PCIe slot power — The host can power-cycle the PCIe slot via the Hot-Plug Slot
Control register or via ACPI _PS3/_PS0 platform methods. This cuts then restores
power to the slot entirely. Reserved for devices that do not respond to FLR or SBR.
6. Out-of-band management (BMC/IPMI) — On server-class hardware, the Baseboard Management Controller (BMC) has independent hardware control over PCIe slot power and presence, completely bypassing the host OS. The BMC operates on a separate management plane with its own network interface and power rail. This means that even if the host kernel itself is hung or severely degraded, a management controller can power-cycle misbehaving device slots and restore the host's ability to re-enumerate them. UmkaOS treats the BMC as a platform capability, not an UmkaOS component — its availability depends on the hardware platform, and it operates outside UmkaOS's software control path.
The critical consequence of this escalating ladder is: a heartbeat timeout from a peer device is never a dead end. The host always has a hardware path to reset and re-enumerate the device, independent of device cooperation. The peer kernel model does not introduce any situation where a failed device permanently wedges the host.
These controls are one-sided and non-cooperative. The host can execute them without any message to the device, without the device kernel's acknowledgment, and even while the device kernel is executing arbitrary code. They represent the irreducible minimum of host authority over physically attached devices.
5.3.3 Isolation Comparison
Traditional Tier 1 driver:
Same Ring 0 address space as host kernel.
Isolation via MPK/DACR — software-enforced, escape possible via WRPKRU.
Host kernel supervises: detects crash, revokes domain, FLR, reload.
Recovery: ~50-150ms (driver reload).
Can corrupt: own domain memory (not host kernel critical structures).
Multikernel peer (Mode A — message passing):
Separate address space, separate physical DRAM, separate cores.
Isolation via IOMMU — hardware-enforced, physically unreachable.
Host kernel does NOT supervise: detects via heartbeat, acts unilaterally.
Recovery: escalating reset ladder (FLR → SBR → slot power → BMC/IPMI).
At least one reset path always available; heartbeat timeout is never a dead end.
Can corrupt: RDMA pool (IOMMU-bounded, excludes kernel critical structures).
Can leave dirty: distributed state (held locks, DSM page ownership).
Multikernel peer (Mode B — hardware-coherent):
Same as Mode A for isolation (IOMMU still bounds device DMA).
Additional risk: in-flight coherent writes to shared pool may be torn.
Recovery: same as Mode A + pool scan and lock force-release.
Remote cluster node (network-attached):
No physical connection. No unilateral PCIe controls.
Isolation: network boundary, no shared memory at all.
Recovery: membership revocation, DSM page invalidation.
Can corrupt: nothing on host (messages-only interface).
The key asymmetry: a peer kernel crash is more isolated than a Tier 1 driver crash (no shared Ring 0 address space, IOMMU is harder than MPK), but less controlled (no supervisor relationship, no fast reload, possible dirty distributed state).
Relationship to the three-tier model. Multikernel peers are effectively a fourth isolation level beyond Tier 2. The progression — Tier 0 (no boundary) → Tier 1 (software memory domains) → Tier 2 (separate address space) → Multikernel Peer (separate physical hardware behind IOMMU) — represents increasing boundary hardness at each step. Unlike the usual isolation-vs-performance tradeoff (each tier adds latency), multikernel peers break the pattern: they provide the strongest isolation and offload host CPU cycles entirely, since the device runs its own kernel on its own cores. The reason UmkaOS does not label this "Tier 3" is that it has fundamentally different failure semantics: no supervisor relationship, distributed state cleanup on crash, an escalating reset ladder (FLR → SBR → slot power → BMC), and a requirement for dedicated hardware. It is not a drop-in replacement for a driver tier — it is a different operational model that happens to sit at the strongest end of the isolation spectrum.
5.3.4 Crash Detection
Two detection paths run in parallel for attached peer kernels:
Primary: Cluster heartbeat (both Mode A and Mode B) The standard membership protocol (Section 5.4): heartbeat every 100ms. Missed for 300ms (3 heartbeats) → Suspect. Missed for 1000ms (10 heartbeats) → Dead. This covers all failure modes including silent crashes and wedged firmware.
Secondary: MMIO watchdog (required for Mode B; recommended for Mode A) The device firmware writes a monotonically increasing counter to a dedicated MMIO register every 10ms. The host polls this register on any DLM lock acquisition timeout or DSM page request timeout. Stale counter → immediate Suspect escalation. Detection latency: ~20ms instead of up to 1 second via heartbeat alone.
Mode B devices that do not implement the MMIO watchdog must not be granted hardware-coherent transport access. The faster detection is mandatory for Mode B because stale lock state in the coherent pool propagates to all cluster members holding locks in that region.
/// Per-peer crash detection state, maintained by the host kernel.
pub struct PeerKernelHealth {
/// NodeId of the peer kernel.
pub node_id: NodeId,
/// Last observed MMIO watchdog counter value.
pub last_watchdog_count: AtomicU64,
/// Timestamp of last watchdog read (nanoseconds since boot).
pub last_watchdog_ts_ns: AtomicU64,
/// Number of consecutive missed heartbeats.
pub missed_heartbeats: AtomicU32,
/// Current health state.
pub state: AtomicU32, // PeerHealthState enum
/// PCIe BDF for unilateral control operations.
pub pcie_bdf: PcieBdf,
/// Coordination mode (determines recovery procedure).
pub mode: PeerCoordinationMode, // ModeA | ModeB
}
#[repr(u32)]
pub enum PeerHealthState {
Active = 0,
Suspect = 1, // Missed heartbeats or stale watchdog — alert, escalate
Dead = 2, // Confirmed failed — execute recovery sequence
Faulted = 3, // FLR issued, waiting for device to rejoin
ManagementFaulted = 4, // CXL Type 3: memory accessible, management unavailable
}
5.3.5 Recovery Sequence
On transition to Dead, the host kernel executes the following sequence.
Steps 1-3 are unilateral and non-cooperative. Steps 4-6 involve cleanup of
distributed state. Steps 7-8 are optional and admin-controlled.
Peer kernel crash recovery sequence:
1. IOMMU lockout (immediate, unilateral)
host.iommu.revoke_domain(peer.pcie_bdf)
— All device DMA into host memory blocked in hardware.
— All outstanding RDMA operations targeting this device's QPs are cancelled.
— Estimated time: <1ms (IOMMU domain table update + TLB shootdown).
2. PCIe bus master disable (immediate, unilateral)
host.pcie.clear_bus_master(peer.pcie_bdf)
— Device can no longer initiate any PCIe transactions.
— Defense-in-depth: belts-and-suspenders with IOMMU lockout.
— Estimated time: <1ms.
3. [Mode B only] Pool scan and lock force-release
host.rdma_pool.scan_and_release_locks(peer.node_id)
— Scan all DLM lock entries in the coherent pool owned by the dead peer.
— Force-release each held lock; set tombstone flag (LOCK_OWNER_DEAD).
— Waiters receive LOCK_OWNER_DEAD, not timeout.
— Scan all DSM directory entries owned by peer; mark as LOST.
— Estimated time: O(active_locks) — typically <10ms.
4. Cluster membership revocation
cluster.revoke_membership(peer.node_id)
— Broadcasts MEMBER_DEAD(peer.node_id) to all other cluster nodes.
— All nodes destroy their QP connections to the peer.
— Master eagerly calls `dereg_mr()` on the evicted node's pool-wide RDMA
Memory Region, invalidating the rkey in RNIC hardware (< 1ms). Any
in-flight one-sided RDMA operations from the evicted node using the
old rkey receive Remote Access Error completion (IBA v1.4 Section
10.6.7.2). This is immediate — it does NOT wait for the rkey rotation
cycle ([Section 5.3.2](#532-host-unilateral-controls), Mitigation 2).
— Estimated time: ~1 RTT to notify all members (~1-3ms on LAN).
5. DSM page invalidation
dsm.invalidate_all_pages_owned_by(peer.node_id)
— All DSM pages in the Owner or Shared state for the dead peer are
invalidated. Processes mapping these pages receive SIGBUS on next
access (same semantics as NFS server failure).
— Home-node responsibilities for pages hosted on the peer are
migrated to surviving nodes via the migration protocol (Section 5.6.3).
— Estimated time: O(pages_owned_by_peer) — typically <100ms.
6. Capability revocation
cap_table.revoke_all_for_node(peer.node_id)
— All capabilities granted to or from the peer are invalidated.
— Ongoing ring buffer channels to the peer are torn down.
— Estimated time: O(capabilities) — typically <10ms.
7. [Optional] Reset and device reboot (admin-controlled or auto-policy)
Escalating ladder — attempt each step in order, stop when device rejoins:
a. FLR: host.pcie.issue_flr(peer.pcie_bdf)
b. SBR: host.pcie.issue_sbr(peer.pcie_bdf) // if FLR fails or unsupported
c. Slot power cycle: host.pcie.power_cycle(peer.pcie_bdf) // last software resort
d. BMC/IPMI slot power: platform.bmc.power_cycle_slot(peer.slot_id) // OOB, if available
— Resets all device hardware state; device firmware re-initializes.
— If device re-joins cluster: resume normal operation.
— If device fails to re-join within timeout after all steps: mark as Faulted,
notify admin, do not attempt further resets automatically.
— At least one reset path (FLR or SBR) is always available for PCIe devices.
8. [Optional] Workload migration
scheduler.migrate_workloads_from(peer.node_id, policy)
— Workloads that were running on the peer's compute resources
(containers, VMs, scheduled tasks) are either:
a. Migrated to surviving nodes if they were checkpointable.
b. Terminated with SIGKILL if not checkpointable.
c. Suspended pending device recovery if short outage expected.
— Policy configured per-cgroup: migrate | terminate | suspend.
Total time to containment (steps 1-2): < 2ms. Total time to clean state (steps 1-6): < 200ms typical. Total time to full recovery (steps 1-8 with FLR): 1-30 seconds (device-dependent).
Compare to Tier 1 driver reload: 50-150ms total. The peer kernel recovery is slower because FLR + firmware re-initialization cannot be parallelized, and because distributed state cleanup (steps 4-6) is inherently network-speed rather than local-memory-speed. Steps 1-3 (containment) are comparable to or faster than Tier 1 domain revocation.
CXL Type 3 crash — distinct recovery model: When a CXL Type 3 memory-expander peer loses its management processor (crash, firmware fault, or reset), the recovery is fundamentally different from a compute peer crash:
- The DRAM cells do not disappear. The physical memory remains accessible to the
host via
CXL.memload/store — it is DRAM on a PCIe/CXL bus, not in the device's compute domain. The host can still read and write those pages. - The management layer is gone. Tiering decisions, compression metadata, encryption keys (if any), and bad-page tracking maintained by the management processor are no longer available. Compressed pages are unreadable until decompressed; encrypted pages are inaccessible if keys were held in device DRAM.
- Recovery action: The host marks the CXL pool as
ManagementFaulted(notDead). Pages without compression or encryption remain accessible normally. Pages with compression or encryption are migrated or treated as lost. The management processor is reset (FLR → re-initialize firmware); once recovered, it re-scans the pool metadata and rejoins as a memory-manager peer. - No workload migration needed: There are no workloads running on a Type 3 management processor. Workloads using the CXL pool memory continue running unaffected (if their pages were uncompressed/unencrypted). Only pool management operations (tiering, new allocation) are suspended during recovery.
The PeerKernelHealth state machine (Section 5.3.4) gains an additional state for
Type 3 peers: ManagementFaulted — memory accessible, management unavailable.
5.3.6 What Survives Peer Kernel Crash Intact
The host kernel's own state is fully protected:
| Component | Protected by | Survives peer crash? |
|---|---|---|
| Host kernel text, rodata | IOMMU (device cannot reach) | Always |
| Host kernel stacks | IOMMU (not in RDMA pool) | Always |
| Capability tables | IOMMU (not in RDMA pool) | Always |
| Scheduler state | IOMMU (not in RDMA pool) | Always |
| Host application memory | IOMMU (not in RDMA pool unless explicitly exported) | Always |
| RDMA pool — kernel structures | IOMMU bounds; pool scan on Mode B crash | Recovered in step 3 |
| RDMA pool — application DSM pages | Invalidated (step 5); app gets SIGBUS | Lost (must re-fetch or re-compute) |
| Other cluster nodes' state | Steps 4-6 clean up distributed references | Recovered |
| Host kernel stability | Nothing to crash | Never affected |
The host kernel never crashes due to a peer kernel crash. The IOMMU is the hardware guarantee; the recovery sequence is the software cleanup.
5.3.7 Relationship to Other Failure Handling Sections
-
Section 10.8 (Driver Crash Recovery): Covers Tier 1/Tier 2 driver failures where the host kernel is the supervisor. Peer kernel isolation is a peer relationship, not a supervisor relationship. Do not apply Section 10.8 procedures to peer kernels.
-
Section 5.9 (Cluster Membership and Failure Detection): Covers remote nodes connected over RDMA networks. Peer kernel recovery shares the membership protocol (steps 4-6) but adds the unilateral PCIe controls (steps 1-3) that are not available for network-attached nodes.
-
Section 5.2.6 (DPU Failure Handling): Covers the Section 5.2 offload tier model where the DPU acts as a managed device via the OffloadProxy KABI. If the same DPU runs a full UmkaOS peer kernel (Section 5.2.2 Path A), this section applies instead. The two models are mutually exclusive for a given device.
5.4 RDMA-Native Transport Layer
5.4.1 Design: Kernel RDMA Transport (umka-rdma)
Unlike Linux where RDMA is a separate subsystem used only by applications, UmkaOS integrates RDMA into the kernel's transport layer so that any kernel subsystem can use RDMA for data movement.
Linux architecture:
┌───────────────────────────────────────────────────┐
│ Kernel subsystems (MM, VFS, IPC, scheduler) │
│ └── All use: memcpy, TCP sockets, block I/O │
│ (no RDMA awareness) │
├───────────────────────────────────────────────────┤
│ RDMA subsystem (ib_core, ib_uverbs) │
│ └── Only used by: userspace apps via libibverbs │
│ (kernel subsystems don't use this) │
└───────────────────────────────────────────────────┘
UmkaOS architecture:
┌───────────────────────────────────────────────────┐
│ Kernel subsystems (MM, VFS, IPC, scheduler) │
│ └── All use: umka-rdma transport (when remote) │
│ Local path: same as before (memcpy, DMA) │
│ Remote path: RDMA read/write via umka-rdma │
├───────────────────────────────────────────────────┤
│ umka-rdma: unified RDMA transport for kernel use │
│ ├── Page migration (MM → remote NUMA node) │
│ ├── Page cache fill (VFS → remote page cache) │
│ ├── IPC ring (IPC → remote process) │
│ ├── Control messages (scheduler, capabilities) │
│ └── Userspace RDMA (libibverbs compat, unchanged) │
├───────────────────────────────────────────────────┤
│ RDMA hardware driver (mlx5, efa, bnxt_re, etc.) │
│ └── Implements RdmaDeviceVTable (Section 21.5) │
└───────────────────────────────────────────────────┘
5.4.2 Transport Abstraction
// umka-core/src/rdma/transport.rs (kernel-internal)
/// Unified transport for kernel-to-kernel communication.
/// Abstracts over RDMA hardware and provides the primitives that
/// other kernel subsystems (MM, IPC, page cache) use.
pub struct KernelTransport {
/// RDMA device used for kernel communication.
rdma_device: DeviceNodeId,
/// Protection domain for all kernel RDMA operations.
kernel_pd: RdmaPdHandle,
/// Pre-registered memory regions for fast page transfer.
/// Covers designated RDMA-eligible regions (pool size policy: Section 5.4.3).
/// Avoids per-transfer memory registration overhead.
kernel_mr: RdmaMrHandle,
/// Per-remote-node connections (reliable connected QPs).
// O(1) indexed by NodeId. MAX_CLUSTER_NODES=64, so this array is ~N*sizeof(NodeConnection) bytes.
connections: [Option<NodeConnection>; MAX_CLUSTER_NODES],
/// Per-CPU completion queues, one per logical CPU.
/// A single shared CQ becomes a bottleneck under load because all CPUs
/// contend to drain it. Per-CPU CQs allow each CPU's RDMA poll thread
/// to drain its own queue independently, eliminating cross-CPU contention.
/// Each QP in NodeConnection is created against the CQ of the CPU that
/// owns that connection's polling thread (assigned at connection setup time).
/// Indexed by cpu_id; length = num_online_cpus() at transport init.
/// Uses a slab-allocated slice rather than a fixed array: the kernel has no
/// compile-time MAX_CPUS (Section 6.1.1) — CPU count is runtime-discovered.
/// Allocated once at transport init, never resized on the hot path.
cqs: SlabSlice<RdmaCqHandle>,
/// Statistics.
stats: TransportStats,
}
pub struct NodeConnection {
/// Reliable connected queue pair for control messages.
control_qp: RdmaQpHandle,
/// Reliable connected QP for one-sided RDMA (Read/Write/Atomic).
/// RDMA Read/Write requires RC or UC transport; UD only supports Send/Receive.
data_qp: RdmaQpHandle,
/// Remote node's memory region key (for RDMA read/write).
remote_rkey: u32,
/// Remote node's base address for page transfers.
remote_base_addr: u64,
/// Flow control: outstanding RDMA operations.
inflight: AtomicU32,
/// Maximum concurrent RDMA operations to this node.
max_inflight: u32,
}
/// Operations available to kernel subsystems.
impl KernelTransport {
/// RDMA Read: fetch a page from remote node to local memory.
/// One-sided — no remote CPU involvement.
/// Used by: MM (page fault on remote page), page cache (remote fill).
pub fn fetch_page(
&self,
remote_node: NodeId,
remote_phys_addr: u64,
local_page: PhysAddr,
size: u32, // Usually 4096 (4KB) or 2097152 (2MB huge page)
) -> Result<RdmaCompletion, TransportError>;
/// RDMA Write: push a page from local memory to remote node.
/// One-sided — no remote CPU involvement.
/// Used by: MM (page eviction to remote node), DSM (writeback).
pub fn push_page(
&self,
remote_node: NodeId,
local_page: PhysAddr,
remote_phys_addr: u64,
size: u32,
) -> Result<RdmaCompletion, TransportError>;
/// RDMA Atomic Compare-and-Swap (64-bit).
/// Used by: Distributed Lock Manager ([Section 14.6.5](14-storage.md#1465-rdma-native-lock-operations)) uncontested acquire,
/// DSM page ownership transfers.
pub fn atomic_cas(
&self,
remote_node: NodeId,
remote_addr: u64,
expected: u64,
desired: u64,
) -> Result<u64, TransportError>;
/// Send a control message to remote node (two-sided, requires remote CPU).
/// Used by: cluster management, capability distribution, scheduler.
pub fn send_control(
&self,
remote_node: NodeId,
msg: &ControlMessage,
) -> Result<(), TransportError>;
/// Batch page transfer: move N pages in a single RDMA operation.
/// Used by: bulk migration, prefetch, process migration.
pub fn fetch_pages_batch(
&self,
remote_node: NodeId,
pages: &[(u64, PhysAddr)], // (remote_phys, local_phys) pairs
) -> Result<RdmaBatchCompletion, TransportError>;
}
5.4.3 Pre-Registered Kernel Memory
The biggest performance problem with Linux RDMA is memory registration. Before any RDMA operation, memory must be pinned and registered with the NIC hardware (translated to physical addresses, programmed into NIC's address translation table). This costs ~1-10 μs per registration.
UmkaOS avoids per-transfer registration overhead by pre-registering designated RDMA-eligible memory regions at cluster join time.
Pool sizing policy:
The pool size is determined at cluster join time by the following formula:
rdma_pool_size = clamp(
min(
physical_ram * 25 / 100, // 25% of physical RAM
rdma_max_pool_gib * GIB, // configurable hard cap
),
256 * MIB, // minimum: always reserve 256 MiB
physical_ram, // maximum: never exceed total RAM
)
Where rdma_max_pool_gib is the rdma.max_pool_gib kernel parameter (default: 64 GiB).
The default 64 GiB cap reflects typical InfiniBand HCA memory registration limits and
prevents excessive physical page pinning on large systems. On a 256 GiB machine the
pool is 64 GiB (25%). On a 1 TiB machine the pool is capped at 64 GiB (6.25%), not
256 GiB, avoiding wasteful pinning on systems where most workloads are local. Operators
running dedicated RDMA fabrics on large memory systems can raise the cap via the
rdma.max_pool_gib tunable; there is no hard upper bound other than total RAM.
The 256 MiB floor ensures that even very small systems (e.g., embedded UmkaOS nodes with 8–16 GiB) always have enough RDMA-registered space for control-plane and DSM traffic.
NUMA-local allocation: The pool is divided proportionally across NUMA nodes:
node_pool_size = pool_size * node_physical_ram / total_physical_ram
Each NUMA node registers its own share as a separate memory region. RDMA NIC NUMA affinity (where the NIC is physically closest to a NUMA node) is accounted for in the distance matrix (Section 5.2.9) but does not affect the proportional split — all nodes contribute fairly to the pool.
Timing: The pool is registered at cluster join time, before any user processes issue RDMA requests. Registration is a one-time cost of approximately 100 μs per region.
Runtime adjustment:
- Shrink: drain in-flight RDMA operations that reference the region, unregister the
excess MRs, update the pool descriptor. Requires quiescing DSM pages in the shrunk
range (~1–10 ms depending on page count).
- Grow: register additional MRs, add to the pool descriptor, notify cluster peers
of the new rkey. No quiescing required.
- Both operations are triggered by writing to /sys/kernel/umka/cluster/rdma.max_pool_gib.
Boot sequence:
Boot sequence:
1. RDMA NIC driver initializes (standard KABI init).
2. Cluster join is requested.
3. umka-rdma computes pool size: min(RAM * 25%, max_pool_gib * GiB), ≥ 256 MiB.
Pool is allocated per-NUMA-node proportionally to node RAM.
4. umka-rdma registers RDMA-eligible memory regions (one per NUMA node):
- Single memory region per NUMA node, one-time cost (~100 μs total)
- NIC can DMA to/from registered pages without per-transfer registration
- Non-registered memory cannot be accessed by remote nodes
5. Remote nodes exchange rkeys (remote access keys).
6. Any kernel subsystem can now RDMA read/write registered pages with zero
setup cost. Pages outside the RDMA pool are not remotely accessible.
This is safe because:
- Only designated RDMA-eligible regions are registered (not all physical memory)
- The rkey is only shared with authenticated cluster members
- RDMA access is gated by the connection (reliable connected QP)
- The kernel controls which pages are exported (via page ownership tracking)
- IOMMU still validates all DMA (RDMA NIC goes through IOMMU like any device)
- Kernel text, kernel stacks, and security-sensitive structures are never in
the RDMA pool
5.4.4 Performance Characteristics
| Operation | Latency | Bandwidth | CPU Involvement |
|---|---|---|---|
| RDMA Read (4KB page) | ~3-5 μs | ~200 Gb/s line rate | Zero on remote side |
| RDMA Write (4KB page) | ~2-3 μs | ~200 Gb/s line rate | Zero on remote side |
| RDMA Atomic CAS | ~2-3 μs | N/A | Zero on remote side |
| Control message (send/recv) | ~1-2 μs | N/A | Interrupt on remote side |
| Batch page transfer (64 pages) | ~5-10 μs | Near line rate | Zero on remote side |
| Memory registration (avoided) | 0 (pre-registered) | N/A | N/A |
Compare with alternatives: | Alternative | 4KB Fetch Latency | Notes | |------------|-------------------|-------| | Local DRAM (same NUMA) | ~300-500 ns | Baseline (4KB page copy) | | Local DRAM (cross-NUMA) | ~500-800 ns | QPI/UPI hop (4KB page copy) | | RDMA (same switch) | ~3-5 μs | ~10x local NVMe, but... | | CXL 2.0 pooled memory | ~200-400 ns | Hardware-coherent | | NVMe SSD | ~10-15 μs | Current swap target | | NFS/CIFS (TCP) | ~50-200 μs | Kernel networking overhead | | HDD | ~5,000 μs | Rotational latency |
Key insight: Raw RDMA Read latency (~3 μs) is 3-5x faster than NVMe (~10-15 μs). However, the complete DSM page fault path (directory lookup + ownership negotiation + RDMA transfer) totals ~10-18 μs (see Section 5.6.4), which is comparable to NVMe latency rather than 3-5x faster. The advantage of remote DRAM over NVMe is not single-fault latency but bandwidth scalability: RDMA provides ~25 GB/s per port (200 Gb/s InfiniBand) vs. ~7 GB/s for a single NVMe SSD, and multiple RDMA ports can be aggregated. Remote memory is a better "swap" target than local disk for bandwidth-bound workloads.
5.4.5 Security Considerations
The pre-registered RDMA pool approach (Section 5.4.3) creates a security trade-off:
any node with the rkey can read/write any address within the RDMA-eligible region on
the remote node via one-sided RDMA. The attack surface is limited to the RDMA pool
(capped at min(25% of RAM, rdma.max_pool_gib GiB), default ≤64 GiB), not all
physical memory — kernel text, stacks, and security structures are excluded.
Explicit trust model assumption: All nodes in an UmkaOS cluster are mutually trusted kernel instances, authenticated during cluster join (X25519 key exchange authenticated with Ed25519 signatures, Section 5.2.8). A compromised node can read/write the RDMA pool of any other node. This is the same trust model used by all production InfiniBand and RoCE deployments today. The cluster join authentication is the trust boundary — once joined, nodes are peers. If a node is suspected of compromise, its cluster membership is revoked (Section 5.9), destroying all QP connections and invalidating its RDMA mappings immediately.
Mitigation 1: RDMA Memory Windows (MW Type 2) — For security-sensitive or multi-tenant deployments, use RDMA Memory Windows instead of pool-wide registration. Each page export creates a short-lived memory window with a unique rkey, scoped to the specific page or region being transferred. The window is revoked when page ownership changes. This adds ~0.5-3μs overhead per window create/destroy (HCA-dependent; ConnectX-5/6 MW Type 2 bind/invalidate involves firmware interaction and PCIe round-trips) but eliminates pool-wide exposure. Caveat: the 0.5-3μs figure is measured on direct-attach single-hop InfiniBand/RoCE; multi-hop fabric topologies or RDMA-over-TCP add additional per-hop latency (typically 5-15μs on ConnectX across two switches). A compromised node can only access pages for which it currently holds a valid memory window, not the entire 25% pool.
Mitigation 2: Rkey rotation — In trusted mode, the pool-wide rkey is rotated periodically (default: every 60 seconds). All nodes exchange new rkeys via the authenticated control channel. Old rkeys are invalidated after a grace period (2x rotation interval = 120s). This limits the window of exposure if an rkey is leaked outside the cluster by a non-RDMA channel (e.g., software bug exposes rkey bytes in a log or network message): an external attacker who obtained the leaked rkey has at most 180 seconds (rotation interval + grace period) before the rkey becomes invalid.
Security invariant: Session encryption keys for all RDMA transfers are stored
in Core-private memory (Tier 0, PKEY 0 on x86, outside the pre-registered RDMA
pool). A compromised node that retains physical RDMA pool access cannot decrypt
other nodes' in-flight traffic without the session keys. On confirmed compromise
detection (CLUSTER_SUSPECT → CLUSTER_EVICT transition), all capability handles
granting RDMA pool access are immediately revoked in the local capability table via
cap_revoke_all(node_id) — in-flight DMA transactions complete but no new transfers
can be initiated by or to the evicted node. The 180-second window governs detection
propagation latency, not credential validity after detection.
Important: Rkey rotation is a defense-in-depth mechanism against rkey leakage to
non-cluster entities. It is NOT the revocation mechanism for evicted cluster members.
When a node's cluster membership is revoked (see "Eager rkey revocation on membership
loss" below), the master calls dereg_mr() on the evicted node's pool-wide Memory
Region immediately, invalidating the rkey in RNIC hardware within < 1ms. The 180s
rotation window does not apply to membership revocation — it applies only to the
background rotation schedule that limits exposure from rkey leakage outside the RDMA
fabric.
For deployments where even the 180s leakage window is unacceptable (multi-tenant, zero-trust), use Mitigation 1 (RDMA Memory Windows) instead, which provides immediate per-page revocation at the cost of higher per-operation overhead.
Mitigation 3: Trusted cluster mode (default) — For single-tenant, physically secured clusters (the common datacenter case), the pool-wide registration with rkey rotation is the default. This matches current InfiniBand practice — all production RDMA deployments assume a trusted fabric.
Mitigation 4: Hardware memory encryption — Note: Total Memory Encryption (Intel TME-MK, AMD SME) operates at the memory controller boundary and does NOT encrypt data visible to DMA devices. RDMA NICs access memory through the IOMMU in the CPU's coherency domain and see plaintext. TME-MK/SME protects against physical DRAM extraction attacks (cold boot) but does not provide defense-in-depth against software-based RDMA attacks. For RDMA data-in-transit protection, use the AES-GCM encryption described above.
Acknowledged limitation: All four mitigations have trade-offs. MW Type 2 adds latency to every page transfer. Rkey rotation adds periodic coordination overhead. Trusted cluster mode requires a physically secured fabric. Hardware encryption prevents cross-node data snooping but doesn't prevent a compromised node from writing garbage to remote memory. A fully zero-trust RDMA solution would require per-operation cryptographic MACs. For RoCE (RDMA over Converged Ethernet), MACsec (IEEE 802.1AE) provides link-layer encryption and integrity at the Ethernet layer, protecting against physical tapping and injection. For native InfiniBand, MACsec is not applicable (IB does not use Ethernet framing); InfiniBand relies on partition-based isolation (P_Key), queue-pair key authentication (Q_Key), and subnet manager access control for security. Neither MACsec nor InfiniBand's native mechanisms provide per-operation cryptographic authentication for one-sided RDMA operations at line rate. UmkaOS's pragmatic stance: match existing InfiniBand/RoCE security practice (trusted fabric with partition isolation), offer MW-based restriction for multi-tenant or security-sensitive deployments, and upgrade to hardware-enforced per-operation authentication when NIC vendors deliver it (CXL 3.0's integrity model is a likely candidate).
One-sided RDMA authentication: There is a fundamental tension between one-sided RDMA
(which requires no remote CPU involvement) and per-operation authentication (which
requires CPU processing). The resolution: the trust boundary is the cluster join,
authenticated via X25519 key exchange with Ed25519 signatures (Section 5.2.8). Within
the cluster, nodes are mutually trusted for RDMA operations. If a node is compromised,
its cluster membership is revoked — all QP connections to that node are destroyed and
its RDMA mappings are invalidated immediately via eager dereg_mr() (see "Eager rkey
revocation on membership loss" under Mitigation 2 above). Rkey revocation is
hardware-enforced with < 1ms latency from dereg_mr() call to rejection of in-flight
RDMA operations. There is no 180s exposure window for membership revocation — RDMA
hardware atomically invalidates the rkey as part of MR deregistration (IBA v1.4
Section 13.6.7.2). The 180s rkey rotation grace period (Mitigation 2) is a separate
defense-in-depth mechanism that limits exposure from rkey leakage to non-cluster
entities, not the revocation path for evicted members.
This matches the security model of all existing RDMA deployments: InfiniBand assumes a trusted fabric, and RoCEv2 relies on network isolation (VLANs, VXLANs) for multi-tenant separation.
Default: Trusted mode (pool-wide registration) is the default. This matches existing InfiniBand/RoCE practice — all production RDMA deployments assume a trusted fabric. Set via:
/sys/kernel/umka/cluster/security_mode
# # "trusted" (default) or "secure"
Why trusted mode is the default: The security boundary in an UmkaOS cluster is the cluster join authentication (Section 5.2.8: X25519 key exchange authenticated with Ed25519 signatures). Once a node is authenticated into the cluster, it is a trusted peer. RDMA pool-wide access does not expand the trust boundary — a compromised node already has full access to cluster resources through authenticated channels.
Datacenter environments provide additional isolation layers: - Physical access control (racks, cages, badge readers) - Network isolation (VLANs, VXLANs, dedicated RDMA fabrics) - Node attestation during cluster join
For these environments, the ~5% overhead of memory windows is unnecessary overhead. Secure mode is appropriate for: - Multi-tenant cloud environments where different customers share RDMA fabric - Environments without physical network isolation - Defense-in-depth deployments where the threat model includes node compromise despite authentication
Switching to secure mode:
echo "secure" > /sys/kernel/umka/cluster/security_mode
# # Note: This triggers a brief (~100ms) disruption as QPs are reconfigured
5.4.6 QP Tear-down Protocol
Destroying a QP while messages are in-flight or Work Requests (WRs) are pending can result in completion events being lost, memory corruption, or use-after-free in the completion handler. The correct procedure for safe QP tear-down is:
-
Transition to ERROR state. Call
ibv_modify_qp(qp, IBV_QPS_ERR). This moves all outstanding WRs to error state and generates flush completions (withIBV_WC_WR_FLUSH_ERRstatus) for every pending send and receive WR. The QP no longer accepts new post operations after this transition. -
Drain the Completion Queue. Poll the CQ until it is empty:
rust loop { let n = ibv_poll_cq(cq, &mut wc_buf); if n == 0 { break; } // Handle or discard flush completions — all have IBV_WC_WR_FLUSH_ERR status. // Any non-flush completion here indicates a protocol bug (WRs posted after // the ERROR transition). }Draining ensures that no completion handler holds a reference to the QP afteribv_destroy_qp()is called. Skipping this step is a use-after-free vulnerability if any in-flight WR's completion fires asynchronously after the QP is freed. -
Destroy the QP. Only after the CQ is empty: call
ibv_destroy_qp(qp). At this point no in-flight operation references the QP. -
Destroy the CQ. After the QP is destroyed. Destroying the CQ before the QP can cause the NIC to write completions to freed memory.
UmkaOS enforcement: UmkaOS's RDMA transport layer tracks QP reference counts and
enforces the protocol:
- ibv_destroy_qp() returns EBUSY if the QP has not been transitioned to
IBV_QPS_ERR state first.
- ibv_destroy_qp() returns EBUSY if the CQ still has pending completions
referencing this QP (detected by checking the QP's tracked inflight counter and
scanning the CQ's pending-completion ring before accepting the destroy call).
- These checks fire in both debug and release builds, because a UAF from an
out-of-order tear-down is a correctness error on all configurations.
This protocol applies to all QP types used by UmkaOS (control and data QPs in
NodeConnection, doorbell QPs in distributed ring buffers). The cluster failure
handler (Section 5.9) follows this protocol when destroying QPs
after a node is marked Dead: it transitions each QP to ERROR state, drains completions,
then destroys the QP, then destroys the associated CQ.
5.5 Distributed IPC
5.5.1 Extending Ring Buffers to RDMA
UmkaOS's IPC is MPSC ring buffer-based (Section 10.6, Zero-Copy I/O Path, which includes the MPSC ring buffer protocol), using SQE/CQE structures compatible with io_uring. The same ring buffer protocol works over RDMA.
Local IPC (current design, unchanged):
┌──────────┐ shared memory ring ┌──────────┐
│ Process A │ ──── SQE/CQE ────► │ Process B │
└──────────┘ (mapped pages) └──────────┘
Same machine, same address space region.
Zero-copy. Latency: ~200ns.
Distributed IPC (new):
┌──────────┐ RDMA ring ┌──────────┐
│ Process A │ ──── SQE/CQE ────► │ Process B │
│ (Node 0) │ (RDMA write) │ (Node 1) │
└──────────┘ └──────────┘
Different machines, RDMA-connected.
Zero-copy (RDMA, no kernel networking stack). Latency: ~2-3 μs (RDMA Write).
The ring buffer protocol (SQE format, CQE format, head/tail pointers,
memory ordering) is identical. Only the transport changes.
5.5.2 Transparent Transport Selection
// umka-core/src/ipc/ring.rs (extend existing)
pub enum RingTransport {
/// Both endpoints on the same machine.
/// Ring is in shared memory (mapped into both processes).
SharedMemory {
ring_pages: PageRange,
},
/// Endpoints on different machines.
/// Submissions: producer RDMA-writes SQEs into consumer's ring memory.
/// Completions: consumer RDMA-writes CQEs back.
/// Doorbell: RDMA send (small control message) notifies consumer.
Rdma {
remote_node: NodeId,
remote_ring_addr: u64,
remote_ring_rkey: u32,
local_ring_pages: PageRange,
doorbell_qp: RdmaQpHandle,
},
/// Endpoints connected via CXL fabric (load/store accessible).
/// Ring is in CXL-attached shared memory. Same as SharedMemory
/// but pages are on a CXL memory pool node.
CxlSharedMemory {
ring_pages: PageRange,
cxl_node: NumaNodeId,
},
}
Transport selection is automatic:
Process A calls connect_ipc(target_process_id):
1. Kernel looks up target process location:
- Same machine → SharedMemory transport
- Remote machine with RDMA → Rdma transport
- CXL-connected pool → CxlSharedMemory transport
2. Kernel sets up ring buffer with appropriate transport
3. Process A gets back an IPC handle (opaque)
4. Process A uses the same SQE submission interface regardless of transport
5. The SQE/CQE format is identical in all cases
5.5.3 Ring Buffer RDMA Protocol
RDMA Ring Buffer Header Extension
The RDMA transport extends the base DomainRingBuffer header (Section 10.7.2) with
additional fields for cross-node synchronization. These fields are placed in a
separate RdmaRingHeader that precedes the standard header:
/// RDMA-specific ring buffer header extension.
/// Placed immediately before the DomainRingBuffer in RDMA transport mode.
/// Consumer-owned fields are on a separate cache line from producer-owned fields.
#[repr(C, align(64))]
pub struct RdmaRingHeader {
// === Producer-owned cache line (written by remote producer via RDMA) ===
/// Sequence number of the last published SQE.
/// The producer increments this AFTER writing SQE data to the ring.
/// The doorbell message carries this sequence number.
/// Memory ordering: producer uses Release, consumer uses Acquire.
pub producer_seq: AtomicU64,
/// Reserved for future use (cache line padding).
_pad_producer: [u8; 56],
// === Consumer-owned cache line (written locally) ===
/// Sequence number up to which the consumer has processed.
/// Used for flow control and to detect dropped entries.
pub consumer_seq: AtomicU64,
/// Reserved for future use (cache line padding).
_pad_consumer: [u8; 56],
}
Protocol: Sequence-Based Doorbell Synchronization
The core challenge is that RDMA Write and RDMA Send operations may arrive at the consumer out of order — the doorbell (RDMA Send) could arrive before the SQE data (RDMA Write) is visible in memory. To solve this, we use a sequence counter that the consumer polls to determine data readiness:
Producer (Node A) submits work:
1. Write SQE to local staging buffer
2. RDMA Write: push SQE to consumer's ring memory on Node B
(one-sided, no CPU involvement on Node B)
3. RDMA Write: update producer_seq in consumer's RdmaRingHeader.
The new value is (previous_seq + 1).
On RC (Reliable Connection) QPs, RDMA Writes are guaranteed to be delivered
and executed in posting order at the responder. Therefore, the SQE data
(step 2) is always visible before producer_seq (step 3), without
requiring any additional fencing.
Note: IBV_SEND_FENCE is NOT needed here — it only orders operations after
prior RDMA Reads and Atomics, not after prior Writes.
(Each producer has a dedicated QP per consumer, so per-QP ordering suffices.)
4. If consumer was idle: RDMA Send doorbell with payload = new producer_seq value.
The doorbell serves only as a notification to wake the consumer; the consumer
does NOT trust the doorbell's sequence value directly. Instead, the consumer
reads producer_seq from local memory (where it was written by step 3) with
Acquire ordering to establish the happens-before relationship.
Consumer (Node B) processes work:
1. Receive doorbell notification (RDMA Send) OR poll timeout
2. Read producer_seq with Acquire ordering from local RdmaRingHeader
3. Compare producer_seq to consumer_seq to determine how many entries are ready:
ready_count = producer_seq - consumer_seq
4. For each ready entry:
a. Read SQE from local ring memory (RDMA write from step 2 already placed it there)
b. Process request
c. Write CQE to local completion ring
5. Update consumer_seq with Release ordering after processing
6. If completions generated: RDMA Write CQEs back to producer's ring on Node A
Latency breakdown:
RDMA Write (SQE, 64 bytes): ~1 μs
RDMA Write (producer_seq): ~0.5 μs (RC QP guarantees Write-after-Write ordering;
no IBV_SEND_FENCE needed — see note above)
RDMA Send (doorbell): ~0.5 μs (may arrive before or after step 3;
consumer uses seq polling to handle either order)
Consumer processing: application-dependent
RDMA Write (CQE, 32 bytes): ~1 μs
Total overhead: ~3 μs round-trip (plus processing time; +0.5 μs vs. non-seq approach
for the producer_seq write, but eliminates the race condition)
Why sequence-based synchronization is required:
RDMA Send and RDMA Write are different verb types. While RDMA Writes are ordered within a single QP, the relationship between RDMA Send and RDMA Write is not guaranteed by the InfiniBand specification. A naive protocol that sends the doorbell after writing SQE data could observe:
Timeline (broken protocol):
Producer: Write SQE --(network)--> Consumer receives SQE in memory
Producer: Send doorbell --(network)--> Consumer receives doorbell interrupt
If the doorbell packet arrives first (different routing, less data to transfer),
the consumer's interrupt handler reads the ring before the SQE RDMA Write has
arrived — it sees stale or uninitialized data.
The sequence counter solves this by making the data visibility indication itself traveled via RDMA Write (which is ordered with respect to the SQE data writes). The doorbell is merely an optimization to avoid spinning; correctness depends only on the producer_seq field, which the consumer reads with Acquire ordering.
Memory ordering semantics:
| Operation | Ordering | Rationale |
|---|---|---|
| Producer writes SQE data | Relaxed | No ordering requirement until seq is published |
| Producer writes producer_seq | Release | Ensures SQE data is visible before seq advances |
| Consumer reads producer_seq | Acquire | Ensures seq read happens before SQE data read |
| Consumer writes consumer_seq | Release | Ensures SQE processing completes before ack |
| Producer reads consumer_seq | Acquire | Ensures ack read happens before reusing slots |
On x86-64, Release/Acquire compile to plain MOV instructions (TSO provides the required ordering). On AArch64, RISC-V, and PowerPC, the compiler emits the appropriate barriers (STLR/LDAR, fence-qualified atomics, lwsync/isync).
5.5.4 Batching and Coalescing
For high-throughput scenarios (e.g., database replication, event streaming):
/// Batch submission: write multiple SQEs in a single RDMA operation.
/// Amortizes RDMA overhead across N entries.
pub struct RdmaBatchSubmit {
/// Number of SQEs to submit in this batch.
pub count: u32,
/// Maximum time to wait for batch to fill before flushing (μs).
pub max_coalesce_us: u32,
/// Minimum batch size before flushing.
pub min_batch_size: u32,
}
// With batching:
// Single SQE: ~3 μs overhead per entry
// Batch of 64 SQEs: ~5 μs total = ~78 ns per entry (38x improvement)
5.6 Distributed Shared Memory
5.6.1 Design Overview
Distributed Shared Memory (DSM) allows processes on different nodes to share a virtual address space. Pages migrate between nodes on demand, using RDMA for transport and page faults for coherence.
Why previous DSM projects failed and why this one can work:
| Past Problem | Our Solution |
|---|---|
| Cache-line coherence (64B) over network | Page-level coherence (4KB minimum) |
| Bolted onto Linux MM (invasive patches) | Designed into UmkaOS MM from start |
| No hardware fault mechanism for remote | RDMA + CPU page fault = standard demand paging |
| Software TLB shootdown over network | Targeted TLB shootdown via RDMA notification (only invalidating nodes listed in the sharer set) |
| Single writer protocol (slow) | Multiple-reader / single-writer with RDMA invalidation |
| No topology awareness | Cluster distance matrix drives all placement decisions |
| Catastrophic on node failure | Capability-based revocation + replicated directory |
5.6.2 Page Ownership Model
Every shared page has exactly one owner node and zero or more reader nodes:
// umka-core/src/dsm/ownership.rs
/// Ownership state of a distributed shared page.
#[repr(u8)]
pub enum DsmPageState {
/// Page is exclusively owned by this node. No remote copies exist.
/// This node can read and write freely.
Exclusive = 0,
/// Page is owned by this node, but read-only copies exist on other nodes.
/// To write: must first invalidate all reader copies.
SharedOwner = 1,
/// This node has a read-only copy. Owner is elsewhere.
/// To write: must request ownership transfer from current owner.
SharedReader = 2,
/// Page is not present on this node. Owner is elsewhere.
/// To read or write: fault → request page from owner via RDMA.
NotPresent = 3,
/// Page is being transferred (migration in progress).
Migrating = 4,
/// Ownership transfer is in progress: the directory entry has been updated to
/// reflect the new owner, but invalidations to current readers have not yet
/// completed. Nodes that read the directory entry during this window see
/// `Invalidating` and must block on the entry's `wait_queue` until the
/// state transitions to `Exclusive`, indicating all invalidations are acked.
Invalidating = 5,
}
/// Directory entry for a distributed shared page.
/// Stored on the home node (determined by hash of virtual address).
#[repr(C)]
pub struct DsmDirectoryEntry {
/// Sequence counter for local CPU consistency on the home node.
/// DsmDirectoryEntry is larger than 8 bytes, so concurrent reads/writes on the
/// home node's CPU require a seqlock.
/// Protocol (home node CPU only — NOT for RDMA):
/// Writer (equivalent to Linux `write_seqlock()` / `write_sequnlock()`):
/// 1. Spin-wait until `sequence` is even (unlocked).
/// 2. CAS(sequence, even_value, even_value + 1, Acquire) — acquires exclusive
/// access AND starts the seqlock write section (makes sequence odd). If
/// CAS fails (another writer won the race or a concurrent increment), retry
/// from step 1. The CAS provides mutual exclusion: only one writer can
/// transition a given even value to odd.
/// 3. Modify fields.
/// 4. store(sequence, even_value + 2, Release) — releases exclusive access AND
/// completes the seqlock write section (makes sequence even, advanced by 2
/// total from the original value). The Release ordering ensures all field
/// updates are visible before the sequence becomes even ("consistent").
/// The CAS in step 2 provides spinlock semantics (mutual exclusion among writers).
/// The even→odd→even transitions provide seqlock semantics for readers. No
/// separate spinlock field is needed.
/// Reader: read sequence (Acquire) → read fields → re-read sequence (Acquire);
/// retry if mismatch/odd.
///
/// **Writer serialization**: The CAS-based seqlock writer protocol (step 2 above)
/// inherently provides mutual exclusion — only one writer can succeed at the
/// CAS(even → odd) transition. No separate per-entry spinlock is needed for
/// **write serialization** (the CAS provides it). However, the `entry_lock`
/// field below serves a different purpose: protecting the wait queue and
/// blocking path, which cannot use the seqlock (see `entry_lock` docs). This is
/// equivalent to Linux's `write_seqlock()` which combines the spinlock acquisition
/// and sequence increment into a single atomic operation. The CAS uses Acquire
/// ordering to prevent subsequent field writes from being reordered before the
/// sequence becomes odd. The final store uses Release ordering to ensure all field
/// updates are visible before the sequence becomes even.
///
/// Remote nodes do NOT read directory entries via one-sided RDMA Read — a seqlock
/// cannot work across separate RDMA operations (no atomicity between reads).
/// Instead, remote nodes send a two-sided directory lookup request to the home
/// node (see Section 5.6.4 page fault flow), and the home node's CPU reads the
/// entry locally (where the seqlock works correctly) and returns it.
pub sequence: AtomicU64,
/// Current owner of this page.
pub owner: NodeId,
/// Nodes that have read-only copies.
/// Bitfield: bit N = node N has a copy.
/// Supports up to MAX_CLUSTER_NODES (64) nodes; a u64 bitmask enables
/// efficient set operations (add/remove reader in O(1) via bit set/clear).
pub readers: u64,
/// Page state on the owner node.
pub state: DsmPageState,
/// Version counter (incremented on every ownership transfer).
/// Used for consistency checks and stale-copy detection.
pub version: u64,
/// Membership epoch when this entry's home assignment was last computed.
/// When cluster membership changes (node join/leave), the modular hash
/// assignment is recomputed and directory entries are reassigned to new home nodes.
/// This field records the epoch of the last rebalance so that stale entries
/// from a previous epoch can be detected and re-homed.
pub rehash_epoch: u32,
/// Wait queue for blocking on `Invalidating` → `Exclusive` transitions.
/// Request handlers that encounter `state == Invalidating` block here instead
/// of spin-waiting. The wait queue is woken when the invalidation completes.
/// This field is NOT transmitted over RDMA — it is local to the home node's
/// kernel memory and does not affect the wire format of directory entries.
/// Size: 8 bytes (pointer). The WaitQueueHead itself is lazily allocated on first
/// contention — most entries never have waiters.
// Lazily allocated on first contention — most entries never have waiters.
pub wait_queue: AtomicPtr<WaitQueueHead>, // null = no waiters; lazily allocated
/// Per-entry spinlock protecting the wait queue and blocking path.
/// The seqlock (`sequence`) handles the fast-path read side; this spinlock
/// protects `prepare_to_wait()` / `wake_up_all()` on `wait_queue` since
/// seqlocks cannot be held across `schedule()`. Lock ordering: seqlock
/// writer THEN entry_lock (never reversed).
pub entry_lock: SpinLock,
}
Wire format: When a directory entry is transmitted over RDMA (e.g., from primary home to backup, or in a directory lookup response), only the core fields are sent — local-only fields (
sequence,wait_queue,entry_lock) are excluded. The wire format is a packed#[repr(C)]struct:
Offset Size Field 0 4 owner: NodeId(u32)4 4 state: DsmPageState(serialized as u32; enum is#[repr(u8)]in memory, zero-extended to u32 on wire for alignment)8 8 readers: u64(bitmask)16 8 version: u6424 4 rehash_epoch: u3228 4 _pad: [u8; 4](alignment)Total: 32 bytes. Fits in a single RDMA inline send (≤64 bytes on most HCAs). The receiver reconstructs the full
DsmDirectoryEntryby initializingsequenceto 0 (even = consistent),wait_queueto null, andentry_lockto unlocked.Note: The
u64bitfield limits clusters toMAX_CLUSTER_NODES(64) nodes, matching the design constant defined in Section 5.2.9. This limit is chosen because a u64 bitmask enables O(1) reader set operations (add, remove, membership test via bit manipulation) and fits in a single atomic operation. For larger clusters, extend to[u64; N]or use a compact set representation. The 64-node limit covers the target deployment (rack-scale RDMA clusters). Extension is deferred (see Section 5.11.2.6).RDMA pool constraint: DSM pages are allocated from the RDMA-registered memory pool (Section 5.4.3), sized as
min(RAM × 25%, rdma.max_pool_gib GiB)with a 256 MiB floor (default cap: 64 GiB). Only pages within the RDMA pool can be transferred to or fetched from remote nodes. If the pool is exhausted, remote page faults block until pages are freed via LRU-based eviction (the DSM eviction policy reclaims the least-recently-accessed remote-resident pages first). The cap and percentage are adjustable viardma.max_pool_giband/sys/kernel/umka/cluster/rdma_pool_percent; workloads with large distributed working sets should increase the cap accordingly.
5.6.3 Home Node Directory
Each shared page has a home node determined by hashing its virtual address. The home node stores the authoritative directory entry (who owns the page, who has copies). This avoids a centralized directory server.
Directory indexing data structure: The home node stores directory entries in a
per-region radix tree indexed by virtual page number within the region. The radix tree
uses 9-bit fan-out (512 entries per node, matching page table structure) with
RCU-protected reads and per-node spinlocks for modification. For a 1TB region with
4KB pages (268M entries), the radix tree uses approximately 4 levels with ~524K internal
nodes (~32MB of metadata). Mutual exclusion for directory entry writes is provided by
the CAS-based seqlock writer protocol defined on DsmDirectoryEntry::sequence: the
CAS(even → odd) step provides spinlock semantics (only one writer can succeed), and the
even→odd→even transitions provide seqlock semantics for readers. This eliminates the
need for a separate per-entry spinlock — the sequence field's CAS-based writer
protocol inherently serializes writers (see DsmDirectoryEntry writer serialization
note in Section 5.6.3).
Page with virtual address VA:
home_node = hash(dsm_region_id, VA) % cluster_size
The home node might not be the owner or have a copy.
It just maintains the directory entry.
Why hash-based:
- No single point of failure (every node is home for some pages)
- O(1) lookup (no traversal)
- Uniform distribution across nodes (modular hashing)
- If home node fails: rehash to backup (Section 5.9)
Note: This is modular hashing (hash % cluster_size), NOT consistent hashing.
Modular hashing remaps most entries when cluster_size changes. UmkaOS targets
fixed-membership clusters where node join/leave is a rare, coordinated event.
When membership changes, directory rehash uses incremental migration:
1. Quorum leader announces new cluster_size to all nodes.
2. Each home node computes which of its directory entries map to a
different home node under the new hash. These entries are marked
"migrating" but remain readable at the old location.
3. New entries are placed in the new hash location immediately.
4. Existing entries are migrated lazily on access: when a node receives
a directory lookup for an entry it no longer owns (under the new hash),
it forwards the request to the new home node. If the new home node
does not yet have the entry, the old home node transfers it on demand.
5. A background sweep migrates remaining entries at low priority.
6. The directory maintains both old and new hash functions during the
migration window. No stop-the-world pause is required. The migration
duration depends on cluster size and DSM dataset size: for typical
clusters (8-32 nodes, <100GB DSM), the background sweep completes in
~1-5 seconds. For large DSM datasets (e.g., 1TB, ~256M directory entries),
migration of all entries takes longer — potentially minutes if entries are
not accessed during the sweep (each migration requires an RDMA round-trip
of ~3μs). The system remains fully functional during this window via the
dual-hash lookup (old location serves requests until migration completes).
7. After all entries are migrated, the old hash function is retired.
In-flight page faults during rehash: A page fault that arrives at a node which
is no longer the home node (under the new hash) is handled by forwarding:
- The old home node checks if it still has the directory entry locally.
If yes, it services the request directly (the entry hasn't migrated yet).
- If the entry has already migrated, the old home node returns a REDIRECT
response with the new home node ID. The faulting node retries the lookup
at the new home node. At most one redirect per fault in the common case
(the new hash is deterministic, so the second lookup always reaches the
correct node). During active entry migration, the fault handler may spin
for up to 8 retries before the redirect (see write-fault race below),
bounding worst-case latency to ~8μs + one redirect.
- Ownership transfers in progress (Section 5.6.4 steps 3-8) that span the
rehash boundary complete under the old hash. The entry is migrated to the
new home node after the transfer completes. This is safe because the
old home node holds the entry until migration, and the seqlock prevents
concurrent modification.
- **Concurrent rehash**: If a second membership change occurs while a rehash is
in progress, the first rehash is completed before the second begins (the
**quorum leader** — defined as the lowest-ID node in the current majority
partition — serializes membership change announcements). During the first
rehash's 1-5 second migration window, the quorum leader holds a membership-change lock
that prevents new node join/leave from being processed. New membership events
are queued and processed sequentially after the current rehash completes. This
ensures that at most one hash transition is active at any time, preserving the
"at most one redirect" guarantee.
**Failure during rehash**: If a node FAILS during a rehash, the failure
is handled by the existing membership protocol (Section 5.9) — the dead
node's directory entries are redistributed as part of the rehash, not as a
separate event. If the **quorum leader** dies while holding the membership-change
lock, the new quorum leader is deterministically identified as the lowest-ID
surviving node in the majority partition (Section 5.9.2.3). The new
leader inherits the in-progress rehash by querying all surviving nodes for their
current hash function version (old vs. new). It then either completes the rehash
(if >50% of entries have migrated) or aborts it (rolling back to the old hash
function). The membership-change lock is not a distributed lock — it is a logical
role held by the quorum leader, so quorum-leader reassignment implicitly transfers it.
Write-fault race during directory entry migration: A write fault on a page
whose directory entry is currently being moved to a new home node (step 4
above — lazy on-demand transfer from old to new home node) requires care:
- Each directory entry carries a per-entry rehash_epoch field (u32) that
is incremented when the entry is tagged "migrating to new home" (step 2).
- The fault handler reads the rehash_epoch before and after reading the
directory entry (as part of the existing seqlock protocol). If the epoch
has changed, or if the entry's state is "migrating", the fault handler
treats this as a transient condition and retries the directory lookup after
a short spin (up to 8 retries with 1 μs backoff, then falls back to the
REDIRECT path).
- Alternatively, the home node may hold a per-entry spinlock during the
entry-transfer step (old home → new home). The fault handler acquiring the
same lock before reading the entry ensures it either sees the entry fully
present (pre-transfer) or receives a REDIRECT (post-transfer), with no
window where the entry is partially visible.
Both mechanisms are equivalent in correctness; the epoch/retry approach is
preferred because it avoids blocking the fault path on a lock acquisition.
This is fundamentally different from DLM's consistent hashing (Section 14.6.4
in 14-storage.md), which uses a hash ring for minimal redistribution because
lock master reassignment must be fast and non-disruptive.
5.6.4 Page Fault Flow
Process on Node A reads address VA (not present locally):
1. CPU page fault on Node A.
2. UmkaOS MM handler identifies VA as part of a DSM region.
3. Compute home_node = hash(region, VA) % cluster_size = Node C.
4. RDMA Send to home Node C: "Lookup directory entry for VA."
(Two-sided — the home node's CPU reads the entry locally using the seqlock
and returns it. One-sided RDMA Read cannot be used because the DsmDirectoryEntry
is larger than 8 bytes and a seqlock requires atomic re-reads, which separate
RDMA Read operations cannot provide. Round-trip: ~4-5 μs.)
Node C's request handler reads the directory entry under the seqlock. If the
entry's state is a transient state, the handler resolves it locally before
responding:
- **state == Invalidating**: A concurrent write-fault is invalidating readers.
The handler **blocks on a per-entry wait queue** (see "Invalidating state
blocking mechanism" below) until the state transitions out of Invalidating
(typically to Exclusive, once all invalidation acks arrive). The requesting
node (A) never sees the Invalidating state — the home node resolves it before
replying. The handler does NOT spin-wait; spinning for up to 1000ms (the
membership dead timeout) would block the CPU and could cause priority inversion
or deadlock in the request handler thread pool.
- **state == Migrating**: The page's home assignment is being transferred to a
different node due to a membership change. The handler returns EAGAIN. Node A
retries the directory lookup after a brief backoff (10-50μs), by which time
the migration has typically completed and the new home node is authoritative.
5. Directory says: owner = Node B, state = Exclusive (or SharedOwner).
6. RDMA Send to Node B: "Request read copy of page at VA."
(Two-sided, Node B's kernel handles the request.)
7. Node B's handler:
a. Transitions page state from Exclusive → SharedOwner.
b. RDMA Write: sends 4KB page to Node A.
c. RDMA Send: Node B sends directory update request to Node C (add Node A
to readers). Node C's CPU processes the request locally — updating the
`readers` bitfield, `state`, and `version` fields atomically using the
local seqlock. (~4-5 μs round-trip.)
8. Node A receives page:
a. Installs page in local page table (read-only mapping).
b. Sets local DsmPageState = SharedReader.
c. Resumes faulting process.
Total latency: ~10-18 μs (directory lookup ~4-5 μs + ownership request ~3-5 μs
+ page transfer ~3-5 μs + local install ~1 μs)
Note: "page transfer ~3-5 μs" is the raw RDMA Read for 4KB (see Section 5.4.4 Performance
Characteristics table). Software overhead (directory lookup, ownership protocol, TLB
shootdown) is accounted in the other terms.
Compare: NVMe page fault = ~12-15 μs (comparable)
Process on Node A writes to address VA (has read-only copy):
1. CPU page fault (write to read-only page) on Node A.
2. UmkaOS MM handler identifies DSM page with SharedReader state.
3. Request ownership transfer:
a. RDMA Send to home Node C: "Request exclusive ownership of VA."
b. Node C looks up directory: owner = Node B, readers = {A, D}.
Node C transitions the directory entry to `Invalidating` state (setting
`owner = Node A` (the requester), `state = DsmPageState::Invalidating`)
under the seqlock before sending any invalidations. This ensures that any
concurrent directory lookup during the invalidation window sees the
`Invalidating` state and waits (see read-fault flow step 4).
c. Node C sends invalidation to all readers except requester:
- RDMA Send to Node D: "Invalidate your copy of VA."
- Node D unmaps page, flushes TLB, sends ack to Node C.
d. After all sharer acks received:
**Case 1: home != owner (C != B) — normal forwarding:**
- Node C sends ownership transfer request to Node B:
"Transfer ownership of VA to Node A."
- Node B unmaps page, sends page data directly to Node A via RDMA Write.
- Node B sends ack to Node C confirming transfer complete.
- Node B transitions to NotPresent.
**Case 2: home == owner (C == B) — skip forwarding:**
- The home node IS the current owner. No forwarding message is needed.
The home node directly invalidates its own local copy, prepares the
page data, and sends it to Node A via RDMA Write. This avoids
self-messaging (which would deadlock or cause infinite forwarding
if the home node's request handler sends a message to itself).
- Node C (the home/owner) unmaps its own page, transitions to NotPresent.
- **Concurrency model (applies to both Case 1 and Case 2)**: The directory
entry is transitioned to `Invalidating` at step 3b (before any invalidation
messages are sent), using the CAS-based seqlock writer protocol for mutual
exclusion. Other request handlers that read the directory entry during the
invalidation window see the `Invalidating` state and **block on the
per-entry wait queue** (see "Invalidating state blocking mechanism" below)
until the state transitions to `Exclusive` (indicating all invalidation
acks have been received and the transfer is complete). This prevents
stale-ownership reads: no node can observe the old owner in the directory
entry while invalidations are in flight. Invalidation requests to readers
are sent after the seqlock write section completes (sequence returns to
even), allowing concurrent directory reads to see the `Invalidating` state.
After all acks arrive, the handler re-acquires the seqlock, transitions
the state from `Invalidating` to `Exclusive` at step 3f, and **wakes all
waiters** on the per-entry wait queue. This prevents deadlock because
other directory operations on the same home node can proceed while waiting
for invalidation acks (waiters are descheduled, not spinning). Additionally,
invalidation ack processing on a reader node does NOT require any directory
operation — it only involves local page table manipulation and an RDMA
Send reply. Therefore, invalidation acks cannot create circular lock
dependencies.
e. Node A confirms receipt of page data by sending ack to Node C.
f. Node C updates directory: owner = Node A, readers = {}, state = Exclusive.
g. Node C sends grant to Node A (RDMA Send): "You now own VA exclusively."
4. Node A receives exclusive ownership:
a. Installs page with read-write mapping.
b. Resumes faulting process.
Total latency: ~15-25 μs (involves invalidating remote copies)
Home==owner optimization saves one RDMA round-trip (~3-5 μs) in that case.
RDMA Send failure handling: If the initial RDMA Send of an invalidation request
fails (queue pair error, transport failure, out-of-send-WRs), the home node treats
the failure as an immediate entry into the retry escalation path below — equivalent
to an instant ACK timeout. The home node does NOT wait 200 μs before retrying;
instead it immediately attempts the first retry via a backup QP (if available) or
a two-sided fallback channel. If all retries fail, the home node escalates to the
membership protocol as in step 2 below. This ensures that directory entries never
get stuck in the Invalidating state due to transport-level failures.
Invalidation ack timeout with escalation: Each invalidation request sent by the home node in step 3c carries a 200 μs timeout. The home node does NOT proceed with the ownership transfer until all readers have acknowledged invalidation or been removed from the reader set — doing so would violate coherence (the stale reader could read data that the new exclusive owner has since modified). If a reader does not acknowledge within 200 μs, the home node escalates:
- Retry (up to 3 attempts, 200 μs apart): The home node re-sends the invalidation request. A live but slow reader (e.g., handling a long interrupt or scheduling delay) will respond on retry.
- Suspect (after 600 μs total): The home node reports the non-responding reader to the membership protocol (Section 5.9) as suspect. If the node is genuinely unreachable, the membership protocol will mark it Suspect after 3 missed heartbeats (300 ms) and Dead after 10 missed heartbeats (1000 ms, per Section 5.9.2.2), at which point its reader bit is cleared from all directory entries.
- Proceed after removal: Once the reader has either acknowledged the invalidation or been marked Dead by the membership protocol (at which point its reader bit is cleared from the directory entry), the ownership transfer proceeds. The write-faulting thread blocks during escalation but is guaranteed forward progress — either the reader responds or the membership protocol eventually marks it Dead and removes it.
This ensures strict coherence: no node can hold a read-only mapping while another node holds exclusive ownership. The worst-case latency for a write fault with an unresponsive reader is ~1000 ms (the membership dead timeout: 10 missed heartbeats at 100 ms intervals, per Section 5.9.2.2), compared to Linux's 10-30 second fencing delay. In the common case (all readers responsive), the additional cost is zero — the 200 μs timeout never fires.
Maximum Invalidating state duration: A directory entry remains in the Invalidating
state from step 3b (when the home node sets it) until step 3f (when all invalidation
acks arrive and the state transitions to Exclusive). The maximum duration is bounded
by the invalidation ack timeout escalation above: 200 μs initial timeout, up to 3
retries (600 μs), then escalation to the membership protocol which marks an
unresponsive node Dead after 1000 ms (10 missed heartbeats at 100 ms intervals, per
Section 5.9.2.2). At that point, the home node removes the unresponsive node from the
sharer set and proceeds with the state transition. Therefore, the worst-case
Invalidating duration is bounded by the membership dead timeout (~1000 ms). Any
concurrent read-fault that arrives at the home node during this window will block
on the per-entry wait queue (see read-fault flow step 4) until the state resolves.
The waiters are descheduled during this time, not spinning, allowing the CPU to
process other requests.
Wait queue pre-allocation (Section 3.2 rule): Wait queue heads on the DSM fault path are pre-allocated from a dedicated pool at initialization (per the Section 3.2 rule: no dynamic allocation in fault handlers). The per-entry lazy pointer in
DsmDirectoryEntrypoints into this pre-allocated pool, never into the general slab allocator.
Invalidating state blocking mechanism: Each DsmDirectoryEntry includes a
per-entry spinlock (entry_lock: SpinLock) and a kernel wait queue
(wait_queue: AtomicPtr<WaitQueueHead>) for blocking on state transitions. The seqlock
(sequence field) remains the fast-path mechanism for read-side directory lookups,
but the blocking/wakeup path uses a separate spinlock because seqlocks cannot be
held across schedule() (the seqlock write side disables preemption internally,
and sleeping with preemption disabled is a deadlock).
When a request handler encounters the Invalidating state:
- The handler acquires
entry_lock(per-entry spinlock). - The handler re-checks the state under the spinlock. If no longer
Invalidating, release the spinlock and proceed (avoids unnecessary sleep). - The handler calls
prepare_to_wait(&entry.wait_queue, TASK_INTERRUPTIBLE)— this adds the handler to the wait queue while holding the spinlock, preventing the race where a waker callswake_up_all()between the lock release and the wait queue addition. - The handler releases
entry_lock. - The handler calls
schedule(), which deschedules the thread. If a concurrent wakeup occurred between steps 3 and 5,schedule()returns immediately without sleeping (the standard Linux wait pattern). - On wakeup, the handler calls
finish_wait(&entry.wait_queue)and re-reads the directory entry under the seqlock (optimistic read path) to proceed.
When the ownership transfer completes (step 3f), the invalidating handler:
- Acquires the seqlock writer (CAS even→odd, provides writer serialization).
- Transitions state from Invalidating to Exclusive.
- Releases the seqlock writer (write even).
- Acquires entry_lock, calls wake_up_all(&entry.wait_queue), releases
entry_lock.
Wait queue safety invariant: The wait queue is protected by entry_lock (a
per-entry spinlock), NOT by the seqlock. Waiters add themselves to the queue under
entry_lock (step 3), and wakers hold entry_lock while calling wake_up_all().
The seqlock writer and entry_lock are independent — the seqlock serializes
directory entry updates (state transitions), while entry_lock serializes wait
queue operations. Lock ordering: seqlock writer THEN entry_lock (never reversed).
This blocking mechanism ensures that the home node can process thousands of
concurrent DSM requests without CPU-starving due to spin-waits. The wait queue
is per-entry (not global), so blocking on one page's invalidation does not
affect other pages. The wait queue head is allocated lazily — the
DsmDirectoryEntry stores only a pointer (wait_queue: *mut WaitQueueHead,
8 bytes) that is null until the first blocking wait occurs. Most entries never
experience contention, so the common case pays only 8 bytes of pointer overhead
per entry rather than embedding a full 16-32 byte wait queue head in every entry.
Thread pool exhaustion prevention: The blocking mechanism above allows
request handler threads to sleep on per-entry wait queues. This creates a
thread pool exhaustion risk: if all threads in the home node's request handler
pool block on Invalidating state wait queues, no thread remains to process
the invalidation ACK completions that would wake them — a deadlock.
UmkaOS prevents this by separating the two paths:
-
Request handler pool (bounded, per-NUMA-node): Processes incoming DSM directory lookup requests (read-faults, write-fault ownership requests). These threads may block on per-entry wait queues when encountering the
Invalidatingstate. Pool size is configurable (default: 2 × cpu_count per NUMA node, minimum 4). -
RDMA completion pool (separate, never blocks on directory state): Processes invalidation ACK completions, page transfer confirmations, and membership heartbeat responses. These threads run to completion without blocking on any DSM directory lock or wait queue — their only operations are: (a) update the directory entry state under the seqlock, (b) wake waiters on the per-entry wait queue, (c) send follow-up RDMA messages. Because they never sleep on directory state, they cannot be starved by request handler blocking.
The two pools use separate RDMA completion queues (CQs): request handler
threads poll the request CQ, completion threads poll the ACK CQ. This
hardware-level separation ensures that ACK processing is never queued behind
blocked request handlers. Even if every request handler thread is sleeping
on an Invalidating wait queue, ACK completions are processed by the
completion pool, which transitions the directory entry to Exclusive and
wakes the blocked handlers.
RDMA operation ordering: Multi-step DSM operations (send page data, then update
directory) require explicit acknowledgment to guarantee ordering across different QPs.
IBV_SEND_FENCE only orders operations within a single QP, but page data goes to the
page recipient while the directory update goes to the home node -- these are different
QPs on different remote nodes. The correct protocol uses an explicit ack step:
1. Owner sends page data to requester via RDMA Write (on QP to requester).
2. Requester confirms receipt by sending an RDMA Send ack to the home node.
3. Home node updates the directory via local CAS (no cross-QP ordering needed since
the home node's CPU processes the ack before updating its own local directory).
This adds one extra round-trip (~2 us) compared to a naive fence-based approach, but
is required for correctness because RDMA fencing cannot provide cross-QP ordering.
Security requirement — Invalidation ACK authentication: Invalidation ACKs
MUST be authenticated to prevent spoofing by malicious nodes. Without authentication,
a malicious node could send forged invalidation ACKs to the home node, causing it to
prematurely transition the directory entry to Exclusive while a stale reader still
holds a mapping — violating cache coherence and potentially leaking data. Authentication
requirements:
- Each invalidation request from the home node (step 3c) MUST include a
cryptographically random 128-bit
invalidation_noncegenerated fresh for each invalidation batch. - The invalidation ACK from each reader node MUST include:
- The same
invalidation_nonce(proving the ACK is a response to this specific invalidation request, not a replay) - An HMAC-SHA-256 computed over
{nonce || page_va || reader_node_id}using a session key established during cluster join (Section 5.2.8, derived via X25519 Diffie-Hellman and HKDF-SHA256 from the shared secret) - The home node MUST verify the HMAC before accepting the ACK. Invalid or missing HMACs MUST be treated as if the ACK was never received (triggering timeout and escalation to membership protocol).
- To limit overhead, the session key is established once during cluster join and rotated every 24 hours. Key rotation uses the authenticated control channel (Section 5.2.8).
- Session key compromise recovery: If a session key is suspected of compromise, the detecting node initiates emergency key rotation per the protocol below.
Session Key Compromise Recovery Protocol
Session keys authenticate DSM invalidation ACKs (above) and control channel messages (Section 5.2.8). Compromise of a session key allows an attacker to forge ACKs, potentially corrupting directory coherence. This section specifies detection, coordination, and recovery.
Detection — A node suspects session key compromise when it observes repeated HMAC verification failures from a specific peer. The detection state machine per peer:
State machine per (local_node, peer_node) pair:
NORMAL
│ HMAC verification failure from peer
│ → increment fail_counter, record timestamp
│ → if fail_counter >= threshold within window → transition to SUSPECT
▼
SUSPECT
│ Send KEY_ROTATE_URGENT to peer (Ed25519-signed, independent of session key)
│ Start re-key timer (3× measured RTT to peer)
│ → if peer responds with KEY_ROTATE_ACK → transition to REKEYING
│ → if timer expires → transition to EVICTING
▼
REKEYING
│ Both sides perform X25519 DH exchange (authenticated via Ed25519 identity keys)
│ Derive new session key via HKDF-SHA256
│ Enter dual-key acceptance window
│ → if both sides confirm (KEY_ROTATE_CONFIRM) → transition to NORMAL, zeroize old key
│ → if confirm not received within 3× RTT → transition to EVICTING
▼
EVICTING
│ Escalate to membership protocol (Section 5.9)
│ → Peer is marked Dead, all QPs destroyed, all capabilities revoked
Detection threshold tuning — The threshold is configurable via
cluster.hmac_fail_threshold and cluster.hmac_fail_window_ms:
| Network | RTT | Recommended threshold | Window | Rationale |
|---|---|---|---|---|
| RDMA LAN (<10 μs) | ~1-5 μs | 5 failures | 60s | Low loss rate; 5 failures is highly anomalous |
| TCP LAN (<1 ms) | ~0.2-1 ms | 10 failures | 120s | TCP retransmits can cause transient HMAC mismatches if segments arrive out of order during congestion |
| WAN (10-100 ms) | ~20-100 ms | 20 failures | 300s | Higher packet loss, reordering, and latency variance; larger window avoids false positives |
False positives (legitimate HMAC failures misidentified as compromise) are handled by the re-keying protocol itself — re-keying is safe even if triggered unnecessarily. The only cost of a false positive is one X25519+HKDF key derivation (~50 μs CPU time) and ~2 RTTs of latency.
Multi-node coordination — When multiple nodes simultaneously detect HMAC failures from the same peer (e.g., nodes A and B both detect failures from node C), a tie-breaking protocol prevents conflicting concurrent re-key exchanges:
- Each
KEY_ROTATE_URGENTmessage includes the sender'snode_id. - The receiving node (C) may receive multiple
KEY_ROTATE_URGENTmessages from different peers (A and B) within a short interval. - Node C processes re-key requests sequentially in
node_idorder: lowestnode_idis re-keyed first. Concurrent requests from highernode_idpeers are queued (acknowledged withKEY_ROTATE_QUEUED) and processed after the current re-key completes. - If node A and node B are both trying to re-key with C, and A has
node_id < B: - A↔C re-key proceeds immediately.
- B receives
KEY_ROTATE_QUEUEDfrom C, waits forKEY_ROTATE_READYfrom C before starting its own DH exchange. - Maximum serialization delay: one re-key duration (~2 RTTs) per queued peer.
- If two nodes try to re-key with each other (A detects failures from B, and B
detects failures from A simultaneously): the node with lower
node_idis the initiator (sends the DH ephemeral public key first). The other node becomes the responder. Both detect the symmetric case via receivingKEY_ROTATE_URGENTwhile already inSUSPECTstate for the same peer.
Interaction with in-flight DSM operations — During re-keying, DSM invalidation and ACK processing continue without interruption:
- Dual-key acceptance window: During the
REKEYINGstate (~1-2 RTTs), both old and new session keys are accepted for incoming HMAC verification. This covers in-flight invalidation ACKs that were signed with the old key before the peer computed the new key. - Nonce binding prevents confusion: Each invalidation batch has a unique 128-bit
nonce. An ACK signed with the old key for batch N is valid only for batch N (the
HMAC covers
{nonce || page_va || reader_node_id}). Even during dual-key acceptance, there is no risk of an attacker replaying an old ACK for a new batch because the nonce will not match. - Seqlock isolation: Directory state transitions (
Invalidating→Exclusive) are protected by the per-entry seqlock (Section 5.6.3). The seqlock is independent of key state — re-keying does not hold or wait for directory locks, and directory operations do not hold or wait for re-keying locks. - Key zeroization timing: The old key is zeroized only after both sides confirm
the new key (
KEY_ROTATE_CONFIRM). At that point, all in-flight messages signed with the old key have either been received (within the dual-key window) or timed out. The timeout escalation (Section 5.6.4 retry sequence) handles any ACKs lost during the window.
Recovery guarantee: After a successful re-key, all subsequent HMAC operations use
the new key. If the old key was compromised, the attacker can no longer forge messages.
Directory coherence is preserved throughout because invalidation correctness depends on
nonce uniqueness and seqlock atomicity, not on which session key was used to authenticate
the ACK.
- ACKs received after the directory state has already transitioned to Exclusive
(duplicate ACKs due to network reordering) MUST be silently discarded without
error — the nonce lookup will fail since the invalidation batch is complete.
Performance impact: HMAC-SHA-256 verification adds ~500 ns to ACK processing on the home node. This is negligible compared to the ~200 μs timeout window. The nonce prevents replay attacks without requiring per-ACK signatures.
Nonce Lifecycle and Replay Prevention:
Each invalidation batch carries a 128-bit cryptographically random nonce
(InvalidationNonce: [u8; 16]). The nonce is generated by the sender using
UmkaOS's CSPRNG (getrandom(2) equivalent).
Retention window: Each receiving node maintains a
NonceWindow: BTreeSet<InvalidationNonce> of nonces seen in the last 30 seconds.
On receiving an ACK:
1. Verify nonce is in the pending-invalidation table (sent but not yet ACKed).
2. Verify nonce is NOT in NonceWindow (replay check).
3. Mark the batch as ACKed.
4. Insert nonce into NonceWindow.
Expiry: NonceWindow entries are purged after 30 seconds (wall clock). This
window is set to 3× the maximum expected network round-trip time (10s RTT bound
in UmkaOS cluster config). Purging runs on the RCU grace period thread, not on the
ACK hot path.
Batch-nonce binding: Each invalidation message includes the batch sequence number AND the nonce. A replayed ACK with an old nonce for a new batch sequence number fails the pending-table lookup (the old nonce is no longer in the pending table). The nonce and sequence number together provide double validation against replay.
Clock skew: Since replay uses wall-clock expiry, nodes must maintain clock synchronization within the replay window. UmkaOS's cluster time sync (Section 5.11.2) guarantees ±1s drift, well within the 30s window.
5.6.5 DSM Coherence Protocol: MOESI
UmkaOS's DSM implements a distributed directory-based MOESI protocol over RDMA. MOESI is chosen over MSI or MESI because the Owned state allows a node to service read requests for dirty data directly — without first writing the data back to the home node's memory. This avoids an extra network round-trip and reduces home-node memory-controller traffic in multi-reader scenarios.
Each 4KB page has a home node (determined by physical address range assignment via the hash function in Section 5.6.3) that maintains the directory entry tracking which nodes hold copies and in which state.
5.6.5.1 Cache Line States
Each DSM page tracked per-node can be in one of five states:
| State | Abbrev | Meaning |
|---|---|---|
| Modified | M | Node has the only valid copy; dirty (not written to home memory). Node must supply data on any incoming request. |
| Owned | O | Node has a dirty copy; other nodes may have Shared copies; home memory is stale. Node must supply data on read requests without first flushing to home. |
| Exclusive | E | Node has the only valid copy; clean (matches home memory). No remote copies exist. |
| Shared | S | Node has a read-only copy; home memory is up-to-date; other nodes may also have S copies. |
| Invalid | I | Node has no valid copy; must fetch from home before any access. |
The Owned state is the key differentiator from MESI: when a node in state M receives
a FwdGetS (another node wants a read copy), it transitions to O rather than writing
back to home first. The dirty data stays with the owner; home memory remains stale, but
the owner supplies all subsequent read requests. Only a PutO eviction or a FwdGetM
(exclusive request) forces a writeback.
5.6.5.2 Directory Entry at the Home Node
The home node maintains a compact directory entry for each shared page. The home node's directory is stored in the radix tree described in Section 5.6.3.
// umka-core/src/dsm/moesi.rs
/// Per-page directory entry maintained at the home node.
/// The home node uses this to track which nodes hold copies and in what state,
/// so it can send the correct forwarding or invalidation messages on a miss.
pub struct DsmDirEntry {
/// State of the page from the home node's perspective.
pub state: DsmHomeState,
/// Nodes that have Shared (S) or Owned (O) copies.
/// Bit N = node N has a copy eligible to supply read data.
/// Maximum MAX_CLUSTER_NODES (64) nodes; u64 bitmask for O(1) set operations.
pub sharers: u64,
/// Node that holds the M or O (or E) copy, if any.
/// `None` means home memory is authoritative (Uncached state).
pub owner: Option<NodeId>,
/// Lock protecting this entry during message processing.
/// Held only for the duration of a single directory update step;
/// never held across a network round-trip or any blocking call.
pub lock: SpinLock<()>,
}
/// State of a page from the home node's directory perspective.
/// Maps to requestor-visible MOESI states as follows:
/// Uncached → home is authoritative (no remote copies)
/// Shared → one or more nodes have S copies; home memory matches
/// Exclusive → one node has E; home memory matches (owner field set)
/// Modified → one node has M or O; home memory is stale (owner field set)
pub enum DsmHomeState {
/// No node has a copy. Home memory is the only valid copy.
/// The next GetS or GetM transitions to Shared or Modified respectively.
Uncached,
/// One or more nodes have read-only (S) copies.
/// Home memory is up-to-date. `sharers` bitmask is non-zero; `owner` is None.
Shared,
/// Exactly one node has an exclusive clean copy.
/// Home memory matches. `owner` is set; `sharers` is 0.
Exclusive,
/// One node has a Modified (M) or Owned (O) copy; home memory is stale.
/// `owner` is set; `sharers` may be non-zero (non-zero = owner is in O state,
/// supplying data to sharers; zero = owner is in M state, sole copy).
Modified,
}
Relationship to
DsmPageState: TheDsmHomeStateis the home node's directory view;DsmPageState(Section 5.6.2) is the per-node local view held in the requestor's page metadata. The two views are consistent: when the home directory saysModifiedwithowner = Node B, Node B's localDsmPageStateisModified(M) orOwned(O) depending on whether thesharersfield is empty or not.
5.6.5.3 State Transitions — Requestor's View
The following table describes the complete MOESI transition function. "Home Actions" are
performed atomically under the DsmDirEntry::lock; messages are sent after releasing the
lock to avoid holding it across network operations.
| From | Event | Message to Home | Home Actions | Message(s) from Home | To |
|---|---|---|---|---|---|
| I | Read miss | GetS(page, requester) |
Uncached: send data from home memory; add requester to sharers; → Shared. Shared: send data; add requester; remain Shared. Exclusive: forward FwdGetS to owner (who will downgrade to O); add requester; → Modified (owner = prior E holder, now O). Modified: forward FwdGetS to owner; add requester to sharers; remain Modified. |
DataResp(page, data, ack_count=0) (home-sourced) or DataFwd(page, data) (owner-sourced after FwdGetS) |
S |
| I | Write miss | GetM(page, requester) |
Uncached: send data from home memory; → Modified (owner = requester, sharers = ∅). Shared: send Inv to all sharers; wait for InvAck × N; send DataResp with ack_count = N; → Modified. Exclusive: forward FwdGetM to owner; → Modified (owner = requester). Modified: forward FwdGetM to owner; → Modified (owner = requester). |
DataResp(data, ack_count=N) + N × Inv to sharers (requester collects N InvAcks before entering M) |
M (after N InvAcks) |
| S | Write upgrade | Upgrade(page, requester) |
Invalidate all other sharers (Inv); wait for InvAck × (N−1); send AckCount(N−1) to requester; → Modified (owner = requester, sharers = ∅). No data transfer — requester already has a clean copy. |
AckCount(N−1) + (N−1) × Inv to other sharers |
M (after N−1 InvAcks) |
| M | Eviction (dirty) | PutM(page, data) |
Write data to home memory; → Uncached (owner = None). |
PutAck(page) |
I |
| O | Eviction (owned) | PutO(page, data) |
Write data to home memory; remove from sharers list if present; → Uncached (owner = None, sharers = ∅). |
PutAck(page) |
I |
| E | Eviction (clean) | PutE(page) |
Set state → Uncached (owner = None). No data transfer needed — home memory is already current. |
PutAck(page) |
I |
| S | Eviction (read copy) | PutS(page) |
Remove requester from sharers bitmask. If sharers becomes 0, → Uncached. |
PutAck(page) |
I |
| M or O | FwdGetS received (from home, on behalf of a new reader) |
— (response to home forward) | Home has already updated directory: added new reader to sharers, state remains Modified. |
Send DataFwd(page, data) to new reader node; self → O (if was M, now dirty with sharers). |
O |
| M or O | FwdGetM received (from home, on behalf of an exclusive requester) |
— (response to home forward) | Home has already updated owner field to requester. |
Send DataFwd(page, data) to requester; self → I (invalidate local copy). |
I |
| O | Implicit writeback on FwdGetM |
— | Home state → Modified (new owner = requester; sharers cleared). |
— | I |
Notes on ack_count: When home issues GetM with N sharers (or Upgrade with N−1
other sharers), it sends Inv to each sharer simultaneously with the DataResp to the
requester. The requester receives the DataResp containing ack_count = N, then waits
for exactly N (or N−1) InvAck messages from those sharers before installing the
write-capable mapping. This allows the home node to pipeline the invalidations without
waiting for all InvAcks itself, reducing write-miss latency.
5.6.5.4 Message Types
All messages are exchanged over the RDMA transport (Section 5.4):
- Data messages (DataResp, DataFwd, PutM, PutO): use RDMA one-sided Write
for the 4KB payload to avoid receiver CPU involvement in data movement.
- Control messages (everything else): use RDMA two-sided Send/Receive.
// umka-core/src/dsm/moesi.rs
/// Messages exchanged between nodes in the MOESI DSM protocol.
///
/// Naming convention:
/// Requestor → Home: Get*, Put*, Upgrade
/// Home → Requestor: DataResp, AckCount, PutAck, Nack
/// Home → Sharer: FwdGetS, FwdGetM, Inv
/// Sharer → Requestor: InvAck
/// Owner → Requestor: DataFwd
pub enum DsmMsg {
// ── Requestor → Home ─────────────────────────────────────────────────────
/// Read request: requester wants a Shared (read-only) copy.
GetS { page: PageAddr, requester: NodeId },
/// Write request: requester wants a Modified (exclusive writable) copy.
GetM { page: PageAddr, requester: NodeId },
/// Upgrade: requester already has S and wants to upgrade to M.
/// Avoids retransmitting the data (home sends only invalidations + AckCount).
Upgrade { page: PageAddr, requester: NodeId },
/// Eviction of a Modified (dirty) copy. Carries the written-back data.
/// Sent via RDMA one-sided Write for the data payload; the control
/// header is a two-sided Send.
PutM { page: PageAddr, data: PageData },
/// Eviction of an Owned copy. Carries the written-back data.
PutO { page: PageAddr, data: PageData },
/// Eviction of an Exclusive (clean) copy. No data transfer needed.
PutE { page: PageAddr },
/// Silent eviction of a Shared read-only copy. No data transfer.
PutS { page: PageAddr },
// ── Home → Requestor ─────────────────────────────────────────────────────
/// Data response from home memory (for Uncached/Shared states).
/// `ack_count`: number of `InvAck` messages the requester must collect
/// before entering state M (zero for read misses transitioning to S).
DataResp { page: PageAddr, data: PageData, ack_count: u32 },
/// Acknowledgment count only (for Upgrade: no data resent).
AckCount { page: PageAddr, ack_count: u32 },
/// Acknowledgment that a Put* eviction was received and home is updated.
PutAck { page: PageAddr },
/// Transient conflict: requester must retry after backoff.
/// Home returns Nack instead of blocking when the directory entry is
/// momentarily locked (e.g., a concurrent ownership transfer is in progress).
/// Requester retries with exponential backoff: 1 μs, 2 μs, 4 μs, …, max 1 ms.
Nack { page: PageAddr, reason: NackReason },
// ── Home → Sharer / Owner (forwarded requests) ────────────────────────────
/// Forward a read request to the current owner (M or O state).
/// Owner must send `DataFwd` to `requester` and transition to O.
FwdGetS { page: PageAddr, requester: NodeId },
/// Forward an exclusive request to the current owner (M or O state).
/// Owner must send `DataFwd` to `requester` and transition to I.
FwdGetM { page: PageAddr, requester: NodeId },
/// Invalidation request to a sharer (S state node).
/// Sharer must unmap the page, flush TLBs, and reply with `InvAck`.
Inv { page: PageAddr, requester: NodeId },
// ── Sharer → Requestor ───────────────────────────────────────────────────
/// Acknowledgment of invalidation. Sent directly to the requester (not home)
/// to allow the requester to pipeline M-state entry without home round-trip.
InvAck { page: PageAddr },
// ── Owner → Requestor (data forwarding) ──────────────────────────────────
/// Forwarded data from the current owner (M or O state) to a new requester.
/// Sent via RDMA one-sided Write for the 4KB payload.
DataFwd { page: PageAddr, data: PageData },
}
/// Physical page address — identifies a page globally across the cluster.
/// Value is the physical address of the first byte of the page, 4KB-aligned.
/// Globally unique because DSM regions use disjoint physical address ranges
/// assigned at region creation time ([Section 5.6.7](#567-dsm-region-management)).
pub struct PageAddr(u64);
/// One 4KB DSM page of data, matching CPU page size.
pub struct PageData([u8; 4096]);
/// Reason codes for transient Nack responses.
pub enum NackReason {
/// Directory entry is temporarily locked by a concurrent operation. Retry.
Busy,
/// Home node is in the middle of a directory rehash. Retry after redirect.
Transient,
}
5.6.5.5 Deadlock Avoidance
The MOESI protocol is deadlock-free by construction via three invariants:
-
No blocking while holding a directory lock. The home node acquires
DsmDirEntry::lock, updates the entry, releases the lock, and only then sends forwarding or invalidation messages. It never blocks waiting for remoteInvAcks orDataFwdcompletions while holding the lock. This ensures the home node's lock is always available to process incoming messages. -
NACK instead of blocking for transient conflicts. When the directory entry is locked by a concurrent operation, the home node returns
Nack(withNackReason::Busy) rather than queuing the request. Requestors retry with exponential backoff (1 μs, 2 μs, 4 μs, …, capped at 1 ms). This prevents priority inversion and livelock by bounding the retry interval. -
Separate request and response channels. RDMA QP pairs are partitioned into a request channel and a response channel per peer. Response messages (
InvAck,DataFwd,PutAck) never block behind request messages (GetS,GetM,Inv). This prevents the classic deadlock where a node cannot process an incomingFwdGetSbecause it is blocked trying to send its ownGetM. -
Owner always responds before issuing new requests. A node that receives
FwdGetSorFwdGetMmust sendDataFwd(andInvAckif transitioning to I) before it may issue any newGetS,GetM, orUpgraderequests. This prevents cyclic wait: A waiting on B'sDataFwdwhile B is waiting on A'sDataFwd.
5.6.5.6 Performance Characteristics
| Operation | Network Latency | RDMA Operations |
|---|---|---|
| Read hit (S or O or E state) | 0 | 0 |
| Read miss — I → S, home-sourced (Uncached or Shared) | 2× RTT | 2× two-sided Send |
| Read miss — I → S, owner-forwarded (Modified/Exclusive owner) | 3× RTT | 3× two-sided Send + 1× one-sided Write |
| Write miss — I → M, no sharers (Uncached or Exclusive) | 2× RTT | 2× two-sided Send |
| Write miss — I → M, N sharers | 2× RTT + max(InvAck RTTs) | 2× Send + N× Inv/InvAck |
| Upgrade — S → M, N−1 other sharers | 1× RTT + max(InvAck RTTs) | 1× Send + (N−1)× Inv/InvAck |
| Eviction M → I (dirty writeback) | 1× RTT | 1× two-sided Send + 1× one-sided Write |
| Eviction O → I (owned writeback) | 1× RTT | 1× two-sided Send + 1× one-sided Write |
| Eviction E → I (clean eviction) | 1× RTT | 1× two-sided Send |
| Eviction S → I (read-copy eviction) | 1× RTT | 1× two-sided Send |
RTT here is the RDMA network round-trip time (~2–5 μs for a local rack, per Section 5.4.4).
RDMA one-sided Write is used for DataResp, DataFwd, PutM, and PutO (4KB payload)
because it avoids receiver CPU involvement in the data path. Two-sided Send is used for
all control messages (≤64 bytes, sent as RDMA inline) because they require the receiver's
CPU to process the directory update or mapping change.
The Owned (O) state provides a concrete advantage for multi-reader scenarios: when the
first reader (FwdGetS) causes the owner to transition M → O, subsequent readers
(FwdGetS again) are served directly by the O-state node without any home-node memory
write. In a scenario with 1 writer followed by K readers, MOESI requires 1 + K network
round-trips; MESI (which requires a writeback before sharing) would require 2 + K.
5.6.6 Extending PageLocationTracker
The PageLocation enum (Section 21.2.1.5) includes RemoteNode, RemoteDevice, and
CxlPool variants for distributed memory tracking. The following shows the distributed
variants and their semantics:
// Distributed variants of PageLocation (defined canonically in Section 21.2.1.5)
pub enum PageLocation {
// ... existing variants (CpuNode, DeviceLocal, Migrating, etc.) ...
/// Page is in CPU memory on this NUMA node (existing).
CpuNode(u8),
/// Page is in accelerator device-local memory (existing).
DeviceLocal {
device_id: DeviceNodeId,
device_addr: u64,
},
/// Page is being transferred (migration in progress).
/// Consistent with DsmPageState::Migrating = 4 (the canonical definition above).
/// **Canonical definition**: Section 21.2.1.5 in 21-accelerators.md defines the
/// Migrating variant with a side-table index (`migration_id: u32`) to keep
/// `PageLocation` at 24 bytes. The `MigrationRecord` side table (defined in
/// Section 21.2.1.5, stored in `PageLocationTracker::active_migrations`) holds
/// the full source/target details (source_kind, source_node, source_device,
/// source_addr, target_kind, target_node, target_device, target_addr).
Migrating {
migration_id: u32,
},
/// Page is not yet allocated (existing).
NotPresent,
/// Page is in compressed pool (existing).
Compressed,
/// Page is in swap (existing).
Swapped,
// === New: distributed memory locations ===
/// Page is on a remote node's CPU memory, accessible via RDMA.
RemoteNode {
node_id: NodeId,
remote_phys_addr: u64,
dsm_state: DsmPageState,
},
/// Page is on a remote node's accelerator memory (GPUDirect RDMA).
RemoteDevice {
node_id: NodeId,
device_id: DeviceNodeId,
device_addr: u64,
},
/// Page is in CXL-attached memory pool (hardware-coherent).
CxlPool {
pool_id: u32,
pool_offset: u64,
},
}
Security requirement — CXL pool bounds validation: The kernel MUST validate
pool_id and pool_offset before using them to access memory. An out-of-bounds
pool_id or pool_offset could allow unauthorized access to memory outside the
intended pool, potentially exposing kernel data or allowing privilege escalation.
Validation requirements:
pool_idMUST be validated against the global CXL pool registry (cxl_pool_count). Accessing a non-existent pool MUST fail withEINVAL.pool_offsetMUST be validated against the target pool's size (cxl_pools[pool_id].size_bytes). Accesses beyond the pool boundary MUST fail withEFAULT.- For RDMA-initiated CXL pool access, the remote node's capability (Section 5.8)
MUST authorize the specific
pool_id. A capability granting access to pool 0 MUST NOT be usable to access pool 1. - CXL pool resize is grow-only — pools can be expanded but never shrunk. This
eliminates the TOCTOU race where a pool shrinks between the bounds check and the
CPU load/store instruction (seqlocks cannot prevent this race because CPU memory
accesses are not rollback-capable). The pool's
size_bytesis read withAcquireordering; a concurrent grow only increases the valid range, so a stale (smaller) size produces a conservative bounds check, never an out-of-bounds access. Pool deallocation (destroying a pool entirely) requires quiescing all accessors first via the standard RCU grace period mechanism.
5.6.7 DSM Region Management
Distributed shared memory is opt-in. Processes create DSM regions explicitly:
/// Create a distributed shared memory region.
/// All nodes participating in this region can map it.
pub struct DsmRegionCreate {
/// Unique region identifier (cluster-wide).
pub region_id: u64,
/// Virtual address base for this region (must be page-aligned).
/// This is the starting virtual address at which the region will be mapped
/// on all participating nodes. Must not overlap existing mappings.
pub base_addr: u64,
/// Size of the shared region (bytes, page-aligned).
pub size: u64,
/// Page size for this region.
pub page_size: DsmPageSize,
/// Access permissions for this region (read / write / execute bitfield).
/// DSM_PROT_READ = 0x1, DSM_PROT_WRITE = 0x2, DSM_PROT_EXEC = 0x4.
/// Applied uniformly to all mappings on all participating nodes.
pub permissions: u32,
/// Consistency model.
pub consistency: DsmConsistency,
/// Initial placement: which node holds the pages initially.
pub initial_owner: NodeId,
/// Home node assignment policy for directory entries.
/// DSM_HOME_HASH = 0: home_node = hash(region_id, VA) % cluster_size (default).
/// DSM_HOME_FIXED = 1: all directory entries homed on initial_owner (simple, but
/// creates a single hot-spot; use only for small, rarely-accessed regions).
pub home_policy: u32,
/// Access control: which nodes can join this region.
pub allowed_nodes: u64, // Bitfield, limited to MAX_CLUSTER_NODES (64)
/// Capability required to map this region.
pub required_cap: CapHandle,
/// Behavior flags (bitfield).
/// DSM_EAGER_WRITEBACK = 0x1: flush dirty pages to owner on lock release
/// (Section 5.6.10). Default: lazy writeback on eviction only.
/// DSM_REPLICATE = 0x2: enable directory replication for fault tolerance
/// (Section 5.9.2.4). Default: single home node per entry.
pub flags: u32,
/// Reserved for future extensions; must be zero.
pub _pad: [u8; 4],
}
#[repr(u32)]
pub enum DsmPageSize {
/// Standard 4KB pages. Best for random access patterns.
Page4K = 0,
/// 2MB huge pages. Best for sequential / bulk access.
/// Reduces TLB misses but increases transfer granularity.
HugePage2M = 1,
}
#[repr(u32)]
pub enum DsmConsistency {
/// Release consistency: writes become visible to other nodes
/// after an explicit release (memory barrier / unlock).
/// Best performance. Standard for most HPC/ML workloads.
Release = 0,
/// Sequential consistency is NOT supported. Providing a total order over all
/// memory operations across nodes would require serializing every write through
/// a single ordering point (or using a distributed total-order broadcast), adding
/// ~5-15 μs per write operation even on RDMA. This overhead would negate the
/// performance benefits of DSM for virtually all workloads. Applications requiring
/// sequential consistency should use explicit distributed locking (DLM, [Section
/// 11.6](14-storage.md#146-distributed-lock-manager)) or message-passing instead of DSM.
///
/// Reserved for potential future use if hardware (e.g., CXL 3.0 with hardware
/// coherence) provides efficient total ordering.
SequentialReserved = 1,
/// Eventual consistency: writes propagate asynchronously.
/// Lowest overhead. Suitable for read-heavy, stale-tolerant data.
///
/// **Propagation mechanism** (Phase 3 deliverable — high-level design):
/// - Writes are applied locally and enqueued in a per-region **propagation log**
/// (circular buffer, one per DSM region per node).
/// - A background propagation thread sends log entries to the home node via
/// RDMA Send (two-sided, for ordering). The home node applies them and
/// pushes updates to other readers via the invalidation protocol.
/// - **Staleness bound**: configurable per-region (default: unbounded).
/// With a bound of T ms, the propagation thread ensures all entries
/// are sent within T ms of the write. A bound of 0 falls back to Release.
/// - **Anti-entropy**: On region join or after network partition heal,
/// nodes exchange version vectors and replay missing updates.
/// The full anti-entropy protocol specification is deferred to Phase 3.
Eventual = 2,
/// Synchronous (strong) consistency: write completes only after all replicas
/// acknowledge. Provides per-page linearizability — reads always serve from the
/// local replica (all replicas are up to date by construction). Write cost:
/// ~3–5 μs RDMA round-trip. Use case: shared metadata, distributed lock tables.
/// NOT sequential consistency — there is no total order across pages or regions.
/// (Phase 3 deliverable.)
Synchronous = 3,
/// Causal consistency: if a thread writes X then writes Y, any reader that
/// observes Y must also observe X. Enforced via per-region vector clocks.
/// Write cost: 1 RDMA RTT + vector clock update (~2–4 μs).
/// Use case: distributed queues, logs, pub/sub.
/// (Phase 3 deliverable.)
Causal = 4,
}
5.6.8 DSM Region Destruction Protocol
Destroying a DSM region requires coordinated cleanup across all participating nodes:
DSM region destruction for region R:
1. Initiator (any node with the region's destroy capability) sends
REGION_DESTROY(region_id) to all nodes in allowed_nodes bitmask.
2. Each participating node:
a. Unmaps all local pages belonging to region R from process page tables.
b. Flushes TLB entries for region R's address range.
c. For pages where this node is the owner: marks pages as reclaimable
(returned to the local physical page allocator).
d. For pages where this node is a reader: discards the local copy
(no writeback needed — reader copies are clean by the single-writer
invariant).
e. Sends REGION_DESTROY_ACK(region_id, node_id) to the initiator.
3. Initiator collects ACKs from all participating nodes.
Timeout: 5 seconds. Nodes that do not ACK within timeout are assumed
dead — their pages are abandoned (will be reclaimed when those nodes
eventually rejoin or are declared dead via heartbeat).
4. After all ACKs received (or timeout):
a. Initiator sends REGION_DIRECTORY_CLEANUP(region_id) to all nodes
that serve as home nodes for pages in this region.
b. Each home node removes all DsmDirectoryEntry records for region R
from its directory. The backup home node (Section 5.9.2.3) is also
notified to remove shadow entries.
c. The region_id is retired and cannot be reused for 24 hours
(prevents stale references from delayed messages).
5. Region metadata is removed from /sys/kernel/umka/cluster/dsm/regions.
Error handling:
- If a process still has region R mapped when destruction is initiated,
the mapping is force-unmapped and the process receives SIGBUS on
subsequent access attempts.
- If the initiator crashes during destruction, any node can resume
the protocol by re-sending REGION_DESTROY (idempotent — nodes that
already completed destruction simply re-ACK).
5.6.9 Linux Compatibility Interface
DSM is exposed via standard POSIX shared memory with extensions:
Standard POSIX (works unmodified):
fd = shm_open("/my_region", O_RDWR | O_CREAT, 0666);
ftruncate(fd, size);
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
→ On a single node, this is standard shared memory. No DSM.
UmkaOS-specific extension (opt-in):
fd = shm_open("/my_region", O_RDWR | O_CREAT, 0666);
ioctl(fd, UMKA_SHM_MAKE_DISTRIBUTED, &dsm_config);
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
→ Same mmap'd pointer, but pages can now migrate across nodes.
→ Process on Node B can shm_open("/my_region") and map the same region.
For MPI applications: MPI implementations (OpenMPI, MPICH) can use DSM
regions for intra-communicator shared memory windows, replacing the
current combination of mmap + RDMA + application-level coherence with
kernel-managed coherence.
5.6.10 False Sharing Mitigation
When multiple nodes write to different offsets within the same page, the page bounces between owners ("false sharing"), degrading performance severely.
Detection: The home node tracks per-page ownership transfer frequency. If a page
has more than N ownership transfers per second (default: 100), it is flagged as
contended. The contention count is exposed via the umka_tp_stable_dsm_contention
tracepoint.
Mitigation 1: Advisory — Log contended pages to a tracepoint, allowing the application to restructure its data layout to avoid cross-node false sharing. This is the lowest-overhead option.
Mitigation 2: Whole-page writeback on release — On a write fault to a contended page under the single-writer model (Section 5.6.2), the owning node sends the entire 4KB page to the home node at the next release point (memory barrier or unlock). This is simpler than TreadMarks-style Twin/Diff because the single-writer invariant guarantees only one node modifies the page at a time — there are no concurrent writers whose changes need merging. The home node distributes the updated page to readers on their next fault. Overhead: one 4KB RDMA Write per release (~1-2 μs). This is acceptable because contention mitigation already implies the page is frequently transferred. Note: Twin/Diff (creating a copy before writing, then diffing to find changed bytes) would only be beneficial under a relaxed multiple-writer consistency model. UmkaOS's DSM uses single-writer, making whole-page transfer the correct and simpler approach.
Mitigation 3: Sub-page coherence for huge pages — For 2MB huge pages that exhibit contention, the kernel falls back to 4KB sub-page coherence for the contended region. The huge page is logically split into 512 sub-pages, each tracked independently in the DSM directory. The physical huge page is preserved (no TLB cost).
Default: Detection + advisory tracing. Eager writeback on release is opt-in per
DSM region via DsmRegionCreate.flags |= DSM_EAGER_WRITEBACK.
5.6.11 Error Handling
Concurrent ownership requests: When two nodes simultaneously request exclusive ownership of the same page, the home node serializes the requests. The first request is processed normally; the second requester is queued and notified when ownership becomes available.
Stale directory: If a node presents an ownership claim with a version counter that doesn't match the directory, the home node detects the stale state and rejects the operation. The requestor retries by re-fetching the directory entry.
Partial transfer: If a page is sent via RDMA Write but the directory update (RDMA Send to home node) fails (e.g., due to a concurrent modification or lost message), the home node detects the version mismatch on the next access and repairs the directory. The transferred page is either adopted (if the transfer was valid) or discarded. Directory updates use two-sided RDMA Send because the DsmDirectoryEntry is 64 bytes (field-by-field: sequence(8) + owner(4) + _pad(4) + readers(8) + state(1) + _pad(7) + version(8) + rehash_epoch(4) + _pad(4) + wait_queue(8) + entry_lock(4) + _pad(4) = 64 bytes; too large for 8-byte RDMA Atomic CAS). The home node's CPU processes the update request locally using the seqlock protocol (Section 5.6.4).
Deadlock: If Node A waits for ownership of a page held by Node B, while Node B
waits for a page held by Node A, a deadlock occurs. DSM ownership deadlocks are
resolved via timeout on ownership requests (default: 10ms). On timeout, the requesting
operation is aborted and returns EAGAIN. The application retries with backoff. This
is a deadlock recovery mechanism (timeout-based), not true deadlock detection
(which requires a wait-for graph). True deadlock detection with a distributed wait-for
graph is implemented in the DLM (Section 14.6.9 in 14-storage.md), where lock
dependencies are tracked explicitly. DSM uses the simpler timeout approach because page
ownership requests are transient (microsecond-scale) and building a wait-for graph for
every page fault would add unacceptable overhead to the critical path.
5.6.12 Honest Performance Expectations
DSM provides a programming convenience (shared address space), NOT transparent performance parity with local memory. Expectations must be calibrated:
What DSM IS good for:
- Read-mostly workloads with occasional writes (replicated data).
- Coarse-grained partitioned workloads where each node mostly touches
its own partition, with infrequent cross-node access.
- Replacing explicit message-passing in applications where shared memory
is a natural fit and page-level granularity matches access patterns.
What DSM is NOT good for:
- Fine-grained shared data structures (concurrent hash maps, work-stealing
queues). These cause page-level false sharing and ownership bouncing.
Use explicit RDMA messaging or partitioned data structures instead.
- Workloads with random write patterns across the shared space.
Every cross-node write fault costs ~5-50μs (RDMA round-trip + TLB
invalidation), vs ~100ns for local memory. This is 50-500x slower.
- Latency-sensitive paths. DSM page faults are unpredictable.
Performance model (InfiniBand 200Gb/s, ~2μs RTT):
Local page access (TLB hit): ~1 ns
Local page fault (mmap, disk): ~1-100 μs
DSM read fault (page not present): ~10-18 μs (directory lookup ~4-5 μs
+ ownership negotiation ~3-5 μs
+ page transfer ~3-5 μs
+ local install ~1 μs;
see Section 5.6.4 for detailed flow)
DSM write fault (exclusive ownership): ~10-50 μs (invalidate readers + transfer)
DSM false sharing (bouncing page): ~100+ μs per iteration (pathological)
The mitigations in Section 5.6.10 help, but cannot eliminate the fundamental cost of network coherence. Applications must be DSM-aware at the data structure level. The kernel's job is to make the common case fast and the pathological case detectable (tracepoints), not to pretend the network is as fast as local memory.
5.6.13 Interaction with Memory Compression (Section 4.2)
The DSM protocol and the memory compression tier (Section 4.2) have a potential conflict: when a page is compressed in the local zpool, should it be advertised as "present" or "not present" to the DSM directory?
Design decision: Compressed pages are treated as locally present in the DSM protocol. The decompression cost (~1-3 microseconds via LZ4) is far lower than a network fetch (~5-50 microseconds over RDMA), so it never makes sense to fetch a remote copy of a page that exists locally in compressed form.
Interaction rules:
- Remote node requests a locally compressed page: The owning node decompresses from its zpool, then transfers the uncompressed data via RDMA Write. The zpool entry is freed after transfer.
- DSM migrates a page to a remote node: The sending node decompresses first, then sends the uncompressed 4KB page. The receiving node may independently compress it based on local memory pressure.
- Compressed page metadata is NOT replicated across nodes. Compression is a node-local optimization, invisible to the DSM directory and coherence protocol.
- DSM coherence protocol uses only the
{CpuNode, RemoteNode, NotPresent, Migrating}variants ofPageLocation(Section 21.2.1.5) for coherence decisions. A "Compressed" DSM state is NOT added — a compressed page is simply "local on CpuNode(N)" from the DSM protocol's perspective. The fullPageLocationTrackertracks all variants (includingDeviceLocal,Compressed,Swapped,RemoteDevice,CxlPool), but compression is transparent to the DSM directory.
Edge case — double fault path: If Node B requests a page from Node A, and Node A has that page compressed in its zpool: (1) Node A receives the RDMA request, (2) local MM subsystem decompresses from zpool (~1-2 microseconds), (3) DSM handler completes the RDMA Write with the decompressed data. The requesting node never knows the page was compressed. Total added latency: ~1-2 microseconds.
5.6.1 Global Memory Pool
5.6.1.1 Design: Cluster Memory as a Unified Tier Hierarchy
The memory manager already manages a tier hierarchy:
Current (single-node):
Tier 0: Per-CPU page caches (fastest, smallest)
Tier 1: Local NUMA node DRAM (fast, local)
Tier 2: Remote NUMA node DRAM (cross-socket, ~150ns)
Tier 3: GPU VRAM (Section 21.2)
Tier 4: Compressed pool (Section 4.2)
Tier 5: Swap (NVMe SSD)
Extended (distributed cluster):
Tier 0: Per-CPU page caches (unchanged)
Tier 1: Local NUMA node DRAM (unchanged)
Tier 2: Remote NUMA node DRAM, same machine (unchanged)
Tier 3: CXL-attached memory pool (~200-400ns) ← NEW
Tier 4: GPU VRAM (unchanged)
Tier 5: Compressed pool (unchanged)
Tier 6: Remote node DRAM via RDMA (~3-5 μs) ← NEW
Tier 7: Remote node compressed pool via RDMA ← NEW
Tier 8: Local NVMe swap (unchanged)
Tier 9: Remote NVMe via NVMe-oF/RDMA ← NEW
Key insight: Tier 6 (remote RDMA, ~3-5 μs) is faster than Tier 8 (local NVMe).
The global memory pool inserts remote memory as a tier BETWEEN
compressed pages and local swap. Compressed pool (Tier 5, ~1-2 μs
decompression) precedes remote RDMA because local decompression is
faster than a network round-trip.
Note: Tier numbers are ordinal positions in the latency-sorted hierarchy, not
fixed identifiers. When distributed mode adds new tier sources (CXL-attached memory,
remote node DRAM), existing sources shift to higher tier numbers. Code that uses
tier numbers MUST NOT hardcode numeric values — instead, use the TierKind enum
(defined in Section 4.1.8) to identify tier types by semantic name (e.g.,
TierKind::LocalDram, TierKind::GpuVram, TierKind::DsmRemote) and query
mem::tier_ordinal(kind) for the current ordinal position of each kind.
5.6.1.2 Memory Pool Accounting
// umka-core/src/mem/global_pool.rs
/// Global memory pool state (cluster-wide view, maintained per-node).
pub struct GlobalMemoryPool {
/// Per-node memory availability.
/// Fixed-capacity array indexed by NodeId; only entries 0..node_count are valid.
/// Uses MAX_CLUSTER_NODES (Section 5.2.9) to avoid heap allocation.
nodes: [NodeMemoryState; MAX_CLUSTER_NODES],
/// Number of active nodes in the cluster.
node_count: u32,
/// Total cluster memory (sum of all nodes).
total_cluster_bytes: u64,
/// Total available for remote allocation (sum of exported pools).
total_available_bytes: u64,
/// Current remote memory usage by local processes.
local_remote_usage_bytes: AtomicU64,
/// Current memory exported to remote nodes.
exported_usage_bytes: AtomicU64,
/// Policy for remote memory allocation.
policy: GlobalPoolPolicy,
}
pub struct NodeMemoryState {
node_id: NodeId,
/// Total physical memory on this node.
total_bytes: u64,
/// Memory available for remote allocation.
/// Admin-configurable: don't export all memory.
/// Default: min(25% of total, rdma.max_pool_gib GiB) — capped to prevent excessive
/// page pinning on large systems (local workloads get priority).
/// This pool is backed by the RDMA-registered memory region (Section 5.4.3).
/// Only RDMA-registered pages can be served to remote nodes.
export_pool_bytes: u64,
/// Currently allocated to remote nodes.
export_used_bytes: u64,
/// Distance to this node (from cluster distance matrix).
distance_ns: u32,
/// Bandwidth to this node (bytes/sec).
bandwidth_bytes_per_sec: u64,
}
pub struct GlobalPoolPolicy {
/// Maximum percentage of local memory to export for remote use.
/// Default: 25%. Protects local workloads from starvation.
pub max_export_percent: u32,
/// When local memory pressure exceeds this threshold,
/// start reclaiming exported pages (evicting remote users).
/// Default: 80% of local memory usage.
pub reclaim_threshold_percent: u32,
/// Prefer remote memory over local swap?
/// Default: true (RDMA is faster than NVMe).
pub prefer_remote_over_swap: bool,
/// Maximum remote memory a single process can consume (bytes).
/// 0 = unlimited (subject to cgroup limits).
pub per_process_remote_max: u64,
}
Security requirement — Cross-node capability chain validation: Remote memory access via the global memory pool MUST validate capability chains across nodes. A capability issued by Node A that authorizes access to memory on Node A must NOT be usable to access memory on Node B without explicit cross-node authorization. Validation requirements:
- Capability scope validation: When Node A's process accesses memory exported by Node B via the global pool, the kernel MUST verify that:
- The process holds a valid
DistributedCapability(Section 5.8.2) for remote memory access, signed by a trusted issuer - The capability's
constraintsfield explicitly authorizes the targetNodeId(or containsNODE_ID_ANYfor cluster-wide access) - The capability has not expired, been revoked, or had its generation invalidated (standard distributed capability validation)
-
The capability's
permissionsfield includesREMOTE_MEMORY_READand/orREMOTE_MEMORY_WRITEas appropriate for the operation -
Delegation chain integrity: If a capability was derived through delegation (e.g., process P1 on Node A delegates to process P2 on Node B), the kernel MUST verify the entire chain:
- Each capability in the chain MUST have a valid signature from its issuer
- Each delegation MUST have
DELEGATEpermission in the parent capability - Derived capabilities MUST be strictly less powerful than their parent (no permission amplification)
-
The delegation depth MUST NOT exceed
MAX_CAP_DELEGATION_DEPTH(default: 8) to prevent resource exhaustion -
Remote node attestation: Before honoring a cross-node capability, Node B MUST verify that Node A is a current, authenticated cluster member (not revoked, evicted, or marked Dead per Section 5.9). This check is O(1) via the cluster membership bitmap.
-
Audit logging: Cross-node capability validations that fail MUST be logged to the security audit subsystem (Section 2.1) with:
- Source node ID
- Target node ID
- Capability object_id and generation
- Failure reason (signature invalid, expired, revoked, wrong node, etc.)
5.6.1.3 The Killer Use Case: AI Model Memory
Cluster: 4 nodes, each 512GB RAM + 4× A100 GPUs (80GB VRAM each)
Total CPU RAM: 2 TB
Total GPU VRAM: 1.28 TB
Total cluster memory: 3.28 TB
Scenario: Run a 405B parameter model (Llama-3.1-405B, ~810GB at FP16)
Without global memory pool (current state of the art):
- Tensor parallelism across 16 GPUs: each GPU holds 1/16 of model (~50GB)
- Model fits in GPU VRAM (80GB per GPU)
- BUT: KV cache for long context (128K tokens) = ~100-200GB additional
- KV cache spills to CPU RAM via UVM → 10-15 μs per page fault to NVMe
- Inference latency: dominated by KV cache spills
With global memory pool:
- GPU VRAM: hot layers (active attention heads, current KV cache entries)
- Local CPU RAM: warm layers (recent KV cache, inactive attention heads)
- Remote CPU RAM (RDMA): cold KV cache entries from other nodes
→ 3-5 μs per page fault (faster than local NVMe!)
- Local NVMe: only for truly cold data (old checkpoints, etc.)
The kernel manages placement transparently:
- MigrationPolicy tracks access patterns per page
- Hot pages migrate toward the accessing GPU/CPU
- Cold pages migrate to remote nodes with more available memory
- The ML framework sees a flat address space; kernel handles the rest
Performance impact:
- KV cache "miss" on remote RDMA: ~5 μs (vs. ~15 μs from NVMe)
- 3x improvement in tail latency for long-context inference
- No application code changes required
5.6.1.4 Cgroup Integration
/sys/fs/cgroup/<group>/memory.remote.max
# # Maximum remote memory this cgroup can consume (bytes)
# # Default: "max" (unlimited, subject to global pool policy)
/sys/fs/cgroup/<group>/memory.remote.current
# # Current remote memory usage (read-only)
/sys/fs/cgroup/<group>/memory.remote.stat
# # Remote memory statistics:
# # remote_alloc <bytes allocated on remote nodes>
# # remote_faults <page faults resolved from remote>
# # remote_migrations_in <pages migrated from remote to local>
# # remote_migrations_out <pages migrated from local to remote>
# # rdma_bytes_rx <total RDMA data received>
# # rdma_bytes_tx <total RDMA data sent>
/sys/fs/cgroup/<group>/memory.tier_preference
# # Override default tier ordering for this cgroup:
# # "local,cxl,remote,compressed,swap" (default)
# # "local,compressed,swap" (disable remote memory for this cgroup)
# # "local,cxl,remote,swap" (skip compression, prefer remote)
5.6.2 Distributed Page Cache
5.6.2.1 Problem
When Node A reads a file that Node B recently accessed and has cached, the standard approach (NFS/CIFS) refetches from the storage server over TCP. This ignores that Node B already has the data cached in its page cache and could serve it faster via RDMA.
5.6.2.2 Design: Cooperative Page Cache
The page cache gains awareness of what other nodes have cached:
Traditional NFS read:
Node A: read() → VFS → NFS client → TCP → NFS server → disk → TCP → Node A
Latency: ~200 μs (network + server + disk if not cached on server)
Cooperative page cache (shared filesystem):
Node A: read() → VFS → page cache miss → WHERE is this page?
Option 1: Remote page cache (RDMA read from Node B's page cache)
Latency: ~3-5 μs
Option 2: Local disk
Latency: ~10-15 μs (NVMe)
Option 3: Remote disk (NVMe-oF/RDMA)
Latency: ~15-25 μs
Option 4: Traditional NFS/CIFS
Latency: ~200 μs
Kernel picks the fastest source automatically.
5.6.2.3 Page Cache Directory
// umka-core/src/vfs/cooperative_cache.rs
/// Distributed page cache directory.
/// Tracks which nodes have cached pages for which files.
///
/// Uses a per-node counting Bloom filter architecture (NOT per-file).
/// Each node maintains ONE filter covering ALL files it has cached,
/// and broadcasts it to peers periodically.
///
/// Memory bound: 256KB per node × 64 nodes = 16MB total cluster-wide
/// per node's local cache of all peer filters. This replaces the prior
/// per-file design that would have consumed 20KB × files × nodes
/// (unbounded: 100K files × 64 nodes = 128GB).
pub struct CooperativeCache {
/// Per-node Bloom filters received from cluster peers.
/// Key: NodeId of the remote node.
/// Value: that node's Bloom filter (covering all files cached on that node)
/// plus the timestamp when the filter was last received.
/// Fixed-size: one entry per cluster node, direct-indexed by NodeId.
/// Total memory: 64 × 256KB = 16MB (compressed in transit: ~4MB).
peer_filters: [Option<CachedPeerFilter>; MAX_CLUSTER_NODES],
/// This node's own Bloom filter, broadcast to peers.
local_filter: NodeBloomFilter,
}
/// Counting Bloom filter: 4 bits per counter, 8 hash functions.
/// Size: 256KB per node = 512K counters (m). At k=8 hash functions and n≈50K items,
/// FPR = (1 - e^(-kn/m))^k = (1 - e^(-8*50K/512K))^8 ≈ 1%. Saturates at ~50K items.
/// Formula: n_max = m/k * ln(1/(1-p^(1/k))) ≈ 52K for m=512K, k=8, p=0.01.
///
/// The filter keys are (filesystem_id, inode, page_offset) hashes.
/// A positive result means "this node probably has this page cached."
/// A negative result means "this node definitely does not have this page."
pub struct NodeBloomFilter {
/// 256KB = 512K nibbles = 512K 4-bit counters (2 nibbles per byte).
/// Counting filter supports deletion (decrement on eviction).
counters: [u8; 262144],
/// Approximate count of items in the filter (for load factor tracking).
approx_count: AtomicU64,
/// Last reset timestamp. Filters are periodically reset to prevent
/// saturation (reset cycle: configurable, default 300 seconds).
last_reset: AtomicU64,
}
pub struct CachedPeerFilter {
/// The remote node's Bloom filter (decompressed from ZSTD on receipt).
filter: NodeBloomFilter,
/// Timestamp when this filter was received from the peer.
received_ns: u64,
}
impl CooperativeCache {
/// Find the best source for a cache page.
pub fn find_page(
&self,
file: FileId,
page_offset: u64,
local_node: NodeId,
distances: &ClusterDistanceMatrix,
) -> PageSource {
// 1. Check local page cache first (always).
// 2. Hash (file, page_offset) and query each peer's Bloom filter.
// A negative result = definitely not cached on that peer.
// A positive result = maybe cached, worth an RDMA probe.
// 3. Of the candidate nodes (positive results), pick the closest
// (lowest distance in the cluster distance matrix).
// 4. If no remote cache hit: fall back to storage.
}
}
pub enum PageSource {
/// Page is in local page cache. No I/O needed.
LocalCache,
/// Page is likely cached on a remote node. Try RDMA read.
RemoteCache { node_id: NodeId, expected_latency_ns: u32 },
/// Page is not cached anywhere. Read from storage.
Storage { device: DeviceNodeId },
/// Page is on remote storage (NVMe-oF/RDMA).
RemoteStorage { node_id: NodeId, device: DeviceNodeId },
}
Bounded Bloom filter memory: Bloom filters for negative caching are bounded by a two-level architecture. The key invariant: one filter per node covering all files, not one filter per file. This bounds total memory to O(nodes), not O(nodes x files).
Level 1: Per-node Counting Bloom Filter (bounded)
Each node maintains a single counting Bloom filter for all its cached file pages, not one filter per file:
pub struct NodeBloomFilter {
/// Counting Bloom filter: 4 bits per counter, 8 hash functions.
/// Size: 256KB per node = 512K counters (m). At k=8 and n≈50K items,
/// FPR ≈ 1%. Saturates at ~50K items (a node's active working set).
counters: [u8; 262144], // 256KB = 512K 4-bit counters (2 per byte)
/// Approximate count of items in the filter (for load factor tracking).
approx_count: AtomicU64,
/// Last reset timestamp. Filters are periodically reset to prevent
/// saturation (reset cycle: configurable, default 300 seconds).
last_reset: AtomicU64,
}
Memory bound: 256KB x 64 nodes = 16MB total per node (for the full cluster's filters). This replaces the prior per-file design that would have required 20KB x files x nodes (100K files x 64 nodes = 128GB — unbounded and unusable).
False positive rate management: With 256KB = 512K 4-bit counters (m), k=8 hash functions, and n≈50K items, FPR = (1 − e^(−kn/m))^k ≈ 1%. A single node's active page-cache working set is typically tens of thousands of (file, page_offset) pairs, well within this limit. A false positive means we send one extra RDMA probe to a node that does not have the page (cost: ~3 us wasted), not that we miss a page that exists (which would be a correctness error). False negatives cannot occur with a standard Bloom filter.
When approx_count > 0.8 * capacity (>40K entries), the filter is considered
saturated. On saturation:
- Log a warning (filter accuracy degraded).
- If persistent (> 60s), trigger a partial reset: decrement all counters by
approx_count / capacity(preserves recent entries while clearing old ones). - If count exceeds capacity by 2x (>2M entries), trigger a full reset and rebuild from the current page cache index.
Level 2: Periodic Reset
fn bloom_maintenance_tick(filter: &NodeBloomFilter) {
let age_secs = now() - filter.last_reset.load(Acquire);
if age_secs >= BLOOM_RESET_INTERVAL_SECS { // default 300s
// Soft reset: decrement all counters by half.
// Items inserted in the last BLOOM_RESET_INTERVAL/2 seconds
// are still present with high probability.
for counter in filter.counters.iter_mut() {
*counter = counter.saturating_sub(4); // 4 = half of max nibble value 8
}
filter.approx_count.fetch_sub(
filter.approx_count.load(Relaxed) / 2,
Relaxed,
);
filter.last_reset.store(now(), Release);
}
}
Negative cache lookup protocol:
fn file_page_cached_on_node(
cache: &CooperativeCache,
node: NodeId,
file: FileId,
page_offset: u64,
) -> BloomResult {
let peer = cache.peer_filters[node.as_usize()]
.as_ref()
.ok_or(BloomResult::Unknown)?; // No filter received yet from this node.
// Expire stale filters (>60 seconds old = 2x the broadcast interval).
if now_ns() - peer.received_ns > BLOOM_STALE_THRESHOLD_NS {
return BloomResult::Unknown;
}
let key_hash = hash_bloom_key(file, page_offset);
if !peer.filter.query(key_hash) {
// Definitely not cached on this node. No RDMA probe needed.
BloomResult::DefinitelyAbsent
} else {
// Might be cached. Worth an RDMA probe.
BloomResult::MaybePresent
}
}
Filter distribution: Each node periodically broadcasts its Bloom filter to peers
(compressed via ZSTD, ~64KB after compression for 256KB raw). Broadcast interval: 30
seconds or on significant change (>10K inserts since last broadcast). Peers cache the
received filter in the fixed-size peer_filters array (one slot per cluster node,
direct-indexed by NodeId). Total memory per node for all peer filters: 64 nodes x
256KB = 16MB (or ~4MB if filters are kept in compressed form and decompressed on query).
Local filter maintenance: When the local node caches a page, it inserts the
(file_id, page_offset) hash into local_filter. When a page is evicted from the
local page cache, the corresponding counter is decremented (counting Bloom filter
supports deletion). The bloom_maintenance_tick runs on the periodic maintenance timer
to prevent long-term counter accumulation from hash collisions.
5.6.2.4 Cache Coherence for Shared Files
When multiple nodes cache the same file, writes must be coordinated:
Write strategy (per-file, configurable):
1. Write-invalidate (default for shared mutable files):
- Writer acquires exclusive ownership (like DSM write fault)
- All reader copies are invalidated via RDMA
- Writer modifies page, becomes sole cached copy
- Other nodes re-fault on next access (get updated page)
2. Write-through (for append-only logs, databases):
- Writer writes to local page cache AND pushes update to owner
- Owner propagates to all readers via RDMA write
- Higher bandwidth cost, but readers see updates faster
3. No coherence (for read-only data, e.g., shared model weights):
- File is marked immutable (or read-only mounted)
- All nodes cache freely, no invalidation needed
- Best case for ML inference: model weights cached everywhere
Integration with the Distributed Lock Manager (Section 14.6):
For clustered filesystems (Section 14.5), page cache coherence is coordinated through DLM lock operations rather than the DSM coherence protocol. The DLM provides filesystem-aware semantics that the generic DSM protocol cannot:
-
Lock downgrade triggers targeted writeback: When a DLM lock is downgraded from EX to PR (Section 14.6.8), only dirty pages tracked by the lock's
LockDirtyTrackerare flushed — not the entire inode's page cache. This eliminates the Linux problem where dropping a lock on a large file requires flushing all dirty pages regardless of how many were actually modified. -
MOESI-like page states: Each cached page on a clustered filesystem carries a coherence state relative to the DLM lock protecting it:
- Modified: Page dirty under EX lock. Sole copy in the cluster.
- Owned: Page was modified, then lock downgraded to PR. This node is responsible for writing back on eviction. Other nodes may have Shared copies.
- Exclusive: Page clean, held under EX lock. Can transition to Modified without network traffic.
- Shared: Page clean, held under PR/CR lock. Read-only.
-
Invalid: Lock released or revoked. Page must be re-fetched on next access.
-
Per-lock-range dirty tracking: The cooperative page cache directory (Section 5.6.2.3) integrates with Section 14.6.8's
LockDirtyTrackerto record which pages were dirtied under which lock range. On lock downgrade, the writeback is scoped to the lock's byte range — concurrent holders of non-overlapping ranges on the same file are not affected.
5.6.2.5 AI Training Data Pipeline
Training data scenario:
- 100TB dataset on shared NVMe storage
- 8 training nodes, each with 4 GPUs
- Each node reads different shards, but shards overlap (data augmentation)
Without cooperative cache:
Each node reads its shard from storage independently.
If Node A and Node B need the same page: two storage reads.
With cooperative cache:
Node A reads page from storage → cached in Node A's page cache.
Node B needs same page → Node A's Bloom filter (cached locally) shows it might have it.
Node B fetches via RDMA Read from Node A: ~3 μs (vs ~15 μs from NVMe).
Storage bandwidth saved: proportional to shard overlap.
For a typical ImageNet-style dataset with 30% shard overlap:
~30% reduction in storage I/O, ~30% more effective storage bandwidth.
5.7 Cluster-Aware Scheduler
5.7.1 Problem
The scheduler (Section 6.1) currently optimizes process placement within a single machine: NUMA-aware load balancing, work stealing between CPUs, migration cost modeling. For a distributed kernel, the scheduler should consider the entire cluster.
5.7.2 Design: Two-Level Scheduler
Level 1: Global Cluster Scheduler (runs every ~10s, lightweight)
- Monitors per-node load (CPU, memory, accelerator utilization)
- Decides process-to-node placement
- Triggers process migration when data locality warrants it
- Respects node affinity, cgroup constraints, capability requirements
Level 2: Per-Node Scheduler (existing, runs every ~4ms)
- CFS/EEVDF + RT + DL queues (unchanged)
- NUMA-aware CPU placement (unchanged)
- Accelerator scheduling (Section 21.1.2.4, unchanged)
5.7.3 Global Scheduler State
// umka-core/src/sched/cluster.rs
pub struct ClusterScheduler {
/// Per-node load summary (updated via periodic RDMA exchange).
/// Fixed-capacity array indexed by NodeId; only entries 0..node_count are valid.
/// Uses MAX_CLUSTER_NODES (Section 5.2.9) to avoid heap allocation.
node_loads: [NodeLoad; MAX_CLUSTER_NODES],
/// Process-to-data affinity map.
/// Tracks which node holds most of a process's working set.
/// Uses a slab allocator keyed by ProcessId for O(1) average lookup;
/// BTreeMap's O(log n) tree traversal with poor cache locality is
/// unsuitable for per-scheduling-tick access under load.
data_affinity: SlabMap<ProcessId, DataAffinity>,
/// Cluster distance matrix (Section 5.2.9).
distances: ClusterDistanceMatrix,
/// Global load balance interval.
balance_interval_ms: u32, // Default: 10000ms (10 seconds)
/// Migration threshold: only migrate if improvement exceeds this.
migration_threshold_ppt: u32, // Default: 300 (300/1000 = 30% locality improvement)
}
pub struct NodeLoad {
node_id: NodeId,
/// CPU utilization 0-100% (average across all CPUs).
cpu_percent: u32,
/// Memory pressure 0-100% (0 = plenty free, 100 = thrashing).
memory_pressure: u32,
/// Accelerator utilization 0-100% (average across all accelerators).
accel_percent: u32,
/// Number of runnable processes.
runnable_count: u32,
/// Remote memory faults per second (high = poor locality).
remote_fault_rate: u32,
}
pub struct DataAffinity {
/// How many pages of this process's working set are on each node.
/// Used to decide where a process should run.
// Fixed array indexed by NodeId; O(1) access, no allocation. MAX_CLUSTER_NODES=64.
pages_per_node: [u64; MAX_CLUSTER_NODES],
/// Total working set size (pages).
total_working_set: u64,
/// Node with most pages (preferred placement).
preferred_node: NodeId,
/// Working set locality on preferred node (0-1000, parts per thousand).
locality_score_ppt: u32,
}
5.7.4 Process Migration
When the cluster scheduler decides a process should move to another node:
Process migration from Node A to Node B:
1. Cluster scheduler on Node A decides: process P should run on Node B.
Reason: 70% of P's working set is on Node B (remote page faults dominate).
2. Pre-migration:
a. Send process metadata to Node B: PID, capabilities, cgroup, open files,
signal handlers, register state.
b. Node B allocates process slot, creates local task struct.
3. Freeze and transfer:
a. Freeze process P on Node A (stop scheduling, save register state).
b. Transfer register state to Node B via RDMA (~64 bytes, ~1 μs).
c. Transfer kernel stack to Node B via RDMA (~16KB, ~2 μs).
d. Mark P's page table on Node A as "migrated to Node B."
Pages are NOT bulk-transferred — they fault in on demand.
4. Resume on Node B:
a. Node B installs process in local scheduler.
b. Process resumes execution on Node B.
c. First memory access → page fault → fetch from Node A via RDMA (~3-5 μs).
d. Subsequent accesses: pages migrate on demand.
e. Working set migrates over ~100ms as pages are faulted in.
Total migration downtime: ~10-50 μs (metadata transfer + freeze/thaw)
(This is the **critical-section freeze time** — the interval during which the
process is stopped for final register/TLB state transfer. Total end-to-end
migration time, including pre-copy, file descriptor proxy setup, IPC handle
conversion, and cgroup recreation, is typically 1–10 ms depending on process
complexity.)
Working set follows lazily: pages migrate over seconds as accessed.
This is the same strategy as live VM migration (pre-copy / post-copy),
but at process granularity. Much lighter weight than VM migration.
Full process migration requires transferring the following state:
- Register state: CPU registers, FPU/SIMD state (saved during freeze).
- Kernel stack: The process's kernel-mode stack (~16KB).
- Page table metadata: Transferred lazily — pages fault in on demand from the source node via RDMA.
- Open file descriptors: For local files on the source node, a proxy is created that forwards read/write/ioctl over RDMA IPC to the source. For shared-filesystem files (NFS, CIFS), the descriptor is re-opened locally on the destination.
- IPC handles: SharedMemory transport handles are converted to Rdma transport handles (Section 5.5.2). The ring buffer contents are preserved.
- Cgroup membership: Recreated on the destination node. The destination cgroup must have sufficient quota.
- Signal state: Pending signals, signal mask, and signal handlers are transferred.
- Timer state: Active timers (POSIX, itimer) are recreated on the destination with remaining duration adjusted for clock synchronization (Section 5.9.2.5).
- Device handles: Local accelerator contexts are converted to
RemoteDeviceProxyhandles (Section 5.9.1.2). The process continues to use the original device via RDMA-proxied commands.
Socket Migration
TCP socket migration transfers established connections transparently across nodes. The guiding constraint is that the remote peer must not observe a connection reset — migration must be invisible to the peer.
Pre-migration phase (on source node):
- The source kernel captures a consistent snapshot of the TCP socket state.
The snapshot must be taken atomically with respect to the send/receive paths:
the socket is placed in
TCP_MIGRATINGstate, which prevents new data from being sent and holds incoming ACKs in a staging buffer.
/// Snapshot of a TCP socket's full state for cross-node migration.
/// Captured atomically: source socket enters TCP_MIGRATING before snapshot
/// and is torn down only after destination confirms active.
pub struct TcpSocketMigrationState {
/// Local endpoint address and port.
pub local_addr: SocketAddr,
/// Remote peer address and port (unchanged across migration).
pub remote_addr: SocketAddr,
/// Next sequence number the source would have sent.
/// Must match what the destination sends on the first post-migration segment.
pub snd_nxt: u32,
/// Oldest unacknowledged sequence number (start of retransmit queue).
pub snd_una: u32,
/// Next sequence number the source expects to receive from the peer.
pub rcv_nxt: u32,
/// Current TCP state machine state at time of snapshot.
/// Must be ESTABLISHED, CLOSE_WAIT, or FIN_WAIT_1; other states are
/// not migratable and return -EOPNOTSUPP.
pub tcp_state: TcpState,
/// Contents of the retransmit queue: bytes sent but not yet acknowledged.
/// The destination may need to retransmit these if the peer's ACK arrives
/// after migration completes.
pub retransmit_queue: Vec<u8>,
/// Contents of the receive buffer: bytes received from the peer but not
/// yet consumed by the process. Delivered in-order on the destination.
pub recv_buffer: Vec<u8>,
/// TCP options negotiated for this connection.
/// Includes SACK state, timestamp offsets, window scaling factor, etc.
pub options: TcpOptionsState,
/// Congestion control algorithm name and its opaque private state blob.
/// The destination must support the same congestion control algorithm;
/// if not, falls back to CUBIC with fresh cwnd.
pub cong_state: CongestionState,
/// Send window size (from most recent ACK from peer).
pub snd_wnd: u32,
/// Receive window size (advertised to peer).
pub rcv_wnd: u32,
/// Maximum segment size negotiated with peer.
pub mss: u16,
/// Smoothed round-trip time estimate (microseconds).
pub srtt_us: u32,
/// RTT variance estimate (microseconds).
pub rttvar_us: u32,
}
- Source quiesces the socket: stops sending, holds incoming ACKs in the staging
buffer for a maximum of 100 ms. If quiescence takes longer (e.g., large
retransmit queue draining), migration is aborted with
-ETIMEDOUTand the process is left on the source node. - Source captures the snapshot (sequence numbers, buffers, options, congestion state).
- Source transmits snapshot to destination node via the RDMA migration channel.
Post-migration phase (on destination node):
- Destination reconstructs the TCP socket from
TcpSocketMigrationState. The socket is created withSO_REUSEPORT+SO_REUSEADDRinTCP_MIGRATINGstate so the destination kernel accepts the sequence-number context without performing a three-way handshake. - The destination must have the migrated process's source IP reachable. Two models are supported:
- IP mobility (preferred): A CARP/VRRP failover address or network-overlay virtual IP moves with the process. The remote peer's connection continues without interruption — packets are routed to the destination by the overlay.
- No IP mobility: The migration is transparent only if source IP is present
on the destination interface (e.g., both nodes share a subnet with the same
address via aliasing). Without this, sockets are closed (
ECONNRESETis delivered to the remote peer) and reconnection is the application's responsibility. UmkaOS does not silently lie to the application about connectivity; migration fails with-EADDRNOTAVAILif the source address cannot be installed on the destination. - Destination sends a TCP keepalive to re-establish the connection from the
remote peer's perspective. The keepalive carries sequence number
snd_nxt - 1(the last acknowledged byte), causing the peer to confirm with an ACK that advancessnd_unaon the destination. - Source tears down the socket and releases the address after receiving a
positive acknowledgement from the destination that the destination socket
is in
ESTABLISHEDstate and the keepalive ACK has been received. - For UDP sockets: the state transfer is simpler — socket options, bound
address, connected address (if
connect()was called), and socket buffer contents are transferred. In-flight datagrams may be lost; UDP is unreliable and applications using it must tolerate loss independently of migration. - Sockets in non-migratable states: Sockets in
SYN_SENT,SYN_RECV,TIME_WAIT, orLISTENstates are not migrated.LISTENsockets are re-bound on the destination (new connections land on the destination after migration; connections established pre-migration remain on the source until they close naturally).
Socket migration state in the process migration record:
/// All TCP/UDP sockets belonging to the migrating process.
pub struct ProcessSocketMigrationState {
/// TCP sockets (ESTABLISHED, CLOSE_WAIT, FIN_WAIT_1 only).
pub tcp_sockets: Vec<TcpSocketMigrationState>,
/// UDP sockets: bound/connected address, socket options, buffer contents.
pub udp_sockets: Vec<UdpSocketMigrationState>,
/// Unix domain sockets connected to peers on the same node: proxied via
/// RemoteSocketProxy after migration (identical mechanism to file proxy).
pub unix_sockets: Vec<UnixSocketProxyState>,
/// Sockets that could not be migrated (LISTEN, TIME_WAIT, etc.).
/// These file descriptors are replaced with a closed fd on the destination;
/// the process receives SIGPIPE or EBADF on next access.
pub non_migratable_fds: Vec<i32>,
}
GPU Context Migration
GPU context migration requires hardware vendor support. Not all GPUs implement the necessary preemption and memory export interfaces. Migration behavior is determined by querying the AccelContext capabilities at migration time.
GPU migration support matrix:
| GPU class | Mechanism | Granularity |
|---|---|---|
| NVIDIA H100+ (Hopper MIG) | NVLink + MIG partition export | Per-MIG slice |
| AMD MI300X (Infinity Fabric) | XGMI memory export + context checkpoint | Full context |
| NVIDIA A100 (non-MIG) | Software checkpoint via CUDA CTK (CRIU-GPU) | Full context, ~500ms |
| Consumer GPUs (RTX, RX) | Not supported; fallback to CPU-side buffer copy | Buffer contents only |
For GPUs that support context migration, the migration sequence is:
/// Full snapshot of an accelerator context for cross-node migration.
/// Populated by the AccelBase driver via the checkpoint vTable call.
pub struct GpuContextMigrationState {
/// Opaque GPU register file snapshot, if the hardware supports it.
/// None for GPUs that do not expose register-file checkpoint (most consumer GPUs).
pub register_file: Option<Vec<u8>>,
/// GPU memory allocations to transfer: virtual GPU address → allocation metadata.
/// The physical memory pages are transferred via HMM migration (see below).
pub allocations: Vec<GpuAllocationRecord>,
/// Pending command buffers: commands submitted to the GPU but not yet dispatched.
/// These are re-submitted to the destination GPU after context restoration.
pub pending_commands: Vec<GpuCommandBuffer>,
/// Fence states at the time of snapshot.
/// Fences that were signaled before snapshot are recorded as SignaledFence so
/// the destination can satisfy `fence_wait()` calls without re-executing work.
pub fence_states: Vec<(FenceId, FenceState)>,
/// GPU push constants and descriptor set state (Vulkan/CUDA-mapped contexts).
pub constants: Vec<u8>,
/// Name of the AccelBase driver that owns this context, for validation on
/// destination (migration between incompatible GPU vendors is rejected).
pub driver_name: [u8; 64],
/// Driver-version tuple: migration is rejected if destination driver version
/// differs by more than one minor version.
pub driver_version: (u32, u32, u32),
}
GPU migration sequence (hardware-supported path):
- Quiesce the AccelContext: issue a context-level drain barrier. For NVIDIA
Hopper, this uses the
cuCtxSynchronize()equivalent kernel-internal call. For AMD MI300X, the XGMI memory export fence is used. Quiescence preempts the running command buffer at a safe preemption point: InstructionLevelpreemption: saves at the current shader instruction boundary.DrawBoundarypreemption (fallback): saves at the current draw-call boundary. Preemption timeout is 200 ms; if the GPU cannot preempt within 200 ms, migration is deferred and retried after 1 s.- Freeze command submissions: set the AccelContext to
ACCEL_FROZENstate. Newaccel_cmd_submit()calls block (do not return error) so the process does not observe the migration boundary. - Snapshot GPU memory via HMM: the Heterogeneous Memory Management layer
migrates GPU pages to CPU-accessible memory (
migrate_vma_setup()/migrate_vma_pages()equivalent). This copies GPU VRAM contents to system RAM for transfer over the RDMA migration channel. - Transfer GPU memory snapshot to destination via RDMA Write to a pre-registered migration receive region on the destination node.
- On destination: the AccelBase driver allocates a new GPU context, restores GPU memory from the snapshot (HMM reverse migration: system RAM → GPU VRAM), and replays the register file (if available) and push constants.
- Resume: the AccelContext transitions from
ACCEL_FROZENtoACCEL_RUNNINGon the destination. Blockedaccel_cmd_submit()calls on the process now complete against the destination context.
GPU migration fallback (no hardware support):
When the GPU does not support register-file checkpoint (most consumer GPUs and some data-center GPUs without MIG):
- Wait for all pending GPU commands to complete (drain the command queue).
This may take up to 5 s for long-running compute kernels; if the drain
does not complete in 5 s, migration fails with
-EOPNOTSUPPand the process stays on the source node. - Copy GPU buffer contents to CPU memory (no register file, no in-flight state).
- Transfer buffer contents to destination.
- On destination: allocate new GPU buffers, restore contents.
- The application's next GPU submission starts from a clean context with restored buffer contents. Any partially-completed GPU computation is lost and must be re-executed by the application.
Processes requiring GPU contexts that cannot be migrated should set
CLUSTER_PIN_NODE to prevent migration attempts.
io_uring Migration
io_uring instances span kernel and userspace memory. The submission queue (SQ),
completion queue (CQ), and the SQE/CQE arrays are memory-mapped into the process's
virtual address space. Migration must quiesce the rings, transfer their state, and
remap them at identical virtual addresses on the destination so the process's
mmap-based pointers remain valid.
Challenge: In-flight SQEs (submitted by the application but not yet dispatched by the kernel) may reference registered buffers, registered files, and eventfd notifications — all of which are also being migrated simultaneously. The migration must preserve the semantic ordering: an SQE that was submitted before migration must either complete on the source or be re-submitted on the destination, never silently dropped.
/// Complete snapshot of one io_uring instance for cross-node migration.
pub struct IoUringMigrationState {
/// The parameters used to create this ring (entries, flags, sq_thread_cpu, etc.).
/// The destination re-creates the ring with identical parameters.
pub ring_params: IoUringParams,
/// Raw bytes of the SQ ring (includes head/tail/flags/dropped counters).
pub sq_ring_snapshot: Vec<u8>,
/// Raw bytes of the CQ ring (includes head/tail/flags/overflow counter).
pub cq_ring_snapshot: Vec<u8>,
/// Number of SQ entries (must equal ring_params.sq_entries).
pub sq_entries: u32,
/// Number of CQ entries (must equal ring_params.cq_entries).
pub cq_entries: u32,
/// SQEs that were in-flight (submitted to kernel, not yet dispatched)
/// when the ring was quiesced. These are either re-submitted on the
/// destination (if idempotent) or failed with -EINTR (if not).
pub inflight_sqes: Vec<IoUringSqe>,
/// user_data values of SQEs that were cancelled during migration.
/// The application receives a CQE with res=-EINTR for each of these.
pub cancelled_sqe_user_data: Vec<u64>,
/// Registered buffer table: each entry is (userspace_iov, buffer_len).
/// Buffers are re-registered on the destination after migration.
pub registered_buffers: Vec<RegisteredBufferRecord>,
/// Registered file table: each entry is the migrated fd number on
/// the destination (after file proxy or re-open).
pub registered_files: Vec<i32>,
/// Personality credentials registered with IORING_REGISTER_PERSONALITY.
pub personalities: Vec<IoUringPersonalityState>,
}
io_uring migration sequence:
- Drain in-flight SQEs: the kernel stops accepting new SQE submissions
(
IORING_SQ_FROZENinternal flag) and waits up to 500 ms for all in-kernel-dispatched operations to complete. Operations that complete normally produce CQEs that are delivered to the process before migration (the CQ ring snapshot will contain them). - Identify and cancel non-drainable SQEs: any SQEs that have not completed
within the drain timeout are cancelled with
-EINTR. Theiruser_datavalues are recorded incancelled_sqe_user_data. The application will see a CQE withres = -EINTRfor each on the destination. - Snapshot the rings: copy the SQ ring, CQ ring, and SQE/CQE arrays from the process's kernel-managed memory. The snapshot is taken while the ring is frozen (no concurrent head/tail pointer modification is possible).
- Unregister buffers and files: all registered buffers and file descriptors are
unregistered on the source. Their state is captured in
registered_buffersandregistered_filesfor re-registration on the destination. - Transfer the
IoUringMigrationStateto the destination as part of the process memory image (via RDMA migration channel). - On destination: re-create the io_uring instance with
io_uring_setup()using the samering_params. The kernel re-maps the SQ and CQ rings into the process's virtual address space at the same virtual addresses as on the source. This is guaranteed by restoring the full process virtual address space (VMA layout) before re-creating the ring, sommap(MAP_FIXED)at the original addresses succeeds. - Re-register buffers: registered buffers are re-registered with
io_uring_register(IORING_REGISTER_BUFFERS)using the migrated buffer regions. Buffers backed by anonymous memory have already been transferred in the page table migration step. Buffers backed by files on the source node are proxied. - Re-register files: registered file descriptors are re-registered with the destination fd numbers from the file descriptor migration step.
- Re-submit idempotent SQEs: SQEs from
inflight_sqesthat are marked idempotent (reads,IORING_OP_NOP,IORING_OP_POLL_ADD) are re-submitted to the destination ring. Non-idempotent SQEs (writes,IORING_OP_SEND,IORING_OP_CONNECT) are not re-submitted; the application receives-EINTRand must retry. - Resume: the ring transitions out of
IORING_SQ_FROZEN. The process can submit new SQEs immediately. The effective migration latency visible to the application is the drain time (≤500 ms) plus the freeze-transfer-restore critical section (~5–50 ms depending on ring size and registered buffer count).
io_uring idempotency classification (determines re-submission vs. cancellation):
| Opcode | Idempotent | Re-submitted after migration |
|---|---|---|
IORING_OP_READ / IORING_OP_READV |
Yes (read does not mutate) | Yes |
IORING_OP_WRITE / IORING_OP_WRITEV |
No | No — EINTR returned |
IORING_OP_SEND / IORING_OP_SENDMSG |
No | No — EINTR returned |
IORING_OP_RECV / IORING_OP_RECVMSG |
Yes (recv is read-like) | Yes |
IORING_OP_POLL_ADD |
Yes | Yes |
IORING_OP_TIMEOUT |
Yes (re-armed with adjusted expiry) | Yes |
IORING_OP_CONNECT |
No | No — EINTR returned |
IORING_OP_ACCEPT |
Yes (re-armed on destination listen socket) | Yes |
IORING_OP_FSYNC / IORING_OP_FDATASYNC |
Yes | Yes |
IORING_OP_NOP |
Yes | Yes |
IORING_OP_SPLICE / IORING_OP_TEE |
No | No — EINTR returned |
IORING_OP_PROVIDE_BUFFERS |
Yes | Yes |
Process migration scope (v1):
- In scope: CPU register state, page tables (lazy), local file proxying, IPC handle
conversion, cgroup membership, signal state, timer state, accelerator handle proxying,
TCP/UDP socket migration (with IP mobility), io_uring ring migration (with drain),
GPU context migration (hardware-supported GPUs only).
- Out of scope for v1: Active RDMA queue pairs (application must handle),
processes with CLONE_VM threads (thread group migration deferred), ptrace
targets, GPU contexts on hardware without checkpoint support (process gets
-EOPNOTSUPP unless it sets CLUSTER_PIN_NODE).
- Limitation: A process holding hardware resources that cannot be proxied or
migrated (e.g., direct GPU rendering context on a consumer GPU without checkpoint
support) will fail migration with -EOPNOTSUPP. Such processes should set
CLUSTER_PIN_NODE to prevent migration attempts.
Migration Rollback Protocol
When migration fails after the freeze phase but before the destination commits, the system must restore the source to a runnable state without data loss. The rollback path depends on where the failure occurred:
- Destination failure: the destination node sends
MigrationAbort { reason }to the source via the RDMA control channel. On receipt, the source: - Unfreezes all migrating tasks (reverses the
SIGSTOPapplied during the freeze phase, restoring each task toTASK_RUNNINGin the local scheduler). - Restores pre-migration resource registrations: file descriptors, IPC handles, cgroup membership, and accelerator context bindings are reverted to their pre-migration state (the source retains these until commit confirmation).
-
Increments the per-node
migration_failedcounter (exposed via/sys/kernel/umka/cluster/stats/migration_failed). The cluster scheduler uses this counter to back off migration attempts to the failing destination (exponential backoff: 1s, 2s, 4s, ..., capped at 60s). -
RDMA link failure during transfer: the source detects an RDMA timeout (>500ms with no completion on the migration QP) and assumes the destination is unreachable. The source:
- Unfreezes all tasks locally (same procedure as case 1).
- Marks the destination node as
MIGRATION_SUSPECTin the cluster state (distinct from full node failure — the node may still be alive but the migration QP failed). The destination, if it is still running and has partially initialized the migrating process, runsmigration_cleanup(): - Releases any allocated address space (VMA teardown, page table deallocation).
- Destroys partially-constructed task structs and scheduler entries.
- Releases cgroup reservations.
-
Sends a
MigrationCleanupComplete { pid }notification to the cluster coordinator so the process is not double-tracked. -
GPU context restore failure: if the destination GPU rejects the context restore (driver version mismatch detected after transfer, VRAM allocation failure, or hardware error during HMM reverse migration), the destination reports
MigrationAbort { reason: GpuContextFailed }to the source. The source: - Restores the GPU context from the serialized
GpuContextMigrationStatesnapshot. The snapshot is retained on the source until the destination sends a commit acknowledgment — it is never destroyed speculatively. - Unfreezes the AccelContext on the source (transitions from
ACCEL_FROZENback toACCEL_RUNNING), allowing blockedaccel_cmd_submit()calls to proceed locally. -
If the source GPU is also unavailable (e.g., concurrent hardware failure), the affected tasks are terminated with
SIGBUS(indicating an unrecoverable hardware fault). TheSIGBUSsi_codeis set toBUS_MCEERR_AO(action optional) to indicate asynchronous hardware error. -
Commit timeout: the source waits a maximum of 30 seconds for the destination's commit acknowledgment (
MigrationCommitAck { pid }). If the timeout expires: - Assume split-brain: it is unsafe for either node to unilaterally resume the process, because the destination may have already committed and started executing.
- Both source and destination invalidate the migrating process: the source delivers
SIGKILLto its local copy, and the destination (if reachable) is instructed toSIGKILLits copy via the cluster coordinator. - The cluster coordinator reconciles the process state during the next heartbeat
round: it queries all nodes for the process PID and ensures exactly zero or one
instance exists. If both copies were killed, the process is gone and its parent
receives
SIGCHLDwithCLD_KILLEDstatus. - The
migration_timeoutcounter is incremented and logged atKERN_WARNINGlevel.
Phase-based rollback specification:
Migration proceeds through four distinct phases. Rollback is deterministic based on which phase the failure occurs in:
Phase 1 (Pre-transfer — state snapshot not yet sent):
- Failure: network error before state transfer begins.
- Rollback: Cancel migration. Source process resumes unchanged.
migration_state → Idle. No side effects on either node.
Phase 2 (State transfer in progress):
- Failure: network error mid-transfer, or destination ENOMEM.
- Rollback:
1. Destination: free all allocated VMAs, discard partial page transfers, release
the reserved PID slot.
2. Source: migration_state → Idle. Process was suspended during transfer —
task_wake() resumes it.
3. Source sends MIGRATION_ABORT message to destination; destination ACKs (or
times out after 5s, at which point source assumes cleanup complete).
4. If source crashes before sending ABORT: destination has a 30s
migration-dead timer that triggers self-cleanup on expiry.
Phase 3 (State transferred, switchover not yet committed):
- Failure: destination crashes after receiving full state, before the source
receives COMMIT ACK.
- Rollback:
1. Source detects missing COMMIT ACK after 5s timeout.
2. Source sends MIGRATION_QUERY to destination to check status.
3. If destination responds FAILED or is unreachable: source resumes the process
locally (process was in suspended state on source during transfer). Source
sends MIGRATION_ABORT to invalidate any partial destination state.
4. If destination responds COMMITTED (commit happened but ACK was lost): source
dequeues its local copy and sends MIGRATION_SOURCE_CLEANUP. Destination
owns the process.
Phase 4 (Committed on both sides): - No rollback possible (migration complete). If the process needs to return, it is a new migration in the reverse direction.
Idempotency: All migration messages carry a 64-bit migration_id. Retransmits
are detected by migration_id deduplication at the destination; duplicate messages
are ACKed without processing.
5.7.5 Capability-Gated Migration
Process migration requires capabilities:
pub const CLUSTER_MIGRATE: u32 = 0x0200; // Allow process migration to remote nodes
pub const CLUSTER_PIN_NODE: u32 = 0x0201; // Pin process to specific node (prevent migration)
pub const CLUSTER_ADMIN: u32 = 0x0202; // Cluster-wide scheduler administration
Processes without CLUSTER_MIGRATE are never migrated. Processes with CLUSTER_PIN_NODE
can pin themselves to their current node. Containers/cgroups can restrict which nodes
their processes can run on:
/sys/fs/cgroup/<group>/cluster.nodes
# # Allowed nodes for this cgroup: "0 1 2" or "all"
# # Default: current node only (no migration)
/sys/fs/cgroup/<group>/cluster.migrate
# # "auto" (kernel decides), "never" (pinned), "prefer" (hint)
# # Default: "never" (existing Linux behavior)
5.7.6 Reconciliation: Local vs Distributed Scheduling
The single-node scheduler (Section 6.1) optimizes for cache locality — keeping tasks on the same CPU core. The distributed scheduler may migrate tasks across nodes, destroying all cache state.
Design principle: Cross-node migration is a last resort, not a default action.
Two-level hierarchy with strict separation:
- Intra-node (Section 6.1): CFS/EEVDF handles all CPU-local decisions (~4ms tick). Entirely unaware of the cluster.
- Inter-node:
ClusterSchedulerruns every 10 seconds — deliberately slow because cross-node migration is 1000x more expensive than cross-CPU migration.
Migration threshold: A task migrates cross-node only when ALL of these hold:
- Source node CPU utilization exceeds 120% of cluster average AND target is below 80% (sustained for 2+ rebalance intervals), OR
- The task's working set is predominantly on the target node (>70% of pages, per
DataAffinity), OR - The task's affinity mask explicitly requests a different node.
Migration cost model:
migration_benefit = (source_load - target_load) * task_weight
migration_cost = cache_refill_time + network_transfer_time + tlb_flush_time
Migration proceeds only when migration_benefit > migration_cost * 1.5 (50% hysteresis
to prevent oscillation).
Warm-up penalty: After cross-node migration, the task's effective load weight is inflated 2x for 20 seconds, preventing immediate re-migration before cache state builds.
5.7.7 Cluster Placement Policy Expression Language (CPPEL)
The cluster scheduler's placement and migration decisions can be customized through a small policy expression language. Policies are set per-cgroup via:
/sys/fs/cgroup/<group>/cluster.placement_policy
A policy is a single expression that evaluates to an action — either migrate(<target>),
stay, or prefer(<node>). The expression language is intentionally minimal: it is not
a general-purpose scripting language, but a constrained DSL for expressing load-based
placement decisions. Expressions are evaluated by the global cluster scheduler every
balance_interval_ms (default 10 s) for each cgroup.
Syntax (EBNF sketch):
policy := expr
expr := cond_expr | action_expr | arith_expr
cond_expr := "if" expr "then" expr "else" expr
action_expr := "migrate" "(" target ")" | "stay" | "prefer" "(" node_ref ")"
target := node_ref | "nearest_idle" | "least_loaded"
node_ref := "node" "." identifier | INTEGER
arith_expr := arith_expr arith_op arith_expr | unary_op arith_expr | atom
atom := field_ref | FLOAT | INTEGER | "(" expr ")"
field_ref := "node" "." IDENTIFIER
| "cluster" "." IDENTIFIER
IDENTIFIER := [a-zA-Z_][a-zA-Z0-9_]*
INTEGER := [0-9]+
FLOAT := [0-9]+ "." [0-9]+
Available node fields:
| Field | Type | Description |
|---|---|---|
node.load |
float (0.0–1.0) | CPU utilization, normalized (0 = idle, 1 = fully loaded) |
node.cpu_percent |
int (0–100) | CPU utilization percentage |
node.memory_pressure |
int (0–100) | Memory pressure (0 = plenty free, 100 = swapping) |
node.free_mem_mib |
int | Free memory in mebibytes |
node.accel_percent |
int (0–100) | Accelerator utilization percentage |
node.runnable_count |
int | Number of runnable tasks |
node.remote_fault_rate |
int | Remote DSM page faults per second |
cluster.node_count |
int | Number of active nodes |
cluster.avg_load |
float (0.0–1.0) | Average CPU load across all nodes |
Arithmetic operators (operate on numeric values, return numeric):
| Operator | Syntax | Description |
|---|---|---|
| Add | a + b |
Addition |
| Subtract | a - b |
Subtraction |
| Multiply | a * b |
Multiplication |
| Divide | a / b |
Division (integer division if both operands are integers; divide-by-zero evaluates to 0) |
| Modulo | a % b |
Remainder |
| Negate | -a |
Unary negation |
Comparison operators (operate on numerics, return bool):
| Operator | Syntax | Description |
|---|---|---|
| Equal | a == b |
True if a equals b |
| Not-equal | a != b |
True if a does not equal b |
| Less-than | a < b |
True if a is less than b |
| Less-or-equal | a <= b |
True if a ≤ b |
| Greater-than | a > b |
True if a is greater than b |
| Greater-or-equal | a >= b |
True if a ≥ b |
Logical operators (operate on bools, short-circuit evaluation, return bool):
| Operator | Syntax | Description |
|---|---|---|
| Logical AND | a && b |
True if both operands are true. Short-circuits: if a is false, b is not evaluated. |
| Logical OR | a \|\| b |
True if either operand is true. Short-circuits: if a is true, b is not evaluated. |
| Logical NOT | !a |
Unary prefix. True if a is false. |
Conditional expression:
if <cond> then <expr> else <expr>
Evaluates cond as a bool; returns the then branch if true, the else branch if false.
Both branches must have the same type (both numeric or both actions).
Type coercion: A numeric value used in a boolean context (e.g., as the condition of if)
is treated as false if zero (or 0.0) and true otherwise. This allows expressions such as
if node.remote_fault_rate then migrate(least_loaded) else stay.
Examples:
# Migrate if this node's CPU load exceeds 80% AND free memory is low.
if node.load > 0.8 && node.free_mem_mib < 512 then migrate(nearest_idle) else stay
# Migrate if this node is significantly more loaded than the cluster average,
# or if memory pressure is critical.
if node.load > cluster.avg_load * 1.2 || node.memory_pressure > 90
then migrate(least_loaded)
else stay
# Prefer the accelerator-lightest node when the local accelerator is saturated,
# but never migrate if memory pressure would be made worse.
if node.accel_percent > 95 && !( node.memory_pressure > 80 )
then prefer(nearest_idle)
else stay
Evaluation rules:
- Expressions are evaluated in the global cluster scheduler context (interrupt-safe,
no blocking operations). Evaluation is always bounded — recursive
ifnesting is limited to depth 8 to prevent stack overflow. An expression exceeding this depth is rejected at policy-load time with-EINVAL. - Field references reflect the most recent
NodeLoadsnapshot (updated everybalance_interval_ms). Expressions cannot trigger RDMA reads or memory allocation. - If evaluation produces an error (type mismatch, divide-by-zero, depth exceeded),
the result is
stay. The error is counted in the cgroup'scluster.policy_errorscounter. - Policies are parsed and type-checked when written to
cluster.placement_policy. An invalid policy is rejected immediately; the previous policy remains in effect.
5.8 Network-Portable Capabilities
5.8.1 Problem
UmkaOS's capability system (Section 8.1) uses opaque kernel-memory tokens validated locally. For distributed operation, capabilities must work across nodes: a process migrated from Node A to Node B should retain its capabilities. A remote RDMA operation should be authorized by a capability that the remote node can verify.
5.8.2 Design: Cryptographically-Signed Capabilities
// umka-core/src/cap/distributed.rs
/// Network-portable capability — split into a compact header (fits on the
/// kernel stack, ~64 bytes) and a separately-allocated signature payload.
///
/// Rationale: PQC signatures (ML-DSA-65: 3,309 bytes, hybrid: 3,373 bytes)
/// make the full capability ~3.6 KB, which must NOT be placed on the kernel
/// stack (kernel stacks are 8-16 KB). The split design keeps the hot path
/// (permission checks, expiry checks) on the stack via CapabilityHeader,
/// while the signature data lives in a slab-allocated CapabilitySignature.
#[repr(C)]
pub struct CapabilityHeader {
/// The local capability this was derived from.
pub object_id: ObjectId,
pub permissions: PermissionBits,
pub generation: u64,
pub constraints: CapConstraints,
// === Network portability extensions ===
/// Node that issued this capability.
pub issuer_node: NodeId,
/// Timestamp of issuance (cluster-relative wall clock,
/// synchronized via PTP/NTP, see Section 5.9.2.5).
pub issued_at_ns: u64,
/// Expiry timestamp (cluster-relative wall clock, see Section 5.9.2.5).
/// Capabilities MUST have bounded lifetime
/// (prevents stale capabilities after node failure).
/// Default: 5 minutes. Renewable while issuer is alive.
/// Expiry checking includes a 1ms grace period for clock skew.
pub expires_at_ns: u64,
/// Signature algorithm identifier.
/// Uses SignatureAlgorithm encoding (Section 8.5.2). u16 accommodates
/// hybrid algorithm IDs (0x0200+).
pub sig_algorithm: u16,
/// Pointer to the slab-allocated signature data.
/// The signature is allocated from a dedicated `cap_sig_slab` pool
/// (fixed-size 3,588-byte slots) to avoid general heap allocation
/// on the capability verification path.
pub signature: *const CapabilitySignature,
}
/// Signature payload for a distributed capability.
/// Allocated from a dedicated slab allocator (`cap_sig_slab`), NOT from
/// the general heap, to ensure bounded allocation latency on the
/// capability verification path.
///
/// Signature formats supported for distributed capabilities:
/// - Ed25519: 64 bytes (current default)
/// - ML-DSA-65: 3,309 bytes (PQC migration target, Section 8.5)
/// - Hybrid Ed25519 + ML-DSA-65: 3,373 bytes (transition mode)
///
/// SLH-DSA-128f is deliberately EXCLUDED from distributed capabilities.
/// Its 17,088-byte signatures would add ~17 KB to every cross-node
/// capability transfer, making it impractical for runtime operations
/// that occur at lock-acquisition frequency. SLH-DSA-128f is supported
/// for boot signatures (Section 8.2, KernelSignature/DriverSignature
/// structs with 17,408-byte buffers) where it is verified once at load
/// time. For distributed capabilities, ML-DSA-65 provides NIST PQC
/// security at 1/5th the signature size.
///
/// MAX_DISTRIBUTED_SIG_BYTES = 3,584 (3.5 KiB, accommodates hybrid
/// signatures with alignment headroom).
#[repr(C)]
pub struct CapabilitySignature {
/// Actual signature length in bytes.
pub sig_len: u16,
pub _pad: [u8; 2],
/// Signature data. Only sig_len bytes are meaningful.
pub data: [u8; 3584],
}
/// Convenience type combining header + signature for full capability operations.
/// Never placed on the stack as a whole — the header is on the stack and
/// the signature is accessed via pointer.
pub type DistributedCapability = (CapabilityHeader, *const CapabilitySignature);
Memory layout note: The
CapabilityHeaderis ~64 bytes and safe to place on the kernel stack. TheCapabilitySignatureis 3,588 bytes and allocated from a dedicated slab pool (cap_sig_slab, 3,588-byte slots) — never on the stack. For RDMA transmission, the capability is serialized using a compact wire format that includes onlysig_lenbytes of signature data, reducing typical message size to ~200-400 bytes (Ed25519: 64-byte signature + ~100 bytes of header fields). Thesig_lenfield determines how many bytes of the signature are included in the wire format.
Slab Allocator and DoS Mitigation: The cap_sig_slab pool uses per-process quotas
to prevent a malicious process from exhausting kernel memory by allocating many signatures:
/// Per-process quota for capability signature allocations.
/// Tracked in the process's capability space (Section 8.1).
pub struct CapSignatureQuota {
/// Maximum signature slots this process may hold simultaneously.
/// Default: 1024 (approximately 3.6 MB worst case).
/// Configurable via prctl(PR_SET_CAP_QUOTA).
pub max_slots: u32,
/// Currently allocated slots for this process.
/// Incremented on successful slab allocation, decremented on free.
pub used_slots: AtomicU32,
}
/// Default quota: 1024 signatures (~3.6 MB per process).
pub const DEFAULT_CAP_SIGNATURE_QUOTA: u32 = 1024;
/// System-wide limit on total signature slab memory.
/// Default: 1 GB (approximately 290,000 signatures).
pub const CAP_SIG_SLAB_TOTAL_LIMIT: usize = 1024 * 1024 * 1024;
Allocation protocol:
- When a process derives a
DistributedCapability, the kernel attempts to allocate a signature slot fromcap_sig_slab. - Before allocation, check
used_slots.load(Acquire) < max_slots. If exceeded, fail withEAGAIN(transient) — the process must free existing capabilities first. - After successful allocation, increment
used_slotswithReleaseordering. - On capability drop (explicit revoke, expiry, or process exit), decrement
used_slotsand return the slot to the slab.
Eager reclamation: When a process terminates (normal exit or killed), all its
CapabilitySignature slots are freed immediately. The capability space tracks
all signatures via an intrusive linked list per process — O(1) enumeration for cleanup.
Memory pressure handling: If the system-wide cap_sig_slab exceeds 80% of
CAP_SIG_SLAB_TOTAL_LIMIT, the kernel triggers a reclamation pass:
- Scan all processes for expired capabilities (where expires_at_ns < now).
- Free signatures for expired capabilities regardless of process quota.
- This ensures that expired capabilities don't accumulate under memory pressure.
Rationale for 1024-slot default: A typical process holds <100 distributed capabilities (file handles, memory regions, accelerator contexts). 1024 provides 10x headroom while limiting worst-case memory consumption to ~3.6 MB per process. For specialized workloads (distributed databases, HPC orchestrators), the admin can raise the limit via prctl.
5.8.3 Verification
Remote capability verification (any node):
1. Process on Node B presents a CapabilityHeader + CapabilitySignature to access
a resource. The header is on the stack; the signature is dereferenced from the slab.
2. Node B checks:
a. Signature valid? (verify with issuer_node's public key using the
algorithm specified in sig_algorithm — Ed25519 ~25-50 μs depending on hardware, ML-DSA-65 ~110 μs)
→ done once, then cached for lifetime of capability (keyed by object_id + generation)
b. Not expired? (compare expires_at_ns with current time)
→ ~10 ns (clock comparison)
c. Generation still valid? (check local revocation list)
→ ~100 ns (hash table lookup)
d. Permissions sufficient for requested operation?
→ ~10 ns (bitfield comparison)
3. If all checks pass: operation is authorized.
Total first-time verification: ~25-50 μs (Ed25519) or ~110 μs (ML-DSA-65).
Subsequent verifications (signature cached): ~200 ns.
5.8.4 Revocation
Capability revocation in a distributed system is harder than on a single node because capabilities may be cached on remote nodes that are temporarily unreachable.
Single-node revocation (Section 8.1.1): Generation-based. O(1), no table scanning.
Distributed revocation protocol:
- Initiation: Home node increments the object's generation in the home directory.
- Broadcast: Home node sends
CapRevoke { object_id, new_generation }to all nodes holding capabilities for this object. The grant log tracks which nodes are affected — only those nodes receive the broadcast. - Acknowledgment: Each remote node marks matching capabilities invalid, responds
with
CapRevokeAck. - Stale rejection: If a remote node presents an old-generation capability to the home node before receiving the broadcast, the home node rejects it immediately.
- Partition handling: If a node is unreachable, revocation messages are queued in a durable per-node outbox. When the partition heals, queued revocations replay in order. During the partition, stale capabilities work for local cached reads but fail for any operation requiring home-node validation.
Consistency guarantee: Revocation is eventually consistent. After broadcast completes and all nodes acknowledge, the capability is universally invalid. During the broadcast window (typically <1ms on RDMA), stale capabilities may succeed for local cached reads but fail for home-node operations.
RDMA optimization: Broadcasts use RDMA SEND over reliable connected (RC) queue pairs for guaranteed delivery and ordering.
Interaction with expiry: Distributed capabilities have bounded lifetimes (default: 5 minutes). Revocation is the fast path; expiry is the safety net — even if revocation messages are permanently lost, no stale capability survives beyond its expiry window.
Scalability: The broadcast is O(N) where N = nodes holding capabilities for the specific object (typically 2-5), not the cluster size.
Revocation Propagation Protocol
Distributed capabilities include a creation_epoch: u64 field set at creation time.
Each cluster node maintains a RevocationLog — an RCU-protected append-only ring of
(CapId, epoch): (u64, u64) tuples, capacity 65536 entries.
Revocation procedure:
-
Revoking node calls
revoke_distributed_cap(cap_id): a. Marks the capabilityREVOKEDin its localCapSpace(atomic flag) b. Appends(cap_id, current_cluster_epoch)to itsRevocationLogc. Increments the node'srevocation_sequencecounter -
On each cluster heartbeat (configurable, default 100 ms), the node broadcasts the
RevocationLogdelta (entries since last heartbeat) to all cluster peers via the cluster transport. -
Receiving nodes: for each
(cap_id, epoch)in the delta, mark the capability revoked in their local capability translation table. Any thread currently holding aValidatedCapfor thiscap_idwill find it invalid on the nextvalidate_cap()call (epoch mismatch). -
validate_distributed_cap(cap)checks: (a) local revocation log O(1) by CapId hash, then (b) if cap.creation_epoch < node.revocation_epoch, synchronously queries the originating node to confirm validity (bounded by cluster RTT).
Bounded revocation latency: heartbeat_period + max_cluster_RTT. Default: ~200 ms.
Fast path (CAP_FLAG_URGENT_REVOKE): Capabilities marked CAP_FLAG_URGENT_REVOKE
(security-critical: credential caps, RDMA master caps) trigger an immediate
REVOKE_URGENT cluster message bypassing heartbeat batching. Delivery latency:
~max_cluster_RTT (~2–5 μs on RDMA fabric, ~100 μs on TCP fallback).
5.8.5 Use Case: Remote GPU Access
Process on Node A wants to submit work to GPU on Node B:
1. Process has local capability: ACCEL_COMPUTE for GPU on Node B.
2. Kernel derives DistributedCapability, signs with Node A's key.
3. Kernel sends command submission + capability to Node B via RDMA.
4. Node B's kernel verifies capability:
- Signature valid (Node A's key)
- Not expired
- Not revoked
- Has ACCEL_COMPUTE permission for this GPU
5. Node B's AccelScheduler accepts the submission.
6. Completion notification sent back to Node A via RDMA.
The process on Node A uses the same AccelContext API
as for a local GPU. The kernel handles the distribution.
5.9.1 Distributed Device Fabric
5.9.1.1 Remote Device Access
The KABI vtable model (Section 10.5) naturally extends to remote devices. A device on Node B can be accessed from Node A through a proxy driver:
Local device access (existing):
Process → syscall → UmkaOS Core → KABI vtable call → Driver → Device
Remote device access (new):
Process → syscall → UmkaOS Core → KABI vtable call → RDMA Proxy Driver
→ RDMA transport → Node B UmkaOS Core → KABI vtable call → Driver → Device
The proxy driver implements the same KABI vtable as the real driver.
It translates vtable calls into RDMA messages. UmkaOS Core cannot tell
the difference between a local driver and a proxy driver.
5.9.1.2 Proxy Driver
// umka-core/src/distributed/proxy.rs
/// A proxy driver that forwards KABI vtable calls to a remote node.
/// Appears as a normal driver to the local kernel.
pub struct RemoteDeviceProxy {
/// Remote node hosting the actual device.
remote_node: NodeId,
/// Remote device ID (on the remote node's device registry).
remote_device_id: DeviceNodeId,
/// RDMA connection to remote node.
transport: KernelTransport,
/// Cached device info (refreshed periodically).
cached_info: AccelDeviceInfo,
/// Capability header for accessing the remote device.
/// The associated CapabilitySignature is slab-allocated and referenced
/// via the header's signature pointer.
remote_cap: CapabilityHeader,
/// Outstanding remote calls (for timeout/cancellation).
/// Uses a slab allocator (not a BTreeMap) for O(1) insertion and lookup by
/// call ID: the call ID encodes the slab slot index directly, so no tree
/// traversal is needed. The slab is bounded by the per-proxy in-flight limit
/// (max_inflight from NodeConnection), which is fixed at connection setup time.
pending_calls: SlabMap<PendingRemoteCall>,
}
This enables:
| Scenario | How It Works |
|---|---|
| Node A uses GPU on Node B | AccelBase proxy over RDMA. Command buffers sent via RDMA Write. |
| Node A reads NVMe on Node B | Block I/O proxy over RDMA (equivalent to NVMe-oF, but kernel-native). |
| Node A uses FPGA on Node B | AccelBase proxy. Same vtable, different device class. |
| Kubernetes GPU sharing | Kubelet on Node A schedules pod needing GPU. GPU is on Node B. Kernel-transparent. |
5.9.1.3 GPUDirect RDMA Across Nodes
For GPU-to-GPU communication across nodes (essential for distributed training):
Current state (NCCL on Linux):
GPU 0 (Node A) → PCIe → CPU RAM (Node A) → RDMA NIC → Network →
→ RDMA NIC → CPU RAM (Node B) → PCIe → GPU 0 (Node B)
Copies: 2 (GPU→CPU, CPU→GPU). CPU involvement: yes.
With GPUDirect RDMA (supported by Mellanox NICs + NVIDIA GPUs):
GPU 0 (Node A) → PCIe → RDMA NIC → Network →
→ RDMA NIC → PCIe → GPU 0 (Node B)
Copies: 0. CPU involvement: none.
UmkaOS integration:
- P2P DMA (Section 21.2) handles local GPU↔NIC path
- KernelTransport handles the RDMA portion
- RdmaDeviceVTable.register_device_mr() (Section 21.5.1.3) registers GPU VRAM for RDMA
- Combined path: GPU→NIC→Network→NIC→GPU, zero CPU copies
The kernel manages the IOMMU mappings on both ends, ensuring that:
- GPU VRAM is registered as an RDMA memory region
- RDMA NIC has IOMMU permission to DMA to/from GPU BAR
- Remote node's RDMA NIC has permission via remote rkey
- Capability system authorizes the cross-node GPU-to-GPU transfer
5.9 Failure Handling and Distributed Recovery
5.9.2 Split-Brain Detection and Recovery
5.9.2.1 Failure Model
Distributed systems have failure modes that single-machine kernels don't:
| Failure | Detection | Recovery |
|---|---|---|
| Node crash (power loss) | Heartbeat timeout (Suspect at 300ms, Dead at 1000ms per Section 5.9.2.2) | Reclaim resources, invalidate capabilities |
| Network partition | Heartbeat timeout + asymmetric | Split-brain protocol (Section 5.9.2.3) |
| RDMA NIC failure | Link-down event + failed RDMA ops | Fallback to TCP or isolate node |
| Slow node (Byzantine) | Heartbeat latency spike | Mark suspect, reduce trust |
| Storage failure | I/O error from block driver | FMA-managed (Section 19.1) |
5.9.2.2 Heartbeat Protocol
// umka-core/src/distributed/heartbeat.rs
pub struct HeartbeatConfig {
/// Heartbeat interval.
/// Default: 100ms (10 heartbeats/sec).
pub interval_ms: u32,
/// Miss count before marking node Suspect.
/// Default: 3 (300ms of silence → Suspect).
pub suspect_threshold: u32,
/// Miss count before marking node Dead.
/// Default: 10 (1000ms of silence → Dead).
pub dead_threshold: u32,
/// Heartbeat uses RDMA Send (two-sided) not RDMA Write (one-sided)
/// because we need the remote CPU to respond (proof of liveness).
pub transport: HeartbeatTransport,
}
// The heartbeat sender and receiver threads run at SCHED_FIFO priority
// (configurable, default priority 50) to avoid false suspect transitions
// caused by CPU saturation delaying heartbeat processing. For non-RDMA
// (TCP) clusters where network latency is higher and more variable, the
// recommended defaults are: interval_ms=500, suspect_threshold=3 (1500ms),
// dead_threshold=10 (5000ms).
/// Heartbeat message (sent via RDMA Send, 64 bytes).
#[repr(C)]
pub struct HeartbeatMessage {
/// Sender node ID.
pub node_id: NodeId, // 4 bytes
pub _pad0: u32, // 4 bytes — explicit alignment padding for generation
/// Monotonic generation (incremented on node restart).
/// If generation changes, it means the node rebooted.
pub generation: u64,
/// Sender's current timestamp (for clock skew estimation).
pub timestamp_ns: u64,
/// Sender's load summary (for cluster scheduler).
pub cpu_percent: u32,
pub memory_pressure: u32,
pub accel_percent: u32,
pub _pad1: u32, // 4 bytes — explicit alignment padding for membership_view
/// Sender's view of cluster membership (u64 bitfield, MAX_CLUSTER_NODES = 64).
pub membership_view: u64,
/// Explicit padding to bring total struct size to 64 bytes (one cache line).
/// Layout: node_id(4) + _pad0(4) + generation(8) + timestamp_ns(8) +
/// cpu_percent(4) + memory_pressure(4) + accel_percent(4) + _pad1(4) +
/// membership_view(8) + _pad(16) = 64.
pub _pad: [u8; 16],
}
5.9.2.3 Split-Brain Resolution
Network partition can cause split-brain: two groups of nodes each believe the other group has failed.
Strategy: Majority quorum + lease-based fencing tokens.
Cluster: Nodes {A, B, C, D, E} (5 nodes)
Partition: {A, B, C} can talk to each other. {D, E} can talk to each other.
Neither group can reach the other.
Resolution:
1. Each group counts its members.
2. {A, B, C} has 3/5 = majority. Continues operating.
3. {D, E} has 2/5 = minority. Enters read-only mode.
- No new DSM writes (avoids conflicting updates).
- No process migrations.
- Local workloads continue (processes already on D, E keep running).
- Remote page faults that target {A, B, C} get errors.
4. When partition heals:
- {D, E} rejoin the cluster.
- DSM directory entries are reconciled (version numbers resolve conflicts).
- Process affinity recalculated.
**Lease-Based Fencing Tokens**: To prevent split-brain ambiguity, UmkaOS uses
monotonically-increasing fencing tokens (lease epochs) for all cluster-wide
operations:
```rust
/// Fencing token for split-brain prevention.
/// Monotonically increments on every quorum leadership change.
/// Used to invalidate stale operations from partitioned minorities.
#[repr(C)]
pub struct FencingToken {
/// Monotonically increasing epoch.
/// Incremented when:
/// 1. Cluster membership changes (node join/leave)
/// 2. Quorum leader changes (leader failure)
/// 3. Admin triggers manual fencing (maintenance)
pub epoch: u64,
/// Node ID of the quorum leader that issued this token.
/// Used for token validation and leader identification.
pub leader_node: NodeId,
/// Timestamp when this token was issued (cluster-relative, PTP-synchronized).
/// Tokens expire after FENCING_TOKEN_TTL_NS (default: 30 seconds).
pub issued_at_ns: u64,
}
/// Token time-to-live: tokens older than this are considered invalid.
/// Must be longer than the worst-case partition detection time (heartbeat timeout).
pub const FENCING_TOKEN_TTL_NS: u64 = 30_000_000_000; // 30 seconds
Token propagation protocol:
- Leader election: When cluster membership changes, the new quorum leader
(determined by deterministic rules below) increments the fencing epoch and
broadcasts the new
FencingTokento all nodes in its partition. - Token validation: Every DSM write and capability grant includes the sender's
current fencing token. The receiver rejects operations with stale tokens
(
token.epoch < local_token.epoch) or expired tokens (now - token.issued_at_ns > FENCING_TOKEN_TTL_NS). - Partition healing: When partitions heal, the surviving leader's fencing token takes precedence. Nodes from the minority partition discard their stale tokens and adopt the majority's token before resuming normal operations.
Deterministic Tie-Breaking Rules: To ensure all nodes independently reach the same decision during a partition, UmkaOS applies the following rules in strict order:
Rule 1: Majority Wins — The partition with >50% of nodes wins. - 5-node cluster: 3+ nodes wins. - 6-node cluster: 4+ nodes wins.
Rule 2: Larger Node Set Wins (for equal-size partitions) — If two partitions have exactly the same number of nodes (possible only in even-sized clusters), the partition with the higher total node ID sum wins. - Example: In a 4-node cluster {A=1, B=2, C=3, D=4}, partition {A, D} (sum=5) loses to partition {B, C} (sum=5) by rule 2a below. - Rule 2a: Equal Sum → Lowest Minimum Node ID Wins: If sums are equal (as in the example), the partition whose lowest node ID is smaller wins. {A, D} has min=1; {B, C} has min=2. {A, D} wins. - This provides a total order over all possible equal-size partitions without requiring external coordination.
Rule 3: External Witness (optional) — An admin-configured external witness (node ID 255, or a dedicated witness VM) can act as a tiebreaker. If configured, the partition containing the witness wins all ties. This overrides Rules 2/2a.
Rule precedence: Rule 1 > Rule 3 > Rule 2 > Rule 2a.
Implementation for 64-node clusters with fixed-size bitmasks:
/// Deterministically select the winning partition from two candidates.
/// Returns true if partition_a wins over partition_b.
///
/// # Preconditions
/// - Partitions are disjoint (no node in both)
/// - Neither partition is empty
pub fn partition_wins(partition_a: u64, partition_b: u64, cluster_size: u8,
witness_mask: u64) -> bool {
let count_a = partition_a.count_ones();
let count_b = partition_b.count_ones();
// Rule 1: Majority wins (strictly more than half).
// For odd clusters: 3 nodes → threshold 2, 5 → 3. Correct.
// For even clusters: 4 nodes → threshold 3, 6 → 4. A 50/50 split (2-2)
// does NOT meet threshold — neither partition wins, falls through to
// tiebreaker (Rule 2: witness, Rule 3: lowest-ID).
let majority_threshold = (cluster_size as u32 / 2) + 1;
if count_a >= majority_threshold { return true; }
if count_b >= majority_threshold { return false; }
// Rule 3: External witness (if configured)
if witness_mask != 0 {
let witness_in_a = (partition_a & witness_mask) != 0;
let witness_in_b = (partition_b & witness_mask) != 0;
if witness_in_a && !witness_in_b { return true; }
if witness_in_b && !witness_in_a { return false; }
}
// Rule 2: Larger partition wins
if count_a != count_b { return count_a > count_b; }
// Rule 2a: Equal size → higher sum wins
let sum_a = node_id_sum(partition_a);
let sum_b = node_id_sum(partition_b);
if sum_a != sum_b { return sum_a > sum_b; }
// Rule 2a final tiebreaker: lower minimum node ID wins
// (guaranteed to differ for disjoint partitions of equal size)
min_node_id(partition_a) < min_node_id(partition_b)
}
/// Compute sum of node IDs in a partition bitmask.
fn node_id_sum(partition: u64) -> u32 {
let mut sum = 0u32;
let mut mask = partition;
while mask != 0 {
let lowest = mask.trailing_zeros();
sum += lowest + 1; // node IDs are 1-indexed
mask &= !(1u64 << lowest);
}
sum
}
/// Find the minimum node ID in a partition bitmask.
fn min_node_id(partition: u64) -> u32 {
partition.trailing_zeros() + 1 // node IDs are 1-indexed
}
Raft/Paxos for Critical Cluster Metadata: For cluster-critical state that cannot tolerate any inconsistency (security policies, node authentication keys, cluster configuration), UmkaOS uses Raft consensus rather than the simpler majority-quorum protocol:
- Raft scope: Security policy replication, node certificate revocation, cluster-wide configuration changes (adding/removing nodes).
- Quorum protocol scope: DSM page ownership, capability caching, heartbeat.
- Rationale: Raft provides linearizability but requires persistent logging and leader election overhead. Using it only for critical metadata avoids imposing Raft's cost on the high-frequency DSM operations.
The Raft implementation is a complete, persistent consensus engine, managed by the
ClusterMetadataReplicator service. Leader election uses the same deterministic
tie-breaking rules as the quorum protocol above to avoid ambiguity during concurrent
elections. What follows is the full Raft specification for UmkaOS.
Raft Log Persistence
A Raft log entry MUST be written to the local write-ahead log (WAL) and fsynced
before the leader sends AppendEntries RPCs to followers. This is the central
durability guarantee of Raft: if the leader crashes immediately after sending
AppendEntries, the entry can be recovered from the leader's WAL on restart and
re-sent. Without WAL persistence, a leader crash before followers acknowledge
would permanently lose the entry — violating the Raft invariant that committed
entries are durable.
/// One entry in the Raft log.
/// Written to the WAL before being sent to followers.
/// The payload immediately follows this header in the WAL file.
#[repr(C)]
pub struct RaftLogEntry {
/// Monotonically increasing log index, 1-based.
/// Gaps are not permitted: every index from 1 to commitIndex is present.
pub index: u64,
/// Raft term in which this entry was created.
/// Entries from earlier terms may be present in the log; they are only
/// considered committed once a current-term entry is committed (Raft §5.4.2).
pub term: u64,
/// CRC32C checksum of the payload bytes, for corruption detection.
/// Verified on read; mismatch triggers WAL recovery from peer snapshot.
pub crc: u32,
/// Length of the serialized `ClusterMetadataOp` payload in bytes.
pub payload_len: u32,
// Followed immediately in the file by `payload_len` bytes of
// serialized `ClusterMetadataOp`.
}
/// The operations that may be stored in a Raft log entry.
/// Each variant is a single atomic change to the cluster metadata state machine.
pub enum ClusterMetadataOp {
/// Transition a node's state (Joining → Active, Active → Leaving, etc.).
SetNodeState { node_id: NodeId, state: NodeState },
/// Revoke a distributed capability (invalidates all cached copies cluster-wide).
RevokeCapability { cap_id: CapabilityId, revoke_epoch: u64 },
/// Update the cluster-wide security policy (e.g., allowed cipher suites,
/// certificate revocation list additions).
UpdateSecurityPolicy { policy_blob: Vec<u8> },
/// Record a new node joining the cluster (triggers joint-consensus transition).
AddNodeToCluster { node_id: NodeId, addr: ClusterAddr, cert: Vec<u8> },
/// Remove a node from the cluster (triggers joint-consensus transition).
RemoveNodeFromCluster { node_id: NodeId },
/// Install a full metadata snapshot (sent via InstallSnapshot RPC;
/// replaces all preceding log entries on the recipient).
InstallSnapshot { snapshot_term: u64, snapshot_index: u64,
metadata_blob: Vec<u8> },
}
Write-ahead log layout:
/// Write-ahead log managing Raft log persistence on local storage.
/// Path: /System/Kernel/cluster/raft-wal (on the umkafs persistent volume).
/// Rotated when it exceeds WAL_ROTATE_THRESHOLD_BYTES (default: 64 MiB).
pub struct RaftWal {
/// Absolute path on the umkafs persistent volume.
pub path: &'static str,
/// Open file descriptor (opened O_WRONLY | O_APPEND | O_DSYNC).
/// O_DSYNC ensures that fsync on completion is not needed for the
/// data pages themselves; only the metadata (inode update) requires fsync.
pub fd: FileDescriptor,
/// Highest log index for which an fsync has completed.
/// Entries with index > fsynced_index are buffered but not yet durable.
pub fsynced_index: AtomicU64,
/// Mutex serializing concurrent appends (one writer at a time).
pub write_lock: Mutex<()>,
}
/// Rotate threshold: when the WAL exceeds this size, create a new segment.
pub const WAL_ROTATE_THRESHOLD_BYTES: u64 = 64 * 1024 * 1024; // 64 MiB
/// Fsync protocol for a log entry append:
/// 1. Lock write_lock.
/// 2. Serialize entry header + payload.
/// 3. write() to fd.
/// 4. fdatasync(fd) — flushes data to durable storage.
/// 5. Update fsynced_index.
/// 6. Unlock write_lock.
/// 7. ONLY THEN send AppendEntries RPC to followers.
/// Steps 3-4 take ~100-500 μs on NVMe (fsync latency dominates).
Raft State Machine
Each Raft participant (node) maintains the following persistent and volatile state:
/// Persistent state: must survive crashes. Written to WAL before responding to RPCs.
pub struct RaftPersistentState {
/// Latest term this node has seen. Initialized to 0, increases monotonically.
pub current_term: u64,
/// Node ID of the candidate this node voted for in `current_term`.
/// None if this node has not voted in the current term.
pub voted_for: Option<NodeId>,
/// Log entries (also persisted in the WAL; this mirrors the WAL index).
pub log: Vec<RaftLogEntry>,
}
/// Volatile state: rebuilt from persistent state after a crash.
pub struct RaftVolatileState {
/// Index of the highest log entry known to be committed.
pub commit_index: u64,
/// Index of the highest log entry applied to the metadata state machine.
pub last_applied: u64,
}
/// Volatile leader-only state: reinitialized after every election win.
pub struct RaftLeaderState {
/// For each follower: index of the next log entry to send.
/// Initialized to leader's last log index + 1.
pub next_index: HashMap<NodeId, u64>,
/// For each follower: index of the highest log entry known to be replicated.
/// Initialized to 0.
pub match_index: HashMap<NodeId, u64>,
}
Leader Election
State transitions:
Follower → Candidate (election timeout fires: no heartbeat received)
Candidate → Leader (receives votes from majority of cluster)
Candidate → Follower (receives AppendEntries from higher-term leader,
OR receives RequestVote with higher term)
Leader → Follower (receives RPC with higher term)
Timeouts (tunable via cluster configuration):
election_timeout: 150–300 ms, randomized per-node at startup and after
each timeout event to prevent synchronized split votes.
Randomization range: [base, base * 2], uniform distribution.
heartbeat_interval: 50 ms (leader sends empty AppendEntries to prevent
follower timeouts; must be << election_timeout).
Pre-vote extension (Raft §10.6, mandatory in UmkaOS): Before a follower starts
an election, it runs a PreVote phase. A PreVote RPC asks peers "would you vote
for me if I started an election?" without incrementing current_term. Only if the
node receives PreVote acknowledgements from a majority does it increment its term
and begin a real election. This prevents a partitioned node that reconnects from
disrupting a stable leader by forcing unnecessary term increments.
Leader election algorithm:
Pre-vote phase (Follower → Candidate):
1. Election timeout fires.
2. Send PreVote RPC to all peers:
PreVote { next_term: current_term + 1,
candidate_id: self_id,
last_log_index, last_log_term }
3. Peer grants pre-vote if:
- peer has not heard from a valid leader within election_timeout, AND
- candidate's log is at least as up-to-date as peer's log
(last_log_term > peer.last_log_term, OR
last_log_term == peer.last_log_term AND last_log_index >= peer.last_log_index)
4. If majority pre-vote grants received: proceed to real election.
5. If rejected: remain Follower, reset election timer.
Real election phase (Candidate):
1. Increment current_term (persist to WAL).
2. Vote for self (persist voted_for = self_id to WAL).
3. Reset election timer.
4. Send RequestVote RPC to all peers:
RequestVote { term: current_term,
candidate_id: self_id,
last_log_index, last_log_term }
5. Peer grants vote if:
- term >= peer.current_term, AND
- peer has not voted for a different node in this term, AND
- candidate's log is at least as up-to-date as peer's log (as above)
On grant: peer persists voted_for = candidate_id to WAL.
6. Candidate receives majority votes (> N/2 including self):
- Transition to Leader.
- Initialize next_index[peer] = last_log_index + 1 for all peers.
- Initialize match_index[peer] = 0 for all peers.
- Send initial heartbeat (empty AppendEntries) to all peers immediately.
7. Candidate receives AppendEntries with term >= current_term:
- Recognize new leader. Transition to Follower.
8. Election timer fires again before majority reached:
- Increment term, repeat from step 1 (another round).
Log Replication
AppendEntries RPC (leader → follower):
Request:
term: u64 — leader's current term
leader_id: NodeId
prev_log_index: u64 — index of log entry immediately preceding new ones
prev_log_term: u64 — term of prev_log_index entry
entries: Vec<RaftLogEntry> — new entries to store (empty for heartbeat)
leader_commit: u64 — leader's current commitIndex
Response:
term: u64 — follower's current term (for leader to update itself)
success: bool — true if follower matched prevLogIndex/prevLogTerm
match_index: u64 — highest index follower has replicated (for leader tracking)
conflict_term: Option<u64> — for fast log rollback (see below)
conflict_index: Option<u64> — first index of conflict_term in follower's log
Replication sequence:
1. Client operation arrives at leader (or is forwarded to leader by a follower).
2. Leader appends new RaftLogEntry to local WAL:
a. Serialize entry to WAL with fsync (see WAL protocol above).
b. fsynced_index is updated after fdatasync completes.
c. Leader does NOT proceed to step 3 until fsync is complete.
3. Leader sends AppendEntries RPC to all followers in parallel (non-blocking send).
4. Follower receives AppendEntries:
a. If term < follower.current_term: respond {success: false, term: current_term}.
b. Reset election timer (valid leader heartbeat received).
c. Check log consistency: does log[prev_log_index].term == prev_log_term?
- No match: respond {success: false, conflict_term, conflict_index}
where conflict_term = log[prev_log_index].term and conflict_index is the
first index in follower's log with that term.
This allows the leader to skip over conflicting entries in one RPC (fast rollback).
d. Delete any existing entries conflicting with new entries (same index, different term).
e. Append new entries to local WAL with fsync.
f. Update commitIndex = min(leader_commit, last new entry index).
g. Apply newly committed entries to the metadata state machine.
h. Respond {success: true, match_index: last appended index}.
5. Leader receives majority success responses (> N/2 including itself):
a. Advance commitIndex to the highest index replicated by majority.
NOTE: Only advance commitIndex for entries from the current term
(Raft §5.4.2). Entries from earlier terms become committed implicitly
when a current-term entry is committed.
b. Apply newly committed entries to local metadata state machine.
c. Respond to the original client with success.
6. Leader piggybacks updated commitIndex in next heartbeat (or next AppendEntries).
Followers apply committed entries up to the new commitIndex.
Fast log rollback (optimization for log inconsistency after leader change):
Instead of decrementing next_index[follower] by 1 per RPC (O(N) RPCs for
long divergent logs), the follower returns conflict_term and conflict_index.
The leader sets next_index[follower] = conflict_index, skipping the entire
conflicting term in a single round-trip.
Snapshot Protocol
When the Raft log grows beyond RAFT_SNAPSHOT_THRESHOLD entries, the leader
compresses the log by sending a full metadata state machine snapshot to followers
whose match_index is far behind. This bounds WAL storage regardless of cluster
uptime.
/// Threshold for triggering a snapshot: leader snapshots when log length exceeds this.
pub const RAFT_SNAPSHOT_THRESHOLD: usize = 1_000;
/// InstallSnapshot RPC (leader → lagging follower).
/// Sent when follower's match_index is behind the leader's snapshot point.
pub struct InstallSnapshotRpc {
/// Leader's current term.
pub term: u64,
pub leader_id: NodeId,
/// The log index covered by this snapshot (all entries up to this index
/// are included in the snapshot and need not be replicated individually).
pub last_included_index: u64,
/// The term of the entry at last_included_index.
pub last_included_term: u64,
/// Serialized full metadata state machine snapshot.
/// Includes: current cluster membership, all node states,
/// all security policies, and the revocation list.
pub data: Vec<u8>,
}
/// Follower response to InstallSnapshot:
/// 1. Discard log entries up to last_included_index (already covered by snapshot).
/// 2. Apply the snapshot to the local metadata state machine.
/// 3. Persist the snapshot to WAL (as a ClusterMetadataOp::InstallSnapshot entry).
/// 4. Update last_applied = last_included_index, commit_index = last_included_index.
/// 5. Respond with {term: current_term, success: true}.
Safety Invariants
The following invariants hold at all times in a correctly operating Raft cluster. Implementation correctness is verified against these invariants:
-
Election Safety: At most one leader is elected per term. Guaranteed by the majority vote requirement — two candidates cannot each receive votes from > N/2 nodes in the same term.
-
Log Matching: If two log entries at the same index have the same term, then all entries in both logs up to that index are identical. Guaranteed by the
prevLogIndex/prevLogTermconsistency check inAppendEntries. -
Leader Completeness: If a log entry is committed in term T, it is present in the logs of all leaders elected in terms > T. Guaranteed by the "candidate log at least as up-to-date as voter's log" requirement in
RequestVotecombined with the majority overlap property. -
State Machine Safety: If a node has applied a log entry at index I, no other node applies a different entry at index I. Follows from invariants 2 and 3.
-
Durability: No entry is sent in
AppendEntriesbefore the leader has fsynced that entry to its local WAL. Guaranteed by the WAL write-then-send protocol above. -
Monotonic Terms: A node's
current_termonly increases, never decreases. On receiving any RPC withterm > current_term, the node immediately updatescurrent_termand transitions to Follower.
Cluster Membership Changes (Joint Consensus)
Cluster membership changes (adding or removing nodes) use the joint consensus mechanism (Raft §10.4) to prevent split-brain during the transition. Direct membership switches (old configuration → new configuration atomically) are not used because they create a window where two different majorities can exist simultaneously.
Joint consensus transition protocol:
Adding node D to cluster {A, B, C}:
1. Admin issues AddNode(D) to the leader.
2. Leader appends AddNodeToCluster(D) entry to the log (Phase 1 start).
During Phase 1, the cluster uses JOINT configuration: {A, B, C, D}.
A majority requires agreement from BOTH the old majority (A,B,C → 2 of 3)
AND the new majority (A,B,C,D → 3 of 4). Both majorities must agree for
any entry to be committed during the joint phase.
3. Leader replicates and commits the AddNodeToCluster(D) entry under joint consensus.
4. Leader appends a ClusterConfig(new={A,B,C,D}) entry (Phase 2 start).
5. Leader replicates and commits ClusterConfig(new) under the new configuration alone.
6. Transition complete: cluster is now {A, B, C, D} with normal majority rules.
Removing node B from cluster {A, B, C, D}:
Same protocol: leader adds RemoveNodeFromCluster(B), transitions through joint
phase {A,C,D} ∩ {A,B,C,D}, then commits ClusterConfig(new={A,C,D}).
Node B is notified via its last heartbeat that it has been removed.
Invariants during joint consensus:
- The joint phase is limited to one pending configuration change at a time.
A second AddNode/RemoveNode is rejected until the first completes.
- If the leader crashes during the joint phase, the new leader (elected under
joint consensus rules) inherits the joint configuration and completes the
transition by re-committing the new-configuration entry.
- A node added to the cluster starts as a non-voting member (log replication
proceeds but its vote is not counted) until it has caught up to within
MEMBERSHIP_CATCHUP_ROUNDS (default: 10) rounds of the leader's log.
This prevents a lagging new member from blocking commit progress.
Implementation Phasing
- Phase 2 (single-node degenerate case): The Raft log is present but has only one participant. All entries are immediately committed (no network round-trip needed). The WAL is still written and fsynced — this validates the WAL implementation before multi-node testing begins.
- Phase 3 (multi-node Raft): Full implementation as specified above.
ClusterMetadataReplicatorruns Raft across all cluster nodes. Security policy replication, capability revocation, and node state transitions all go through Raft. - Phase 4 (joint consensus): Online cluster reconfiguration (add/remove nodes) using the joint consensus protocol. Required before UmkaOS clusters can be managed dynamically without downtime.
In-flight RDMA operations during partition: When a network partition occurs, RDMA
NICs report completion with error status for all in-flight operations. KernelTransport
translates these to TransportError::ConnectionLost. Callers (DSM fault handler, IPC
ring) retry once (in case of transient link flap), then return an error to the process:
SIGBUS for DSM page faults, EIO for IPC operations.
Minority partition DSM behavior: Nodes in the minority partition mark all DSM pages
as SUSPECT. Write accesses to SUSPECT pages are blocked (write-protect in page tables);
write attempts fault and return EAGAIN. Read accesses to SUSPECT pages continue to
return locally-cached data (preserving availability for read-heavy workloads during
brief partitions), but set a per-page stale_read flag. Applications can check whether
they have read potentially stale data via madvise(MADV_DSM_CHECK), which returns
EDSM_STALE if any SUSPECT pages were read since the last check. This ensures that
stale reads are never silent — applications that care about consistency can detect and
handle them, while applications that tolerate staleness continue without interruption.
When the partition heals, SUSPECT pages are reconciled with the majority partition's
directory and the SUSPECT marking is cleared.
Unreachable home node: Each DSM page has two home nodes: primary (determined by
hash(region, VA) % cluster_size) and backup (determined by hash_backup(region, VA)
% cluster_size using a different hash seed, guaranteed to differ from the primary).
Backup home node protocol:
1. Shadow directory maintenance: On every directory state change (ownership
transfer, reader set update), the primary home node sends the updated
DsmDirectoryEntry to the backup via RDMA Write to a pre-allocated shadow
directory region on the backup node. Each entry includes a generation counter
(incremented on every update) for consistency verification.
2. Consistency: The backup's shadow directory is write-only from the primary's
perspective — the backup never modifies it independently. The generation counter
ensures that stale writes (e.g., reordered RDMA Writes) are detected and
discarded. The backup compares the incoming generation counter against its last
seen value; only strictly incrementing updates are applied.
3. Failover: When the primary home node is declared Dead by the cluster
membership protocol (Section 5.9), the new home node is determined
deterministically — not by self-promotion. The membership protocol's
NodeDead event triggers re-evaluation of the same hashing rule used for
initial home placement (Section 5.6.3): hash(region, VA) % new_cluster_size
with the dead node removed from the membership set. All surviving nodes
compute the same result, producing a single deterministic new home. No node
self-promotes. The new home is computed deterministically from the membership
epoch, so all survivors agree.
If the deterministic new home happens to be the node already holding the
backup shadow directory, it promotes the shadow to primary. Otherwise, the
backup transfers its shadow entries to the computed new home. Directory
entries on the new home may be slightly stale (by at most one in-flight
update). The version counter in each DsmDirectoryEntry allows requestors
to detect and retry if they encounter a stale entry.
4. Partition healing: When the primary returns, the primary and backup reconcile
their directories: for each entry, the node with the higher generation counter
wins. The backup reverts to shadow mode after reconciliation completes.
For even-numbered clusters (no strict majority possible): - Admin designates a "tiebreaker" node (or external witness). - Or: smaller-numbered-node-set wins (deterministic, no external dependency). Caveat: the smaller-numbered-node-set heuristic is simple but has a known weakness — if the lower-numbered nodes are physically co-located, a power event affecting that rack consistently picks the wrong survivor set. For production deployments, an external quorum device (dedicated witness VM, or a 3rd-site arbitrator via RDMA or TCP heartbeat) is recommended. The heuristic remains as the default-fallback when no external witness is configured.
#### Cluster Leadership Election
The Distributed Lock Manager requires a leader node for deadlock arbitration and
fencing token allocation. Election uses a **Fencing Bully Algorithm** — O(1) rounds,
no coordinator needed:
**Leadership invariant:** The node with the lexicographically greatest
`(active_epoch, node_id)` tuple that is reachable by quorum (`⌈N/2⌉ + 1` nodes)
is the leader. `active_epoch` is the current `CLUSTER_ACTIVE` epoch counter;
`node_id` (u32, assigned at cluster join) breaks ties deterministically.
**Election trigger:** Any node that fails to receive a heartbeat from the current
leader for `2 × heartbeat_timeout` (default 200 ms) initiates election by
broadcasting `CLAIM_LEADERSHIP { epoch, node_id, fencing_token }`.
**Fencing token:** Monotonically increasing `u64` stored in durable cluster state
(`ClusterState` journal). A candidate that cannot present a `fencing_token` greater
than all known tokens is rejected. This prevents split-brain after network partition:
the partition with the stale fencing token cannot become leader.
**Quorum check:** Candidate waits for `ACKNOWLEDGE_LEADERSHIP` from `⌈N/2⌉ + 1`
nodes within `2 × heartbeat_timeout`. On quorum: candidate becomes leader, increments
fencing token, broadcasts `NEW_LEADER { fencing_token }` to all peers.
**Leader responsibilities:** Arbitration only — no lock grants. The leader resolves
deadlock cycles (via wait-for graph analysis) and allocates `FencingToken` values
for lock requests. Lock grants remain distributed across all nodes.
**Crash recovery:** On leader crash, election re-runs automatically. Lock requests
with `fencing_token < current_fencing_token` are rejected (stale, issued by
pre-crash leader). Requestors retry with a fresh token.
#### 5.9.2.4 DSM Recovery After Node Failure
Node B fails (holds exclusive ownership of some DSM pages):
- Heartbeat timeout → Node B marked Dead.
- For each DSM page where B was owner: a. Home node (determined by hash) still has the directory entry. b. If B was SharedOwner: readers still have valid copies. → Promote one reader to owner (pick closest to home node). c. If B was Exclusive: page data is LOST (only copy was on B). → Mark page as "lost." Processes faulting on it get SIGBUS. → Application must handle this (checkpoint/restart).
- For each DSM page where B was a reader: a. Simply remove B from reader set. No data loss.
- Capabilities issued by B expire naturally (bounded lifetime). Remote nodes stop accepting B-signed capabilities immediately.
Mitigation for exclusive page loss:
- DSM regions can be created with DSM_REPLICATE flag (replication_factor = 2).
- Every write to an exclusive page is mirrored to a backup node.
- On failure: backup is promoted to owner. No data loss.
- Cost: 2x write bandwidth for replicated regions.
**Design rationale — why replication is NOT the default:**
DSM is a **performance optimization**, not a durability mechanism. The default
(`replication_factor = 1`) is correct for the typical DSM use case:
1. **Ephemeral data**: Caches, temporary buffers, computational scratch space. Loss on
node failure is acceptable — the data can be recomputed or reloaded from source.
2. **Read-mostly workloads**: Configuration data, reference tables, shared code pages.
These are typically backed by persistent storage; losing the in-memory copy just
means reloading from disk.
3. **Application-managed durability**: Databases, distributed file systems, and message
queues implement their own replication and checkpointing. Adding DSM-level
replication would be redundant and wasteful.
4. **Performance sensitivity**: The 2x bandwidth cost and ~15% latency overhead of
synchronous replication would penalize all DSM users, even those who don't need it.
**When to enable replication:**
- DSM regions holding irreplaceable data without application-level persistence
- Workloads where recomputation cost exceeds replication cost
- Environments where node failure is frequent enough to justify the overhead
**Durability is the application's responsibility**: Just as applications using `mmap()`
or `malloc()` must implement their own persistence, DSM users must decide whether
their data warrants replication. The kernel provides the mechanism (`DSM_REPLICATE`);
the policy is left to the application.
#### 5.9.2.5 Clock Synchronization
Distributed capabilities ([Section 5.8](#58-network-portable-capabilities)) and DSM timeouts rely on nodes having
synchronized clocks. UmkaOS requires PTP (IEEE 1588 Precision Time Protocol) as
the primary clock synchronization mechanism:
- **PTP grandmaster**: One node (or a dedicated PTP appliance) serves as the time
reference. All other nodes synchronize to it via hardware PTP timestamping on the
RDMA NIC.
- **Expected accuracy**: <1 μs with hardware PTP (typical for modern RDMA NICs with
PTP hardware timestamping support).
- **NTP fallback**: If PTP is not available (no hardware support), NTP is used as a
fallback. Expected accuracy: 1-10 ms. When using NTP, capability expiry grace period
is increased to 100ms (from the default 1ms with PTP).
- **Maximum acceptable skew**: 1ms for PTP deployments. Capability expiry includes a
1ms grace period to account for this skew. Nodes with clock skew exceeding 10ms
trigger an FMA alert (reported as a `HealthEvent` with
`class: HealthEventClass::Network`, event code `CLOCK_SKEW_EXCEEDED` — clock skew is
a network-level health event, not its own event class).
- **Clock skew estimation**: Each heartbeat message ([Section 5.9.2.2](#5922-heartbeat-protocol)) includes the
sender's timestamp. The receiver estimates one-way clock skew as
`(remote_ts - local_ts - RTT/2)`. Persistent skew > 1ms triggers a PTP
resynchronization.
---
## 5.10 CXL 3.0 Fabric Integration
#### 5.10.1 Why CXL Changes Everything
CXL (Compute Express Link) 3.0 provides **hardware-coherent** shared memory across
a PCIe fabric. Unlike RDMA (which requires software coherence protocols), CXL memory
is accessed via normal CPU load/store instructions with hardware cache coherence.
> **Hardware availability caveat**: CXL 3.0 hardware with full shared memory semantics
> (back-invalidate snooping for multi-host coherence) is not yet commercially available
> as of 2025. UmkaOS's CXL support targets CXL 2.0 Type 3 devices (available in Sapphire
> Rapids, Genoa) with software-managed coherence. CXL 3.0 back-invalidate snooping will
> be supported when hardware becomes available. The CXL 3.0 sections below describe the
> target architecture; implementation is gated on hardware availability.
Memory access latency spectrum:
Local DRAM (same socket): ~50-80 ns (load/store) Local DRAM (cross-socket): ~100-150 ns (load/store, QPI/UPI) CXL 2.0 attached memory: ~200-400 ns (load/store, PCIe + CXL) CXL 3.0 shared memory pool: ~200-500 ns (load/store, coherent) RDMA: ~2000-5000 ns (explicit transfer) NVMe SSD: ~10000 ns (block I/O)
CXL 3.0 shared memory is 5-25x faster than RDMA because:
1. No software protocol (hardware coherence via CXL.cache, CXL.mem)
2. Cache-line granularity (64 bytes, not 4KB pages)
3. No memory registration overhead
4. CPU load/store instructions, not DMA engine
#### 5.10.2 Design: CXL as a First-Class Memory Tier
`PageLocation` (defined canonically in [Section 21.2.1.5](21-accelerators.md#21215-page-location-tracking) of 21-accelerators.md,
reproduced in [Section 5.6.6](#566-extending-pagelocationtracker) above) is reused here for distributed page placement
decisions. See [Section 5.6.6](#566-extending-pagelocationtracker) for the full enum definition including RDMA-specific
variants.
```rust
// Extend NumaNodeType (from Section 21.2.1.3)
pub enum NumaNodeType {
CpuMemory,
AcceleratorMemory { /* ... */ },
CxlMemory {
latency_ns: u32,
/// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
bandwidth_gbs: u32,
},
// NEW: CXL 3.0 shared memory pool (visible to multiple nodes)
CxlSharedPool {
/// Latency from this CPU to the shared pool.
latency_ns: u32,
/// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
bandwidth_gbs: u32,
/// Nodes that share this pool (u64 bitfield, MAX_CLUSTER_NODES = 64).
sharing_nodes: u64,
/// Hardware coherence protocol version.
coherence_version: CxlCoherenceVersion,
},
}
#[repr(u32)]
pub enum CxlCoherenceVersion {
/// CXL 2.0: pooled memory, no hardware coherence between hosts.
/// Kernel manages coherence via software protocol (like DSM Section 5.6).
Cxl20Pooled = 0,
/// CXL 3.0: hardware-coherent shared memory.
/// CPU cache coherence protocol extended across CXL fabric.
/// No software coherence needed — hardware handles it.
Cxl30Coherent = 1,
}
5.10.3 CXL + RDMA Hybrid
In a realistic datacenter, both CXL and RDMA will coexist:
Rack-level (CXL 3.0 fabric, ~200-500 ns):
┌─────────┐ CXL ┌─────────┐ CXL ┌─────────┐
│ Node 0 │◄────────►│ CXL │◄────────►│ Node 1 │
│ CPU+GPU │ │ Switch │ │ CPU+GPU │
└─────────┘ │ +Memory │ └─────────┘
│ Pool │
└────┬────┘
│ CXL
┌────┴────┐
│ Node 2 │
│ CPU+GPU │
└─────────┘
Cross-rack (RDMA, ~2-5 μs):
Rack 0 ◄──── 400GbE RDMA ────► Rack 1
Memory tier ordering within this topology:
Tier 1: Local DRAM (~80 ns)
Tier 2: CXL shared pool (~300 ns) ← same rack
Tier 3: GPU VRAM (~500 ns)
Tier 4: Compressed (~1-2 μs to decompress)
Tier 5: Remote DRAM via RDMA (~3 μs) ← cross rack
Tier 6: Local NVMe (~12 μs)
Tier 7: Remote NVMe via NVMe-oF/RDMA ← cross rack
The kernel detects CXL and RDMA links automatically (device registry)
and builds the distance matrix accordingly. No manual configuration
of tier ordering — the measured latencies determine placement policy.
5.10.4 CXL Shared Memory for DSM
When CXL 3.0 hardware-coherent shared memory is available, the DSM protocol (Section 5.6) simplifies dramatically:
DSM over RDMA (software coherence):
- Page fault → directory lookup → ownership transfer → RDMA page transfer
- ~10-25 μs per fault
- Software TLB invalidation protocol
DSM over CXL 3.0 (hardware coherence):
- Map CXL shared pool pages into process address space
- CPU load/store works directly (hardware coherence)
- No page faults for coherence (hardware handles cache-line invalidation)
- ~200-500 ns access latency (same as accessing CXL memory)
- Software DSM protocol not needed for CXL-connected nodes
The kernel uses CXL shared memory when available (intra-rack),
and falls back to RDMA-based DSM for cross-rack communication.
Best transport is selected automatically per page.
DSM redundancy and DLM: For node pairs connected via a CXL 3.0 fabric, the
software DSM page-ownership state machine (Section 5.6.2) is redundant — the hardware
handles cache-line-granularity coherence without software page faults or ownership
transfers. UmkaOS's DSM routing layer uses CxlPool transport (load/store) for these
pairs and skips the full coherence protocol.
However, the Distributed Lock Manager (DLM, Section 14.6) remains fully required even
with CXL 3.0. Hardware cache coherence does not provide mutual exclusion semantics
for arbitrary data structures (spinlocks, reader-writer locks, cross-node transactions).
DLM lock acquisition (LOCK_ACQUIRE, atomic CAS on lock words) continues to operate
as specified — CXL just makes the lock words accessible via load/store rather than
RDMA, which reduces lock round-trip latency but does not eliminate the need for the
protocol.
CXL 3.0 node pair — what changes vs. RDMA:
DSM page faults: eliminated (hardware coherence)
DSM ownership xfer: eliminated (no protocol needed)
DLM lock acquire: unchanged in protocol, load/store instead of RDMA
DLM deadlock detect: unchanged
Heartbeat / membership: unchanged
Capability tokens: unchanged
5.10.1 CXL Devices as UmkaOS Peers
The three CXL device types map to distinct UmkaOS peer operating models. The classification determines Mode A vs. Mode B, peer role (compute vs. memory-manager), and crash recovery behavior.
5.10.1.1 Type 1: Coherent Compute (CXL.cache only)
Type 1 devices have their own compute and a cache that participates in the host CPU's
coherency domain via CXL.cache. They have no device-managed DRAM visible via CXL.
UmkaOS operating model:
- Mode B peer (hardware-coherent transport) — CXL.cache IS the cache coherence
protocol. Ring buffers placed in the shared region are coherent by hardware without
any explicit cache flush or ownership protocol in software.
- Full compute peer — runs UmkaOS or a firmware shim (Paths A/B). Participates in
cluster membership, heartbeat, DLM, and DSM as a regular cluster node.
- FLR cache flush requirement — see Section 5.3.2. On FLR the device must flush all
CXL.cache dirty lines to host memory. Host waits for FLR completion before
accessing the shared region.
5.10.1.2 Type 2: Coherent Compute + Device Memory (CXL.cache + CXL.mem)
Type 2 is the richest CXL peer type. The device has both a coherent cache
(CXL.cache) AND device-local DRAM accessible to the host via load/store
(CXL.mem).
UmkaOS operating model:
- Mode B peer with bidirectional zero-copy — coherent in both directions:
device cache participates in host coherency domain (CXL.cache); host CPU can
directly load/store device DRAM as a NUMA memory tier (CXL.mem). Neither
direction requires DMA or explicit transfer.
- Memory tier AND compute peer — the device's DRAM is registered as
NumaNodeType::CxlMemory (or CxlSharedPool if multi-host). The device's
cores run UmkaOS and execute workloads. Same ClusterNode participates in both
memory placement decisions and compute scheduling.
- Ring buffers — can live in either host DRAM (coherent via CXL.cache) or
device DRAM (coherent via CXL.mem + host load/store). Placement policy: ring
buffers go in the lower-latency region as measured at runtime.
- FLR cache flush — same requirement as Type 1.
- Examples: future CXL-attached GPUs with HBM, CXL AI accelerators.
5.10.1.3 Type 3: Memory Expander (CXL.mem only, minimal compute)
Type 3 devices provide DRAM accessible to the host via CXL.mem. They have no
accelerator compute from CXL's perspective, though they may have a small management
processor (typically ARM Cortex-A5x or similar).
UmkaOS operating model — memory-manager peer, not compute peer: - The management processor runs a minimal UmkaOS instance (Path A via AArch64 build target) with a single responsibility: managing the DRAM pool. - What it does: monitors pool health, handles tiering decisions (hot/cold page migration across CXL memory sub-regions), manages optional compression or encryption, retires bad pages, reports ECC errors and media errors to the host cluster membership layer. - What it does NOT do: run application workloads, participate in DSM ownership transfers (the pool appears as a NUMA node to the host, not as a peer's memory), or require the full cluster protocol. It uses a lightweight subset: heartbeat + health reporting + pool management messages. - No DLM, no DSM page protocol: the Type 3 peer does not own pages in the distributed sense. The host NUMA allocator owns pages in the CXL pool; the management processor just monitors and maintains the physical medium.
CXL 3.0 multi-host pool: when a CXL switch connects multiple hosts to the same Type 3 pool, the management processor becomes a shared pool coordinator: - Arbitrates allocation between hosts (each host has a capacity reservation) - Reports pool-wide health events to all connected hosts - Does not arbitrate cache coherence (hardware does that via CXL.cache back-invalidate) - Does not run DLM for cross-host locking (UmkaOS DLM handles that over CXL.cache)
Crash recovery: see Section 5.3.5 CXL Type 3 section. DRAM persists; management
layer is lost until management processor recovers. Pool transitions to
ManagementFaulted state; uncompressed/unencrypted pages remain accessible.
5.10.1.4 CXL Switch as Fabric Manager Node
CXL 3.0 introduces intelligent CXL switches that route traffic between multiple hosts and multiple memory/compute devices. A CXL switch with embedded compute (ARM or RISC-V management processor) can run UmkaOS as a fabric manager node:
- Topology discovery: the switch sees all CXL endpoints and can report the full fabric topology (which hosts can reach which memory pools, with latency measurements) to the UmkaOS cluster membership layer.
- Routing policy: the switch can enforce traffic shaping, QoS, or bandwidth partitioning between hosts sharing the same CXL fabric.
- No data plane participation: the switch does not own memory pages, run workloads, or participate in DLM. It is a topology and routing oracle.
- This is a future capability: CXL 3.0 switch hardware with embedded compute is not yet commercially available (2025). The architecture is ready; the FabricManager node type and associated cluster messages are deferred to Phase 5+ (Section 5.11.4).
5.10.1.5 Summary Table
| CXL Type | Compute | Memory | UmkaOS Peer Role | Transport | Crash Model |
|---|---|---|---|---|---|
| Type 1 | Yes (CXL.cache) | No | Full compute peer | Mode B | FLR flush + standard recovery |
| Type 2 | Yes (CXL.cache) | Yes (CXL.mem) | Compute + memory tier peer | Mode B | FLR flush + standard recovery |
| Type 3 | No (mgmt only) | Yes (CXL.mem) | Memory-manager peer | Heartbeat + pool msgs | ManagementFaulted; DRAM persists |
| CXL Switch | Yes (mgmt only) | No | Fabric manager (future) | Topology msgs | Fallback to static topology |
5.11 Compatibility, Integration, and Phasing
5.11.1 Linux Compatibility and MPI Integration
5.11.1.1 Existing RDMA Applications (Unchanged)
All existing Linux RDMA applications work through the compatibility layer:
| Application | Linux Interface | UmkaOS Path |
|---|---|---|
| MPI (OpenMPI, MPICH) | libibverbs / libfabric | umka-compat RDMA compat layer |
| NCCL (multi-node GPU) | libibverbs + GDR | RDMA compat + GPUDirect RDMA |
| DPDK | ibverbs / EFA | RDMA compat |
| Ceph (msgr2) | RDMA transport | RDMA compat |
| Spark (RDMA shuffle) | libfabric | RDMA compat |
| Redis (RDMA) | ibverbs | RDMA compat |
The compatibility layer (umka-compat/src/rdma/) translates Linux verbs API calls
to KABI RdmaDeviceVTable calls. Binary libibverbs.so works without recompilation.
5.11.1.2 MPI Optimization Opportunities
MPI implementations can opt into UmkaOS-specific features for better performance:
Standard MPI on UmkaOS (no changes, works today):
MPI_Send/MPI_Recv → libibverbs → RDMA NIC
MPI_Win_create (shared memory window) → mmap + RDMA
Performance: same as on Linux
MPI on UmkaOS with DSM (opt-in, future):
MPI_Win_create → UMKA_SHM_MAKE_DISTRIBUTED
MPI_Put/MPI_Get → direct load/store on DSM region
Kernel handles page migration transparently.
No explicit RDMA operations needed by MPI implementation.
Benefit: MPI one-sided operations become load/store.
Latency reduction: ~2-5 μs (RDMA verbs overhead) → ~3-5 μs (page fault)
for first access, then ~50-150 ns for subsequent accesses (page is local).
For iterative algorithms (most HPC): working set becomes local after
first iteration. Subsequent iterations run at local memory speed.
5.11.1.3 Kubernetes / Container Integration
/sys/fs/cgroup/<group>/cluster.nodes
# # Which nodes this cgroup's processes can run on
# # "0 1 2 3" or "all"
/sys/fs/cgroup/<group>/cluster.memory.remote.max
# # Maximum remote memory for this cgroup
/sys/fs/cgroup/<group>/cluster.accel.devices
# # Allowed accelerators (including remote)
# # "node0:gpu0 node0:gpu1 node1:gpu0"
Kubernetes integration:
- kubelet reads cluster topology from /sys/kernel/umka/cluster/
- Device plugin exposes remote GPUs as schedulable resources
- Pod spec: resources.limits: { umka.dev/gpu: 4, umka.dev/remote-gpu: 4 }
- kubelet sets cgroup constraints; kernel handles placement
5.11.1.4 UmkaOS-Specific Cluster Interfaces
/sys/kernel/umka/cluster/
nodes # List of cluster nodes with state
topology # Cluster distance matrix
dsm/
regions # Active DSM regions
stats # DSM page fault / migration stats
memory_pool/
total # Cluster-wide memory total
available # Cluster-wide memory available
per_node/ # Per-node breakdown
scheduler/
balance_interval_ms # Global load balance interval
migrations # Process migration count
migration_log # Recent migrations (for debugging)
capabilities/
revocation_list # Current revocation list
key_ring # Cluster node public keys
5.11.2 Integration with UmkaOS Architecture
5.11.2.1 Memory Manager Integration
The distributed memory features integrate with the existing MM at two points:
1. Page fault handler (extend existing):
Current: fault → check VMA → allocate page / CoW / swap-in
Extended: fault → check VMA → check PageLocationTracker:
- CpuNode → standard local fault (unchanged)
- DeviceLocal → device fault (Section 21.1, unchanged)
- RemoteNode → RDMA fetch from remote node (NEW)
- CxlPool → CXL load (hardware handles it) (NEW)
- NotPresent → allocate locally (unchanged)
2. Page reclaim / eviction (extend existing):
Current: LRU scan → compress or swap
Extended: LRU scan → compress, OR migrate to remote node, OR swap
- If remote memory is available and faster than swap: migrate
- Decision based on cluster distance matrix + pool availability
5.11.2.2 Device Registry Integration
New device types in the registry:
ClusterFabric (virtual root for cluster topology)
+-- rdma_link_0 (Node 0 ↔ Node 1, 200 Gb/s, 2.5 μs RTT)
+-- rdma_link_1 (Node 0 ↔ Node 2, 200 Gb/s, 4.0 μs RTT)
+-- cxl_link_0 (Node 0 ↔ CXL Pool 0, 64 GB/s, 300 ns)
RemoteNode (virtual device representing a remote machine)
+-- Properties:
| node-id: 1
| state: "active"
| rtt-ns: 2500
| bandwidth-gbit-s: 200
| memory-total: 549755813888
| memory-available: 137438953472
| gpu-count: 4
+-- Services published:
"remote-memory" (GlobalMemoryPool)
"remote-accel" (RemoteDeviceProxy for each GPU)
"remote-block" (RemoteDeviceProxy for each NVMe)
5.11.2.3 FMA Integration
New FMA health events for distributed subsystem (Section 19.1):
| Rule | Threshold | Action |
|---|---|---|
| RDMA link degraded | >10% packet retransmits / minute | Alert + reduce traffic |
| RDMA link down | Link-down event | Failover to TCP or isolate node |
| Remote node unresponsive | 3 missed heartbeats (300ms) | Mark Suspect |
| Remote node dead | 10 missed heartbeats (1000ms) | Mark Dead + reclaim |
| DSM page loss | Exclusive page on dead node | Alert + SIGBUS to process |
| Cluster split-brain | Membership views diverge | Quorum protocol (Section 5.9.2.3) |
| CXL memory error | Uncorrectable ECC on CXL pool | Migrate pages + Alert |
| Clock skew detected | >10ms drift between nodes | Alert (affects capability expiry) |
5.11.2.4 Stable Tracepoints
New stable tracepoints for distributed observability (Section 19.2):
| Tracepoint | Arguments | Description |
|---|---|---|
umka_tp_stable_dsm_fault |
node_id, remote_node, vaddr, latency_ns | DSM page fault |
umka_tp_stable_dsm_migrate |
src_node, dst_node, pages, bytes | DSM page migration |
umka_tp_stable_dsm_invalidate |
owner_node, reader_nodes, vaddr | DSM cache invalidation |
umka_tp_stable_cluster_join |
node_id, rdma_gid | Node joined cluster |
umka_tp_stable_cluster_leave |
node_id, reason | Node left/failed |
umka_tp_stable_cluster_migrate |
pid, src_node, dst_node, reason | Process migration |
umka_tp_stable_rdma_transfer |
src_node, dst_node, bytes, latency_ns | RDMA data transfer |
umka_tp_stable_remote_fault |
node_id, tier, vaddr, latency_ns | Remote memory access |
umka_tp_stable_global_pool_alloc |
node_id, remote_node, bytes | Global pool allocation |
umka_tp_stable_global_pool_reclaim |
node_id, bytes, reason | Global pool reclaim |
5.11.2.5 Object Namespace
Cluster objects in the unified namespace (Section 19.4):
\Cluster\
+-- Nodes\
| +-- node0\ (this machine)
| | +-- State "active"
| | +-- Memory "512 GB total, 384 GB free, 32 GB exported"
| | +-- GPUs → symlink to \Accelerators\
| +-- node1\
| +-- State "active"
| +-- Memory "512 GB total, 256 GB free, 64 GB exported"
| +-- RTT "2500 ns"
| +-- Bandwidth "200 Gb/s"
+-- DSM\
| +-- region_0\
| +-- Size "1073741824 (1 GB)"
| +-- Pages "262144 total, 131072 local, 131072 remote"
| +-- Faults "12345 total, 3.2 μs avg"
+-- MemoryPool\
| +-- Total "4096 GB (8 nodes × 512 GB)"
| +-- Available "2048 GB"
| +-- LocalExported "128 GB"
| +-- RemoteUsed "64 GB"
+-- Fabric\
+-- Links (RDMA link table with latency/bandwidth)
+-- Topology (switch-level fabric map)
Browsable via umkafs:
cat /mnt/umka/Cluster/Nodes/node1/RTT
→ 2500 ns
ls /mnt/umka/Cluster/DSM/
→ region_0 region_1 region_2
5.11.2.6 Open Questions
The following items require further design work:
DSM Replication Protocol (Phase 3 Milestone)
The complete write propagation and quorum acknowledgment protocol is a Phase 3 milestone (see Section 5.11.3). The design intent, which drives the data structure choices already specified, is:
Consistency levels (set per-region at dsm_region_create() time):
| Level | Semantics | Write cost | Use case |
|---|---|---|---|
DSM_SYNCHRONOUS |
Per-page linearizability (all replicas ack; NOT sequential consistency across pages) | All-replica ack, ~3–5 μs round-trip | Shared structs, lock tables |
DSM_RELAXED |
Release consistency | Async propagation | Bulk data, ML tensors |
DSM_CAUSAL |
Causal consistency via vector clocks | 1 ack + vector clock update | Distributed queues, logs |
Write propagation model (directory-based MOESI-like):
- Owner node holds the canonical copy and the
DsmDirectoryentry for the page - On write: owner sends
WRITE_PROPAGATE { page_id, data, epoch }to all sharers recorded in the directory entry DSM_SYNCHRONOUS: owner marks write complete after receiving acks from all sharersDSM_RELAXED: owner proceeds after 1 ack; remaining propagation is asyncDsmPageStatetransitions: MODIFIED → OWNED (on first sharer) → SHARED (N readers) → INVALID (stale, must fetch) → EXCLUSIVE (transitioning to modified)
Phase 3 deliverables:
- Complete DsmReplicationState struct with epoch counters and sharer bitmask
- propagate_write() algorithm with failure handling (node eviction mid-write)
- quorum_read() algorithm for DSM_SYNCHRONOUS regions
- Consistency level validation benchmarks against real RDMA hardware
- Vector clock protocol for DSM_CAUSAL
Per-consistency-level protocol specifications:
DSM_SYNCHRONOUS: write completes only after all replicas acknowledge. Uses a two-phase protocol:- The primary (owner) node writes to its local copy and sends
ReplicateWrite { addr, data, epoch }to all replica nodes recorded in theDsmDirectorysharer set. -
The primary waits for
WriteAck { addr, epoch }from all replicas before marking the write complete and returning to the caller. Reads always serve from the local replica (no remote fetch needed — all replicas are up to date by construction). If any replica does not acknowledge within the RDMA timeout (500ms), the primary retries once, then evicts the non-responsive replica from the sharer set and completes the write with the remaining replicas. The evicted node must re-join the region and perform a full state sync before serving reads again. Latency: 2x RDMA RTT (approximately 4-6 microseconds for a local cluster). Use case: shared metadata, distributed lock tables, coordination structures where stale reads are unacceptable. -
DSM_RELAXED(default): write completes after the local write plus asynchronous propagation to replicas. The primary writes to its local copy and returns immediately to the caller. Propagation to replicas proceeds asynchronously: the primary sendsReplicateWrite { addr, data, epoch }but does not wait for acknowledgment. Each page maintains a per-page epoch counter (monotonically increasing Lamport clock). Readers may observe stale data; thedsm_fence()call synchronizes all pending writes by waiting forWriteAckfrom all replicas for all epochs up to the current local epoch. Conflict resolution uses last-writer-wins by epoch: when a replica receives two writes for the same address with different epochs, it applies only the higher epoch and discards the lower. Ties (same epoch from different nodes) are broken by node ID (lower node ID wins). Latency: 1x RDMA RTT for write (no wait for acks; acks are processed asynchronously and update the propagation watermark). Use case: bulk data, ML tensors, large working sets where application-managed consistency (explicitdsm_fence()calls) is acceptable. -
DSM_CAUSAL: writes are ordered by causality — if thread A writes X then writes Y, any reader that sees Y must also see X. Uses vector clocks per memory region. Each node maintains a vector clockvc[MAX_CLUSTER_NODES]for eachDSM_CAUSALregion. On write: - The writer increments its own entry in the vector clock:
vc[self_node] += 1. - The
ReplicateWrite { addr, data, vector_clock }message carries the writer's current full vector clock. - The recipient advances its local clock to
max(local_vc[i], received_vc[i])for each entryibefore applying the write. On read: a read is satisfied locally if the local vector clock dominates (is component-wise >=) the vector clock of the last write to that address. If not, the reader must fetch the page from a node whose clock dominates the write's clock. In practice, this means reads are local unless the reader has not yet received a causally-prior write (which triggers a synchronous fetch from the owner). Latency: 1x RDMA RTT plus vector clock comparison overhead (O(N) where N is cluster size, negligible for N <= 64). Use case: message-passing patterns, distributed queues, event logs where causal ordering matters but global synchronization (DSM_SYNCHRONOUS) is too expensive.
Causal Tracking — Per-Interval Vector Timestamps, Not Per-Page Clocks:
Storing a full vector clock per page ([u64; 64] = 512 bytes) would impose 12.5%
metadata overhead on 1 TB DSM (64 GB of clock storage). Production DSM systems
(TreadMarks, Munin) attach vector timestamps to synchronization intervals —
windows between lock acquire/release or explicit fence points — not to individual
pages. UmkaOS follows this approach.
Per-page state (DsmPageMeta):
rust
/// Off-page metadata stored in the page frame allocator for each DSM page.
/// Compact by design: 32 bytes total, never mapped into the page itself.
pub struct DsmPageMeta {
/// Nodes that currently hold a valid copy of this page (one bit per node).
/// Used for targeted invalidation: only nodes in `copyset` receive
/// `GetM` / invalidation messages when a writer takes exclusive access.
/// 8 bytes — 0.2% overhead for 4 KB pages.
pub copyset: u64,
/// Epoch of the last `DsmCausalStamp` that dirtied this page.
/// Used to check whether a reader's open interval has seen this write.
pub last_stamp_epoch: u32,
/// Node ID of the last writer (for conflict resolution logging).
pub last_writer: u8,
pub _pad: [u8; 3],
/// Home node for this page (directory entry location).
pub home_node: u8,
/// Current MOESI state on the home node's view (matches `DsmHomeState`).
pub home_state: u8,
pub _pad2: [u8; 14],
}
Per-interval causal tracking (DsmCausalStamp):
rust
/// Allocated at lock-acquire (or explicit `dsm_fence()`) and freed after the
/// interval closes and all participating nodes have acknowledged the release.
/// Carries a full vector clock for the interval's causal frontier.
pub struct DsmCausalStamp {
/// Vector timestamp at interval open. `clock[i]` = number of writes
/// node `i` had completed when this interval began (monotonic).
pub clock: [u64; MAX_CLUSTER_NODES],
/// Pages dirtied within this interval. Piggybacked onto the lock-release
/// message so receivers can update only the dirty pages' `last_stamp_epoch`.
pub dirty_pages: Vec<DsmPageAddr>,
/// Epoch counter for this stamp (matches `DsmPageMeta.last_stamp_epoch`).
pub epoch: u32,
}
Total clock storage is proportional to concurrently open intervals, not to the number of DSM pages. For 64 concurrent intervals across 64 nodes: 64 × 512 bytes = 32 KB — negligible.
-
Read-path dominance check: On a
DSM_CAUSALread, if the requesting node holds an openDsmCausalStampfor the current interval, it compares the page'slast_stamp_epochagainst its interval's epoch:if page.last_stamp_epoch <= interval.epoch: accept (all writes causally prior to this interval are visible) else: issue invalidation, fetch latest version from home node -
Propagation: At lock-release, the releasing node increments
clock[own_node_id]in itsDsmCausalStamp, setsepochon each dirty page'sDsmPageMeta, and sends the stamp with the lock-release message. Receivers merge the dirty-page list; non-dirty pages need no update. -
Conflict resolution: Two writes to the same page on different nodes in the same interval epoch create a conflict. UmkaOS uses node-rank ordering: the write from the lower-numbered node ID wins. The losing write is replayed via the application's registered conflict handler (registered at
dsm_region_create()time) or returnsECONFLICTif no handler is registered. -
Node departure: When a node leaves, its bit in each
copysetis cleared. OpenDsmCausalStampinstances from the departed node have their clock slots frozen at their last observed value. A slot may be reused after one full cluster epoch during which no open stamp references it.
Multi-threaded process migration barrier protocol:
When migrating a multi-threaded process (threads sharing an address space via
CLONE_VM), all threads must be frozen atomically before state transfer begins.
A partially-frozen process with some threads still executing would corrupt the
shared address space during transfer. The cluster-wide barrier protocol:
-
Enumerate participating nodes: the source node queries the cluster thread registry to determine all nodes that have threads belonging to the migrating process (identified by
tgid). This produces a set of participating nodesP = { node_1, node_2, ..., node_k }. -
Send freeze barrier: the source sends
FreezeBarrier { pid, thread_count }to all nodes inPvia the RDMA control channel.thread_countis the total number of threads in the thread group across all nodes (the source knows this from the cluster thread registry). -
Remote thread freeze: each remote node in
PdeliversSIGSTOPto all local threads of the process via cross-node signal delivery. The signal is injected into the target node's signal queue using an RDMA message (RemoteSignal { pid, tid, signo: SIGSTOP }). The target node's signal processing path handlesRemoteSignalidentically to a localkill(). -
Collect freeze acknowledgments: the source waits for
FreezeAck { pid, frozen_thread_count }from each node inP. EachFreezeAckconfirms that all of that node's threads for the process are stopped and their register state is saved. Timeout: 5 seconds per node. If any node does not respond within the timeout, migration is aborted: the source sendsFreezeAbort { pid }to all nodes that did acknowledge, and those nodes unfreeze their threads. -
Barrier complete: once all
FreezeAckmessages are received (confirming that the sum offrozen_thread_countvalues equals the expectedthread_count), migration proceeds as for the single-threaded case: each node serializes its local thread states independently and sends them to the destination via the RDMA migration channel. -
Atomic commit on destination: the destination reconstructs all threads before unfreezing any. Thread reconstruction order:
- First: the main thread (thread group leader,
tid == pid). - Then: all other threads, in arbitrary order.
-
Last: the destination transitions all threads from
TASK_STOPPEDtoTASK_RUNNINGin a single atomic batch (interrupts disabled on the destination CPU during the batch transition to prevent partial observation). This ensures no thread observes a state where some sibling threads exist on the destination but others do not. -
TCP fallback mechanism: When RDMA fails (NIC down, driver crash), the transport layer falls back to TCP as follows:
Trigger conditions:
- RDMA link-down event from the NIC driver (immediate detection)
- transport_send() returns TransportError::LinkDown (detected on first failed op)
- Heartbeat timeout (indirect detection if NIC crashes silently)
In-flight operation handling: - Any RDMA operation in flight at failure time is retried via TCP. - The sequence number is preserved: the TCP connection starts at the same message sequence number as the failed RDMA path. Both endpoints track the highest acknowledged sequence number so the receiver can deduplicate re-sent ops. - Operations that were acknowledged before the failure need not be retried. - Operations that may have been delivered but not acknowledged (RDMA RC completions are idempotent for read ops; write/CAS ops are idempotent if wrapped in sequence-numbered messages) are re-sent — the receiver deduplicates via sequence number.
Reconnection protocol:
1. Detecting node sends TRANSPORT_FALLBACK_REQUEST(seq_num=N, reason) to peer via TCP
(using the pre-established TCP backup connection maintained per peer).
2. Peer acknowledges with TRANSPORT_FALLBACK_ACK(seq_num=N).
3. Both sides switch all subsequent messages to TCP.
4. The failed RDMA QP is torn down cleanly or marked dead if unreachable.
Pre-established TCP backup connection: Each node maintains a single TCP connection to each peer for the fallback path, established at cluster join time. This connection is otherwise idle (only keepalive packets). Establishing TCP during failure adds ~100-500ms latency that would be unacceptable.
Re-establishment of RDMA: After the RDMA link recovers (NIC driver reports link-up), the transport layer re-creates the RC QP and runs an exchange to synchronize sequence numbers. Once the QP reaches RTS state, traffic migrates back from TCP to RDMA.
TCP Fallback Transport
When RDMA is unavailable (non-RDMA fabric, RDMA link failure, or RDMA initialization error), the DSM and DLM subsystems fall back to a KEEP-TCP transport:
Transport abstraction: The ClusterTransport trait is implemented by both
RdmaTransport and TcpTransport. Selection happens at cluster join time:
pub trait ClusterTransport: Send + Sync {
/// Send a message to node `target`. Best-effort; no delivery guarantee.
fn send(&self, target: NodeId, msg: &[u8]) -> Result<(), TransportError>;
/// Reliable send with ACK. Blocks until ACK received or timeout.
fn send_reliable(&self, target: NodeId, msg: &[u8], timeout_ms: u32) -> Result<(), TransportError>;
/// Poll for incoming messages (called from the cluster I/O thread).
fn poll_recv(&self, buf: &mut [u8]) -> Option<(NodeId, usize)>;
}
TcpTransport specifics:
- One persistent TCP connection per peer node (reconnected automatically on drop).
- send(): non-blocking write to a per-connection TX ring buffer (kernel-side
socket send buffer).
- send_reliable(): write + wait for application-level ACK message (4-byte
ACK: u32 echoing message sequence number). Timeout after timeout_ms.
- Framing: 8-byte header [msg_len: u32, seq: u32] + variable payload.
- Performance: ~5–20× higher latency than RDMA (microseconds → tens of
microseconds) and ~10× lower throughput. Acceptable for fallback; cluster
health monitoring triggers a warning and attempts RDMA re-establishment.
Fallback trigger: The cluster subsystem detects RDMA unavailability at
startup (ibv_query_device() failure) or at runtime (persistent RDMA send errors
→ TransportError::FabricDown). After 3 consecutive RDMA failures to the same
node, that peer's transport is demoted to TCP. RDMA is re-attempted every 60s.
- Scale extension beyond 64 nodes: Data structure changes for >64-node clusters (extended bitfields, hierarchical directories).
- Partial network partitions: Non-transitive reachability (A can reach B, B can reach C, but A cannot reach C). Current quorum model assumes transitive reachability.
- Thread group migration: Migrating a process that has threads sharing an address
space via
CLONE_VM. All threads in the group must be frozen atomically before any can be transferred, which requires cluster-wide barrier coordination not yet designed. - ptrace migration: Transferring a process that is being traced (
PTRACE_ATTACH) is not handled in v1. The tracer and tracee must reside on the same node. - GPU context migration on consumer hardware: Migration of AccelContexts on GPUs
without hardware-supported checkpoint (most GeForce and Radeon consumer parts).
Currently fails with
-EOPNOTSUPP; a future software-level CUDA/ROCm checkpoint integration may enable this at the cost of longer quiescence time.
5.11.3 Implementation Phasing
| Component | Phase | Dependencies | Notes |
|---|---|---|---|
| KernelTransport (RDMA kernel API) | Phase 3 | RDMA driver KABI | Foundation for everything |
| KernelTransport (TCP fallback) | Phase 3+ | KernelTransport | TCP socket transport for RDMA-less environments; end-to-end round-trip latency ~50-200μs (network + kernel processing) vs ~3-5μs RDMA Read. Note: the ~5μs kernel processing overhead quoted in Section 5.1.1.2 is per-packet processing only, not round-trip time. |
| Cluster join / topology discovery | Phase 3 | KernelTransport | Basic cluster formation |
| Heartbeat + failure detection | Phase 3 | KernelTransport | Must have before distributed state |
| Distributed Lock Manager (Section 14.6) | Phase 3-4 | KernelTransport, Heartbeat | RDMA-native DLM; prerequisite for clustered FS |
| Distributed IPC (RDMA rings) | Phase 3-4 | KernelTransport, IPC | Natural extension of existing IPC |
| Pre-registered kernel memory | Phase 3-4 | RDMA driver, IOMMU | Performance prerequisite |
| PageLocation RemoteNode variant | Phase 4 | Memory manager | Small MM extension |
| DSM page fault handler | Phase 4 | PageLocation, KernelTransport | Core DSM functionality |
| DSM directory (home-node hash) | Phase 4 | DSM fault handler | Page ownership tracking |
| DSM coherence protocol | Phase 4-5 | DSM directory | Multiple-reader / single-writer |
| Distributed capabilities (signed) | Phase 4 | Capability system, Ed25519 | Security foundation |
| Cooperative page cache | Phase 4-5 | DSM, VFS | Distributed page cache |
| Global memory pool (basic) | Phase 5 | DSM, cgroups | Remote memory as swap tier |
| Cluster scheduler | Phase 5 | Global pool, DSM affinity | Process migration |
| Process migration | Phase 5 | Cluster scheduler | Freeze/thaw + lazy page fetch |
| Remote device proxy | Phase 5 | KernelTransport, AccelBase | Remote GPU/NVMe access |
| GPUDirect RDMA cross-node | Phase 5 | P2P DMA, RDMA | GPU↔GPU across network |
| Split-brain resolution | Phase 5 | Heartbeat, DSM | Quorum + fencing |
| CXL 2.0 pooled memory | Phase 5 | Memory manager | CXL memory as NUMA node |
| CXL 3.0 shared memory | Phase 5+ | CXL 2.0, DSM | Hardware-coherent DSM |
| Global memory pool (advanced) | Phase 5+ | All of above | Full cluster memory management |
| Cluster-wide cgroup integration | Phase 5+ | Cluster scheduler, global pool | Kubernetes-ready |
| DSM replication (fault tolerance) | Phase 5+ | DSM, replication protocol | For critical workloads |
5.11.3.1 Priority Rationale
Phase 3-4 (Foundation): RDMA transport + cluster formation + basic DSM. This makes UmkaOS cluster-aware and enables distributed IPC. MPI and NCCL workloads benefit immediately from kernel-native RDMA transport.
Phase 4-5 (Practical Wins): Cooperative page cache + global memory pool + signed capabilities. This is when distributed UmkaOS becomes genuinely useful: remote memory as a tier, shared file caching, and secure cross-node operations.
Phase 5+ (Competitive Advantage): Process migration, CXL integration, cluster-wide resource management. Features that no other OS provides. The kernel manages a cluster of machines as a single coherent system.
5.11.4 Licensing Summary
| Component | IP Source | Risk |
|---|---|---|
| RDMA kernel transport | Original design (uses standard RDMA verbs) | None |
| DSM page coherence | Academic (published research: Ivy, TreadMarks, Munin, GAM) | None |
| Home-node directory | Academic (distributed hash table, published) | None |
| Global memory pool | Original design (extends NUMA model) | None |
| Cooperative page cache | Academic (published research) | None |
| Cluster scheduler | Original design (extends CBS) | None |
| Distributed capabilities | Original design (Ed25519 is public-domain) | None |
| CXL integration | CXL spec (public, royalty-free consortium) | None |
| Process migration | Academic (MOSIX concepts are published research) | None |
| Split-brain / quorum | Academic (Paxos, RAFT, published) | None |
| CRDT revocation list | Academic (Shapiro et al., published) | None |
All components are either original design or based on published academic research and open specifications. No vendor-proprietary APIs or patented algorithms.
5.11.5 Comparison: Why Previous DSM Projects Failed and Why This Succeeds
| Factor | Kerrighed / OpenSSI / MOSIX | UmkaOS Distributed |
|---|---|---|
| Kernel design | Bolted onto Linux (30M+ LOC, assumes single machine) | Designed from scratch with distribution in mind |
| Coherence granularity | Cache-line (64B) — false sharing kills performance | Page-level (4KB) — matches network latency |
| Hardware support | None (pure software coherence) | RDMA (2020s), CXL 3.0 (2025+) |
| Memory model | Patched Linux MM (invasive, broke on updates) | PageLocationTracker already supports heterogeneous tiers |
| Transport | TCP sockets (high overhead) | RDMA one-sided ops (zero remote CPU) |
| Security | Unix permissions (not network-portable) | Cryptographically-signed capabilities |
| Fault tolerance | Fragile (node failure = cluster crash) | Quorum-based, bounded-lifetime capabilities, graceful degradation |
| Application compat | Modified syscall layer, broke things | Standard POSIX + opt-in extensions |
| Maintenance burden | Thousands of patches across all subsystems | Clean integration points (PageLocation, IPC transport, KABI proxy) |
| Timing | 2000s — hardware wasn't ready | 2026 — RDMA ubiquitous, CXL arriving, AI demands it |
5.12 SmartNIC and DPU Integration
5.12.1 Problem
DPUs (Data Processing Units) — NVIDIA BlueField, AMD Pensando, Intel IPU — are processors that sit on the network path. They have their own ARM cores, run their own OS, and can process network traffic, storage I/O, and security policies without using host CPU cycles.
The kernel needs a model for "this driver runs on the DPU, not on the host."
5.12.2 Design: Offload Tier
Extend the three-tier driver model (Section 10.4) with a concept of execution location:
// umka-core/src/driver/offload.rs
/// Where a driver's code executes.
/// `#[repr(C, u32)]` is required (not `#[repr(u32)]`) because the `Offloaded`
/// variant carries data fields. Plain `#[repr(u32)]` only works for fieldless
/// (C-like) enums.
#[repr(C, u32)]
pub enum DriverExecLocation {
/// Driver runs on host CPU (standard: Tier 0, 1, or 2).
Host = 0,
/// Driver runs on a DPU/SmartNIC.
/// Kernel communicates with it via PCIe mailbox or shared memory.
/// Same KABI vtable interface, different transport.
Offloaded {
/// Device ID of the DPU in the device registry.
dpu_device: DeviceNodeId,
/// Communication mechanism.
transport: OffloadTransport,
},
}
#[repr(u32)]
pub enum OffloadTransport {
/// PCIe mailbox (register-based, low-latency, small messages).
PcieMailbox = 0,
/// Shared memory region (mapped via PCIe BAR, bulk data).
SharedMemory = 1,
/// RDMA (if DPU has RDMA capability).
Rdma = 2,
}
5.12.3 How It Works
Standard driver (host):
Process → syscall → UmkaOS Core → KABI vtable call → Driver (host CPU)
Offloaded driver (DPU):
Process → syscall → UmkaOS Core → KABI vtable call →
→ OffloadProxy → PCIe mailbox → DPU driver → Hardware
The OffloadProxy implements the same KABI vtable as the real driver.
It translates vtable calls into DPU mailbox messages.
UmkaOS Core cannot tell the difference.
This is the same proxy pattern used for remote devices in the distributed kernel (Section 5.9.1). A DPU is essentially a very close "remote node" connected via PCIe instead of RDMA.
5.12.4 Use Cases
| Scenario | Host Path | DPU Path | Benefit |
|---|---|---|---|
| Network firewall | CPU processes every packet | DPU processes packets, host sees only allowed traffic | CPU freed for applications |
| NVMe-oF target | CPU handles RDMA + NVMe | DPU handles RDMA + NVMe, host CPU uninvolved | Zero host CPU for storage serving |
| IPsec / TLS | CPU encrypts/decrypts | DPU encrypts/decrypts | CPU freed, lower latency |
| vSwitch (OVS) | CPU handles VM networking | DPU handles VM networking | Major CPU savings in cloud |
| Telemetry | CPU collects and sends metrics | DPU collects and sends | No host CPU overhead |
5.12.5 DPU Discovery and Boot
DPU discovery is via standard PCIe enumeration (the DPU appears as a PCIe device). Firmware loading is vendor-specific: BlueField uses UEFI on the DPU's ARM cores, Intel IPU uses its own loader. The host kernel's role: 1. Enumerate DPU as PCIe device (standard PCI scan at boot). 2. Establish communication channel (PCIe mailbox or shared memory BAR). 3. KABI vtable exchange: host and DPU exchange offload capabilities. 4. DPU firmware loading is out of scope — managed by DPU's own boot chain.
5.12.6 DPU Failure Handling
DPU crash / reboot / PCIe link failure:
1. PCIe error detected (AER or link-down notification).
2. OffloadProxy marks all pending I/O as failed:
- Pending network I/O: -EIO to waiting processes.
- Pending storage I/O: -EIO, filesystem handles error.
- Pending control operations: -ENODEV.
3. Notify userspace via udev/netlink: DPU offline event.
4. Attempt DPU recovery:
a. PCIe hot-reset (if supported by hardware).
b. Wait for DPU firmware to re-initialize.
c. Re-establish mailbox communication.
d. Re-exchange KABI vtables.
5. If recovery succeeds: resume offloaded operations.
6. If recovery fails: fall back to host-side driver
(if one exists for the offloaded function).
Example: DPU was offloading NIC → fall back to host NIC driver.
Not all functions have host fallbacks — admin notified.
5.12.7 Shared State Consistency
DPU and host may share data-plane state (flow tables, packet counters) via PCIe shared memory:
Source of truth:
Data plane (flow entries, counters): DPU is authoritative.
DPU processes packets at line rate. Host reads are stale by ~μs.
Control plane (policy, configuration): Host is authoritative.
Host writes policy. DPU reads and applies.
Consistency model:
Shared memory uses producer-consumer pattern with sequence counters.
No locks across PCIe (latency too high for spinlocks).
Host writes are fenced (SFENCE) before signaling DPU via mailbox.
DPU writes are fenced before updating sequence counter.
Host reads check sequence counter for torn-write detection.
DPU Firmware Update Lifecycle:
Updating DPU firmware while services are running:
- Migrate all offloaded functions back to host-side drivers (if host fallbacks exist).
- Quiesce the DPU: drain in-flight I/O, flush shared-memory state.
- Apply firmware update via vendor-specific mechanism (BlueField:
bfb-install, Intel IPU: vendor tool). - DPU reboots with new firmware.
- Re-establish KABI vtable exchange (new firmware may have new capabilities).
- Re-offload functions that were migrated to host.
If no host fallback exists for an offloaded function, the update requires a maintenance window (the function is unavailable during DPU reboot).
DPU Multi-Tenancy:
In cloud environments, a single DPU serves multiple VMs/containers:
- SR-IOV VFs: The DPU's NIC exposes SR-IOV Virtual Functions, one per tenant. Each VF is passed through to a VM via IOMMU. Hardware isolates VF traffic.
- Hardware flow classification: The DPU's embedded switch classifies packets by flow (5-tuple or VXLAN VNI) and routes to the correct VF.
- Per-VF offload: Each tenant's offloaded functions (firewall rules, encryption keys, QoS policies) are isolated per VF. The host kernel's cgroup hierarchy maps to VF assignments.
Offload Decision Criteria:
The kernel decides whether to offload a function to the DPU based on:
- Capability: Does the DPU support this function? (Checked via KABI vtable exchange.)
- Admin policy:
/sys/kernel/umka/dpu/<id>/offload_policy—auto,manual,disabled. - Intent integration: If the cgroup's
intent.efficiencyis high, prefer DPU offload (reduces host CPU power). Ifintent.latency_nsis very low, check whether the PCIe mailbox round-trip (~1-2μs) exceeds the latency budget. - Load: If the DPU's embedded cores are saturated (utilization > 90%), do not offload additional functions.
5.12.8 Performance Impact
When using DPU offload: the host CPU does LESS work. Performance improves for host applications because infrastructure processing moves to the DPU.
Overhead: one PCIe mailbox round-trip (~1-2 μs) per KABI vtable call that crosses the host-DPU boundary. But DPU-offloaded drivers handle the fast path entirely on the DPU — the host only sees control-plane operations (setup, teardown, configuration changes).