Skip to content

Chapter 5: Distributed Kernel Architecture

Cluster topology, distance matrix, RDMA transport, distributed lock manager, SmartNIC/DPU integration


Optional multi-kernel clustering over RDMA. Each node runs an independent UmkaOS instance; the cluster layer provides a peer protocol for capability exchange, distributed IPC, and coordinated scheduling. The same peer protocol is used by Tier M devices (DPUs, SmartNICs) connected via PCIe — a remote server and a local DPU are both "peers" in the topology graph, differing only in transport latency. Failure of one node cannot crash another — peer isolation is a hard invariant. CXL 3.0 fabric is supported as a first-class intra-rack transport.

5.1 Distributed Kernel Architecture

This chapter has three conceptually distinct layers, each independently valuable:

  1. Umka-protocol (Sections 5.2-5.4) — A universal device communication protocol providing self-describing devices, capability-scoped access, crash recovery, and unified management. This is the communication foundation shared with the on-host Tier M peer model (Section 11.1). It runs over any transport (PCIe BAR, RDMA, CXL, USB, TCP) and is implementable today via an 8-12K line C firmware shim on existing device RTOSes.

  2. Cluster management (Sections 5.5-5.8) — Membership, topology discovery, RDMA transport, distributed lock manager. Provides cluster-aware scheduling, global memory pooling, and DLM for multi-node filesystems. Requires umka-protocol but not DSM.

  3. Distributed Shared Memory (Sections 5.9-5.10) — Optional, subsystem-scoped. Page-granularity coherence over RDMA for workloads that benefit from shared-memory semantics across nodes. Incurs ~10-18 μs fault latency (NVMe-class, not DRAM-class). DSM is a toolkit that kernel subsystems opt into by creating regions — it is not a global transparent layer. See Section 6.2 for the usage model.

A fully functional distributed UmkaOS deployment uses layers 1-2 without enabling layer 3. DSM is an advanced feature for workloads that explicitly opt in, not a requirement for distributed operation.

Design Constraints:

  1. Drop-in compatibility: A single-node UmkaOS system behaves identically to a non-distributed kernel. All distributed features are opt-in. Existing Linux binaries (MPI, NCCL, Spark, Redis, PostgreSQL) work without modification.
  2. Superset, not replacement: Standard TCP/IP sockets, POSIX shared memory, and SysV IPC work exactly as before. Distributed capabilities are additional. Applications that opt in (via new interfaces or transparent kernel policies) get better performance.
  3. RDMA-native from day one: The kernel's core primitives (IPC, page cache, memory management) are designed with RDMA transport in mind, not bolted on after the fact.
  4. Page-level coherence (layer 3 only): Distributed memory coherence operates at page granularity (4KB minimum), not cache-line granularity (64B). This is the fundamental design decision that makes distributed shared memory practical over network latencies.
  5. Graceful degradation: Node failures are handled. No single point of failure. Split-brain is detected and resolved. Partial cluster operation is always possible.

Relationship to the UmkaOS Peer Protocol: The distributed kernel uses the same Layer 1 peer protocol as Tier M on-host peers (Section 11.1) — membership, capability negotiation, crash recovery, ring buffer transport. The difference is transport binding (Layer 0: RDMA verbs instead of PCIe BAR/MSI) and the addition of DSM page coherence (Layer 3) on top for shared-memory workloads. See Section 24.1 for the full protocol stack.


5.1.1.1 The Hardware Shift

The datacenter is becoming a single, disaggregated computer:

2015: Machines are islands. 10GbE, TCP/IP, millisecond latencies.
      Networking = slow, unreliable. Kernel = local machine only.

2020: RDMA everywhere. 100GbE RoCEv2 / InfiniBand HDR (200Gb/s).
      1-2 μs latency. Kernel-bypass networking is the norm for HPC.

2024: CXL 2.0 memory pooling. Disaggregated memory over PCIe fabric.
      Memory can live outside the machine, accessible at ~200-400ns.

2025-2026: CXL 3.0 hardware-coherent shared memory. PCIe 6.0 (64 GT/s).
      400GbE / InfiniBand NDR (400Gb/s). Sub-microsecond RDMA.

2027+: CXL switches, memory fabric topology, composable infrastructure.
      The distinction between "local" and "remote" memory blurs.

The hardware is converging on a model where: - Network latency (RDMA) ≈ remote NUMA latency ≈ CXL latency - All three are ~1-5 μs, compared to NVMe SSD at ~10-15 μs - Remote memory over RDMA is faster than local SSD

No existing operating system is designed for this reality.

5.1.1.2 Why Linux Cannot Adapt

Linux has two completely separate networking paradigms:

World A: Socket-based (kernel-managed)
  - TCP/IP, UDP, Unix sockets
  - Kernel manages connections, buffers, routing
  - Page cache, VFS, block I/O all use this world
  - Latency: ~5 μs per packet (kernel processing overhead — time for the
      kernel network stack to process one packet, not end-to-end round-trip)

World B: RDMA/verbs (kernel-bypass)
  - InfiniBand verbs, RoCE
  - Application manages everything via libibverbs
  - Kernel provides setup (protection domains, memory registration) then gets out
  - Page cache, VFS, block I/O know nothing about this world
  - Latency: ~1-2 μs per operation

These worlds do not interact. There is no way for the Linux page cache
to fetch a page from a remote node via RDMA. There is no way for the
Linux scheduler to migrate a process to where its data lives across
an RDMA link. There is no way for Linux IPC to transparently extend
to a remote node.

Most distributed features in Linux (DRBD, Ceph, GlusterFS, GFS2, OCFS2)
are built on World A (sockets). NFS and SMB have added RDMA transports
(svcrdma/xprtrdma since 2.6.24, SMB Direct since v4.15, KSMBD since
v5.15), but these are bolt-on transport alternatives — the core protocols,
data structures, and failure handling remain socket-oriented. None use
RDMA verbs (CAS, FAA, one-sided Read/Write) for lock-free data structure
access or coherence protocols. UmkaOS's distinction is using RDMA atomics
and one-sided operations as the primary coordination primitive, not just
as a transport.

Previous attempts to add distributed capabilities to Linux (Kerrighed, OpenSSI, MOSIX, SSI clusters, GAM patches) all failed because:

  1. Linux's core subsystems (MM, scheduler, VFS, IPC) assume single-machine
  2. Patches touched thousands of lines across dozens of subsystems
  3. Every kernel update broke the patches
  4. Cache-line-level coherence over network was too expensive
  5. No clean abstraction boundary — distributed logic was smeared everywhere

5.1.1.3 UmkaOS's Structural Advantage

UmkaOS's existing architecture is uniquely suited for distributed extension:

Existing Feature Distributed Extension
NUMA-aware memory manager (per-node buddy allocators) Remote node = distant NUMA node
PageLocationTracker (Section 22.4) Already tracks CPU, GPU, compressed, swap — add RemoteNode
MPSC ring buffer IPC Ring buffer maps naturally to RDMA queue pair
Capability tokens (generation-based revocation) Cryptographic signing → network-portable
Device registry (topology tree) Extend topology to include RDMA fabric
AccelBase KABI (Section 22.1) GPU on remote node = remote accelerator
CBS bandwidth guarantees (Section 7.6) Extend CBS to cluster-wide resource accounting
Object namespace (Section 20.5) \Cluster\Node2\Devices\gpu0

The key insight: UmkaOS already models heterogeneous memory (CPU RAM, GPU VRAM, compressed pages, swap) as different tiers in a unified memory hierarchy. Remote memory over RDMA is just another tier. The memory manager already knows how to migrate pages between tiers on demand. Extending "tiers" to include remote nodes is a natural generalization, not a fundamental redesign.


5.1.2 UmkaOS Peer Protocol Wire Specification

This section defines the complete Layer 1 peer protocol — the wire format, message types, negotiation state machine, and transport bindings that all multikernel communication in UmkaOS shares. Tier M peers on PCIe, distributed nodes over RDMA, firmware shims on USB, and CXL-attached accelerators all speak this protocol. The standalone protocol spec document (Section 24.1) is extracted from this section.

All messages use ClusterMessageHeader (defined in Section 5.2, 40 bytes) as the wire header. All structs are #[repr(C)], little-endian, explicitly padded. No implicit compiler padding. All multi-byte integer fields in wire structs use Le16/Le32/Le64 wrapper types (Section 6.1) to enforce correct endianness on mixed-endian clusters (PPC32 and s390x are big-endian). MMIO-mapped atomic fields use LeAtomicU32/LeAtomicU64. Enum discriminants in wire structs are transmitted as Le32 with explicit conversion at the send/receive boundary.

Padding convention (applies to ALL wire structs in Chapters 5-6): Padding fields (_pad*) in all wire structs must be zeroed by the sender. Receivers MUST ignore padding field contents (forward compatibility: future protocol versions may repurpose padding bytes for new fields, distinguished by protocol_version in the message header). This matches the NVMe, CXL, and USB spec conventions for extensible wire formats.

Boolean field convention (applies to ALL u8 boolean fields in wire structs): All u8 boolean fields use the convention 0 = false, 1 = true per CLAUDE.md rule 8 (no bool in #[repr(C)] wire/KABI structs). Receivers treat any non-zero value as true and log an FMA event WireProtocolMalformedField { peer, message_type, field_name, value } if the value exceeds 1. Processing proceeds normally — the anomaly is diagnostic only (HMAC integrity prevents accidental values).

5.1.2.1 Message Types

/// All peer protocol message types. Carried in ClusterMessageHeader.message_type.
/// Values 0x0000-0x00FF: control plane. 0x0100-0x01FF: data plane.
/// 0x0200-0x02FF: DSM coherence (06-dsm.md, Section 6.6.8). 0x0300-0x03FF:
/// DSM region management (06-dsm.md, Section 6.8.1). 0x0400-0x04FF: DLM wire protocol.
/// 0x0090-0x009F: thread migration (Section 6.5.1).
/// Graceful shutdown messages: 0x0022-0x0027 (Section 5.8.3.3).
#[repr(u32)]
pub enum PeerMessageType {
    // --- Control plane (0x0000-0x00FF) ---

    /// Ping/pong for RTT measurement and liveness probing.
    Ping              = 0x0001,
    /// Response to Ping; carries responder timestamp for RTT calculation.
    Pong              = 0x0002,

    /// Join request (initiator → acceptor). Carries authentication material.
    JoinRequest       = 0x0010,
    /// Join accepted: node_id assigned, session key established.
    JoinAccept        = 0x0011,
    /// Join rejected: reason code + diagnostic message.
    JoinReject        = 0x0012,

    /// Graceful leave notification (departing node → all peers).
    LeaveNotify       = 0x0020,
    /// Death notification (detecting node → all peers, broadcast).
    DeadNotify        = 0x0021,

    // --- Graceful shutdown sequence (§5.8.3.3) ---

    /// Acknowledgement of LeaveNotify (peer → departing node).
    LeaveAck          = 0x0022,
    /// Drain notification: departing node requests peers to drain in-flight
    /// messages (departing → all peers).
    DrainNotify       = 0x0023,
    /// Drain acknowledgement: peer confirms all in-flight messages to the
    /// departing node have been processed (peer → departing).
    DrainAck          = 0x0024,
    /// Migration start: departing node begins migrating owned resources
    /// (DLM locks, DSM pages) to the designated successor (departing → successor).
    MigrateStart      = 0x0025,
    /// Migration acknowledgement: successor confirms resource migration
    /// complete (successor → departing).
    MigrateAck        = 0x0026,
    /// DLM drain complete: all DLM lock state for the departing node has
    /// been migrated or released (DLM subsystem → membership layer).
    DlmDrainComplete  = 0x0027,
    /// Periodic progress report during drain (departing → all peers).
    /// Payload: `DrainProgressPayload` ([Section 5.8](#failure-handling-and-distributed-recovery--graceful-shutdown-protocol)).
    DrainProgress     = 0x0028,
    /// Final departure signal (departing → all peers). After sending this,
    /// the departing peer disconnects. Payload: `LeaveCompletePayload`.
    LeaveComplete     = 0x0029,
    /// Per-service drain initiation (departing → service peer).
    ServiceDrainNotify = 0x002A,
    /// Per-service drain acknowledgement (service peer → departing).
    ServiceDrainAck   = 0x002B,

    /// Periodic heartbeat (uses HeartbeatMessage payload, §5.9).
    Heartbeat         = 0x0030,

    /// Advertise available services (joining peer → host/cluster).
    CapAdvertise      = 0x0040,
    /// Withdraw a previously advertised service.
    CapWithdraw       = 0x0041,
    /// Acknowledge acceptance of an advertised service.
    CapAck            = 0x0042,
    /// Reject an advertised service (unsupported, busy, auth failure).
    CapNack           = 0x0043,

    /// Bind a service to a ring pair (host → peer).
    ServiceBind       = 0x0050,
    /// Confirm service binding with ring parameters (peer → host).
    ServiceBindAck    = 0x0051,
    /// Unbind a service (either direction).
    ServiceUnbind     = 0x0052,

    /// FMA health report (peer → host). Same as HeartbeatMessage but
    /// sent immediately on health event, not on periodic schedule.
    HealthReport      = 0x0060,

    /// Reset request (host → device). Escalating: FLR → SBR → power cycle.
    FlrRequest        = 0x0070,

    /// Fencing token update (leader → all nodes).
    FenceTokenUpdate  = 0x0080,

    // --- Raft consensus protocol (0x00A0-0x00AF) ---
    // Raft RPCs for cluster metadata state machine consensus.
    // Full payload structs defined in
    // [Section 5.1](#distributed-kernel-architecture--raft-wire-payload-structs).

    /// Raft AppendEntries RPC (leader → follower).
    RaftAppendEntries     = 0x00A0,
    /// Raft AppendEntries response (follower → leader).
    RaftAppendEntriesResp = 0x00A1,
    /// Raft RequestVote RPC (candidate → all peers).
    RaftRequestVote       = 0x00A2,
    /// Raft RequestVote response (voter → candidate).
    RaftRequestVoteResp   = 0x00A3,
    /// Raft InstallSnapshot RPC (leader → lagging follower, chunked transfer).
    RaftInstallSnapshot   = 0x00A4,
    /// Raft InstallSnapshot response (follower → leader).
    RaftInstallSnapshotResp = 0x00A5,
    /// Raft PreVote RPC (candidate → all peers, Raft Section 9.6 extension).
    RaftPreVote           = 0x00A6,
    /// Raft PreVote response (voter → candidate).
    RaftPreVoteResp       = 0x00A7,

    // --- Topology link-state (0x00B0-0x00BF) ---
    // Link-state advertisement flooding for topology graph construction.
    // Wire format defined in [Section 5.2](#cluster-topology-model--topology-lsa-wire-format).

    /// Topology link-state advertisement (origin peer → all neighbors, flooded).
    /// Payload: LinkStateAdvertisementWire (variable-length, header + N neighbors).
    TopologyLsa           = 0x00B0,
    /// Topology LSA acknowledgement (receiver → sender).
    /// Payload: 16 bytes (origin_peer_id: Le64, sequence: Le64).
    TopologyLsaAck        = 0x00B1,

    // --- Thread migration (0x0090-0x009F) ---
    // Lightweight thread migration between cluster nodes.
    // Full payload structs and protocol sequence defined in
    // [Section 5.6](#cluster-aware-scheduler--thread-migration-wire-protocol).

    /// Request to migrate a thread to a remote node.
    ThreadMigrateRequest  = 0x0090,
    /// Accept thread migration (destination has capacity).
    ThreadMigrateAccept   = 0x0091,
    /// Reject thread migration (overloaded, incompatible, etc.).
    ThreadMigrateReject   = 0x0092,
    // 0x0093 reserved (state transferred via transport.write_to_peer(), not a message type).
    /// Commit: thread state transfer is complete; destination may schedule.
    /// On RDMA: state was written via one-sided RDMA Write to the pre-registered
    /// receive buffer. On TCP: state was sent as a bulk reliable message.
    ThreadMigrateCommit   = 0x0094,
    /// Abort: migration failed (transport error, timeout). Source resumes thread.
    ThreadMigrateAbort    = 0x0095,

    // --- Data plane (0x0100-0x01FF) ---

    /// Service-specific request message (payload = serialized KABI call).
    ServiceMessage    = 0x0100,
    /// Service-specific response message.
    ServiceResponse   = 0x0101,

    /// Doorbell coalescing batch marker (producer → consumer). Signals
    /// that N entries have been written and the consumer should drain.
    DoorbellBatch     = 0x0110,

    // --- DSM coherence (0x0200-0x02FF) ---
    // DSM coherence messages carry a DsmWireHeader payload that contains
    // the DsmMsgType discriminant ([Section 6.6](06-dsm.md#dsm-coherence-protocol-moesi--dsm-coherence-message-wire-format)).
    // The ClusterMessageHeader.message_type is DsmCoherence for ALL DSM
    // coherence traffic; the specific operation (GetS, GetM, Inv, etc.)
    // is encoded in DsmWireHeader.dsm_type. This avoids consuming a
    // PeerMessageType code per MOESI message and keeps the top-level
    // dispatch to a single branch for the entire DSM coherence subsystem.

    /// DSM page coherence message (MOESI protocol, write-update, causal).
    /// Payload: DsmWireHeader (40 bytes) + optional page data (4 KiB RDMA Write).
    /// Sub-type discriminant in DsmWireHeader.dsm_type (DsmMsgType enum).
    DsmCoherence      = 0x0200,

    // --- DSM region management (0x0300-0x03FF) ---
    // Region lifecycle messages (create, join, leave, destroy, reconstruct).
    // Full payload structs are specified in [Section 6.8](06-dsm.md#dsm-region-management--region-management-wire-messages).
    // Each has its own PeerMessageType code because they are infrequent
    // control-plane operations that benefit from distinct top-level dispatch.

    /// Broadcast new region creation to all peers.
    DsmRegionCreateBcast  = 0x0300,
    /// Acknowledge receipt of region creation broadcast.
    DsmRegionCreateAck    = 0x0301,
    /// Request to join an existing DSM region.
    DsmRegionJoinRequest  = 0x0302,
    /// Accept a region join request (assigns slot).
    DsmRegionJoinAccept   = 0x0303,
    /// Reject a region join request (capacity, auth, etc.).
    DsmRegionJoinReject   = 0x0304,
    /// Graceful departure from a DSM region.
    DsmRegionLeave        = 0x0305,
    /// Acknowledge region departure.
    DsmRegionLeaveAck     = 0x0306,
    /// Slot compaction notification (coordinator → participants).
    DsmSlotCompaction     = 0x0310,
    /// Acknowledge slot compaction completion.
    DsmSlotCompactionAck  = 0x0311,
    /// Destroy a DSM region (coordinator → all participants).
    DsmRegionDestroy      = 0x0320,
    /// Acknowledge region destruction.
    DsmRegionDestroyAck   = 0x0321,

    // --- DSM directory reconstruction (0x0330-0x033F) ---
    // Home directory reconstruction after node failure. Full payload structs
    // in [Section 5.8](#failure-handling-and-distributed-recovery--dsm-home-reconstruction).

    /// Directory reconstruction request (new home → all region peers).
    DsmDirReconstruct         = 0x0330,
    /// Directory reconstruction report (peer → new home: local page states).
    DsmDirReconstructReport   = 0x0331,
    /// Directory reconstruction complete (new home → all peers).
    DsmDirReconstructComplete = 0x0332,

    // --- DLM wire protocol (0x0400-0x04FF) ---
    // DLM lock operations dispatched via the peer protocol layer.
    // Full payload structs are specified in [Section 15.15](15-storage.md#distributed-lock-manager--dlm-wire-protocol).

    /// DLM lock operations. Payload: DlmMessageHeader + per-message payload.
    /// See [Section 15.15](15-storage.md#distributed-lock-manager--dlm-wire-protocol).
    DlmOp                     = 0x0400,
}

5.1.2.2 Message Payload Structs

Every message type has a fixed-layout payload struct. Payload immediately follows the 40-byte ClusterMessageHeader in the ring entry. Unused trailing bytes in the ring entry are undefined (not zeroed — zeroing wastes cycles).

/// Join request payload. Initiator → acceptor.
/// Total: 256 bytes (fixed, padded).
#[repr(C)]
pub struct JoinRequestPayload {
    /// Protocol version; must match acceptor's version to proceed.
    pub protocol_version: Le32,
    /// Transport type over which this join is occurring (PeerTransportType as Le32).
    pub transport_type: Le32,
    /// Bitfield of services this peer can provide.
    pub node_capabilities_mask: Le64,
    /// Human-readable node name (ASCII, NUL-padded).
    pub node_name: [u8; 64],
    /// X25519 ephemeral public key (for forward-secret session key derivation).
    pub auth_public_key: [u8; 32],
    /// Ed25519 signature over (auth_public_key || nonce || protocol_version).
    /// Verifier uses the peer's long-term Ed25519 public key from the cluster
    /// trust store (see `TrustStore` below).
    pub auth_signature: [u8; 64],
    /// 16-byte random nonce contributed by the initiator.
    pub nonce: [u8; 16],
    /// Explicit padding to 256 bytes.
    pub _pad: [u8; 64],
}
const_assert!(core::mem::size_of::<JoinRequestPayload>() == 256);

/// Transport type identifier.
/// Value 0 is reserved (invalid / uninitialized). PeerTransportType starts at 1
/// so that zero-initialized memory can be detected as uninitialized. This means
/// `Option<PeerTransportType>` does NOT benefit from niche optimization (no 0 variant).
#[repr(u32)]
pub enum PeerTransportType {
    PcieBars      = 1,
    Rdma          = 2,
    Cxl           = 3,
    Usb           = 4,
    TcpIp         = 5,
    /// s390x HiperSockets — in-memory virtual NIC between LPARs on the
    /// same CEC (Central Electronics Complex). Sub-microsecond latency,
    /// no physical network traversal. Phase 3+ optimization; s390x peers
    /// participate via TCP transport (Auto fallback) in Phase 1-2.
    HiperSockets  = 6,
}

/// Join accept payload. Acceptor → initiator.
/// Total: 256 bytes.
#[repr(C)]
pub struct JoinAcceptPayload {
    /// Node ID assigned to the joining peer (unique within cluster, never reused).
    pub assigned_node_id: Le64,   // NodeId
    /// Current cluster epoch (FencingToken.epoch). The joining node must use
    /// this epoch for all subsequent cluster-wide operations.
    pub cluster_epoch: Le64,
    /// Node ID of the current quorum leader.
    pub leader_node: Le64,        // NodeId
    /// Acceptor's X25519 ephemeral public key.
    pub auth_public_key: [u8; 32],
    /// Ed25519 signature over (auth_public_key || nonce || assigned_node_id).
    pub auth_signature: [u8; 64],
    /// 16-byte random nonce contributed by the acceptor.
    pub nonce: [u8; 16],
    /// Explicit padding to 256 bytes (8+8+8+32+64+16+120 = 256).
    pub _pad: [u8; 120],
}
const_assert!(core::mem::size_of::<JoinAcceptPayload>() == 256);

/// Session key derivation (both sides compute after exchanging JoinRequest/JoinAccept).
/// Double-DH for forward secrecy — both static and ephemeral shared secrets contribute:
///   static_shared = X25519(my_static_secret, peer_static_public)
///   ephemeral_shared = X25519(my_ephemeral_secret, peer_ephemeral_public)
///   session_key = HKDF-SHA256(
///       ikm = static_shared || ephemeral_shared,
///       salt = initiator_nonce || acceptor_nonce,
///       info = b"umkaos-peer-v1"
///   )
/// HKDF uses SHA-256 (standard, RFC 5869). Message HMAC uses SHA3-256
/// (structurally independent from SHA-256 — sponge vs. Merkle-Damgard).
/// A break in SHA-2's internal structure does not affect SHA-3, providing
/// defense-in-depth for the session integrity path.
/// All subsequent messages on this session are integrity-checked with
/// HMAC-SHA3-256(session_key, ClusterMessageHeader || payload). The HMAC
/// is carried in ClusterMessageHeader.checksum (truncated to 64 bits for
/// the header; full 256-bit HMAC appended after payload for messages
/// requiring strong integrity — CapAdvertise, FenceTokenUpdate).

/// Cluster trust store. Populated at Phase 6.x (post-network) from one of:
/// 1. Pre-shared keys in `/etc/umka/cluster-keys.json` (admin-provisioned).
/// 2. PKI: X.509 certificates signed by a cluster CA, stored in
///    `/etc/umka/cluster-ca.pem` + `/etc/umka/node-cert.pem`.
/// 3. TPM-sealed keys ([Section 9.4](09-security.md#tpm-runtime-services)): keys sealed to PCR
///    values, released only on measured boot.
/// Populated once at init and updated on membership changes (JoinAccept
/// inserts, LeaveNotify/DeadNotify removes).
pub struct TrustStore {
    /// Maps NodeId → peer's long-term Ed25519 public key.
    pub keys: XArray<PeerPublicKey>,  // keyed by NodeId (u64)
}

/// Join reject payload.
/// Total: 32 bytes.
#[repr(C)]
pub struct JoinRejectPayload {
    /// Rejection reason code (JoinRejectReason as Le32).
    pub reason: Le32,
    /// ASCII diagnostic message (NUL-terminated, for logging).
    pub message: [u8; 28],
}
const_assert!(core::mem::size_of::<JoinRejectPayload>() == 32);

#[repr(u32)]
pub enum JoinRejectReason {
    VersionMismatch = 1,
    AuthFailed      = 2,
    ClusterFull     = 3,
    PolicyDenied    = 4,
}

/// Graceful leave notification.
/// Total: 16 bytes.
#[repr(C)]
pub struct LeaveNotifyPayload {
    /// Node ID of the departing node.
    pub leaving_node: Le64,       // NodeId
    /// Reason for departure (LeaveReason as Le32).
    pub reason: Le32,
    /// How long to wait for in-flight operations to drain (ms).
    /// Peers should complete outstanding requests within this window.
    pub drain_timeout_ms: Le32,
}
const_assert!(core::mem::size_of::<LeaveNotifyPayload>() == 16);

#[repr(u32)]
pub enum LeaveReason {
    Graceful     = 0,
    AdminCommand = 1,
    Maintenance  = 2,
}

/// Death notification (broadcast by the detecting node).
/// Total: 32 bytes.
#[repr(C)]
pub struct DeadNotifyPayload {
    /// Node ID of the dead node.
    pub dead_node: Le64,          // NodeId
    /// Node ID that detected the failure.
    pub detected_by: Le64,        // NodeId
    /// How the failure was detected (DeathDetectionMethod as Le32).
    pub detection_method: Le32,
    pub _pad0: [u8; 4],
    /// New cluster epoch (incremented by the leader on membership change).
    pub new_epoch: Le64,
}
const_assert!(core::mem::size_of::<DeadNotifyPayload>() == 32);

#[repr(u32)]
pub enum DeathDetectionMethod {
    HeartbeatTimeout = 0,
    WatchdogStale    = 1,
    TransportError   = 2,
    AdminForce       = 3,
}

/// Capability advertisement. Variable-length (up to 4096 bytes).
/// Sent by a joining peer to announce available services.
#[repr(C)]
pub struct CapAdvertisePayload {
    /// Number of services in the `services` array.
    pub service_count: Le32,
    pub _pad: [u8; 4],
    // Followed by `service_count` PeerServiceDescriptor structs (128 bytes each).
    // Max services per message: (4096 - 8) / 128 = 31.
}
const_assert!(core::mem::size_of::<CapAdvertisePayload>() == 8);

/// Wire-format service identifier for the peer protocol. Same layout as
/// `ServiceId` ([Section 12.7](12-kabi.md#kabi-service-dependency-resolution)) but with the
/// `major` field encoded as `Le32` for network byte order. Conversion:
/// `WireServiceId::from(service_id)` / `ServiceId::from(wire_id)`.
#[repr(C)]
pub struct WireServiceId {
    /// ASCII service name, NUL-padded (60 bytes). Identical to `ServiceId.name`.
    pub name: [u8; 60],
    /// Major version, little-endian encoded for the wire protocol.
    pub major: Le32,
}
const_assert!(core::mem::size_of::<WireServiceId>() == 64);

/// Describes one service offered by a peer. Fixed 128 bytes.
#[repr(C)]
pub struct PeerServiceDescriptor {
    /// KABI service identifier (name[60] + major_version[4]).
    /// Wire-format encoding: `WireServiceId` uses `Le32` for `major` to ensure
    /// correct endianness on the wire. Matches `ServiceId` in
    /// [Section 12.7](12-kabi.md#kabi-service-dependency-resolution) for interoperability with the
    /// KABI service registry.
    pub service_id: WireServiceId,
    /// Minor version of the service implementation.
    pub minor_version: Le32,
    /// Bitmask of supported transports for this service.
    /// Bit 0: Tier 0 (direct, in-kernel). Bit 1: Tier 1 (ring).
    /// Bit 2: Tier 2 (IPC). For peer services, bit 1 is always set.
    pub transport_mask: Le32,
    /// Maximum concurrent requests this service can handle.
    pub max_concurrent_requests: Le32,
    /// Length of the properties blob (0-32 bytes).
    pub properties_len: Le16,
    pub _pad: [u8; 18],
    /// Service-specific properties (opaque to the protocol layer).
    /// The 32-byte blob is service-defined. Common schemas:
    ///   block service: `max_lba: Le64, block_size: Le32, queue_depth: Le16,
    ///                   flags: Le16, _pad: [u8; 12]`
    ///   network service: `mtu: Le32, speed_mbps: Le32, mac: [u8; 6], _pad: [u8; 14]`
    ///   crypto service: `algorithms: Le64 (bitmask), max_req_size: Le32, _pad: [u8; 16]`
    pub properties: [u8; 32],
}
const_assert!(core::mem::size_of::<PeerServiceDescriptor>() == 128);

/// Capability acknowledgment or rejection.
/// Total: 16 bytes. One per service — host sends N of these after CapAdvertise.
#[repr(C)]
pub struct CapResponsePayload {
    /// FNV-1a hash of the ServiceId for fast matching (the full ServiceId
    /// was already exchanged in CapAdvertise).
    pub service_id_hash: Le64,
    /// Response status (CapResponseStatus as Le32).
    pub status: Le32,
    pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<CapResponsePayload>() == 16);

#[repr(u32)]
pub enum CapResponseStatus {
    /// Service accepted; host will send ServiceBind next.
    Accepted       = 0,
    /// Service not needed by this host.
    NotNeeded      = 1,
    /// Service version incompatible.
    VersionMismatch = 2,
    /// Host is overloaded; try again later.
    Busy           = 3,
    /// Policy denies this service from this peer.
    PolicyDenied   = 4,
}

/// Bind a service to a ring pair.
/// Total: 128 bytes.
/// Layout: service_id(64) + ring_pair_index(4) + requested_queue_depth(4)
///   + requested_entry_size(4) + _pad(52) = 128.
#[repr(C)]
pub struct ServiceBindPayload {
    /// Which service to bind (full WireServiceId for unambiguous identification).
    pub service_id: WireServiceId,
    /// Ring pair index (0-31) assigned for this service's data plane.
    pub ring_pair_index: Le32,
    /// Requested ring depth (entries per ring, power of 2).
    pub requested_queue_depth: Le32,
    /// Requested entry size (bytes, power of 2, minimum 128).
    pub requested_entry_size: Le32,
    pub _pad: [u8; 52],
}
const_assert!(core::mem::size_of::<ServiceBindPayload>() == 128);

/// Service bind acknowledgment (peer → host).
/// Total: 64 bytes.
///
/// The trailing bytes carry transport-specific connection parameters.
/// The receiver inspects `transport_type` to determine which fields
/// in the transport-specific union are valid.
#[repr(C)]
pub struct ServiceBindAckPayload {
    /// Ring pair index (echoed from ServiceBindPayload).
    pub ring_pair_index: Le32,
    /// Granted ring depth (may be less than requested if device has less memory).
    pub granted_queue_depth: Le32,
    /// Granted entry size (may be larger than requested for alignment).
    pub granted_entry_size: Le32,
    /// Transport type discriminant for the transport-specific fields below.
    pub transport_type: Le32,
    /// Offset of the request DomainRingBuffer in the shared memory region (BAR2).
    pub request_ring_offset: Le64,
    /// Offset of the response DomainRingBuffer in the shared memory region.
    pub response_ring_offset: Le64,
    /// Transport-specific connection parameters (20 bytes).
    /// Interpretation depends on `transport_type`.
    pub transport_params: ServiceBindTransportParams,
    /// Explicit padding to reach 64-byte total.
    pub _pad: [u8; 12],
    // ring_pair_index(4) + granted_queue_depth(4) + granted_entry_size(4)
    // + transport_type(4) + request_ring_offset(8) + response_ring_offset(8)
    // + transport_params(20) + _pad(12) = 64 bytes
}
const_assert!(core::mem::size_of::<ServiceBindAckPayload>() == 64);

/// Transport type discriminant for ServiceBindAckPayload.
#[repr(u32)]
pub enum ServiceBindTransport {
    /// PCIe BAR-based transport. `transport_params` contains BAR doorbell offset.
    Pcie      = 0,
    /// RDMA transport (InfiniBand, RoCE, iWARP). `transport_params` contains
    /// RDMA queue pair number, remote key, and remote address for RDMA operations.
    Rdma      = 1,
    /// TCP fallback transport. `transport_params` contains TCP port and address.
    Tcp       = 2,
    /// CXL shared-memory transport. `transport_params` contains doorbell offset.
    Cxl       = 3,
}

/// Transport-specific parameters in ServiceBindAckPayload.
/// 20 bytes, matching the tail of the 64-byte payload.
#[repr(C)]
pub union ServiceBindTransportParams {
    /// PCIe / CXL transport: doorbell register offset.
    pub pcie: ServiceBindPcieParams,
    /// RDMA transport: queue pair, remote key, and remote address.
    pub rdma: ServiceBindRdmaParams,
    /// TCP fallback: port and address (for completeness; rarely used).
    pub tcp: ServiceBindTcpParams,
}

/// PCIe transport parameters (also used for CXL).
#[repr(C)]
pub struct ServiceBindPcieParams {
    /// Byte offset of the doorbell register for this ring pair within BAR0.
    /// This is NOT a virtual address — it is an offset from the base of the
    /// device's BAR0 MMIO region. The host adds this to its BAR0 mapping
    /// base to derive the kernel virtual address.
    pub doorbell_offset: Le32,
    pub _reserved: [u8; 16],
}

/// RDMA transport parameters. Provided when the peer's coordination
/// transport is RDMA (InfiniBand RC, RoCE, or iWARP). The host uses
/// these to post RDMA Send/Recv operations to the service's dedicated
/// queue pair and to perform RDMA Read/Write to the peer's ring buffers.
#[repr(C)]
pub struct ServiceBindRdmaParams {
    /// RDMA Queue Pair number allocated by the peer for this service binding.
    /// The host must connect its local QP to this remote QPN before issuing
    /// RDMA operations.
    pub rdma_qp_num: Le32,
    /// Remote key (rkey) for RDMA Read/Write access to the peer's ring
    /// buffer memory region. Valid for the lifetime of this service binding;
    /// invalidated on ServiceUnbind or peer failure.
    pub rdma_rkey: Le32,
    /// Remote virtual address of the ring buffer base in the peer's address
    /// space. Used as the remote_addr parameter for RDMA Read/Write work
    /// requests. The host adds ring offsets (request_ring_offset,
    /// response_ring_offset) to this base address.
    pub rdma_remote_addr: Le64,
    pub _reserved: [u8; 4],
}

/// TCP fallback transport parameters.
#[repr(C)]
pub struct ServiceBindTcpParams {
    /// TCP port for this service's data plane connection.
    pub tcp_port: Le16,
    pub _reserved: [u8; 18],
}
const_assert!(core::mem::size_of::<ServiceBindPcieParams>() == 20);
const_assert!(core::mem::size_of::<ServiceBindRdmaParams>() == 20);
const_assert!(core::mem::size_of::<ServiceBindTcpParams>() == 20);
const_assert!(core::mem::size_of::<ServiceBindTransportParams>() == 20);

/// FLR/reset request (host → device).
/// Total: 20 bytes.
#[repr(C)]
pub struct FlrRequestPayload {
    /// Target node to reset.
    pub target_node: Le64,        // NodeId
    /// Reset escalation level (ResetLevel as Le32).
    pub reset_level: Le32,
    /// Reason for reset (diagnostic).
    pub reason: Le32,
    pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<FlrRequestPayload>() == 20);

#[repr(u32)]
pub enum ResetLevel {
    Flr        = 0,
    Sbr        = 1,
    PowerCycle = 2,
    BmcIpmi   = 3,
}

/// Fencing token update (leader → all nodes).
/// Payload is the FencingToken struct (defined in Section 5.8.2.3).
/// Full 256-bit HMAC-SHA256 appended after payload for integrity.

5.1.2.3 Coordination Mode

/// Peer coordination mode. Determines the communication mechanism
/// between host and peer. Stored in PeerKernelHealth.mode (Section 5.3).
#[repr(u32)]
pub enum PeerCoordinationMode {
    /// Mode A: Message-passing via ring buffers (RDMA Send/Recv or PCIe BAR DMA).
    /// No shared memory between host and peer. All state transfer is explicit.
    /// Used for: discrete PCIe devices, remote RDMA hosts, USB peers, TCP fallback.
    MessagePassing  = 0,
    /// Mode B: Hardware-coherent shared memory.
    /// Host and peer share a cache-coherent memory region. Locks use local
    /// atomic instructions (LOCK CMPXCHG on x86, LDXR/STXR on AArch64) on
    /// shared memory instead of RDMA CAS. 5-10x lower latency than Mode A.
    /// Used for: CXL Type 2 devices, on-chip partitions (ARM CCA, RISC-V WorldGuard),
    /// NVLink-C2C (Grace Hopper). Requires: PCIe ATS+ACS, CXL.cache, or equivalent
    /// hardware coherency protocol.
    HardwareCoherent = 1,
}

5.1.2.4 Peer Ring Entry Format

/// A single entry in a peer-to-peer ring buffer.
/// Uses the existing DomainRingBuffer infrastructure (Section 11.5,
/// 10-drivers.md) with a fixed entry layout.
///
/// Entry layout: [ClusterMessageHeader (40 bytes)] [payload (entry_size - 40 bytes)]
///
/// Default entry_size: 256 bytes (40 header + 216 payload). Covers all control
/// messages inline. For data-plane messages exceeding (entry_size - 40) bytes,
/// the payload is split across consecutive entries using a continuation flag:
///   - First entry: header.message_type = ServiceMessage, payload = first chunk
///   - Continuation entries: header.message_type = ServiceMessage,
///     header.flags bit 0 set (CONTINUATION_FLAG = 0x0000_0001).
///     Receiver reassembles by concatenating payloads until an entry
///     without the flag is seen.
///   - header.payload_length in the FIRST entry carries the total payload
///     length across all entries (for pre-allocation).
///
/// This avoids variable-size entries in the ring (which would break
/// the fixed-stride indexing of DomainRingBuffer).
pub const PEER_RING_DEFAULT_ENTRY_SIZE: usize = 256;
pub const PEER_RING_DEFAULT_DEPTH: usize = 256;
pub const PEER_RING_MIN_ENTRY_SIZE: usize = 128;
/// Continuation flag: bit 0 of ClusterMessageHeader.flags.
/// Set on all continuation entries except the last.
pub const CONTINUATION_FLAG: u32 = 0x0000_0001;

5.1.2.5 PCIe BAR Memory Layout

For PCIe-attached peers (Tier M on-host devices), the device exposes two BARs:

BAR0 (4 KiB) — Control Registers + Doorbells + Scratchpad
═══════════════════════════════════════════════════════════
+0x000  PeerControlRegs (256 bytes)
+0x100  Doorbell registers: u32[32] (128 bytes)
+0x180  Reserved (128 bytes)
+0x200  Scratchpad (64 bytes — used during join handshake)
+0x240  Reserved to 4 KiB

BAR2 (configurable, default 1 MiB) — Ring Buffer Region
═══════════════════════════════════════════════════════════
Each ring pair occupies 2 × (sizeof(DomainRingBuffer) + depth × entry_size) bytes.
With defaults (depth=256, entry_size=256): 2 × (128 + 65536) = ~128 KiB per pair.
1 MiB BAR2 supports 7 ring pairs. Larger BAR2 (4 MiB) supports 31 pairs.

+0x00000  Ring pair 0 — request ring  [DomainRingBuffer header + entries]
+0x10000  Ring pair 0 — response ring [DomainRingBuffer header + entries]
+0x20000  Ring pair 1 — request ring
+0x30000  Ring pair 1 — response ring
...
Ring pair 0 is always the control channel (join, cap negotiation, heartbeat).
Ring pairs 1+ are data channels (one per bound service).
/// PCIe BAR0 control registers for the peer protocol.
/// Little-endian. Host and device must agree on this layout.
/// All atomic fields use LeAtomicU32/LeAtomicU64 for correct endianness on
/// mixed-endian clusters. Non-atomic fields use Le32/Le64.
/// MMIO access: writes are posted, reads are non-posted and serialize prior writes
/// (PCIe ordering guarantees). The LeAtomic types handle endian conversion at
/// every load/store — the MMIO register value is always little-endian on the wire.
#[repr(C)]
pub struct PeerControlRegs {
    /// Protocol magic: 0x554D4B41 ('U','M','K','A' in little-endian ASCII).
    /// Host reads this after PCIe enumeration to identify an UmkaOS peer device.
    pub magic: Le32,
    /// Protocol version (currently 1). Must match for join to proceed.
    pub protocol_version: Le32,
    /// Device → host status register (PeerDeviceStatus as Le32).
    pub device_status: LeAtomicU32,
    /// Host → device command register (PeerHostCommand as Le32).
    pub host_command: LeAtomicU32,
    /// Node ID assigned by host (written during JoinAccept, read by device).
    pub node_id: Le32,
    /// Number of ring pairs available in BAR2.
    pub ring_pair_count: Le32,
    /// Entry size for all rings (bytes, power of 2, min 128). Device sets this
    /// based on its memory budget; host reads during ServiceBindAck.
    pub ring_entry_size: Le32,
    /// Entries per ring (power of 2). Same for all rings on this device.
    pub ring_depth: Le32,
    /// BAR2 total size in bytes.
    pub ring_region_size: Le64,
    /// MMIO watchdog counter. Device firmware increments every 10ms.
    /// Host polls on any timeout or suspected failure. Stale counter (>20ms
    /// since last change) → immediate Suspect escalation via PeerHealthState.
    pub watchdog_counter: LeAtomicU64,
    /// Device capabilities mask (same encoding as
    /// JoinRequestPayload.node_capabilities_mask).
    pub capabilities_mask: Le64,
    /// Padding to 256 bytes (cache-line friendly, room for future fields).
    /// 56 bytes of explicit fields + 200 bytes of reserved = 256 bytes.
    pub _reserved: [u8; 200],
}
const_assert!(core::mem::size_of::<PeerControlRegs>() == 256);

/// Device status values (written by device to PeerControlRegs.device_status).
#[repr(u32)]
pub enum PeerDeviceStatus {
    /// Device is in reset / not yet initialized.
    Reset   = 0,
    /// Device firmware has initialized and written X25519 pubkey to scratchpad.
    Ready   = 1,
    /// Join handshake in progress.
    Joining = 2,
    /// Fully operational: services bound, heartbeat active.
    Active  = 3,
    /// Device encountered an unrecoverable error.
    Error   = 4,
}

/// Host command values (written by host to PeerControlRegs.host_command).
#[repr(u32)]
pub enum PeerHostCommand {
    /// No pending command.
    None   = 0,
    /// Initiate join (host has written JoinRequest to ring pair 0).
    Join   = 1,
    /// Request graceful leave.
    Leave  = 2,
    /// Request device reset.
    Reset  = 3,
}

Scratchpad (BAR0 + 0x200, 64 bytes):

/// Used during the join handshake before ring pair 0 is established.
/// Device writes its ephemeral X25519 public key and nonce here.
/// Host reads them, performs ECDH, then sends JoinRequest via ring pair 0.
/// After join completes, scratchpad is unused.
#[repr(C)]
pub struct PeerScratchpad {
    /// Device's X25519 ephemeral public key (32 bytes).
    pub device_pubkey: [u8; 32],
    /// Device's nonce (16 bytes).
    pub device_nonce: [u8; 16],
    /// Reserved.
    pub _reserved: [u8; 16],
}
const_assert!(core::mem::size_of::<PeerScratchpad>() == 64);

5.1.2.6 Capability Negotiation State Machine

The state machine governs the lifecycle of a peer from power-on to active service delivery. All peers (PCIe, RDMA, USB, CXL) follow this state machine; transport differences are in Layer 0, not here.

    ┌──────────┐
    │   IDLE   │  Device powered on, firmware running, no protocol activity.
    └────┬─────┘
         │ Device: initialize BAR0, write pubkey to scratchpad,
         │         set device_status ← READY.
    ┌──────────┐
    │  READY   │  Host discovers device (PCIe enumeration / USB probe /
    └────┬─────┘  RDMA CM connection request). Reads magic + version.
         │ Host: read scratchpad, perform ECDH, write JoinRequest to ring 0.
         │        Set host_command ← JOIN.
    ┌──────────┐  ◄── timeout 5s → back to IDLE (retry with 1s backoff, max 30s)
    │ JOINING  │  ◄── JoinReject → IDLE (log reason, NO auto-retry)
    └────┬─────┘
         │ Device: receive JoinAccept, store assigned node_id.
         │ Device: send CapAdvertise (list all available services).
    ┌──────────┐  ◄── timeout 5s → IDLE
    │ CAP_SENT │  ◄── CapNack for ALL services → IDLE (no usable services)
    └────┬─────┘
         │ Host: evaluate each PeerServiceDescriptor against local policy.
         │ Host: send CapAck/CapNack per service.
         │ Host: send ServiceBind for each accepted service.
    ┌──────────┐  ◄── timeout 5s per service → partial bind (proceed with what succeeded)
    │ BINDING  │
    └────┬─────┘
         │ Device: configure ring pairs per ServiceBindAck.
         │ Device: set device_status ← ACTIVE.
         │ Both: start heartbeat (100ms interval).
    ┌──────────┐
    │  ACTIVE  │  Normal operation: heartbeats flow, service messages on data rings.
    └────┬─────┘
         │ Trigger: LeaveNotify, DeadNotify, host_command=LEAVE,
         │          transport error, or admin command.
    ┌──────────┐
    │ LEAVING  │  Drain in-flight operations (drain_timeout_ms, default 5s).
    └────┬─────┘  Outstanding ring entries are completed or cancelled.
         │ All rings drained. Device: set device_status ← RESET.
    ┌──────────┐
    │   IDLE   │  Ready for re-join or power-off.
    └──────────┘

Multi-peer service selection: When multiple peers offer the same service, selection uses topology cost: the peer with the lowest measured latency (from the topology graph distance matrix, Section 5.2) is preferred. Ties are broken by node_id (lower wins, deterministic). If the selected peer fails (heartbeat timeout or DeadNotify), the next-lowest-latency peer is bound automatically via a new ServiceBind exchange. Selection is re-evaluated on membership change events (JoinAccept, LeaveNotify, DeadNotify).

State enum (for tracking in kernel code):

#[repr(u32)]
pub enum PeerNegotiationState {
    Idle     = 0,
    Ready    = 1,
    Joining  = 2,
    CapSent  = 3,
    Binding  = 4,
    Active   = 5,
    Leaving  = 6,
}

Timeout table:

Transition Timeout On timeout
READY → JOINING 10s (host must initiate) Host logs FMA warning; device stays READY
JOINING → CAP_SENT 5s → IDLE, retry with 1s backoff (max 30s)
CAP_SENT → BINDING 5s → IDLE, retry
BINDING → ACTIVE 5s per service Partial bind: proceed with bound services
LEAVING drain configurable, default 5s Force-close rings, drop in-flight entries

Error handling: - JoinReject → IDLE immediately. Reason logged via FMA. Admin must resolve (version mismatch, auth failure, policy). No automatic retry prevents authentication brute-force. - Transport error during ACTIVE → peer transitions to Suspect via PeerHealthState (Section 5.3). If heartbeat resumes within suspect window (300ms), back to Active. Otherwise, Dead → 8-step crash recovery sequence (Section 5.3).

5.1.2.7 Transport Bindings

The peer protocol is transport-agnostic. Each transport binding maps the abstract operations (send message, receive message, signal doorbell, read scratchpad) to transport-specific mechanisms:

Operation PCIe (BAR) RDMA CXL USB TCP HiperSockets (s390x)
Send message DMA write to peer's BAR2 ring RDMA Send (RC QP) Store to shared region USB bulk OUT TCP send QDIO SBAL write + SIGA
Receive message Read from local BAR2 ring RDMA Recv completion Load from shared region USB bulk IN TCP recv QDIO SBAL read (input queue)
Doorbell Write to BAR0+0x100 → MSI-X RDMA Send (zero-length, IBV_SEND_SIGNALED) CXL back-invalidate / MSI-X USB interrupt EP TCP (implicit: data arrival) SIGA instruction (adapter interrupt)
Scratchpad Read/write BAR0+0x200 RDMA Write to known offset Shared memory load/store USB control transfer TCP initial handshake payload CCW control command
Watchdog Poll BAR0 watchdog_counter Read via RDMA Read Shared memory load USB control transfer TCP keepalive RTT QDIO heartbeat SBAL
Ring pair setup BAR2 at known offsets Allocate QP pair + register MR Map shared region Allocate bulk EP pair Allocate socket pair Allocate QDIO queue pair + SBAL rings

RDMA-specific notes: Ring pair 0 (control channel) uses a dedicated RC QP. Each data-plane service ring pair uses its own RC QP for isolation. QP creation follows the standard INIT→RTR→RTS state machine (Section 5.4). The RdmaRingHeader (Section 5.4) is prepended to each DomainRingBuffer in RDMA mode for sequence-based doorbell synchronization.

USB-specific notes: USB peers use bulk transfer endpoints. EP1 OUT/IN for control messages, EP2 OUT/IN for data, EP3 IN (interrupt) for doorbells. Each USB transfer carries exactly one PeerRingEntry (no scatter across USB transfers — USB framing is per-transfer). Latency: ~1-10ms (USB bus scheduling). Suitable for low-throughput services (sensor hubs, crypto tokens, debug interfaces). Discovery: USB device descriptor class 0xFF, subclass 0x55 ('U'), protocol 0x4D ('M').

Ethernet+TCP-specific notes: Software transport for development, demo, and IoT deployments. Uses one persistent TCP connection per peer. Framing: 8-byte header [msg_len: u32, sequence: u32] + PeerRingEntry payload. Performance: ~50-500μs latency. The TcpPeerTransport implementation of ClusterTransport (Section 5.10) provides this.

HiperSockets-specific notes (s390x, Phase 3+): HiperSockets is a z/VM and LPAR internal virtual networking facility that provides inter-partition communication within the same Central Electronics Complex (CEC) without traversing any physical network. Latency: <1μs (memory-to-memory via microcode assist). Uses QDIO (Queued Direct I/O) queues with SBAL (Storage Block Address List) entries. Discovery: subchannel scan for HiperSockets device type (CU type 0x8061/0x8062). In Phase 1-2, s390x peers participate in the cluster via standard TCP transport (the Auto selection path falls through to TCP when RDMA and CXL are unavailable). HiperSockets optimization adds a transport binding that maps QDIO queues to the DomainRingBuffer abstraction for near-zero-copy inter-LPAR communication.

5.1.2.8 Doorbell Coalescing Protocol

To avoid per-message interrupt overhead:

  1. Producer writes N entries to the ring (incrementing head/published).
  2. Producer writes doorbell register ONCE after the batch.
  3. Consumer wakes on doorbell interrupt, processes ALL entries up to published.
  4. If consumer finds the ring non-empty after processing a batch, it continues processing without waiting for another doorbell (drain-to-empty loop).
  5. Adaptive coalescing: if ring occupancy exceeds 50% at doorbell time, the producer delays the next doorbell by up to coalesce_timeout_us (default: 50μs for PCIe, 10μs for CXL, 0 for USB). This amortizes interrupt overhead under load while maintaining low latency under light load.
/// Per-ring doorbell coalescing state (producer-side).
pub struct DoorbellCoalescer {
    /// Entries written since last doorbell.
    pub pending_count: u32,
    /// Maximum entries before forced doorbell (0 = every entry).
    pub max_batch: u32,
    /// Maximum delay before forced doorbell (microseconds).
    pub coalesce_timeout_us: u32,
    /// Timestamp of first entry in current batch (for timeout enforcement).
    pub batch_start_ns: u64,
}

This matches the NVMe submission queue doorbell model (write tail pointer once per batch) and the io_uring IORING_ENTER_GETEVENTS model (drain all available CQEs per wakeup).

5.1.2.9 Raft Wire Payload Structs

Wire-format payload structs for the Raft consensus RPCs (PeerMessageType::RaftAppendEntries through RaftPreVoteResp, codes 0x00A0-0x00A7). All integer fields use Le64/Le32 for cross-node endian safety. Payloads follow the 40-byte ClusterMessageHeader.

/// AppendEntries RPC payload (leader → follower).
/// Variable-length: fixed header + `entries_count` log entries.
#[repr(C)]
pub struct RaftAppendEntriesPayload {
    /// Leader's current term.
    pub term: Le64,
    /// Leader's node ID (so followers can redirect clients).
    pub leader_id: Le64,
    /// Index of log entry immediately preceding the new entries.
    pub prev_log_index: Le64,
    /// Term of `prev_log_index` entry.
    pub prev_log_term: Le64,
    /// Leader's commit index.
    pub leader_commit: Le64,
    /// Number of log entries following this header.
    pub entries_count: Le32,
    pub _pad: [u8; 4],
    // Followed by `entries_count` serialized RaftLogEntry records.
    // Empty entries_count = heartbeat (no log entries appended).
}
const_assert!(core::mem::size_of::<RaftAppendEntriesPayload>() == 48);

/// AppendEntries response (follower → leader).
///
/// Both `conflict_term` and `conflict_index` are always populated by followers.
/// On success, they are set to `u64::MAX` (0xFFFFFFFFFFFFFFFF = no conflict).
/// On failure, the follower sets them to enable fast log rollback:
///   - `conflict_term`: term of the conflicting entry at `prev_log_index`, or
///     `u64::MAX` if the follower's log is shorter than `prev_log_index`.
///   - `conflict_index`: first index the follower has for `conflict_term`, or
///     the follower's log length if the log is too short.
///
/// The leader's use of these fields is Evolvable (hot-swappable policy):
///   - **Simple policy** (default): ignore conflict fields, decrement
///     `match_index` by 1 per RTT. Correct, O(divergence) RTTs.
///   - **Fast-rollback policy**: use `conflict_term` to skip entire terms.
///     O(distinct_terms) RTTs. Better for medium divergence (10-100 entries).
///   - **InstallSnapshot**: kicks in for large divergence regardless of policy.
///
/// Wire cost: 16 extra bytes per response = unmeasurable bandwidth (fits in
/// a single RDMA inline send, typical inline limit 64-256 bytes).
#[repr(C)]
pub struct RaftAppendEntriesRespPayload {
    /// Follower's current term (for leader to update itself).
    pub term: Le64,                 // 8 bytes  (offset 0)
    /// True if follower's log matched prev_log_index/prev_log_term.
    /// 0 = false (rejected), 1 = true (accepted).
    /// Receivers treat any non-zero value as `true` (accepted). Values
    /// > 1 are protocol anomalies: log FMA event
    /// `RaftResponseMalformed { peer, field: "success", value }` but
    /// proceed normally. (HMAC integrity prevents accidental values;
    /// only a compromised peer — already inside the trust boundary —
    /// could send intentional invalid values.)
    pub success: u8,                // 1 byte   (offset 8)
    pub _pad: [u8; 3],             // 3 bytes  (offset 9)
    /// Follower's last log index (allows leader to set next_index efficiently).
    // Note: offset 12 is not naturally 8-byte aligned. Le64 (alignment 1)
    // handles this correctly via byte-level access.
    pub match_index: Le64,          // 8 bytes  (offset 12)
    /// Term of the conflicting entry, or 0xFFFFFFFFFFFFFFFF if no conflict
    /// (success=true) or follower's log is too short.
    pub conflict_term: Le64,        // 8 bytes  (offset 20)
    /// First index the follower stores for `conflict_term`, or the follower's
    /// log length if the log is shorter than `prev_log_index`.
    /// Set to 0xFFFFFFFFFFFFFFFF on success (no conflict).
    pub conflict_index: Le64,       // 8 bytes  (offset 28)
}
// Total: 8 + 1 + 3 + 8 + 8 + 8 = 36 bytes (fits single RDMA inline send).
const_assert!(core::mem::size_of::<RaftAppendEntriesRespPayload>() == 36);

/// RequestVote RPC payload (candidate → all peers).
#[repr(C)]
pub struct RaftRequestVotePayload {
    /// Candidate's term.
    pub term: Le64,
    /// Candidate requesting the vote.
    pub candidate_id: Le64,
    /// Index of candidate's last log entry.
    pub last_log_index: Le64,
    /// Term of candidate's last log entry.
    pub last_log_term: Le64,
}
const_assert!(core::mem::size_of::<RaftRequestVotePayload>() == 32);

/// RequestVote response (voter → candidate).
#[repr(C)]
pub struct RaftRequestVoteRespPayload {
    /// Voter's current term.
    pub term: Le64,
    /// True if the voter granted its vote to the candidate.
    pub vote_granted: u8,
    pub _pad: [u8; 7],
}
const_assert!(core::mem::size_of::<RaftRequestVoteRespPayload>() == 16);

/// PreVote RPC payload (candidate → all peers, Raft Section 9.6).
/// Identical format to RequestVote; semantically different:
/// pre-vote does not increment the candidate's term.
#[repr(C)]
pub struct RaftPreVotePayload {
    /// Term the candidate would campaign in (current_term + 1).
    pub term: Le64,
    /// Candidate requesting the pre-vote.
    pub candidate_id: Le64,
    /// Index of candidate's last log entry.
    pub last_log_index: Le64,
    /// Term of candidate's last log entry.
    pub last_log_term: Le64,
}
const_assert!(core::mem::size_of::<RaftPreVotePayload>() == 32);

/// PreVote response (voter → candidate).
#[repr(C)]
pub struct RaftPreVoteRespPayload {
    /// Voter's current term.
    pub term: Le64,
    /// True if the voter would grant its vote.
    pub vote_granted: u8,
    pub _pad: [u8; 7],
}
const_assert!(core::mem::size_of::<RaftPreVoteRespPayload>() == 16);

5.1.2.10 Raft Log Compaction and Snapshotting

The peer protocol uses Raft consensus (Section 5.1) for cluster membership and distributed lock arbitration. The Raft log grows unboundedly without compaction. UmkaOS uses log-structured snapshotting to truncate the log:

/// Raft log compaction snapshot. Captures the full state machine at a given
/// log index so that all entries up to (and including) `last_included_index`
/// can be discarded.
///
/// Snapshots are taken when `raft_log.len() > RAFT_SNAPSHOT_THRESHOLD` (default:
/// 10_000 entries). The threshold is configurable via the cluster sysctl
/// `cluster.raft_snapshot_threshold`.
pub struct RaftSnapshot {
    /// Log index of the last entry included in this snapshot.
    /// All log entries with index <= last_included_index are discarded
    /// after the snapshot is persisted.
    pub last_included_index: Le64,
    /// Term of the last included log entry.
    pub last_included_term: Le64,
    /// Serialized state machine (cluster membership table, DLM lock state,
    /// capability grants). Format: length-prefixed sections, each tagged
    /// with a `SnapshotSection` discriminant for forward compatibility.
    /// Length is carried out-of-band (in the InstallSnapshot RPC header).
    ///
    /// **Bound**: Allocated once per snapshot (cold path). Maximum size is
    /// bounded by the state machine size: membership table (~1 KiB per node)
    /// + DLM lock state (~64 bytes per lock) + capability grants. Typical
    /// upper bound: ~16 MiB for a 64-node cluster with 100K active locks.
    /// Freed after all followers acknowledge or after snapshot expiry timeout.
    pub data: Vec<u8>,
    /// CRC-32C of `data` for integrity verification.
    pub checksum: Le32,
}

/// Snapshot section tags for forward-compatible deserialization.
/// New sections can be added without breaking older peers (unknown
/// sections are skipped).
#[repr(u16)]
pub enum SnapshotSection {
    /// Cluster membership: list of (NodeId, addr, capabilities).
    Membership   = 0x0001,
    /// DLM lock table: (resource_name, lock_mode, holder_node).
    DlmLocks     = 0x0002,
    /// Capability grants: (cap_id, granter, grantee, permissions).
    CapGrants    = 0x0003,
    /// DSM region directory: (region_id, home_node, consistency).
    /// Serialization format defined below.
    DsmDirectory = 0x0004,
}

SnapshotSection::DsmDirectory serialization format:

The DSM directory snapshot captures the complete state needed to reconstruct all DSM region directories on a recovering node. The format is a length-prefixed sequence of region entries, each containing the region table metadata, the per-page ownership directory, and dirty page bitmaps.

/// Top-level DSM directory snapshot. Serialized as the payload of
/// SnapshotSection::DsmDirectory (tag 0x0004) in the Raft snapshot.
///
/// Wire format: [region_count: Le32] [padding: 4 bytes]
///              followed by region_count × DsmRegionSnapshot entries.
#[repr(C)]
pub struct DsmDirectorySnapshot {
    /// Number of active DSM regions in the cluster at snapshot time.
    pub region_count: Le32,
    pub _pad: [u8; 4],
    // Followed by region_count × DsmRegionSnapshot (variable size).
}
// Wire format (fixed header): region_count(4) + _pad(4) = 8 bytes.
const_assert!(core::mem::size_of::<DsmDirectorySnapshot>() == 8);

/// Per-region snapshot entry.
///
/// Wire format: [header: 48 bytes]
///              [directory_entries: entry_count × (24 + 8 × ceil(max_participants / 64)) bytes]
///              [dirty_bitmap: ceil(entry_count / 8) bytes, padded to 8-byte alignment]
/// For regions with ≤64 participants (1 sharer word): entry size = 32 bytes.
#[repr(C)]
pub struct DsmRegionSnapshot {
    /// Region identity and configuration (matches DsmRegionCreate).
    pub region_id: Le64,
    pub base_addr: Le64,
    pub size: Le64,
    pub page_size: Le32,        // DsmPageSize as Le32
    pub consistency: Le32,      // DsmConsistency as Le32
    pub max_participants: Le16,
    /// Number of pages in this region (= size / page_size_bytes).
    pub entry_count: Le32,
    /// Explicit padding to reach 48 bytes. Le types have alignment 1 (they
    /// are byte arrays internally), so no implicit padding exists between
    /// Le fields. This padding ensures the trailing variable-length data
    /// (directory entries + dirty bitmap) starts at a known offset.
    pub _pad: [u8; 10],
    // Followed by entry_count × DsmDirectoryEntrySnapshot.
    // Base entry size = 32 bytes (for ≤64 participants); variable for larger regions.
    // Followed by dirty bitmap: ceil(entry_count / 8) bytes, 8-byte aligned.
}
const_assert!(core::mem::size_of::<DsmRegionSnapshot>() == 48);

/// Per-page directory entry in the snapshot.
///
/// Captures the home directory's authoritative state for one page:
/// ownership, sharer set, and MOESI home state. This is the data that
/// the home reconstruction protocol ([Section 5.8](#failure-handling-and-distributed-recovery))
/// rebuilds from peer reports during node failure recovery.
#[repr(C)]
pub struct DsmDirectoryEntrySnapshot {
    /// Virtual address of the page within the region.
    pub va: Le64,
    /// Current owner node (PeerId). 0 if Uncached.
    pub owner: Le64,
    /// Home directory state (DsmHomeState as u8).
    pub home_state: u8,
    pub _pad1: [u8; 3],
    /// Number of sharers (nodes with SharedReader copies).
    pub sharer_count: Le16,
    pub _pad2: [u8; 2],
    /// Sharer bitmap — variable-width, serialized as ceil(max_participants / 64)
    /// words. Stored inline for regions with ≤ 64 participants (1 word);
    /// for larger regions, additional words follow this struct in sequence.
    /// The snapshot reader uses max_participants from the region header
    /// to determine the word count.
    pub sharers_word0: Le64,
}
// Base size for ≤64 participants: va(8) + owner(8) + home_state(1) + _pad1(3) +
// sharer_count(2) + _pad2(2) + sharers_word0(8) = 32 bytes.
// Le* types are [u8; N] with alignment 1, so no implicit padding.
const_assert!(core::mem::size_of::<DsmDirectoryEntrySnapshot>() == 32);

Dirty page bitmap: Follows the directory entries as a packed bitfield (1 bit per page, bit i corresponds to directory entry i). A set bit indicates the page has dirty data on at least one node that differs from the home node's copy. The bitmap is padded to 8-byte alignment. The recovering node uses this bitmap to schedule writeback or re-fetch for dirty pages after directory reconstruction.

Consistency guarantee: The snapshot is taken under a cluster-wide DSM quiescent point — the Raft leader holds all region directory locks during serialization. This ensures the snapshot represents a consistent cut of all directory states. The snapshot does NOT include page data (only directory metadata); page data is reconstructed via the normal MOESI coherence protocol after the directory is restored.

Compaction protocol:

  1. When raft_log.len() > RAFT_SNAPSHOT_THRESHOLD, the leader serializes the current state machine into a RaftSnapshot.
  2. The snapshot is written to persistent storage (NVMe or ramdisk) before truncation.
  3. Log entries [0, last_included_index] are discarded. The log now starts at last_included_index + 1.
  4. Followers that fall behind beyond the compacted region receive the snapshot via the InstallSnapshot RPC (chunked transfer, 64 KiB chunks) instead of individual log entries. This is the Raft InstallSnapshot mechanism from Section 7 of the Raft paper (Ongaro & Ousterhout, 2014).

InstallSnapshot RPC: Uses PeerMessageType::RaftInstallSnapshot (type 0x00A4). The message is chunked: each chunk carries offset and done flag. The follower assembles chunks into the full snapshot, verifies the CRC-32C, and replaces its state machine. After installation, the follower's log starts at last_included_index + 1.

Liveness during snapshot transfer: The follower continues to process heartbeats from the leader during the multi-chunk snapshot transfer. If an election timeout fires mid-transfer, the follower abandons the partial snapshot and participates in the new election.


5.2 Cluster Topology Model

Membership protocol index — Cluster membership is specified across multiple sections: join handshake and key exchange (§5.2.8), heartbeat/failure detection (§5.9), split-brain resolution (§5.9.2), graceful leave (§5.9.3), wire format (§5.1.2 message types 0x0010-0x0017).

5.2.1 Extending the Device Registry

The device registry (Section 11.4) models hardware topology as a tree with parent-child and provider-client edges. For distributed operation, extend the tree to span multiple nodes:

\Cluster (root of distributed namespace)
  +-- node0 (this machine)
  |   +-- pci0000:00
  |   |   +-- 0000:41:00.0 (GPU)
  |   |   +-- 0000:06:00.0 (NVMe)
  |   |   +-- 0000:03:00.0 (RDMA NIC, mlx5)
  |   +-- cpu0 ... cpu31
  |   +-- numa-node0 (512GB DDR5)
  |   +-- numa-node1 (512GB DDR5)
  |
  +-- node1 (remote machine, discovered via RDMA fabric)
  |   +-- [remote device tree, cached]
  |   +-- numa-node0 (512GB DDR5, reachable via RDMA)
  |   +-- gpu0 (80GB VRAM, reachable via GPUDirect RDMA)
  |
  +-- node2 ...
  |
  +-- fabric (RDMA fabric topology)
      +-- switch0 (InfiniBand switch)
      |   +-- port0 → node0:mlx5_0
      |   +-- port1 → node1:mlx5_0
      +-- switch1
          +-- port0 → node2:mlx5_0
          +-- port1 → node3:mlx5_0

5.2.2 Cluster Node Descriptor

/// Describes a node in the cluster (including self).
///
/// Transport addressing is polymorphic: every node has a mandatory
/// `ip_addr` + `port` (sufficient for TCP transport). RDMA-capable nodes
/// additionally populate `rdma_endpoint` (GID, QPN, PD key). The
/// `transport_type` discriminant indicates which transport the node
/// advertises as its preferred data path. TCP-only peers zero-fill
/// `rdma_endpoint`.
///
/// Layout: 128 bytes (verified by const_assert below).
// kernel-internal, not KABI
#[repr(C)]
pub struct ClusterNode {
    /// Unique node ID (assigned during cluster join, never reused).
    /// Le64 wire type for endian-safe serialization on mixed-endian clusters.
    pub node_id: Le64,  // NodeId as Le64

    /// Node state (Le32 for wire endianness; convert to/from NodeState at access).
    pub state: Le32,  // NodeState as Le32

    /// Preferred transport type (TransportType as Le32).
    /// Determines how other nodes should reach this peer.
    pub transport_type: Le32,  // TransportType as Le32

    /// RDMA endpoint for reaching this node.
    /// Zero-filled when `transport_type != Rdma`. Callers MUST check
    /// `transport_type` before using these fields — a zeroed GID/QPN
    /// is not a valid RDMA endpoint.
    pub rdma_endpoint: RdmaEndpoint,

    /// Total CPU memory available for remote access (bytes).
    /// Le64 wire type for endian-safe serialization (see wire format rule below).
    pub remote_accessible_memory: Le64,

    /// Number of NUMA nodes.
    pub numa_nodes: Le32,

    /// Number of accelerators (GPUs, NPUs, etc.).
    pub accelerator_count: Le32,

    /// Round-trip latency to this node (nanoseconds, measured).
    /// Le32 covers up to ~4.29 seconds, sufficient for datacenter and
    /// campus-scale clusters. WAN links with higher RTT use the
    /// TCP fallback transport, not RDMA, and are represented in the
    /// distance matrix (Section 5.2.9) instead.
    /// **Saturation**: values exceeding `u32::MAX` are clamped to
    /// `u32::MAX` (~4.29 s). Links at `u32::MAX` are treated as equivalent
    /// (effectively unusable for cluster operations — DLM leases expire
    /// at 5-10 s, DSM faults block). The field is kernel-internal (not a
    /// wire commitment) so it can be widened without protocol breakage if
    /// future requirements demand finer WAN differentiation.
    pub measured_rtt_ns: Le32,

    /// TCP/IP port for cluster communication (Le16 for wire endianness).
    /// Always valid regardless of transport_type — used for TCP fallback,
    /// initial cluster join handshake, and DLM two-sided messages.
    pub port: Le16,

    /// Padding for alignment (brings offset to 8-byte boundary for Le64 below).
    pub _pad_align: [u8; 2],

    /// Unidirectional bandwidth to this node (bytes/sec, measured).
    /// Field name uses `bytes_per_sec` to avoid ambiguity with
    /// "bps" (bits per second) common in networking contexts.
    pub measured_bw_bytes_per_sec: Le64,

    /// Heartbeat: last received timestamp.
    pub last_heartbeat_ns: Le64,

    /// Heartbeat: monotonic generation (detects restarts).
    pub heartbeat_generation: Le64,

    /// Cluster protocol version (must match to join).
    /// Nodes with mismatched protocol_version are rejected during cluster join.
    pub protocol_version: Le32,
    pub _pad_pv: [u8; 4],        // alignment padding after protocol_version; must be zeroed on send

    /// IPv6 address of this node (16 bytes). IPv4 addresses are stored in
    /// IPv4-mapped IPv6 format (::ffff:a.b.c.d). Always valid regardless
    /// of transport_type — the IP address is the universal node identifier
    /// used for TCP fallback, initial discovery, and diagnostic reporting.
    pub ip_addr: [u8; 16],
    // Layout: node_id(8) + state(4) + transport_type(4) + rdma_endpoint(40)
    //   + remote_accessible_memory(8) + numa_nodes(4) + accelerator_count(4)
    //   + measured_rtt_ns(4) + port(2) + _pad_align(2) + measured_bw(8)
    //   + last_heartbeat_ns(8) + heartbeat_generation(8)
    //   + protocol_version(4) + _pad_pv(4) + ip_addr(16) = 128 bytes.
}
const_assert!(size_of::<ClusterNode>() == 128);

/// Transport type advertised by a cluster node. Determines the preferred
/// data path for reaching this peer. Higher-performance transports are
/// preferred when available; TCP is the universal fallback.
#[repr(u32)]
pub enum TransportType {
    /// TCP/IP transport. Latency: ~50-200 μs per operation.
    /// Always available as fallback.
    Tcp   = 0,
    /// RDMA (InfiniBand/RoCEv2) transport. Latency: ~1-5 μs.
    /// `rdma_endpoint` fields are valid.
    Rdma  = 1,
    /// CXL shared memory transport. Latency: ~0.1-0.5 μs.
    /// `rdma_endpoint` is zero-filled; CXL addressing uses the
    /// CXL fabric topology discovered at boot.
    Cxl   = 2,
}

/// Consensus (Raft) node identifier. 64-bit to match PeerId (Section 5.2.1) and
/// avoid truncation when a consensus message references a peer. Each Raft voter
/// is also a peer; `NodeId` and `PeerId` are the same numeric value for a given
/// node. The type alias exists to clarify which protocol layer a field belongs to.
pub type NodeId = u64;

#[repr(u32)]
pub enum NodeState {
    /// Node is reachable and healthy.
    Active          = 0,
    /// Node is joining (exchanging topology, syncing state).
    Joining         = 1,
    /// Node missed heartbeats but not yet declared dead.
    Suspect         = 2,
    /// Node is unreachable. Its resources are being reclaimed.
    Dead            = 3,
    /// Node is gracefully leaving (draining work, migrating pages).
    Leaving         = 4,
}

/// RDMA remote key — opaque 32-bit handle granted by the RNIC hardware for
/// remote memory access. Used in RDMA Read/Write/Atomic work requests.
/// The rkey is scoped to a specific Memory Region (MR) and invalidated on
/// MR deregistration or rkey rotation (Section 5.4.5).
///
/// Newtype enforces type safety: prevents accidental use of raw u32 values
/// as rkeys. Referenced by 13-vfs.md, 14-storage.md, and 21-accelerators.md.
#[derive(Clone, Copy, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct RdmaRkey(pub u32);

#[repr(C)]
pub struct RdmaEndpoint {
    /// RDMA GID (Global Identifier) — InfiniBand/RoCE address.
    pub gid: [u8; 16],
    /// Queue pair number for control channel (Le32 for wire endianness).
    pub control_qpn: Le32,
    /// Protection domain key for this cluster (Le32 for wire endianness).
    pub pd_key: Le32,
    /// RDMA device index on the local machine (Le32 for wire endianness).
    /// Note: this is a local-to-sender value that the receiver uses for
    /// diagnostic/logging purposes, not for RDMA operations.
    pub local_rdma_device: Le32,
    pub _pad: [u8; 12],
}
const_assert!(core::mem::size_of::<RdmaEndpoint>() == 40);

Device-local kernels as cluster members — The ClusterNode structure describes traditional CPU-based compute nodes, but modern hardware increasingly runs its own firmware OS:

  • SmartNICs/DPUs: NVIDIA BlueField-2/3 DPUs run full Ubuntu with 8-16 ARM cores, 16-32 GB DRAM, and can host containers and VMs. Intel IPU and AMD Pensando DPUs run similar firmware stacks.
  • GPUs: NVIDIA GPUs run CUDA firmware that schedules work across thousands of cores, manages HBM memory, and coordinates P2P transfers. AMD GPUs run ROCm firmware.
  • Storage controllers: High-end NVMe controllers and RAID cards run embedded RTOS or Linux to manage flash translation layers, wear leveling, and caching.
  • CXL devices: CXL defines three device types, each with a different operating model in UmkaOS's multikernel cluster:
  • Type 1 (coherent compute, no device-managed memory): Device compute participates in the host CPU cache coherency domain via CXL.cache. Natural Mode B peer — ring buffers in shared memory are hardware-coherent without explicit flush. Examples: coherent FPGAs, smart NICs with CXL.
  • Type 2 (compute + device-managed memory): Both CXL.cache (device cache in host coherency domain) and CXL.mem (device DRAM accessible to host via load/store). The richest UmkaOS peer type — bidirectional zero-copy coherent access. Device runs UmkaOS on its embedded cores, Mode B ring buffers are coherent in both directions. Examples: future CXL-attached GPUs, AI accelerators with HBM.
  • Type 3 (memory expansion, minimal or no compute): Provides additional DRAM via CXL.mem; host sees it as a slower NUMA node. The tiny management processor (if present, typically ARM/RISC-V) acts as a memory-manager peer, not a compute peer: it manages tiering, compression, encryption, and error reporting for the pool, but does not run workloads. Examples: Samsung CMM-H, Micron CZ120, SK Hynix AiMM. See Section 5.9 for the full Type 3 operating model and crash recovery distinction.

Rather than treating these as passive devices controlled exclusively by the host kernel, UmkaOS's distributed design allows device-local kernels to participate as first-class cluster members. A BlueField-3 DPU running UmkaOS could:

  1. Join the cluster as a peer node with its own NodeId, exchange topology with other nodes, and participate in membership/heartbeat protocols.
  2. Expose resources in the device registry: its own CPUs, DRAM, and attached storage/network as remotely-accessible resources.
  3. Run workloads: containers or VMs can be scheduled on the DPU's cores, with distributed locking and DSM providing transparent access to host memory or other cluster nodes.
  4. Offload functions: RDMA transport, network filtering, encryption, compression, or storage can run on the DPU with kernel-level coordination via the distributed lock manager and DSM.

This multikernel model treats a single physical server as a cluster of heterogeneous kernels — one on the host CPU, one on each DPU, one on each GPU (if the GPU firmware exposes cluster primitives). The distributed protocols (membership, DSM, DLM, quorum) work identically whether communicating between physical servers or between the host and a DPU on the same PCIe bus.

Protocol requirements — For a device-local kernel to participate as a first-class cluster member, it must implement UmkaOS's inter-kernel messaging protocol. This is a wire protocol, not an API — the device kernel does not need to be UmkaOS itself, but it must speak the same language:

  1. Transport layer: RDMA (for network-attached nodes) or PCIe P2P MMIO+interrupts (for on-board devices like DPUs/GPUs). The device must expose:
  2. A control channel for cluster management messages (join, heartbeat, topology sync)
  3. A data channel for DSM page transfers and DLM lock requests
  4. MMIO-mapped doorbell registers or MSI-X interrupts for signaling

  5. Cluster membership protocol (Section 5.4):

  6. Implement the join handshake: authenticate, exchange topology, sync protocol version
  7. Send periodic heartbeats (every 100ms) with monotonic generation counter
  8. Respond to membership queries with node state (Active, Suspect, Dead, Leaving)
  9. Participate in failure detection: mark other nodes as Suspect if heartbeats missed

  10. DSM page protocol (Section 6.2):

  11. Accept page ownership transfer requests: PAGE_REQUEST(vpfn, read|write)
  12. Respond with page data or forward request if not owner
  13. Implement cache coherence state machine (Owner, Shared, Invalid)
  14. Participate in invalidation broadcasts for write requests

  15. DLM lock protocol (Section 15.15):

  16. Accept lock acquisition requests: LOCK_ACQUIRE(lock_id, mode=shared|exclusive)
  17. Maintain lock ownership table and grant/deny based on current holders
  18. Support one-sided RDMA lock operations (atomic CAS on lock words)
  19. Implement deadlock detection timeout (5 seconds default)

  20. Serialization format: All messages use fixed-size binary structs with explicit padding and versioning. Each message has a 40-byte header:

    #[repr(C)]
    pub struct ClusterMessageHeader {
        pub protocol_version: Le32,  // Currently 1
        pub message_type: Le32,      // PeerMessageType as Le32
        pub node_id: Le64,           // Sender's node ID (NodeId/PeerId)
        pub payload_length: Le32,    // Bytes following this header
        /// Per-message flags. Bit 0: CONTINUATION (payload continues in the
        /// next message). Bits 1-31: reserved (must be zero).
        pub flags: Le32,             // message flags (replaces old _pad)
        pub sequence: Le64,          // Message sequence number (for ordering)
        pub checksum: Le64,          // 64-bit truncated HMAC-SHA3-256
    }
    const_assert!(core::mem::size_of::<ClusterMessageHeader>() == 40);
    // Total header size: 40 bytes (4+4+8+4+4+8+8). All Le types are [u8; N]
    // with alignment 1, so no implicit padding. The `flags` field provides
    // extensible per-message flags without corrupting the HMAC checksum.
    //
    // All integer fields use Le32/Le64 wire types
    // ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) for correct
    // operation on mixed-endian clusters (PPC32, s390x are big-endian).
    //
    // The `checksum` field carries a 64-bit truncated HMAC-SHA3-256 keyed with
    // the session key. Security properties of 64-bit HMAC truncation:
    //
    //   Forgery resistance: 2^64 (one-target, per NIST SP 800-107r1 Section 5.3.4).
    //     An attacker needs ~2^64 guesses to forge a valid tag for a single message.
    //
    //   Birthday-bound collision resistance: 2^32 — the binding constraint.
    //     At 1M messages/second, collision probability reaches 50% after ~72 minutes
    //     (2^32 messages). The per-peer ephemeral HMAC key rotates every 120s
    //     ([Section 5.1](#distributed-kernel-architecture--session-key-rotation)), resetting
    //     the collision counter well below the birthday bound.
    //
    // Hash selection: HKDF uses SHA-256 (key derivation, RFC 5869). Message HMAC
    // uses SHA3-256 (structurally independent from SHA-256 — sponge vs.
    // Merkle-Damgard). A break in SHA-2's internal structure does not affect
    // SHA-3, providing defense-in-depth for the session integrity path.
    // The long-term cluster session key is rotated every 24 hours.
    //
    // Full 256-bit HMAC is used for security-critical messages: capability tokens
    // (CapAdvertise, CapRevoke) and session establishment (JoinRequest/JoinAccept)
    // carry the full HMAC appended after the payload.
    
    Payload structs are defined in Section 5.5 (message formats). All integer fields use Le32/Le64 wire types — native u32/u64 is never used in wire structs. Nodes with mismatched protocol_version are rejected during join.

Sequence number replay protection: Each receiver maintains last_seen_seq: XArray<u64> keyed by NodeId. On message receipt: if msg.sequence <= last_seen_seq[msg.node_id], the message is a duplicate or replay — drop silently and increment fma_counter(PEER_REPLAY_DROP). Otherwise, update last_seen_seq[msg.node_id] = msg.sequence and process. Senders increment sequence monotonically per destination (not globally) to avoid gaps that would trigger false replay detection. On peer restart (detected via generation mismatch in heartbeat), the receiver resets last_seen_seq[peer_id] = 0 to accept the restarted peer's fresh sequence space.

Implementation paths for device vendors:

  • Path A: Full UmkaOS on device — Run UmkaOS kernel on the device's embedded CPU (e.g., BlueField DPU with 16 ARM cores runs UmkaOS natively). This gives full protocol support with zero extra work. The device becomes a cluster node indistinguishable from a regular server.

  • Path B: Firmware shim — Device vendor implements a minimal protocol adapter in their existing firmware. The adapter translates UmkaOS cluster messages into the device's native operations. Example: NVIDIA GPU firmware receives PAGE_REQUEST messages and responds by copying HBM pages to system memory via GPUDirect RDMA. Does not require rewriting the entire firmware stack.

  • Path C: Traditional driver + host-proxy service — The device runs a traditional KABI driver on the host (not a cluster peer). The host's subsystem layer (block, VFS, accel) wraps the device as a capability service provider (Section 5.7), sending CapAdvertise on behalf of the device. Lower performance than Path B (extra host CPU involvement and subsystem-level indirection) but requires zero firmware changes — day-one cluster accessibility for any driver-managed device.

Near-term hardware targets — UmkaOS already builds for aarch64-unknown-none and riscv64gc-unknown-none-elf. Devices with ARM or RISC-V cores can run the UmkaOS kernel with zero ISA porting work, making Path A immediately actionable:

Device Cores ISA Path Notes
NVIDIA BlueField-2 DPU 8× Cortex-A72 AArch64 A Replace host OS with UmkaOS. PCIe P2P to host. Currently runs Ubuntu.
NVIDIA BlueField-3 DPU 16× Neoverse N2 AArch64 A Same. Higher bandwidth NIC.
Marvell OCTEON 10 DPU ARM Neoverse N2 AArch64 A Open SDK. Same category as BlueField.
Microchip PolarFire SoC FPGA 4× U54 riscv64gc A UmkaOS boot target. FPGA implements custom datapath. Open toolchain.
StarFive JH7110 (VisionFive 2) 4× U74 riscv64gc A Boots UmkaOS today. PCIe expansion for host interconnect.
SiFive Intelligence X280 U74 + RVV riscv64gc A RISC-V vector AI accelerator. UmkaOS-compatible ISA.
Netronome Agilio CX (NFP3800) NFP microengines proprietary B Open C/BPF SDK. Published ring interface specs. Implement UmkaOS ring protocol in NFP firmware.
AMD/Xilinx Alveo U50/U250 FPGA + ARM AArch64 / FPGA A or B Fully programmable. Define any protocol in RTL. UmkaOS on embedded ARM for Path A.
Samsung SmartSSD Zynq UltraScale+ (ARM + FPGA) AArch64 A or B ARM Cortex-A53 runs UmkaOS. FPGA handles NVMe datapath. NVMe CSI spec published.
Samsung CMM-H ARM management core AArch64 A (mgmt) CXL Type 3 memory expander. Management core runs UmkaOS as a memory-manager peer (Type 3 model, Section 5.9). 256 GB–1 TB LPDDR5 pool.
Micron CZ120 ARM management core AArch64 A (mgmt) CXL Type 3. Same model as CMM-H. CXL 2.0, 128 GB–512 GB.
SK Hynix AiMM ARM management core AArch64 A (mgmt) CXL Type 3 with in-memory compute (PIM). Management core as memory-manager peer.

The key insight: RISC-V devices are uniquely positioned as zero-effort Path A targets. UmkaOS already cross-compiles to riscv64gc-unknown-none-elf with OpenSBI boot. Any device with a RISC-V core and OpenSBI can boot an unmodified UmkaOS kernel — no porting required. ARM-based DPUs (BlueField, OCTEON) are equally zero-effort via the AArch64 build target.

Security boundary — If a device firmware participates as a cluster member, it must be trusted to the same degree as any cluster node. A malicious or compromised GPU firmware with cluster membership could: - Request arbitrary memory pages via DSM (reading sensitive data) - Corrupt shared memory by writing to DSM pages - Initiate denial-of-service by flooding lock requests

Therefore, device cluster membership is disabled by default and enabled per-device:

echo 1 > /sys/bus/pci/devices/0000:41:00.0/umka_cluster_enabled

Only devices running signed, verified firmware (Section 11.3) should be granted cluster membership in secure environments.

Initially, only host kernels participate. Device participation is a Phase 5+ capability (Section 9.7, Section 5.9) that requires firmware modifications by hardware vendors or open-source firmware projects. The protocol specification will be published as an RFC-style document to enable third-party implementations.

Real-world precedents: - Barrelfish multikernel OS: Research OS where each CPU core runs its own kernel instance, coordinating via message passing. UmkaOS generalizes this to heterogeneous hardware. - BlueField DPU offload: Current NVIDIA BlueField firmware can run OVS, storage targets, or custom applications, but coordination with the host is ad-hoc userspace protocols. UmkaOS provides kernel-level coordination. - GPU Direct Storage: NVIDIA GDS allows GPUs to directly access NVMe storage, bypassing the CPU. This is a point solution. UmkaOS's model makes such bypasses general-purpose.

Lightweight mode for intra-machine devices — The full distributed protocol (DSM page transfers, DLM with RDMA CAS, quorum protocols) was designed for multi-node clusters over RDMA networks with 1-5 μs latency. For intra-machine devices (host ↔ DPU/GPU on the same PCIe bus), the latency is 10x lower (~200-500ns), but failure modes are different (no network partitions, but devices can hang or reset independently).

Architectural principle: message passing is the primitive. UmkaOS's IPC model (Section 11.8) is built on message passing — explicit ownership transfer, capability-mediated channels, and defined send/receive semantics. This is the architectural primitive. It composes uniformly across every boundary UmkaOS targets: intra-kernel, kernel-user, cross-process, cross-VM, cross-network, and hardware peer. The "shared memory fast path" described in Mode B below is not a competing model — it is how message passing is implemented locally when hardware cache coherency is available. The abstraction is always message passing; the transport is chosen to match the hardware.

Two coordination modes are supported:

Mode A: Full Distributed Protocol (default for network-attached nodes) - All messages via RDMA or PCIe P2P with ClusterMessageHeader wire format - DSM: explicit page ownership transfer with RDMA Read/Write - DLM: distributed lock tables, RDMA atomic CAS, deadlock detection - Membership: heartbeat every 100ms, suspect timeout 1 second - Best for: Multi-node clusters, devices that may have partial failures

Mode B: Hardware-Coherent Transport (optional for trusted local devices) - The message-passing ownership guarantee is provided by the hardware cache coherency protocol (PCIe ACS, CCIX, or CXL.cache) rather than by the software ownership transfer protocol. The MESI state machine in hardware plays the same role that the software ownership protocol plays over RDMA: only one writer at a time, cached reads see the latest write. No software ownership messages are needed on top of hardware coherency — that would be redundant. - Both host and device map the same physical memory (via PCIe BAR mappings or pinned system memory). Locks use local atomic ops (x86 LOCK CMPXCHG, ARM LDXR/STXR) on cache-coherent shared memory instead of RDMA CAS. - Membership via MMIO doorbell registers (no network heartbeat overhead). - 5-10x lower latency: ~50-100ns for lock acquire vs ~500ns-1μs for RDMA CAS. - Requires: device must be cache-coherent with host (PCIe ATS + ACS, CCIX, or CXL.cache). Non-coherent devices (all current discrete GPUs, most current DPUs) must use Mode A.

When to use each mode:

Scenario Mode Reason
Multi-node RDMA cluster A Must handle network failures, can't assume cache coherence
BlueField DPU running full UmkaOS A Separate memory space, needs explicit coordination
Future CXL 3.0-coherent GPU B CXL.cache coherency makes hardware the ownership protocol
Integrated GPU / APU (UMA) B CPU and GPU share coherent on-die/on-package memory
NVMe storage controller B Controller and host share command queues in coherent memory
Untrusted/unverified device A Mode B relies on hardware coherency guarantee — only for verified hardware

Mode B is an optimization, not a separate protocol. It reuses the same data structures (lock tables, membership records) but the coherency guarantee is provided by hardware instead of software message passing. A device can fall back to Mode A if cache coherence fails or if the device resets.

Selection is per-device at join time:

# # Force Mode A (full protocol) for untrusted DPU
echo "rdma://pci:0000:41:00.0?mode=distributed" > /sys/kernel/umka/cluster/join

# # Enable Mode B (shared memory) for trusted GPU with cache-coherent access
echo "shmem://pci:0000:03:00.0" > /sys/kernel/umka/cluster/join

Implementation note: Mode B requires PCIe ATS (Address Translation Services) or CXL.cache to ensure device accesses to system memory are cache-coherent. Non-coherent devices (most current GPUs) must use Mode A.

5.2.3 Host-Side Component: umka-peer-transport

A device participating as a multikernel peer does not require a traditional device driver on the host. A traditional driver (e.g., mlx5_core, amdgpu, nvme) must understand the device's register layout, command format, initialization sequence, error recovery procedure, and internal resource model. All of that complexity now lives inside the device's own kernel. The host never touches device registers and has no knowledge of the device's internals.

What the host requires instead is a single generic module — umka-peer-transport — that handles the PCIe connection to any UmkaOS peer device, regardless of what the device actually does:

/// Host-side state for one UmkaOS peer kernel, maintained by umka-peer-transport.
/// Identical structure for every peer device — NIC, GPU, storage, custom ASIC.
/// The device's function is irrelevant to this layer.
pub struct PeerTransport {
    /// PCIe BDF of the peer device (for unilateral controls, see Section 5.3).
    pub pcie_bdf: PcieBdf,
    /// Shared memory region for the inbound/outbound domain ring buffer pair.
    /// Allocated by the host, mapped into PCIe BAR by device at join time.
    pub ring_region: DmaBuffer,
    /// MMIO doorbell register: host writes here to signal the device.
    pub doorbell_mmio: MmioRegion,
    /// MMIO watchdog register: device writes a counter here every ~10ms.
    pub watchdog_mmio: MmioRegion,
    /// Cluster membership state (shared with the membership protocol Section 5.4).
    pub health: PeerKernelHealth,
    /// Negotiated cluster protocol version.
    pub protocol_version: u32,
}

umka-peer-transport does five things and nothing else:

  1. Enumerate — detect that the PCIe device at a given BDF exposes the UmkaOS peer capability register (a new PCI capability ID assigned to the UmkaOS protocol).
  2. Connect — allocate the shared ring buffer region, map the device's MMIO doorbell, run the cluster join handshake (Section 5.2).
  3. Monitor — poll the MMIO watchdog counter and participate in the heartbeat protocol to detect device failure (Section 5.3).
  4. Contain — execute IOMMU lockout, bus master disable, and FLR if the device fails (Section 5.3).
  5. Disconnect — handle voluntary CLUSTER_LEAVE (planned update, shutdown).

umka-peer-transport has zero device-specific logic. The same binary handles a BlueField DPU, a RISC-V AI accelerator, a computational storage device, and any future device that implements the UmkaOS peer protocol. The host kernel's dependency on device-specific code goes from hundreds of thousands of lines (per device class) to zero.

5.2.4 Live Firmware Update Without Host Reboot

Because the host has no device-specific driver and no knowledge of device internals, firmware updates on the device are entirely the device's own responsibility. The host is not involved in the update itself — it only observes the device leaving and rejoining the cluster.

Update procedure from the device's perspective:

1. Device decides to update (admin command, automatic policy, or
   device-side health check triggers it).

2. Device sends CLUSTER_LEAVE (orderly departure, not a crash).
   ClusterMessageHeader { message_type: MEMBER_LEAVING, node_id: self }

3. Host receives CLUSTER_LEAVE, executes graceful shutdown:
   - Migrates any workloads running on device to surviving nodes.
   - Drains in-flight IPC channels and DSM operations (waits for
     completions, does not issue new requests to the device).
   - Revokes cluster membership cleanly (no IOMMU lockout, no FLR —
     this is voluntary, not a crash; Section 5.3 crash path not taken).
   Host is fully operational throughout. Zero disruption to workloads
   not using this device.

4. Device updates its own firmware:
   - For Path A (full UmkaOS kernel): device does a kernel rolling update
     ([Section 13.18](13-device-classes.md#live-kernel-evolution)) or reboots its own cores with new kernel image.
   - For Path B (firmware shim): device applies vendor firmware update
     procedure — entirely internal, host has no visibility.
   - For all-in-one firmware: same, internal to device.
   The host cannot observe what happens inside. It only knows the
   device is absent from the cluster.

5. Device reinitializes hardware on new firmware (internal).

6. Device sends CLUSTER_JOIN with new protocol version and capabilities.
   Host authenticates (verifies firmware signature per [Section 9.3](09-security.md#verified-boot-chain)),
   negotiates protocol version, exchanges updated topology.
   Device rejoins — may announce new capabilities (e.g., firmware
   added support for a new DSM extension).

7. Workloads migrate back to device (or new workloads scheduled).

Update cadence is fully independent per device:

Component Update authority Host reboot required?
Host kernel Host admin No (Section 13.18 live evolution) or yes
Device firmware (any path) Device / device admin Never
UmkaOS cluster protocol Negotiated at join No (backward-compatible range)
Device hardware capabilities Announced at rejoin No

A device can update its firmware multiple times per day. The host never reboots. The host's umka-peer-transport module never changes. The only host-visible event is the device being absent for the duration of the update (seconds to minutes, device-dependent).

For Path A devices running full UmkaOS, Section 13.18 (Live Kernel Evolution) makes it possible to update individual kernel subsystems without even leaving the cluster — no CLUSTER_LEAVE at all. This is a future optimization requiring Section 13.18 to be implemented on the device kernel, but it is architecturally reachable.

5.2.5 Attack Surface Reduction

The shift from device-specific drivers to a generic peer transport has a significant security consequence. Traditional device drivers run in Ring 0 with full kernel privileges. A single memory-safety bug anywhere in driver code equals full kernel compromise. Driver code is the dominant source of kernel CVEs:

Linux kernel CVE distribution (approximate, 2020-2024):
  ~50% — driver bugs (memory safety, race conditions, use-after-free)
  ~15% — networking stack
  ~10% — filesystem
  ~25% — other subsystems

Lines of driver code per device class (approximate):
  mlx5 (Mellanox NIC):   ~150,000 lines in Ring 0
  amdgpu (AMD GPU):      ~700,000 lines in Ring 0
  i915 (Intel GPU):      ~400,000 lines in Ring 0
  nvme (NVMe storage):   ~15,000 lines in Ring 0

In the UmkaOS multikernel model:

umka-peer-transport (all devices, combined):  ~2,000 lines in Ring 0
Device-specific code on device:               lives behind IOMMU boundary
                                               cannot reach host kernel memory
                                               even if completely compromised

A vulnerability in device firmware (whether that is a full UmkaOS kernel, a firmware shim, or an all-in-one firmware) cannot escalate to the host kernel. The IOMMU is the hard boundary (Section 5.3). The firmware can be replaced or compromised entirely; the host kernel's critical structures (text, stacks, capability tables, scheduler state) remain unreachable.

The host's trust relationship with a peer device is: 1. At join time: verify firmware signature (Section 9.3) — the device presents a signed identity. If the signature is invalid, the join is rejected. 2. During operation: treat all messages as untrusted input — validate message type, version, checksum, and semantic correctness before acting. Same discipline as any network protocol. 3. On failure: IOMMU lockout bounds the damage regardless of what the device firmware does (Section 5.3).

This is a fundamentally different trust model than Ring 0 driver code, which must be trusted completely because there is no boundary between it and the kernel.

5.2.6 Toward a Universal Device Protocol

The three-path model (Section 5.2) has an implication beyond UmkaOS: it describes a universal device management protocol that could eliminate device-specific drivers entirely.

The core observation: every PCIe device — from a simple NIC to a full DPU — can be modeled as a cluster peer that advertises its capabilities. The host doesn't need device-specific knowledge. It needs to know what the device can do and handle everything else locally.

How it works for any device class:

Device boots → speaks cluster protocol → CLUSTER_JOIN with capabilities:

Simple NIC (E810-class):
  capabilities: [RSS, FLOW_DIRECTOR, VXLAN_DECAP, CHECKSUM_OFFLOAD, SR_IOV]
  → Host runs: TCP/IP, firewall, eBPF, congestion control, routing
  → Device runs: packet classification, checksum, encap/decap

Smart NIC (Bluefield-class):
  capabilities: [FULL_KERNEL, TCP_OFFLOAD, FIREWALL, EBPF, IPSEC, RDMA]
  → Host runs: almost nothing — thin peer transport relay
  → Device runs: full networking stack on its own cores

NVMe controller:
  capabilities: [BLOCK_IO, NAMESPACE_MGMT, ZONED_APPEND, HW_CRYPTO]
  → Host runs: filesystem, page cache, I/O scheduling
  → Device runs: flash translation, wear leveling, encryption

GPU:
  capabilities: [COMPUTE, RENDER, DISPLAY, VIDEO_DECODE, P2P_DMA]
  → Host runs: command submission scheduling, memory management policy
  → Device runs: shader execution, display scanout, video decode

What this replaces:

Today (device-specific drivers) With universal peer protocol
~700K lines of NIC drivers in Linux (e810, mlx5, bnxt, igb, ...) ~2K generic umka-peer-transport + capability dispatch
Every new NIC = write + upstream a driver New NIC = firmware speaks protocol on day one
Driver bugs crash the kernel (Ring 0) Device is a peer behind IOMMU — host never crashes
Vendor-specific management tools (ethtool quirks, proprietary ioctls) One protocol for config, health, firmware lifecycle
Firmware update often requires reboot CLUSTER_LEAVE → flash → CLUSTER_JOIN

The host-side logic becomes capability-driven, not device-driven:

/// Host receives CLUSTER_JOIN from a device and fills in the gaps.
fn configure_device_peer(peer: &PeerNode) {
    // Device advertises what it can do — host runs everything else.
    if !peer.capabilities.contains(TCP_OFFLOAD) {
        // Run TCP/IP stack locally, route packets to/from peer
        host_networking.attach_local_stack(peer.interface);
    }
    if !peer.capabilities.contains(FIREWALL) {
        // Run netfilter locally for this interface
        host_netfilter.attach(peer.interface);
    } else {
        // Push firewall rules to device
        peer.send_config(NetfilterRules::current());
    }
    if !peer.capabilities.contains(CHECKSUM_OFFLOAD) {
        // Software checksum on host
        host_networking.enable_software_checksum(peer.interface);
    }
    // ... same pattern for every capability
}

Precedent: CXL already does this for memory. CXL Type 3 devices (memory expanders) are self-describing — the host doesn't need a device-specific driver. CXL defines standardized discovery, management, and data protocols. The UmkaOS peer protocol generalizes CXL's approach from memory devices to all device classes. USB class drivers are another precedent — a USB mass storage device works without a vendor driver because it speaks a standard class protocol. PCIe never achieved this level of standardization above the transport layer.

Path to adoption:

  1. UmkaOS proves the model works with existing hardware (Path C firmware shims for major NIC/NVMe families; Path A for DPUs with native support).
  2. One or two vendors ship native peer protocol support in firmware (likely DPU vendors first — NVIDIA Bluefield, Intel IPU — since they already have general-purpose cores and benefit most from standardized management).
  3. PCIe-SIG or CXL Consortium evaluates standardization of a device peer capability structure (building on CXL's existing discovery model).
  4. Eventually: PCIe devices ship with a standard "device peer protocol" capability in their PCIe capability list, the same way devices today advertise MSI-X, AER, or SR-IOV support.

The multikernel peer model is not just an UmkaOS feature. It is a potential industry standard that could eliminate the driver problem at its root — by making devices self-describing cluster members rather than opaque hardware that requires device-specific software.

5.2.7 Hierarchical Cluster Topology

From the host's perspective, every device — local or remote, simple or complex — is a node in a uniform cluster tree. Each node advertises a set of child resources (capabilities, compute, memory, I/O ports). The host doesn't distinguish between "a PCIe device" and "a remote server" at the management layer — both are nodes with different capability sets and different latencies.

Example topology as seen by the host:

Cluster
├── Host node: "server-1" (this machine)
│   ├── 64x CPU cores
│   ├── 256 GB DRAM (4 NUMA zones)
│   ├── Local Tier 1 drivers (USB HCI, audio codec — too simple for peering)
│   └── Peer devices (below)
├── Peer node: "bf3-nic0" (Bluefield-3, PCIe slot 2, ~100ns RTT)
│   ├── 16x ARM A78 cores             [COMPUTE]
│   ├── 16 GB DRAM                    [MEMORY]
│   ├── 2x 100G Ethernet ports        [NETWORK]
│   ├── eSwitch engine                [FLOW_DIRECTOR, L2_L3_OFFLOAD]
│   ├── Crypto engine                 [IPSEC, TLS_OFFLOAD]
│   ├── RegEx engine                  [DPI, PATTERN_MATCH]
│   └── ConnectX-7 RDMA              [RDMA_VERBS, ROCEV2]
├── Peer node: "e810-nic1" (Intel E810, PCIe slot 3, ~100ns RTT)
│   ├── 2x 100G Ethernet ports        [NETWORK]
│   ├── RSS (128 queues)              [RSS]
│   ├── Flow Director (8192 rules)    [FLOW_DIRECTOR]
│   └── SR-IOV (256 VFs)             [SR_IOV]
│   (no COMPUTE, no MEMORY — host runs TCP/IP, firewall, eBPF locally)
├── Peer node: "nvme0" (NVMe SSD, M.2 slot, ~100ns RTT)
│   ├── 4 namespaces (1TB each)       [BLOCK_IO, NAMESPACE_MGMT]
│   ├── HW crypto (AES-256-XTS)      [HW_CRYPTO]
│   └── Zoned append                  [ZONED_APPEND]
├── Peer node: "gpu0" (NVIDIA GPU, PCIe slot 1, ~200ns RTT)
│   ├── 16384 CUDA cores              [COMPUTE, SHADER]
│   ├── 24 GB VRAM                    [MEMORY, DEVICE_LOCAL]
│   ├── Video decode engine            [VIDEO_DECODE]
│   ├── Display outputs (3x DP, 1x HDMI) [DISPLAY]
│   └── P2P DMA                       [P2P_DMA → nvme0, bf3-nic0]
├── Remote node: "server-2" (RDMA-attached, ~3μs RTT)
│   ├── 128x CPU cores                [COMPUTE]
│   ├── 512 GB DRAM                   [MEMORY, RDMA_ACCESSIBLE]
│   ├── (server-2's own peer devices are opaque from here —
│   │    server-2 manages them locally in its own cluster tree)
│   └── Exported capabilities:        [BLOCK_IO, COMPUTE, MEMORY]
└── Remote node: "server-3" (RDMA-attached, ~5μs RTT)
    └── Exported capabilities:        [BLOCK_IO, MEMORY]

Key properties of this model:

  1. Uniform abstraction. A Bluefield DPU, an E810 NIC, an NVMe SSD, a GPU, and a remote server are all the same thing: a node with capabilities at a measured latency. The cluster manager, scheduler, and capability dispatch logic don't have device-class-specific code paths.

  2. Recursive composition. Each remote node manages its own local cluster tree internally. From server-1's perspective, server-2 is a single node that exports aggregated capabilities. server-2 internally sees its own GPUs, DPUs, and SSDs as peer nodes. The cluster is a tree of trees — each level manages the level below it.

  3. Capability-driven scheduling. "Find a node with IPSEC capability and 100G throughput" returns both bf3-nic0 (local, ~100ns) and server-2's exported capabilities (remote, ~3μs). The scheduler picks based on latency, current load, and data locality — the same algorithm for local devices and remote servers.

  4. Graceful heterogeneity. The E810 with no compute resources and the Bluefield with 16 ARM cores are both valid network-capable nodes. The host automatically runs TCP/IP locally for the E810 and offloads it to the Bluefield. No configuration needed — the capability advertisement drives the decision.

  5. P2P awareness. The topology captures direct peer-to-peer paths (e.g., gpu0 can DMA directly to nvme0 and bf3-nic0 without host involvement). The scheduler uses this for GPU-direct-storage and GPU-direct-RDMA paths.

5.2.8 Topology Discovery

Cluster formation is explicit (no auto-discovery magic):

1. Admin configures cluster membership:
   echo "rdma://10.0.0.2/mlx5_0" > /sys/kernel/umka/cluster/join

2. Kernel establishes RDMA control channel to target node:
   - Create RDMA queue pair (reliable connected)
   - Perform authenticated key exchange:
     - Each node has an Ed25519 signing key pair (for authentication) and an X25519
       Diffie-Hellman key pair (for key exchange)
     - Nodes exchange X25519 public keys and authenticate them with Ed25519 signatures
     - Shared secret derived via X25519 DH, then HKDF-SHA256 to derive session keys
   - Mutual authentication via pre-shared cluster secret or PKI
   - **Ephemeral key exchange for forward secrecy**: after authenticating with the
     long-term Ed25519/ML-DSA-65 keys, both nodes perform an additional X25519 ECDH
     exchange using **ephemeral** (per-session) key pairs. Each node generates a fresh
     X25519 key pair at session establishment time; these ephemeral keys are never
     persisted to disk and are zeroed from memory after the shared secret is derived.
     The final session symmetric key is derived as:
     ```
     session_key = HKDF-SHA256(
         ikm = static_shared_secret || ephemeral_shared_secret,
         salt = initiator_nonce || acceptor_nonce,
         info = b"umkaos-peer-v1"
     )
     ```
     where `static_shared_secret` is from the long-term X25519 exchange,
     `ephemeral_shared_secret` is from the ephemeral X25519 exchange, and
     the salt is a 32-byte random value contributed by both nodes (16 bytes
     each, concatenated). This ensures that compromise of the long-term signing key
     or the long-term X25519 key does not allow decryption of recorded past sessions:
     the ephemeral keys needed to reconstruct `ephemeral_shared_secret` no longer
     exist after session setup. The ephemeral key pair is regenerated on every
     reconnection (including reconnection after RDMA link recovery), so each session
     has independent forward secrecy.

   - **Quantum resistance note**: Session key establishment uses X25519+HKDF-SHA256
     for the initial cluster transport. This is quantum-vulnerable but acceptable for
     Phase 2 because: (1) ephemeral per-peer HMAC keys are rotated every 120s per the
     heartbeat interval (distinct from the 24-hour cluster session key rotation in
     Section 5.2.4), (2) the attack window is too short for store-and-decrypt —
     an adversary would need to break X25519 within the session lifetime, and (3) hybrid
     ML-KEM-768+X25519 session establishment is planned for Phase 3 when the PQC
     transport is validated (see [Section 24.12](24-roadmap.md#kabi-idl-compiler-specification)). Capability
     signatures already use ML-DSA-65 (PQC) because they are long-lived and stored —
     a captured capability token is vulnerable to future quantum cryptanalysis.

3. Exchange topology information:
   - Each node sends its device registry summary (devices, NUMA, accelerators)
   - Each node sends its memory availability (total, available for remote access)
   - Each node sends its RDMA capabilities (bandwidth, latency, features)

4. Measure link quality:
   - Ping-pong latency measurement (RTT)
   - Bandwidth probe (bulk RDMA write)
   - Results stored in ClusterNode.measured_rtt_ns / measured_bw_bytes_per_sec

5. Fabric topology construction:
   - If InfiniBand: query subnet manager for switch topology
   - If RoCEv2: infer topology from latency measurements + LLDP
   - Build cluster-wide distance matrix (for scheduling and placement)

6. Cluster is operational.
   Heartbeat monitoring begins (RDMA send every 100ms per node).

5.2.9 Peer Registry and Topology

The cluster uses two complementary data structures for peer management: a peer registry for discovery ("what exists and what can it do?") and a topology graph for routing ("how do I reach it, and at what cost?").

Both are dynamic — no compile-time constants limit cluster size or topology shape.

5.2.9.1 Peer Registry

Every peer in the cluster is tracked by a flat, eventually-consistent registry. The registry is used for discovery and capability queries, not for data-path decisions.

/// Unique peer identifier. Assigned at peer join time. Never reused within a
/// cluster epoch. 64-bit to avoid birthday collisions across cluster lifetimes.
/// Wraps `NonZeroU64` because PeerId 0 is reserved as "no peer" sentinel.
/// This enables `Option<PeerId>` to be 8 bytes via Rust niche optimization
/// (matching the wire format, which uses 0 as None sentinel).
/// Derives: Clone, Copy, Eq, PartialEq, Ord, PartialOrd, Hash, Debug.
#[derive(Clone, Copy, Eq, PartialEq, Ord, PartialOrd, Hash, Debug)]
#[repr(transparent)]
pub struct PeerId(pub NonZeroU64);

impl PeerId {
    /// Create a PeerId from a raw u64. Returns None if val == 0.
    pub fn new(val: u64) -> Option<Self> {
        NonZeroU64::new(val).map(Self)
    }
    /// Return the underlying u64 value (always non-zero).
    pub fn as_u64(self) -> u64 {
        self.0.get()
    }
}

/// What kind of peer this is. Determines protocol capabilities and Raft
/// eligibility (firmware shim peers are never Raft voters).
#[repr(u8)]
pub enum PeerType {
    /// Host or DPU running the full UmkaOS kernel. Full protocol capabilities.
    FullKernel = 0,
    /// Device (NVMe, SAS controller, smart NIC, GPU) whose firmware implements
    /// the umka-protocol shim. No driver on the host — the device IS a peer.
    /// Implementable in 8-12K lines of C on existing RTOSes.
    FirmwareShim = 1,
}

/// Liveness state. Transitions: Alive → Suspect → Dead (one-way).
/// A peer that recovers after being declared Dead must re-join with a new
/// generation number — it is a new logical peer.
#[repr(u8)]
pub enum PeerStatus {
    /// Heartbeat received within the last `heartbeat_timeout` interval.
    Alive = 0,
    /// Heartbeat missed for `suspect_threshold` consecutive intervals.
    /// The peer is still considered reachable but degraded. DSM operations
    /// to this peer use longer timeouts. No failover yet.
    Suspect = 1,
    /// Heartbeat missed for `dead_threshold` consecutive intervals, or
    /// explicit leave notification received. DSM state is invalidated,
    /// locks held by this peer are released, Raft voter set is updated.
    Dead = 2,
    /// Graceful shutdown in progress. Peer is draining in-flight operations
    /// and transferring state before departing. Treated as Alive for
    /// in-flight operations but excluded from new work placement.
    Leaving = 3,
}

bitflags! {
    /// Capabilities advertised by a peer at join time and updated via the
    /// membership protocol. Used for discovery: "which peers can serve
    /// block I/O?", "which peers participate in DSM?", etc.
    pub struct PeerCapFlags: u32 {
        /// Peer provides block storage services (NVMe, SAS, virtio-blk).
        const BLOCK_STORAGE    = 1 << 0;
        /// Peer provides filesystem services to the cluster.
        const FILESYSTEM       = 1 << 1;
        /// Peer provides accelerator services (GPU, FPGA, inference).
        const ACCELERATOR      = 1 << 2;
        /// Peer participates in distributed shared memory ([Section 6.2](06-dsm.md#dsm-design-overview)).
        const DSM_PARTICIPANT  = 1 << 3;
        /// Peer is eligible to vote in Raft consensus ([Section 5.8](#failure-handling-and-distributed-recovery--split-brain-resolution)).
        /// Never set for FirmwareShim peers. Requires runtime role assignment
        /// by the cluster admin in addition to this capability.
        const RAFT_VOTER       = 1 << 4;
        /// Peer can hold distributed locks (DLM, [Section 15.15](15-storage.md#distributed-lock-manager)).
        const DLM_HOLDER       = 1 << 5;
        /// Peer supports RDMA data transfers (not just control messages).
        const RDMA_CAPABLE     = 1 << 6;
        /// Peer supports CXL.mem coherent memory sharing.
        const CXL_COHERENT     = 1 << 7;
        /// Peer provides external network access (L2/L3/RDMA proxy).
        const EXTERNAL_NETWORK = 1 << 8;
        /// Peer provides serial port access (management consoles, industrial I/O).
        const SERIAL_PORT      = 1 << 9;
        /// Peer provides USB device forwarding (any USB device class).
        const USB_DEVICE       = 1 << 10;
        /// Peer provides TPM 2.0 services (attestation, sealing, RNG).
        const TPM_SERVICE      = 1 << 11;
    }
}

**Capability and service inventory:**

| PeerCapFlag | ServiceId | Subsystem | Specification | Provider Types |
|---|---|---|---|---|
| `BLOCK_STORAGE` | `block_io` | Block layer | [Section 15.13](15-storage.md#block-storage-networking--block-service-provider) | Device-native (NVMe/SAS shim), host-proxy |
| `FILESYSTEM` | `vfs_mount` | VFS | [Section 14.11](14-vfs.md#fuse-filesystem-in-userspace--vfs-service-provider) | Host-proxy, host-native |
| `ACCELERATOR` | `accel_compute` | Accelerator framework | [Section 22.7](22-accelerators.md#accelerator-networking-rdma-and-linux-gpu-compatibility--accelerator-service-provider) | Device-native (GPU shim), host-proxy |
| `EXTERNAL_NETWORK` | `external_nic` | Network | [Section 16.31](16-networking.md#network-service-provider) | Host-proxy, host-native |
| `SERIAL_PORT` | `serial` | Serial/TTY | [Section 21.1](21-user-io.md#tty-and-pty-subsystem--serial-service-provider-cluster-wide-serial-access) | Host-proxy |
| `USB_DEVICE` | `usb_device` | USB | [Section 13.29](13-device-classes.md#usb-device-forwarding-service-provider) | Host-proxy |
| `TPM_SERVICE` | `tpm` | Security | [Section 9.4](09-security.md#tpm-runtime-services--tpm-service-provider-cluster-wide-tpm-access) | Host-proxy |
| `DSM_PARTICIPANT` |  | DSM | [Section 6.2](06-dsm.md#dsm-design-overview) | Full kernel peers only |
| `RAFT_VOTER` |  | Consensus | [Section 5.8](#failure-handling-and-distributed-recovery--split-brain-resolution) | Full kernel peers only |
| `DLM_HOLDER` |  | DLM | [Section 15.15](15-storage.md#distributed-lock-manager) | Full kernel peers, some shims |
| `RDMA_CAPABLE` |  | Transport | [Section 5.4](#rdma-native-transport-layer) | Peers with RDMA NIC |
| `CXL_COHERENT` |  | Transport | [Section 5.9](#cxl-30-fabric-integration) | CXL-attached devices |

DPU-specific services (not cluster-wide, local to the host-DPU pair):

| ServiceId | Description | Specification |
|---|---|---|
| `nic_offload` | NIC hardware offload (checksum, TSO, RSS) | [Section 5.11](#smartnic-and-dpu-integration) |
| `nvmeof_target` | NVMe-oF storage target | [Section 22.6](22-accelerators.md#in-kernel-inference-engine--use-cases) |
| `ipsec_offload` | IPsec encryption/decryption offload | [Section 22.6](22-accelerators.md#in-kernel-inference-engine--use-cases) |

Custom services are registered via `PeerServiceEndpoint`
([Section 5.7](#network-portable-capabilities--custom-service-endpoints)). Any subsystem can
define new ServiceIds  the framework is open-ended.

/// Per-peer metadata stored in the registry. 128 bytes, cache-line aligned.
/// All padding is explicit — no compiler-generated holes.
///
/// Field layout (offsets verified):
///   offset  0: peer_id         (8 bytes)
///   offset  8: name           (64 bytes)
///   offset 72: capabilities    (4 bytes)
///   offset 76: status          (1 byte)
///   offset 77: peer_type       (1 byte)
///   offset 78: _pad0           (2 bytes) — aligns generation to 4-byte boundary
///   offset 80: generation      (4 bytes)
///   offset 84: _pad1           (4 bytes) — aligns join_epoch_ns to 8-byte boundary
///   offset 88: join_epoch_ns   (8 bytes)
///   offset 96: _pad2          (32 bytes) — align(64) rounds 96 → 128
#[repr(C, align(64))]
pub struct PeerInfo {
    /// Globally unique peer identifier.
    pub peer_id: PeerId,                    // 8 bytes  (offset 0)
    /// Human-readable name (hostname or device serial). Null-terminated, max 63 bytes.
    pub name: [u8; 64],                     // 64 bytes (offset 8)
    /// Capability bitflags. Immutable after join (capability changes require re-join).
    pub capabilities: PeerCapFlags,         // 4 bytes  (offset 72)
    /// Current liveness state.
    pub status: PeerStatus,                 // 1 byte   (offset 76)
    /// Peer type (full kernel or firmware shim).
    pub peer_type: PeerType,                // 1 byte   (offset 77)
    /// Explicit padding for u32 alignment of `generation`.
    pub _pad0: [u8; 2],                     // 2 bytes  (offset 78)
    /// Monotonically increasing generation number. Incremented on each join.
    /// Distinguishes a rebooted peer from a new one at the same address.
    /// u32 longevity: at 1 join/second (extreme churn), wraps after ~136 years.
    /// In practice, peer joins are rare events (reboots, maintenance), so u32
    /// provides effectively infinite headroom. Protocol-mandated u32 (wire format).
    pub generation: u32,                    // 4 bytes  (offset 80)
    /// Explicit padding for u64 alignment of `join_epoch_ns`.
    pub _pad1: [u8; 4],                     // 4 bytes  (offset 84)
    /// Cluster-wide timestamp (ns) when this peer joined. Used for leader
    /// election tiebreaking and diagnostic ordering.
    pub join_epoch_ns: u64,                 // 8 bytes  (offset 88)
    /// Explicit padding to reach 128 bytes (align(64) rounds 96 → 128).
    pub _pad2: [u8; 32],                    // 32 bytes (offset 96)
}
const_assert!(core::mem::size_of::<PeerInfo>() == 128);

/// The cluster-wide peer registry. Dynamic, heap-allocated, no fixed-size limit.
///
/// **Consistency model**: eventually consistent. Updates propagate via the
/// membership gossip protocol (push-pull anti-entropy, converges in O(log N)
/// gossip rounds). The registry is NOT replicated via Raft — it is soft state
/// reconstructable from peer heartbeats and join/leave events.
///
/// **Concurrency**: RCU-protected for read-heavy workloads. Reads (capability
/// lookups during I/O dispatch) are lock-free. Writes (peer join/leave/status
/// change) take a write lock and publish via RCU.
pub struct PeerRegistry {
    /// Map from PeerId to PeerInfo. XArray keyed by PeerId (u64) — O(1)
    /// lookup with native RCU-compatible reads (lock-free). Grows dynamically.
    peers: XArray<PeerInfo>,
    /// Monotonically increasing generation counter. Incremented on every
    /// mutation (add, remove, status change). Used by heartbeat protocol
    /// to detect registry staleness and trigger delta sync.
    generation: AtomicU64,
}

impl PeerRegistry {
    /// Look up a peer by ID. Returns None if the peer is not registered
    /// or has status Dead. Lock-free (RCU read-side).
    pub fn get(&self, id: PeerId) -> Option<&PeerInfo> { ... }

    /// Find all peers with the given capability. Used for discovery:
    /// "which peers can serve block I/O?" Returns a snapshot (may be
    /// slightly stale — callers must handle peer-gone errors gracefully).
    /// Bounded: max `MAX_CLUSTER_PEERS` (1024) entries. Cold-path only
    /// (service discovery, placement decisions). Not called per-packet.
    pub fn peers_with_cap(&self, cap: PeerCapFlags) -> Vec<PeerId> { ... }

    /// Register a new peer. Called during the join protocol (Section 5.2.8).
    /// Increments the registry generation counter.
    ///
    /// **Capacity enforcement**: The caller (Raft leader processing a JoinRequest)
    /// MUST check `self.active_count() < MAX_CLUSTER_PEERS` before calling `add()`.
    /// If the cluster is at capacity, respond with
    /// `JoinReject { reason: ClusterFull }`. This check is serialized by the Raft
    /// log — concurrent join requests are processed sequentially by the leader.
    /// The `PartitionBitmap` (`[u64; 16]` = 1024 bits) is indexed by
    /// `PeerSlotIndex` (recycled dense index), not raw `PeerId`. Slot indices
    /// are allocated from a free list and recycled after dead-peer GC, so the
    /// bitmap can represent up to `MAX_CLUSTER_PEERS` concurrent peers regardless
    /// of cumulative PeerId growth. Admitting a peer beyond `MAX_CLUSTER_PEERS`
    /// concurrent slots would fail with `ClusterFull`.
    pub fn add(&self, info: PeerInfo) { ... }

    /// Number of active (non-Dead, non-Leaving) peers in the registry.
    /// Used for capacity enforcement at join time.
    pub fn active_count(&self) -> usize { ... }

    /// Transition a peer to Dead status. Does NOT remove the entry — dead
    /// peers are retained for diagnostic queries and to prevent PeerId reuse
    /// within the current cluster epoch. Garbage-collected after a configurable
    /// retention period (default: 1 hour).
    pub fn mark_dead(&self, id: PeerId) { ... }

    /// Current generation counter. Compared against heartbeat's
    /// `membership_gen` field to detect staleness.
    pub fn generation(&self) -> u64 { ... }
}

5.2.9.2 Topology Graph

Cluster connectivity is modeled as a weighted, undirected graph. Each peer advertises its direct neighbors; the full graph is assembled by every peer from all advertisements. This replaces three previously separate models:

Previous Model Scope Replaced By
NumaTopology::distance() (Section 4.1) Local NUMA distances Local edges in topology graph (seeded at cluster init)
ClusterDistanceMatrix (old §5.2.9) Remote host-to-host costs Remote edges with measured latency/bandwidth
PCIe/CXL topology (Section 11.1) Device-to-host links PCIe/CXL edges in the graph

Important separation: local subsystems (slab allocator, page allocator, scheduler) continue to use NumaTopology::distance() directly — the topology graph adds no overhead to the local hot path. The topology graph is for cross-peer routing and cost computation only. See Section 4.1 for the local model.

bitflags! {
    /// What a link between two peers can carry. Multiple flags may be set
    /// (e.g., a CXL link supports both coherent memory and bulk data).
    pub struct LinkCapFlags: u16 {
        /// RDMA data transfer (RDMA Write/Read/Send).
        const RDMA           = 1 << 0;
        /// DMA between peers (PCIe DMA, peer-to-peer).
        const DMA            = 1 << 1;
        /// Memory-mapped I/O (PCIe BAR access).
        const MMIO           = 1 << 2;
        /// Bulk data transfer (TCP fallback, non-RDMA).
        const BULK_DATA      = 1 << 3;
        /// Control messages only (low bandwidth, e.g., I2C management).
        const CONTROL_ONLY   = 1 << 4;
        /// CXL.mem coherent memory sharing.
        const CXL_COHERENT   = 1 << 5;
    }
}

/// A single edge in the topology graph. Edges are undirected (cost is stored
/// both ways for asymmetric links) and weighted by measured performance.
///
/// Kernel-internal, not repr(C). Size is compiler-determined.
pub struct TopologyEdge {
    /// One endpoint.
    pub peer_a: PeerId,
    /// Other endpoint.
    pub peer_b: PeerId,
    /// One-way latency in nanoseconds, measured by ping-pong calibration
    /// (Section 5.2.9.5). NOT from spec sheets — real measured values.
    /// **Saturation**: values exceeding `u32::MAX` are clamped to
    /// `u32::MAX`. See `ClusterNode.measured_rtt_ns` for rationale.
    pub latency_ns: u32,
    /// Sustained bandwidth in bytes/sec, measured by bulk transfer test
    /// (Section 5.2.9.5). Used for placement decisions (large transfers
    /// prefer high-bandwidth links even at higher latency).
    pub bandwidth_bytes_per_sec: u64,
    /// What this link can carry.
    pub link_caps: LinkCapFlags,
    /// Timestamp (ns) of last measurement. Stale measurements (older than
    /// `measurement_refresh_interval`) trigger re-calibration.
    pub last_measured_ns: u64,
}

/// Maximum direct neighbors per peer. 32 covers realistic topologies:
/// a host with 8 NVMe drives, 2 GPUs, 2 NICs, and 9 remote hosts = 21.
/// If exceeded, lowest-bandwidth neighbors are pruned from advertisements
/// (they remain reachable via multi-hop routing).
pub const MAX_NEIGHBORS: usize = 32;

/// The cluster-wide topology graph. Assembled locally by each peer from
/// link-state advertisements (Section 5.2.9.3). Used for routing decisions
/// and cost estimation.
///
/// **Not replicated via Raft.** Topology is soft state, reconstructable
/// from LSAs. Each peer maintains its own copy.
pub struct TopologyGraph {
    /// All edges in the graph. Heap-allocated, grows with cluster size.
    /// Bounded: max `MAX_CLUSTER_PEERS * MAX_NEIGHBORS` edges (1024 * 32 = 32768).
    /// Topology changes are cold-path (LSA processing); Vec reallocation is acceptable.
    edges: Vec<TopologyEdge>,
    /// Per-peer adjacency list for fast neighbor lookup.
    /// XArray keyed by PeerId (u64), value: indices into `edges`.
    adjacency: XArray<ArrayVec<usize, MAX_NEIGHBORS>>,
    /// Monotonically increasing counter. Incremented on any topology change
    /// (edge added, removed, or weight updated). Used by CachedRoute
    /// (Section 5.2.9.4) for staleness detection.
    generation: AtomicU64,
}

impl TopologyGraph {
    /// Cost in nanoseconds to reach `dst` from `src`, considering the
    /// best path through the graph. Returns None if no path exists.
    ///
    /// **Not for hot-path use.** This runs Dijkstra (O(E log V)) on every
    /// call. Hot-path code uses CachedRoute (Section 5.2.9.4) instead.
    pub fn path_cost_ns(&self, src: PeerId, dst: PeerId) -> Option<u64> { ... }

    /// Compute the best next-hop for reaching `dst` from `src`.
    /// Returns (next_hop PeerId, total cost in ns). Used to populate
    /// CachedRoute entries.
    pub fn best_next_hop(&self, src: PeerId, dst: PeerId)
        -> Option<(PeerId, u64)> { ... }

    /// Seed local NUMA edges from NumaTopology::distance() at cluster init.
    /// These edges are purely local — they are NOT advertised via LSA.
    /// They allow end-to-end cost computation that includes local NUMA hops
    /// (e.g., "page on remote Host B, NUMA node 1" → local NUMA node 0
    /// = remote edge cost + local NUMA hop cost).
    pub fn seed_numa_edges(&mut self, numa: &NumaTopology) { ... }

    /// Apply a received LinkStateAdvertisement. Updates edges, adjacency,
    /// and increments generation if anything changed. Returns true if the
    /// graph was modified (caller should re-flood the LSA to neighbors).
    pub fn apply_lsa(&mut self, lsa: &LinkStateAdvertisement) -> bool { ... }

    /// Current generation counter.
    pub fn generation(&self) -> u64 { ... }
}

Typical edge costs (measured, not configured):

Link Type Latency (ns) Bandwidth Capabilities
Local NUMA same-socket 300-800 50+ GB/s DMA, MMIO
Local NUMA cross-socket (QPI/UPI) 500-1,500 20-40 GB/s DMA, MMIO
CXL-attached memory 200-400 32-64 GB/s CXL_COHERENT, DMA
PCIe device (NVMe, GPU) 1,000-2,000 8-32 GB/s (gen4/5) DMA, MMIO
RDMA same-switch 3,000-5,000 12-25 GB/s (100-200 Gbps) RDMA, BULK_DATA
RDMA cross-switch 5,000-10,000 10-20 GB/s RDMA, BULK_DATA
TCP fallback 50,000-200,000 1-10 GB/s BULK_DATA

Key observation: remote RDMA is faster than local NVMe SSD (~12,000 ns for 4KB random read). Placement decisions must account for this.

Topology information is distributed using link-state flooding, analogous to OSPF. Each peer advertises its direct neighbors; all peers assemble the full graph locally.

/// A peer's advertisement of its direct neighbors. Flooded to all peers
/// on join, link change, or periodic refresh.
///
/// Size: variable, typically 200-500 bytes for a peer with 5-15 neighbors.
pub struct LinkStateAdvertisement {
    /// The peer that originated this advertisement.
    pub origin: PeerId,
    /// Monotonically increasing sequence number per origin. Receivers
    /// discard LSAs with sequence ≤ their stored sequence for this origin.
    /// Prevents stale/duplicate advertisements (standard link-state mechanism).
    pub sequence: u64,
    /// Direct neighbors with measured link properties.
    pub neighbors: ArrayVec<NeighborEntry, MAX_NEIGHBORS>,
    /// Time-to-live. Decremented at each hop. Dropped when TTL reaches 0.
    /// Default: 16 (sufficient for realistic cluster diameters).
    pub ttl: u8,
}

/// One neighbor entry within an LSA.
pub struct NeighborEntry {
    /// The neighbor's PeerId.
    pub peer_id: PeerId,
    /// Measured one-way latency (ns). Clamped to `u32::MAX`;
    /// see `ClusterNode.measured_rtt_ns` for saturation semantics.
    pub latency_ns: u32,
    /// Measured sustained bandwidth (bytes/sec).
    pub bandwidth_bytes_per_sec: u64,
    /// Link capabilities.
    pub link_caps: LinkCapFlags,
}

Distribution protocol:

  1. On peer join: after calibration (Section 5.2.9.5), the new peer sends its LSA to all direct neighbors. Neighbors re-flood to their neighbors. Full convergence in O(diameter) hops — sub-second for rack-scale clusters.
  2. On link change: if a neighbor's heartbeat latency changes by >20% or a link goes down/up, the peer issues a new LSA with incremented sequence.
  3. Periodic refresh: every peer re-advertises its LSA every 5 minutes (configurable) to recover from lost advertisements. Receivers ignore LSAs with sequence ≤ stored (idempotent).
  4. Receiver processing: on receiving an LSA, a peer compares the sequence number against its stored value for that origin. If newer: update the local TopologyGraph, decrement TTL, re-flood to all neighbors except the sender. If equal or older: discard silently.

Raft is NOT involved in topology distribution. Raft handles consensus state (security policies, cluster configuration). Topology is soft state, reconstructable from LSAs, and does not require total ordering.

5.2.9.4 Topology LSA Wire Format

The in-memory LinkStateAdvertisement and NeighborEntry use native types (appropriate for local graph construction). For inter-node transmission, a separate wire format with Le types is used. Sent via PeerMessageType::TopologyLsa (0x00B0) and acknowledged via TopologyLsaAck (0x00B1).

/// Wire-format header for a topology link-state advertisement.
/// Followed by `neighbor_count` × `NeighborEntryWire` entries.
///
/// Message type: PeerMessageType::TopologyLsa (0x00B0).
#[repr(C)]
pub struct LinkStateAdvertisementWire {
    /// Origin peer that generated this LSA.
    pub origin: Le64,               // 8 bytes  (offset 0)
    /// Monotonically increasing per-origin sequence number.
    pub sequence: Le64,             // 8 bytes  (offset 8)
    /// Timestamp (ns) of LSA generation (for age estimation).
    pub timestamp_ns: Le64,         // 8 bytes  (offset 16)
    /// Number of NeighborEntryWire records following this header.
    pub neighbor_count: Le16,       // 2 bytes  (offset 24)
    /// Remaining time-to-live. Decremented at each forwarding hop.
    pub ttl: u8,                    // 1 byte   (offset 26)
    /// Reserved for future use; must be zeroed on send.
    pub _pad: [u8; 5],             // 5 bytes  (offset 27)
}
// Total header: 8 + 8 + 8 + 2 + 1 + 5 = 32 bytes.
const_assert!(core::mem::size_of::<LinkStateAdvertisementWire>() == 32);

/// Wire-format neighbor entry within an LSA. All integer fields use Le types
/// for correct operation on mixed-endian clusters (PPC32, s390x are big-endian).
#[repr(C)]
pub struct NeighborEntryWire {
    /// Neighbor's PeerId.
    pub peer_id: Le64,              // 8 bytes  (offset 0)
    /// Measured one-way latency in nanoseconds.
    pub latency_ns: Le32,           // 4 bytes  (offset 8)
    /// Measured sustained bandwidth in bytes/sec.
    /// Le64 alignment is 1 (byte array wrapper); no implicit padding at
    /// offset 12. Using native `u64` here would insert 4 bytes of padding
    /// and fail the const_assert.
    pub bandwidth_bytes_per_sec: Le64, // 8 bytes  (offset 12)
    /// Link capability bitflags. Le32 (not Le16) deliberately: provides
    /// 16 reserved bits for future link capability flags without wire
    /// format change. On receive, bits [31:16] are masked off for the
    /// current LinkCapFlags version.
    pub link_caps: Le32,            // 4 bytes  (offset 20)
    /// Reserved for future use; must be zeroed on send.
    pub _pad: [u8; 4],             // 4 bytes  (offset 24)
}
// Total per-neighbor: 8 + 4 + 8 + 4 + 4 = 28 bytes.
const_assert!(core::mem::size_of::<NeighborEntryWire>() == 28);

/// Total wire size: 32 + neighbor_count × 28 bytes.
/// Maximum: 32 + 32 × 28 = 928 bytes (MAX_NEIGHBORS = 32).
/// Fits in a single RDMA Send (inline for small LSAs, DMA for large).

Conversion at send/receive boundary: the sender serializes its in-memory LinkStateAdvertisement into LinkStateAdvertisementWire + N × NeighborEntryWire. The receiver deserializes back into native LinkStateAdvertisement/NeighborEntry for local graph construction. The in-memory types (TopologyEdge, NeighborEntry, LinkStateAdvertisement) retain native types — no #[repr(C)], no Le encoding.

5.2.9.5 Cached Routes

The topology graph is queried infrequently (on topology changes). The data path uses pre-computed cached routes for single-lookup forwarding.

/// Opaque handle into the per-peer transport connection pool. Obtained from
/// the transport layer at route computation time; used to send messages
/// without any further lookup or connection resolution on the data path.
///
/// The inner `usize` is an index into the transport layer's connection table
/// (one entry per established RDMA QP, PCIe P2P channel, or TCP fallback
/// connection). The transport layer validates the index on send and returns
/// `Err(TransportError::StaleHandle)` if the connection has been torn down
/// since the route was cached.
#[repr(transparent)]
pub struct TransportHandle(pub usize);

/// Pre-computed route to a destination peer. Stored in a per-peer lookup
/// table for O(1) data-path access.
///
/// Size: 32 bytes.
pub struct CachedRoute {
    /// Destination peer.
    pub destination: PeerId,
    /// Next hop toward the destination (may be the destination itself for
    /// direct neighbors).
    pub next_hop: PeerId,
    /// Pre-resolved transport handle — ready to send without any further
    /// lookup. Obtained from the transport layer at route computation time.
    pub transport_handle: TransportHandle,
    /// Total path cost in nanoseconds (sum of edge costs along the path).
    /// u64 to match `path_cost_ns()` return type and avoid truncation for
    /// multi-hop paths across high-latency links.
    pub cost_ns: u64,
    /// Topology generation at which this route was computed. Compared
    /// against TopologyGraph::generation() before use. Mismatch → recompute.
    pub generation: u64,
}

Data-path lookup (hot path, inlined):

/// O(1) route lookup. Called on every cross-peer message send.
#[inline(always)]
fn send_to_peer(routes: &RouteTable, dest: PeerId, msg: &[u8]) {
    let route = routes.get(dest);
    if likely(route.generation == topology.generation()) {
        // Fast path: cached route is current. One RCU-protected array lookup + send.
        route.transport_handle.send(msg);
    } else {
        // Slow path: topology changed. Recompute via Dijkstra, update cache.
        // Topology changes are seconds-to-hours apart; this branch is ~0.01%.
        let new_route = topology.recompute_route(self_id, dest);
        routes.update(dest, new_route);
        new_route.transport_handle.send(msg);
    }
}

Performance: the fast path is one RCU-protected array lookup (indexed by PeerId, O(1)) plus a function pointer call — no HashMap on the hot path (collection policy §3.1.13 prohibits it). The generation check is a single atomic load (relaxed ordering, no fence). The slow path (Dijkstra recompute) runs at most once per topology change per destination, amortized across all messages.

5.2.9.6 Edge Measurement Protocol

All edge weights in the topology graph are measured, never configured from spec sheets. Real numbers reflect actual hardware, firmware, cabling, and switch configuration.

On-join calibration (~1 second):

  1. Latency: 1000 ping-pong round trips (RDMA Send/Recv or PCIe doorbell). Discard top/bottom 5%. Median RTT ÷ 2 = one-way latency.
  2. Bandwidth: 10 MB sustained bulk transfer (RDMA Write or DMA). Measure elapsed time. Bytes ÷ time = bandwidth.
  3. Results stored in TopologyEdge.latency_ns and bandwidth_bytes_per_sec, advertised via LSA (Section 5.2.9.3).

Periodic refresh: every N minutes (default 5, configurable per-cluster). Same procedure as on-join but interleaved with normal traffic to avoid measurement-induced jitter. Detects link degradation (failing optics, congestion changes, firmware bugs).

Free latency signal: heartbeat RTT (Section 5.8.2.2) is a continuous latency measurement. If RTT doubles relative to the calibrated value, the edge weight is updated immediately and a new LSA is issued — no need to wait for the periodic refresh cycle.

Local NVMe cost: each host peer measures local NVMe 4KB random read latency at calibration time (calibrate_local_storage()). This value is stored locally (not advertised) and used to decide whether to fetch a DSM page from a remote peer or read it from local SSD:

/// Should we fetch a page from a remote peer vs. from local SSD?
/// Returns true if remote fetch is expected to be faster.
///
/// Key insight: RDMA RTT (~3-5 μs) is often faster than NVMe random
/// read (~12 μs). This function makes the right call based on measured
/// values, not assumptions.
pub fn prefer_remote_over_ssd(route_cost_ns: u64, local_ssd_cost_ns: u64) -> bool {
    route_cost_ns < local_ssd_cost_ns
}

5.2.9.7 DSM Design Requirements for Subscriber-Controlled Caching

The DSM rework (deferred Q1/Q6) must support subscribers that need explicit control over caching — not only transparent page-fault-driven shared memory. A clustered filesystem (UPFS), a distributed database, or any subsystem that manages its own cache policy needs DSM to accept external hints for prefetch, invalidation, and writeback. Without these interfaces, such subsystems must build a separate cache layer on top of DSM — a bolt-on design we must avoid.

Requirements the DSM rework must satisfy:

  1. Subscriber-controlled prefetch. A subscriber (e.g., UPFS, a distributed database) knows its access patterns better than the DSM's generic fault handler. DSM must accept prefetch hints: dsm_prefetch(region, va_range, priority). The DSM fetches pages proactively without waiting for faults. Examples: UPFS sequential read-ahead, database index scan prefetch, stripe-aware prefetch.

  2. Subscriber-controlled invalidation. A subscriber must be able to invalidate specific cached pages: dsm_invalidate(region, va_range). The DSM must not serve stale cached data after the subscriber releases its coherence token. This must integrate with the DLM's targeted writeback (Section 15.15): dirty pages are flushed before invalidation, clean pages are simply discarded.

  3. Subscriber-controlled writeback. Subscribers need to control when dirty pages are written back — it cannot be purely timer-driven (the kernel's default writeback). DSM must support explicit writeback: dsm_writeback(region, va_range, sync). Use cases: DLM lock downgrade (flush dirty data before releasing write token), database checkpoint, filesystem journal commit.

  4. Integration with DLM token lifecycle. The canonical flow for a DLM-coordinated subscriber (UPFS, distributed database): acquire DLM token → access pages via DSM (page cache) → dirty pages tracked by DLM's LockDirtyTracker → on token downgrade, flush dirty pages via DSM writeback → invalidate cache → release token. When a DsmLockBinding is active, the DLM's LockDirtyTracker delegates to the binding's DsmDirtyBitmap (Section 15.15), ensuring a single source of truth for dirty page state and avoiding double bookkeeping.

  5. Per-region cache policy. Different DSM regions may have different caching characteristics: metadata regions (small, frequently accessed, long-lived cache) vs. data regions (large, sequential, short-lived cache) vs. scratch regions (no caching). DSM must support per-region cache policy configuration.

These requirements do not change the DSM's wire format or coherence protocol (MOESI) — they add control interfaces that subscribers use to guide DSM behavior. The DSM remains transparent to applications using plain mmap; the control interfaces are used by kernel-level subscribers (UPFS, distributed databases, service provider layers).

5.2.9.8 DSM Foundational Types

DSM foundational types (RegionBitmap, RegionSlotMap, RegionSlotIndex, slot allocation protocol) are defined in Section 6.1.

5.2.9.9 Topology Reasoning Engine

The topology graph (Section 5.2) stores measured latencies, bandwidths, and link capabilities for every edge in the cluster. The cached routes (Section 5.2) provide O(1) next-hop lookup for the data path. The topology reasoning engine adds a structured query layer on top, enabling subsystems to make constraint-based placement and routing decisions without implementing their own graph algorithms.

Design principle: Inspired by Barrelfish's System Knowledge Base (ETH Zurich), which uses declarative constraint logic for hardware reasoning. UmkaOS implements the same principle without a logic programming engine — structured queries against the topology graph with cached results, bounded computation, and no hot-path involvement.

// umka-core/src/cluster/topology_query.rs

/// Constraint for topology queries. Multiple constraints are AND-combined.
pub enum TopologyConstraint {
    /// Path latency from query origin to candidate must be ≤ this value.
    MaxLatencyNs(u32),
    /// Path bandwidth from query origin to candidate must be ≥ this value.
    MinBandwidthBps(u64),
    /// At least one link on the path must support these capabilities.
    RequiresLinkCap(LinkCapFlags),
    /// Candidate peer must have these capability flags set.
    HasPeerCap(PeerCapFlags),
    /// Exclude this peer from results (e.g., the querier itself).
    ExcludePeer(PeerId),
    /// Rank results by proximity to this peer (lowest path cost first).
    PreferClosestTo(PeerId),
    /// Candidate must have at least this much free capacity for the
    /// specified resource type (memory bytes, CPU cores, accelerator slots).
    MinFreeCapacity { resource: ResourceType, amount: u64 },
}

/// Resource types for capacity constraints.
#[repr(u8)]
pub enum ResourceType {
    MemoryBytes   = 0,
    CpuCores      = 1,
    AccelSlots    = 2,
    StorageBytes  = 3,
    NetworkBwBps  = 4,
}

/// Result of a topology query. Fixed-size, caller-provided buffer.
pub struct TopologyQueryResult<'a> {
    /// Matching peers, sorted by cost (lowest first) if PreferClosestTo
    /// was specified, otherwise unordered.
    pub peers: &'a mut [PeerId],
    /// Number of matching peers written to `peers`.
    pub count: usize,
    /// Topology generation at query time. Used for cache validation.
    pub generation: u64,
}

Query API:

impl TopologyGraph {
    /// Find peers matching all constraints. Results written to caller's buffer.
    /// Returns the number of matching peers (may exceed buffer size — caller
    /// can re-query with a larger buffer if needed).
    ///
    /// Computational cost: O(N × E) where N = peers, E = avg edges per peer.
    /// For 1024 peers with 10 edges each: ~10K operations, <100μs.
    /// For 10 peers: <1μs.
    ///
    /// This function is NOT for the data-path hot path. Use CachedRoute
    /// (Section 5.2.9.4) for per-message routing. This function is for
    /// placement decisions, service binding, and topology-change responses.
    pub fn find_peers(
        &self,
        origin: PeerId,
        constraints: &[TopologyConstraint],
        result: &mut TopologyQueryResult,
    ) -> usize { ... }

    /// Find the best path from src to dst satisfying all constraints.
    /// Returns the path cost in nanoseconds, or None if no path satisfies
    /// the constraints. The path itself (list of hops) is written to
    /// `path_buf` if provided.
    ///
    /// Uses constrained Dijkstra: edges that violate link capability
    /// constraints are excluded from the search.
    pub fn find_path(
        &self,
        src: PeerId,
        dst: PeerId,
        constraints: &[TopologyConstraint],
        path_buf: Option<&mut ArrayVec<PeerId, 16>>,
    ) -> Option<u64> { ... }

    /// Suggest optimal peer for placing a service, combining topology
    /// constraints with affinity rules (Section 5.12).
    ///
    /// Returns the recommended PeerId and a cost in nanoseconds (lower = better).
    /// Returns None if no peer satisfies all constraints. `u64` matches
    /// `path_cost_ns()` and `CachedRoute.cost_ns` for consistency.
    pub fn suggest_placement(
        &self,
        service_id: &ServiceId,
        constraints: &[TopologyConstraint],
        affinity_rules: &[AffinityRule],
    ) -> Option<(PeerId, u64)> { ... }
}

Cached query results:

/// Cache for topology query results. Avoids re-running Dijkstra on every
/// placement decision. Invalidated automatically when the topology changes.
pub struct TopologyQueryCache {
    /// Cached results in an XArray keyed by a mixed hash of
    /// `(origin_id, constraint_hash)`. O(1) lookup per cache check;
    /// integer key → XArray per §3.1.13. Cold path (placement decisions).
    entries: XArray<CachedQueryEntry>,
    /// Topology generation at cache population time.
    generation: u64,
}

struct CachedQueryEntry {
    /// Matching peer IDs (heap-allocated, immutable after creation).
    peers: Box<[PeerId]>,
    /// Timestamp of cache entry creation (for TTL-based eviction).
    created_ns: u64,
    /// Origin peer ID for hit validation (prevents cross-query collisions).
    origin: PeerId,
    /// Constraint hash for hit validation.
    constraint_hash: u64,
}

impl TopologyQueryCache {
    /// Look up cached results. Returns None if cache miss or stale
    /// (topology generation changed).
    ///
    /// Staleness check: one u64 comparison (same pattern as CachedRoute,
    /// Section 5.2.9.4). Cache hit rate: >99% in stable clusters
    /// (topology changes are seconds-to-hours apart).
    pub fn get(
        &self,
        origin: PeerId,
        constraints: &[TopologyConstraint],
        current_generation: u64,
    ) -> Option<&[PeerId]> {
        if self.generation != current_generation { return None; }
        let c_hash = constraint_hash(constraints);
        let key = Self::cache_key_inner(origin, c_hash);
        self.entries.get(key).and_then(|e| {
            // Hit validation: verify origin and constraint_hash match
            // to prevent cross-query collisions from hash mixing.
            if e.origin == origin && e.constraint_hash == c_hash {
                Some(e.peers.as_ref())
            } else {
                None // Hash collision — treat as cache miss.
            }
        })
    }

    /// Populate cache entry from a fresh query result.
    pub fn insert(
        &mut self,
        origin: PeerId,
        constraints: &[TopologyConstraint],
        result: &[PeerId],
        generation: u64,
    ) {
        let c_hash = constraint_hash(constraints);
        let key = Self::cache_key_inner(origin, c_hash);
        self.entries.store(key, CachedQueryEntry {
            peers: result.into(),
            created_ns: current_time_ns(),
            origin,
            constraint_hash: c_hash,
        });
        self.generation = generation;
    }

    /// Compute the XArray key from origin peer and constraint hash.
    /// Uses multiplicative mixing (not XOR) to avoid cross-axis collisions
    /// where XOR would allow `key(A,X) == key(B,Y)` whenever the hashes
    /// accidentally align. The hit-validation in `get()` catches any
    /// remaining collisions. Cold-path cache — mixing cost is negligible.
    fn cache_key_inner(origin: PeerId, c_hash: u64) -> u64 {
        fxhash64(origin.as_u64().wrapping_mul(0x517cc1b727220a95) ^ c_hash)
    }

    /// Evict all entries. Called on topology change (generation increment).
    /// O(1) — just increments the generation counter. Stale entries are
    /// lazily evicted on next access.
    pub fn invalidate(&mut self, new_generation: u64) { ... }
}

CXL constraint solver:

For CXL 3.0 memory pooling (Section 5.9), memory region allocation is a constraint satisfaction problem: find a CXL device that meets latency, bandwidth, capacity, and coherence requirements.

/// Allocate a CXL memory range satisfying constraints. Runs once at
/// region creation (e.g., DSM region, GPU memory pool), not per-access.
///
/// Uses find_peers() with CXL-specific constraints, then selects the
/// best candidate by available capacity.
pub fn allocate_cxl_range(
    topology: &TopologyGraph,
    requestor: PeerId,
    size_bytes: u64,
    constraints: &[TopologyConstraint],
) -> Result<CxlAllocation, CxlAllocError> { ... }

pub struct CxlAllocation {
    /// CXL device peer that will host the memory range.
    pub device_peer: PeerId,
    /// Physical address range on the CXL device.
    pub phys_range: PhysRange,
    /// Measured latency from requestor to this device.
    /// Saturation: clamped to `u32::MAX`; see `ClusterNode.measured_rtt_ns`.
    pub latency_ns: u32,
    /// Measured bandwidth.
    pub bandwidth_bytes_per_sec: u64,
}

Performance bounds:

Operation Complexity Typical (10 peers) Max (1024 peers)
find_peers() O(N × E) <1 μs <100 μs
find_path() (Dijkstra) O((N + E) log N) <2 μs <1 ms
suggest_placement() O(N × E × A) where A = affinity rules <5 μs <5 ms
Cache lookup O(1) hash + u64 compare <100 ns <100 ns
Cache invalidation O(1) generation bump <10 ns <10 ns
allocate_cxl_range() O(find_peers) + O(capacity check) <5 μs <200 μs

All operations run off the data path. The data path uses CachedRoute (Section 5.2) which is O(1) array index.

When queries run: - Peer join/leave → re-evaluate cached placements - Topology edge weight change (re-measurement) → invalidate cache - Service binding request → suggest_placement() for optimal peer - CXL region creation → allocate_cxl_range() - Admin-triggered rebalance → find_peers() with updated constraints

What does NOT trigger queries: - Page faults (use CachedRoute) - Packet forwarding (use CachedRoute) - DLM lock operations (use pre-resolved PeerId) - DSM coherence messages (use pre-resolved PeerId) - Any per-request operation


5.3 Peer Kernel Isolation and Crash Recovery

This section defines the isolation model for device-local kernels (Section 5.2 Path A/B) — devices running UmkaOS or an UmkaOS-protocol firmware as a first-class cluster peer. This model is fundamentally different from both Tier 1 driver crash recovery (Section 11.9) and remote node failure (Section 5.8), and must not be confused with either.

5.3.1 The Isolation Model Shift

With traditional drivers (Tier 1/Tier 2), UmkaOS operates a supervisor hierarchy: the host kernel loads, supervises, and recovers drivers. A Tier 1 driver crash is handled by the kernel — it detects the fault, revokes the domain, issues FLR, and reloads the driver binary. The kernel is always in control.

With a multikernel peer, this hierarchy does not exist. The device runs its own autonomous kernel on its own cores in its own physical memory. The host kernel:

  • Does not load the device kernel
  • Cannot inspect or modify the device kernel's private memory
  • Cannot "reload" the device kernel the way it reloads a Tier 1 driver
  • Cannot force the device into a known-good state without a full hardware reset

The isolation unit shifts from software domain (MPK/DACR keys, enforced by the kernel) to physical boundary (PCIe bus, IOMMU, enforced by hardware).

5.3.2 Host Unilateral Controls

Despite being a peer, the device is physically attached to the host's PCIe fabric. The host retains six unilateral controls it can exercise regardless of the device kernel's state — even against a buggy, wedged, or hostile peer. They form an escalating ladder from surgical containment to full hardware reset:

1. IOMMU domain lockout — The host programmed the device's IOMMU domain at join time. It can modify or revoke that domain at any time. Revoking the IOMMU domain prevents all further device DMA into host memory, hardware-enforced, with no cooperation from the device kernel. This is the primary containment action.

2. PCIe bus master disable — A single MMIO write to the device's Command register clears the Bus Master Enable bit. All PCIe transactions from the device are silently dropped by the root complex. Effective immediately; no cooperation needed.

3. Function Level Reset (FLR) — The host can issue FLR via the PCIe capability register. This resets all device hardware state: the device kernel dies, all in-flight DMA is cancelled, device cores reset. The device must re-initialize and rejoin the cluster. FLR is the multikernel equivalent of "driver reload" but takes device-reboot time (~1-30 seconds depending on firmware initialization) rather than driver-reload time (~50-150ms for Tier 1).

CXL Type 1/2 note: On CXL-attached peers with CXL.cache (coherent cache in the host CPU's coherency domain), the CXL specification requires the device to flush all dirty cache lines back to host memory before the FLR completes. The host must not access the shared memory region until the FLR completion status is confirmed — cache lines in flight during FLR may be in an indeterminate state. UmkaOS's crash recovery sequence (Section 5.3) issues IOMMU lockout and bus master disable (steps 1-2) before FLR (step 7) precisely to ensure no new DMA or cache traffic originates from the device during the flush window.

4. PCIe Secondary Bus Reset (SBR) — The upstream bridge or root port can assert the Secondary Bus Reset bit in its Bridge Control register. This propagates a hard reset signal to all devices on that bus segment — more comprehensive than FLR (resets all PCIe functions on the device, not just one) but less disruptive than a full power cycle. Used when FLR is not supported or does not produce a clean reset.

5. PCIe slot power — The host can power-cycle the PCIe slot via the Hot-Plug Slot Control register or via ACPI _PS3/_PS0 platform methods. This cuts then restores power to the slot entirely. Reserved for devices that do not respond to FLR or SBR.

6. Out-of-band management (BMC/IPMI) — On server-class hardware, the Baseboard Management Controller (BMC) has independent hardware control over PCIe slot power and presence, completely bypassing the host OS. The BMC operates on a separate management plane with its own network interface and power rail. This means that even if the host kernel itself is hung or severely degraded, a management controller can power-cycle misbehaving device slots and restore the host's ability to re-enumerate them. UmkaOS treats the BMC as a platform capability, not an UmkaOS component — its availability depends on the hardware platform, and it operates outside UmkaOS's software control path.

The critical consequence of this escalating ladder is: a heartbeat timeout from a peer device is never a dead end. The host always has a hardware path to reset and re-enumerate the device, independent of device cooperation. The peer kernel model does not introduce any situation where a failed device permanently wedges the host.

These controls are one-sided and non-cooperative. The host can execute them without any message to the device, without the device kernel's acknowledgment, and even while the device kernel is executing arbitrary code. They represent the irreducible minimum of host authority over physically attached devices.

5.3.2.1 s390x Peer Isolation Controls

On s390x, peer kernels run in separate LPARs (Logical Partitions) or z/VM guests. The isolation model is fundamentally different from PCIe — there is no PCIe fabric, no IOMMU domain, and no BAR memory. Instead, s390x provides partition-level controls managed by the hypervisor (z/VM or PR/SM):

1. Virtual device disableHSCH (Halt Subchannel) + CSCH (Clear Subchannel) immediately halt all I/O on a subchannel. QDIO queues are drained and the subchannel is placed in a quiesced state. Effective immediately; no cooperation from the peer partition needed. Equivalent to PCIe bus master disable.

2. Partition quiesce — z/VM SMAPI (Systems Management Application Programming Interface) Image_Deactivate or PR/SM partition deactivate. Stops all CPUs in the peer partition, preventing further execution. Equivalent to PCIe FLR.

3. Partition reset — z/VM SMAPI Image_Reset or PR/SM partition reset via the HMC (Hardware Management Console). Resets all partition state: CPUs, memory, virtual devices. The peer must re-IPL (Initial Program Load) and rejoin the cluster. Equivalent to PCIe SBR.

4. Out-of-band management — The SE (Service Element) / HMC provides independent hardware control over partitions, accessible via a separate management network. Operates independently of the host UmkaOS instance. Equivalent to BMC/IPMI.

The same escalating ladder principle applies: a heartbeat timeout from an s390x peer partition is never a dead end. The hypervisor always has a path to quiesce and reset the peer, independent of peer cooperation.

5.3.3 Isolation Comparison

Traditional Tier 1 driver:
  Same Ring 0 address space as host kernel.
  Isolation via MPK/DACR — software-enforced, escape possible via WRPKRU.
  Host kernel supervises: detects crash, revokes domain, FLR, reload.
  Recovery: ~50-150ms (driver reload).
  Can corrupt: own domain memory (not host kernel critical structures).

Multikernel peer (Mode A — message passing):
  Separate address space, separate physical DRAM, separate cores.
  Isolation via IOMMU — hardware-enforced, physically unreachable.
  Host kernel does NOT supervise: detects via heartbeat, acts unilaterally.
  Recovery: escalating reset ladder (FLR → SBR → slot power → BMC/IPMI).
  At least one reset path always available; heartbeat timeout is never a dead end.
  Can corrupt: RDMA pool (IOMMU-bounded, excludes kernel critical structures).
  Can leave dirty: distributed state (held locks, DSM page ownership).

Multikernel peer (Mode B — hardware-coherent):
  Same as Mode A for isolation (IOMMU still bounds device DMA).
  Additional risk: in-flight coherent writes to shared pool may be torn.
  Recovery: same as Mode A + pool scan and lock force-release.

Remote cluster node (network-attached):
  No physical connection. No unilateral PCIe controls.
  Isolation: network boundary, no shared memory at all.
  Recovery: membership revocation, DSM page invalidation.
  Can corrupt: nothing on host (messages-only interface).

The key asymmetry: a peer kernel crash is more isolated than a Tier 1 driver crash (no shared Ring 0 address space, IOMMU is harder than MPK), but less controlled (no supervisor relationship, no fast reload, possible dirty distributed state).

Relationship to the three-tier model. Multikernel peers are effectively a fourth isolation level beyond Tier 2. The progression — Tier 0 (no boundary) → Tier 1 (software memory domains) → Tier 2 (separate address space) → Multikernel Peer (separate physical hardware behind IOMMU) — represents increasing boundary hardness at each step. Unlike the usual isolation-vs-performance tradeoff (each tier adds latency), multikernel peers break the pattern: they provide the strongest isolation and offload host CPU cycles entirely, since the device runs its own kernel on its own cores. The reason UmkaOS does not label this "Tier 3" is that it has fundamentally different failure semantics: no supervisor relationship, distributed state cleanup on crash, an escalating reset ladder (FLR → SBR → slot power → BMC), and a requirement for dedicated hardware. It is not a drop-in replacement for a driver tier — it is a different operational model that happens to sit at the strongest end of the isolation spectrum.

5.3.4 Crash Detection

Two detection paths run in parallel for attached peer kernels:

Primary: Cluster heartbeat (both Mode A and Mode B) The standard membership protocol (Section 5.4): heartbeat every 100ms. Missed for 300ms (3 heartbeats) → Suspect. Missed for 1000ms (10 heartbeats) → Dead. This covers all failure modes including silent crashes and wedged firmware.

Secondary: MMIO watchdog (required for Mode B; recommended for Mode A) The device firmware writes a monotonically increasing counter to a dedicated MMIO register every 10ms. The host polls this register on any DLM lock acquisition timeout or DSM page request timeout. Stale counter → immediate Suspect escalation. Detection latency: ~20ms instead of up to 1 second via heartbeat alone.

Mode B devices that do not implement the MMIO watchdog must not be granted hardware-coherent transport access. The faster detection is mandatory for Mode B because stale lock state in the coherent pool propagates to all cluster members holding locks in that region.

/// Per-peer crash detection state, maintained by the host kernel.
pub struct PeerKernelHealth {
    /// PeerId of the peer kernel.
    pub peer_id: PeerId,
    /// Last observed MMIO watchdog counter value.
    pub last_watchdog_count: AtomicU64,
    /// Timestamp of last watchdog read (nanoseconds since boot).
    pub last_watchdog_ts_ns: AtomicU64,
    /// Number of consecutive missed heartbeats.
    pub missed_heartbeats: AtomicU32,
    /// Current health state.
    pub state: AtomicU32, // PeerHealthState enum
    /// PCIe BDF for unilateral control operations.
    pub pcie_bdf: PcieBdf,
    /// Coordination mode (determines recovery procedure).
    pub mode: PeerCoordinationMode, // ModeA | ModeB
}

#[repr(u32)]
pub enum PeerHealthState {
    Active             = 0,
    Suspect            = 1, // Missed heartbeats or stale watchdog — alert, escalate
    Dead               = 2, // Confirmed failed — execute recovery sequence
    Faulted            = 3, // FLR issued, waiting for device to rejoin
    ManagementFaulted  = 4, // CXL Type 3: memory accessible, management unavailable
}

5.3.5 Recovery Sequence

On transition to Dead, the host kernel executes the following sequence. Steps 1-3 are unilateral and non-cooperative. Steps 4-6 involve cleanup of distributed state. Steps 7-8 are optional and admin-controlled.

Peer kernel crash recovery sequence:

1. IOMMU lockout (immediate, unilateral)
   host.iommu.revoke_domain(peer.pcie_bdf)
   — All device DMA into host memory blocked in hardware.
   — All outstanding RDMA operations targeting this device's QPs are cancelled.
   — Estimated time: <1ms (IOMMU domain table update + TLB shootdown).

2. PCIe bus master disable (immediate, unilateral)
   host.pcie.clear_bus_master(peer.pcie_bdf)
   — Device can no longer initiate any PCIe transactions.
   — Defense-in-depth: belts-and-suspenders with IOMMU lockout.
   — Estimated time: <1ms.

3. [Mode B only] Pool scan and lock force-release
   host.rdma_pool.scan_and_release_locks(peer.node_id)
   — Scan all DLM lock entries in the coherent pool owned by the dead peer.
   — Force-release each held lock; set tombstone flag (LOCK_OWNER_DEAD).
   — Waiters receive LOCK_OWNER_DEAD, not timeout.
   — Scan all DSM directory entries owned by peer; mark as LOST.
   — Estimated time: O(active_locks) — typically <10ms.

4. Cluster membership revocation
   cluster.revoke_membership(peer.node_id)
   — Broadcasts MEMBER_DEAD(peer.node_id) to all other cluster nodes.
   — All nodes destroy their QP connections to the peer.
   — Master eagerly calls `dereg_mr()` on the evicted node's pool-wide RDMA
     Memory Region, invalidating the rkey in RNIC hardware (< 1ms). Any
     in-flight one-sided RDMA operations from the evicted node using the
     old rkey receive Remote Access Error completion (per the InfiniBand
     Architecture Specification, MR deregistration invalidates the
     associated rkey in RNIC hardware). This is immediate — it does NOT wait for the rkey rotation
     cycle ([Section 5.3](#peer-kernel-isolation-and-crash-recovery--host-unilateral-controls), Mitigation 2).
   — Estimated time: ~1 RTT to notify all members (~1-3ms on LAN).

5. DSM page invalidation
   dsm.invalidate_all_pages_owned_by(peer.node_id)
   — All DSM pages in the Owner or Shared state for the dead peer are
     invalidated. Processes mapping these pages receive SIGBUS on next
     access (same semantics as NFS server failure).
   — Home-node responsibilities for pages hosted on the peer are
     migrated to surviving nodes via the migration protocol (Section 5.7.3).
   — Estimated time: O(pages_owned_by_peer) — typically <100ms.

6. Capability revocation
   cap_table.revoke_all_for_node(peer.node_id)
   — All capabilities granted to or from the peer are invalidated.
   — Ongoing ring buffer channels to the peer are torn down.
   — Estimated time: O(capabilities) — typically <10ms.

7. [Optional] Reset and device reboot (admin-controlled or auto-policy)
   Escalating ladder — attempt each step in order, stop when device rejoins:
   a. FLR:  host.pcie.issue_flr(peer.pcie_bdf)
   b. SBR:  host.pcie.issue_sbr(peer.pcie_bdf)   // if FLR fails or unsupported
   c. Slot power cycle: host.pcie.power_cycle(peer.pcie_bdf)  // last software resort
   d. BMC/IPMI slot power: platform.bmc.power_cycle_slot(peer.slot_id)  // OOB, if available
   — Resets all device hardware state; device firmware re-initializes.
   — If device re-joins cluster: resume normal operation.
   — If device fails to re-join within timeout after all steps: mark as Faulted,
     notify admin, do not attempt further resets automatically.
   — At least one reset path (FLR or SBR) is always available for PCIe devices.

8. [Optional] Workload migration
   scheduler.migrate_workloads_from(peer.node_id, policy)
   — Workloads that were running on the peer's compute resources
     (containers, VMs, scheduled tasks) are either:
     a. Migrated to surviving nodes if they were checkpointable.
     b. Terminated with SIGKILL if not checkpointable.
     c. Suspended pending device recovery if short outage expected.
   — Policy configured per-cgroup: migrate | terminate | suspend.

Total time to containment (steps 1-2): < 2ms. Total time to clean state (steps 1-6): < 200ms typical. Total time to full recovery (steps 1-8 with FLR): 1-30 seconds (device-dependent).

Compare to Tier 1 driver reload: 50-150ms total. The peer kernel recovery is slower because FLR + firmware re-initialization cannot be parallelized, and because distributed state cleanup (steps 4-6) is inherently network-speed rather than local-memory-speed. Steps 1-3 (containment) are comparable to or faster than Tier 1 domain revocation.

CXL Type 3 crash — distinct recovery model: When a CXL Type 3 memory-expander peer loses its management processor (crash, firmware fault, or reset), the recovery is fundamentally different from a compute peer crash:

  • The DRAM cells do not disappear. The physical memory remains accessible to the host via CXL.mem load/store — it is DRAM on a PCIe/CXL bus, not in the device's compute domain. The host can still read and write those pages.
  • The management layer is gone. Tiering decisions, compression metadata, encryption keys (if any), and bad-page tracking maintained by the management processor are no longer available. Compressed pages are unreadable until decompressed; encrypted pages are inaccessible if keys were held in device DRAM.
  • Recovery action: The host marks the CXL pool as ManagementFaulted (not Dead). Pages without compression or encryption remain accessible normally. Pages with compression or encryption are migrated or treated as lost. The management processor is reset (FLR → re-initialize firmware); once recovered, it re-scans the pool metadata and rejoins as a memory-manager peer.
  • No workload migration needed: There are no workloads running on a Type 3 management processor. Workloads using the CXL pool memory continue running unaffected (if their pages were uncompressed/unencrypted). Only pool management operations (tiering, new allocation) are suspended during recovery.

The PeerKernelHealth state machine (Section 5.3) gains an additional state for Type 3 peers: ManagementFaulted — memory accessible, management unavailable.

5.3.6 What Survives Peer Kernel Crash Intact

The host kernel's own state is fully protected:

Component Protected by Survives peer crash?
Host kernel text, rodata IOMMU (device cannot reach) Always
Host kernel stacks IOMMU (not in RDMA pool) Always
Capability tables IOMMU (not in RDMA pool) Always
Scheduler state IOMMU (not in RDMA pool) Always
Host application memory IOMMU (not in RDMA pool unless explicitly exported) Always
RDMA pool — kernel structures IOMMU bounds; pool scan on Mode B crash Recovered in step 3
RDMA pool — application DSM pages Invalidated (step 5); app gets SIGBUS Lost (must re-fetch or re-compute)
Other cluster nodes' state Steps 4-6 clean up distributed references Recovered
Host kernel stability Nothing to crash Never affected

The host kernel never crashes due to a peer kernel crash. The IOMMU is the hardware guarantee; the recovery sequence is the software cleanup.

5.3.7 Relationship to Other Failure Handling Sections

  • Section 11.9 (Driver Crash Recovery): Covers Tier 1/Tier 2 driver failures where the host kernel is the supervisor. Peer kernel isolation is a peer relationship, not a supervisor relationship. Do not apply Section 11.9 procedures to peer kernels.

  • Section 5.8 (Cluster Membership and Failure Detection): Covers remote nodes connected over RDMA networks. Peer kernel recovery shares the membership protocol (steps 4-6) but adds the unilateral PCIe controls (steps 1-3) that are not available for network-attached nodes.

  • Section 5.11 (DPU Failure Handling): DPUs are Tier M peers (Section 5.11). Their crash recovery follows the standard peer recovery sequence (this section, steps 1-7) with one DPU-specific extension: host fallback to a local driver for the same ServiceId.


5.4 RDMA-Native Transport Layer

5.4.1 Design: Kernel RDMA Transport (umka-rdma)

Unlike Linux where RDMA is a separate subsystem used only by applications, UmkaOS integrates RDMA into the kernel's transport layer so that any kernel subsystem can use RDMA for data movement.

Linux architecture:
  ┌───────────────────────────────────────────────────┐
  │  Kernel subsystems (MM, VFS, IPC, scheduler)      │
  │  └── All use: memcpy, TCP sockets, block I/O      │
  │      (no RDMA awareness)                           │
  ├───────────────────────────────────────────────────┤
  │  RDMA subsystem (ib_core, ib_uverbs)              │
  │  └── Only used by: userspace apps via libibverbs   │
  │      (kernel subsystems don't use this)            │
  └───────────────────────────────────────────────────┘

UmkaOS architecture:
  ┌───────────────────────────────────────────────────┐
  │  Kernel subsystems (MM, VFS, IPC, scheduler)      │
  │  └── All use: umka-rdma transport (when remote)    │
  │      Local path: same as before (memcpy, DMA)      │
  │      Remote path: RDMA read/write via umka-rdma    │
  ├───────────────────────────────────────────────────┤
  │  umka-rdma: unified RDMA transport for kernel use  │
  │  ├── Page migration (MM → remote NUMA node)        │
  │  ├── Page cache fill (VFS → remote page cache)     │
  │  ├── IPC ring (IPC → remote process)               │
  │  ├── Control messages (scheduler, capabilities)    │
  │  └── Userspace RDMA (libibverbs compat, unchanged) │
  ├───────────────────────────────────────────────────┤
  │  RDMA hardware driver (mlx5, efa, bnxt_re, etc.)  │
  │  └── Implements RdmaDeviceVTable (Section 22.5) │
  └───────────────────────────────────────────────────┘

Nucleus data path rationale: The RDMA data path (post_send, poll_cq, doorbell MMIO writes) executes entirely in Nucleus (Tier 0) with zero tier crossings. This is a deliberate performance decision: DSM page faults (~10-18μs, network-dominated) would gain 2-8% overhead (~400-800ns) from tier crossings if RDMA verbs moved to Tier 1. The NIC driver (setup, teardown, interrupt handling — the code most likely to have bugs) remains Tier 1 for crash isolation. RDMA verbs code is small (~2-3KB), amenable to formal verification, and rarely changes. Session encryption keys are stored in Core-private memory (PKEY 0, outside the RDMA pool) so even physical RDMA access cannot decrypt in-flight traffic.

5.4.2 RDMA Transport Implementation

The RDMA transport implements the ClusterTransport trait (Section 5.10) for RDMA-connected peers. RDMA infrastructure (protection domain, completion queues, memory regions) is shared across all RDMA peers via Arc<RdmaInfra>, while per-peer state (queue pairs, flow control) lives in RdmaPeerTransport.

// umka-core/src/rdma/transport.rs (kernel-internal)

/// Shared RDMA infrastructure. Created once at RDMA transport init,
/// shared by all `RdmaPeerTransport` instances via `Arc`.
pub struct RdmaInfra {
    /// RDMA device used for kernel communication.
    rdma_device: DeviceNodeId,

    /// Protection domain for all kernel RDMA operations.
    kernel_pd: RdmaPdHandle,

    /// Pre-registered memory regions for fast page transfer.
    /// Covers designated RDMA-eligible regions (pool size policy below).
    /// Avoids per-transfer memory registration overhead.
    kernel_mr: RdmaMrHandle,

    /// Per-CPU completion queues, one per logical CPU.
    /// A single shared CQ becomes a bottleneck under load because all CPUs
    /// contend to drain it. Per-CPU CQs allow each CPU's RDMA poll thread
    /// to drain its own queue independently, eliminating cross-CPU contention.
    /// Each QP in RdmaPeerTransport is created against the CQ of the CPU that
    /// owns that connection's polling thread (assigned at connection setup time).
    /// Indexed by cpu_id; length = num_online_cpus() at transport init.
    /// Uses a slab-allocated slice rather than a fixed array: the kernel has no
    /// compile-time MAX_CPUS (Section 7.1.1) — CPU count is runtime-discovered.
    /// Allocated once at transport init, never resized on the hot path.
    cqs: SlabSlice<RdmaCqHandle>,

    /// Statistics (across all RDMA peers).
    stats: TransportStats,
}

/// Per-peer RDMA transport binding. Implements `ClusterTransport`.
/// Created when a peer joins the cluster via RDMA and the QP handshake
/// completes. Stored in `PeerNode.transport` as `Arc<dyn ClusterTransport>`.
///
/// **QP Handshake Protocol** (executed once per peer connection):
///
/// ```text
/// 1. Local creates two RC QPs (control + data) in RESET state:
///    ibv_create_qp(pd, RC, send_cq=per_cpu_cq, recv_cq=per_cpu_cq)
/// 2. Transition both QPs to INIT state:
///    ibv_modify_qp(qp, INIT, { pkey_index=0, port=1, access=LOCAL_WRITE|REMOTE_WRITE|REMOTE_READ|REMOTE_ATOMIC })
/// 3. Exchange QP metadata with peer via out-of-band TCP (cluster join).
///    The exchange uses the ClusterMessageHeader-framed message format with
///    Le32/Le64 encoding for all integer fields. Upon receipt, values are
///    converted to native byte order for storage in `RdmaPeerTransport`.
///    Send: (control_qp.qp_num, data_qp.qp_num, lid, gid, rkey, base_addr, security_mode)
///    Recv: peer's corresponding metadata
/// 4. Transition both QPs to RTR (Ready To Receive):
///    ibv_modify_qp(qp, RTR, { dest_qp_num=peer.qp_num, ah_attr={dlid=peer.lid, dgid=peer.gid}, path_mtu, max_dest_rd_atomic=16, min_rnr_timer=12 })
/// 5. Transition both QPs to RTS (Ready To Send):
///    ibv_modify_qp(qp, RTS, { max_rd_atomic=16, timeout=14, retry_cnt=7, rnr_retry=7 })
/// 6. Post initial receive buffers on control QP for incoming messages.
/// 7. Construct RdmaPeerTransport with the established QPs.
/// ```
///
/// After step 5, both control and data QPs are ready for two-sided and one-sided
/// operations. The out-of-band TCP connection (step 3) is closed after the handshake
/// — all subsequent communication uses RDMA.
pub struct RdmaPeerTransport {
    /// Shared RDMA infrastructure (PD, MRs, CQs). All RDMA peers on the
    /// same NIC share this via `Arc` — no per-peer duplication of expensive
    /// NIC resources.
    infra: Arc<RdmaInfra>,

    /// Reliable connected queue pair for control messages (two-sided).
    control_qp: RdmaQpHandle,

    /// Reliable connected QP for one-sided RDMA (Read/Write/Atomic).
    /// RDMA Read/Write requires RC or UC transport; UD only supports Send/Receive.
    data_qp: RdmaQpHandle,

    /// Remote node's memory region key (for RDMA read/write).
    remote_rkey: u32,

    /// Remote node's base address for page transfers.
    remote_base_addr: u64,

    /// Flow control: outstanding RDMA operations.
    inflight: AtomicU32,

    /// Maximum concurrent RDMA operations to this peer.
    max_inflight: u32,

    /// Per-connection security posture (Permissive / Strict / Auto).
    security_mode: RdmaSecurityMode,
}

/// ClusterTransport implementation for RDMA peers.
/// All methods operate on the single peer this instance is bound to.
impl ClusterTransport for RdmaPeerTransport {
    /// RDMA Send (RC QP). Used for control messages, coherence protocol.
    fn send(&self, msg: &[u8]) -> Result<(), TransportError>;

    /// RDMA Send (RC QP) + poll CQ for completion.
    fn send_reliable(&self, msg: &[u8], timeout_ms: u32) -> Result<(), TransportError>;

    /// Poll the per-CPU CQ for incoming messages on this peer's control QP.
    fn poll_recv(&self, buf: &mut [u8]) -> Option<usize>;

    /// RDMA Read: fetch a page from this peer to local memory.
    /// One-sided — no remote CPU involvement (~3-5 μs).
    /// Used by: MM (page fault on remote page), page cache (remote fill).
    fn fetch_page(
        &self,
        remote_addr: u64,
        local_addr: PhysAddr,
        size: u32,
    ) -> Result<(), TransportError>;

    /// RDMA Write: push a page from local memory to this peer.
    /// One-sided — no remote CPU involvement (~2-3 μs).
    /// Used by: MM (page eviction to remote peer), DSM (writeback).
    fn push_page(
        &self,
        local_addr: PhysAddr,
        remote_addr: u64,
        size: u32,
    ) -> Result<(), TransportError>;

    /// Batch page transfer: chained RDMA Work Requests in a single
    /// post_send() (~5-10 μs for 64 pages). Near line rate.
    fn fetch_pages_batch(
        &self,
        pages: &[(u64, PhysAddr)],
    ) -> Result<(), TransportError>;

    /// RDMA Atomic Compare-and-Swap (64-bit, NIC-side, ~2-3 μs).
    /// Used by: DLM ([Section 15.15](15-storage.md#distributed-lock-manager--transport-agnostic-lock-operations))
    /// uncontested acquire, DSM page ownership transfers.
    fn atomic_cas(
        &self,
        remote_addr: u64,
        expected: u64,
        desired: u64,
    ) -> Result<u64, TransportError>;

    /// RDMA Atomic Fetch-and-Add (64-bit, NIC-side, ~2-3 μs).
    fn atomic_faa(
        &self,
        remote_addr: u64,
        addend: u64,
    ) -> Result<u64, TransportError>;

    /// RC QP in-order delivery guarantees ordering for writes. For
    /// ordering after reads/atomics: zero-length Send with IBV_SEND_FENCE.
    fn fence(&self) -> Result<(), TransportError>;

    fn transport_name(&self) -> &'static str { "rdma" }
    fn supports_one_sided(&self) -> bool { true }
    fn is_coherent(&self) -> bool { false }
}

5.4.3 RDMA Device Capability Flags

Device capabilities are discovered at initialization by querying the NIC's ibv_device_attr (or equivalent KABI vtable call for Tier 1 RDMA drivers). Subsystems check these flags before using optional hardware features.

bitflags! {
    /// RDMA device capability flags. Discovered at device init, immutable thereafter.
    /// Stored in `RdmaInfra::device_caps` and queried by DLM, DSM, and transport code.
    pub struct RdmaDeviceCaps: u32 {
        /// Device supports Reliable Connected (RC) queue pairs.
        const RC_QP           = 1 << 0;
        /// Device supports Unreliable Connected (UC) queue pairs.
        const UC_QP           = 1 << 1;
        /// Device supports Unreliable Datagram (UD) queue pairs.
        const UD_QP           = 1 << 2;
        /// Device supports RDMA Read (one-sided).
        const RDMA_READ       = 1 << 3;
        /// Device supports RDMA Write (one-sided).
        const RDMA_WRITE      = 1 << 4;
        /// Device supports 64-byte atomic operations (required for DSM
        /// compare-and-swap on LVB structures). Confirmed on ConnectX-5/6/7
        /// and AWS EFA. When absent, the DLM falls back to the double-read
        /// protocol for LVB reads ([Section 15.15](15-storage.md#distributed-lock-manager--lock-value-block)).
        const RDMA_ATOMIC_64B = 1 << 5;
        /// Device supports 8-byte atomic Compare-and-Swap.
        const ATOMIC_CAS      = 1 << 6;
        /// Device supports 8-byte atomic Fetch-and-Add.
        const ATOMIC_FAA      = 1 << 7;
        /// Device supports Shared Receive Queues (SRQ).
        const SRQ             = 1 << 8;
        /// Device supports Memory Windows Type 2 (fine-grained access control).
        const MW_TYPE2        = 1 << 9;
    }
}

5.4.4 Pre-Registered Kernel Memory

The biggest performance problem with Linux RDMA is memory registration. Before any RDMA operation, memory must be pinned and registered with the NIC hardware (translated to physical addresses, programmed into NIC's address translation table). This costs ~1-10 μs per registration.

UmkaOS avoids per-transfer registration overhead by pre-registering designated RDMA-eligible memory regions at cluster join time.

Pool sizing policy:

The pool size is determined at cluster join time by the following formula:

rdma_pool_size = clamp(
    min(
        physical_ram * 25 / 100,      // 25% of physical RAM
        rdma_max_pool_gib * GIB,      // configurable hard cap
    ),
    256 * MIB,                        // minimum: always reserve 256 MiB
    physical_ram,                     // maximum: never exceed total RAM
)

Where rdma_max_pool_gib is the rdma.max_pool_gib kernel parameter (default: 64 GiB).

The default 64 GiB cap reflects typical InfiniBand HCA memory registration limits and prevents excessive physical page pinning on large systems. On a 256 GiB machine the pool is 64 GiB (25%). On a 1 TiB machine the pool is capped at 64 GiB (6.25%), not 256 GiB, avoiding wasteful pinning on systems where most workloads are local. Operators running dedicated RDMA fabrics on large memory systems can raise the cap via the rdma.max_pool_gib tunable; there is no hard upper bound other than total RAM.

The 256 MiB floor ensures that even very small systems (e.g., embedded UmkaOS nodes with 8–16 GiB) always have enough RDMA-registered space for control-plane and DSM traffic.

NUMA-local allocation: The pool is divided proportionally across NUMA nodes:

node_pool_size = pool_size * node_physical_ram / total_physical_ram

Each NUMA node registers its own share as a separate memory region. RDMA NIC NUMA affinity (where the NIC is physically closest to a NUMA node) is accounted for in the distance matrix (Section 5.2) but does not affect the proportional split — all nodes contribute fairly to the pool.

Timing: The pool is registered at cluster join time, before any user processes issue RDMA requests. Registration is a one-time cost of approximately 100 μs per region. Registration cost varies by region size and NIC: ~10 μs for 4 KB, ~100 μs for 256 MB, ~1 ms for multi-GB regions (measured on ConnectX-6). The 100 μs figure is for typical pool sizes (64-256 MB).

Runtime adjustment: - Shrink: drain in-flight RDMA operations that reference the region, unregister the excess MRs, update the pool descriptor. Requires quiescing DSM pages in the shrunk range (~1–10 ms depending on page count). - Grow: register additional MRs, add to the pool descriptor, notify cluster peers of the new rkey. No quiescing required. - Both operations are triggered by writing to /sys/kernel/umka/cluster/rdma.max_pool_gib.

Boot sequence:

Boot sequence:
  1. RDMA NIC driver initializes (standard KABI init).
  2. Cluster join is requested.
  3. umka-rdma computes pool size: min(RAM * 25%, max_pool_gib * GiB), ≥ 256 MiB.
     Pool is allocated per-NUMA-node proportionally to node RAM.
  4. umka-rdma registers RDMA-eligible memory regions (one per NUMA node):
     - Single memory region per NUMA node, one-time cost (~100 μs total)
     - NIC can DMA to/from registered pages without per-transfer registration
     - Non-registered memory cannot be accessed by remote nodes
  5. Remote nodes exchange rkeys (remote access keys).
  6. Any kernel subsystem can now RDMA read/write registered pages with zero
     setup cost. Pages outside the RDMA pool are not remotely accessible.

This is safe because:
  - Only designated RDMA-eligible regions are registered (not all physical memory)
  - The rkey is only shared with authenticated cluster members
  - RDMA access is gated by the connection (reliable connected QP)
  - The kernel controls which pages are exported (via page ownership tracking)
  - IOMMU still validates all DMA (RDMA NIC goes through IOMMU like any device)
  - Kernel text, kernel stacks, and security-sensitive structures are never in
    the RDMA pool

5.4.5 Performance Characteristics

Operation Latency Bandwidth CPU Involvement
RDMA Read (4KB page) ~3-5 μs ~200 Gb/s line rate Zero on remote side
RDMA Write (4KB page) ~2-3 μs ~200 Gb/s line rate Zero on remote side
RDMA Atomic CAS ~2-3 μs N/A Zero on remote side
Control message (send/recv) ~1-2 μs N/A Interrupt on remote side
Batch page transfer (64 pages) ~5-10 μs Near line rate Zero on remote side
Memory registration (avoided) 0 (pre-registered) N/A N/A

Compare with alternatives: | Alternative | 4KB Fetch Latency | Notes | |------------|-------------------|-------| | Local DRAM (same NUMA) | ~300-500 ns | Baseline (4KB page copy) | | Local DRAM (cross-NUMA) | ~500-800 ns | QPI/UPI hop (4KB page copy) | | RDMA (same switch) | ~3-5 μs | ~10x local NVMe, but... | | CXL 2.0 pooled memory | ~200-400 ns | Hardware-coherent | | NVMe SSD | ~10-15 μs | Current swap target | | NFS/CIFS (TCP) | ~50-200 μs | Kernel networking overhead | | HDD | ~5,000 μs | Rotational latency |

Key insight: Raw RDMA Read latency (~3 μs) is 3-5x faster than NVMe (~10-15 μs). However, the complete DSM page fault path (directory lookup + ownership negotiation + RDMA transfer) totals ~10-18 μs (see Section 6.5), which is comparable to NVMe latency rather than 3-5x faster. The advantage of remote DRAM over NVMe is not single-fault latency but bandwidth scalability: RDMA provides ~25 GB/s per port (200 Gb/s InfiniBand) vs. ~7 GB/s for a single NVMe SSD, and multiple RDMA ports can be aggregated. Remote memory is a better "swap" target than local disk for bandwidth-bound workloads.

5.4.6 Security Considerations

The pre-registered RDMA pool approach (Section 5.4) creates a security trade-off: any node with the rkey can read/write any address within the RDMA-eligible region on the remote node via one-sided RDMA. The attack surface is limited to the RDMA pool (capped at min(25% of RAM, rdma.max_pool_gib GiB), default ≤64 GiB), not all physical memory — kernel text, stacks, and security structures are excluded.

Explicit trust model assumption: All nodes in an UmkaOS cluster are mutually trusted kernel instances, authenticated during cluster join (X25519 key exchange authenticated with Ed25519 signatures, Section 5.2). A compromised node can read/write the RDMA pool of any other node. This is the same trust model used by all production InfiniBand and RoCE deployments today. The cluster join authentication is the trust boundary — once joined, nodes are peers. Nodes are mutually trusted for correctness (they run the same kernel and obey the protocol). Encryption protects against passive network observers and hardware-level attacks (compromised switches). A compromised node kernel can bypass all software protections regardless. If a node is suspected of compromise, its cluster membership is revoked (Section 5.8), destroying all QP connections and invalidating its RDMA mappings immediately.

Mitigation 1: RDMA Memory Windows (MW Type 2) — For security-sensitive or multi-tenant deployments, use RDMA Memory Windows instead of pool-wide registration. Each page export creates a short-lived memory window with a unique rkey, scoped to the specific page or region being transferred. The window is revoked when page ownership changes. This adds ~0.5-3μs overhead per window create/destroy (HCA-dependent; ConnectX-5/6 MW Type 2 bind/invalidate involves firmware interaction and PCIe round-trips) but eliminates pool-wide exposure. Caveat: the 0.5-3μs figure is measured on direct-attach single-hop InfiniBand/RoCE; multi-hop fabric topologies or RDMA-over-TCP add additional per-hop latency (typically 5-15μs on ConnectX across two switches). A compromised node can only access pages for which it currently holds a valid memory window, not the entire 25% pool.

Mitigation 2: Rkey rotation — In trusted mode, the pool-wide rkey is rotated periodically (default rotation interval: 60 seconds). All nodes exchange new rkeys via the authenticated control channel. Old rkeys remain valid during a grace period of 2× the rotation interval (= 120 seconds) to allow in-flight RDMA operations to complete, then are invalidated. This limits the window of exposure if an rkey is leaked outside the cluster by a non-RDMA channel (e.g., software bug exposes rkey bytes in a log or network message): an external attacker who obtained the leaked rkey has at most 180 seconds (60s rotation + 120s grace) before the rkey becomes invalid.

Security invariant: Session encryption keys for all RDMA transfers are stored in Core-private memory (Tier 0, PKEY 0 on x86, outside the pre-registered RDMA pool). A compromised node that retains physical RDMA pool access cannot decrypt other nodes' in-flight traffic without the session keys. On confirmed compromise detection (CLUSTER_SUSPECT → CLUSTER_EVICT transition), all capability handles granting RDMA pool access are immediately revoked in the local capability table via cap_revoke_all(node_id) — in-flight DMA transactions complete but no new transfers can be initiated by or to the evicted node. The 180-second window governs detection propagation latency, not credential validity after detection.

Important: Rkey rotation is a defense-in-depth mechanism against rkey leakage to non-cluster entities. It is NOT the revocation mechanism for evicted cluster members. When a node's cluster membership is revoked (see "Eager rkey revocation on membership loss" below), the master calls dereg_mr() on the evicted node's pool-wide Memory Region immediately, invalidating the rkey in RNIC hardware within < 1ms. The 180s rotation window does not apply to membership revocation — it applies only to the background rotation schedule that limits exposure from rkey leakage outside the RDMA fabric.

For deployments where even the 180s leakage window is unacceptable (multi-tenant, zero-trust), use Mitigation 1 (RDMA Memory Windows) instead, which provides immediate per-page revocation at the cost of higher per-operation overhead.

Mitigation 3: Trusted cluster mode (default) — For single-tenant, physically secured clusters (the common datacenter case), the pool-wide registration with rkey rotation is the default. This matches current InfiniBand practice — all production RDMA deployments assume a trusted fabric.

Mitigation 4: Hardware memory encryption — Note: Total Memory Encryption (Intel TME-MK, AMD SME) operates at the memory controller boundary and does NOT encrypt data visible to DMA devices. RDMA NICs access memory through the IOMMU in the CPU's coherency domain and see plaintext. TME-MK/SME protects against physical DRAM extraction attacks (cold boot) but does not provide defense-in-depth against software-based RDMA attacks. For RDMA data-in-transit protection, use the AES-GCM encryption described above.

Acknowledged limitation: All four mitigations have trade-offs. MW Type 2 adds latency to every page transfer. Rkey rotation adds periodic coordination overhead. Trusted cluster mode requires a physically secured fabric. Hardware encryption prevents cross-node data snooping but doesn't prevent a compromised node from writing garbage to remote memory. A fully zero-trust RDMA solution would require per-operation cryptographic MACs. For RoCE (RDMA over Converged Ethernet), MACsec (IEEE 802.1AE) provides link-layer encryption and integrity at the Ethernet layer, protecting against physical tapping and injection. For native InfiniBand, MACsec is not applicable (IB does not use Ethernet framing); InfiniBand relies on partition-based isolation (P_Key), queue-pair key authentication (Q_Key), and subnet manager access control for security. Neither MACsec nor InfiniBand's native mechanisms provide per-operation cryptographic authentication for one-sided RDMA operations at line rate. UmkaOS's pragmatic stance: match existing InfiniBand/RoCE security practice (trusted fabric with partition isolation), offer MW-based restriction for multi-tenant or security-sensitive deployments, and upgrade to hardware-enforced per-operation authentication when NIC vendors deliver it (CXL 3.0's integrity model is a likely candidate).

One-sided RDMA authentication: There is a fundamental tension between one-sided RDMA (which requires no remote CPU involvement) and per-operation authentication (which requires CPU processing). The resolution: the trust boundary is the cluster join, authenticated via X25519 key exchange with Ed25519 signatures (Section 5.2). Within the cluster, nodes are mutually trusted for RDMA operations. If a node is compromised, its cluster membership is revoked — all QP connections to that node are destroyed and its RDMA mappings are invalidated immediately via eager dereg_mr() (see "Eager rkey revocation on membership loss" under Mitigation 2 above). Rkey revocation is hardware-enforced with < 1ms latency from dereg_mr() call to rejection of in-flight RDMA operations. There is no 180s exposure window for membership revocation — RDMA hardware atomically invalidates the rkey as part of MR deregistration (per the InfiniBand Architecture Specification, MR deregistration invalidates the rkey). The 180s rkey rotation grace period (Mitigation 2) is a separate defense-in-depth mechanism that limits exposure from rkey leakage to non-cluster entities, not the revocation path for evicted members.

This matches the security model of all existing RDMA deployments: InfiniBand assumes a trusted fabric, and RoCEv2 relies on network isolation (VLANs, VXLANs) for multi-tenant separation.

Default: Trusted mode (pool-wide registration) is the default. This matches existing InfiniBand/RoCE practice — all production RDMA deployments assume a trusted fabric. Set via:

/sys/kernel/umka/cluster/security_mode
# # "trusted" (default) or "secure"

RDMA Security Mode — Per-connection security posture is governed by RdmaSecurityMode, which controls whether individual peer connections use pool-wide rkeys (fast, trusted) or per-transfer memory windows (slower, hardened). The mode is selected automatically by default, based on network topology, but can be overridden globally or per-peer.

/// Per-connection RDMA security posture. Determines whether one-sided RDMA
/// operations use pool-wide rkeys (Permissive) or per-transfer Memory Windows
/// (Strict). Stored per `PeerConnection` and evaluated at QP setup time.
#[repr(u8)]
pub enum RdmaSecurityMode {
    /// Pool-wide rkey — fast path. Remote peer accesses the entire RDMA pool
    /// via a single rkey. No per-transfer MW create/destroy overhead.
    /// Appropriate for physically isolated, single-tenant fabrics where all
    /// nodes are mutually trusted and share a rack or pod.
    Permissive = 0,

    /// Per-transfer Memory Window (MW Type 2). Each RDMA data transfer creates
    /// a short-lived MW scoped to the specific page(s) being transferred. The
    /// MW is invalidated immediately after the transfer completes (or on
    /// ownership change). Adds ~0.5-3 μs overhead per transfer (HCA-dependent)
    /// but eliminates pool-wide exposure.
    /// Appropriate for cross-rack links, multi-tenant fabrics, and any
    /// connection traversing untrusted network segments.
    Strict     = 1,

    /// Topology-aware automatic selection (default). The kernel inspects
    /// the `TopologyEdge` between the local node and the remote peer
    /// at QP setup time and selects the mode as follows:
    ///
    /// - **Intra-rack** (`TopologyEdge.hop_count <= 1` AND both nodes share
    ///   the same `rack_id` in the topology graph): `Permissive`. Rationale:
    ///   single-hop InfiniBand within a rack is physically isolated; the only
    ///   attack vector is a compromised peer node, which is already within the
    ///   cluster trust boundary.
    /// - **Cross-rack** (`hop_count > 1` OR different `rack_id`): `Strict`.
    ///   Rationale: traffic traverses at least one inter-rack switch, which
    ///   widens the physical attack surface (shared cabling, potentially shared
    ///   spine switches with other tenants). Memory Windows bound the blast
    ///   radius of a compromised switch or rogue node on a shared fabric.
    ///
    /// The auto-selection runs once per peer connection at QP creation time
    /// and is recorded in `PeerConnection.security_mode`. It does not change
    /// dynamically — if topology changes (e.g., a peer moves to a different
    /// rack), the QP must be torn down and re-established to re-evaluate.
    Auto       = 2,
}

Configuration interface:

/proc/umka/rdma/default_security_mode
# # Read/write. Accepts: "permissive", "strict", "auto" (default: "auto").
# # Sets the cluster-wide default for new peer connections.
# # Existing connections are not affected — the mode is locked at QP
# # creation time. To apply a new default to all connections, write the
# # new mode and then trigger a cluster-wide QP refresh via:
# #   echo 1 > /proc/umka/rdma/refresh_qps
# # which tears down and re-establishes all peer QPs (~100 ms disruption).

/proc/umka/rdma/peer/<peer_id>/security_mode
# # Per-peer override. Takes precedence over the cluster-wide default.
# # Accepts: "permissive", "strict", "auto".
# # Useful for pinning specific cross-datacenter links to Strict while
# # keeping intra-rack links at Permissive.

Selection criteria summary:

Topology Auto selects Rationale
Same rack, single-hop IB Permissive Physically isolated; no shared infrastructure
Cross-rack, multi-hop IB Strict Shared spine switches widen attack surface
RoCE (any topology) Strict Ethernet fabric is typically shared; MACsec may not cover all hops
CXL fabric Permissive CXL 3.0 provides hardware-enforced access control via LD-ID

Why trusted mode is the default: The security boundary in an UmkaOS cluster is the cluster join authentication (Section 5.2: X25519 key exchange authenticated with Ed25519 signatures). Once a node is authenticated into the cluster, it is a trusted peer. RDMA pool-wide access does not expand the trust boundary — a compromised node already has full access to cluster resources through authenticated channels.

Datacenter environments provide additional isolation layers: - Physical access control (racks, cages, badge readers) - Network isolation (VLANs, VXLANs, dedicated RDMA fabrics) - Node attestation during cluster join

For these environments, the ~5% overhead of memory windows is unnecessary overhead. Secure mode is appropriate for: - Multi-tenant cloud environments where different customers share RDMA fabric - Environments without physical network isolation - Defense-in-depth deployments where the threat model includes node compromise despite authentication

Switching to secure mode:

echo "secure" > /sys/kernel/umka/cluster/security_mode
# # Note: This triggers a brief (~100ms) disruption as QPs are reconfigured

5.4.7 RDMA Pool Manager

All subsystems that require RDMA-registered memory (DSM, DLM, NVMe-oF, Block Service Provider) allocate from a single managed pool rather than independently registering their own memory regions. This ensures coordinated memory registration, per-subsystem quota enforcement, and safe resize operations without MR fragmentation or NIC resource exhaustion.

/// Maximum number of subsystems that can register with the RDMA pool manager.
/// Current consumers: DSM, NVMe-oF, Block Service Provider, DLM, reserve.
/// 8 provides headroom for future subsystems (distributed VFS, cluster
/// networking, etc.).
pub const MAX_RDMA_SUBSYSTEMS: usize = 8;

/// Centralized RDMA memory pool manager. Initialized at cluster join time
/// (boot sequence step 3, see Section 5.4.3). All RDMA-registered memory
/// is managed through this single instance.
///
/// **Allocation**: `RdmaPoolManager` MUST be heap-allocated (`Box` or slab),
/// never on the kernel stack. The `regions` array contains per-NUMA-node
/// `RdmaNodeRegion` entries, each embedding a `SpinLock<RdmaBuddyAllocator>`
/// (~21.9 KiB on 64-bit). With `NUMA_NODES_STACK_CAP = 64` nodes, the
/// inline size would exceed 1.4 MiB — far beyond any kernel stack (8-16 KiB).
/// Using `Box<[RdmaNodeRegion]>` allocates only the actual number of NUMA
/// nodes discovered at boot (typically 1-8), avoiding both stack overflow
/// and wasted inline capacity.
pub struct RdmaPoolManager {
    /// Total pool size (bytes). Determined at cluster join time by the
    /// sizing formula in Section 5.4.3: clamp(min(RAM * 25%, max_pool_gib),
    /// 256 MiB, total_ram). Configurable via `rdma.max_pool_gib` boot
    /// parameter.
    pub total_size: usize,

    /// Pre-registered memory regions, one per NUMA node. Each MR covers
    /// the node's proportional share of the pool. Registered once at
    /// cluster join time (~100 μs total).
    ///
    /// Heap-allocated (not `ArrayVec<..., NUMA_NODES_STACK_CAP>`) because
    /// each `RdmaNodeRegion` embeds a ~21.9 KiB buddy allocator.
    /// `ArrayVec<RdmaNodeRegion, 64>` would consume ~1.4 MiB inline,
    /// making `RdmaPoolManager` itself unsafe for stack allocation.
    /// Length equals the number of NUMA nodes discovered at boot.
    pub regions: Box<[RdmaNodeRegion]>,

    /// Per-subsystem quota allocation and current usage. Protected by
    /// a spinlock because quota updates happen on warm paths (subsystem
    /// init, resize) — never on the per-operation hot path.
    pub quotas: SpinLock<ArrayVec<RdmaSubsystemQuota, MAX_RDMA_SUBSYSTEMS>>,

    /// Number of MR registrations currently active on the NIC. Tracked
    /// against the NIC's hardware MR limit (typically 128-512 MRs,
    /// queried from the HCA at init). If `mr_count` approaches
    /// `mr_limit`, subsystems are warned to consolidate registrations.
    pub mr_count: AtomicU32,
    /// Hardware MR limit reported by the NIC (ibv_query_device.max_mr).
    pub mr_limit: u32,
}

/// Per-NUMA-node RDMA memory region.
pub struct RdmaNodeRegion {
    /// NUMA node ID (local hardware topology, not cluster-level NodeId which is u64).
    pub numa_node: u32,
    /// Registered memory region handle (NIC-side).
    pub mr: RdmaMemoryRegion,
    /// Base virtual address of this node's pool region.
    pub base: *mut u8,
    /// Size of this node's pool region in bytes.
    pub size: usize,
    /// Bump allocator offset within this region. Allocation is
    /// monotonically increasing within a region; freed slices are
    /// returned to the buddy allocator for reuse.
    pub alloc_offset: AtomicUsize,
    /// Buddy allocator for freed regions. Guarantees O(log N) free regions
    /// by construction: adjacent power-of-two blocks are coalesced on free,
    /// preventing unbounded fragmentation regardless of alloc/free patterns.
    ///
    /// For a 1 TiB pool with 4 KiB minimum block size: max 40 orders
    /// (log2(1 TiB / 4 KiB) = 28 orders), max ~28 free regions per order.
    /// Total free list entries bounded by O(log(pool_size / min_block_size))
    /// — never exceeds ~50 entries for any practical pool size.
    ///
    /// Allocation: find smallest order >= requested size, split if needed.
    /// Free: return block, coalesce with buddy if buddy is also free.
    /// Both operations: O(log N) where N = pool_size / min_block_size.
    ///
    /// This replaces the prior ArrayVec<RdmaFreeSlice, 256> free list which
    /// could overflow under fragmented alloc/free patterns (DSM page churn
    /// with interleaved allocations), silently leaking RDMA pool memory.
    pub buddy: SpinLock<RdmaBuddyAllocator>,
}

/// Buddy allocator for a contiguous RDMA memory region.
///
/// Each order k tracks free blocks of size `min_block_size * 2^k`.
/// Order 0 = minimum allocation unit (typically PAGE_SIZE = 4 KiB).
/// Maximum order = log2(region_size / min_block_size).
///
/// Free lists per order use ArrayVec with a generous bound — the buddy
/// invariant guarantees at most ~2 entries per order in steady state
/// (one from each half of the split parent). The ArrayVec capacity of 64
/// per order handles transient fragmentation during concurrent free bursts.
pub struct RdmaBuddyAllocator {
    /// Minimum block size (typically PAGE_SIZE = 4096).
    pub min_block_size: usize,
    /// Number of orders: log2(region_size / min_block_size) + 1.
    /// Maximum 41 for a 1 TiB region with 4 KiB blocks.
    pub max_order: u8,
    /// Free list per order. Index k contains free blocks of size
    /// `min_block_size * 2^k`. Each entry is the byte offset within
    /// the region.
    /// Capacity 64 per order: buddy coalescing ensures steady-state is
    /// ~2 entries per order; capacity 16 would suffice for all
    /// non-pathological workloads. 64 is generous to handle burst
    /// concurrency during rapid alloc/free cycles. Total inline size:
    /// ~42 × (64 × 8 + overhead) ≈ 21.9 KiB per NUMA node — negligible
    /// compared to the RDMA memory region (hundreds of MiB to TiB).
    pub free_lists: ArrayVec<ArrayVec<usize, 64>, 42>,
    /// Bitmap tracking which blocks are free (for buddy coalescing).
    /// One bit per minimum-sized block. Allocated from the region's
    /// metadata slab (not the RDMA region itself).
    pub bitmap: Box<[u64]>,
}

/// Per-subsystem quota tracking.
///
/// **Hot-path vs warm-path**: Hot-path allocation checks use
/// `allocated: AtomicUsize` (lock-free CAS, ~1-3 cycles). The
/// `quotas: SpinLock` in `RdmaPoolManager` protects only warm-path
/// operations (subsystem init, quota resize — ~10 per system lifetime).
/// These are distinct paths by design.
pub struct RdmaSubsystemQuota {
    /// Subsystem name (for diagnostics and sysfs reporting).
    pub name: &'static str,
    /// Percentage of total pool allocated to this subsystem (0-100).
    /// The sum of all subsystem quotas must equal 100.
    pub quota_pct: u8,
    /// Current bytes allocated by this subsystem. Tracked atomically
    /// for lock-free hot-path allocation checks.
    pub allocated: AtomicUsize,
    /// Maximum bytes this subsystem may allocate (quota_pct * total_size / 100).
    /// Computed once at init and updated on resize.
    pub max_bytes: usize,
}

/// A slice of RDMA-registered memory returned by the pool manager.
/// The slice is within a pre-registered MR, so it is immediately
/// eligible for RDMA operations (zero registration overhead).
pub struct RdmaSlice {
    /// Pointer to the start of the allocated region.
    pub ptr: *mut u8,
    /// Length in bytes.
    pub len: usize,
    /// NUMA node this slice was allocated from (not cluster-level NodeId).
    pub numa_node: u32,
    /// Local key for the enclosing MR (needed for RDMA work requests).
    pub lkey: u32,
    /// Remote key for the enclosing MR (shared with cluster peers).
    pub rkey: u32,
}

Default quota distribution:

Subsystem Quota Rationale
DSM 40% Page migration dominates RDMA traffic volume. A 64 GiB pool gives DSM 25.6 GiB of RDMA-registered pages for remote memory. Reduced from 50% to accommodate the Block Service Provider; DSM page working sets rarely exhaust 40% of the pool in practice — hot pages are reused, and cold pages are evicted to local swap before remote memory.
NVMe-oF 25% Storage I/O buffers. Each NVMe-oF queue pair needs pre-registered buffers for SQE/CQE data payloads. Reduced from 30% — most deployments use either NVMe-oF or Block Service Provider, not both simultaneously, so combined storage quota (25% + 15%) exceeds the original 30%.
Block Service Provider 15% RDMA data buffers for the block service provider (Section 15.13). Each BlockServiceQueue requires pre-registered memory for RDMA Read/Write data transfer (read data is RDMA-Written into client buffers; write data is RDMA-Read from client buffers). Buffer size per queue: queue_depth x max_io_bytes (default: 128 x 1 MB = 128 MB). At 16 queues per client x 4 clients = ~8 GB, well within 15% of a 64 GiB pool (9.6 GiB). Deployments without block export can redistribute this quota to DSM or NVMe-oF via sysfs.
DLM 10% Lock messages are small (64-256 bytes) but frequent. 10% provides headroom for lock storms during failover.
Reserve 10% Burst headroom and future subsystems. Any subsystem can temporarily borrow from reserve when its quota is exhausted, subject to a 5-second borrow timeout after which it must free excess allocations.

Quotas are configurable via sysfs:

/sys/kernel/umka/cluster/rdma_quota/<subsystem>/pct

impl RdmaPoolManager {
    /// Allocate RDMA-registered memory for a subsystem.
    /// Returns a slice within the pre-registered MR (zero-copy eligible).
    /// Allocation is NUMA-aware: prefers the local NUMA node of the
    /// calling CPU, falls back to remote nodes if the local node's
    /// region is exhausted.
    ///
    /// Returns `RdmaPoolError::QuotaExceeded` if the subsystem has
    /// reached its quota and the reserve pool is also exhausted.
    /// Returns `RdmaPoolError::OutOfMemory` if all regions are full.
    pub fn alloc(
        &self,
        subsystem: &str,
        size: usize,
    ) -> Result<RdmaSlice, RdmaPoolError> { /* ... */ }

    /// Free previously allocated RDMA memory. The freed slice is added
    /// to the per-node free list for reuse. Adjacent free slices are
    /// coalesced to reduce fragmentation.
    pub fn free(&self, slice: RdmaSlice) { /* ... */ }

    /// Resize the pool. This is a disruptive operation that requires
    /// quiescing all subsystems:
    ///
    /// 1. Notify all subsystems to drain in-flight RDMA operations
    ///    that reference the current MRs.
    /// 2. Wait for drain completion (each subsystem acknowledges).
    /// 3. Deregister old MRs (one per NUMA node).
    /// 4. Allocate new regions, register new MRs.
    /// 5. Update quota max_bytes for all subsystems.
    /// 6. Notify subsystems to re-bind to the new MRs (update lkey/rkey
    ///    in outstanding work requests).
    /// 7. Exchange new rkeys with cluster peers.
    ///
    /// Returns error if any subsystem cannot quiesce within `timeout_ms`.
    /// The old pool remains active if resize fails (no partial state).
    ///
    /// **DSM deadlock prevention**: Resize MUST NOT block DSM page fault
    /// resolution. DSM page faults acquire RDMA descriptors on the hot path
    /// (`alloc()` with `AtomicUsize` — lock-free). Resize operates on the
    /// global quota structure under `SpinLock` (warm path). The two paths
    /// are independent: `alloc()` checks `allocated < max_bytes` atomically,
    /// while resize updates `max_bytes` under lock after all subsystems
    /// quiesce. Quiescence request to the DSM subsystem means "complete
    /// in-flight RDMA operations and do not start new page transfers" —
    /// it does NOT mean "stop servicing page faults" (page faults that
    /// arrive during resize quiescence are queued and serviced from the
    /// new pool after resize completes). If DSM cannot quiesce within
    /// `timeout_ms`, resize is aborted (old pool stays active).
    pub fn resize(
        &self,
        new_size: usize,
        timeout_ms: u32,
    ) -> Result<(), RdmaPoolError> { /* ... */ }
}

pub enum RdmaPoolError {
    /// Subsystem has exceeded its quota allocation.
    QuotaExceeded { subsystem: &'static str, used: usize, max: usize },
    /// All RDMA pool regions are exhausted (no free space on any NUMA node).
    OutOfMemory,
    /// Resize failed: a subsystem did not quiesce within the timeout.
    QuiesceTimeout { subsystem: &'static str },
    /// NIC MR limit reached — cannot register additional memory regions.
    MrLimitReached { current: u32, limit: u32 },
}

NIC MR limit tracking: The pool manager tracks the number of MR registrations against the NIC's hardware limit (queried via ibv_query_device().max_mr at init time, typically 128-512 MRs on ConnectX-5/6/7). The per-NUMA-node registration model uses one MR per NUMA node (typically 1-8 MRs for 1-8 NUMA nodes), well within hardware limits. If the system approaches the MR limit (e.g., due to memory window usage in secure mode), the pool manager logs a warning and refuses new registrations until existing MRs are freed. Subsystems are warned via a callback so they can consolidate registrations (e.g., merge multiple small MRs into a single larger one).

Relationship to Section 5.4.3: The pool sizing formula, NUMA-local allocation strategy, boot sequence, and runtime adjustment protocol defined in Section 5.4 remain authoritative. The RdmaPoolManager is the implementation of that policy — it enforces the sizing formula, manages per-NUMA regions, and provides the alloc/free API that subsystems use instead of directly calling ibv_reg_mr().

5.4.8 QP Tear-down Protocol

Destroying a QP while messages are in-flight or Work Requests (WRs) are pending can result in completion events being lost, memory corruption, or use-after-free in the completion handler. The correct procedure for safe QP tear-down is:

  1. Transition to ERROR state. Call ibv_modify_qp(qp, IBV_QPS_ERR). This moves all outstanding WRs to error state and generates flush completions (with IBV_WC_WR_FLUSH_ERR status) for every pending send and receive WR. The QP no longer accepts new post operations after this transition.

  2. Drain the Completion Queue. Poll the CQ until it is empty:

    loop {
        let n = ibv_poll_cq(cq, &mut wc_buf);
        if n == 0 { break; }
        // Handle or discard flush completions — all have IBV_WC_WR_FLUSH_ERR status.
        // Any non-flush completion here indicates a protocol bug (WRs posted after
        // the ERROR transition).
    }
    
    Draining ensures that no completion handler holds a reference to the QP after ibv_destroy_qp() is called. Skipping this step is a use-after-free vulnerability if any in-flight WR's completion fires asynchronously after the QP is freed.

  3. Destroy the QP. Only after the CQ is empty: call ibv_destroy_qp(qp). At this point no in-flight operation references the QP.

  4. Destroy the CQ. After the QP is destroyed. Destroying the CQ before the QP can cause the NIC to write completions to freed memory.

UmkaOS enforcement: UmkaOS's RDMA transport layer tracks QP reference counts and enforces the protocol: - ibv_destroy_qp() returns EBUSY if the QP has not been transitioned to IBV_QPS_ERR state first. - ibv_destroy_qp() returns EBUSY if the CQ still has pending completions referencing this QP (detected by checking the QP's tracked inflight counter and scanning the CQ's pending-completion ring before accepting the destroy call). - These checks fire in both debug and release builds, because a UAF from an out-of-order tear-down is a correctness error on all configurations.

This protocol applies to all QP types used by UmkaOS (control and data QPs in NodeConnection, doorbell QPs in distributed ring buffers). The cluster failure handler (Section 5.8) follows this protocol when destroying QPs after a node is marked Dead: it transitions each QP to ERROR state, drains completions, then destroys the QP, then destroys the associated CQ.


5.5 Distributed IPC

5.5.1 Extending Ring Buffers to RDMA

UmkaOS's IPC is MPSC ring buffer-based (Section 11.7, Zero-Copy I/O Path, which includes the MPSC ring buffer protocol), using SQE/CQE structures compatible with io_uring. The same ring buffer protocol works over RDMA.

Local IPC (current design, unchanged):
  ┌──────────┐  shared memory ring  ┌──────────┐
  │ Process A │ ──── SQE/CQE ────► │ Process B │
  └──────────┘   (mapped pages)     └──────────┘
  Same machine, same address space region.
  Zero-copy. Latency: ~200ns.

Distributed IPC (new):
  ┌──────────┐  RDMA ring           ┌──────────┐
  │ Process A │ ──── SQE/CQE ────► │ Process B │
  │ (Node 0)  │   (RDMA write)      │ (Node 1)  │
  └──────────┘                      └──────────┘
  Different machines, RDMA-connected.
  Zero-copy (RDMA, no kernel networking stack). Latency: ~2-3 μs (RDMA Write).

The ring buffer protocol (SQE format, CQE format, head/tail pointers,
memory ordering) is identical. Only the transport changes.

5.5.2 Transparent Transport Selection

// umka-core/src/ipc/ring.rs (extend existing)

pub enum RingTransport {
    /// Both endpoints on the same machine.
    /// Ring is in shared memory (mapped into both processes).
    SharedMemory {
        ring_pages: PageRange,
    },

    /// Endpoints on different machines.
    /// Submissions: producer RDMA-writes SQEs into consumer's ring memory.
    /// Completions: consumer RDMA-writes CQEs back.
    /// Doorbell: RDMA send (small control message) notifies consumer.
    Rdma {
        remote_peer: PeerId,
        remote_ring_addr: u64,
        remote_ring_rkey: u32,
        local_ring_pages: PageRange,
        doorbell_qp: RdmaQpHandle,
    },

    /// Endpoints connected via CXL fabric (load/store accessible).
    /// Ring is in CXL-attached shared memory. Same as SharedMemory
    /// but pages are on a CXL memory pool node.
    CxlSharedMemory {
        ring_pages: PageRange,
        cxl_node: NumaNodeId,
    },
}

Transport selection is automatic and follows a deterministic algorithm:

/// Per-peer transport preference. Set via cluster configuration or
/// overridden per-connection by the application (via `setsockopt`-style
/// IPC option).
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
#[repr(u8)]
pub enum PeerTransportPreference {
    /// Automatically select the best available transport.
    /// Selection order: CXL shared memory > RDMA > TCP.
    /// This is the default.
    Auto = 0,

    /// Force RDMA transport. Fail the connection if RDMA is not
    /// available to the target peer (returns `ENODEV`).
    RdmaOnly = 1,

    /// Force TCP transport even if RDMA is available. Useful for
    /// debugging, testing, or when RDMA NICs are reserved for
    /// specific workloads.
    TcpOnly = 2,
}

Transport selection algorithm (select_transport()):

select_transport(local_peer, remote_peer, preference) -> Result<RingTransport>:

  // Step 0: Same machine — always use shared memory.
  if remote_peer.node_id == local_peer.node_id {
      return Ok(RingTransport::SharedMemory { ring_pages })
  }

  // Step 1: Check explicit preference.
  match preference {
      TcpOnly => return setup_tcp_transport(remote_peer),
      RdmaOnly => {
          return try_rdma(local_peer, remote_peer)
              .map_err(|_| Error::ENODEV)  // Fail hard if RDMA unavailable.
      }
      Auto => { /* Fall through to auto-selection */ }
  }

  // Step 2 (Auto): Try CXL shared memory first.
  // If both peers are on the same CXL fabric and can access a shared
  // CXL memory pool, use load/store semantics (lowest latency).
  if let Some(cxl_node) = cxl_fabric_shared_pool(local_peer, remote_peer) {
      return Ok(RingTransport::CxlSharedMemory { ring_pages, cxl_node })
  }

  // Step 3 (Auto): Try RDMA.
  // Requirements: both peers have RDMA-capable NICs, IB subnet is
  // reachable (GID resolution succeeds), and QP creation succeeds.
  match try_rdma(local_peer, remote_peer) {
      Ok(transport) => return Ok(transport),
      Err(reason) => {
          // Log why RDMA was not selected (for diagnostics).
          log_info!("RDMA unavailable to peer {}: {:?}, falling back to TCP",
                    remote_peer.id, reason);
      }
  }

  // Step 4 (Auto): Fallback to TCP.
  // This is the default path for s390x peers (no PCIe, no RDMA NIC in most
  // configurations). s390x peers use TCP over HiperSockets (intra-CEC) or
  // OSA-Express (physical network). Phase 3+ adds PeerTransportType::HiperSockets
  // as a dedicated QDIO-based transport for sub-microsecond intra-CEC latency.
  setup_tcp_transport(remote_peer)

try_rdma(local_peer, remote_peer) -> Result<RingTransport, RdmaSetupError>:
  // (a) Check local RDMA NIC availability.
  let local_nic = rdma_device_for_peer(local_peer)?;  // ENODEV if no RDMA NIC

  // (b) Check remote RDMA NIC availability (from PeerRegistry metadata).
  let remote_nic = remote_peer.rdma_capable()?;  // ENODEV if peer has no RDMA

  // (c) Resolve IB subnet reachability: query the local SM (Subnet Manager)
  //     for a path to the remote GID. Timeout: 500ms.
  let path = ib_resolve_path(local_nic.gid, remote_nic.gid, timeout_ms=500)?;

  // (d) Create QP (Queue Pair) for this connection.
  //     Uses RC (Reliable Connection) QP type for ordering guarantees.
  let qp = rdma_create_qp(local_nic, &path, QpType::RC)?;

  // (e) Exchange ring buffer memory registration (rkey) with remote peer
  //     via the TCP control channel (one-time setup).
  let (remote_ring_addr, remote_ring_rkey) =
      exchange_ring_keys(qp, remote_peer)?;

  Ok(RingTransport::Rdma {
      remote_peer: remote_peer.id,
      remote_ring_addr,
      remote_ring_rkey,
      local_ring_pages,
      doorbell_qp: qp,
  })

RDMA setup failure reasons (returned by try_rdma):

RdmaSetupError variant Meaning Auto fallback?
NoLocalDevice This node has no RDMA NIC Yes → TCP
NoRemoteDevice Remote peer advertises no RDMA capability Yes → TCP
PathResolutionFailed IB subnet manager cannot find a route to remote GID Yes → TCP
QpCreationFailed QP creation error (resource limit, NIC error) Yes → TCP
KeyExchangeFailed Control channel timeout during rkey exchange Yes → TCP
Process A calls connect_ipc(target_process_id):
  1. Kernel looks up target process location (local or remote peer).
  2. Kernel calls select_transport() with the peer's PeerTransportPreference.
  3. Transport is selected per the algorithm above.
  4. Kernel sets up ring buffer with the selected transport.
  5. Process A gets back an IPC handle (opaque).
  6. Process A uses the same SQE submission interface regardless of transport.
  7. The SQE/CQE format is identical in all cases.

5.5.3 Ring Buffer RDMA Protocol

RDMA Ring Buffer Header Extension

The RDMA transport extends the base DomainRingBuffer header (Section 11.8) with additional fields for cross-node synchronization. These fields are placed in a separate RdmaRingHeader that precedes the standard header:

/// RDMA-specific ring buffer header extension.
/// Placed immediately before the DomainRingBuffer in RDMA transport mode.
/// Consumer-owned fields are on a separate cache line from producer-owned fields.
/// Uses LeAtomicU64 for correct endianness on mixed-endian clusters
/// ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)).
#[repr(C, align(64))]
pub struct RdmaRingHeader {
    // === Producer-owned cache line (written by remote producer via RDMA) ===
    /// Sequence number of the last published SQE.
    /// The producer increments this AFTER writing SQE data to the ring.
    /// The doorbell message carries this sequence number.
    /// Memory ordering: producer uses Release, consumer uses Acquire.
    pub producer_seq: LeAtomicU64,
    /// Reserved for future use (cache line padding).
    _pad_producer: [u8; 56],

    // === Consumer-owned cache line (written locally) ===
    /// Sequence number up to which the consumer has processed.
    /// Used for flow control and to detect dropped entries.
    pub consumer_seq: LeAtomicU64,
    /// Reserved for future use (cache line padding).
    _pad_consumer: [u8; 56],
}
// 64-byte cache line separation. On POWER9/10 with 128-byte L1 cache
// lines, both fields share the same L1 line; this is negligible compared
// to RDMA round-trip latency (~1-5 us). Widening to 256 bytes would
// waste memory on all other architectures for a marginal PPC64LE benefit.
const_assert!(core::mem::size_of::<RdmaRingHeader>() == 128);

Protocol: Sequence-Based Doorbell Synchronization

The core challenge is that RDMA Write and RDMA Send operations may arrive at the consumer out of order — the doorbell (RDMA Send) could arrive before the SQE data (RDMA Write) is visible in memory. To solve this, we use a sequence counter that the consumer polls to determine data readiness:

Producer (Node A) submits work:
  1. Write SQE to local staging buffer
  2. RDMA Write: push SQE to consumer's ring memory on Node B
     (one-sided, no CPU involvement on Node B)
  3. RDMA Write: update producer_seq in consumer's RdmaRingHeader.
     The new value is (previous_seq + 1).
     On RC (Reliable Connection) QPs, RDMA Writes are guaranteed to be delivered
     and executed in posting order at the responder. Therefore, the SQE data
     (step 2) is always visible before producer_seq (step 3), without
     requiring any additional fencing.
     Note: IBV_SEND_FENCE is NOT needed here — it only orders operations after
     prior RDMA Reads and Atomics, not after prior Writes.
     (Each producer has a dedicated QP per consumer, so per-QP ordering suffices.)
  4. If consumer was idle: RDMA Send doorbell with payload = new producer_seq value.
     The doorbell serves only as a notification to wake the consumer; the consumer
     does NOT trust the doorbell's sequence value directly. Instead, the consumer
     reads producer_seq from local memory (where it was written by step 3) with
     Acquire ordering to establish the happens-before relationship.

Consumer (Node B) processes work:
  1. Receive doorbell notification (RDMA Send) OR poll timeout
  2. Read producer_seq with Acquire ordering from local RdmaRingHeader
  3. Compare producer_seq to consumer_seq to determine how many entries are ready:
     ready_count = producer_seq - consumer_seq
  4. For each ready entry:
     a. Read SQE from local ring memory (RDMA write from step 2 already placed it there)
     b. Process request
     c. Write CQE to local completion ring
  5. Update consumer_seq with Release ordering after processing
  6. If completions generated: RDMA Write CQEs back to producer's ring on Node A

Latency breakdown:
  RDMA Write (SQE, 64 bytes):  ~1 μs
  RDMA Write (producer_seq):   ~0.5 μs (RC QP guarantees Write-after-Write ordering;
                                no IBV_SEND_FENCE needed — see note above)
  RDMA Send (doorbell):        ~0.5 μs (may arrive before or after step 3;
                                consumer uses seq polling to handle either order)
  Consumer processing:         application-dependent
  RDMA Write (CQE, 32 bytes):  ~1 μs
  Total overhead: ~3 μs round-trip (plus processing time; +0.5 μs vs. non-seq approach
  for the producer_seq write, but eliminates the race condition)

Why sequence-based synchronization is required:

RDMA Send and RDMA Write are different verb types. While RDMA Writes are ordered within a single QP, the relationship between RDMA Send and RDMA Write is not guaranteed by the InfiniBand specification. A naive protocol that sends the doorbell after writing SQE data could observe:

  Timeline (broken protocol):
  Producer: Write SQE --(network)--> Consumer receives SQE in memory
  Producer: Send doorbell --(network)--> Consumer receives doorbell interrupt

  If the doorbell packet arrives first (different routing, less data to transfer),
  the consumer's interrupt handler reads the ring before the SQE RDMA Write has
  arrived — it sees stale or uninitialized data.

The sequence counter solves this by making the data visibility indication itself traveled via RDMA Write (which is ordered with respect to the SQE data writes). The doorbell is merely an optimization to avoid spinning; correctness depends only on the producer_seq field, which the consumer reads with Acquire ordering.

Memory ordering semantics:

Operation Ordering Rationale
Producer writes SQE data Relaxed No ordering requirement until seq is published
Producer writes producer_seq Release Ensures SQE data is visible before seq advances
Consumer reads producer_seq Acquire Ensures seq read happens before SQE data read
Consumer writes consumer_seq Release Ensures SQE processing completes before ack
Producer reads consumer_seq Acquire Ensures ack read happens before reusing slots

On x86-64, Release/Acquire compile to plain MOV instructions (TSO provides the required ordering). On AArch64, RISC-V, and PowerPC, the compiler emits the appropriate barriers (STLR/LDAR, fence-qualified atomics, lwsync/isync).

5.5.4 Batching and Coalescing

For high-throughput scenarios (e.g., database replication, event streaming):

/// Batch submission: write multiple SQEs in a single RDMA operation.
/// Amortizes RDMA overhead across N entries.
pub struct RdmaBatchSubmit {
    /// Number of SQEs to submit in this batch.
    pub count: u32,
    /// Maximum time to wait for batch to fill before flushing (μs).
    pub max_coalesce_us: u32,
    /// Minimum batch size before flushing.
    pub min_batch_size: u32,
}

// With batching:
//   Single SQE: ~3 μs overhead per entry
//   Batch of 64 SQEs: ~5 μs total = ~78 ns per entry (38x improvement)

5.6 Cluster-Aware Scheduler

Dependency note: The cluster-aware scheduler is a consumer of the distributed infrastructure (peer protocol, topology graph, DLM, DSM) but is not a dependency of any other distributed subsystem. The peer protocol, cluster membership, DLM, DSM, CXL integration, and all capability services are fully functional without the cluster scheduler or process migration.

Within this section, lightweight thread migration (§5.7.4, DSM-triggered, ~17 KB transfer, ~10-20 μs) is tightly integrated with DSM — it's the natural response when a thread's working set is strongly remote. Full process migration (§5.7.4, fd remapping, TCP redirect, GPU contexts) is a standalone Phase 4+ feature with complex cross-subsystem dependencies. It is specified here for completeness but should not be considered a prerequisite for any other feature in the distributed architecture.

5.6.1 Problem

The scheduler (Section 7.1) currently optimizes process placement within a single machine: NUMA-aware load balancing, work stealing between CPUs, migration cost modeling. For a distributed kernel, the scheduler should consider the entire cluster.

5.6.2 Design: Two-Level Scheduler

Level 1: Global Cluster Scheduler (runs every ~10s, lightweight)
  - Monitors per-node load (CPU, memory, accelerator utilization)
  - Decides process-to-node placement
  - Triggers process migration when data locality warrants it
  - Respects node affinity, cgroup constraints, capability requirements

Level 2: Per-Node Scheduler (existing, runs every ~4ms)
  - CFS/EEVDF + RT + DL queues (unchanged)
  - NUMA-aware CPU placement (unchanged)
  - Accelerator scheduling (Section 22.1.2.4, unchanged)

5.6.3 Global Scheduler State

// umka-core/src/sched/cluster.rs

pub struct ClusterScheduler {
    /// Per-peer load summary (updated via periodic RDMA exchange).
    /// XArray keyed by PeerId (u64) — O(1) lookup with native RCU-compatible
    /// reads (scheduling decisions are frequent and lock-free);
    /// writes (periodic load updates from peers) use XArray's internal locking.
    peer_loads: XArray<PeerLoad>,

    /// Process-to-data affinity map.
    /// Tracks which peer holds most of a process's working set.
    /// Uses a slab allocator keyed by ProcessId for O(1) average lookup;
    /// tree-based maps with O(log n) traversal and poor cache locality are
    /// unsuitable for per-scheduling-tick access under load.
    data_affinity: SlabMap<ProcessId, DataAffinity>,

    /// Topology graph (Section 5.2.5) — replaces ClusterDistanceMatrix.
    /// Used for cost computation between peers.
    topology: &'static TopologyGraph,

    /// Global load balance interval.
    balance_interval_ms: u32,   // Default: 10000ms (10 seconds)

    /// Migration threshold: only migrate if improvement exceeds this.
    migration_threshold_ppt: u32,   // Default: 300 (300/1000 = 30% locality improvement)
}

pub struct PeerLoad {
    peer_id: PeerId,

    /// CPU utilization 0-100% (average across all CPUs).
    /// Populated from HeartbeatMessage.cpu_percent (u8, same 0-100 scale).
    cpu_percent: u32,

    /// Memory pressure 0-100% (0 = plenty free, 100 = thrashing).
    /// Populated from HeartbeatMessage.memory_pressure (u8, 0-255 scale).
    /// Conversion: `(hb.memory_pressure as u32 * 100) / 255`.
    memory_pressure: u32,

    /// Accelerator utilization 0-100% (average across all accelerators).
    /// Populated from HeartbeatMessage.accel_percent (u8, same 0-100 scale).
    accel_percent: u32,

    /// Number of runnable processes.
    /// Populated from HeartbeatMessage.runnable_count (Le32).
    runnable_count: u32,

    /// Remote memory faults per second (high = poor locality).
    /// Populated from HeartbeatMessage.remote_fault_rate (Le32).
    remote_fault_rate: u32,
}

pub struct DataAffinity {
    /// How many pages of this process's working set are on each peer.
    /// Used to decide where a process should run (warm-path: updated on
    /// page fault/migration, read at task placement — not per-syscall).
    /// XArray keyed by PeerId (u64) — O(1) lookup, bounded by MAX_PEERS (1024).
    pages_per_peer: XArray<u64>,

    /// Total working set size (pages).
    total_working_set: u64,

    /// Peer with most pages (preferred placement).
    preferred_peer: PeerId,

    /// Working set locality on preferred peer (0-1000, parts per thousand).
    locality_score_ppt: u32,
}

5.6.4 Process Migration

Two distinct mechanisms:

  1. Lightweight thread migration (DSM-triggered): Moves only registers + kernel stack (~17 KB, ~10-20 μs). Organic extension of DSM — triggered by DsmFaultHint::MigrateThread when a thread's working set is strongly remote. Wire protocol: Section 5.6. Phase 3 — tightly coupled with DSM.

  2. Full process migration (below): Moves entire process state including file descriptors, TCP connections, GPU contexts, io_uring rings. Complex cross-subsystem dependencies on VFS (§13), TCP (§15), KABI (§11), and device-specific state. Phase 4+ — standalone feature, not required by any other distributed subsystem. The serialization of device-specific state (GPU contexts, io_uring rings) via opaque blobs is an acknowledged design limitation — these require per-subsystem migration handlers that are not yet specified for all device classes.

When the cluster scheduler decides a process should move to another node:

Process migration from Node A to Node B:

1. Cluster scheduler on Node A decides: process P should run on Node B.
   Reason: 70% of P's working set is on Node B (remote page faults dominate).

2. Pre-migration:
   a. Send process metadata to Node B: PID, capabilities, cgroup, open files,
      signal handlers, register state.
   b. Node B allocates process slot, creates local task struct.

3. Freeze and transfer:
   a. Freeze process P on Node A (stop scheduling, save register state).
   b. Transfer register state to Node B via RDMA (~64 bytes, ~1 μs).
   c. Transfer kernel stack to Node B via RDMA (~16KB, ~2 μs).
   d. Mark P's page table on Node A as "migrated to Node B."
      Pages are NOT bulk-transferred — they fault in on demand.

4. Resume on Node B:
   a. Node B installs process in local scheduler.
   b. Process resumes execution on Node B.
   c. First memory access → page fault → fetch from Node A via RDMA (~3-5 μs).
   d. Subsequent accesses: pages migrate on demand.
   e. Working set migrates over ~100ms as pages are faulted in.

Total migration downtime: ~10-50 μs for single-threaded metadata-only migration
on unloaded interconnect (metadata transfer + freeze/thaw). Multi-threaded
processes with large working sets incur additional transfer time; see full
migration latency model in Section 5.7.
(This is the **critical-section freeze time** — the interval during which the
process is stopped for final register/TLB state transfer. Total end-to-end
migration time, including pre-copy, file descriptor proxy setup, IPC handle
conversion, and cgroup recreation, is typically 1–10 ms depending on process
complexity.)
Working set follows lazily: pages migrate over seconds as accessed.

This is the same strategy as live VM migration (pre-copy / post-copy),
but at process granularity. Much lighter weight than VM migration.

Full process migration requires transferring the following state:

  • Register state: CPU registers, FPU/SIMD state (saved during freeze).
  • Kernel stack: The process's kernel-mode stack (~16KB).
  • Page table metadata: Transferred lazily — pages fault in on demand from the source node via RDMA.
  • Open file descriptors: For local files on the source node, a proxy is created that forwards read/write/ioctl over RDMA IPC to the source. For shared-filesystem files (NFS, CIFS), the descriptor is re-opened locally on the destination.
  • IPC handles: SharedMemory transport handles are converted to Rdma transport handles (Section 5.5). The ring buffer contents are preserved.
  • Cgroup membership: Recreated on the destination node. The destination cgroup must have sufficient quota.
  • Signal state: Pending signals, signal mask, and signal handlers are transferred.
  • Timer state: Active timers (POSIX, itimer) are recreated on the destination with remaining duration adjusted for clock synchronization (Section 5.8).
  • Device handles: Local accelerator contexts are converted to remote service client handles (Section 5.7). The process continues to use the original device via the subsystem service provider (block, VFS, or accel).

Lightweight Thread Migration (DSM-Triggered)

Full process migration (above) is a heavyweight operation: 1-10 ms end-to-end, involving file descriptor proxies, IPC handle conversion, cgroup recreation, and timer adjustment. It is appropriate when a process's working set has durably shifted to another peer.

For transient data locality shifts — a thread that temporarily accesses a hot region owned by a remote peer — a lightweight thread migration path avoids the full process migration overhead. This is triggered by DsmFaultHint::MigrateThread (Section 6.12) when a subscriber detects strong remote locality for a single thread.

Lightweight migration transfers only: - CPU registers + FPU/SIMD state (~512 bytes with AVX-512, less on other arches). - Kernel stack (~16 KB). - A back-pointer to the process's home peer (for file descriptors, cgroup, etc.).

Total: ~17 KB over RDMA → ~10-20 μs freeze time (comparable to TidalScale's ~6 μs vCPU context switch, which transfers less state over a custom protocol).

The thread runs on the destination peer using the process's DSM address space — page table entries are shared via DSM, so no page table transfer is needed. File I/O, signals, and IPC are proxied back to the home peer (similar to MOSIX's deputy model, but only for the duration of the lightweight migration — not permanently).

/// Lightweight thread migration state. Transferred via the cluster transport
/// abstraction: `transport.write_to_peer()` maps to RDMA Write on RDMA
/// transports and `send_reliable()` with bulk data on TCP transports.
/// Process migration is a cold path (seconds-scale), so TCP latency is
/// acceptable.
///
/// # Allocation
///
/// This struct is ~17 KiB (KERNEL_STACK_SIZE + ~512 bytes register state)
/// and **MUST NOT be stack-allocated** (would overflow a 16 KiB kernel stack).
/// - **Source**: allocated from a per-CPU migration slab (pre-allocated at
///   cluster join, one slot per CPU — only one migration can be in-flight per CPU).
/// - **Destination**: populated directly from the transport receive buffer. On RDMA,
///   this is a pre-registered MR; on TCP, it is a kernel-allocated page buffer.
///   The receive buffer is pre-allocated at cluster join time and reused across
///   migrations.
pub struct ThreadMigrationState {
    /// Thread ID (globally unique within the cluster).
    pub thread_id: u64,
    /// Process ID on the home peer. The process itself does NOT migrate.
    pub home_process_id: u64,
    /// Home peer where the process's file descriptors, cgroup, etc. reside.
    pub home_peer: PeerId,
    /// Saved CPU register state (arch-specific, opaque to the migration layer).
    pub register_state: ArchRegisterState,
    /// Kernel stack contents.
    pub kernel_stack: [u8; KERNEL_STACK_SIZE],
    /// DSM region ID that triggered the migration (for affinity tracking).
    pub trigger_region: u64,
}

Return migration: When the thread's DSM access pattern shifts back to the home peer (detected by the subscriber returning DsmFaultHint::MigrateThread pointing home), or when the thread performs an operation that requires the home peer (direct device I/O, non-proxyable syscall), the thread migrates back. Return migration uses the same lightweight path.

Interaction with full migration: If a thread remains on a remote peer for longer than THREAD_MIGRATION_PROMOTE_MS (default: 5000 ms), the cluster scheduler evaluates whether the full process should migrate. If >50% of the process's threads are on the same remote peer, full migration is triggered. This prevents the "permanent deputy" problem that limits MOSIX's I/O performance.

5.6.4.1 Thread Migration Wire Protocol

Lightweight thread migration uses PeerMessageType codes in the 0x0090-0x009F range. The protocol is a three-phase handshake (request → accept/reject → commit/abort) with bulk data transferred via transport.write_to_peer() between the request acceptance and commit (maps to RDMA Write on RDMA transports, reliable send on TCP).

// PeerMessageType codes for thread migration:
ThreadMigrateRequest  = 0x0090,
ThreadMigrateAccept   = 0x0091,
ThreadMigrateReject   = 0x0092,
// 0x0093 reserved (data transferred via transport.write_to_peer(), not a message type)
ThreadMigrateCommit   = 0x0094,
ThreadMigrateAbort    = 0x0095,

Wire payload structs:

/// Migration request from source to destination.
/// Total: 64 bytes (cache-line aligned for RDMA inline sends).
/// Layout: thread_id(8) + home_process_id(8) + home_peer(8) +
/// trigger_region(8) + register_state_size(4) + kernel_stack_size(4) +
/// _reserved(24) = 64.
#[repr(C)]
pub struct ThreadMigrateRequestPayload {
    pub thread_id: Le64,                        // 8 bytes
    pub home_process_id: Le64,                  // 8 bytes
    pub home_peer: Le64,                        // 8 bytes (PeerId)
    /// DSM region that triggered the migration hint (for destination to
    /// verify it actually owns pages in that region).
    pub trigger_region: Le64,                   // 8 bytes
    /// Size of the arch-specific register state blob (bytes). Varies by
    /// architecture: x86-64 with AVX-512 ≈ 512 bytes, AArch64 ≈ 256 bytes.
    pub register_state_size: Le32,              // 4 bytes
    /// Kernel stack size (bytes). Typically KERNEL_STACK_SIZE (16384).
    pub kernel_stack_size: Le32,                // 4 bytes
    pub _reserved: [u8; 24],
}
const_assert!(core::mem::size_of::<ThreadMigrateRequestPayload>() == 64);

/// Destination accepts the migration and provides a receive buffer for
/// the register state + kernel stack transfer.
/// Total: 32 bytes.
#[repr(C)]
pub struct ThreadMigrateAcceptPayload {
    pub thread_id: Le64,                        // 8 bytes
    /// Address of the pre-registered receive buffer on the destination.
    /// On RDMA: source uses one-sided RDMA Write to this address.
    /// On TCP: this address is informational (destination allocates the
    /// buffer but data arrives via send_reliable(), not one-sided write).
    /// Register state starts at offset 0; kernel stack at offset
    /// register_state_size (rounded up to 64-byte alignment).
    pub receive_buffer_addr: Le64,              // 8 bytes
    /// RDMA rkey for the receive buffer (allows source to Write).
    /// Zero when transport is TCP (not used — data arrives via reliable send).
    pub receive_buffer_rkey: Le32,              // 4 bytes
    pub _pad: [u8; 4],
    pub _reserved: [u8; 8],
}
const_assert!(core::mem::size_of::<ThreadMigrateAcceptPayload>() == 32);

/// Destination rejects the migration.
/// Total: 16 bytes.
#[repr(C)]
pub struct ThreadMigrateRejectPayload {
    pub thread_id: Le64,                        // 8 bytes
    /// 0 = no capacity (CPU overloaded), 1 = no DSM region access,
    /// 2 = cgroup quota exceeded, 3 = shutting down (PeerStatus::Leaving).
    pub reason: Le32,                           // 4 bytes
    pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<ThreadMigrateRejectPayload>() == 16);

/// Source confirms the migration after thread state transfer is complete.
/// On receipt, destination installs the thread in its local scheduler.
/// Total: 16 bytes.
#[repr(C)]
pub struct ThreadMigrateCommitPayload {
    pub thread_id: Le64,                        // 8 bytes
    /// CRC32C of the transferred data (register state + kernel stack).
    /// Destination verifies before installing the thread.
    pub data_crc: Le32,                         // 4 bytes
    pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<ThreadMigrateCommitPayload>() == 16);

/// Source aborts the migration (transport write failed or source-side error).
/// Destination frees the receive buffer and forgets the request.
/// Total: 16 bytes.
#[repr(C)]
pub struct ThreadMigrateAbortPayload {
    pub thread_id: Le64,                        // 8 bytes
    pub reason: Le32,                           // 0 = transport failure, 1 = source error
    pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<ThreadMigrateAbortPayload>() == 16);

Transport operation sequence:

Source                                     Destination
  │                                          │
  │  1. Send: ThreadMigrateRequest ────────→ │
  │     (64B, transport.send_reliable())     │
  │                                          │  2. Check: CPU capacity, cgroup quota,
  │                                          │     DSM region access, PeerStatus
  │                                          │
  │  ←──── Send: ThreadMigrateAccept ────────│  (if ok: allocate recv buffer)
  │  ←──── Send: ThreadMigrateReject ────────│  (if not ok: reject with reason)
  │                                          │
  │  3. Freeze thread (stop scheduling,      │
  │     save registers to ThreadMigration-   │
  │     State, capture kernel stack).        │
  │                                          │
  │  4. Write: register state ─────────────→ │  (transport.write_to_peer())
  │     (~512 bytes)                         │
  │  5. Write: kernel stack ───────────────→ │  (transport.write_to_peer())
  │     (~16 KB)                             │
  │                                          │
  │  6. Send: ThreadMigrateCommit ─────────→ │  (ordering: writes visible first)
  │                                          │
  │                                          │  7. Verify CRC32C of received data.
  │                                          │     Install thread in local scheduler.
  │                                          │     Resume execution.
  │                                          │
  │  8. Source cleans up: remove thread from  │
  │     local scheduler, free local stack.   │

Transport ordering guarantees: On RDMA, steps 4-6 are posted to the same RC QP; RC in-order delivery guarantees that the Write data is visible at the destination before the Commit message is processed. On TCP, send_reliable() provides the same ordering guarantee (TCP byte stream is in-order). The Commit message is sent after write completion on all transports.

Abort and rollback:

  • Reject at step 2: source unfreezes the thread and resumes local execution. The DSM subsystem falls back to DsmFaultHint::Default (fetch the page instead of migrating the thread).
  • Write failure (step 4/5): source sends ThreadMigrateAbort via the transport's fallback path (TCP if RDMA QP is in error state, or over the same TCP connection if already on TCP). Source unfreezes the thread locally.
  • Destination crash after Accept but before Commit: source detects via heartbeat timeout (Section 5.8), unfreezes the thread locally. The destination's allocated receive buffer is freed by crash recovery.
  • Source crash after Commit: destination has a fully installed thread with a home_peer that is now dead. The thread sees -EHOSTDOWN on the next file I/O proxy call and is killed with SIGKILL (same as any process whose home peer dies).

Wire budget:

Component Size
Request (Send inline) 40 + 64 = 104 B
Accept (Send inline) 40 + 32 = 72 B
Register state (Write) ~512 B
Kernel stack (Write) ~16,384 B
Commit (Send inline) 40 + 16 = 56 B
Total ~17.1 KB

At 100 Gbps RDMA bandwidth: ~1.4 μs for the Write data. Two Send round-trips add ~4 μs (2 × ~2 μs RTT). Total wire time: ~6-8 μs. Remaining time in the 10-20 μs freeze budget is CPU overhead for register save/restore and TLB setup on the destination.

Socket Migration

TCP socket migration transfers established connections transparently across nodes. The guiding constraint is that the remote peer must not observe a connection reset — migration must be invisible to the peer.

Pre-migration phase (on source node):

  1. The source kernel captures a consistent snapshot of the TCP socket state. The snapshot must be taken atomically with respect to the send/receive paths: a per-socket migration_freeze: AtomicBool flag is set, which prevents new data from being sent (tcp_sendmsg() returns -EAGAIN) and prevents tcp_recvmsg() from consuming buffered data (returns -EAGAIN). Incoming ACKs continue to update snd_una normally (the retransmit queue drains naturally), but no new data segments are transmitted.

The socket's TCP state is unchanged during migration freeze -- it remains in its current state (Established, CloseWait, or FinWait1). The freeze flag is kernel-internal: not visible to getsockopt(SO_PROTOCOL), inet_diag, or /proc/net/tcp. This avoids modifying the TCP state machine (which would require handling the new state in every match on TcpState across the networking stack, timer paths, congestion control, and diagnostic interfaces).

/// Maximum bytes transferred per socket during migration. Larger buffers
/// are truncated: TCP will retransmit lost data from the retransmit queue,
/// and the sender will retransmit data beyond recv_buffer capacity.
/// 256 KiB balances migration speed vs. data preservation.
pub const MAX_MIGRATION_BUFFER_SIZE: usize = 256 * 1024;

/// Snapshot of a TCP socket's full state for cross-node migration.
/// Captured atomically: source socket has `migration_freeze` set before
/// snapshot and is torn down only after destination confirms active.
///
/// Both `retransmit_queue` and `recv_buffer` are truncated to
/// `MAX_MIGRATION_BUFFER_SIZE` bytes by the snapshot function.
/// Truncation is safe: the retransmit queue's lost tail will be
/// retransmitted by TCP's normal retransmission logic, and lost
/// receive-buffer data will be retransmitted by the remote sender
/// when the destination advertises its receive window.
pub struct TcpSocketMigrationState {
    /// Local endpoint address and port.
    pub local_addr: SocketAddr,
    /// Remote peer address and port (unchanged across migration).
    pub remote_addr: SocketAddr,
    /// Next sequence number the source would have sent.
    /// Must match what the destination sends on the first post-migration segment.
    pub snd_nxt: u32,
    /// Oldest unacknowledged sequence number (start of retransmit queue).
    pub snd_una: u32,
    /// Next sequence number the source expects to receive from the peer.
    pub rcv_nxt: u32,
    /// Current TCP state machine state at time of snapshot.
    /// Must be ESTABLISHED, CLOSE_WAIT, or FIN_WAIT_1; other states are
    /// not migratable and return -EOPNOTSUPP.
    pub tcp_state: TcpState,
    /// Contents of the retransmit queue (bytes sent but not ACKed).
    /// Truncated to MAX_MIGRATION_BUFFER_SIZE; TCP retransmits lost tail.
    pub retransmit_queue: Vec<u8>,  // len() ≤ MAX_MIGRATION_BUFFER_SIZE
    /// Contents of the receive buffer (bytes received but not consumed).
    /// Truncated to MAX_MIGRATION_BUFFER_SIZE; sender retransmits lost tail.
    pub recv_buffer: Vec<u8>,  // len() ≤ MAX_MIGRATION_BUFFER_SIZE
    /// TCP options negotiated for this connection.
    /// Includes SACK state, timestamp offsets, window scaling factor, etc.
    pub options: TcpOptionsState,
    /// Congestion control algorithm name and its opaque private state blob.
    /// The destination must support the same congestion control algorithm;
    /// if not, falls back to CUBIC with fresh cwnd.
    pub cong_state: CongestionState,
    /// Send window size (from most recent ACK from peer).
    pub snd_wnd: u32,
    /// Receive window size (advertised to peer).
    pub rcv_wnd: u32,
    /// Maximum segment size negotiated with peer.
    pub mss: u16,
    /// Smoothed round-trip time estimate (microseconds).
    pub srtt_us: u32,
    /// RTT variance estimate (microseconds).
    pub rttvar_us: u32,
}
  1. Source quiesces the socket: stops sending, holds incoming ACKs in the staging buffer for a maximum of 100 ms. If quiescence takes longer (e.g., large retransmit queue draining), migration is aborted with -ETIMEDOUT and the process is left on the source node.
  2. Source captures the snapshot (sequence numbers, buffers, options, congestion state).
  3. Source transmits snapshot to destination node via the RDMA migration channel.

Post-migration phase (on destination node):

  1. Destination reconstructs the TCP socket from TcpSocketMigrationState. The socket is created with SO_REUSEPORT + SO_REUSEADDR and migration_freeze set, so the destination kernel accepts the sequence-number context without performing a three-way handshake. The migration_freeze flag is cleared once the socket is fully reconstructed and the keepalive probe confirms connectivity.
  2. The destination must have the migrated process's source IP reachable. Two models are supported:
  3. IP mobility (preferred): A CARP/VRRP failover address or network-overlay virtual IP moves with the process. The remote peer's connection continues without interruption — packets are routed to the destination by the overlay.
  4. No IP mobility: The migration is transparent only if source IP is present on the destination interface (e.g., both nodes share a subnet with the same address via aliasing). Without this, sockets are closed (ECONNRESET is delivered to the remote peer) and reconnection is the application's responsibility. UmkaOS does not silently lie to the application about connectivity; migration fails with -EADDRNOTAVAIL if the source address cannot be installed on the destination.
  5. Destination sends a TCP keepalive to re-establish the connection from the remote peer's perspective. The keepalive carries sequence number snd_nxt - 1 (the last acknowledged byte), causing the peer to confirm with an ACK that advances snd_una on the destination.
  6. Source tears down the socket and releases the address after receiving a positive acknowledgement from the destination that the destination socket is in ESTABLISHED state and the keepalive ACK has been received.
  7. For UDP sockets: the state transfer is simpler — socket options, bound address, connected address (if connect() was called), and socket buffer contents are transferred. In-flight datagrams may be lost; UDP is unreliable and applications using it must tolerate loss independently of migration.
  8. Sockets in non-migratable states: Sockets in SYN_SENT, SYN_RECV, TIME_WAIT, or LISTEN states are not migrated. LISTEN sockets are re-bound on the destination (new connections land on the destination after migration; connections established pre-migration remain on the source until they close naturally).

Socket migration state in the process migration record:

/// Maximum number of TCP sockets migrated per process. Sockets beyond this
/// limit are closed on the source (process receives ECONNRESET on those fds).
pub const MAX_MIGRATABLE_TCP_SOCKETS: usize = 256;
/// Maximum number of UDP sockets migrated per process.
pub const MAX_MIGRATABLE_UDP_SOCKETS: usize = 256;
/// Per-migration total memory budget for socket buffer snapshots.
/// Caps the aggregate allocation burst during the freeze window regardless
/// of socket count. Sockets are snapshotted in priority order (oldest
/// established first). When cumulative buffer size reaches this limit,
/// remaining sockets are closed (ECONNRESET) rather than migrated.
/// FMA metric: `cluster.migration_sockets_dropped` counts sockets dropped
/// due to budget exhaustion.
pub const MAX_MIGRATION_BUFFER_TOTAL: usize = 16 * 1024 * 1024; // 16 MiB

/// All TCP/UDP sockets belonging to the migrating process.
///
/// Hard upper bounds are enforced to prevent unbounded migration state:
/// a process with 10K TCP connections at 512 KiB each would produce ~5 GB of
/// migration state, far exceeding any reasonable freeze window. Sockets beyond
/// the limit are closed on the source — processes with many connections should
/// pre-close non-essential connections before migration or use migration-aware
/// connection pooling.
///
/// **Per-migration budget**: In addition to per-socket count limits, the total
/// buffer allocation is capped at `MAX_MIGRATION_BUFFER_TOTAL` (16 MiB). This
/// prevents the worst-case allocation burst (256 sockets × 512 KiB = 128 MiB)
/// from exhausting memory during the time-sensitive freeze window. Sockets are
/// snapshotted in priority order (oldest established connection first). When
/// the cumulative buffer size reaches 16 MiB, remaining sockets are closed on
/// the source with ECONNRESET.
pub struct ProcessSocketMigrationState {
    /// TCP sockets (ESTABLISHED, CLOSE_WAIT, FIN_WAIT_1 only).
    /// Maximum: MAX_MIGRATABLE_TCP_SOCKETS (256).
    /// Beyond this limit: closed on source (ECONNRESET).
    pub tcp_sockets: Vec<TcpSocketMigrationState>,
    /// UDP sockets: bound/connected address, socket options, buffer contents.
    /// Maximum: MAX_MIGRATABLE_UDP_SOCKETS (256).
    pub udp_sockets: Vec<UdpSocketMigrationState>,
    /// Unix domain sockets connected to peers on the same node: proxied via
    /// RemoteSocketProxy after migration (identical mechanism to file proxy).
    pub unix_sockets: Vec<UnixSocketProxyState>,
    /// Sockets that could not be migrated (LISTEN, TIME_WAIT, etc.).
    /// These file descriptors are replaced with a closed fd on the destination;
    /// the process receives SIGPIPE or EBADF on next access.
    pub non_migratable_fds: Vec<i32>,
}

GPU Context Migration

GPU context migration requires hardware vendor support. Not all GPUs implement the necessary preemption and memory export interfaces. Migration behavior is determined by querying the AccelContext capabilities at migration time.

GPU migration support matrix:

GPU class Mechanism Granularity
NVIDIA H100+ (Hopper MIG) NVLink + MIG partition export Per-MIG slice
AMD MI300X (Infinity Fabric) XGMI memory export + context checkpoint Full context
NVIDIA A100 (non-MIG) Software checkpoint via CUDA CTK (CRIU-GPU) Full context, ~500ms
Consumer GPUs (RTX, RX) Not supported; fallback to CPU-side buffer copy Buffer contents only

For GPUs that support context migration, the migration sequence is:

/// Hard maximums for GPU migration state fields. If any field exceeds its
/// maximum, migration returns `-E2BIG` and the process remains on the source node.
pub const MAX_GPU_REGISTER_FILE: usize = 4 * 1024 * 1024;   // 4 MiB
pub const MAX_GPU_ALLOCATIONS: usize = 16_384;
pub const MAX_GPU_COMMANDS: usize = 4_096;
pub const MAX_GPU_FENCES: usize = 1_024;
pub const MAX_GPU_CONSTANTS: usize = 64 * 1024;              // 64 KiB

/// Full snapshot of an accelerator context for cross-node migration.
/// Populated by the AccelBase driver via the checkpoint vTable call.
///
/// All Vec fields have hard upper bounds (see constants above). If the driver
/// reports state exceeding any bound, migration is rejected with `-E2BIG` and
/// the process remains on the source node. This prevents unbounded allocation
/// during the time-sensitive migration freeze window.
pub struct GpuContextMigrationState {
    /// Opaque GPU register file snapshot, if the hardware supports it.
    /// None for GPUs that do not expose register-file checkpoint (most consumer GPUs).
    /// Hard maximum: MAX_GPU_REGISTER_FILE (4 MiB).
    pub register_file: Option<Vec<u8>>,
    /// GPU memory allocations to transfer: virtual GPU address → allocation metadata.
    /// The physical memory pages are transferred via HMM migration (see below).
    /// Hard maximum: MAX_GPU_ALLOCATIONS (16,384 entries).
    pub allocations: Vec<GpuAllocationRecord>,
    /// Pending command buffers: commands submitted to the GPU but not yet dispatched.
    /// These are re-submitted to the destination GPU after context restoration.
    /// Hard maximum: MAX_GPU_COMMANDS (4,096 entries).
    pub pending_commands: Vec<GpuCommandBuffer>,
    /// Fence states at the time of snapshot.
    /// Fences that were signaled before snapshot are recorded as SignaledFence so
    /// the destination can satisfy `fence_wait()` calls without re-executing work.
    /// Hard maximum: MAX_GPU_FENCES (1,024 entries).
    pub fence_states: Vec<(FenceId, FenceState)>,
    /// GPU push constants and descriptor set state (Vulkan/CUDA-mapped contexts).
    /// Hard maximum: MAX_GPU_CONSTANTS (64 KiB).
    pub constants: Vec<u8>,
    /// Name of the AccelBase driver that owns this context, for validation on
    /// destination (migration between incompatible GPU vendors is rejected).
    pub driver_name: [u8; 64],
    /// Driver-version tuple: migration is rejected if destination driver version
    /// differs by more than one minor version.
    pub driver_version: (u32, u32, u32),
}

GPU migration sequence (hardware-supported path):

  1. Quiesce the AccelContext: issue a context-level drain barrier. For NVIDIA Hopper, this uses the cuCtxSynchronize() equivalent kernel-internal call. For AMD MI300X, the XGMI memory export fence is used. Quiescence preempts the running command buffer at a safe preemption point:
  2. InstructionLevel preemption: saves at the current shader instruction boundary.
  3. DrawBoundary preemption (fallback): saves at the current draw-call boundary. Preemption timeout is 200 ms; if the GPU cannot preempt within 200 ms, migration is deferred and retried after 1 s.
  4. Freeze command submissions: set the AccelContext to ACCEL_FROZEN state. New accel_cmd_submit() calls block (do not return error) so the process does not observe the migration boundary.
  5. Snapshot GPU memory via HMM: the Heterogeneous Memory Management layer migrates GPU pages to CPU-accessible memory (migrate_vma_setup() / migrate_vma_pages() equivalent). This copies GPU VRAM contents to system RAM for transfer via the cluster transport.
  6. Transfer GPU memory snapshot to destination via transport.write_to_peer() to a pre-registered migration receive region on the destination node. On RDMA, this uses one-sided Write for maximum bandwidth; on TCP, bulk send_reliable().
  7. On destination: the AccelBase driver allocates a new GPU context, restores GPU memory from the snapshot (HMM reverse migration: system RAM → GPU VRAM), and replays the register file (if available) and push constants.
  8. Resume: the AccelContext transitions from ACCEL_FROZEN to ACCEL_RUNNING on the destination. Blocked accel_cmd_submit() calls on the process now complete against the destination context.

GPU migration fallback (no hardware support):

When the GPU does not support register-file checkpoint (most consumer GPUs and some data-center GPUs without MIG):

  1. Wait for all pending GPU commands to complete (drain the command queue). This may take up to 5 s for long-running compute kernels; if the drain does not complete in 5 s, migration fails with -EOPNOTSUPP and the process stays on the source node.
  2. Copy GPU buffer contents to CPU memory (no register file, no in-flight state).
  3. Transfer buffer contents to destination.
  4. On destination: allocate new GPU buffers, restore contents.
  5. The application's next GPU submission starts from a clean context with restored buffer contents. Any partially-completed GPU computation is lost and must be re-executed by the application.

Processes requiring GPU contexts that cannot be migrated should set CLUSTER_PIN_NODE to prevent migration attempts.

io_uring Migration

io_uring instances span kernel and userspace memory. The submission queue (SQ), completion queue (CQ), and the SQE/CQE arrays are memory-mapped into the process's virtual address space. Migration must quiesce the rings, transfer their state, and remap them at identical virtual addresses on the destination so the process's mmap-based pointers remain valid.

Challenge: In-flight SQEs (submitted by the application but not yet dispatched by the kernel) may reference registered buffers, registered files, and eventfd notifications — all of which are also being migrated simultaneously. The migration must preserve the semantic ordering: an SQE that was submitted before migration must either complete on the source or be re-submitted on the destination, never silently dropped.

/// Complete snapshot of one io_uring instance for cross-node migration.
pub struct IoUringMigrationState {
    /// The parameters used to create this ring (entries, flags, sq_thread_cpu, etc.).
    /// The destination re-creates the ring with identical parameters.
    pub ring_params: IoUringParams,
    /// Raw bytes of the SQ ring (includes head/tail/flags/dropped counters).
    pub sq_ring_snapshot: Vec<u8>,
    /// Raw bytes of the CQ ring (includes head/tail/flags/overflow counter).
    pub cq_ring_snapshot: Vec<u8>,
    /// Number of SQ entries (must equal ring_params.sq_entries).
    pub sq_entries: u32,
    /// Number of CQ entries (must equal ring_params.cq_entries).
    pub cq_entries: u32,
    /// SQEs that were in-flight (submitted to kernel, not yet dispatched)
    /// when the ring was quiesced. These are either re-submitted on the
    /// destination (if idempotent) or failed with -EINTR (if not).
    pub inflight_sqes: Vec<IoUringSqe>,
    /// user_data values of SQEs that were cancelled during migration.
    /// The application receives a CQE with res=-EINTR for each of these.
    pub cancelled_sqe_user_data: Vec<u64>,
    /// Registered buffer table: each entry is (userspace_iov, buffer_len).
    /// Buffers are re-registered on the destination after migration.
    pub registered_buffers: Vec<RegisteredBufferRecord>,
    /// Registered file table: each entry is the migrated fd number on
    /// the destination (after file proxy or re-open).
    pub registered_files: Vec<i32>,
    /// Personality credentials registered with IORING_REGISTER_PERSONALITY.
    pub personalities: Vec<IoUringPersonalityState>,
}

io_uring migration sequence:

  1. Drain in-flight SQEs: the kernel stops accepting new SQE submissions (IORING_SQ_FROZEN internal flag) and waits up to 500 ms for all in-kernel-dispatched operations to complete. Operations that complete normally produce CQEs that are delivered to the process before migration (the CQ ring snapshot will contain them).
  2. Identify and cancel non-drainable SQEs: any SQEs that have not completed within the drain timeout are cancelled with -EINTR. Their user_data values are recorded in cancelled_sqe_user_data. The application will see a CQE with res = -EINTR for each on the destination.
  3. Snapshot the rings: copy the SQ ring, CQ ring, and SQE/CQE arrays from the process's kernel-managed memory. The snapshot is taken while the ring is frozen (no concurrent head/tail pointer modification is possible).
  4. Unregister buffers and files: all registered buffers and file descriptors are unregistered on the source. Their state is captured in registered_buffers and registered_files for re-registration on the destination.
  5. Transfer the IoUringMigrationState to the destination as part of the process memory image (via RDMA migration channel).
  6. On destination: re-create the io_uring instance with io_uring_setup() using the same ring_params. The kernel re-maps the SQ and CQ rings into the process's virtual address space at the same virtual addresses as on the source. This is guaranteed by restoring the full process virtual address space (VMA layout) before re-creating the ring, so mmap(MAP_FIXED) at the original addresses succeeds.
  7. Re-register buffers: registered buffers are re-registered with io_uring_register(IORING_REGISTER_BUFFERS) using the migrated buffer regions. Buffers backed by anonymous memory have already been transferred in the page table migration step. Buffers backed by files on the source node are proxied.
  8. Re-register files: registered file descriptors are re-registered with the destination fd numbers from the file descriptor migration step.
  9. Re-submit idempotent SQEs: SQEs from inflight_sqes that are marked idempotent (reads, IORING_OP_NOP, IORING_OP_POLL_ADD) are re-submitted to the destination ring. Non-idempotent SQEs (writes, IORING_OP_SEND, IORING_OP_CONNECT) are not re-submitted; the application receives -EINTR and must retry.
  10. Resume: the ring transitions out of IORING_SQ_FROZEN. The process can submit new SQEs immediately. The effective migration latency visible to the application is the drain time (≤500 ms) plus the freeze-transfer-restore critical section (~5–50 ms depending on ring size and registered buffer count).

io_uring idempotency classification (determines re-submission vs. cancellation):

Opcode Idempotent Re-submitted after migration
IORING_OP_READ / IORING_OP_READV Yes (read does not mutate) Yes
IORING_OP_WRITE / IORING_OP_WRITEV No No — EINTR returned
IORING_OP_SEND / IORING_OP_SENDMSG No No — EINTR returned
IORING_OP_RECV / IORING_OP_RECVMSG Yes (recv is read-like) Yes
IORING_OP_POLL_ADD Yes Yes
IORING_OP_TIMEOUT Yes (re-armed with adjusted expiry) Yes
IORING_OP_CONNECT No No — EINTR returned
IORING_OP_ACCEPT Yes (re-armed on destination listen socket) Yes
IORING_OP_FSYNC / IORING_OP_FDATASYNC Yes Yes
IORING_OP_NOP Yes Yes
IORING_OP_SPLICE / IORING_OP_TEE No No — EINTR returned
IORING_OP_PROVIDE_BUFFERS Yes Yes

Process migration scope (v1): - In scope: CPU register state, page tables (lazy), local file proxying, IPC handle conversion, cgroup membership, signal state, timer state, accelerator handle proxying, TCP/UDP socket migration (with IP mobility), io_uring ring migration (with drain), GPU context migration (hardware-supported GPUs only). - Out of scope for v1: Active RDMA queue pairs (application must handle), processes with CLONE_VM threads (thread group migration deferred), ptrace targets, GPU contexts on hardware without checkpoint support (process gets -EOPNOTSUPP unless it sets CLUSTER_PIN_NODE). - Limitation: A process holding hardware resources that cannot be proxied or migrated (e.g., direct GPU rendering context on a consumer GPU without checkpoint support) will fail migration with -EOPNOTSUPP. Such processes should set CLUSTER_PIN_NODE to prevent migration attempts.

Migration Rollback Protocol

When migration fails after the freeze phase but before the destination commits, the system must restore the source to a runnable state without data loss. The rollback path depends on where the failure occurred:

  1. Destination failure: the destination node sends MigrationAbort { reason } to the source via the RDMA control channel. On receipt, the source:
  2. Unfreezes all migrating tasks (reverses the SIGSTOP applied during the freeze phase, restoring each task to TASK_RUNNING in the local scheduler).
  3. Restores pre-migration resource registrations: file descriptors, IPC handles, cgroup membership, and accelerator context bindings are reverted to their pre-migration state (the source retains these until commit confirmation).
  4. Increments the per-node migration_failed counter (exposed via /sys/kernel/umka/cluster/stats/migration_failed). The cluster scheduler uses this counter to back off migration attempts to the failing destination (exponential backoff: 1s, 2s, 4s, ..., capped at 60s).

  5. RDMA link failure during transfer: the source detects an RDMA timeout (>500ms with no completion on the migration QP) and assumes the destination is unreachable. The source:

  6. Unfreezes all tasks locally (same procedure as case 1).
  7. Marks the destination node as MIGRATION_SUSPECT in the cluster state (distinct from full node failure — the node may still be alive but the migration QP failed). The destination, if it is still running and has partially initialized the migrating process, runs migration_cleanup():
  8. Releases any allocated address space (VMA teardown, page table deallocation).
  9. Destroys partially-constructed task structs and scheduler entries.
  10. Releases cgroup reservations.
  11. Sends a MigrationCleanupComplete { pid } notification to the cluster coordinator so the process is not double-tracked.

  12. GPU context restore failure: if the destination GPU rejects the context restore (driver version mismatch detected after transfer, VRAM allocation failure, or hardware error during HMM reverse migration), the destination reports MigrationAbort { reason: GpuContextFailed } to the source. The source:

  13. Restores the GPU context from the serialized GpuContextMigrationState snapshot. The snapshot is retained on the source until the destination sends a commit acknowledgment — it is never destroyed speculatively.
  14. Unfreezes the AccelContext on the source (transitions from ACCEL_FROZEN back to ACCEL_RUNNING), allowing blocked accel_cmd_submit() calls to proceed locally.
  15. If the source GPU is also unavailable (e.g., concurrent hardware failure), the affected tasks are terminated with SIGBUS (indicating an unrecoverable hardware fault). The SIGBUS si_code is set to BUS_MCEERR_AO (action optional) to indicate asynchronous hardware error.

  16. Commit timeout: the source waits a maximum of 30 seconds for the destination's commit acknowledgment (MigrationCommitAck { pid }). If the timeout expires:

  17. Assume split-brain: it is unsafe for either node to unilaterally resume the process, because the destination may have already committed and started executing.
  18. Both source and destination invalidate the migrating process: the source delivers SIGKILL to its local copy, and the destination (if reachable) is instructed to SIGKILL its copy via the cluster coordinator.
  19. The cluster coordinator reconciles the process state during the next heartbeat round: it queries all nodes for the process PID and ensures exactly zero or one instance exists. If both copies were killed, the process is gone and its parent receives SIGCHLD with CLD_KILLED status.
  20. The migration_timeout counter is incremented and logged at KERN_WARNING level.

Phase-based rollback specification:

Migration proceeds through four distinct phases. Rollback is deterministic based on which phase the failure occurs in:

Phase 1 (Pre-transfer — state snapshot not yet sent): - Failure: network error before state transfer begins. - Rollback: Cancel migration. Source process resumes unchanged. migration_state → Idle. No side effects on either node.

Phase 2 (State transfer in progress): - Failure: network error mid-transfer, or destination ENOMEM. - Rollback: 1. Destination: free all allocated VMAs, discard partial page transfers, release the reserved PID slot. 2. Source: migration_state → Idle. Process was suspended during transfer — task_wake() resumes it. 3. Source sends MIGRATION_ABORT message to destination; destination ACKs (or times out after 5s, at which point source assumes cleanup complete). 4. If source crashes before sending ABORT: destination has a 30s migration-dead timer that triggers self-cleanup on expiry.

Phase 3 (State transferred, switchover not yet committed): - Failure: destination crashes after receiving full state, before the source receives COMMIT ACK. - Rollback: 1. Source detects missing COMMIT ACK after 5s timeout. 2. Source sends MIGRATION_QUERY to destination to check status. 3. If destination responds FAILED or is unreachable: source resumes the process locally (process was in suspended state on source during transfer). Source sends MIGRATION_ABORT to invalidate any partial destination state. 4. If destination responds COMMITTED (commit happened but ACK was lost): source dequeues its local copy and sends MIGRATION_SOURCE_CLEANUP. Destination owns the process.

Phase 4 (Committed on both sides): - No rollback possible (migration complete). If the process needs to return, it is a new migration in the reverse direction.

Idempotency: All migration messages carry a 64-bit migration_id. Retransmits are detected by migration_id deduplication at the destination; duplicate messages are ACKed without processing.

5.6.5 Capability-Gated Migration

Process migration requires capabilities:

pub const CLUSTER_MIGRATE: u32  = 0x0200;  // Allow process migration to remote nodes
pub const CLUSTER_PIN_NODE: u32 = 0x0201;  // Pin process to specific node (prevent migration)
pub const CLUSTER_ADMIN: u32    = 0x0202;  // Cluster-wide scheduler administration

Processes without CLUSTER_MIGRATE are never migrated. Processes with CLUSTER_PIN_NODE can pin themselves to their current node. Containers/cgroups can restrict which nodes their processes can run on:

/sys/fs/cgroup/<group>/cluster.nodes
# # Allowed nodes for this cgroup: "0 1 2" or "all"
# # Default: current node only (no migration)

/sys/fs/cgroup/<group>/cluster.migrate
# # "auto" (kernel decides), "never" (pinned), "prefer" (hint)
# # Default: "never" (existing Linux behavior)

5.6.6 Reconciliation: Local vs Distributed Scheduling

The single-node scheduler (Section 7.1) optimizes for cache locality — keeping tasks on the same CPU core. The distributed scheduler may migrate tasks across nodes, destroying all cache state.

Design principle: Cross-node migration is a last resort, not a default action.

Two-level hierarchy with strict separation:

  1. Intra-node (Section 7.1): CFS/EEVDF handles all CPU-local decisions (~4ms tick). Entirely unaware of the cluster.
  2. Inter-node: ClusterScheduler runs every 10 seconds — deliberately slow because cross-node migration is 1000x more expensive than cross-CPU migration.

Migration threshold: A task migrates cross-node only when ALL of these hold:

  • Source node CPU utilization exceeds 120% of cluster average AND target is below 80% (sustained for 2+ rebalance intervals), OR
  • The task's working set is predominantly on the target node (>70% of pages, per DataAffinity), OR
  • The task's affinity mask explicitly requests a different node.

Migration cost model:

migration_benefit = (source_load - target_load) * task_weight
migration_cost = cache_refill_time + network_transfer_time + tlb_flush_time

Migration proceeds only when migration_benefit > migration_cost * 1.5 (50% hysteresis to prevent oscillation).

Warm-up penalty: After cross-node migration, the task's effective load weight is inflated 2x for 20 seconds, preventing immediate re-migration before cache state builds.

5.6.7 Cluster Placement Policy Expression Language (CPPEL)

Design note — in-kernel vs. userspace: CPPEL runs in the kernel's global scheduler context. An alternative is a userspace policy daemon consuming topology/load data via procfs/sysfs. The in-kernel approach was chosen because: (a) placement evaluation needs atomic access to PeerLoad snapshots and TopologyGraph — a userspace daemon would round-trip through sysfs per field, (b) the language is intentionally constrained (no loops, no allocation, depth-8 nesting limit) so the attack surface is bounded, (c) evaluation frequency is low (~every 10s per cgroup) so it doesn't affect hot-path performance. However, the kernel MUST reject expressions at parse time if they violate the nesting limit — no runtime stack overflow is acceptable. A future alternative: expose PeerLoad snapshots as a structured procfs file and let a Tier 2 userspace scheduler make placement decisions, writing results back via the cgroup cluster.preferred_node knob. Both approaches are compatible.

The cluster scheduler's placement and migration decisions can be customized through a small policy expression language. Policies are set per-cgroup via:

/sys/fs/cgroup/<group>/cluster.placement_policy

A policy is a single expression that evaluates to an action — either migrate(<target>), stay, or prefer(<node>). The expression language is intentionally minimal: it is not a general-purpose scripting language, but a constrained DSL for expressing load-based placement decisions. Expressions are evaluated by the global cluster scheduler every balance_interval_ms (default 10 s) for each cgroup.

Syntax (authoritative EBNF — the supported expression syntax):

policy      := expr
expr        := cond_expr | action_expr | arith_expr
cond_expr   := "if" expr "then" expr "else" expr
action_expr := "migrate" "(" target ")" | "stay" | "prefer" "(" node_ref ")"
target      := node_ref | "nearest_idle" | "least_loaded"
node_ref    := "node" "." identifier | INTEGER
arith_expr  := arith_expr arith_op arith_expr | unary_op arith_expr | atom
atom        := field_ref | FLOAT | INTEGER | "(" expr ")"
field_ref   := "node" "." IDENTIFIER
              | "cluster" "." IDENTIFIER
IDENTIFIER  := [a-zA-Z_][a-zA-Z0-9_]*
INTEGER     := [0-9]+
FLOAT       := [0-9]+ "." [0-9]+

Available node fields:

Field Type Description
node.load float (0.0–1.0) CPU utilization, normalized (0 = idle, 1 = fully loaded)
node.cpu_percent int (0–100) CPU utilization percentage
node.memory_pressure int (0–100) Memory pressure (0 = plenty free, 100 = swapping)
node.free_mem_mib int Free memory in mebibytes
node.accel_percent int (0–100) Accelerator utilization percentage
node.runnable_count int Number of runnable tasks
node.remote_fault_rate int Remote DSM page faults per second
cluster.node_count int Number of active nodes
cluster.avg_load float (0.0–1.0) Average CPU load across all nodes

Arithmetic operators (operate on numeric values, return numeric):

Operator Syntax Description
Add a + b Addition
Subtract a - b Subtraction
Multiply a * b Multiplication
Divide a / b Division (integer division if both operands are integers; divide-by-zero evaluates to 0)
Modulo a % b Remainder
Negate -a Unary negation

Comparison operators (operate on numerics, return bool):

Operator Syntax Description
Equal a == b True if a equals b
Not-equal a != b True if a does not equal b
Less-than a < b True if a is less than b
Less-or-equal a <= b True if a ≤ b
Greater-than a > b True if a is greater than b
Greater-or-equal a >= b True if a ≥ b

Logical operators (operate on bools, short-circuit evaluation, return bool):

Operator Syntax Description
Logical AND a && b True if both operands are true. Short-circuits: if a is false, b is not evaluated.
Logical OR a \|\| b True if either operand is true. Short-circuits: if a is true, b is not evaluated.
Logical NOT !a Unary prefix. True if a is false.

Conditional expression:

if <cond> then <expr> else <expr>

Evaluates cond as a bool; returns the then branch if true, the else branch if false. Both branches must have the same type (both numeric or both actions).

Type coercion: A numeric value used in a boolean context (e.g., as the condition of if) is treated as false if zero (or 0.0) and true otherwise. This allows expressions such as if node.remote_fault_rate then migrate(least_loaded) else stay.

Examples:

# Migrate if this node's CPU load exceeds 80% AND free memory is low.
if node.load > 0.8 && node.free_mem_mib < 512 then migrate(nearest_idle) else stay

# Migrate if this node is significantly more loaded than the cluster average,
# or if memory pressure is critical.
if node.load > cluster.avg_load * 1.2 || node.memory_pressure > 90
then migrate(least_loaded)
else stay

# Prefer the accelerator-lightest node when the local accelerator is saturated,
# but never migrate if memory pressure would be made worse.
if node.accel_percent > 95 && !( node.memory_pressure > 80 )
then prefer(nearest_idle)
else stay

Evaluation rules:

  • Expressions are evaluated in the global cluster scheduler context (interrupt-safe, no blocking operations). Evaluation is always bounded — recursive if nesting is limited to depth 8 to prevent stack overflow. An expression exceeding this depth is rejected at policy-load time with -EINVAL.
  • Field references reflect the most recent PeerLoad snapshot (updated every balance_interval_ms). Expressions cannot trigger RDMA reads or memory allocation.
  • If evaluation produces an error (type mismatch, divide-by-zero, depth exceeded), the result is stay. The error is counted in the cgroup's cluster.policy_errors counter.
  • Policies are parsed and type-checked when written to cluster.placement_policy. An invalid policy is rejected immediately; the previous policy remains in effect.

5.7 Network-Portable Capabilities

5.7.1 Problem

UmkaOS's capability system (Section 9.1) uses opaque kernel-memory tokens validated locally. For distributed operation, capabilities must work across nodes: a process migrated from Node A to Node B should retain its capabilities. A remote RDMA operation should be authorized by a capability that the remote node can verify.

5.7.2 Design: Cryptographically-Signed Capabilities

// umka-core/src/cap/distributed.rs

/// Network-portable capability — split into a compact header (fits on the
/// kernel stack, ~64 bytes) and a separately-allocated signature payload.
/// kernel-internal, not KABI — contains native types (ObjectId, PeerId, u64).
///
/// Rationale: PQC signatures (ML-DSA-65: 3,309 bytes, hybrid: 3,373 bytes)
/// make the full capability ~3.6 KB, which must NOT be placed on the kernel
/// stack (kernel stacks are 8-16 KB). The split design keeps the hot path
/// (permission checks, expiry checks) on the stack via CapabilityHeader,
/// while the signature data lives in a slab-allocated CapabilitySignature.
// kernel-internal, not KABI
#[repr(C)]
pub struct CapabilityHeader {
    /// The local capability this was derived from.
    pub object_id: ObjectId,
    pub permissions: PermissionBits,
    /// Epoch when this capability was created. Used for revocation comparison:
    /// if `creation_epoch < node.revocation_epoch`, the capability may have been
    /// revoked and requires synchronous revalidation against the originating node.
    pub creation_epoch: u64,
    pub constraints: CapConstraints,

    // === Network portability extensions ===

    /// Peer that issued this capability.
    pub issuer_peer: PeerId,

    /// Timestamp of issuance (cluster-relative wall clock,
    /// synchronized via PTP/NTP, see Section 5.8.2.5).
    pub issued_at_ns: u64,

    /// Expiry timestamp (cluster-relative wall clock, see Section 5.8.2.5).
    /// Capabilities MUST have bounded lifetime
    /// (prevents stale capabilities after node failure).
    /// Default: 5 minutes. Renewable while issuer is alive.
    /// Expiry checking includes a 1ms grace period for clock skew.
    pub expires_at_ns: u64,

    /// Signature algorithm identifier.
    /// Uses SignatureAlgorithm encoding (Section 9.5.2). u16 accommodates
    /// hybrid algorithm IDs (0x0200+).
    pub sig_algorithm: u16,

    /// Epoch of the signing key used to create this capability.
    /// Enables receivers to identify which key to use for verification
    /// and to reject capabilities signed with rotated-out keys.
    pub signing_key_epoch: u64,

    /// User namespace ID that scoped this capability's creation.
    /// Distributed capabilities are only valid within the same user namespace
    /// scope — a capability issued under user_ns_id N is rejected by a receiver
    /// whose user namespace root does not include namespace N. This prevents
    /// privilege escalation across user namespace boundaries in multi-tenant
    /// clusters (e.g., a container's capability cannot grant access on the host).
    pub user_ns_id: u64,

    /// Slab slot index into the `cap_sig_slab` pool.
    /// The signature is allocated from a dedicated slab allocator
    /// (fixed-size 3,588-byte slots) to avoid general heap allocation
    /// on the capability verification path.
    ///
    /// Access via `cap_sig_slab.get(signature_slot)` which returns
    /// `Option<&CapabilitySignature>`. A `None` return indicates the
    /// capability was revoked between header creation and signature
    /// access — the caller must treat this as a revoked capability.
    ///
    /// Using a u32 slot index instead of `*const CapabilitySignature`:
    ///   - Trivially `Send`/`Sync` (no raw pointer lifetime concerns).
    ///   - Bounds-checked at access time (no dangling pointer risk).
    ///   - Slab reclamation during memory pressure cannot cause use-after-free;
    ///     the slot simply returns `None` after reclamation.
    pub signature_slot: CapSigSlotId,
}

/// Slab slot identifier for a `CapabilitySignature` in the `cap_sig_slab` pool.
/// u32 supports up to ~4 billion concurrent capability signatures — sufficient
/// for any practical cluster size. The slot is bounds-checked on every access.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub struct CapSigSlotId(pub u32);

/// Signature payload for a distributed capability.
/// Allocated from a dedicated slab allocator (`cap_sig_slab`), NOT from
/// the general heap, to ensure bounded allocation latency on the
/// capability verification path.
///
/// Signature formats supported for distributed capabilities:
///   - Ed25519: 64 bytes (current default)
///   - ML-DSA-65: 3,309 bytes (PQC migration target, Section 9.5)
///   - Hybrid Ed25519 + ML-DSA-65: 3,373 bytes (transition mode)
///
/// SLH-DSA-SHAKE-128f is deliberately EXCLUDED from distributed capabilities.
/// Its 17,088-byte signatures would add ~17 KB to every cross-node
/// capability transfer, making it impractical for runtime operations
/// that occur at lock-acquisition frequency. SLH-DSA-SHAKE-128f is supported
/// for boot signatures (Section 9.2, KernelSignature/DriverSignature
/// structs with 17,408-byte buffers) where it is verified once at load
/// time. For distributed capabilities, ML-DSA-65 provides NIST PQC
/// security at 1/5th the signature size.
///
/// MAX_DISTRIBUTED_SIG_DATA_BYTES = 3,584 (maximum signature payload;
/// the `CapabilitySignature` struct is 3,588 bytes total including the
/// 4-byte header: `sig_len`(2) + `_pad`(2)). Accommodates hybrid PQC
/// signatures with alignment headroom.
#[repr(C)]
pub struct CapabilitySignature {
    /// Actual signature length in bytes.
    pub sig_len: u16,
    pub _pad: [u8; 2],
    /// Signature data. Only sig_len bytes are meaningful.
    pub data: [u8; 3584],
}
const_assert!(core::mem::size_of::<CapabilitySignature>() == 3588);

/// Convenience type combining header + signature for full capability operations.
/// Never placed on the stack as a whole — the header is on the stack and
/// the signature is accessed via slab slot index.
pub type DistributedCapability = (CapabilityHeader, CapSigSlotId);

Memory layout note: The CapabilityHeader is ~72 bytes and safe to place on the kernel stack. The CapabilitySignature is 3,588 bytes and allocated from a dedicated slab pool (cap_sig_slab, 3,588-byte slots) — never on the stack. The signature_slot: CapSigSlotId (u32) in the header is a bounds-checked index into the slab pool. For RDMA transmission, the capability is serialized using a variable-length wire format that includes only sig_len bytes of signature data, giving typical message size of ~170 bytes (Ed25519: 64-byte signature + 106 bytes header) to ~3,415 bytes (ML-DSA-65: 3,309-byte signature + 106 bytes header).

Wire encoding of CapabilityHeader: The in-memory CapabilityHeader uses native integer types and a slab slot index (CapSigSlotId). For RDMA/network transmission, the header is serialized into a variable-length wire format with little-endian encoded fields and the signature appended (only sig_len bytes):

/// Wire encoding of ObjectId for cross-node capability transfer.
/// ObjectId is {slot: u32, generation: u64} = 12 bytes in-memory.
/// The wire format preserves both fields with explicit little-endian encoding.
#[repr(C)]
pub struct ObjectIdWire {
    /// Slot index in the object registry.
    pub slot: Le32,                     // 4 bytes
    /// Generation counter for stale-capability detection.
    pub generation: Le64,               // 8 bytes
}
// Total: 12 bytes. Le32/Le64 are `[u8; N]` wrappers with alignment 1,
// so no implicit padding between fields.
const_assert!(core::mem::size_of::<ObjectIdWire>() == 12);

/// Capability wire encoding specification (variable-length).
///
/// The wire format is NOT a `repr(C)` struct — it is a byte-level encoding
/// specification. The total wire size is `CAP_WIRE_HEADER_SIZE + sig_len` bytes:
/// ~170 bytes for Ed25519, ~3,415 bytes for ML-DSA-65.
///
/// | Offset | Size | Field |
/// |--------|------|-------|
/// | 0      | 12   | `object_id: ObjectIdWire` (Le32 slot + Le64 generation) |
/// | 12     | 8    | `permissions: Le64` |
/// | 20     | 8    | `creation_epoch: Le64` |
/// | 28     | 32   | `constraints: CapConstraintsWire` (fixed 32 bytes) |
/// | 60     | 8    | `issuer_peer: Le64` |
/// | 68     | 8    | `issued_at_ns: Le64` |
/// | 76     | 8    | `expires_at_ns: Le64` |
/// | 84     | 2    | `sig_algorithm: Le16` |
/// | 86     | 8    | `signing_key_epoch: Le64` |
/// | 94     | 8    | `user_ns_id: Le64` |
/// | 102    | 2    | `sig_len: Le16` |
/// | 104    | 2    | `_pad: [u8; 2]` (reserved, zero) |
/// | 106    | sig_len | signature data (Ed25519: 64 bytes, ML-DSA-65: 3309 bytes) |
///
/// Total header (fixed part): CAP_WIRE_HEADER_SIZE = 106 bytes.
/// Total message: 106 + sig_len bytes.
pub const CAP_WIRE_HEADER_SIZE: usize = 106;

/// Encode a capability into the wire format.
/// `buf` must be at least `CAP_WIRE_HEADER_SIZE + sig.len()` bytes.
/// Returns the number of bytes written.
pub fn encode_capability_wire(
    buf: &mut [u8],
    header: &CapabilityHeader,
    constraints: &CapConstraintsWire,
    sig: &[u8],
) -> Result<usize, Error>;

/// Decode a capability from the wire format.
/// Returns (CapabilityHeader, CapConstraintsWire, sig_offset, sig_len).
/// The caller retrieves signature bytes from `buf[sig_offset..sig_offset + sig_len]`.
pub fn decode_capability_wire(
    buf: &[u8],
) -> Result<(CapabilityHeader, CapConstraintsWire, usize, u16), Error>;

Wire format for CapConstraints: The in-memory CapConstraints struct (Section 9.1) contains a cpu_affinity_mask: CpuMask field whose size is variable (depends on nr_cpus discovered at boot). Variable-size fields cannot appear in a fixed-size wire format. The wire encoding of CapConstraints replaces CpuMask with a fixed-size NodeAffinityHint:

/// Wire encoding of CapConstraints for distributed capabilities.
/// Fixed-size (32 bytes). Included in the capability wire format at offset 28.
///
/// CpuMask is NOT serialized on the wire — it is a node-local concept
/// (CPU numbering differs across nodes). Instead, the wire format carries
/// a NodeAffinityHint that expresses the constraint at the topology level
/// appropriate for cross-node semantics.
#[repr(C)]
pub struct CapConstraintsWire {
    /// Capability expiry timestamp (0 = no expiry). This is the constraint-
    /// level expiry (per-delegation policy). The header-level `expires_at_ns`
    /// at wire offset 76 is the signature expiry (signed by the issuer).
    /// Both are checked: the capability is expired if EITHER timestamp has
    /// passed. The constraint-level expiry may be SHORTER than the signature
    /// expiry (a delegatee restricts further), but never longer.
    pub expires_at: Le64,                          // 8 bytes
    /// Maximum delegation depth (0 = no further delegation).
    pub max_delegation_depth: Le32,                // 4 bytes
    /// Flags:
    ///   bit 0: delegatable
    ///   bit 1: revocable
    ///   bit 2: urgent_revoke (capability uses Tier 2 revocation)
    pub flags: Le16,                               // 2 bytes
    /// Maximum delegation tier (0 = Tier 0 only, 1 = up to Tier 1, etc.).
    /// Controls how far down the isolation hierarchy this capability may
    /// be re-delegated. 0xFF = unrestricted.
    pub max_delegation_tier: u8,                   // 1 byte
    pub _pad: [u8; 1],                             // 1 byte
    /// Node affinity hint — the topology-level equivalent of CpuMask
    /// for cross-node capability constraints.
    pub affinity: NodeAffinityHint,                // 16 bytes
}

/// Fixed-size (16 bytes) topology-level affinity constraint for distributed
/// capabilities. Replaces CpuMask on the wire because CPU numbering is
/// node-local and meaningless across nodes.
///
/// A receiving node translates NodeAffinityHint into a local CpuMask by
/// looking up the NUMA node IDs that match the specified topology constraints
/// (using the local topology discovery from {ref:boot-and-installation}  <!-- UNRESOLVED -->).
#[repr(C)]
pub struct NodeAffinityHint {
    /// Target NUMA node ID on the destination node. u16::MAX = any node.
    /// The receiver maps this to a CpuMask covering CPUs on that NUMA node.
    pub preferred_numa_node: Le16,                 // 2 bytes
    /// Target peer ID. Zero = local node (no cross-node constraint).
    /// Non-zero = capability is only valid when exercised on this peer.
    pub target_peer: Le64,                         // 8 bytes
    /// Minimum memory tier required (0 = any, 1 = DRAM, 2 = HBM, 3 = CXL).
    /// See [Section 4.2](04-memory.md#physical-memory-allocator--memory-tier-model).
    pub min_memory_tier: u8,                       // 1 byte
    pub _reserved: [u8; 5],                        // 5 bytes
}
const_assert!(core::mem::size_of::<CapConstraintsWire>() == 32);
const_assert!(core::mem::size_of::<NodeAffinityHint>() == 16);

Serialization protocol:

Direction Conversion
In-memory → wire (sending) CapConstraintsCapConstraintsWire: copy scalar fields; convert cpu_affinity_mask to NodeAffinityHint by looking up the NUMA node that contains the majority of set bits in the mask; set preferred_numa_node to that NUMA ID. If the mask is empty (no constraint), set preferred_numa_node = u16::MAX.
Wire → in-memory (receiving) CapConstraintsWireCapConstraints: copy scalar fields; convert NodeAffinityHint.preferred_numa_node to a local CpuMask covering all CPUs on the specified NUMA node. If preferred_numa_node == u16::MAX, set cpu_affinity_mask = CpuMask::empty() (no constraint).

This ensures that the wire format is fixed-size (32 bytes for CapConstraintsWire) and that CPU affinity constraints are translated to the local node's topology rather than carrying meaningless remote CPU numbers across the network.

Slab Allocator and DoS Mitigation: The cap_sig_slab pool uses per-process quotas to prevent a malicious process from exhausting kernel memory by allocating many signatures:

/// Per-process quota for capability signature allocations.
/// Tracked in the process's capability space (Section 9.1).
pub struct CapSignatureQuota {
    /// Maximum signature slots this process may hold simultaneously.
    /// Default: 1024 (approximately 3.6 MB worst case).
    /// Configurable via prctl(PR_SET_CAP_QUOTA).
    pub max_slots: u32,

    /// Currently allocated slots for this process.
    /// Incremented on successful slab allocation, decremented on free.
    pub used_slots: AtomicU32,
}

/// Default quota: 1024 signatures (~3.6 MB per process).
pub const DEFAULT_CAP_SIGNATURE_QUOTA: u32 = 1024;

/// System-wide limit on total signature slab memory.
/// Default: 1 GB (approximately 290,000 signatures).
pub const CAP_SIG_SLAB_TOTAL_LIMIT: usize = 1024 * 1024 * 1024;

Allocation protocol:

  1. When a process derives a DistributedCapability, the kernel attempts to allocate a signature slot from cap_sig_slab.
  2. Atomically reserve a slot via fetch_add(1, AcqRel). The returned value is the PREVIOUS used_slots count. If prev >= max_slots, the reservation failed: immediately fetch_sub(1, Relaxed) to undo the reservation and return EAGAIN. This atomic reservation pattern avoids the TOCTOU race in the previous load-check-then-increment design (two concurrent allocators could both pass the check but only one should succeed).
  3. On success (prev < max_slots): proceed with slab allocation. If the slab allocation itself fails (out of memory), fetch_sub(1, Relaxed) to undo.
  4. On capability drop (explicit revoke, expiry, or process exit), fetch_sub(1, Release) and return the slot to the slab.

Eager reclamation: When a process terminates (normal exit or killed), all its CapabilitySignature slots are freed immediately. The capability space tracks all signatures via an intrusive linked list per process — O(1) enumeration for cleanup.

Memory pressure handling: If the system-wide cap_sig_slab exceeds 80% of CAP_SIG_SLAB_TOTAL_LIMIT, the kernel triggers a reclamation pass: - Scan all processes for expired capabilities (where expires_at_ns < now). - Free signatures for expired capabilities regardless of process quota. - This ensures that expired capabilities don't accumulate under memory pressure.

System-wide exhaustion (slab at 100%): all new DistributedCapability derivations return -ENOMEM until slots are freed. To prevent a coordinated DoS where many processes each fill their per-process quota: - Cgroup-level cap quota: distributed capability signatures count against the cgroup's memory limit (memory.current includes cap_sig_slab allocations for processes in that cgroup). Exceeding the cgroup memory limit triggers the standard OOM reclaim path, which includes capability signature reclamation. - FMA alert: when system-wide slab exceeds 90%, log FMA warning "cap_sig_slab nearing exhaustion" with top-3 consuming processes.

Rationale for 1024-slot default: A typical process holds <100 distributed capabilities (file handles, memory regions, accelerator contexts). 1024 provides 10x headroom while limiting worst-case memory consumption to ~3.6 MB per process. For specialized workloads (distributed databases, HPC orchestrators), the admin can raise the limit via prctl.

5.7.3 Verification

Remote capability verification (any node):

1. Process on Node B presents a CapabilityHeader + CapabilitySignature to access
   a resource. The header is on the stack; the signature is dereferenced from the slab.
2. Node B checks (in order — fail fast, cheapest checks first after signature):
   a. Signature valid? (verify with issuer_node's public key using the
      algorithm specified in sig_algorithm — Ed25519 ~25-50 μs depending on hardware, ML-DSA-65 ~110 μs)
      → done once, then cached for lifetime of capability (keyed by object_id + generation)
   b. Not expired? (compare expires_at_ns with current time)
      → ~10 ns (clock comparison)
   c. Generation still valid? (check local revocation list)
      → ~100 ns (hash table lookup)
   d. Permissions sufficient for requested operation?
      → ~10 ns (bitfield comparison)
   e. LSM policy check on receiving node?
      → Invoke security_cap_validate_remote(issuer_peer, object_id, permissions)
      → ~50-200 ns (LSM hook dispatch + policy evaluation)
      The LSM hook runs AFTER cryptographic and generation checks to avoid
      wasting LSM evaluation cycles on forged or revoked capabilities.
   f. User namespace validation: If the capability header carries a `user_ns_id`
      field (non-zero), verify that the requesting process belongs to a user
      namespace that is either equal to or a descendant of the capability's
      originating user namespace. This prevents a capability issued in a
      container's user namespace from being used by processes in a sibling or
      ancestor namespace where the capability's authority should not apply.
      → Walk `current_task().nsproxy.user_ns` up via `.parent` and compare
        each level's `ns_id` against the capability's `user_ns_id`.
      → If no match found: return CapError::NamespaceMismatch.
      → ~20-60 ns (namespace depth typically 1-3 levels).
      The hook receives the issuer PeerId, the ObjectId, and the requested
      PermissionBits. The registered LSM module (SELinux/AppArmor/SMACK) can:
        - Deny based on the issuer peer's security label (e.g., untrusted peer).
        - Deny based on the target object's security context on this node.
        - Deny based on cross-node mandatory access control policy
          (e.g., SELinux MLS level dominance between cluster nodes).
      If the LSM denies: return CapError::LsmDenied. The capability is
      cryptographically valid but policy-rejected on this node.
   g. SystemCaps cross-check on the receiving node: verify that the receiving
      node's own SystemCaps permit the operation the capability grants. A
      capability may be cryptographically valid and issued by a trusted peer,
      but the receiving node may lack the system-level authority to honor it
      (e.g., the receiving node is itself constrained by its cluster role or
      is inside a non-init user namespace). Specifically:
      → Map the capability's PermissionBits to the required SystemCaps
        (e.g., RDMA_REGISTER_MR → CAP_NET_ADMIN, DSM access → CAP_DSM_CREATE).
      → Check `current_node_syscaps().contains(required_caps)`.
      → If the receiving node lacks any required SystemCap: return
        CapError::ReceiverLacksSystemCap.
      → ~10-20 ns (bitfield comparison against the node's cached SystemCaps).
      This check runs after LSM (step 2e) and namespace validation (step 2f)
      because it is the cheapest and rarely fails on properly-configured nodes.
3. If all checks pass: operation is authorized.
   Total first-time verification: ~25-50 μs (Ed25519) or ~110 μs (ML-DSA-65).
   Subsequent verifications (signature cached, LSM cached): ~260-420 ns.

LSM hook specification for remote capability validation:

/// LSM hook invoked on the receiving node after cryptographic validation
/// of a distributed capability succeeds. This is the enforcement point
/// for mandatory access control policy on cross-node capability use.
///
/// Called from validate_distributed_cap() after steps 2a-2d pass.
/// The hook is NOT called for cached verifications where the LSM result
/// was already cached (the cache key includes the LSM policy sequence
/// number — cache entries are invalidated when LSM policy is reloaded).
///
/// # Arguments
///
/// * `issuer` — PeerId of the node that signed the capability.
/// * `object_id` — The target object identifier from the capability header.
/// * `permissions` — The permission bits being exercised.
/// * `requesting_cred` — Credentials of the local process requesting access.
///
/// # Returns
///
/// * `Ok(())` — LSM permits the operation.
/// * `Err(EACCES)` — LSM denies the operation (logged via audit subsystem).
///
/// # LSM module implementations
///
/// - **SELinux**: checks whether the issuer peer's security context
///   (mapped via the cluster peer label database) is allowed to provide
///   the requested permission class to the requesting process's domain.
///   Uses `avc_has_perm()` with a `SECCLASS_CAP_REMOTE` object class.
/// - **AppArmor**: checks the requesting process's profile for a
///   `cap_remote` permission rule matching the issuer peer and object type.
/// - **SMACK**: checks the issuer peer's SMACK label against the
///   requesting process's SMACK label using the configured access rules.
/// - **No LSM loaded**: hook is a no-op, returns `Ok(())`.
pub fn security_cap_validate_remote(
    issuer: PeerId,
    object_id: ObjectId,
    permissions: PermissionBits,
    requesting_cred: &Credentials,
) -> Result<(), Errno> {
    // Dispatch to registered LSM hook (compile-time static dispatch
    // via the LSM hook infrastructure, Section 9.3).
    lsm_hooks::cap_validate_remote(issuer, object_id, permissions, requesting_cred)
}

LSM result caching: To avoid repeated LSM hook invocations for the same capability on every use, the verification cache entry (keyed by object_id + generation) includes the LSM verdict and the LSM policy sequence number (security_policy_seqno, incremented on every selinux_load_policy() or apparmor_replace_profile() call). On cache hit, if cached_seqno == current_seqno, the cached LSM verdict is reused. If the sequence numbers differ, the LSM hook is re-invoked and the cache entry is updated. This ensures policy changes take effect within one capability verification cycle.

5.7.4 Revocation

Capability revocation in a distributed system is harder than on a single node because capabilities may be cached on remote nodes that are temporarily unreachable.

5.7.4.1 Three-Tier Revocation Model

UmkaOS uses a unified three-tier revocation model that covers both local and distributed capabilities. The tiers differ in latency, scope, and mechanism but share the same generation-based invalidation semantic: incrementing an object's generation counter instantly invalidates all capabilities with older generations.

Tier Name Scope Mechanism Latency Use Case
1 Local-sync Single node Generation increment + epoch fence + in-flight drain Microseconds (~1-10 μs) Local capability revocation, credential change, resource teardown
2 Remote-urgent Targeted remote nodes Dedicated REVOKE_URGENT RDMA Send Milliseconds (~0.1-5 ms) Security-critical remote caps (credential caps, RDMA master caps, key compromise)
3 Remote-batched All remote nodes Heartbeat-piggybacked RevocationLog delta Hundreds of milliseconds (~100-300 ms) Non-critical remote caps (file handles, read-only resource caps, service bindings)
5.7.4.1.1 Tier 1: Local-Sync Revocation

Local revocation uses the generation-based protocol from Section 9.1:

  1. Revoking code increments the object's generation in the capability table.
  2. All ValidatedCap tokens for that object become invalid on the next validate_cap() call (generation mismatch). In-flight operations holding a CapOperationGuard complete normally; drain() waits for them (bounded by CAP_DRAIN_TIMEOUT = 5ms).
  3. After drain, the capability is universally invalid on this node.
  4. Cost: O(1) per capability (no table scan), bounded drain wait.

This is the same protocol used by KABI dispatch revocation (Section 12.3).

5.7.4.1.2 Tier 2: Remote-Urgent Revocation

For security-critical capabilities that must be invalidated across the cluster within milliseconds, the revoking node sends a dedicated REVOKE_URGENT message:

  1. Revoking node performs local-sync revocation (Tier 1) for the local copy.
  2. Revoking node sends REVOKE_URGENT { cap_id, new_generation, reason } via RDMA Send (reliable connected QP) to each node in the capability's grant log. This is a dedicated message, not piggybacked on a heartbeat.
  3. Each receiving node performs local-sync revocation for its cached copy and responds with CapRevokeAck { cap_id, new_generation }.
  4. Stale rejection: if a remote node presents an old-generation capability to the home node before receiving the REVOKE_URGENT, the home node rejects it immediately (the local generation was already incremented in step 1).
  5. Partition handling: if a node is unreachable, the REVOKE_URGENT message is queued in a durable per-node outbox. When the partition heals, queued revocations replay in order. During the partition, stale capabilities work for local cached reads but fail for any operation requiring home-node validation.

Latency: ~max_cluster_RTT (~2-5 μs on RDMA fabric, ~100 μs on TCP fallback). Scalability: O(N) where N = nodes holding capabilities for the specific object (typically 2-5), not the cluster size.

RDMA optimization: all REVOKE_URGENT messages use RDMA SEND over reliable connected (RC) queue pairs for guaranteed delivery and ordering.

Capabilities flagged CAP_FLAG_URGENT_REVOKE at creation time always use Tier 2. This flag is set for: credential capabilities, RDMA master capabilities, key revocation responses, and any capability whose CapConstraints includes urgent_revoke: true.

5.7.4.1.3 Tier 3: Remote-Batched Revocation

For non-critical capabilities where millisecond-level revocation latency is unnecessary, revocation notifications are batched into heartbeat messages:

  1. Revoking node performs local-sync revocation (Tier 1) for the local copy.
  2. Revoking node appends (cap_id, current_cluster_epoch) to its per-node RevocationLog — an RCU-protected ring buffer of (CapId, epoch): (u64, u64) tuples, default capacity 65,536 entries, with monotonically increasing write_seq. Ring size is configurable at cluster join time via cluster.revocation_ring_capacity (default 65,536, range [4096, 1048576]). Capacity formula: max_sustained_revocations_per_sec = ring_size / heartbeat_interval_sec. FMA metric: cluster.revocation_ring_util_pct tracks high-water-mark utilization.
/// RevocationLog is allocated via `vmalloc` at cluster join time (one
/// instance per cluster node — 1 MiB total, regardless of cluster size).
/// Never stack-allocated. Freed on cluster leave/shutdown.
/// Access: `write_seq` and entries written under the cluster heartbeat
/// timer; `peer_ack_seq` updated on per-peer ACK receipt. Ring entries
/// are read by the heartbeat delta sender (RCU reader).
pub struct RevocationLog {
    /// Ring buffer of (CapId, epoch) tuples. Capacity: 65,536 entries.
    /// 16 bytes per entry × 65,536 = 1 MiB per node.
    ring: [RevocationEntry; 65_536],
    /// Monotonically increasing write sequence number.
    /// Each append increments this. Used for per-peer delta tracking.
    write_seq: AtomicU64,
    /// Per-peer last-acknowledged sequence number. Keyed by PeerId.
    /// Each peer tracks how far it has consumed the ring.
    peer_ack_seq: XArray<AtomicU64>,
}
  1. On each cluster heartbeat (configurable, default 100 ms), the node sends each peer the delta: entries from peer_ack_seq[peer] to write_seq. Each peer independently tracks its position — fast peers get precise deltas, slow/partitioned peers may fall behind.

Overflow semantics (per-peer high-water mark): If a peer's ack_seq falls more than 65,536 entries behind write_seq (i.e., the ring has wrapped past that peer's position), the node cannot send a precise delta to that peer. Instead, it sends a FullRevocationEpoch flag in the heartbeat to that specific peer. The receiving peer invalidates ALL capabilities with creation_epoch < full_revocation_epoch and performs synchronous revalidation on next access. An FMA event CapRevocationRingWrap { peer_id } is emitted.

This design ensures: - Healthy, fast peers always receive precise deltas (no thundering herd). - Only the specific slow/partitioned peer gets the full-epoch revalidation. - Fixed memory (1 MiB per node, regardless of revocation rate).

  1. Receiving nodes: for each (cap_id, epoch) in the delta, mark the capability revoked in their local capability translation table. Any thread currently holding a ValidatedCap for this cap_id will find it invalid on the next validate_cap() call (epoch mismatch). After processing, the receiver sends an ack updating peer_ack_seq on the sender.
  2. validate_distributed_cap(cap) checks: (a) local revocation log O(1) by CapId hash, then (b) if cap.creation_epoch < node.revocation_epoch, synchronously queries the originating node to confirm validity (bounded by cluster RTT).

Bounded revocation latency: heartbeat_period + max_cluster_RTT. Default: ~200 ms. At 655,360 revocations/sec (ring capacity / heartbeat interval), the ring supports sustained revocation rates far exceeding any practical workload.

5.7.4.1.4 Tier Selection Logic
/// SystemCaps that grant credential-level privilege (identity impersonation,
/// namespace manipulation). Revocation must be urgent — stale credential caps
/// allow privilege escalation across nodes.
/// Note: CAP_SETUID/CAP_SETGID/CAP_SYS_ADMIN are system-administration
/// capabilities (SystemCaps), not fine-grained object permissions (PermissionBits).
/// CAP_SYS_ADMIN (Linux bit 21) controls namespace creation/manipulation
/// (setns, unshare) — there is no separate SET_NAMESPACE capability in Linux.
const CREDENTIAL_SYSCAPS: SystemCaps = SystemCaps::from_bits_truncate(
    SystemCaps::CAP_SETUID.bits()
    | SystemCaps::CAP_SETGID.bits()
    | SystemCaps::CAP_SYS_ADMIN.bits()
);

/// PermissionBits that grant RDMA master access (queue pair creation,
/// memory registration, direct hardware DMA). Revocation must be urgent —
/// stale RDMA caps allow unauthorized DMA to remote memory.
const RDMA_MASTER_PERMS: PermissionBits = PermissionBits::from_bits_truncate(
    PermissionBits::RDMA_REGISTER_MR.bits()
    | PermissionBits::RDMA_CREATE_QP.bits()
);

/// Determine which revocation tier to use for a distributed capability.
/// Called by revoke_distributed_cap() after local-sync (Tier 1) completes.
/// Checks both SystemCaps (credential-level privileges) and PermissionBits
/// (fine-grained object permissions) separately, since they are orthogonal.
fn select_revocation_tier(
    cap: &CapabilityHeader,
    issuer_cred: &Credentials,
) -> RevocationTier {
    if cap.constraints.urgent_revoke
        || issuer_cred.cap_effective.intersects(CREDENTIAL_SYSCAPS)
        || cap.permissions.intersects(RDMA_MASTER_PERMS)
    {
        RevocationTier::RemoteUrgent    // Tier 2
    } else {
        RevocationTier::RemoteBatched   // Tier 3
    }
}

#[repr(u8)]
pub enum RevocationTier {
    /// Tier 1: local generation increment + drain (always performed first).
    LocalSync     = 1,
    /// Tier 2: dedicated REVOKE_URGENT message per affected node.
    RemoteUrgent  = 2,
    /// Tier 3: batched in next heartbeat delta.
    RemoteBatched = 3,
}

Interaction with expiry: Distributed capabilities have bounded lifetimes (default: 5 minutes). Revocation (Tier 2 or Tier 3) is the fast path; expiry is the safety net — even if revocation messages are permanently lost, no stale capability survives beyond its expiry window.

Consistency guarantee: After Tier 2 completes (all CapRevokeAck received) or Tier 3 completes (next heartbeat round-trip), the capability is universally invalid. During the propagation window, stale capabilities may succeed for local cached reads but fail for any operation requiring home-node validation.

5.7.5 Key Rotation and Revocation

Distributed capabilities are signed by the issuing peer's private key. Peers must be able to rotate signing keys (routine hygiene, algorithm migration) and revoke keys immediately (compromise response). This subsection specifies both mechanisms.

5.7.5.1 Key Store

Each peer maintains a KeyStore that holds its current signing key and a bounded set of recently-retired keys still within their grace period:

/// Per-peer signing key store. The `current` key is used for all new signatures.
/// Retired keys remain valid for verification during their grace period.
///
/// Max 4 retired keys — supports up to 4 rapid consecutive rotations without
/// unbounded growth. If a 5th rotation occurs before the oldest retired key
/// expires, the oldest is forcibly purged (capabilities signed by it become
/// unverifiable and must be re-issued).
pub struct KeyStore {
    /// Active signing key pair.
    pub current: KeyEpochEntry,
    /// Recently retired keys, ordered by epoch (oldest first).
    /// Entries are purged when `expires_at_ns` elapses or on explicit revocation.
    pub retired: ArrayVec<KeyEpochEntry, 4>,
}

/// A single epoch's key material.
pub struct KeyEpochEntry {
    /// Monotonically increasing epoch counter. Starts at 1 for the initial key.
    pub epoch: u64,
    /// Public key (Ed25519: 32 bytes, ML-DSA-65: 1,952 bytes, hybrid: 1,984 bytes).
    /// Private key is held only by the owning peer and never transmitted.
    pub pubkey: SigningPublicKey,
    /// Cluster-relative wall clock timestamp after which this key is no longer
    /// accepted for signature verification. Set to `u64::MAX` for the current key.
    pub expires_at_ns: u64,
}

5.7.5.2 CapabilityHeader Extension

The CapabilityHeader includes a signing_key_epoch: u64 field that identifies which epoch's key was used to sign the capability. Verifiers look up the corresponding public key in the issuer peer's KeyStore (current or retired) to verify the signature.

5.7.5.3 Rotation Protocol

Routine key rotation follows a graceful overlap protocol that avoids invalidating in-flight capabilities:

  1. Generate: Peer generates a new key pair and increments its epoch counter.
  2. Announce: Peer broadcasts a KeyRotateMsg via the cluster heartbeat extension:
/// Cluster heartbeat extension for key rotation announcements.
/// Delivered reliably via the heartbeat protocol (Section 5.2.7).
/// Wire struct — size depends on SigningPublicKey (algorithm-specific: 32-1984 bytes).
/// const_assert deferred until SigningPublicKey is monomorphized per algorithm.
// kernel-internal, not KABI
#[repr(C)]
pub struct KeyRotateMsg {
    /// Peer performing the rotation.
    pub peer_id: Le64,             // PeerId
    /// New epoch number (must be exactly `old_epoch + 1`).
    pub new_epoch: Le64,
    /// Public key for the new epoch.
    pub new_pubkey: SigningPublicKey,
    /// Cluster-relative wall clock timestamp when the old epoch's key
    /// stops being accepted for verification.
    pub old_epoch_expires_at: Le64,
}
  1. Grace period: The old key is moved to retired with expires_at_ns set to old_epoch_expires_at. Default grace period: KEY_GRACE_PERIOD = 300 seconds (5 minutes, matching the default capability lifetime). Capabilities signed with the old key remain verifiable during this window.
  2. Purge: After old_epoch_expires_at elapses, the retired key entry is removed from the KeyStore. Any capability still bearing that epoch's signature fails verification — the issuer must re-sign it with the current key if still valid.

Consistency: Remote peers update their cached copy of the issuer's KeyStore upon receiving KeyRotateMsg. If a peer missed the announcement (partition), it queries the issuer's current key set on the next capability verification failure (bounded by cluster RTT).

5.7.5.4 Immediate Key Revocation (Compromise Response)

When a signing key is believed compromised, the peer (or a cluster administrator with CAP_CLUSTER_ADMIN) broadcasts a KeyRevokeMsg:

/// Immediate key revocation — no grace period.
/// Delivered via REVOKE_URGENT cluster message (same path as CAP_FLAG_URGENT_REVOKE).
/// Wire struct — contains CapabilitySignature (3588 bytes). Total: 8+8+3588 = 3604 bytes.
///
/// **Intentionally fixed-size** despite the Ed25519 overhead (only 64 of 3588 bytes
/// used). Key revocation is a rare emergency operation (compromise response, perhaps
/// once per year) where wire parsing simplicity is more valuable than bandwidth
/// optimization. If PQC signatures (ML-DSA-65, 3,309 bytes) become the default,
/// the fixed-size struct avoids a variable-length parsing vulnerability on the
/// compromise response path. Exceeds RDMA inline send limits (typically 64-256
/// bytes), so this message uses a standard RDMA Send (posted to send queue).
#[repr(C)]
pub struct KeyRevokeMsg {
    /// Peer whose key is revoked.
    pub peer_id: Le64,             // PeerId
    /// Epoch of the compromised key.
    pub revoked_epoch: Le64,
    /// Signature over this message using a key with epoch > revoked_epoch,
    /// proving the revoker controls a newer key. If the current key itself
    /// is compromised, the cluster administrator signs with the cluster
    /// root key (Section 9.2).
    pub proof_sig: CapabilitySignature,
}
// Wire format: peer_id(8) + revoked_epoch(8) + proof_sig(3588) = 3604 bytes.
const_assert!(core::mem::size_of::<KeyRevokeMsg>() == 3604);

Semantics: Upon receiving KeyRevokeMsg, all peers immediately remove the revoked epoch from their cached KeyStore for that peer. No grace period — the key is invalid as of the message receipt time. All capabilities signed with the revoked key are immediately unverifiable. The issuing peer must re-derive and re-sign any still-valid capabilities using its current (non-compromised) key.

Interaction with capability expiry: Even without explicit re-signing, all affected capabilities expire naturally within their bounded lifetime (default 5 minutes). Revocation eliminates the exposure window; expiry is the safety net.

5.7.6 Use Case: Remote GPU Access

Process on Node A wants to submit work to GPU on Node B:

1. Process has local capability: ACCEL_COMPUTE for GPU on Node B.
2. Kernel derives DistributedCapability, signs with Node A's key.
3. Kernel sends command submission + capability to Node B via RDMA.
4. Node B's kernel verifies capability:
   - Signature valid (Node A's key)
   - Not expired
   - Not revoked
   - Has ACCEL_COMPUTE permission for this GPU
5. Node B's AccelScheduler accepts the submission.
6. Completion notification sent back to Node A via RDMA.

The process on Node A uses the same AccelContext API
as for a local GPU. The kernel handles the distribution.

5.7.7 Distributed Device Fabric

5.7.7.1 Remote Device Access — Capability Provider Taxonomy

Every cluster-accessible device is represented by a capability service provider — an entity that advertises services via CapAdvertise (Section 5.1), accepts connections via ServiceBind, and serves requests through per-service wire protocols. The mechanism is the same regardless of who implements the provider:

Capability service provider taxonomy:

Device-native provider (Path A/B):
  Consumer → peer protocol → Device peer (firmware shim or full kernel)
  Device sends CapAdvertise. No host driver. The device IS a peer —
  directly addressable via the peer protocol and topology graph
  (Section 5.2.9). Lowest latency, no host CPU involvement.

Host-proxy provider (Path C):
  Consumer → peer protocol → Host subsystem → KABI driver → Device
  Host runs a KABI driver for the device. The host's subsystem layer
  (block, VFS, accel) wraps the driver-managed device as a
  PeerServiceEndpoint and sends CapAdvertise on behalf of the device.
  Day-one cluster accessibility without firmware changes.

Host-native provider:
  Consumer → peer protocol → Host subsystem (no backing device)
  Host provides a service from its own resources — e.g., exporting a
  locally-mounted filesystem, sharing host DRAM as a DSM region.
  The host sends CapAdvertise for a service backed by host resources.

All three: CapAdvertise → ServiceBind → per-service wire protocol.
Consumer cannot distinguish which provider type it is talking to.

ServiceBind validation flow: When a ServiceBind request arrives, the receiving node validates it through a four-step gate before accepting the binding:

  1. PeerCapFlags check: The requesting node's PeerCapFlags (exchanged during cluster join and cached in PeerState) must include the service-specific flag for the requested service (e.g., PEER_CAP_BLOCK for block service, PEER_CAP_NET for network service, PEER_CAP_ACCEL for accelerator service). If the flag is absent, the request is rejected — the peer has not advertised the required capability class.

  2. Namespace trust policy: The CapabilityHeader.user_ns_id carried in the ServiceBind message is validated against the receiver's namespace trust policy. Capabilities scoped to init_user_ns (user_ns_id = 0) are accepted from all peers in the trust domain. Capabilities scoped to a non-init user namespace are accepted only from same-cluster peers (peers sharing the same cluster_id in PeerState). This prevents namespace-confused privilege escalation across administrative boundaries.

  3. Concurrency limit: The service provider's max_concurrent_requests limit is checked. If the current active binding count for this service equals or exceeds the limit, the request is deferred (queued) or rejected depending on the provider's backpressure policy.

  4. LSM hook: lsm_check_service_bind(service_id, peer_id, cap_header) is called. The LSM module (SELinux, AppArmor, or custom) may deny the binding based on security labels, peer identity, or service-specific policy rules.

If any step fails, the receiver sends ServiceBindNack { error_code, step } back to the requester, identifying which validation step failed and the specific error code (e.g., EPERM for LSM denial, EACCES for namespace mismatch, EBUSY for concurrency limit, ENOTSUP for missing PeerCapFlags).

Namespace ID translation for cross-node validation: The user_ns_id in CapabilityHeader is local to each node. For cross-node capability validation, the verifying node maps the capability's user_ns_id to its own namespace hierarchy via the cluster namespace registry — a replicated XArray mapping (origin_node, remote_ns_id) to local_ns_id. If no mapping exists, the capability is treated as belonging to a foreign namespace with no local privileges. The registry is updated when nodes join the cluster and exchange namespace topology. Consistency: eventual (heartbeat-piggybacked deltas, same as capability revocation batching Section 5.8).

A device instance either runs firmware speaking umka-protocol (device-native provider) or is managed by a host KABI driver (host-proxy provider). Same hardware model can ship either way — it is a firmware/vendor decision. The vendor can implement both and let operators choose; once device-native mode is proven sufficient, the host-proxy path can be dropped. Host-proxy providers ensure day-one cluster accessibility for devices with traditional firmware.

5.7.7.2 Capability Service Providers

Host-proxy and host-native providers use subsystem-level service endpoints to make local resources available to cluster peers. Each subsystem (block layer, VFS, accelerator framework) implements one PeerServiceEndpoint handler that wraps its local abstraction and serves it via the standard peer protocol.

Host-proxy service provider model:
  Host A: Driver → block device (local)
          Block service provider → CapAdvertise → peer protocol → cluster
  Host B: peer protocol → block service client → "block device on Host A"
          (same block device interface — consumer doesn't know it's remote)

Why subsystem-level service providers, not driver-level proxy:

  • No per-driver proxy code. One service provider per subsystem (block, VFS, accel), not per-driver. Adding a new NVMe driver doesn't require writing a matching proxy.
  • Right abstraction level. The consumer sees a block device, a filesystem, or an accelerator context — not a driver vtable. The service provider can coalesce I/O, enforce quotas, and cache hot data.
  • No KABI vtable forwarding across the network. KABI vtables are designed for local, synchronous calls with sub-microsecond latency. Forwarding them across RDMA (3-5 μs RTT) would turn every vtable call into a remote procedure call — a performance disaster for high-frequency operations like block I/O completion.

Subsystem service providers:

Subsystem Service Provider Specification Service Provided
Block layer Block I/O forwarding over peer protocol Section 15.13 Block devices (NVMe, SCSI, virtio-blk)
VFS File operation forwarding over peer protocol Section 14.11 Mounted filesystems, individual mounts
Accelerator framework Command buffer submission over peer protocol Section 22.7 GPU contexts, FPGA slots, inference engines
Network Packet forwarding over peer protocol Section 16.31 External network interfaces (L2/L3 gateway, RDMA proxy)
Serial/TTY Byte-stream forwarding over peer protocol Section 21.1 Serial ports (management consoles, industrial I/O)
USB URB-level forwarding over peer protocol Section 13.29 Any USB device (FIDO2, smartcard, HID, storage, audio)
TPM TPM2 command forwarding over peer protocol Section 9.4 Attestation, key sealing, hardware RNG

Each subsystem service provider is ~1-2K lines. Access is capability-gated: CAP_BLOCK_REMOTE, CAP_FS_REMOTE, CAP_ACCEL_REMOTE, or CAP_NET_REMOTE (see Section 9.1).

For device-native providers (firmware shim or full kernel), host-proxy service providers are unnecessary — the device already advertises its capabilities directly via CapAdvertise and is addressable in the PeerRegistry (Section 5.2.9.1).

Coexistence: the same device class can be a device-native provider on one host (new firmware) and served by a host-proxy provider on another (old firmware). No conflict — different instances with different capabilities advertised in the registry. Over time, vendors ship shim firmware, those devices become device-native providers, and the host-proxy path becomes unused for that device class.

5.7.7.2.1 Custom Service Endpoints

The three built-in subsystem service providers (block, VFS, accel) cover common device classes. However, UPFS or other specialized subsystem may need FS-aware server-side logic that goes beyond generic block I/O — for example, FS-aware prefetching, stripe-aware read-ahead, or metadata-aware write coalescing.

For such cases, the peer protocol allows subsystems to register custom service endpoints on the peer protocol's data path:

/// A custom service registered on the peer protocol.
/// The service receives raw messages on its own RDMA queue pairs and
/// implements its own wire protocol. The peer protocol provides:
/// - Connection management (queue pair setup/teardown)
/// - Authentication (capability check at connection time)
/// - Discovery (via PeerRegistry capability advertisement)
/// - Failure detection (via heartbeat)
///
/// The service provides:
/// - Wire protocol (message format, request/response semantics)
/// - Server-side processing logic
/// - Client-side library for submitting requests
pub struct PeerServiceEndpoint {
    /// Service name (e.g., "upfs-data", "upfs-metadata").
    pub name: ServiceName,
    /// Capability required to connect (registered with the capability system).
    pub required_cap: CapabilityType,
    /// RDMA queue pairs allocated for this service.
    /// Bounded: max `MAX_SERVICE_QUEUES` (32) per endpoint — one per NUMA node
    /// in a large system. Allocated at service registration time (warm path).
    pub queues: ArrayVec<RdmaQueuePair, MAX_SERVICE_QUEUES>,
    /// Handler called for each incoming message.
    pub handler: &'static dyn PeerServiceHandler,
}

/// Trait implemented by custom service handlers.
pub trait PeerServiceHandler: Send + Sync {
    /// Process an incoming message from a connected client.
    /// Returns a response to send back.
    fn handle(&self, client: PeerId, msg: &[u8]) -> ServiceResponse;
    /// Called when a client connects.
    fn on_connect(&self, client: PeerId);
    /// Called when a client disconnects or is declared Dead.
    fn on_disconnect(&self, client: PeerId);
}

This enables a GPFS-class filesystem to implement its own data service (analogous to GPFS's NSD or Lustre's OSS) that speaks a FS-aware protocol over the peer protocol's RDMA infrastructure. The custom service benefits from all cluster infrastructure (membership, heartbeat, topology, capabilities) without being constrained by the generic block export's abstraction level.

The built-in block, VFS, and accel exports are themselves implemented as PeerServiceEndpoint handlers — they are not special-cased in the peer protocol.

5.7.7.3 GPUDirect RDMA Across Nodes

For GPU-to-GPU communication across nodes (essential for distributed training):

Current state (NCCL on Linux):
  GPU 0 (Node A) → PCIe → CPU RAM (Node A) → RDMA NIC → Network →
  → RDMA NIC → CPU RAM (Node B) → PCIe → GPU 0 (Node B)
  Copies: 2 (GPU→CPU, CPU→GPU). CPU involvement: yes.

With GPUDirect RDMA (supported by Mellanox NICs + NVIDIA GPUs):
  GPU 0 (Node A) → PCIe → RDMA NIC → Network →
  → RDMA NIC → PCIe → GPU 0 (Node B)
  Copies: 0. CPU involvement: none.

UmkaOS integration:
  - P2P DMA (Section 22.2) handles local GPU↔NIC path
  - ClusterTransport handles the RDMA/CXL/TCP portion (per-peer transport binding)
  - RdmaDeviceVTable.register_device_mr() (Section 22.5.1.3) registers GPU VRAM for RDMA
  - Combined path: GPU→NIC→Network→NIC→GPU, zero CPU copies

The kernel manages the IOMMU mappings on both ends, ensuring that:
  - GPU VRAM is registered as an RDMA memory region
  - RDMA NIC has IOMMU permission to DMA to/from GPU BAR
  - Remote node's RDMA NIC has permission via remote rkey
  - Capability system authorizes the cross-node GPU-to-GPU transfer

5.8 Failure Handling and Distributed Recovery

5.8.1 Split-Brain Detection and Recovery

5.8.1.1 Failure Model

Distributed systems have failure modes that single-machine kernels don't:

Failure Detection Recovery
Node crash (power loss) Heartbeat timeout (Suspect at 300ms, Dead at 1000ms per Section 5.8) Reclaim resources, invalidate capabilities
Network partition Heartbeat timeout + asymmetric Split-brain protocol (Section 5.8)
RDMA NIC failure Link-down event + failed RDMA ops Fallback to TCP or isolate node
Slow node (Byzantine) Heartbeat latency spike Mark suspect, reduce trust
Storage failure I/O error from block driver FMA-managed (Section 20.1)

5.8.1.2 Heartbeat Protocol

A peer heartbeats only its direct neighbors in the topology graph (Section 5.2.9.2), not every peer in the cluster. This keeps heartbeat traffic proportional to the number of physical links, not the square of the cluster size.

// umka-core/src/distributed/heartbeat.rs

pub struct HeartbeatConfig {
    /// Heartbeat interval.
    /// Default: 100ms (10 heartbeats/sec).
    pub interval_ms: u32,

    /// Miss count before marking peer Suspect.
    /// Default: 3 (300ms of silence → Suspect).
    pub suspect_threshold: u32,

    /// Miss count before marking peer Dead.
    /// Default: 10 (1000ms of silence → Dead).
    pub dead_threshold: u32,

    /// Heartbeat transport mode. Selected automatically based on peer
    /// capabilities; override via sysctl `cluster.heartbeat.transport`.
    pub transport: HeartbeatTransport,
}

/// Heartbeat transport mode. Determines how heartbeat messages are
/// delivered to each peer. Selection follows standard transport priority:
/// CxlMmio > RdmaUd > TcpUnicast. Each peer may use a different
/// transport depending on connectivity.
#[repr(u32)]
pub enum HeartbeatTransport {
    /// RDMA Unreliable Datagram (UD) Send. Two-sided (requires remote CPU
    /// to process — proof of liveness). Supports multicast for O(1)
    /// delivery to multiple neighbors. Default for RDMA-connected peers.
    /// Latency: ~1-3 μs.
    RdmaUd     = 0,

    /// TCP unicast heartbeat to each peer individually via
    /// `transport.send_reliable()`. Used when RDMA is unavailable.
    /// O(peers) messages per heartbeat interval. Latency: ~50-200 μs.
    /// Recommended defaults for TCP: interval_ms=500, suspect_threshold=3,
    /// dead_threshold=10.
    TcpUnicast = 1,

    /// CXL shared memory polling. The heartbeat sender writes to a
    /// pre-agreed MMIO location on the CXL fabric; the receiver polls
    /// the location. Lowest latency (~0.1-0.3 μs), suitable for
    /// CXL-attached peers on the same fabric.
    CxlMmio    = 2,
}

// The heartbeat sender and receiver threads run at SCHED_FIFO priority
// (configurable, default priority 50) to avoid false suspect transitions
// caused by CPU saturation delaying heartbeat processing. For non-RDMA
// (TCP) clusters where network latency is higher and more variable, the
// recommended defaults are: interval_ms=500, suspect_threshold=3 (1500ms),
// dead_threshold=10 (5000ms).

/// Heartbeat message (sent via RDMA Send or PCIe doorbell, 44 bytes).
///
/// Heartbeat's only job is proving liveness. It does NOT carry membership
/// state — membership is exchanged separately via the PeerRegistry gossip
/// protocol (Section 5.2.9.1). The `membership_gen` field enables
/// lightweight staleness detection: if the receiver's registry generation
/// is behind, it triggers a delta sync from the sender.
#[repr(C)]
pub struct HeartbeatMessage {
    /// Sender's PeerId.
    pub peer_id: Le64,            // 8 bytes  (PeerId)
    /// Monotonic generation (incremented on peer restart).
    /// If generation changes, it means the peer rebooted — treat as a
    /// new logical peer (old state is invalid). Le64 matches
    /// `ClusterNode.heartbeat_generation` — no narrowing on the wire.
    pub generation: Le64,         // 8 bytes
    /// Sender's current timestamp in nanoseconds (for clock skew
    /// estimation and RTT measurement — heartbeat RTT is a free
    /// continuous latency signal fed to the topology graph).
    pub timestamp_ns: Le64,       // 8 bytes
    /// Sender's load summary (for cluster scheduler).
    /// CPU utilization 0-100 (percent, u8 sufficient for 0-100 range).
    pub cpu_percent: u8,          // 1 byte
    /// Memory pressure: 0-255 scale (0=idle, 255=OOM imminent).
    /// Finer granularity than percentage for pressure gradient detection.
    /// **Conversion to PeerLoad**: `peer_load.memory_pressure = (hb.memory_pressure as u32 * 100) / 255`.
    pub memory_pressure: u8,      // 1 byte
    /// Accelerator utilization 0-100 (percent).
    pub accel_percent: u8,        // 1 byte
    /// Reserved for future use.
    pub _reserved: u8,            // 1 byte
    /// Sender's PeerRegistry generation counter. If the receiver's
    /// registry generation is lower, it knows its membership view is
    /// stale and should request a delta sync.
    pub membership_gen: Le64,     // 8 bytes
    /// Number of runnable tasks on this peer (for cluster scheduler
    /// load balancing). Transmitted here to avoid a separate exchange
    /// mechanism; PeerLoad.runnable_count is populated from this field.
    pub runnable_count: Le32,     // 4 bytes
    /// Remote page faults per second (for DSM data locality scoring).
    /// PeerLoad.remote_fault_rate is populated from this field.
    pub remote_fault_rate: Le32,  // 4 bytes
}
// Total: 8 + 8 + 8 + 1 + 1 + 1 + 1 + 8 + 4 + 4 = 44 bytes.
// Le64/Le32 are byte-array-backed (alignment 1), so no implicit padding.
// 44 bytes is sufficient for RDMA alignment (RDMA Send payloads need
// only 4-byte alignment for most HCAs).
const_assert!(core::mem::size_of::<HeartbeatMessage>() == 44);

Neighbor-only heartbeat model:

  • A peer heartbeats only its direct neighbors in the topology graph.
  • On-host devices (NVMe, GPU, SAS controllers running firmware shim) heartbeat their host kernel only, via PCIe doorbell — effectively free, no network traffic consumed.
  • Host kernels heartbeat other host kernels via RDMA Send (actual network traffic, but only to direct neighbors).
  • A DPU or smart NIC with its own network port heartbeats independently of its host — it is a separate peer with its own neighbor set.

Traffic analysis (10-host cluster, 8 devices per host):

Link Type Count Transport Network Cost
Host-to-host 10 × 4 = 40 (avg 4 neighbors each) RDMA Send 40 × 40B = 1.6 KB per interval
Device-to-host 80 PCIe doorbell Zero network traffic
Total network 40 1.6 KB / 100ms = 16 KB/s

Negligible. A 100 Gbps RDMA fabric carries this without measurable impact.

TCP cluster traffic analysis (neighbor-only model, avg 4 neighbors per node):

Cluster Size Messages/Interval (cluster-wide) Per-Node Send Rate Bandwidth (cluster-wide)
10 nodes 40 (10 x 4 neighbors) 4 msgs / 500ms = 8/sec 40 x 40B = 1.6 KB / 500ms = 3.2 KB/s
50 nodes 200 (50 x 4 neighbors) 4 msgs / 500ms = 8/sec 200 x 40B = 8 KB / 500ms = 16 KB/s
100 nodes 400 (100 x 4 neighbors) 4 msgs / 500ms = 8/sec 400 x 40B = 16 KB / 500ms = 32 KB/s
1024 nodes 4096 (1024 x 4 neighbors) 4 msgs / 500ms = 8/sec 4096 x 40B = 160 KB / 500ms = 320 KB/s

The neighbor-only model keeps per-node send rate constant at O(neighbors) regardless of cluster size. Even at 1024 nodes, cluster-wide heartbeat bandwidth is 320 KB/s — trivial on any network. This is in contrast to an all-to-all model which would produce O(N^2) messages: 1024 nodes x 1023 peers = ~1M messages per interval, consuming ~80 MB/s of bandwidth — exceeding 1 Gbps link capacity with TCP framing overhead.

Scalability recommendation: TCP clusters up to 1024 nodes are well-supported by the neighbor-only heartbeat model with recommended TCP defaults (interval_ms=500, suspect_threshold=3, dead_threshold=10). Failure detection latency on TCP is bounded by the gossip propagation diameter of the topology graph (typically 3-5 hops for well-connected clusters), so a node failure is detected cluster-wide within ~1.5-2.5 seconds on TCP (3 hops x 500ms interval).

Failure inference from topology: if Host A stops heartbeating, all peers whose only path in the topology graph goes through Host A (its on-host devices) are presumed dead. This falls out naturally from the topology — no separate "device failure" protocol needed. A device with an independent network path (DPU, smart NIC) is NOT presumed dead if its host fails; it heartbeats independently and its liveness is assessed separately.

5.8.1.3 On-Host Tier M Peer Heartbeat Format

On-host Tier M peers (Section 11.1) use bus-specific mechanisms for heartbeat delivery. The message content is identical HeartbeatMessage (44 bytes) regardless of bus — only the physical delivery differs:

Bus Host → device Device → host
PCIe MMIO write to BAR0 + PEER_DOORBELL_OFFSET (0x100), value = 32-bit sequence counter DMA write to pre-agreed host memory (HeartbeatMessage in control ring pair)
CXL Store to shared-memory doorbell region Store to shared-memory (hardware-coherent)
s390x Channel I/O SIGA on QDIO output queue (heartbeat SBAL entry) I/O interrupt with subchannel status word
USB Control transfer SET_FEATURE(HEARTBEAT) with seq counter Interrupt IN endpoint (HeartbeatMessage)
virtio Virtqueue kick (notify) on control virtqueue Used buffer with HeartbeatMessage payload
On-chip partition IPI or mailbox register write IPI or mailbox register write

All use the same interval (default 100ms) and detection thresholds (Suspect at 3 misses, Dead at 10 misses).

5.8.1.4 Tier M Peer Hot-Plug and Surprise Removal

Hot-add of Tier M device (any bus, at runtime after boot):

  1. Bus detects new device:
  2. PCIe: link training → hot-add interrupt
  3. USB: hub port status change interrupt
  4. s390x: channel report word (CRW) machine check → STSCH reveals new subchannel
  5. virtio: device configuration change notification
  6. Kernel enumerates new device (same as Phase 4.4a bus scan, but at runtime).
  7. Check bus-specific Tier M magic → if Tier M: run Phase 4.8 detection sequence.
  8. Peer join handshake → CapAdvertisePeerServiceProxy creation in KabiServiceRegistry (Section 5.11).
  9. Host subsystems discover new service via KabiServiceRegistry change notification (Section 11.6).

Surprise removal of Tier M device (any bus):

  1. Bus-specific disconnect event:
  2. PCIe: link-down event or AER (Advanced Error Reporting) fatal error
  3. USB: hub port disconnect interrupt
  4. s390x: subchannel gone (CRW with solicited flag)
  5. Universal fallback: heartbeat timeout (3 misses → Suspect, 10 → Dead)
  6. Peer marked Dead (standard peer failure path).
  7. Bus-specific isolation lockout:
  8. PCIe: IOMMU domain teardown + bus master disable (<1ms)
  9. s390x: CSCH (Clear SubChannel) stops all pending I/O
  10. USB: all pending URBs cancelled, endpoints halted
  11. PeerServiceProxy generation counter set to even (dead). All in-flight vtable calls return -ENODEV.
  12. KabiServiceRegistry auto-resolves to next-best provider (host KABI driver, if loaded and registered with lower priority).
  13. Device unregistered from peer registry and device registry.

5.8.1.5 System Suspend with Tier M Peers

Tier M peers follow standard device suspend ordering — suspended AFTER host subsystems quiesce (Phase DeviceSuspend in the suspend sequence, Section 7.5).

For Tier M NICs providing cluster transport (EXTERNAL_NETWORK or RDMA_CAPABLE): the host sends LeaveNotify to all remote cluster peers BEFORE suspending the NIC. The host is temporarily Dead to the cluster during sleep. On resume:

  1. NIC Tier M peer re-initializes (firmware restarts or resumes).
  2. Host re-runs Phase 4.8 peer detection for the NIC.
  3. Host re-runs Phase 7.0a transport activation.
  4. Host re-joins the cluster (Phase 7.2 cluster_join()).
  5. Duration: ~100-500ms for cluster rejoin after resume.

If the host has multiple NICs, one can stay in D0 (active) for cluster keepalive while others suspend. Admin-configurable via /sys/class/net/<nic>/device/peer/suspend_policy (suspend or keepalive).

For Tier M storage devices (NVMe with Tier M shim): suspended after all filesystems are synced and all I/O queues are drained. On resume: peer re-join is automatic; mounted filesystems see no error (the device was idle during suspend).

5.8.1.6 Firmware Update of Sole Cluster Transport

When the RDMA NIC (or only Ethernet NIC) providing cluster transport requires firmware update and is the sole path to remote peers:

  1. Host sends LeaveNotify to all remote peers (via the NIC, while it still works).
  2. Wait for DrainAck from all peers (bounded timeout: 5 seconds).
  3. Host is now gracefully Dead to the cluster (expected, not an error).
  4. Host sends firmware update command to NIC via control ring pair.
  5. NIC reboots firmware (~1-5 seconds, device temporarily unresponsive).
  6. NIC re-initializes → Phase 4.8 peer re-detection → Phase 7.0a transport reactivation.
  7. Host re-joins cluster via Phase 7.2.

Total cluster blackout: ~2-10 seconds (firmware reboot + cluster rejoin). During the blackout, the host functions normally for LOCAL operations (local block devices, local GPU, local serial, etc.). Only remote cluster operations are unavailable.

If the host has a secondary NIC (even a slow Ethernet adapter), cluster membership can be maintained via the secondary path during the update — the topology graph routes around the updating NIC automatically.

5.8.1.7 Split-Brain Resolution

Network partition can cause split-brain: two groups of nodes each believe the other group has failed.

Strategy: Majority quorum + lease-based fencing tokens.

Cluster: Nodes {A, B, C, D, E} (5 nodes)
Partition: {A, B, C} can talk to each other. {D, E} can talk to each other.
           Neither group can reach the other.

Resolution:
  1. Each group counts its members.
  2. {A, B, C} has 3/5 = majority. Continues operating.
  3. {D, E} has 2/5 = minority. Enters read-only mode.
     - No new DSM writes (avoids conflicting updates).
     - No process migrations.
     - Local workloads continue (processes already on D, E keep running).
     - Remote page faults that target {A, B, C} get errors.
  4. When partition heals:
     - {D, E} rejoin the cluster.
     - DSM directory entries are reconciled (version numbers resolve conflicts).
     - Process affinity recalculated.

**Lease-Based Fencing Tokens**: To prevent split-brain ambiguity, UmkaOS uses
monotonically-increasing fencing tokens (lease epochs) for all cluster-wide
operations:

```rust
/// Fencing token for split-brain prevention.
/// Monotonically increments on every quorum leadership change.
/// Used to invalidate stale operations from partitioned minorities.
#[repr(C)]
pub struct FencingToken {
    /// Monotonically increasing epoch.
    /// Incremented when:
    ///   1. Cluster membership changes (node join/leave)
    ///   2. Quorum leader changes (leader failure)
    ///   3. Admin triggers manual fencing (maintenance)
    pub epoch: Le64,

    /// PeerId of the quorum leader that issued this token.
    /// Used for token validation and leader identification.
    pub leader_peer: Le64,         // PeerId

    /// Timestamp when this token was issued (cluster-relative, PTP-synchronized).
    /// Tokens expire after FENCING_TOKEN_TTL_NS (default: 30 seconds).
    pub issued_at_ns: Le64,
}
const_assert!(core::mem::size_of::<FencingToken>() == 24);

/// Token time-to-live: tokens older than this are considered invalid.
/// Must be longer than the worst-case partition detection time (heartbeat timeout).
pub const FENCING_TOKEN_TTL_NS: u64 = 30_000_000_000; // 30 seconds

Analysis: TTL vs heartbeat detection. The 30-second TTL is NOT the primary split-brain protection. The minority partition enters read-only mode within 1.5 seconds (3 missed heartbeats × 500 ms). The TTL is a backstop for external services that cannot observe membership changes directly. A token with a stale epoch is rejected immediately by any service that tracks the current epoch — TTL expiry is only relevant for services without epoch tracking.

Token propagation protocol:

  1. Leader election: When cluster membership changes, the new quorum leader (determined by deterministic rules below) increments the fencing epoch and broadcasts the new FencingToken to all nodes in its partition.
  2. Token validation: Every DSM write and capability grant includes the sender's current fencing token. The receiver rejects operations with stale tokens (token.epoch < local_token.epoch) or expired tokens (now - token.issued_at_ns > FENCING_TOKEN_TTL_NS).
  3. Partition healing: When partitions heal, the surviving leader's fencing token takes precedence. Nodes from the minority partition discard their stale tokens and adopt the majority's token before resuming normal operations.

Deterministic Tie-Breaking Rules: To ensure all nodes independently reach the same decision during a partition, UmkaOS applies the following rules in strict order:

Rule 1: Majority Wins — The partition with >50% of nodes wins. - 5-node cluster: 3+ nodes wins. - 6-node cluster: 4+ nodes wins.

Rule 2: Larger Node Set Wins (for equal-size partitions) — If two partitions have exactly the same number of nodes (possible only in even-sized clusters), the partition with the higher total node ID sum wins. - Example: In a 4-node cluster {A=1, B=2, C=3, D=4}, partition {A, D} (sum=5) ties with partition {B, C} (sum=5); Rule 2a below breaks the tie. - Rule 2a: Equal Sum → Lowest Minimum Node ID Wins: If sums are equal (as in the example), the partition whose lowest node ID is smaller wins. {A, D} has min=1; {B, C} has min=2. {A, D} wins. - This provides a total order over all possible equal-size partitions without requiring external coordination.

Rule 3: External Witness (optional) — An admin-configured external witness (node ID 255, or a dedicated witness VM) can act as a tiebreaker. If configured, the partition containing the witness wins all ties. This overrides Rules 2/2a.

Rule precedence: Rule 1 > Rule 3 > Rule 2 > Rule 2a.

Slot-indexed addressing: PartitionBitmap uses PeerSlotIndex (not raw PeerId) to avoid overflow after cumulative peer joins exceed 1024. PeerId(NonZeroU64) is never reused within a cluster epoch and grows monotonically. After 1024 cumulative joins (even with only 10 peers active), raw PeerId would exceed the bitmap range. PeerSlotIndex is a recycled dense index in [0, 1023] managed by PeerRegistry:

/// Dense slot index for bitmap addressing. Recycled from a free list after
/// dead-peer garbage collection (with a grace period matching the GC retention
/// period of 1 hour — see PeerRegistry.mark_dead()). The slot index is
/// independent of PeerId: a peer with PeerId=50000 may occupy slot 7.
///
/// Invariant: at most MAX_PEERS (1024) slots are allocated at any time.
pub struct PeerSlotIndex(u16);  // [0, 1023], recycled

impl PeerRegistry {
    /// Allocate a dense slot index for a newly joining peer.
    /// Returns ClusterFull if all 1024 slots are occupied.
    /// The slot is returned to the free list by `release_slot()`.
    fn allocate_slot(&self) -> Result<PeerSlotIndex, ClusterFull> { ... }

    /// Release a slot after dead-peer garbage collection grace period.
    /// The slot becomes available for reuse by future joining peers.
    fn release_slot(&self, slot: PeerSlotIndex) { ... }

    /// Look up the PeerSlotIndex for a given PeerId.
    /// Used by split-brain resolution to convert PeerId → slot for bitmap ops.
    fn slot_for(&self, peer: PeerId) -> Option<PeerSlotIndex> { ... }
}

Implementation for up to MAX_PEERS (1024) nodes with scalable bitmasks:

/// Scalable partition bitmask supporting up to MAX_PEERS (1024) nodes.
/// Backed by `[u64; 16]` = 1024 bits, matching the cluster-wide MAX_PEERS limit.
/// Operations delegate to per-word bitwise logic — no heap allocation.
///
/// **Indexed by PeerSlotIndex, NOT raw PeerId.** PeerSlotIndex is a dense
/// recycled index in [0, 1023], ensuring bitmap addressing remains valid
/// regardless of cumulative PeerId growth. The PeerRegistry maps PeerId →
/// PeerSlotIndex; callers must convert before using bitmap operations.
#[derive(Clone, Debug)]
pub struct PartitionBitmap {
    /// Each bit represents a slot index (0-indexed, matching PeerSlotIndex).
    words: [u64; 16],
}

impl PartitionBitmap {
    /// Set the bit for slot `slot`. Returns `true` on success,
    /// `false` if `slot.0` is out of range [0, 1023].
    ///
    /// Precondition: slot in [0, 1023], enforced by PeerRegistry.allocate_slot().
    /// Out-of-range slots are silently ignored (returns false) with an FMA
    /// event `PartitionBitmapOobAccess { slot }`, rather than panicking,
    /// because this code runs on the failure-handling path where defense
    /// in depth is more important than fail-fast semantics.
    pub fn set(&mut self, slot: PeerSlotIndex) -> bool {
        let idx = slot.0 as usize;
        if idx >= 1024 {
            // FMA event: PartitionBitmapOobAccess { slot }
            return false;
        }
        self.words[idx / 64] |= 1u64 << (idx % 64);
        true
    }

    /// Test whether slot `slot` is present. Returns `false` for
    /// out-of-range slots (same defensive behavior as `set()`).
    pub fn contains(&self, slot: PeerSlotIndex) -> bool {
        let idx = slot.0 as usize;
        if idx >= 1024 { return false; }
        (self.words[idx / 64] & (1u64 << (idx % 64))) != 0
    }

    /// Count the number of set bits (population count).
    pub fn count_ones(&self) -> u32 {
        self.words.iter().map(|w| w.count_ones()).sum()
    }

    /// Test whether any bit in `other` overlaps with `self`.
    pub fn intersects(&self, other: &Self) -> bool {
        self.words.iter().zip(other.words.iter()).any(|(a, b)| a & b != 0)
    }

    /// Minimum slot index present (0-indexed). Returns `None` if empty.
    pub fn min_slot(&self) -> Option<PeerSlotIndex> {
        for (i, &w) in self.words.iter().enumerate() {
            if w != 0 {
                return Some(PeerSlotIndex((i as u16 * 64) + w.trailing_zeros() as u16));
            }
        }
        None
    }

    /// Sum of all slot indices present (0-indexed). u64 to avoid overflow
    /// (worst case: slots 0..=1023, sum = 523776).
    pub fn slot_sum(&self) -> u64 {
        let mut sum = 0u64;
        for (i, &w) in self.words.iter().enumerate() {
            let base = i as u64 * 64;
            let mut mask = w;
            while mask != 0 {
                let lowest = mask.trailing_zeros() as u64;
                sum += base + lowest; // slot indices are 0-indexed
                mask &= !(1u64 << lowest);
            }
        }
        sum
    }
}

/// Deterministically select the winning partition from two candidates.
/// Returns true if partition_a wins over partition_b.
///
/// # Preconditions
/// - Partitions are disjoint (no node in both)
/// - Neither partition is empty
pub fn partition_wins(partition_a: &PartitionBitmap, partition_b: &PartitionBitmap,
                      cluster_size: u16, witness: &PartitionBitmap) -> bool {
    let count_a = partition_a.count_ones();
    let count_b = partition_b.count_ones();

    // Rule 1: Majority wins (strictly more than half).
    // For odd clusters: 3 nodes → threshold 2, 5 → 3. Correct.
    // For even clusters: 4 nodes → threshold 3, 6 → 4. A 50/50 split (2-2)
    // does NOT meet threshold — neither partition wins, falls through to
    // tiebreakers (Rule 3: witness, Rule 2: larger set / lowest-ID).
    let majority_threshold = (cluster_size as u32 / 2) + 1;
    if count_a >= majority_threshold { return true; }
    if count_b >= majority_threshold { return false; }

    // Rule 3: External witness (if configured)
    if witness.count_ones() != 0 {
        let witness_in_a = partition_a.intersects(witness);
        let witness_in_b = partition_b.intersects(witness);
        if witness_in_a && !witness_in_b { return true; }
        if witness_in_b && !witness_in_a { return false; }
    }

    // Rule 2: Larger partition wins
    if count_a != count_b { return count_a > count_b; }

    // Rule 2a: Equal size → higher slot sum wins
    let sum_a = partition_a.slot_sum();
    let sum_b = partition_b.slot_sum();
    if sum_a != sum_b { return sum_a > sum_b; }

    // Rule 2a final tiebreaker: lower minimum slot index wins
    // (guaranteed to differ for disjoint partitions of equal size)
    partition_a.min_slot() < partition_b.min_slot()
}

Raft/Paxos for Critical Cluster Metadata: For cluster-critical state that cannot tolerate any inconsistency (security policies, node authentication keys, cluster configuration), UmkaOS uses Raft consensus rather than the simpler majority-quorum protocol:

  • Raft scope: Security policy replication, node certificate revocation, cluster-wide configuration changes (adding/removing nodes).
  • Quorum protocol scope: DSM page ownership, capability caching, heartbeat.
  • Rationale: Raft provides linearizability but requires persistent logging and leader election overhead. Using it only for critical metadata avoids imposing Raft's cost on the high-frequency DSM operations.

The Raft implementation is a complete, persistent consensus engine, managed by the ClusterMetadataReplicator service. Leader election uses the same deterministic tie-breaking rules as the quorum protocol above to avoid ambiguity during concurrent elections. What follows is the full Raft specification for UmkaOS.

Raft Log Persistence

A Raft log entry MUST be written to the local write-ahead log (WAL) and fsynced before the leader sends AppendEntries RPCs to followers. This is the central durability guarantee of Raft: if the leader crashes immediately after sending AppendEntries, the entry can be recovered from the leader's WAL on restart and re-sent. Without WAL persistence, a leader crash before followers acknowledge would permanently lose the entry — violating the Raft invariant that committed entries are durable.

/// One entry in the Raft log.
/// Written to the WAL before being sent to followers.
/// The payload immediately follows this header in the WAL file.
#[repr(C)]
pub struct RaftLogEntry {
    /// Monotonically increasing log index, 1-based.
    /// Gaps are not permitted: every index from 1 to commitIndex is present.
    pub index: Le64,
    /// Raft term in which this entry was created.
    /// Entries from earlier terms may be present in the log; they are only
    /// considered committed once a current-term entry is committed (Raft §5.4.2).
    pub term: Le64,
    /// CRC32C checksum of the payload bytes, for corruption detection.
    /// Verified on read; mismatch triggers WAL recovery from peer snapshot.
    pub crc: Le32,
    /// Length of the serialized `ClusterMetadataOp` payload in bytes.
    pub payload_len: Le32,
    // Followed immediately in the file by `payload_len` bytes of
    // serialized `ClusterMetadataOp`.
}
const_assert!(core::mem::size_of::<RaftLogEntry>() == 24);

/// The operations that may be stored in a Raft log entry.
/// Each variant is a single atomic change to the cluster metadata state machine.
/// Maximum serialized size of a security policy update blob.
pub const MAX_SECURITY_POLICY_BLOB: usize = 64 * 1024; // 64 KiB

/// Maximum serialized size of a peer certificate (X.509 DER + PQC extensions).
pub const MAX_PEER_CERT_SIZE: usize = 8 * 1024; // 8 KiB

/// Maximum serialized size of a metadata snapshot blob.
pub const MAX_METADATA_SNAPSHOT_SIZE: usize = 1024 * 1024; // 1 MiB

pub enum ClusterMetadataOp {
    /// Transition a peer's state (Joining → Active, Active → Leaving, etc.).
    SetPeerState { peer_id: PeerId, state: PeerState },
    /// Revoke a distributed capability (invalidates all cached copies cluster-wide).
    RevokeCapability { cap_id: CapabilityId, revoke_epoch: u64 },
    /// Update the cluster-wide security policy (e.g., allowed cipher suites,
    /// certificate revocation list additions).
    /// Bounded: max `MAX_SECURITY_POLICY_BLOB` (64 KiB).
    UpdateSecurityPolicy { policy_blob: Vec<u8> },
    /// Record a new peer joining the cluster (triggers joint-consensus transition).
    /// `cert` bounded: max `MAX_PEER_CERT_SIZE` (8 KiB).
    AddPeerToCluster { peer_id: PeerId, addr: ClusterAddr, cert: Vec<u8> },
    /// Remove a peer from the cluster (triggers joint-consensus transition).
    RemovePeerFromCluster { peer_id: PeerId },
    /// Install a full metadata snapshot (sent via InstallSnapshot RPC;
    /// replaces all preceding log entries on the recipient).
    /// `metadata_blob` bounded: max `MAX_METADATA_SNAPSHOT_SIZE` (1 MiB).
    InstallSnapshot { snapshot_term: u64, snapshot_index: u64,
                      metadata_blob: Vec<u8> },
}

ClusterMetadataOp uses a custom binary wire encoding (not the Rust enum representation). Each variant is serialized as [discriminant: Le32] [payload...] with deterministic encoding (same entry produces identical bytes on all nodes, required for CRC verification). Variable-length blobs use [blob_len: Le32] [blob: [u8; blob_len]]:

ClusterMetadataOp wire encoding:
  [discriminant: Le32]   Variant ID (see table below)
  [payload...]           Variant-specific fields (all integers Le-encoded)

Discriminant table:
  0x0001  SetPeerState:           [peer_id: Le64] [state: Le32]
  0x0002  RevokeCapability:       [cap_id: Le64] [revoke_epoch: Le64]
  0x0003  UpdateSecurityPolicy:   [blob_len: Le32] [blob: [u8; blob_len]]
  0x0004  AddPeerToCluster:       [peer_id: Le64] [addr: ClusterAddrWire] [cert_len: Le32] [cert: [u8; cert_len]]
  0x0005  RemovePeerFromCluster:  [peer_id: Le64]
  0x0006  InstallSnapshot:        [snapshot_term: Le64] [snapshot_index: Le64] [blob_len: Le32] [blob: [u8; blob_len]]

No external serialization dependency. This encoding is used both on the wire (AppendEntries RPC payloads) and in the WAL (persistent log entries). On mixed-endian clusters (PPC32/s390x big-endian, others little-endian), all integers are Le-encoded.

Deserialization validation (applies to both wire receive and WAL replay): - UpdateSecurityPolicy: blob_len > MAX_SECURITY_POLICY_BLOB (64 KiB) --> RaftError::PayloadTooLarge - AddPeerToCluster: cert_len > MAX_PEER_CERT_SIZE (8 KiB) --> RaftError::PayloadTooLarge - InstallSnapshot: blob_len > MAX_METADATA_SNAPSHOT_SIZE (1 MiB) --> RaftError::PayloadTooLarge

The validation MUST occur before allocation: read blob_len/cert_len, validate against the constant, then allocate. A malformed entry in the WAL triggers WAL corruption recovery (truncate to last valid entry). This prevents OOM from malicious or corrupt oversized payloads on the Raft leader during log replay.

Write-ahead log layout:

/// Write-ahead log managing Raft log persistence on local storage.
/// Path: /ukfs/kernel/cluster/raft-wal (on the umkafs persistent volume).
/// Rotated when it exceeds WAL_ROTATE_THRESHOLD_BYTES (default: 64 MiB).
pub struct RaftWal {
    /// Absolute path on the umkafs persistent volume.
    pub path: &'static str,
    /// Open file descriptor (opened O_WRONLY | O_APPEND).
    /// O_DSYNC is NOT used: it forces synchronous data write on every write()
    /// call, defeating WAL batching. Instead, the WAL batches multiple log
    /// entries into a single write() + fdatasync() pair. The fdatasync() at
    /// commit time ensures durability for the entire batch. This matches the
    /// design of production WAL implementations (SQLite WAL, PostgreSQL WAL)
    /// where explicit fsync/fdatasync after batch writes is preferred over
    /// per-write O_DSYNC.
    pub fd: FileDescriptor,
    /// Highest log index for which an fsync has completed.
    /// Entries with index > fsynced_index are buffered but not yet durable.
    pub fsynced_index: AtomicU64,
    /// PI-aware Mutex serializing the batch write+fsync operation (inherited
    /// from Ch 3 kernel Mutex default). Entry submission to the append queue
    /// is lock-free (bounded MPSC ring). This Mutex serializes only the batch
    /// write+fsync step — concurrent submitters are not blocked by fsync.
    pub write_lock: Mutex<()>,
}

/// Rotate threshold: when the WAL exceeds this size, create a new segment.
pub const WAL_ROTATE_THRESHOLD_BYTES: u64 = 64 * 1024 * 1024; // 64 MiB

/// Fsync protocol for log entry appends (batched):
///
/// Multiple concurrent client operations batch their entries before a single fsync:
/// 1. Lock write_lock.
/// 2. Drain all pending entries from the append queue (up to WAL_BATCH_MAX entries).
/// 3. Serialize all entry headers + payloads in a single write() call.
/// 4. fdatasync(fd) — flushes ALL batched data in one syscall.
/// 5. Update fsynced_index to the highest index in the batch.
/// 6. Unlock write_lock.
/// 7. Send AppendEntries RPCs for all entries in the batch.
///
/// Batching amortizes the fsync cost (~100-500 μs on NVMe) across multiple entries.
/// Under load, a batch of 16 entries costs the same as a single entry.
/// Under low load, the batch timer (WAL_BATCH_TIMEOUT_US) flushes partial batches
/// to bound latency.
pub const WAL_BATCH_MAX: usize = 32;
/// Maximum delay before flushing a partial batch (microseconds).
/// 0 = no batching (immediate fsync per entry). Default: 500 μs.
pub const WAL_BATCH_TIMEOUT_US: u64 = 500;

Raft State Machine

Each Raft participant (node) maintains the following persistent and volatile state:

/// Persistent state: must survive crashes. Written to WAL before responding to RPCs.
pub struct RaftPersistentState {
    /// Latest term this node has seen. Initialized to 0, increases monotonically.
    pub current_term: u64,
    /// PeerId of the candidate this peer voted for in `current_term`.
    /// None if this peer has not voted in the current term.
    pub voted_for: Option<PeerId>,
    /// Log entries (mirrors the WAL index). Holds entry headers only (24 bytes
    /// each). Full payloads are read from the WAL on demand during leader
    /// replication. Practical bound: `RAFT_SNAPSHOT_THRESHOLD` entries (240 KiB).
    /// If `log.len() > 2 * RAFT_SNAPSHOT_THRESHOLD`, trigger emergency synchronous
    /// snapshot before accepting new entries. This provides a hard upper bound
    /// of 480 KiB (20,000 entries × 24 bytes). The snapshot protocol
    /// ([Section 5.8](#failure-handling-and-distributed-recovery--raft-snapshot-compaction))
    /// truncates entries before the snapshot index, keeping the Vec bounded.
    pub log: Vec<RaftLogEntry>,
}

/// Volatile state: rebuilt from persistent state after a crash.
pub struct RaftVolatileState {
    /// Index of the highest log entry known to be committed.
    pub commit_index: u64,
    /// Index of the highest log entry applied to the metadata state machine.
    pub last_applied: u64,
}

/// Volatile leader-only state: reinitialized after every election win.
pub struct RaftLeaderState {
    /// For each follower: index of the next log entry to send.
    /// Initialized to leader's last log index + 1.
    /// XArray keyed by PeerId (u64) — O(1) lookup.
    pub next_index: XArray<u64>,
    /// For each follower: index of the highest log entry known to be replicated.
    /// Initialized to 0.
    /// XArray keyed by PeerId (u64) — O(1) lookup.
    pub match_index: XArray<u64>,
}

Leader Election

State transitions:
  Follower  → Candidate  (election timeout fires: no heartbeat received)
  Candidate → Leader     (receives votes from majority of cluster)
  Candidate → Follower   (receives AppendEntries from higher-term leader,
                          OR receives RequestVote with higher term)
  Leader    → Follower   (receives RPC with higher term)

Timeouts (tunable via cluster configuration):
  election_timeout:   150–300 ms, randomized per-node at startup and after
                      each timeout event to prevent synchronized split votes.
                      Randomization range: [base, base * 2], uniform distribution.
  heartbeat_interval: 50 ms (leader sends empty AppendEntries to prevent
                      follower timeouts; must be << election_timeout).

Pre-vote extension (Ongaro thesis §9.6, mandatory in UmkaOS): Before a follower starts an election, it runs a PreVote phase. A PreVote RPC asks peers "would you vote for me if I started an election?" without incrementing current_term. Only if the node receives PreVote acknowledgements from a majority does it increment its term and begin a real election. This prevents a partitioned node that reconnects from disrupting a stable leader by forcing unnecessary term increments.

Leader election algorithm:

Pre-vote phase (Follower → Candidate):
  1. Election timeout fires.
  2. Send PreVote RPC to all peers:
       PreVote { next_term: current_term + 1,
                 candidate_id: self_id,
                 last_log_index, last_log_term }
  3. Peer grants pre-vote if:
       - peer has not heard from a valid leader within election_timeout, AND
       - candidate's log is at least as up-to-date as peer's log
         (last_log_term > peer.last_log_term, OR
          last_log_term == peer.last_log_term AND last_log_index >= peer.last_log_index)
  4. Peer responds with PreVoteResponse:
       PreVoteResponse { term: peer.current_term,
                         grant: bool }
     - grant=true: peer would vote for this candidate
     - grant=false: peer has a valid leader or candidate's log is stale
  5. Pre-vote timeout: 75% of the randomized election timeout.
     If timeout expires before majority responds: remain Follower, retry on next
     election timeout (do NOT escalate to real election on timeout alone).
  6. If majority pre-vote grants received (>50% of cluster, including self):
     proceed to real election.
  7. If majority rejects OR timeout: remain Follower, reset election timer.
     Even-sized clusters: exact 50% is NOT sufficient (strict majority required).

Edge cases during PreVote:

  E1. AppendEntries from valid leader arrives during PreVote:
      - Abort PreVote immediately. Transition to Follower state.
      - Reset election timer. Do NOT increment term.
      - Discard any accumulated PreVoteResponse results.
      - Rationale: a valid leader exists, so the election is unnecessary.
        "Valid" means the AppendEntries has term >= candidate's current_term.

  E2. RequestVote from higher term arrives during PreVote:
      - Abort PreVote immediately. Step down to Follower.
      - Update current_term to the incoming term (persist to WAL).
      - Grant or deny the vote using standard RequestVote rules.
      - Do NOT start a new PreVote round; wait for the next election timeout.

  E3. PreVoteResponse with term > candidate's current_term:
      - The responding peer has a higher term. This means the cluster has
        advanced past the candidate's term. Abort PreVote, update current_term
        to the response term (persist to WAL), revert to Follower.

  E4. Network partition heals mid-PreVote (stale responses arrive late):
      - Responses for a PreVote round are only valid if the candidate is still
        in PreVote state for the SAME round (same next_term). If the candidate
        has transitioned to Follower (via E1/E2/E3 or timeout) and a late
        PreVoteResponse arrives, it is silently discarded.

Real election phase (Candidate):
  1. Increment current_term (persist to WAL).
  2. Vote for self (persist voted_for = self_id to WAL).
  3. Reset election timer.
  4. Send RequestVote RPC to all peers:
       RequestVote { term: current_term,
                     candidate_id: self_id,
                     last_log_index, last_log_term }
  5. Peer grants vote if:
       - term >= peer.current_term, AND
       - peer has not voted for a different node in this term, AND
       - candidate's log is at least as up-to-date as peer's log (as above)
     On grant: peer persists voted_for = candidate_id to WAL.
  6. Candidate receives majority votes (> N/2 including self):
       - Transition to Leader.
       - Initialize next_index[peer] = last_log_index + 1 for all peers.
       - Initialize match_index[peer] = 0 for all peers.
       - Send initial heartbeat (empty AppendEntries) to all peers immediately.
  7. Candidate receives AppendEntries with term >= current_term:
       - Recognize new leader. Transition to Follower.
  8. Election timer fires again before majority reached:
       - Increment term, repeat from step 1 (another round).

Log Replication

AppendEntries RPC (leader → follower):

  Request:
    term:         u64   — leader's current term
    leader_id:    PeerId
    prev_log_index: u64 — index of log entry immediately preceding new ones
    prev_log_term:  u64 — term of prev_log_index entry
    entries:      Vec<RaftLogEntry>  — new entries to store (empty for heartbeat)
    leader_commit: u64  — leader's current commitIndex

  Response:
    term:       u64     — follower's current term (for leader to update itself)
    success:    bool    — true if follower matched prevLogIndex/prevLogTerm
    match_index: u64    — highest index follower has replicated (for leader tracking)
    conflict_term: Option<u64>   — for fast log rollback (see below)
                                   (wire encoding: Le64, 0xFFFFFFFFFFFFFFFF = None)
    conflict_index: Option<u64>  — first index of conflict_term in follower's log
                                   (wire encoding: Le64, 0xFFFFFFFFFFFFFFFF = None)

Replication sequence:

1. Client operation arrives at leader (or is forwarded to leader by a follower).
2. Leader appends new RaftLogEntry to local WAL:
   a. Serialize entry to WAL with fsync (see WAL protocol above).
   b. fsynced_index is updated after fdatasync completes.
   c. Leader does NOT proceed to step 3 until fsync is complete.
3. Leader sends AppendEntries RPC to all followers in parallel (non-blocking send).
4. Follower receives AppendEntries:
   a. If term < follower.current_term: respond {success: false, term: current_term}.
   b. Reset election timer (valid leader heartbeat received).
   c. Check log consistency: does log[prev_log_index].term == prev_log_term?
      - No match: respond {success: false, conflict_term, conflict_index}
        where conflict_term = log[prev_log_index].term and conflict_index is the
        first index in follower's log with that term.
        This allows the leader to skip over conflicting entries in one RPC (fast rollback).
   d. Delete any existing entries conflicting with new entries (same index, different term).
   e. Append new entries to local WAL with fsync.
   f. Update commitIndex = min(leader_commit, last new entry index).
   g. Apply newly committed entries to the metadata state machine.
   h. Respond {success: true, match_index: last appended index}.
5. Leader receives majority success responses (> N/2 including itself):
   a. Advance commitIndex to the highest index replicated by majority.
      NOTE: Only advance commitIndex for entries from the current term
      (Raft §5.4.2). Entries from earlier terms become committed implicitly
      when a current-term entry is committed.
   b. Apply newly committed entries to local metadata state machine.
   c. Respond to the original client with success.
6. Leader piggybacks updated commitIndex in next heartbeat (or next AppendEntries).
   Followers apply committed entries up to the new commitIndex.

Fast log rollback (optimization for log inconsistency after leader change):
  Instead of decrementing next_index[follower] by 1 per RPC (O(N) RPCs for
  long divergent logs), the follower returns conflict_term and conflict_index.
  The leader sets next_index[follower] = conflict_index, skipping the entire
  conflicting term in a single round-trip.

Snapshot Protocol

When the Raft log grows beyond RAFT_SNAPSHOT_THRESHOLD entries, the leader compresses the log by sending a full metadata state machine snapshot to followers whose match_index is far behind. This bounds WAL storage regardless of cluster uptime.

/// Threshold for triggering a snapshot: leader snapshots when log length exceeds this.
/// Default: 10_000 entries (matches etcd v3.6 — <1 second recovery at 100K entries/sec).
/// Configurable via sysctl `cluster.raft_snapshot_threshold` (valid range: [100, 1_000_000]).
/// Lower values = faster recovery + more snapshot I/O; higher = less I/O + slower recovery.
pub const RAFT_SNAPSHOT_THRESHOLD: usize = 10_000;

/// InstallSnapshot RPC (leader → lagging follower).
/// Sent when follower's match_index is behind the leader's snapshot point.
/// Wire struct — all integer fields use Le64 for cross-node endian safety.
#[repr(C)]
pub struct InstallSnapshotRpc {
    /// Leader's current term.
    pub term: Le64,                     // 8 bytes  (offset 0)
    /// Leader's PeerId (Le64 on the wire).
    pub leader_id: Le64,                // 8 bytes  (offset 8)
    /// The log index covered by this snapshot (all entries up to this index
    /// are included in the snapshot and need not be replicated individually).
    pub last_included_index: Le64,      // 8 bytes  (offset 16)
    /// The term of the entry at last_included_index.
    pub last_included_term: Le64,       // 8 bytes  (offset 24)
    /// Length of the snapshot data payload (in bytes).
    pub data_len: Le32,                 // 4 bytes  (offset 32)
    /// Padding for 8-byte alignment of the variable-length payload that
    /// follows this header. RDMA DMA transfers benefit from 8-byte aligned
    /// payload start addresses. Reserved for future use; must be zeroed on send.
    pub _pad: [u8; 4],                  // 4 bytes  (offset 36)
    // Followed by `data_len` bytes of serialized metadata state machine snapshot.
    // Includes: current cluster membership, all node states,
    // all security policies, and the revocation list.
    // Variable-length payload starts at offset 40 (8-byte aligned).
}
// Total header: 8 + 8 + 8 + 8 + 4 + 4 = 40 bytes.
const_assert!(core::mem::size_of::<InstallSnapshotRpc>() == 40);

/// Follower response to InstallSnapshot:
/// 1. Discard log entries up to last_included_index (already covered by snapshot).
/// 2. Apply the snapshot to the local metadata state machine.
/// 3. Persist the snapshot to WAL (as a ClusterMetadataOp::InstallSnapshot entry).
/// 4. Update last_applied = last_included_index, commit_index = last_included_index.
/// 5. Respond with {term: current_term, success: true}.

Safety Invariants

The following invariants hold at all times in a correctly operating Raft cluster. Implementation correctness is verified against these invariants:

  1. Election Safety: At most one leader is elected per term. Guaranteed by the majority vote requirement — two candidates cannot each receive votes from > N/2 nodes in the same term.

  2. Log Matching: If two log entries at the same index have the same term, then all entries in both logs up to that index are identical. Guaranteed by the prevLogIndex/prevLogTerm consistency check in AppendEntries.

  3. Leader Completeness: If a log entry is committed in term T, it is present in the logs of all leaders elected in terms > T. Guaranteed by the "candidate log at least as up-to-date as voter's log" requirement in RequestVote combined with the majority overlap property.

  4. State Machine Safety: If a node has applied a log entry at index I, no other node applies a different entry at index I. Follows from invariants 2 and 3.

  5. Durability: No entry is sent in AppendEntries before the leader has fsynced that entry to its local WAL. Guaranteed by the WAL write-then-send protocol above.

  6. Monotonic Terms: A node's current_term only increases, never decreases. On receiving any RPC with term > current_term, the node immediately updates current_term and transitions to Follower.

Cluster Membership Changes (Joint Consensus)

Cluster membership changes (adding or removing nodes) use the joint consensus mechanism (Ongaro thesis §4.3, Cluster Membership Changes) to prevent split-brain during the transition. Direct membership switches (old configuration → new configuration atomically) are not used because they create a window where two different majorities can exist simultaneously.

Joint consensus transition protocol:

Adding node D to cluster {A, B, C}:

1. Admin issues AddNode(D) to the leader.
2. Leader appends AddNodeToCluster(D) entry to the log (Phase 1 start).
   During Phase 1, the cluster uses JOINT configuration: {A, B, C, D}.
   A majority requires agreement from BOTH the old majority (A,B,C → 2 of 3)
   AND the new majority (A,B,C,D → 3 of 4). Both majorities must agree for
   any entry to be committed during the joint phase.
3. Leader replicates and commits the AddNodeToCluster(D) entry under joint consensus.
4. Leader appends a ClusterConfig(new={A,B,C,D}) entry (Phase 2 start).
5. Leader replicates and commits ClusterConfig(new) under the new configuration alone.
6. Transition complete: cluster is now {A, B, C, D} with normal majority rules.

Removing node B from cluster {A, B, C, D}:
  Same protocol: leader adds RemoveNodeFromCluster(B), transitions through joint
  phase {A,C,D} ∩ {A,B,C,D}, then commits ClusterConfig(new={A,C,D}).
  Node B is notified via its last heartbeat that it has been removed.

Invariants during joint consensus:
  - The joint phase is limited to one pending configuration change at a time.
    A second AddNode/RemoveNode is rejected until the first completes.
  - If the leader crashes during the joint phase, the new leader (elected under
    joint consensus rules) inherits the joint configuration and completes the
    transition by re-committing the new-configuration entry.
  - A node added to the cluster starts as a non-voting member (log replication
    proceeds but its vote is not counted) until it has caught up to within
    MEMBERSHIP_CATCHUP_ROUNDS (default: 10) rounds of the leader's log.
    This prevents a lagging new member from blocking commit progress.

Voter Subset

Not every peer in the cluster votes in Raft elections or participates in log replication as a voter. Raft voters are a small configured subset of peers (3-7, typically host kernels). All other peers are learners (observers): they receive replicated state but do not vote.

/// Raft voter eligibility is controlled by two independent conditions:
/// 1. The peer must have `PEER_CAP_RAFT_VOTER` capability (Section 5.2.9.1).
///    Firmware shim peers NEVER have this capability — Raft is too complex
///    for a shim implementation. They observe consensus results.
/// 2. The cluster admin must assign the voter role at runtime via the
///    cluster management API. Having the capability alone is not sufficient.
///
/// The voter set is stored in Raft persistent state, replicated via Raft
/// itself (changes use joint consensus, as specified above).

impl RaftPersistentState {
    /// Current set of voting peers. Small (3-7 entries typically, max 15 for
    /// large clusters). Bounded by `MAX_RAFT_VOTERS` (15). ArrayVec avoids
    /// heap allocation for this tiny, bounded collection (15 * 8 = 120 bytes).
    /// NOT a bitmask — PeerId is u64. Linear scan for `is_voter()` is fine
    /// (120 bytes fits in two cache lines).
    pub voters: ArrayVec<PeerId, MAX_RAFT_VOTERS>,

    /// Quorum size. Updated when `voters` changes.
    /// `quorum = voters.len() / 2 + 1`
    pub fn quorum(&self) -> usize {
        self.voters.len() / 2 + 1
    }

    /// Check if a peer is a voter.
    pub fn is_voter(&self, peer: PeerId) -> bool {
        self.voters.contains(&peer)
    }
}

Learner peers receive all committed log entries (security policies, cluster configuration, capability revocations) and apply them locally. They benefit from consensus without adding voting overhead. A cluster with 80 peers (10 hosts × 8 devices) has only 5-7 voters — Raft message complexity remains O(voters), not O(cluster_size).

Implementation Phasing

  • Phase 2 (single-node degenerate case): The Raft log is present but has only one participant. All entries are immediately committed (no network round-trip needed). The WAL is still written and fsynced — this validates the WAL implementation before multi-node testing begins.
  • Phase 3 (multi-node Raft): Full implementation as specified above. ClusterMetadataReplicator runs Raft across voter peers. Security policy replication, capability revocation, and peer state transitions all go through Raft. Learner peers receive committed entries via a separate replication stream.
  • Phase 4 (joint consensus): Online cluster reconfiguration (add/remove voters) using the joint consensus protocol. Required before UmkaOS clusters can be managed dynamically without downtime.

In-flight RDMA operations during partition: When a network partition occurs, RDMA NICs report completion with error status for all in-flight operations. RdmaPeerTransport translates these to TransportError::ConnectionLost. Callers (DSM fault handler, IPC ring) retry once (in case of transient link flap), then return an error to the process: SIGBUS for DSM page faults, EIO for IPC operations.

Minority partition DSM behavior: Nodes in the minority partition mark all DSM pages as SUSPECT. Write accesses to SUSPECT pages are blocked (write-protect in page tables); write attempts fault and return EAGAIN. Read accesses to SUSPECT pages continue to return locally-cached data (preserving availability for read-heavy workloads during brief partitions), but set a per-page stale_read flag. Applications can check whether they have read potentially stale data via madvise(MADV_DSM_CHECK), which returns EDSM_STALE if any SUSPECT pages were read since the last check. This ensures that stale reads are never silent — applications that care about consistency can detect and handle them, while applications that tolerate staleness continue without interruption. When the partition heals, SUSPECT pages are reconciled with the majority partition's directory and the SUSPECT marking is cleared.

Unreachable home node: Each DSM page has two home nodes: primary (determined by hash(region, VA) % cluster_size) and backup (determined by hash_backup(region, VA) % cluster_size using a different hash seed, guaranteed to differ from the primary).

Backup home node protocol: 1. Shadow directory maintenance: On every directory state change (ownership transfer, reader set update), the primary home node sends the updated DsmDirectoryEntry to the backup via transport.write_to_peer() — a transport-agnostic bulk write that maps to RDMA Write on RDMA transports and send_reliable() with serialized data on TCP transports. The target is a pre-allocated shadow directory region on the backup node. Each entry includes a generation counter (incremented on every update) for consistency verification. On TCP transports, directory pages are transferred as serialized messages via send_reliable(). Throughput is lower (~1 GB/s TCP vs ~12 GB/s RDMA) but the protocol is identical. 2. Consistency: The backup's shadow directory is write-only from the primary's perspective — the backup never modifies it independently. The generation counter ensures that stale writes (e.g., reordered transport writes) are detected and discarded. The backup compares the incoming generation counter against its last seen value; only strictly incrementing updates are applied. 3. Failover: When the primary home node is declared Dead by the cluster membership protocol (Section 5.8), the new home node is determined deterministically — not by self-promotion. The membership protocol's NodeDead event triggers re-evaluation of the same hashing rule used for initial home placement (Section 5.7.3): hash(region, VA) % new_cluster_size with the dead node removed from the membership set. All surviving nodes compute the same result, producing a single deterministic new home. No node self-promotes. The new home is computed deterministically from the membership epoch, so all survivors agree. If the deterministic new home happens to be the node already holding the backup shadow directory, it promotes the shadow to primary. Otherwise, the backup transfers its shadow entries to the computed new home. Directory entries on the new home may be slightly stale (by at most one in-flight update). The version counter in each DsmDirectoryEntry allows requestors to detect and retry if they encounter a stale entry. 4. Partition healing: When the primary returns, the primary and backup reconcile their directories: for each entry, the node with the higher generation counter wins. The backup reverts to shadow mode after reconciliation completes.

For even-numbered clusters (no strict majority possible): - Admin designates a "tiebreaker" node (or external witness). - Or: smaller-numbered-node-set wins (deterministic, no external dependency). Caveat: the smaller-numbered-node-set heuristic is simple but has a known weakness — if the lower-numbered nodes are physically co-located, a power event affecting that rack consistently picks the wrong survivor set. For production deployments, an external quorum device (dedicated witness VM, or a 3rd-site arbitrator via RDMA or TCP heartbeat) is recommended. The heuristic remains as the default-fallback when no external witness is configured.

5.8.1.7.1 Cluster Leadership Election

The Distributed Lock Manager requires a leader node for deadlock arbitration and fencing token allocation. Election uses a Fencing Bully Algorithm — O(1) rounds, no coordinator needed.

Relationship to Raft leader: The DLM/fencing leader and the Raft metadata leader are always the same node during normal operation. The Fencing Bully Algorithm provides O(1) deterministic leader selection during two specific windows: (1) cluster bootstrap before Raft's first election completes, and (2) as a fast fallback during network partitions where Raft elections may be delayed by split-vote rounds. Once the Raft leader is elected, it assumes all leadership responsibilities including DLM arbitration and fencing token issuance. If the Bully algorithm and Raft disagree (e.g., during concurrent partition healing), the Raft leader takes precedence because Raft provides stronger consistency guarantees (log-up-to-date requirement). The Bully leader's authority is strictly bounded: it may fence stale nodes during partition, but it cannot commit metadata changes to the Raft log. On partition heal, the Raft leader's cluster state view is authoritative.

Leadership invariant: The node with the lexicographically greatest (active_epoch, node_id) tuple that is reachable by quorum (⌈N/2⌉ + 1 nodes) is the leader. active_epoch is the current CLUSTER_ACTIVE epoch counter; node_id (u32, assigned at cluster join) breaks ties deterministically.

Election trigger: Any node that fails to receive a heartbeat from the current leader for 2 × heartbeat_timeout (default 200 ms) initiates election by broadcasting CLAIM_LEADERSHIP { epoch, node_id, fencing_token }.

Fencing token: Monotonically increasing u64 stored in durable cluster state (ClusterState journal). A candidate that cannot present a fencing_token greater than all known tokens is rejected. This prevents split-brain after network partition: the partition with the stale fencing token cannot become leader.

Quorum check: Candidate waits for ACKNOWLEDGE_LEADERSHIP from ⌈N/2⌉ + 1 nodes within 2 × heartbeat_timeout. On quorum: candidate becomes leader, increments fencing token, broadcasts NEW_LEADER { fencing_token } to all peers.

Leader responsibilities: Arbitration only — no lock grants. The leader resolves deadlock cycles (via wait-for graph analysis) and allocates FencingToken values for lock requests. Lock grants remain distributed across all nodes.

Crash recovery: On leader crash, election re-runs automatically. Lock requests with fencing_token < current_fencing_token are rejected (stale, issued by pre-crash leader). Requestors retry with a fresh token.

Raft WAL Boot Recovery

When a node restarts after a crash, it must reconstruct Raft state from the persisted WAL before participating in the cluster. The recovery procedure:

raft_recover(wal: &RaftWal) -> Result<(RaftPersistentState, RaftVolatileState), RaftError>:

  // Phase 1: Validate and replay WAL segments.
  // Segments are ordered by creation timestamp (filename encodes epoch).
  // Within each segment, entries are sequential by log index.
  segments = list_wal_segments(wal.path)  // sorted oldest → newest
  log = Vec::new()
  current_term = 0
  voted_for = None
  last_valid_index = 0

  for segment in segments:
    offset = 0
    while offset < segment.len():
      entry = parse_raft_log_entry(segment, offset)
      if entry.crc != crc32(entry.payload):
        // CRC mismatch: truncate this segment at the corruption point.
        // All entries after the corruption are discarded — they were not
        // fsynced before the crash. This is safe because Raft's replication
        // protocol will resend any entries that the leader has committed
        // but this follower has lost.
        truncate_segment(segment, offset)
        break
      log.push(entry)
      if entry.term > current_term:
        current_term = entry.term
      if entry.voted_for.is_some():
        voted_for = entry.voted_for
      last_valid_index = entry.index
      offset += entry.serialized_size()

  // Phase 2: Apply latest snapshot (if exists).
  // Snapshots compact the log up to snapshot_index.
  snapshot = load_latest_snapshot(wal.path)
  if snapshot.is_some():
    apply_snapshot_to_state_machine(snapshot)
    discard_log_entries_before(snapshot.last_included_index)

  // Phase 3: Reconstruct volatile state.
  // commit_index and last_applied are initialized conservatively:
  // - commit_index = 0: the leader will inform us of the true commit_index
  //   via AppendEntries RPCs after we rejoin.
  // - last_applied = snapshot.last_included_index (or 0 if no snapshot):
  //   we replay committed entries from the snapshot point forward.
  // Voter set is reconstructed from the latest ConfigChange entry in the
  // recovered log (or from the snapshot's config if no ConfigChange entries
  // exist post-snapshot). This ensures a crashed node knows whether it was
  // a voter or learner without contacting the leader.
  persistent = RaftPersistentState { current_term, voted_for, log }
  volatile = RaftVolatileState {
    commit_index: 0,  // leader will update via AppendEntries
    last_applied: snapshot.map(|s| s.last_included_index).unwrap_or(0),
  }

  // Phase 4: Replay committed log entries from last_applied+1 to the last
  // entry whose commit is known. On a fresh restart, commit_index=0 means
  // no replay happens until the leader sends AppendEntries. This is correct:
  // the Raft protocol guarantees that the leader will bring us up to date.

  return Ok((persistent, volatile))

Post-recovery leader behavior: A node that wins an election after recovery must probe all followers to reconstruct RaftLeaderState: - next_index[follower] = leader's last log index + 1 (optimistic; decremented on AppendEntries rejection per standard Raft protocol). - match_index[follower] = 0 (unknown; updated as AppendEntries succeed).

WAL corruption beyond CRC: If the WAL header itself is corrupted (cannot parse segment metadata), the node refuses to start and logs a fatal error. Operator intervention is required (restore from snapshot or remove the node from the cluster and re-add as a fresh follower).

5.8.1.8 DSM Recovery After Node Failure

Node B fails (holds exclusive ownership of some DSM pages):

1. Heartbeat timeout → Node B marked Dead.
2. For each DSM page where B was owner:
   a. Home node (determined by hash) still has the directory entry.
   b. If B was SharedOwner: readers still have valid copies.
      → Promote one reader to owner (pick closest to home node).
   c. If B was Exclusive: page data is LOST (only copy was on B).
      → Mark page as "lost." Processes faulting on it get SIGBUS.
      → Application must handle this (checkpoint/restart).
3. For each DSM page where B was a reader:
   a. Simply remove B from reader set. No data loss.
4. Capabilities issued by B expire naturally (bounded lifetime).
   Remote nodes stop accepting B-signed capabilities immediately.

**Mitigation for exclusive page loss:**
  - DSM regions can be created with `DSM_REPLICATE` flag (replication_factor = 2).
  - Every write to an exclusive page is mirrored to a backup node.
  - On failure: backup is promoted to owner. No data loss.
  - Cost: 2x write bandwidth for replicated regions.

Design rationale — why replication is NOT the default:

DSM is a performance optimization, not a durability mechanism. The default (replication_factor = 1) is correct for the typical DSM use case:

  1. Ephemeral data: Caches, temporary buffers, computational scratch space. Loss on node failure is acceptable — the data can be recomputed or reloaded from source.

  2. Read-mostly workloads: Configuration data, reference tables, shared code pages. These are typically backed by persistent storage; losing the in-memory copy just means reloading from disk.

  3. Application-managed durability: Databases, distributed file systems, and message queues implement their own replication and checkpointing. Adding DSM-level replication would be redundant and wasteful.

  4. Performance sensitivity: The 2x bandwidth cost and ~15% latency overhead of synchronous replication would penalize all DSM users, even those who don't need it.

When to enable replication: - DSM regions holding irreplaceable data without application-level persistence - Workloads where recomputation cost exceeds replication cost - Environments where node failure is frequent enough to justify the overhead

Durability is the application's responsibility: Just as applications using mmap() or malloc() must implement their own persistence, DSM users must decide whether their data warrants replication. The kernel provides the mechanism (DSM_REPLICATE); the policy is left to the application.

5.8.1.8.1 Home Node Failure — Directory Reconstruction

When a home node dies, page directory entries hashed to it are lost. The new home (determined by rehashing with the dead node removed from the hash ring) must reconstruct the directory from surviving peers' local page metadata.

Reconstruction protocol:

1. New home H' detects home reassignment (DeadNotify for old home H).
2. H' sends DsmDirReconstruct (PeerMessageType = 0x0330) to all peers
   in the affected region.
3. Each surviving peer P scans its local DsmPageMeta for pages whose home
   was the dead node H (home_slot matches H's old slot). For each such page,
   P reports its local state.
4. H' collects reports, resolves conflicts, builds new directory entries.
5. H' sends DsmDirReconstructComplete (0x0332) to all peers. Normal
   coherence traffic resumes for these pages.

Wire messages:

DsmDirReconstruct         = 0x0330,  // New home → all region peers
DsmDirReconstructReport   = 0x0331,  // Peer → new home: local page states
DsmDirReconstructComplete = 0x0332,  // New home → all: reconstruction done

DsmDirReconstructPayload (24 bytes):

#[repr(C)]
pub struct DsmDirReconstructPayload {
    /// Region identifier. Le64 (not Le32 — all DSM region IDs are u64).
    pub region_id: Le64,
    /// Slot of the dead home node (peers match against DsmPageMeta.home_slot).
    pub dead_home_slot: Le16,                     // RegionSlotIndex
    pub _pad: [u8; 6],
    /// New home's peer ID (for peers to redirect future messages).
    pub new_home: Le64,                           // PeerId
}
// Wire format: region_id(8) + dead_home_slot(2) + _pad(6) + new_home(8) = 24 bytes.
const_assert!(core::mem::size_of::<DsmDirReconstructPayload>() == 24);

DsmDirReconstructReportPayload (variable):

#[repr(C)]
pub struct DsmDirReconstructReportPayload {
    /// Region identifier (u64, matches all other DSM region IDs).
    pub region_id: Le64,
    pub reporting_peer_slot: Le16,                 // RegionSlotIndex
    /// Number of page entries in this report.
    pub page_count: Le16,
    pub _pad: [u8; 4],
    // Followed by `page_count` × DsmPageReport entries.
}
// Wire format (fixed header): region_id(8) + reporting_peer_slot(2) + page_count(2) + _pad(4) = 16 bytes.
const_assert!(core::mem::size_of::<DsmDirReconstructReportPayload>() == 16);

/// Per-page state report from a surviving peer. 32 bytes.
///
/// For DSM_CAUSAL regions, the new home must reconstruct both the MOESI
/// directory AND the per-page causal vector clock entries. Without the
/// vector clock, the new home cannot enforce causal ordering for pages
/// whose old home was the dead node — readers could observe writes out
/// of causal order because the home's authoritative clock snapshot was
/// lost with the dead node. The `vc_own_slot_value` field carries the
/// reporting peer's own vector clock entry for this page, which is the
/// only component the peer can authoritatively report (each peer knows
/// its own slot's clock value). The new home reconstructs the full
/// vector clock by taking the element-wise maximum across all reports.
///
/// For non-causal regions (Release, Eventual, Synchronous, Relaxed),
/// the `vc_own_slot_value` field is zero and ignored by the receiver.
#[repr(C)]
pub struct DsmPageReport {
    /// Physical page address (4KB-aligned).
    pub page_addr: Le64,                           // 8 bytes
    // Note: u8 fields are endian-neutral (single byte) — no LeXX wrapper needed
    // per §6.1.1 wire type policy exception.
    /// This peer's local state for the page.
    pub local_state: u8,                           // DsmPageState (M/O/E/S/I)
    /// Whether this peer has dirty (uncommitted) data.
    pub is_dirty: u8,                              // 1 byte (0 or 1)
    /// DsmConsistency mode of the region (needed so receiver knows
    /// whether to interpret vc_own_slot_value). DsmConsistency has only
    /// 5 variants (0-4); u8 wire encoding is space-efficient with no
    /// information loss. Use `DsmConsistency::from_wire_u8_validated()` to convert.
    pub consistency_mode: u8,                      // 1 byte
    pub _pad: [u8; 5],                             // 5 bytes (align to 8)
    /// The reporting peer's vector clock entry for this page.
    /// Only meaningful for DSM_CAUSAL regions; zero otherwise.
    /// The reporting peer's RegionSlotIndex is in the enclosing
    /// DsmDirReconstructReportPayload.reporting_peer_slot, so the
    /// receiver knows which vector clock slot this value belongs to.
    pub vc_own_slot_value: Le64,                   // 8 bytes
    /// Last causal stamp epoch this peer observed for this page.
    /// Used by the new home to reconstruct the page's
    /// `last_stamp_epoch` field. Zero for non-causal regions.
    pub last_stamp_epoch: Le64,                    // 8 bytes
}
// Wire format: page_addr(8)+local_state(1)+is_dirty(1)+consistency_mode(1)+_pad(5)+vc_own_slot_value(8)+last_stamp_epoch(8) = 32 bytes.
const_assert!(core::mem::size_of::<DsmPageReport>() == 32);

Conflict resolution at new home H':

Reports Received New Directory State Action
One peer reports M Modified, that peer is owner Set owner, sharers empty
One peer reports O, others report S Modified, O-peer is owner, S-peers in sharers Set owner + sharers bitmap
One peer reports E Exclusive, that peer is owner Set owner, sharers empty
Multiple peers report S, none M/O/E Shared, sharers = reporting S-peers No owner, build sharers bitmap
No reports for a page Uncached Page was only on dead node → lost (SIGBUS on fault)
Two peers report M (impossible in correct protocol) Error: protocol violation FMA event, arbitrarily pick one, invalidate other

Vector clock reconstruction (DSM_CAUSAL regions): For causal regions, the new home H' reconstructs each page's vector clock from the collected reports: 1. Allocate a fresh vector clock array (size = region's max_participants). 2. For each DsmPageReport where consistency_mode == Causal: set vc[reporting_peer_slot] = max(vc[reporting_peer_slot], vc_own_slot_value). 3. For each slot belonging to the dead node: leave at zero (dead node's writes are either already visible via other peers' merged clocks, or lost). 4. Set the page's last_stamp_epoch = max(all reported last_stamp_epoch values). This produces a conservative clock: any write that was visible to at least one surviving peer is reflected. The resulting clock may be slightly ahead of the true causal frontier (harmless — it only causes unnecessary waits, not missed causality). If no peer reports a causal entry for a page, the clock starts at zero — equivalent to "no causal history known," which forces a full synchronous fetch on the next read (safe fallback).

Bitmap on the wire: Note that DsmDirReconstructReport does NOT carry RegionBitmap data — it carries per-page state reports (page address + local MOESI state + optional vector clock entry). The new home builds the RegionBitmap fields (sharers, copyset) locally from the collected reports. Bitmaps are a directory-internal data structure; they never travel on the wire.

Incremental reconstruction: Pages become accessible individually as their directory entries are reconstructed — there is no cluster-wide quiescent period that blocks all affected pages:

  1. New home H' collects DsmDirReconstructReport messages from all peers.
  2. As reports arrive, H' builds directory entries page-by-page. Each page's entry is marked Reconstructing until all peers have reported for it.
  3. Once all reports for a specific page are collected (or the per-page timeout of 5 seconds expires), H' finalizes that page's directory entry and begins accepting coherence messages for it immediately.
  4. Pages for which no peer reported within the timeout are marked Uncached (lost → SIGBUS on fault).
  5. DsmDirReconstructComplete is sent after ALL pages are finalized. This is informational — peers have already been accessing reconstructed pages incrementally.

Coherence messages for pages still in Reconstructing state are buffered at H' (not dropped). The buffer is bounded: if more than 4096 messages accumulate for a single page, the oldest are NACKed (requestor retries after DsmDirReconstructComplete). In practice, most pages are reconstructed within 1-2 RTTs (~10 μs) as reports arrive, so the buffer rarely grows.

5.8.1.8.2 Bitmap Wire Format — Design Note

Per-region bitmaps (RegionBitmap, Decision 7 in design decisions) are directory-internal data structures that do not appear on the DSM coherence wire protocol. The MOESI protocol operates point-to-point:

  • Home → sharers: home iterates the sharer bitmap locally and sends individual Inv or FwdGetS messages to each sharer. No bitmap on the wire.
  • Requester → home: GetS, GetM, PutM etc. carry only the requester's PeerId (8 bytes in DsmWireHeader.peer_id). Home updates the bitmap locally.
  • Home → requester: DataResp carries ack_count (a count derived from the bitmap, not the bitmap itself). AckCount is a u32 in DsmWireHeader.aux.
  • Owner → requester: DataFwd carries page data + sender's PeerId. No bitmap.
  • Recovery: DsmDirReconstructReport carries per-page state reports, not bitmaps. New home builds bitmaps from reports.

The only place bitmap-sized data appears on the wire is SlotCompactionPayload (Section 6.8), which carries explicit (old_slot, new_slot, peer_id) remapping entries — a slot mapping table, not a raw bitmap.

Message framing: All DSM coherence messages carry region_id in DsmWireHeader. The receiver looks up region metadata to determine max_participants and therefore the bitmap size W = ceil(max_participants / 64). This is used only for local directory operations, not for parsing wire messages (which are fixed-size per message type, as shown in the transport binding table).

5.8.1.9 Clock Synchronization

Distributed capabilities (Section 5.7) and DSM timeouts rely on nodes having synchronized clocks. UmkaOS requires PTP (IEEE 1588 Precision Time Protocol) as the primary clock synchronization mechanism:

  • PTP grandmaster: One node (or a dedicated PTP appliance) serves as the time reference. All other nodes synchronize to it via hardware PTP timestamping on the RDMA NIC.
  • Expected accuracy: <1 μs with hardware PTP (typical for modern RDMA NICs with PTP hardware timestamping support).
  • NTP fallback: If PTP is not available (no hardware support), NTP is used as a fallback. Expected accuracy: 1-10 ms. When using NTP, capability expiry grace period is increased to 100ms (from the default 1ms with PTP).
  • Maximum acceptable skew: 1ms for PTP deployments. Capability expiry includes a 1ms grace period to account for this skew. Nodes with clock skew exceeding 10ms trigger an FMA alert (reported as a HealthEvent with class: HealthEventClass::Network, event code CLOCK_SKEW_EXCEEDED — clock skew is a network-level health event, not its own event class).
  • Clock skew estimation: Each heartbeat message (Section 5.8) includes the sender's timestamp. The receiver estimates one-way clock skew as (remote_ts - local_ts - RTT/2). Persistent skew > 1ms triggers a PTP resynchronization.

5.8.1.10 Clock Drift Failure Handling

Clock drift beyond acceptable bounds threatens correctness of distributed capabilities, DSM timeouts, DLM lease validation, and fencing token expiry. The kernel uses a tiered response model based on measured drift magnitude:

/// Action to take when clock drift is detected on a peer.
/// Determined by comparing the estimated one-way clock skew (from
/// heartbeat timestamps) against configurable thresholds.
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
#[repr(u8)]
pub enum ClockDriftAction {
    /// Drift < 10ms: Normal operation. No corrective action.
    /// PTP/NTP is handling synchronization within acceptable bounds.
    /// This is the steady-state for healthy clusters.
    Normal = 0,

    /// 10ms ≤ drift < 100ms: Warning zone.
    /// - Emit FMA warning event (`CLOCK_DRIFT_WARNING`, severity: degraded).
    /// - Increase heartbeat tolerance: temporarily raise `suspect_threshold`
    ///   by +2 (e.g., from 3 → 5 missed heartbeats before Suspect) to avoid
    ///   false suspect transitions caused by clock-skewed timestamp checks.
    /// - Widen capability expiry grace period from 1ms to `drift + 10ms`
    ///   (prevents premature capability revocation on the drifting node).
    /// - Trigger PTP forced resynchronization on the drifting node.
    /// - Log once per 10-second interval (avoid log spam).
    Warn = 1,

    /// Drift ≥ 100ms: Clock suspect. Node is fenced from time-sensitive
    /// distributed subsystems.
    /// - Emit FMA critical event (`CLOCK_SUSPECT`, severity: faulted).
    /// - Mark peer as `PeerStatus::ClockSuspect` in PeerRegistry.
    /// - Fence the suspect peer from DLM: reject all new lock grant
    ///   requests; existing locks continue but are not renewable.
    /// - Fence the suspect peer from DSM: reject RegionJoin requests;
    ///   existing regions are drained (invalidate pages, revoke
    ///   write permissions) within the DSM drain timeout (120s).
    /// - Increase Raft election timeout for this peer to prevent it
    ///   from triggering spurious leader elections.
    /// - Capability validation for this peer uses server-side timestamps
    ///   only (ignores the suspect peer's claimed timestamps).
    /// - Recovery: when drift returns below 10ms for 30 consecutive
    ///   heartbeat intervals (3 seconds at default 100ms interval),
    ///   clear ClockSuspect status and restore normal operation.
    Suspect = 2,
}

/// Evaluate clock drift for a peer and determine the action.
/// Called on every heartbeat receive after computing skew estimate.
fn evaluate_clock_drift(skew_ms: u64) -> ClockDriftAction {
    if skew_ms < 10 {
        ClockDriftAction::Normal
    } else if skew_ms < 100 {
        ClockDriftAction::Warn
    } else {
        ClockDriftAction::Suspect
    }
}

Drift threshold summary:

Measured drift ClockDriftAction Heartbeat tolerance DLM DSM Capabilities FMA event
< 10ms Normal Default Normal Normal Normal (1ms grace) None
10-100ms Warn +2 to suspect threshold Normal Normal Widened grace period CLOCK_DRIFT_WARNING (degraded)
≥ 100ms Suspect N/A (peer fenced) Fenced (no new locks) Fenced (drain regions) Server-side timestamps only CLOCK_SUSPECT (faulted)

Hysteresis: The Suspect → Normal transition requires 30 consecutive heartbeats (3 seconds) with drift < 10ms. The Warn → Normal transition requires 10 consecutive heartbeats (1 second) with drift < 10ms. This prevents oscillation when drift hovers near a threshold boundary.

Interaction with split-brain detection: A ClockSuspect peer counts toward quorum for split-brain detection (it is alive, just clock-drifted) but cannot participate in Raft leader election until its clock stabilizes. This prevents a drifting node from becoming leader and issuing fencing tokens with incorrect timestamps.

5.8.2 Graceful Shutdown Protocol

Section Section 5.8 handles unplanned failure (crash, power loss). This section specifies the planned departure path: a peer gracefully leaving the cluster with zero data loss and minimal disruption to remaining peers.

Graceful shutdown is triggered by admin command (umka-cluster leave), the reboot(2) syscall, or a maintenance workflow (firmware update, hardware replacement). The protocol ensures that all subsystems drain their state in the correct dependency order, all dirty data is written back, all locks are released, and remaining peers have time to redirect traffic before the departing peer disconnects.

5.8.2.1 Protocol Overview

Graceful shutdown proceeds in six phases:

Phase 0: INITIATION
  Admin command or reboot(2) → departing peer sets PeerStatus::Leaving locally.
  Raft logs SetPeerState { peer_id, state: Leaving } via ClusterMetadataOp.

Phase 1: EXCLUSION (immediate, <1ms)
  Departing peer is removed from new work placement:
  - Cluster scheduler ([Section 5.6](#cluster-aware-scheduler)): excluded from
    migration targets.
  - DLM ([Section 15.15](15-storage.md#distributed-lock-manager)): stop granting
    new locks to this peer.
  - DSM ([Section 6.2](06-dsm.md#dsm-design-overview)): reject RegionJoin requests
    from this peer.
  - Capability services: stop accepting new client connections.

Phase 2: SUBSYSTEM DRAIN (ordered per dependency DAG, §5.9.3.2)
  Each subsystem drains its state, completing in-flight operations and
  transferring ownership. Drains proceed in DAG order — each subsystem
  starts only after its dependencies have completed.

Phase 3: LEAVENOTIFY BROADCAST
  After all subsystem drains complete locally, the departing peer broadcasts
  LeaveNotify ([Section 5.1](#distributed-kernel-architecture--message-payload-structs)) to all peers.

Phase 4: DRAIN ACKNOWLEDGMENT
  Departing peer collects LeaveAck from every Alive peer. Each LeaveAck
  indicates which subsystems the responding peer has finished draining on
  its end (its own in-flight requests to the departing peer).

Phase 5: FINAL CLEANUP
  QP tear-down ([Section 5.4](#rdma-native-transport-layer--transport-teardown)), RDMA memory region
  deregistration, service endpoint closure.

Phase 6: DEPARTURE
  Departing peer sends LeaveComplete and disconnects.
  Remaining peers transition the departing peer from Leaving → Dead → removed
  from PeerRegistry.

Total shutdown timeout: LeaveNotifyPayload.drain_timeout_ms (default: 300,000 ms = 5 minutes). If the total timeout expires before Phase 6, the departing peer force- disconnects and remaining peers handle it via the crash recovery path (Section 5.8).

5.8.2.2 Subsystem Drain Ordering

Subsystems drain in a strict dependency order. Each subsystem may only begin draining after all subsystems it depends on have completed:

Scheduler ──→ Capability Services ──→ DLM ──→ DSM ──→ Raft ──→ IPC Rings ──→ RDMA Transport
Order Subsystem Drain Action Depends On Timeout
1 Cluster scheduler Recall lightweight-migrated threads; cancel pending migrations 30s
2 Capability services ServiceDrainNotify to clients; drain queues; hand off to alternates Scheduler 60s
3 DLM Release all held locks (reverse acquisition order); notify masters Services 30s
4 DSM PutM/PutO all dirty pages; PutS all shared copies; RegionLeave all regions DLM 120s
5 Raft Voter stepdown / leadership transfer DSM 10s
6 IPC rings Drain RDMA ring buffers; flush pending CQEs All above 5s
7 RDMA transport QP tear-down IPC rings 5s

Dependency rationale:

  • Services before DLM: Active block/VFS operations hold DLM locks. Draining services first ensures no new lock-protected I/O is generated. Only then can DLM release locks without risking data corruption from in-flight write operations.
  • DLM before DSM: Subscriber-controlled caching (Section 6.12) ties DLM lock state to DSM dirty pages. DLM lock release triggers dsm_writeback() for dirty pages covered by the lock. Only after all DLM locks are released can DSM safely write back remaining dirty pages and leave regions.
  • DSM before Raft: DSM drain may generate ClusterMetadataOp entries (e.g., SetPeerState updates). These must commit to the Raft log before the voter steps down. Raft drains last among stateful subsystems.

5.8.2.3 Wire Messages

New PeerMessageType values, extending the control plane range adjacent to existing LeaveNotify (0x0020) and DeadNotify (0x0021):

// Graceful shutdown messages — canonical codes from [Section 5.1](#distributed-kernel-architecture):
LeaveAck              = 0x0022,  // canonical (shared)
DrainNotify           = 0x0023,  // canonical: departing → all peers
DrainAck              = 0x0024,  // canonical: peer → departing
MigrateStart          = 0x0025,  // canonical: departing → successor
MigrateAck            = 0x0026,  // canonical: successor → departing
DlmDrainComplete      = 0x0027,  // canonical (shared)
// Service-level drain (extends the core graceful shutdown protocol):
DrainProgress         = 0x0028,  // periodic progress report during drain
LeaveComplete         = 0x0029,  // departing peer final disconnect notification
ServiceDrainNotify    = 0x002A,  // per-service drain initiation
ServiceDrainAck       = 0x002B,  // per-service drain acknowledgement

Wire payload structs:

/// Sent by each remaining peer to the departing peer after local drain
/// preparation is complete. The bitmask indicates which subsystems the
/// responding peer has finished draining on its end (no more in-flight
/// requests to the departing peer for those subsystems).
///
/// Total: 16 bytes.
#[repr(C)]
pub struct LeaveAckPayload {
    /// Peer sending the acknowledgment.
    pub acking_peer: Le64,                      // 8 bytes (PeerId)
    /// Bitmask of subsystems this peer has finished draining.
    /// bit 0 = scheduler, bit 1 = capability services, bit 2 = DLM,
    /// bit 3 = DSM, bit 4 = Raft, bit 5 = IPC rings, bit 6 = RDMA transport.
    /// A peer may send multiple LeaveAcks with progressively more bits set
    /// as each subsystem completes.
    pub drain_subsystem_mask: Le32,             // 4 bytes
    pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<LeaveAckPayload>() == 16);

/// Informational progress report from the departing peer. Remaining peers
/// log this for diagnostics but do not act on it. Useful for operators
/// monitoring a long drain (e.g., DSM writeback of millions of dirty pages).
///
/// Total: 28 bytes (8 + 1 + 1 + 2 + 4 + 8 + 4).
#[repr(C)]
pub struct DrainProgressPayload {
    pub leaving_peer: Le64,                     // 8 bytes (PeerId)
    /// Current drain phase (0-6, matching the phase table in §5.9.3.1).
    pub phase: u8,                              // 1 byte
    /// Subsystem currently draining (0=scheduler, 1=services, ..., 6=RDMA).
    pub subsystem: u8,                          // 1 byte
    pub _pad: [u8; 2],                          // 2 bytes
    /// Estimated completion percentage (0-100) for the current subsystem.
    pub progress_pct: Le32,                     // 4 bytes
    /// Number of items remaining (dirty pages, held locks, pending I/Os, etc.).
    pub items_remaining: Le64,                  // 8 bytes
    pub _pad2: [u8; 4],                         // 4 bytes
}
const_assert!(core::mem::size_of::<DrainProgressPayload>() == 28);

/// Final departure signal. After sending this, the departing peer disconnects.
/// Remaining peers transition its status from Leaving → Dead → removed.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct LeaveCompletePayload {
    pub leaving_peer: Le64,                     // 8 bytes (PeerId)
    /// Final Raft epoch at departure. Remaining peers use this to confirm
    /// all metadata operations from the departing peer have been committed.
    pub final_epoch: Le64,                      // 8 bytes
}
const_assert!(core::mem::size_of::<LeaveCompletePayload>() == 16);

/// Sent by the departing peer to each connected client of a capability service
/// (block, filesystem, accelerator). Notifies the client to complete or cancel
/// outstanding requests and reconnect to an alternative provider.
///
/// Total: 32 bytes.
#[repr(C)]
pub struct ServiceDrainNotifyPayload {
    /// FNV-1a hash of the `WireServiceId` being drained. Matches the
    /// `CapResponsePayload.service_id_hash` convention. The full
    /// `WireServiceId` was exchanged during ServiceBind and is cached
    /// in the per-peer service table; the drain handler uses this hash
    /// to index the cached table.
    pub service_id_hash: Le64,                  // 8 bytes
    /// Time (ms) the client has to complete outstanding requests.
    /// After this timeout, the departing peer will force-close the service.
    pub drain_timeout_ms: Le32,                 // 4 bytes
    pub _pad: [u8; 4],
    /// Alternative peer that can serve the same resource. Discovered via
    /// PeerRegistry capability flags ([Section 5.2](#cluster-topology-model--membership-and-topology)).
    /// Zero if no alternative is known — the client must fall back to a
    /// local driver or accept unavailability.
    pub alternative_peer: Le64,                 // 8 bytes (PeerId)
    pub _pad2: [u8; 8],
}
const_assert!(core::mem::size_of::<ServiceDrainNotifyPayload>() == 32);

/// Client acknowledges service drain. Sent after the client has completed
/// or cancelled all outstanding requests for this service and is ready
/// for the departing peer to close the service endpoint.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct ServiceDrainAckPayload {
    pub service_id_hash: Le64,                  // 8 bytes (FNV-1a hash, matches ServiceDrainNotifyPayload.service_id_hash)
    pub acking_peer: Le64,                      // 8 bytes (PeerId)
}
const_assert!(core::mem::size_of::<ServiceDrainAckPayload>() == 16);

/// Sent by the departing peer to each DLM master after releasing all locks
/// on resources mastered by that master. The master removes the departing
/// peer from all lock resource holder/waiter lists.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct DlmDrainCompletePayload {
    pub draining_peer: Le64,                    // 8 bytes (PeerId)
    /// Number of lock resources released on this master.
    pub released_count: Le32,                   // 4 bytes
    pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<DlmDrainCompletePayload>() == 16);

5.8.2.4 Per-Subsystem Drain Protocols

1. Cluster Scheduler Drain

  • Enumerate all lightweight-migrated threads where ThreadMigrationState.home_peer is NOT this peer (i.e., threads from other peers running here). For each, initiate return migration via the thread migration wire protocol (Section 5.6): send ThreadMigrateRequest back to home_peer.
  • Cancel any pending full-process migrations where this peer is source or destination. In-flight migrations that have already transferred register state are completed (the destination peer owns the process); pending migrations that haven't started transfer are aborted.
  • Processes that were full-migrated FROM this peer but currently run on remote peers require no action — the remote peer owns execution. File descriptor proxies and home-peer metadata are cleaned up in Phase 5 (IPC ring drain).

2. Capability Service Drain

For each active capability service (block/VFS/accel) on this peer:

  1. Send ServiceDrainNotify to every connected client peer. The alternative_peer field is populated by querying the PeerRegistry for another peer with matching PeerCapFlags that provides an equivalent service (same block device, same filesystem mount, same accelerator type). If no alternative exists, set to 0.
  2. Clients that receive ServiceDrainNotify must: a. Complete or cancel all outstanding requests within drain_timeout_ms. b. If alternative_peer != 0, begin connecting to the alternative. c. Send ServiceDrainAck when ready.
  3. After all ServiceDrainAck messages received (or drain_timeout_ms expires): close the PeerServiceEndpoint.

Service-specific drain behavior is defined in the respective subsystem specs: - Block: Section 15.13 — FLUSH all queues before closing. Requests submitted after ServiceDrainNotify are rejected with status -ESHUTDOWN. - VFS: Section 14.11 — Close-to-open flush for all open files. Lease invalidation broadcast to all clients. Grace period for clients to reclaim open file state on the alternative server. - Accelerator: Section 22.7 — Wait for in-flight command buffers to complete. Running inference/compute contexts are migrated to alternative_peer if available, or returned as error to the submitting process.

3. DLM Drain

Release all locally-held locks in reverse acquisition order (latest first) to avoid deadlock cycles with other peers that may be acquiring locks in forward order:

  1. For each exclusive (EX) or protected-write (PW) lock: trigger dsm_writeback() for dirty pages covered by the lock (subscriber-controlled caching integration, Section 6.12). Wait for writeback completion. Then downgrade to NL (no lock).
  2. For each shared (PR/CR) lock: release directly (no writeback needed).
  3. After all local locks released, send DlmDrainComplete to each DLM master node that this peer held locks on.
  4. Each master removes the departing peer from all lock resource holder/waiter lists and processes the waiting queue (granting compatible waiters per the standard DLM protocol, Section 15.15).

No lock transfer is performed — releasing is sufficient. Other peers re-acquire locks as needed. This is simpler and safer than lock transfer (which would require coordinating lock state, Lock Value Blocks, and dirty page tracking across peers).

4. DSM Drain

For each DSM region this peer participates in:

  1. Write back all dirty pages. For every page in M (Modified) or O (Owned) state: send PutM or PutO to the home node via the standard MOESI eviction path (Section 6.6). Wait for PutAck for each page. For write-update regions (DSM_WRITE_UPDATE flag): flush any pending diffs via WriteDiff before evicting.
  2. Drop all shared copies. For every page in S (Shared) state: send PutS to the home node. This is a silent eviction — no data transfer, just directory update.
  3. Send RegionLeave to the region coordinator (§5.7.7.1). The coordinator tombstones the departing peer's slot in the RegionSlotMap (Section 6.1).
  4. Wait for RegionLeaveAck from the coordinator confirming slot tombstone.

Order within a region: writeback (step 1) completes before RegionLeave (step 3). This ensures no M/O pages are lost.

Across regions: drains proceed in parallel (independent regions have no ordering dependency). The DSM drain timeout (120s) bounds the total wall-clock time.

For regions where this peer is the home node for some pages: the directory entries are rehashed to surviving peers. Home node reassignment is triggered by the coordinator after the departing peer leaves — surviving participants compute a new consistent hash excluding the departed peer and redistribute directory entries.

5. Raft Voter Stepdown

  • If this peer is not a Raft voter: no action required.
  • If this peer is a voter but not the leader:
  • Submit RemovePeerFromCluster { peer_id: self } to the current leader.
  • Leader processes via joint consensus (Raft §7.3 — the leader commits a configuration entry removing this peer from the voter set).
  • Wait for commit confirmation.
  • If this peer is the Raft leader:
  • Select the best successor: voter with highest match_index in RaftLeaderState. Break ties by lowest PeerId.
  • Send TimeoutNow RPC to the successor, forcing it to start an election immediately without waiting for the election timeout.
  • Wait for the new leader to be elected (detected via AppendEntries from the new leader with a higher term).
  • Submit RemovePeerFromCluster to the new leader.
  • Wait for commit confirmation.

5.8.2.5 Timeout and Crash Handling

Per-subsystem timeout enforcement:

If any subsystem exceeds its timeout (column in §5.9.3.2 table), the departing peer force-completes that subsystem's drain:

Subsystem Force-Complete Action
Scheduler Abandon unreturned migrated threads (destination peer inherits ownership)
Exports Force-close export endpoints; clients see -ECONNRESET
DLM Release all locks without waiting for writeback completion
DSM Invalidate remaining M/O pages unilaterally (home sees PutAck timeout → marks pages lost, same as crash recovery)
Raft Step down as voter without waiting for RemovePeerFromCluster commit

Crash during shutdown:

If the departing peer crashes mid-shutdown (detected by remaining peers via heartbeat timeout, Section 5.8): - Remaining peers execute the standard crash recovery path (Section 5.8). - The partially-drained state is safe because all drain operations are idempotent: releasing an already-released lock is a no-op, PutM for an already-evicted page returns PutAck, and RegionLeave for an already-departed peer is silently ignored.

Remaining peer crash during drain:

If a remaining peer crashes while the departing peer is draining: - The departing peer detects via heartbeat timeout, removes the crashed peer from its drain-ack-needed set, and continues draining. - The crashed peer's resources are handled by the standard crash recovery path independently of the graceful shutdown.

5.8.2.6 Remaining Peer Actions on LeaveNotify

When a remaining peer receives LeaveNotify from the departing peer:

  1. Mark the departing peer as PeerStatus::Leaving in the local PeerRegistry. Leaving peers are treated as Alive for in-flight operations but excluded from new work placement (same semantics as Phase 1 exclusion, but now enforced by all peers, not just the departing peer).

  2. Complete all in-flight requests to/from the departing peer. Outstanding RDMA operations (pending DataResp, InvAck, PutAck, etc.) are processed normally.

  3. Stop issuing new requests to the departing peer:

  4. DSM: no new GetS/GetM/Upgrade messages where the departing peer is home.
  5. DLM: no new lock requests on resources mastered by the departing peer.
  6. Services: no new I/O submissions to the departing peer's capability services.

  7. Prepare for directory rehash. If the departing peer is the home node for any locally-cached DSM pages, mark those directory entries as pending rehash. The actual rehash executes after LeaveComplete is received.

  8. Reconnect services. If this peer is a client of the departing peer's capability services, process the ServiceDrainNotify messages:

  9. If alternative_peer != 0: connect to the alternative, re-establish the service session (block/VFS/accel specific reconnection per §15.4.2, §14.7.10, §22.5.3).
  10. If no alternative: fall back to local driver (if the underlying device is accessible locally) or report service unavailability to affected processes.

  11. Send LeaveAck after local preparation is complete. The drain_subsystem_mask in LeaveAckPayload indicates which subsystems this peer has finished preparing. The peer may send multiple LeaveAck messages as each subsystem completes, with progressively more bits set.

5.8.2.7 Interaction with Maintenance Workflows

Common maintenance scenarios and how graceful shutdown supports them:

Firmware update (e.g., NVMe SSD firmware, DPU firmware): 1. umka-cluster leave --reason=maintenance --timeout=60s on the peer. 2. Graceful shutdown proceeds per §5.9.3.1-5.9.3.6. 3. Firmware update is applied offline. 4. umka-cluster join re-adds the peer with fresh firmware.

Rolling kernel update across a cluster: 1. For each peer in sequence: graceful leave → update → rejoin. 2. Capability service alternative_peer redirect ensures clients always have a live provider during the rolling update. 3. DSM regions with DSM_REPLICATE survive each individual departure.

Hardware replacement: 1. Graceful leave (drains all state off the node). 2. Physical hardware swap. 3. Fresh join (new PeerId assigned, clean slot allocation in DSM regions).

5.8.2.8 Membership Event Model

The heartbeat state machine generates membership events consumed by multiple subsystems (DSM, DLM, capability service providers, cluster scheduler). A formal enum captures all possible membership transitions.

/// Cluster membership event. Generated by the heartbeat state machine
/// when a peer's liveness state changes. Delivered synchronously to all
/// registered `MembershipListener` implementations.
///
/// Subsystems that consume membership events:
/// - DSM: triggers anti-entropy acceleration (PeerSuspect) or full home
///   reconstruction (PeerDead). See heartbeat-DSM bridge below.
/// - DLM: triggers re-mastering on PeerDead
///   ([Section 15.15](15-storage.md#distributed-lock-manager--recovery-protocol)).
/// - Capability service providers: drain service endpoints on PeerLeaving,
///   revoke capabilities on PeerDead.
/// - Cluster scheduler: rebalance work on PeerDead, accept work on PeerJoined.
///
/// Kernel-internal enum, not wire/KABI. `#[repr(u32)]` fixes the discriminant
/// type for stable memory layout within the kernel; it does not make this
/// enum C-compatible. Wire encoding of membership changes uses separate
/// PeerMessage payloads.
#[repr(u32)]
pub enum MembershipEvent {
    /// A new peer has joined the cluster (completed the join handshake,
    /// received a slot in the PeerRegistry, and is heartbeating normally).
    PeerJoined {
        peer_id: PeerId,
        /// Cluster epoch after this join (monotonically increasing).
        epoch: u64,
        /// Capability flags advertised by the joining peer
        /// ([Section 5.2](#cluster-topology-model--membership-and-topology)).
        cap_flags: PeerCapFlags,
    },
    /// Heartbeats from this peer have been missed but the peer has not
    /// been declared dead. Subsystems should prepare for possible failure
    /// (e.g., accelerate anti-entropy, pre-stage recovery metadata).
    PeerSuspect {
        peer_id: PeerId,
        /// Number of consecutive missed heartbeat intervals.
        missed_heartbeats: u32,
    },
    /// The peer has been declared dead (missed `dead_threshold` heartbeats).
    /// All resources mastered by this peer must be recovered. The peer's
    /// slot in the PeerRegistry is invalidated; a future join from the
    /// same physical node receives a fresh PeerId and slot.
    PeerDead {
        peer_id: PeerId,
        /// Cluster epoch after this death declaration.
        epoch: u64,
    },
    /// The peer is voluntarily leaving the cluster (graceful shutdown).
    /// Service drain has been initiated; consumers should complete
    /// outstanding requests within `drain_timeout_ms`.
    PeerLeaving {
        peer_id: PeerId,
        /// Milliseconds until the peer forcibly disconnects.
        drain_timeout_ms: u32,
    },
    /// A network partition has been detected. The local node is in the
    /// specified partition set. If the local node is in the minority
    /// partition, it enters read-only mode (no DSM writes, no new locks,
    /// no process migrations). See split-brain resolution above.
    PartitionDetected {
        /// Cluster epoch at partition detection.
        epoch: u64,
        /// Number of nodes in the local partition.
        local_partition_size: u32,
        /// Total cluster size before partition.
        total_cluster_size: u32,
        /// True if the local partition has majority quorum.
        has_quorum: bool,
    },
}

/// Listener trait for membership events. Implemented by subsystems that
/// need to react to cluster topology changes.
///
/// Listeners are registered via `membership_register_listener()` during
/// subsystem init and called synchronously from the heartbeat state
/// machine's transition handler. Implementations must be fast (no blocking
/// I/O) — heavy recovery work should be deferred to a workqueue.
pub trait MembershipListener: Send + Sync {
    /// Called synchronously when a membership event occurs.
    /// Implementations must not block — queue work if recovery is needed.
    fn on_event(&self, event: &MembershipEvent);
}

5.8.2.9 Heartbeat → DSM Anti-Entropy Bridge

The cluster heartbeat operates at the membership layer (per-peer liveness). DSM operates at the region layer (per-page coherence). These layers are bridged through two event hooks:

  1. PeerSuspected (heartbeat missed 2× interval, ~200ms): For all DSM regions where the suspected peer is a participant, accelerate anti-entropy sync frequency from the normal background interval (default 10s) to 1ms. This ensures that any writes the suspected peer had not yet propagated are detected and replicated before a potential Dead declaration. If the heartbeat resumes (false alarm), anti-entropy frequency returns to normal.

  2. PeerDead (heartbeat missed dead_threshold, ~1000ms): For all DSM regions where the dead peer was a participant, trigger force_region_rebuild(region_id, dead_peer). This is the full home reconstruction protocol (Phase 1 below). Additionally, clear the dead peer's slot in each region's RegionSlotMap and invalidate any cached pages where the dead peer was the sole Exclusive owner (data lost — application sees SIGBUS on next access).

These hooks are registered by the DSM subsystem during init (dsm_register_membership_hooks()) and are called synchronously from the heartbeat state machine's transition handler.

5.8.3 Cross-Subsystem Recovery Ordering: DSM and DLM

When a node fails, both the DSM and DLM must recover state. These recoveries have a dependency: DLM lock state may reside in DSM-managed memory (lock state words are stored in RDMA-registered regions that may overlap with DSM pages), and the DLM recovery protocol (Section 15.15) requires reading lock state from the master's memory, which may need DSM page reconstruction first.

Recovery ordering protocol:

On NodeDead(failed_node) confirmed:

Phase 1 — DSM home reconstruction (runs first):
  1. For each DSM region where failed_node was a participant:
     a. Identify pages where failed_node was the home node
        (directory authority).
     b. Rehash those pages to surviving home nodes using the
        home assignment policy (hash(region_id, VA) % surviving_count
        or fixed-home fallback).
     c. Rebuild directory entries from surviving copies:
        - SharedReader nodes report their cached copies.
        - SharedOwner nodes report their authoritative copies.
        - Pages where failed_node was the sole Exclusive owner
          are marked INVALID (data lost — application must
          re-initialize from persistent storage).
     d. Signal `dsm_recovery_complete(region_id)` event for each
        recovered region.

  DSM reconstruction is region-parallel: independent regions recover
  concurrently. Total time: ~50-200ms per region (dominated by
  directory rebuild network round-trips).

Phase 2 — DLM re-mastering (per-resource DSM dependency):
  2. Consistent hashing reassigns failed_node's mastered resources
     to surviving nodes.
  3. Surviving holders report their lock state to new masters via
     transport.send_reliable().
  4. New masters rebuild granted/converting/waiting queues.

  Re-mastering proceeds per-resource, NOT blocked on a lockspace-wide
  barrier. The key insight is that only a small fraction of DLM resources
  (~1% in typical deployments) have their CAS words stored in DSM-homed
  pages. Blocking the entire lockspace until all dependent DSM regions
  complete recovery penalizes the 99% of resources that have no DSM
  dependency.

  Per-resource DSM dependency tracking:
  Each DlmResource records whether its CAS word page is DSM-homed via
  `DlmResourceDsmDep` (canonical definition:
  [Section 15.15](15-storage.md#distributed-lock-manager--data-structures)).
  This per-resource metadata is populated when the master allocates CAS
  word arrays from RDMA-registered memory at resource creation time.
  Overhead: 24 bytes per DlmResource (one Option<u64> + one Option<u64>).

  Re-mastering decision per resource:
  a. If `region_id` is None: resource has no DSM dependency.
     Re-mastering proceeds immediately — new master reads CAS word from
     local RDMA pool memory (or reconstructs from survivor reports).
  b. If `region_id` is Some(rid) AND the CAS word's page was NOT homed
     on the failed node: the page's home is a surviving node, directory
     is intact. Re-mastering proceeds immediately.
  c. If `region_id` is Some(rid) AND the CAS word's page WAS homed on
     the failed node: this resource's re-mastering is deferred until
     dsm_recovery_complete(rid) is signaled. The resource is placed in
     a per-lockspace deferred queue (DsmPendingResources) indexed by
     region_id for efficient batch release.

  The "was homed on failed node" check uses the DSM home assignment
  function: home_node(region_id, cas_word_va) == failed_node. This is
  a pure function (hash of region_id + VA modulo cluster size) that can
  be evaluated locally without network traffic.

Phase 3 — Lock grants resume (incremental):
  5. Resources with no DSM dependency (cases a and b above) resume lock
     grant processing as soon as their individual re-mastering completes.
     This means ~99% of resources in a typical lockspace resume within
     milliseconds of the DLM re-mastering phase starting — they do NOT
     wait for DSM recovery.
  6. When dsm_recovery_complete(rid) is signaled, all resources in
     DsmPendingResources[rid] are released: their CAS word pages are now
     reconstructed and readable. These resources complete re-mastering
     and resume lock grant processing.
  7. Pending lock requests that were blocked during recovery are
     evaluated against the rebuilt queues as each resource becomes
     available.

  Per-lockspace DSM dependency set (retained for monitoring):
  Each DlmLockspace still records the aggregate set of DSM region IDs
  used by any resource in the lockspace (populated incrementally as
  resources are created). This set is used for:
  - Monitoring: umkafs exposes per-lockspace recovery status showing
    how many resources are DSM-pending vs. recovered.
  - Timeout fallback: if a DSM region does not signal recovery within
    30 seconds, all resources in DsmPendingResources[rid] fall back to
    survivor-reported lock state exclusively (same as scenario 4 in the
    DLM recovery protocol). The per-resource timeout avoids blocking
    the entire lockspace indefinitely.

Cross-subsystem barrier implementation:

  /// One-shot synchronization barrier for cross-subsystem recovery ordering.
  /// Producers signal completion; consumers wait (with timeout) for the signal.
  /// Backed by a futex-like wait/wake mechanism — no spin-waiting.
  pub struct EventBarrier<T: Clone> {
      /// Signalled flag. Set to `true` by `signal()`, read by `wait()`.
      signalled: AtomicBool,
      /// Payload delivered on signal (e.g., region_id for DSM recovery).
      payload: UnsafeCell<MaybeUninit<T>>,
      /// Wait queue for blocked waiters.
      waiters: WaitQueue,
  }

  impl<T: Clone> EventBarrier<T> {
      /// Signal the barrier, waking all waiters.
      pub fn signal(&self, payload: T);
      /// Wait for the barrier to be signalled, with a timeout.
      /// Returns `Ok(payload)` on signal, `Err(Timeout)` if `timeout` expires.
      pub fn wait(&self, timeout: Duration) -> Result<T, BarrierTimeout>;
  }

  /// Completion notification for DSM region recovery. Signalled by the DSM
  /// subsystem when a region's home directory has been fully reconstructed
  /// and all pages are accessible. The DLM subscribes to this event for
  /// each region that contains CAS words used by deferred resources.
  pub struct DsmRecoveryComplete {
      /// The DSM region whose recovery is complete.
      pub region_id: u64,
  }

  dsm_recovery_complete is an EventBarrier<DsmRecoveryComplete> (umka-core event bus):
    - DSM signals: event_bus.signal(DsmRecoveryComplete { region_id })
    - DLM subscribes per deferred resource batch:
      event_bus.wait(DsmRecoveryComplete { region_id }) with a timeout
      of 30 seconds per region. If DSM recovery does not complete within
      30 seconds (catastrophic failure), the DLM proceeds with
      re-mastering for the affected resources using survivor-reported
      lock state exclusively — CAS words in the timed-out DSM region
      are treated as INVALID (same as scenario 4 in the DLM recovery
      protocol).
    - Resources with no DSM dependency never subscribe to any event
      barrier — they proceed immediately.

Why this ordering matters:

Without the ordering constraint, the DLM could attempt to read a resource's CAS word from a DSM page whose home directory has not yet been reconstructed. If the CAS word's page was homed on the failed node, the DLM would read stale or invalid data, leading to incorrect lock state rebuild (phantom locks, lost locks, or split-brain lock grants).

The ordering guarantee is: a DSM page is fully reconstructed and accessible before the DLM reads any lock state from that page. This is a per-resource dependency, not a lockspace-wide barrier. Resources whose CAS words are not DSM-homed, or whose CAS words are in DSM pages homed on surviving nodes, proceed immediately. Only the small fraction of resources (~1% in typical deployments) whose CAS word pages were homed on the failed node must wait for DSM region recovery. This per-resource granularity reduces recovery latency for the common case from O(slowest_DSM_region) to O(1) for unaffected resources.

Graceful departure optimization: During a graceful leave (Section 5.8), the departing node transfers its mastered DLM resources to new masters BEFORE departing from DSM regions. This avoids the recovery ordering constraint entirely: the DLM state is already migrated before any DSM directory entries need rehashing.


5.9 CXL 3.0 Fabric Integration

5.9.1 Why CXL Changes Everything

CXL (Compute Express Link) 3.0 provides hardware-coherent shared memory across a PCIe fabric. Unlike RDMA (which requires software coherence protocols), CXL memory is accessed via normal CPU load/store instructions with hardware cache coherence.

Hardware availability caveat: CXL 3.0 hardware with full shared memory semantics (back-invalidate snooping for multi-host coherence) is not yet commercially available as of 2025. UmkaOS's CXL support targets CXL 2.0 Type 3 devices (available in Sapphire Rapids, Genoa) with software-managed coherence. CXL 3.0 back-invalidate snooping will be supported when hardware becomes available. The CXL 3.0 sections below describe the target architecture; implementation is gated on hardware availability.

Memory access latency spectrum:

  Local DRAM (same socket):     ~50-80 ns    (load/store)
  Local DRAM (cross-socket):    ~100-150 ns  (load/store, QPI/UPI)
  CXL 2.0 attached memory:     ~200-400 ns  (load/store, PCIe + CXL)
  CXL 3.0 shared memory pool:  ~200-500 ns  (load/store, coherent)
  RDMA:                         ~2000-5000 ns (explicit transfer)
  NVMe SSD:                     ~10000 ns    (block I/O)

CXL 3.0 shared memory is 5-25x faster than RDMA because: 1. No software protocol (hardware coherence via CXL.cache, CXL.mem) 2. Cache-line granularity (64 bytes, not 4KB pages) 3. No memory registration overhead 4. CPU load/store instructions, not DMA engine

5.9.2 Design: CXL as a First-Class Memory Tier

PageLocation (defined canonically in Section 22.4 of 21-accelerators.md, reproduced in Section 6.7 above) is reused here for distributed page placement decisions. See Section 6.7 for the full enum definition including RDMA-specific variants.

NumaNodeType is defined canonically in Section 22.4, including the CxlSharedPool variant. All CXL-related fields (latency_ns, bandwidth_gbs, sharing_nodes, coherence_version) are specified there. This section defines only the CxlCoherenceVersion enum used by that variant:

#[repr(u32)]
pub enum CxlCoherenceVersion {
    /// CXL 2.0: pooled memory, no hardware coherence between hosts.
    /// Kernel manages coherence via software protocol (like DSM Section 5.6).
    Cxl20Pooled     = 0,
    /// CXL 3.0: hardware-coherent shared memory.
    /// CPU cache coherence protocol extended across CXL fabric.
    /// No software coherence needed — hardware handles it.
    Cxl30Coherent   = 1,
}

5.9.3 CXL + RDMA Hybrid

In a realistic datacenter, both CXL and RDMA will coexist:

Rack-level (CXL 3.0 fabric, ~200-500 ns):
  ┌─────────┐   CXL    ┌─────────┐   CXL    ┌─────────┐
  │ Node 0  │◄────────►│ CXL     │◄────────►│ Node 1  │
  │ CPU+GPU │          │ Switch  │          │ CPU+GPU │
  └─────────┘          │ +Memory │          └─────────┘
                       │  Pool   │
                       └────┬────┘
                            │ CXL
                       ┌────┴────┐
                       │ Node 2  │
                       │ CPU+GPU │
                       └─────────┘

Cross-rack (RDMA, ~2-5 μs):
  Rack 0 ◄──── 400GbE RDMA ────► Rack 1

Memory tier ordering within this topology:
  Tier 1: Local DRAM (~80 ns)
  Tier 2: CXL shared pool (~300 ns)         ← same rack
  Tier 3: GPU VRAM (~500 ns)
  Tier 4: Compressed (~1-2 μs to decompress)
  Tier 5: Remote DRAM via RDMA (~3 μs)      ← cross rack
  Tier 6: Local NVMe (~12 μs)
  Tier 7: Remote NVMe via NVMe-oF/RDMA      ← cross rack

Note: This ordering is illustrative (measured latency for this specific topology).
The canonical `MemoryTier` enum is defined in [Section 4.2](04-memory.md#physical-memory-allocator--memory-tier-model);
the kernel discovers and ranks tiers dynamically at boot from measured latencies.

The kernel detects CXL and RDMA links automatically (device registry)
and builds the distance matrix accordingly. No manual configuration
of tier ordering — the measured latencies determine placement policy.

5.9.4 CXL Shared Memory for DSM

When CXL 3.0 hardware-coherent shared memory is available, the DSM protocol (Section 6.2) simplifies dramatically:

DSM over RDMA (software coherence):
  - Page fault → directory lookup → ownership transfer → RDMA page transfer
  - ~10-25 μs per fault
  - Software TLB invalidation protocol

DSM over CXL 3.0 (hardware coherence):
  - Map CXL shared pool pages into process address space
  - CPU load/store works directly (hardware coherence)
  - No page faults for coherence (hardware handles cache-line invalidation)
  - ~200-500 ns access latency (same as accessing CXL memory)
  - Software DSM protocol not needed for CXL-connected nodes

The kernel uses CXL shared memory when available (intra-rack),
and falls back to RDMA-based DSM for cross-rack communication.
Best transport is selected automatically per page.

DSM redundancy and DLM: For node pairs connected via a CXL 3.0 fabric, the software DSM page-ownership state machine (Section 6.3) is redundant — the hardware handles cache-line-granularity coherence without software page faults or ownership transfers. UmkaOS's DSM routing layer uses CxlPool transport (load/store) for these pairs and skips the full coherence protocol.

However, the Distributed Lock Manager (DLM, Section 15.15) remains fully required even with CXL 3.0. Hardware cache coherence does not provide mutual exclusion semantics for arbitrary data structures (spinlocks, reader-writer locks, cross-node transactions). DLM lock acquisition (LOCK_ACQUIRE, atomic CAS on lock words) continues to operate as specified — CXL just makes the lock words accessible via load/store rather than RDMA, which reduces lock round-trip latency but does not eliminate the need for the protocol.

CXL 3.0 node pair — what changes vs. RDMA:
  DSM page faults:       eliminated (hardware coherence)
  DSM ownership xfer:    eliminated (no protocol needed)
  DLM lock acquire:      unchanged in protocol, load/store instead of RDMA
  DLM deadlock detect:   unchanged
  Heartbeat / membership: unchanged
  Capability tokens:     unchanged

5.9.5 CXL Devices as UmkaOS Peers

The three CXL device types map to distinct UmkaOS peer operating models. The classification determines Mode A vs. Mode B, peer role (compute vs. memory-manager), and crash recovery behavior.

5.9.5.1 Type 1: Coherent Compute (CXL.cache only)

Type 1 devices have their own compute and a cache that participates in the host CPU's coherency domain via CXL.cache. They have no device-managed DRAM visible via CXL.

UmkaOS operating model: - Mode B peer (hardware-coherent transport)CXL.cache IS the cache coherence protocol. Ring buffers placed in the shared region are coherent by hardware without any explicit cache flush or ownership protocol in software. - Full compute peer — runs UmkaOS or a firmware shim (Paths A/B). Participates in cluster membership, heartbeat, DLM, and DSM as a regular cluster node. - FLR cache flush requirement — see Section 5.3. On FLR the device must flush all CXL.cache dirty lines to host memory. Host waits for FLR completion before accessing the shared region.

5.9.5.2 Type 2: Coherent Compute + Device Memory (CXL.cache + CXL.mem)

Type 2 is the richest CXL peer type. The device has both a coherent cache (CXL.cache) AND device-local DRAM accessible to the host via load/store (CXL.mem).

UmkaOS operating model: - Mode B peer with bidirectional zero-copy — coherent in both directions: device cache participates in host coherency domain (CXL.cache); host CPU can directly load/store device DRAM as a NUMA memory tier (CXL.mem). Neither direction requires DMA or explicit transfer. - Memory tier AND compute peer — the device's DRAM is registered as NumaNodeType::CxlMemory (or CxlSharedPool if multi-host). The device's cores run UmkaOS and execute workloads. Same ClusterNode participates in both memory placement decisions and compute scheduling. - Ring buffers — can live in either host DRAM (coherent via CXL.cache) or device DRAM (coherent via CXL.mem + host load/store). Placement policy: ring buffers go in the lower-latency region as measured at runtime. - FLR cache flush — same requirement as Type 1. - Examples: future CXL-attached GPUs with HBM, CXL AI accelerators.

5.9.5.3 Type 3: Memory Expander (CXL.mem only, minimal compute)

Type 3 devices provide DRAM accessible to the host via CXL.mem. They have no accelerator compute from CXL's perspective, though they may have a small management processor (typically ARM Cortex-A5x or similar).

UmkaOS operating model — memory-manager peer, not compute peer: - The management processor runs a minimal UmkaOS instance (Path A via AArch64 build target) with a single responsibility: managing the DRAM pool. - What it does: monitors pool health, handles tiering decisions (hot/cold page migration across CXL memory sub-regions), manages optional compression or encryption, retires bad pages, reports ECC errors and media errors to the host cluster membership layer. - What it does NOT do: run application workloads, participate in DSM ownership transfers (the pool appears as a NUMA node to the host, not as a peer's memory), or require the full cluster protocol. It uses a lightweight subset: heartbeat + health reporting + pool management messages. - No DLM, no DSM page protocol: the Type 3 peer does not own pages in the distributed sense. The host NUMA allocator owns pages in the CXL pool; the management processor just monitors and maintains the physical medium.

CXL 3.0 multi-host pool: when a CXL switch connects multiple hosts to the same Type 3 pool, the management processor becomes a shared pool coordinator: - Arbitrates allocation between hosts (each host has a capacity reservation) - Reports pool-wide health events to all connected hosts - Does not arbitrate cache coherence (hardware does that via CXL.cache back-invalidate) - Does not run DLM for cross-host locking (UmkaOS DLM handles that over CXL.cache)

Crash recovery: see Section 5.3 CXL Type 3 section. DRAM persists; management layer is lost until management processor recovers. Pool transitions to ManagementFaulted state; uncompressed/unencrypted pages remain accessible.

5.9.5.4 CXL Switch as Fabric Manager Node

CXL 3.0 introduces intelligent CXL switches that route traffic between multiple hosts and multiple memory/compute devices. A CXL switch with embedded compute (ARM or RISC-V management processor) can run UmkaOS as a fabric manager node:

  • Topology discovery: the switch sees all CXL endpoints and can report the full fabric topology (which hosts can reach which memory pools, with latency measurements) to the UmkaOS cluster membership layer.
  • Routing policy: the switch can enforce traffic shaping, QoS, or bandwidth partitioning between hosts sharing the same CXL fabric.
  • No data plane participation: the switch does not own memory pages, run workloads, or participate in DLM. It is a topology and routing oracle.
  • This is a future capability: CXL 3.0 switch hardware with embedded compute is not yet commercially available (2025). The architecture is ready; the FabricManager node type and associated cluster messages are deferred to Phase 5+ (Section 24.2).

5.9.5.5 Summary Table

CXL Type Compute Memory UmkaOS Peer Role Transport Crash Model
Type 1 Yes (CXL.cache) No Full compute peer Mode B FLR flush + standard recovery
Type 2 Yes (CXL.cache) Yes (CXL.mem) Compute + memory tier peer Mode B FLR flush + standard recovery
Type 3 No (mgmt only) Yes (CXL.mem) Memory-manager peer Heartbeat + pool msgs ManagementFaulted; DRAM persists
CXL Switch Yes (mgmt only) No Fabric manager (future) Topology msgs Fallback to static topology

5.10 Compatibility, Integration, and Phasing

5.10.1 Linux Compatibility and MPI Integration

5.10.1.1 Existing RDMA Applications (Unchanged)

All existing Linux RDMA applications work through the compatibility layer:

Application Linux Interface UmkaOS Path
MPI (OpenMPI, MPICH) libibverbs / libfabric umka-sysapi RDMA compat layer
NCCL (multi-node GPU) libibverbs + GDR RDMA compat + GPUDirect RDMA
DPDK ibverbs / EFA RDMA compat
Ceph (msgr2) RDMA transport RDMA compat
Spark (RDMA shuffle) libfabric RDMA compat
Redis (RDMA) ibverbs RDMA compat

The compatibility layer (umka-sysapi/src/rdma/) translates Linux verbs API calls to KABI RdmaDeviceVTable calls. Binary libibverbs.so works without recompilation.

5.10.1.2 MPI Optimization Opportunities

MPI implementations can opt into UmkaOS-specific features for better performance:

Standard MPI on UmkaOS (no changes, works today):
  MPI_Send/MPI_Recv → libibverbs → RDMA NIC
  MPI_Win_create (shared memory window) → mmap + RDMA
  Performance: same as on Linux

MPI on UmkaOS with DSM (opt-in, future):
  MPI_Win_create → UMKA_SHM_MAKE_DISTRIBUTED
  MPI_Put/MPI_Get → direct load/store on DSM region
  Kernel handles page migration transparently.
  No explicit RDMA operations needed by MPI implementation.

  Benefit: MPI one-sided operations become load/store.
  Latency reduction: ~2-5 μs (RDMA verbs overhead) → ~3-5 μs (page fault)
  for first access, then ~50-150 ns for subsequent accesses (page is local).

  For iterative algorithms (most HPC): working set becomes local after
  first iteration. Subsequent iterations run at local memory speed.

5.10.1.3 Kubernetes / Container Integration

/sys/fs/cgroup/<group>/cluster.nodes
# # Which nodes this cgroup's processes can run on
# # "0 1 2 3" or "all"

/sys/fs/cgroup/<group>/cluster.memory.remote.max
# # Maximum remote memory for this cgroup

/sys/fs/cgroup/<group>/cluster.accel.devices
# # Allowed accelerators (including remote)
# # "node0:gpu0 node0:gpu1 node1:gpu0"

Kubernetes integration:
  - kubelet reads cluster topology from /sys/kernel/umka/cluster/
  - Device plugin exposes remote GPUs as schedulable resources
  - Pod spec: resources.limits: { umka.dev/gpu: 4, umka.dev/remote-gpu: 4 }
  - kubelet sets cgroup constraints; kernel handles placement

5.10.1.4 UmkaOS-Specific Cluster Interfaces

/sys/kernel/umka/cluster/
    nodes                   # List of cluster nodes with state
    topology                # Cluster distance matrix
    dsm/
        regions             # Active DSM regions
        stats               # DSM page fault / migration stats
    memory_pool/
        total               # Cluster-wide memory total
        available           # Cluster-wide memory available
        per_node/           # Per-node breakdown
    scheduler/
        balance_interval_ms # Global load balance interval
        migrations          # Process migration count
        migration_log       # Recent migrations (for debugging)
    capabilities/
        revocation_list     # Current revocation list
        key_ring            # Cluster node public keys

5.10.2 Integration with UmkaOS Architecture

5.10.2.1 Memory Manager Integration

The distributed memory features integrate with the existing MM at two points:

1. Page fault handler (extend existing):
   Current: fault → check VMA → allocate page / CoW / swap-in
   Extended: fault → check VMA → check PageLocationTracker:
     - CpuNode → standard local fault (unchanged)
     - DeviceLocal → device fault (Section 22.1, unchanged)
     - RemoteNode → RDMA fetch from remote node (NEW)
     - CxlPool → CXL load (hardware handles it) (NEW)
     - NotPresent → allocate locally (unchanged)

2. Page reclaim / eviction (extend existing):
   Current: LRU scan → compress or swap
   Extended: LRU scan → compress, OR migrate to remote node, OR swap
     - If remote memory is available and faster than swap: migrate
     - Decision based on cluster distance matrix + pool availability

5.10.2.2 Device Registry Integration

New device types in the registry:

ClusterFabric (virtual root for cluster topology)
  +-- rdma_link_0 (Node 0 ↔ Node 1, 200 Gb/s, 2.5 μs RTT)
  +-- rdma_link_1 (Node 0 ↔ Node 2, 200 Gb/s, 4.0 μs RTT)
  +-- cxl_link_0 (Node 0 ↔ CXL Pool 0, 64 GB/s, 300 ns)

RemoteNode (virtual device representing a remote machine)
  +-- Properties:
  |     node-id: 1
  |     state: "active"
  |     rtt-ns: 2500
  |     bandwidth-gbit-s: 200
  |     memory-total: 549755813888
  |     memory-available: 137438953472
  |     gpu-count: 4
  +-- Services published:
        "remote-memory" (GlobalMemoryPool)
        "remote-accel" (AccelExportService for each GPU)
        "remote-block" (BlockExportService for each NVMe)

5.10.2.3 FMA Integration

New FMA health events for distributed subsystem (Section 20.1):

Rule Threshold Action
RDMA link degraded >10% packet retransmits / minute Alert + reduce traffic
RDMA link down Link-down event Failover to TCP or isolate node
Remote node unresponsive 3 missed heartbeats (300ms) Mark Suspect
Remote node dead 10 missed heartbeats (1000ms) Mark Dead + reclaim
DSM page loss Exclusive page on dead node Alert + SIGBUS to process
Cluster split-brain Membership views diverge Quorum protocol (Section 5.8)
CXL memory error Uncorrectable ECC on CXL pool Migrate pages + Alert
Clock skew detected >10ms drift between nodes Alert (affects capability expiry)

5.10.2.4 Stable Tracepoints

New stable tracepoints for distributed observability (Section 20.2):

Tracepoint Arguments Description
umka_tp_stable_dsm_fault node_id, remote_node, vaddr, latency_ns DSM page fault
umka_tp_stable_dsm_migrate src_node, dst_node, pages, bytes DSM page migration
umka_tp_stable_dsm_invalidate owner_node, reader_nodes, vaddr DSM cache invalidation
umka_tp_stable_cluster_join node_id, rdma_gid Node joined cluster
umka_tp_stable_cluster_leave node_id, reason Node left/failed
umka_tp_stable_cluster_migrate pid, src_node, dst_node, reason Process migration
umka_tp_stable_rdma_transfer src_node, dst_node, bytes, latency_ns RDMA data transfer
umka_tp_stable_remote_fault node_id, tier, vaddr, latency_ns Remote memory access
umka_tp_stable_global_pool_alloc node_id, remote_node, bytes Global pool allocation
umka_tp_stable_global_pool_reclaim node_id, bytes, reason Global pool reclaim

5.10.2.5 Object Namespace

Cluster objects in the unified namespace (Section 20.5):

\Cluster\
  +-- Nodes\
  |   +-- node0\          (this machine)
  |   |   +-- State       "active"
  |   |   +-- Memory      "512 GB total, 384 GB free, 32 GB exported"
  |   |   +-- GPUs        → symlink to \Accelerators\
  |   +-- node1\
  |       +-- State       "active"
  |       +-- Memory      "512 GB total, 256 GB free, 64 GB exported"
  |       +-- RTT         "2500 ns"
  |       +-- Bandwidth   "200 Gb/s"
  +-- DSM\
  |   +-- region_0\
  |       +-- Size        "1073741824 (1 GB)"
  |       +-- Pages       "262144 total, 131072 local, 131072 remote"
  |       +-- Faults      "12345 total, 3.2 μs avg"
  +-- MemoryPool\
  |   +-- Total           "4096 GB (8 nodes × 512 GB)"
  |   +-- Available       "2048 GB"
  |   +-- LocalExported   "128 GB"
  |   +-- RemoteUsed      "64 GB"
  +-- Fabric\
      +-- Links           (RDMA link table with latency/bandwidth)
      +-- Topology        (switch-level fabric map)

Browsable via umkafs:
  cat /mnt/umka/Cluster/Nodes/node1/RTT
  → 2500 ns

  ls /mnt/umka/Cluster/DSM/
  → region_0  region_1  region_2

5.10.2.6 Open Questions

5.10.2.6.1 DSM Protocol Status

The DSM coherence protocol, wire format, and consistency models are fully specified:

Component Specification
MOESI coherence protocol Section 6.6
Wire format (DsmWireHeader, DsmMsgType) Section 6.6
Write-update wire encoding Section 6.6
Causal consistency (vector clocks) Section 6.6
Region management wire messages Section 6.8
Subscriber-controlled caching + DLM binding Section 6.12
Anti-entropy (DSM_RELAXED) Section 6.13
Home node failure recovery Section 5.8
Graceful shutdown drain Section 5.8
5.10.2.6.2 Remaining Open Items

Multi-threaded process migration barrier protocol:

When migrating a multi-threaded process (threads sharing an address space via CLONE_VM), all threads must be frozen atomically before state transfer begins. A partially-frozen process with some threads still executing would corrupt the shared address space during transfer. The cluster-wide barrier protocol:

  1. Enumerate participating nodes: the source node queries the cluster thread registry to determine all nodes that have threads belonging to the migrating process (identified by tgid). This produces a set of participating nodes P = { node_1, node_2, ..., node_k }.

  2. Send freeze barrier: the source sends FreezeBarrier { pid, thread_count } to all nodes in P via the RDMA control channel. thread_count is the total number of threads in the thread group across all nodes (the source knows this from the cluster thread registry).

  3. Remote thread freeze: each remote node in P delivers SIGSTOP to all local threads of the process via cross-node signal delivery. The signal is injected into the target node's signal queue using an RDMA message (RemoteSignal { pid, tid, signo: SIGSTOP }). The target node's signal processing path handles RemoteSignal identically to a local kill().

  4. Collect freeze acknowledgments: the source waits for FreezeAck { pid, frozen_thread_count } from each node in P. Each FreezeAck confirms that all of that node's threads for the process are stopped and their register state is saved. Timeout: 5 seconds per node. If any node does not respond within the timeout, migration is aborted: the source sends FreezeAbort { pid } to all nodes that did acknowledge, and those nodes unfreeze their threads.

  5. Barrier complete: once all FreezeAck messages are received (confirming that the sum of frozen_thread_count values equals the expected thread_count), migration proceeds as for the single-threaded case: each node serializes its local thread states independently and sends them to the destination via the RDMA migration channel.

  6. Atomic commit on destination: the destination reconstructs all threads before unfreezing any. Thread reconstruction order:

  7. First: the main thread (thread group leader, tid == pid).
  8. Then: all other threads, in arbitrary order.
  9. Last: the destination transitions all threads from TASK_STOPPED to TASK_RUNNING in a single atomic batch (interrupts disabled on the destination CPU during the batch transition to prevent partial observation). This ensures no thread observes a state where some sibling threads exist on the destination but others do not.

  10. TCP fallback mechanism: When RDMA fails (NIC down, driver crash), the transport layer falls back to TCP as follows:

Trigger conditions: - RDMA link-down event from the NIC driver (immediate detection) - transport_send() returns TransportError::LinkDown (detected on first failed op) - Heartbeat timeout (indirect detection if NIC crashes silently)

In-flight operation handling: - Any RDMA operation in flight at failure time is retried via TCP. - The sequence number is preserved: the TCP connection starts at the same message sequence number as the failed RDMA path. Both endpoints track the highest acknowledged sequence number so the receiver can deduplicate re-sent ops. - Operations that were acknowledged before the failure need not be retried. - Operations that may have been delivered but not acknowledged (RDMA RC completions are idempotent for read ops; write/CAS ops are idempotent if wrapped in sequence-numbered messages) are re-sent — the receiver deduplicates via sequence number.

DSM coherence reconciliation on RDMA→TCP fallback:

DSM page coherence operations have multi-step semantics (read fault = directory lookup + page transfer; write fault = directory update + invalidation broadcast + ACK collection). When RDMA fails mid-operation, the home node must reconcile:

In-flight operation RDMA failure point Reconciliation
Read fault (GetS) Request sent, no response Re-send GetS via TCP. Idempotent — home node checks directory, sends page if still shared.
Write fault (GetM) Request sent, no response Re-send GetM via TCP. Idempotent — home node starts or resumes ownership transfer.
Invalidation in progress Directory updated to Invalidating, some ACKs received Home node retains Invalidating state. Outstanding invalidation messages are re-sent via TCP (sequence-numbered, idempotent). Nodes that already ACKed are deduplicated. Transition to Exclusive occurs when all ACKs (original RDMA + retried TCP) are collected.
Page transfer in progress Source began RDMA Write of page data, incomplete Page transfer is restarted via TCP (copy entire page, not partial). Source retains page until TCP transfer completes and ACK received.
Directory CAS in progress Atomic CAS on home node half-completed CAS on home node is local — RDMA failure doesn't affect it. The CAS either completed (directory updated) or didn't (no change). The retried operation checks directory state and proceeds accordingly.

Key invariant: all DSM messages are idempotent when wrapped in sequence-numbered ClusterMessageHeader entries. The home directory's seqlock (§5.7.2) ensures that concurrent directory reads see a consistent state — either before or after the ownership transfer, never during. Nodes in Invalidating state block on the directory entry's wait_queue until the transition completes regardless of transport.

Reconnection protocol: 1. Detecting node sends TRANSPORT_FALLBACK_REQUEST(seq_num=N, reason) to peer via TCP (using the pre-established TCP backup connection maintained per peer). 2. Peer acknowledges with TRANSPORT_FALLBACK_ACK(seq_num=N). 3. Both sides switch all subsequent messages to TCP. 4. The failed RDMA QP is torn down cleanly or marked dead if unreachable.

Pre-established TCP backup connection: Each node maintains a single TCP connection to each peer for the fallback path, established at cluster join time. This connection is otherwise idle (only keepalive packets). Establishing TCP during failure adds ~100-500ms latency that would be unacceptable.

Re-establishment of RDMA: After the RDMA link recovers (NIC driver reports link-up), the transport layer re-creates the RC QP and runs an exchange to synchronize sequence numbers. Once the QP reaches RTS state, traffic migrates back from TCP to RDMA.

5.10.2.7 Unified Cluster Transport

All kernel-to-kernel communication uses the ClusterTransport trait. Every peer holds its own transport binding (Arc<dyn ClusterTransport>) — different peers in the same cluster can use different transports simultaneously (e.g., RDMA to rack-local peers, TCP to cross-datacenter peers, CXL to memory-pool-attached peers, USB to sensor hubs).

Per-peer transport model: There is no global "cluster transport". Each peer connection independently selects and holds its own transport:

/// Per-peer cluster connection. Stored in the cluster's peer registry
/// (`XArray<PeerNode>` keyed by `PeerId`). Every kernel subsystem
/// (DSM, DLM, distributed IPC, capability distribution) reaches a
/// peer through this struct — never by constructing transport-specific
/// objects directly.
pub struct PeerNode {
    /// Unique cluster-wide peer identifier.
    pub peer_id: PeerId,

    /// Transport binding for this peer. Each peer independently selects
    /// the best available transport at connection time (CXL > RDMA > TCP).
    /// Transport upgrades (e.g., TCP→RDMA after RDMA NIC init) swap this
    /// field atomically via `Arc::swap`. Downgrades (e.g., RDMA→TCP after
    /// 3 consecutive failures) likewise.
    ///
    /// `Arc` because multiple subsystems (DSM, DLM, IPC) hold references
    /// to the same peer's transport concurrently. The transport itself is
    /// per-peer (one QP pair for RDMA, one TCP socket for TCP, etc.).
    pub transport: Arc<dyn ClusterTransport>,

    /// Cluster membership and health state.
    pub health: PeerKernelHealth,

    /// Negotiated cluster protocol version.
    pub protocol_version: u32,

    /// Transport type tag (for diagnostics and sysfs reporting).
    pub transport_type: PeerTransportType,
}

The ClusterTransport trait — the universal transport interface. Every transport implementation (RDMA, TCP, CXL, PCIe BAR, NVLink, USB, HiperSockets) implements this trait. Methods do NOT take PeerId because the trait instance is already bound to a specific peer.

/// Universal transport for kernel-to-kernel communication.
/// Each instance is bound to a single remote peer. Implementations:
///   RdmaPeerTransport   — RDMA (InfiniBand / RoCE)
///   TcpPeerTransport    — TCP/IP fallback
///   CxlPeerTransport    — CXL 3.0 shared memory
///   PcieBarTransport    — PCIe BAR-mapped (DPUs, local accelerators)
///   NvLinkTransport     — GPU↔GPU or CPU↔GPU (Grace Hopper)
///   UsbPeerTransport    — USB bulk (sensor hubs, crypto tokens)
///   HiperSockTransport  — s390x inter-LPAR (QDIO)
pub trait ClusterTransport: Send + Sync {
    // --- Two-sided messaging ---

    /// Send a message. Best-effort; no delivery guarantee.
    fn send(&self, msg: &[u8]) -> Result<(), TransportError>;

    /// Reliable send with ACK. Blocks until ACK received or timeout.
    fn send_reliable(&self, msg: &[u8], timeout_ms: u32) -> Result<(), TransportError>;

    /// Poll for incoming messages (called from the cluster I/O thread).
    /// Returns the number of bytes received, or None if no message is ready.
    fn poll_recv(&self, buf: &mut [u8]) -> Option<usize>;

    // --- One-sided data transfer ---
    // These operations move bulk data. On RDMA, they map to one-sided
    // RDMA Read/Write (zero remote CPU involvement). On TCP, they use
    // request-response messages where the remote kernel thread performs
    // the memory copy and responds with the data. On CXL, they are
    // direct load/store to hardware-coherent shared memory.

    /// Fetch `size` bytes from remote peer's memory at `remote_addr`
    /// into local memory at `local_addr`. One-sided on RDMA/CXL;
    /// request-response on TCP.
    fn fetch_page(
        &self,
        remote_addr: u64,
        local_addr: PhysAddr,
        size: u32,
    ) -> Result<(), TransportError>;

    /// Push `size` bytes from local memory at `local_addr` to remote
    /// peer's memory at `remote_addr`. One-sided on RDMA/CXL;
    /// request-response on TCP.
    fn push_page(
        &self,
        local_addr: PhysAddr,
        remote_addr: u64,
        size: u32,
    ) -> Result<(), TransportError>;

    /// Batch page transfer: move N pages in a single operation.
    /// On RDMA: chained Work Requests in a single post_send().
    /// On TCP: pipelined requests (send all, then collect all ACKs).
    /// On CXL: sequential memcpy (already hardware-coherent).
    /// Default implementation calls fetch_page() in a loop.
    fn fetch_pages_batch(
        &self,
        pages: &[(u64, PhysAddr)],   // (remote_addr, local_addr) pairs
    ) -> Result<(), TransportError> {
        for &(remote, local) in pages {
            self.fetch_page(remote, local, PAGE_SIZE as u32)?;
        }
        Ok(())
    }

    // --- Remote atomics ---
    // On RDMA: NIC-side atomics (zero remote CPU, ~2-3 μs).
    // On TCP: request-response with server-side operation (~50-200 μs).
    // On CXL (coherent): hardware CAS/FAA via shared memory.
    // On PCIe BAR: MMIO atomic if supported, else doorbell + response.
    // Protocol logic is identical on all transports; only latency differs.

    /// Atomic compare-and-swap on a u64 at `remote_addr`.
    /// Returns the previous value.
    fn atomic_cas(
        &self,
        remote_addr: u64,
        expected: u64,
        desired: u64,
    ) -> Result<u64, TransportError>;

    /// Atomic fetch-and-add on a u64 at `remote_addr`.
    /// Returns the previous value.
    fn atomic_faa(
        &self,
        remote_addr: u64,
        addend: u64,
    ) -> Result<u64, TransportError>;

    // --- Ordering ---

    /// Fence: all preceding operations via this transport are visible to
    /// the remote side before any subsequent operation.
    /// RDMA: RC QP in-order delivery (writes); IBV_SEND_FENCE for
    ///       reads/atomics. TCP: no-op (TCP is ordered). CXL (coherent):
    ///       no-op. NVLink (non-coherent): GPU membar.sys.
    fn fence(&self) -> Result<(), TransportError>;

    // --- Transport metadata ---

    /// Returns the transport type name for diagnostics.
    fn transport_name(&self) -> &'static str;

    /// True if this transport supports one-sided operations (RDMA
    /// Read/Write/Atomics, CXL load/store) without remote CPU involvement.
    fn supports_one_sided(&self) -> bool;

    /// True if this transport provides hardware cache coherence (CXL 3.0,
    /// coherent NVLink). When true, the DSM directory can skip software
    /// invalidation for this peer — hardware maintains coherence.
    fn is_coherent(&self) -> bool { false }
}

Transport binding semantics by implementation:

ClusterTransport method RdmaPeerTransport TcpPeerTransport CxlPeerTransport PcieBarTransport
send() RDMA Send (RC) TCP send (non-blocking) Store to doorbell region Write to BAR2 ring
send_reliable() RDMA Send (RC) + poll CQ TCP send + wait for ACK Store + read-back fence BAR write + MSI-X ACK
fetch_page() RDMA Read (~3-5 μs) TCP request + recv (~50-200 μs) memcpy from shared region (~0.2-0.4 μs) DMA read from BAR (~1-5 μs)
push_page() RDMA Write (~2-3 μs) TCP send + recv ACK (~50-200 μs) memcpy to shared region (~0.2-0.4 μs) DMA write to BAR (~1-5 μs)
atomic_cas() RDMA Atomic CAS (~2-3 μs) TCP request + recv (~50-200 μs) Hardware CAS (~0.1-0.3 μs) MMIO atomic or doorbell (~1-10 μs)
atomic_faa() RDMA Atomic FAA (~2-3 μs) TCP request + recv (~50-200 μs) Hardware FAA (~0.1-0.3 μs) MMIO atomic or doorbell (~1-10 μs)
fence() IBV_SEND_FENCE (if needed) no-op no-op (hw coherent) SFENCE + BAR read-back
supports_one_sided() true false true true
is_coherent() false false true false

On TCP transports, one-sided and atomic operations degrade gracefully to request-response pairs. The remote kernel's cluster I/O thread processes the request synchronously. This adds ~50-200 μs latency per operation (vs ~2-5 μs on RDMA) but preserves protocol correctness. The DLM and DSM protocols are written against ClusterTransport and are transport-agnostic — they use the same message sequences regardless of the underlying transport.

5.10.2.8 Service Data Region Semantics

Service providers reference bulk data via data_region_offset — an offset into a ServiceDataRegion allocated at ServiceBind time. The offset is a local value: it tells the receiver where in its own region to find or place the data. How data gets there is the transport's job, fully transparent to service providers.

Transports fall into two classes for data region handling:

Shared-memory transports (PCIe BAR, CXL, QDIO, virtio): both sides map the same physical memory. When the sender writes data at region_offset, the receiver can read it immediately. No explicit data transfer — the write IS the transfer. This is the fastest path (~0 additional latency).

Message-based transports (RDMA, TCP, USB, s390x CCW): sides have separate address spaces. The transport layer transfers data between local regions transparently:

Transport Data transfer mechanism Latency
RDMA One-sided Write from sender's region to receiver's region (rkey from ServiceBind) ~2-3 μs
TCP Sender transmits data after control message; receiver copies to region at offset ~50-200 μs
USB Sender transmits via bulk endpoint; receiver copies to offset ~1-10 ms
s390x CCW Data transfer via CCW chain (channel subsystem DMA) ~5-50 μs

For message-based transports, the ClusterTransport implementation handles the data transfer internally when processing ring pair entries that reference data_region_offset. Service providers never trigger explicit data transfers — they write data to their local region, post a ring entry with the offset, and the transport does the rest. The performance difference is in the transport, not the service protocol.

5.10.2.9 Per-Peer Transport Selection

Transport is selected independently per peer at connection time, and can be upgraded or downgraded at runtime:

/// Select transport for a peer. Called during cluster join handshake.
/// Returns the best available transport binding for the given peer.
///
/// Selection hierarchy (highest priority first):
///   1. CXL shared memory (if peer is CXL-attached, ~0.2-0.4 μs)
///   2. NVLink (if peer is NVLink-connected GPU/DPU)
///   3. PCIe BAR (if peer is on the same machine, ~1-5 μs)
///   4. RDMA (if both peers have RDMA NICs on the same fabric, ~3-5 μs)
///   5. HiperSockets (s390x inter-LPAR, if available, <1 μs)
///   6. TCP/IP (always available as last resort, ~50-200 μs)
///   7. USB (for USB-attached peers — sensor hubs, crypto tokens, ~1-10 ms)
///
/// The `preference` field in JoinRequestPayload can force a specific
/// transport (e.g., TcpOnly for debugging) or allow Auto selection.
fn select_peer_transport(
    local: &LocalNodeInfo,
    remote: &PeerJoinInfo,
    preference: PeerTransportPreference,
) -> Result<Arc<dyn ClusterTransport>, TransportError>;

Transport upgrade: If a peer initially connects via TCP (e.g., RDMA NIC was still initializing), and RDMA later becomes available, the transport can be upgraded:

  1. The cluster I/O thread detects RDMA availability for a TCP peer.
  2. It creates a new RdmaPeerTransport and performs the RDMA handshake (QP creation, rkey exchange) over the existing TCP connection.
  3. Once the RDMA path is verified (successful ping), the PeerNode.transport is atomically swapped via Arc::swap().
  4. In-flight operations on the old TCP transport complete normally (the old Arc<TcpPeerTransport> is kept alive by existing references until they drop).
  5. New operations use the RDMA transport.

Transport downgrade: After 3 consecutive failures to the same peer on a non-TCP transport, that peer is demoted to TCP. The failing transport is retried every 60 seconds in the background.

5.10.2.10 Transport Construction from Tier M NICs

When the local NIC is a Tier M peer (Section 11.1), the host builds ClusterTransport from the NIC's advertised capabilities instead of loading a KABI NIC driver. The NIC doesn't know or care whether it's used for cluster transport — it advertises what it IS (EXTERNAL_NETWORK, optionally RDMA_CAPABLE). The host decides how to use it.

Tier M NIC with RDMA_CAPABLE: The ring pair established at ServiceBind carries native RDMA work request entries. The NIC hardware already knows how to process these — same format as if a KABI driver posted them, just exposed via the Tier M ring pair in BAR2 instead of driver-specific MMIO work queues. RdmaPeerTransport writes WQEs to the ring pair; the NIC hardware executes them. One-sided operations (push_page, fetch_page, atomics) are hardware-accelerated with no firmware CPU involvement on the data path. This is the fastest path (~2-5 μs), with the additional benefit of eliminating the Tier 0/1 crossing that a KABI driver would require (~300-800 ns saved per operation).

Tier M NIC with EXTERNAL_NETWORK only (no RDMA): The ring pair carries TxPacket/RxPacket network service messages. The host runs its own TCP stack; the NIC handles L2 frame send/receive with available offloads (checksum, TSO, RSS advertised in ExternalNicProperties). TcpPeerTransport builds TCP connections using the NIC's packet I/O service. Slower (~50-200 μs) but works with any NIC.

KABI NIC driver (no Tier M shim): Traditional path — host loads a per-device KABI driver (e.g., mlx5, ixgbe), which registers a netdev. ClusterTransport is built on top of the netdev interface. Same as Linux today. Compatibility path for devices without firmware shim.

The select_peer_transport() hierarchy applies the same priority logic regardless of whether the NIC is Tier M or KABI-driven — it selects the best available transport. The Tier M path is preferred when available (eliminates driver, removes tier crossing, reduces host-side overhead).

5.10.2.11 TCP Fallback Transport

When RDMA is unavailable (non-RDMA fabric, RDMA link failure, or RDMA initialization error), peers fall back to TCP:

TcpPeerTransport (implements ClusterTransport, consumes the kernel TCP stack via SocketOps trait, Section 16.3):

/// TCP transport binding for a single remote peer.
/// Created when RDMA/CXL are unavailable or as initial transport during
/// cluster join (before RDMA handshake completes).
pub struct TcpPeerTransport {
    /// Kernel-internal socket to this peer. Persistent connection,
    /// auto-reconnected on drop. Created via `SocketOps::connect()`.
    socket: KernelSocket,

    /// Pre-allocated send buffer pool for hot-path messages (DSM page
    /// fault path). 4 outstanding messages × max message size. The
    /// hot-path send never allocates and never blocks on memory.
    send_pool: SendBufferPool,

    /// Monotonically increasing sequence number. u64 internally; truncated
    /// to u32 on the TCP framing wire (wraps every ~7 min at 10M msg/sec,
    /// harmless — see framing detail below). Duplicate detection at the
    /// cluster protocol layer uses `ClusterMessageHeader.sequence` (Le64),
    /// not this framing seq.
    seq: AtomicU64,
}

impl ClusterTransport for TcpPeerTransport { /* ... */ }
  • One persistent TCP connection per peer (reconnected automatically on drop). TX uses SocketOps::sendmsg() with NetBuf scatter-gather for zero-copy on large cluster messages (DSM page transfers, capability tokens).
  • send(): non-blocking write to a per-connection TX ring buffer (kernel-side socket send buffer).
  • send_reliable(): write + wait for application-level ACK message (4-byte ACK: u32 echoing message sequence number). Timeout after timeout_ms.
  • fence(): no-op (TCP is ordered within a connection).
  • is_coherent(): returns false.
  • Framing: 8-byte header [msg_len: u32, seq: u32] + variable payload. The seq is a monotonically increasing u32 for connection-level stream integrity (detect truncated/corrupt frames). Wraps after ~4.29B messages (~7 min at 10M msg/sec); wrap is harmless because the receiver uses ClusterMessageHeader.sequence (Le64) for message-level duplicate detection, not the framing seq. On reconnect, both sides reset their framing seq counters to 0. The receiver discards any buffered partial frames from the previous connection (TCP close drains the socket buffer, so partial frames should not survive reconnection; this is a defense-in-depth measure).

Allocation context: send() is called from the cluster I/O worker thread (process context, preemption enabled). Framing buffer allocation uses GFP_KERNEL. For the DSM page fault hot path, a pre-allocated per-peer send buffer pool (4 outstanding messages × max message size per peer) is used — send_from_pool() never allocates and never blocks on memory.

TCP framing detail: Each TCP message is: [msg_len: u32 LE] [seq: u32 LE] [ClusterMessageHeader: 40 bytes] [payload: msg_len - 40 bytes]. msg_len includes the ClusterMessageHeader but excludes the 8-byte framing header. The seq is a connection-level stream integrity field (detect truncated/corrupt frames); duplicate detection uses ClusterMessageHeader.sequence (Le64). The TCP receiver verifies msg_len == 40 + payload_length as a sanity check (where payload_length is from the parsed ClusterMessageHeader). Mismatch causes the message to be dropped with an fma_counter(PEER_FRAMING_ERROR) increment. - Performance: ~5-20x higher latency than RDMA (microseconds to tens of microseconds) and ~10× lower throughput. Acceptable for fallback; cluster health monitoring triggers a warning and attempts RDMA re-establishment.

Fallback trigger: The cluster subsystem detects RDMA unavailability at startup (ibv_query_device() failure) or at runtime (persistent RDMA send errors → TransportError::FabricDown). After 3 consecutive RDMA failures to the same node, that peer's transport is demoted to TCP. RDMA is re-attempted every 60s.

  • Scale extension beyond 64 nodes: Data structure changes for >64-node clusters (extended bitfields, hierarchical directories).
  • Partial network partitions: Non-transitive reachability (A can reach B, B can reach C, but A cannot reach C). Current quorum model assumes transitive reachability.
  • Thread group migration: Migrating a process that has threads sharing an address space via CLONE_VM. Requires the cluster-wide freeze barrier protocol specified in Section 5.6 above. All threads across all nodes must be frozen atomically before transfer.
  • ptrace migration: Transferring a process that is being traced (PTRACE_ATTACH) is not handled in v1. The tracer and tracee must reside on the same node.
  • GPU context migration on consumer hardware: Migration of AccelContexts on GPUs without hardware-supported checkpoint (most GeForce and Radeon consumer parts). Currently fails with -EOPNOTSUPP; a future software-level CUDA/ROCm checkpoint integration may enable this at the cost of longer quiescence time.

5.10.3 Implementation Phasing

Component Phase Dependencies Notes
ClusterTransport trait + RdmaPeerTransport Phase 3 RDMA driver KABI Foundation for everything
TcpPeerTransport (TCP fallback) Phase 3+ ClusterTransport TCP socket transport for RDMA-less environments; end-to-end round-trip latency ~50-200μs (network + kernel processing) vs ~3-5μs RDMA Read. Note: the ~5μs kernel processing overhead quoted in Section 5.1.1.2 is per-packet processing only, not round-trip time.
Cluster join / topology discovery Phase 3 ClusterTransport Basic cluster formation
Heartbeat + failure detection Phase 3 ClusterTransport Must have before distributed state
Distributed Lock Manager (Section 15.15) Phase 3-4 ClusterTransport, Heartbeat RDMA-native DLM; prerequisite for clustered FS
Distributed IPC (RDMA rings) Phase 3-4 ClusterTransport, IPC Natural extension of existing IPC
Pre-registered kernel memory Phase 3-4 RDMA driver, IOMMU Performance prerequisite
PageLocation RemoteNode variant Phase 4 Memory manager Small MM extension
DSM page fault handler Phase 4 PageLocation, ClusterTransport Core DSM functionality
DSM directory (home-node hash) Phase 4 DSM fault handler Page ownership tracking
DSM coherence protocol Phase 4-5 DSM directory Multiple-reader / single-writer
Distributed capabilities (signed) Phase 4 Capability system, Ed25519 Security foundation
Cooperative page cache Phase 4-5 DSM, VFS Distributed page cache
Global memory pool (basic) Phase 5 DSM, cgroups Remote memory as swap tier
Cluster scheduler Phase 5 Global pool, DSM affinity Process migration
Process migration Phase 5 Cluster scheduler Freeze/thaw + lazy page fetch
Capability services (block, VFS, accel) Phase 5 ClusterTransport, peer protocol Remote access to driver-managed devices
GPUDirect RDMA cross-node Phase 5 P2P DMA, RDMA GPU↔GPU across network
Split-brain resolution Phase 5 Heartbeat, DSM Quorum + fencing
CXL 2.0 pooled memory Phase 5 Memory manager CXL memory as NUMA node
CXL 3.0 shared memory Phase 5+ CXL 2.0, DSM Hardware-coherent DSM
Global memory pool (advanced) Phase 5+ All of above Full cluster memory management
Cluster-wide cgroup integration Phase 5+ Cluster scheduler, global pool Kubernetes-ready
DSM replication (fault tolerance) Phase 5+ DSM, replication protocol For critical workloads

5.10.3.1 Priority Rationale

Phase 3-4 (Foundation): RDMA transport + cluster formation + basic DSM. This makes UmkaOS cluster-aware and enables distributed IPC. MPI and NCCL workloads benefit immediately from kernel-native RDMA transport.

Phase 4-5 (Practical Wins): Cooperative page cache + global memory pool + signed capabilities. This is when distributed UmkaOS becomes genuinely useful: remote memory as a tier, shared file caching, and secure cross-node operations.

Phase 5+ (Competitive Advantage): Process migration, CXL integration, cluster-wide resource management. Features that no other OS provides. The kernel manages a cluster of machines as a single coherent system.


5.10.4 Licensing Summary

Component IP Source Risk
RDMA kernel transport Original design (uses standard RDMA verbs) None
DSM page coherence Academic (published research: Ivy, TreadMarks, Munin, GAM) None
Home-node directory Academic (distributed hash table, published) None
Global memory pool Original design (extends NUMA model) None
Cooperative page cache Academic (published research) None
Cluster scheduler Original design (extends CBS) None
Distributed capabilities Original design (Ed25519 is public-domain) None
CXL integration CXL spec (public, royalty-free consortium) None
Process migration Academic (MOSIX concepts are published research) None
Split-brain / quorum Academic (Paxos, RAFT, published) None
CRDT revocation list Academic (Shapiro et al., published) None

All components are either original design or based on published academic research and open specifications. No vendor-proprietary APIs or patented algorithms.


5.10.5 Comparison: Why Previous DSM Projects Failed and Why This Succeeds

Factor Kerrighed / OpenSSI / MOSIX UmkaOS Distributed
Kernel design Bolted onto Linux (30M+ LOC, assumes single machine) Designed from scratch with distribution in mind
Coherence granularity Cache-line (64B) — false sharing kills performance Page-level (4KB) — matches network latency
Hardware support None (pure software coherence) RDMA (2020s), CXL 3.0 (2025+)
Memory model Patched Linux MM (invasive, broke on updates) PageLocationTracker already supports heterogeneous tiers
Transport TCP sockets (high overhead) RDMA one-sided ops (zero remote CPU)
Security Unix permissions (not network-portable) Cryptographically-signed capabilities
Fault tolerance Fragile (node failure = cluster crash) Quorum-based, bounded-lifetime capabilities, graceful degradation
Application compat Modified syscall layer, broke things Standard POSIX + opt-in extensions
Maintenance burden Thousands of patches across all subsystems Clean integration points (PageLocation, IPC transport, capability service providers)
Timing 2000s — hardware wasn't ready 2026 — RDMA ubiquitous, CXL arriving, AI demands it

5.11 SmartNIC and DPU Integration

5.11.1 Problem

DPUs (Data Processing Units) — NVIDIA BlueField, AMD Pensando, Intel IPU — are processors that sit on the network path. They have their own ARM cores, run their own OS, and can process network traffic, storage I/O, and security policies without using host CPU cycles.

Linux models DPUs as dumb PCIe devices with massive host-side drivers. UmkaOS models them as what they are: Tier M peers that join the cluster via the standard peer protocol (Section 5.1).

5.11.2 Design: DPUs as Tier M Peers

A DPU is a Tier M peer (Section 11.1). It connects to the host via the same peer protocol that all multikernel communication uses:

  • Full UmkaOS port (Path 2): the DPU runs a full UmkaOS kernel instance on its ARM cores. Joins the cluster as a ClusterNode. Full DSM, DLM, and distributed scheduling participation. Best for DPUs with 8+ cores and 8+ GB DRAM (BlueField-3, Pensando Elba).
  • Firmware shim (Path 3): the DPU keeps its existing vendor OS (NVIDIA DOCA, AMD firmware) and implements the UmkaOS peer protocol as a shim (~10-18K lines of C, excluding crypto primitives already in firmware). The host cannot distinguish a shim from a full port. Best for DPUs with constrained firmware environments or vendor lock-in.

Trust boundary: Firmware shims are not in the Trusted Computing Base (TCB). A compromised shim is equivalent to a malicious network peer — isolated by IOMMU, constrained by capability scoping (the host grants only the capabilities the DPU needs per ServiceBind), and subject to unilateral host-side reset (FLR/SBR). The Tier M boundary is a hardware fabric (PCIe BAR + IOMMU domain); crossing it requires physical access or firmware compromise — a categorically different threat from software policy bypass. The host never trusts DPU-originated data without validation: ring buffer positions are bounds-checked, ServiceResponse payloads are validated against the service schema, and capability tokens are verified on every dispatch.

Both paths use the same wire format: PeerMessageType messages over DomainRingBuffer ring pairs in BAR2, with PeerControlRegs in BAR0 and doorbells at BAR0+0x100 (Section 5.1).

There is no separate "offload transport" or "offload proxy." The peer protocol IS the transport. The DPU advertises its services via CapAdvertise (Section 5.1), the host binds services via ServiceBind, and data flows through ring pairs using ServiceMessage/ServiceResponse. The host's generic umka-peer-transport module (~2,000 lines) handles all Tier M peers identically — the DPU gets no special-case code.

5.11.3 How It Works

Traditional driver (host, Tier 0/1/2):
  Process → syscall → UmkaOS Core → KABI vtable → Host driver → Hardware

DPU peer (Tier M):
  Process → syscall → UmkaOS Core → peer transport →
    → ServiceMessage via ring pair → DPU peer → DPU hardware

The peer transport implements the same KABI service interface as a host driver.
UmkaOS Core dispatches to a ServiceId; whether that service lives on a local
driver or a remote peer is transparent — the ServiceId resolves to a ring pair
endpoint instead of a local vtable call.

5.11.4 DPU Discovery and Join

DPU discovery follows the standard peer negotiation state machine (Section 5.1):

  1. PCIe enumeration: Host discovers the DPU as a PCIe device. Reads BAR0 for PeerControlRegs.magic (0x554D4B41 = "UMKA"). If magic matches → Tier M peer. If not → standard PCIe device, use traditional driver.
  2. Handshake: Host reads scratchpad (BAR0+0x200) for X25519 public key. Sends JoinRequest via ring pair 0. DPU responds with JoinAccept.
  3. Capability advertisement: DPU sends CapAdvertise listing available services (e.g., ServiceId("nic_offload", 1), ServiceId("nvmeof_target", 1), ServiceId("ipsec_offload", 1)).
  4. Service binding: Host sends ServiceBind for each service it wants. DPU responds with ServiceBindAck including ring pair offsets and doorbell indices.
  5. Active: Heartbeat begins (100ms). Services process requests.

DPU firmware loading is out of scope — managed by the DPU's own boot chain (BlueField: UEFI on ARM cores, Intel IPU: vendor loader). The peer protocol begins after the DPU firmware is running.

5.11.5 Use Cases

Scenario Host Path DPU Peer Path Benefit
Network firewall CPU processes every packet DPU processes packets, host sees only allowed traffic CPU freed for applications
NVMe-oF target CPU handles RDMA + NVMe DPU handles RDMA + NVMe, host CPU uninvolved Zero host CPU for storage serving
IPsec / TLS CPU encrypts/decrypts DPU encrypts/decrypts via crypto service CPU freed, lower latency
vSwitch (OVS) CPU handles VM networking DPU handles VM networking via vswitch service Major CPU savings in cloud
Telemetry CPU collects and sends metrics DPU collects and sends via telemetry service No host CPU overhead

5.11.6 DPU Failure Handling

DPU failure follows the standard peer crash recovery sequence (Section 5.3), with one DPU-specific extension: host fallback.

DPU crash / reboot / PCIe link failure:

Standard peer recovery (Section 5.3, steps 1-7):
  1. IOMMU lockout (<1ms)
  2. PCIe bus master disable (<1ms)
  3. Capability revocation — all services provided by this DPU are unregistered.
     In-flight ServiceMessage requests on all ring pairs receive error completions.
  4. Cluster membership revocation (DeadNotify broadcast)
  5. FLR → SBR → power cycle escalation ladder

DPU-specific extension — host fallback (step 8):
  6. For each service the DPU was providing, check if a host-side driver
     exists for the same ServiceId:
     a. ServiceId("nic_offload") → host has e1000/mlx5 driver → activate it.
     b. ServiceId("ipsec_offload") → host has software IPsec → activate it.
     c. ServiceId("nvmeof_target") → no host equivalent → function unavailable.
  7. Fallback drivers are pre-registered in the KABI service registry with
     `fallback_priority: u32` lower than the DPU's priority. When the DPU's
     service is revoked, the registry automatically resolves to the next-best
     provider (the host driver).
  8. If DPU recovers (FLR + rejoin via JoinRequest): DPU re-advertises services
     → host re-binds → host fallback drivers are deactivated.

This fallback mechanism is unique to DPUs — generic Tier M peers (SAS controllers, FPGAs) typically have no host-side equivalent. The KABI service registry's priority-based resolution makes fallback automatic: no special-case code.

5.11.7 Service Registry Integration (PeerServiceProxy)

When a Tier M peer advertises a service via CapAdvertise, the host kernel must make it available through the standard KabiServiceRegistry (Section 12.7) so that host subsystems (block layer, network stack, accelerator framework) can discover and use it via registry_lookup_service() without any Tier M-specific awareness.

The bridge is PeerServiceProxy — a thin wrapper that implements a standard KABI vtable backed by ring pair operations to the Tier M peer:

/// Bridges a Tier M peer service into the host's KABI service registry.
/// One instance per (peer, ServiceId) pair. Created during CapAdvertise
/// processing; destroyed when the peer disconnects or withdraws the service.
///
/// Registered in KabiServiceRegistry at priority `peer_priority` (default 100,
/// higher than host KABI drivers at default 50). When both a Tier M peer and
/// a host KABI driver provide the same ServiceId, the peer wins (higher
/// priority) and the KABI driver is dormant. On peer disconnect, the registry
/// auto-resolves to the KABI driver (lower priority, now best available).
pub struct PeerServiceProxy {
    /// The Tier M peer providing this service.
    peer_id: PeerId,
    /// ServiceId from the CapAdvertise message.
    service_id: ServiceId,
    /// Ring pair for ServiceMessage/ServiceResponse exchange.
    ring_pair: PeerQueuePair,
    /// ServiceDataRegion for bulk transfers.
    data_region: ServiceDataRegion,
    /// Registry handle (for deregistration on disconnect).
    registry_handle: KabiServiceHandle,
    /// Generation counter — odd = active, even = dead.
    /// Checked at every vtable dispatch (same as any KABI service handle).
    generation: AtomicU32,
    /// Service-specific vtable implementation.
    /// The vtable methods translate high-level calls (e.g., `submit_bio`,
    /// `ndo_start_xmit`) into ServiceMessage ring pair entries.
    vtable: &'static dyn PeerServiceVtable,
}

Registration flow:

CapAdvertise received from Tier M peer:
  1. Peer protocol layer validates: PeerCapFlags, namespace trust, LSM check.
  2. For each ServiceId in the advertisement:
     a. Create PeerServiceProxy with appropriate vtable:
        - ServiceId("block_io")   → BlockServiceVtable (translates bio → BlockServiceRequest)
        - ServiceId("external_nic") → NetServiceVtable (translates xmit → TxPacket)
        - ServiceId("accel_compute") → AccelServiceVtable (translates ioctl → AccelServiceOp)
        - ... (one vtable per service type, part of the peer transport module)
     b. ServiceBind to the peer → negotiate ring pair, data region.
     c. Register in KabiServiceRegistry with peer_priority.
     d. Set generation to 1 (active).
  3. Host subsystems discover the service via registry_lookup_service()
     — no awareness that it's backed by a Tier M peer.

Peer disconnect (crash, drain, or admin removal):
  1. Set generation to 2 (dead) — all in-flight vtable calls return -ENODEV.
  2. Deregister PeerServiceProxy from KabiServiceRegistry.
  3. Registry auto-resolves: next-best provider (host KABI driver, if loaded).
  4. Destroy ring pair and data region.

Why this design: - Host subsystems are unmodified — they use registry_lookup_service() and get a vtable, same as for any KABI driver. No if (is_peer_service) conditionals. - Fallback is automatic — the registry's priority-based resolution handles it. No special fallback code in each subsystem. - The cost is one vtable indirection + ring pair post per operation. For Tier M peers on PCIe (~1-2 μs ring pair latency), this is faster than a KABI driver with Tier 0/1 crossing (~300-800 ns saved), making PeerServiceProxy the preferred path when available.

5.11.8 Shared State Consistency

DPU and host share data-plane state (flow tables, packet counters) via the DomainRingBuffer ring pairs in BAR2 (Section 5.1). The consistency model:

Source of truth:
  Data plane (flow entries, counters): DPU is authoritative.
    DPU processes packets at line rate. Host reads are stale by ~μs.
  Control plane (policy, configuration): Host is authoritative.
    Host writes policy via ServiceMessage. DPU reads and applies.

Consistency:
  All cross-boundary communication uses DomainRingBuffer (Section 11.5,
  10-drivers.md) with producer/consumer atomics. No locks across PCIe.
  Doorbell coalescing (Section 5.1.2.8) amortizes interrupt overhead.
  Host writes are fenced (PCIe write ordering) before doorbell.
  DPU writes are fenced before updating ring published position.
  Host reads check published position for availability.

DPU Firmware Update Lifecycle:

Updating DPU firmware while services are running:

  1. For each DPU service with a host fallback: send ServiceUnbind, activate host fallback driver (automatic via KABI service registry priority).
  2. Send LeaveNotify to the DPU with drain_timeout_ms (default 5s).
  3. DPU drains in-flight I/O, transitions to LEAVING → IDLE.
  4. Apply firmware update via vendor-specific mechanism (BlueField: bfb-install, Intel IPU: vendor tool).
  5. DPU reboots with new firmware. Peer protocol handshake restarts (READY → JOINING → CapAdvertise → ServiceBind). New firmware may advertise different or additional services.
  6. Host re-binds services. Host fallback drivers are deactivated.

If no host fallback exists for a service, the function is unavailable during DPU reboot (maintenance window required).

DPU Multi-Tenancy:

In cloud environments, a single DPU serves multiple VMs/containers:

  • SR-IOV VFs: The DPU's NIC exposes SR-IOV Virtual Functions, one per tenant. Each VF is passed through to a VM via IOMMU. Hardware isolates VF traffic.
  • Hardware flow classification: The DPU's embedded switch classifies packets by flow (5-tuple or VXLAN VNI) and routes to the correct VF.
  • Per-VF offload: Each tenant's offloaded functions (firewall rules, encryption keys, QoS policies) are isolated per VF. The host kernel's cgroup hierarchy maps to VF assignments. Per-VF services are advertised as separate PeerServiceDescriptor entries in the CapAdvertise message, each with a properties blob containing the VF index.

Offload Decision Criteria:

The kernel decides whether to bind a DPU service (vs using a host driver) based on:

  1. Capability: Does the DPU advertise this ServiceId in CapAdvertise?
  2. Admin policy: /sys/kernel/umka/peers/<node_id>/bind_policyauto (prefer DPU if available), manual (admin explicitly binds), disabled.
  3. Intent integration: If the cgroup's intent.efficiency is high, prefer DPU (reduces host CPU power). If intent.latency_ns is very low, check whether the PCIe ring round-trip (~1-2μs) exceeds the latency budget.
  4. Peer load: If the DPU's heartbeat reports cpu_percent > 90, do not bind additional services (backpressure from HeartbeatMessage.cpu_percent).

5.11.9 Performance Impact

When using a DPU peer: the host CPU does LESS work. Performance improves for host applications because infrastructure processing moves to the DPU.

Overhead: one ring buffer round-trip (~1-2 μs) per ServiceMessage that crosses the host-DPU boundary. But DPU services handle the fast path entirely on the DPU — the host only sees control-plane operations (setup, teardown, configuration changes) via ServiceMessage. Data-plane packets flow through DPU hardware directly (SR-IOV VFs, hardware flow rules) without touching the host CPU or the ring buffers.


5.12 Affinity-Based Service Placement

When a cluster has multiple peer kernel instances (host + DPU + computational storage + CXL memory controller), the question "where should this service run?" needs a principled answer. Section 5.12 describes DPU offload decisions via static policy (bind_policy). This section generalizes that into a declarative affinity model where services declare their placement preferences and the cluster automatically determines optimal placement.

Inspired by: Helios (Microsoft Research, SOSP 2009) — satellite kernel model where processes declare affinities and the OS automatically places them on the best available kernel instance.

5.12.1 Affinity Model

// umka-core/src/cluster/affinity.rs

/// Affinity rule for a service. Multiple rules are evaluated together
/// to determine optimal placement.
pub enum AffinityRule {
    /// Co-locate with this service (positive affinity).
    /// The placement algorithm prefers peers that already host the
    /// named service. Strength (0-100) determines priority when
    /// multiple rules conflict.
    CoLocate { service: ServiceId, strength: u8 },

    /// Isolate from this service (negative affinity).
    /// The placement algorithm avoids peers that host the named service.
    /// Use case: isolate latency-sensitive services from batch workloads.
    Isolate { service: ServiceId, strength: u8 },

    /// Prefer the peer closest (lowest path cost) to the specified peer.
    /// Use case: place a storage cache service near the storage device.
    /// `max_latency_ns` uses the same u32 nanosecond range as
    /// `TopologyEdge.latency_ns` — clamped at `u32::MAX` (~4.29 s);
    /// see `ClusterNode.measured_rtt_ns` for saturation semantics.
    NearTo { peer: PeerId, max_latency_ns: u32 },

    /// Require a specific capability on the target peer.
    /// Use case: service needs RDMA, CXL coherence, or GPU.
    RequireCap(PeerCapFlags),

    /// Require a specific link capability on the path from the consumer
    /// to the service peer.
    RequireLinkCap(LinkCapFlags),

    /// Prefer peers with available capacity for the specified resource.
    PreferCapacity { resource: ResourceType, min_free: u64 },
}

/// Complete affinity specification for a service.
pub struct ServiceAffinity {
    /// Service this affinity applies to.
    pub service_id: ServiceId,
    /// Affinity rules, evaluated in order of strength (highest first).
    pub rules: ArrayVec<AffinityRule, 8>,
    /// Minimum improvement (percentage) required to trigger migration
    /// from current placement. Prevents flapping on noisy measurements.
    /// Default: 20 (20% cost reduction required).
    pub hysteresis_pct: u8,
    /// Whether admin can override automatic placement.
    /// If false, placement is always automatic. If true, admin's
    /// bind_policy=manual takes precedence.
    pub admin_overridable: bool,
}

5.12.2 Affinity in KABI Manifests

Services declare affinity in their KABI IDL files (Section 12.1). The umka-kabi-gen compiler embeds affinity rules into the KabiDriverManifest:

// Example: network stack service KABI manifest
service "net_stack" version 1 {
    affinity {
        co_locate "nic_offload" strength 90    // strong: near the NIC
        co_locate "ipsec_offload" strength 70  // moderate: near crypto
        isolate "batch_compute" strength 50    // moderate: away from batch
        require_cap RDMA_CAPABLE               // must have RDMA
        require_link_cap RDMA                  // path must support RDMA
        hysteresis 20                          // 20% improvement to migrate
    }
}

Compiled into KabiDriverManifest.affinity_rules: ArrayVec<AffinityRule, 8>. Services without an [affinity] section get no automatic placement — they use the existing bind_policy from Section 5.11.

5.12.3 Placement Algorithm

Three-pass algorithm, runs asynchronously on topology changes (peer join/leave, link weight change, service registration). Never on the data-path hot path.

Pass 1 — FILTER (which peers CAN host this service?):
  candidates = topology.find_peers(origin, [
      RequireCap(affinity.rules.require_cap),
      RequireLinkCap(affinity.rules.require_link_cap),
      MinFreeCapacity(affinity.rules.prefer_capacity),
  ])

  If candidates is empty: service cannot be placed. Log FMA event.
  If candidates has 1 entry: place there (no choice). Skip passes 2-3.

Pass 2 — RANK by positive affinity:
  For each candidate peer P:
    score_positive = 0
    For each CoLocate rule:
      if P hosts the co-located service:
        score_positive += rule.strength
    For each NearTo rule:
      latency = topology.path_cost_ns(P, rule.peer)
      if latency <= rule.max_latency_ns:
        score_positive += 100 - (latency * 100 / rule.max_latency_ns)

Pass 3 — RANK by negative affinity + topology cost:
  For each candidate peer P:
    score_negative = 0
    For each Isolate rule:
      if P hosts the isolated service:
        score_negative += rule.strength
    topology_cost = topology.path_cost_ns(consumer, P)

  Final score = score_positive - score_negative - (topology_cost / 1000)
  Best candidate = highest final score.

Hysteresis check:
  If service is already placed on peer Q:
    improvement = (score(Q) - score(best)) / score(Q) * 100
    If improvement < hysteresis_pct: keep current placement (no migration).

Output: PlacementDecision { service_id, target_peer, score, reason }. The decision is cached in PlacementCache (generation-tagged, same pattern as TopologyQueryCache, Section 5.2).

5.12.4 Automatic Offload Example

Cluster: Host A (x86, mlx5 NIC driver) + BlueField DPU (Tier M peer)

1. DPU joins, sends CapAdvertise:
   - ServiceId("nic_offload", 1) — NIC hardware offload
   - ServiceId("ipsec_offload", 1) — IPsec crypto
   - ServiceId("nvmeof_target", 1) — NVMe-oF storage

2. Host A's placement engine evaluates registered affinities:
   - "net_stack" has CoLocate("nic_offload", 90) → DPU hosts nic_offload → score +90
   - "net_stack" has CoLocate("ipsec_offload", 70) → DPU hosts ipsec_offload → score +70
   - "net_stack" is currently on Host A (local) → topology_cost = 0
   - DPU topology_cost = 200ns (PCIe) → -0.2
   - Final: DPU score = 159.8, Host A score = 0
   - Improvement > 20% → migrate net_stack to DPU

3. Placement engine sends ServiceDrainNotify for net_stack on Host A.
4. Rebinds net_stack on DPU via ServiceBind.
5. Network packets now flow through DPU hardware. Host A CPU freed.

No hardcoded "if DPU present, offload networking." The affinity rules in
the net_stack KABI manifest drive the decision automatically.

5.12.5 Re-evaluation Triggers

Event Action Delay
Peer join Re-evaluate all services with affinity rules 1 second (allow capability advertisement to complete)
Peer leave Re-evaluate services that were co-located with the departed peer Immediate (graceful shutdown already handles drain)
Topology edge weight change >20% Re-evaluate services with NearTo rules 5 seconds (debounce noisy measurements)
Admin bind_policy change Re-evaluate the affected service Immediate
New service registration with affinity rules Evaluate placement for the new service Immediate

Re-evaluation is rate-limited: at most once per second per service. Multiple triggers within the window are coalesced.

5.12.6 Performance Bounds

Operation Cost When
Placement algorithm (3-pass) O(P × R) where P=candidate peers, R=rules On topology change (rare)
Typical cluster (10 peers, 5 rules) <50 μs
Large cluster (100 peers, 8 rules) <500 μs
Placement cache lookup O(1) hash On service binding (rare)
Service migration Same as graceful shutdown (§5.9.3) When placement changes

The placement algorithm never runs on the data path. All per-message, per-packet, per-page-fault operations use pre-resolved PeerId and CachedRoute — zero placement overhead on the hot path.

5.12.7 Relationship to Existing Mechanisms

Mechanism Scope Affinity Integration
bind_policy (§5.12) Per-peer, admin-controlled Affinity is the automatic mode; bind_policy=manual overrides
Intent-based management (§7.7) Per-cgroup workload intent intent.efficiency high → prefer DPU offload (maps to CoLocate with offload services)
ML policy (§23.1) Predictive resource allocation Can observe placement decisions and recommend affinity rule adjustments
Topology reasoning (§5.2.9.8) Constraint-based peer queries Affinity algorithm uses find_peers() and suggest_placement()

5.12.8 Small Cluster Optimization

On clusters with ≤3 peers, the placement algorithm is skipped — there is insufficient topology diversity to benefit. Services use the static bind_policy from Section 5.11. The threshold is configurable: cluster.affinity_min_peers (default: 4).