Chapter 6: Distributed Shared Memory¶
Page-granularity coherence over RDMA for workloads that benefit from shared-memory semantics across nodes. Optional subsystem — see Chapter 5 for the cluster foundation.
Page-granularity distributed shared memory over RDMA for workloads that benefit from shared-memory semantics across nodes. Uses a MOESI-derived coherence protocol with home-node directory tracking. This is an optional subsystem — the cluster foundation in Chapter 5 (Section 5.1) operates without it.
6.1 DSM Foundational Types¶
The following types are used throughout the DSM subsystem and are defined here as prerequisites.
6.1.1 DsmMsgType Range Allocation¶
All DSM message type codes (DsmMsgType enum values and impl extensions) are
allocated from the following ranges. When adding new message types, allocate from
the correct range to avoid collisions.
| Range | Subsystem | Source file |
|---|---|---|
| 0x0001-0x0007 | MOESI requestor→home | dsm-coherence-protocol-moesi.md |
| 0x0010-0x0013 | MOESI home→requestor | dsm-coherence-protocol-moesi.md |
| 0x0020-0x0023 | MOESI forwarding + invalidation + InvAck | dsm-coherence-protocol-moesi.md |
| 0x0030 | MOESI owner→requestor data forward | dsm-coherence-protocol-moesi.md |
| 0x0040-0x0043 | Subscriber control (reserved, local-only) | dsm-coherence-protocol-moesi.md |
| 0x0050-0x0051 | Write-update protocol | dsm-coherence-protocol-moesi.md |
| 0x0060-0x0062 | Causal consistency | dsm-coherence-protocol-moesi.md |
| 0x0070-0x0072 | Anti-entropy | dsm-anti-entropy-protocol.md |
| 0x0080-0x008F | Cooperative cache probes (impl extension) | dsm-distributed-page-cache.md |
| 0x0090-0x009F | Distributed futex | dsm-coherence-protocol-moesi.md |
| 0x0300-0x0321 | Region management (via PeerMessageType) | dsm-region-management.md |
| 0x0330-0x0332 | Directory reconstruction (via PeerMessageType) | dsm-region-management.md |
6.1.2 Wire Format Integer Types¶
All multi-byte integers in wire-format structures (peer protocol messages, RDMA-visible shared memory, on-disk WAL entries) use explicitly little-endian wrapper types. This prevents silent data corruption when a cluster contains mixed-endian nodes (PPC32 and s390x are big-endian; the other six supported architectures are little-endian).
Policy: Every #[repr(C)] struct that crosses a node boundary — whether via RDMA
write, peer message send, or persistent WAL — MUST use Le16/Le32/Le64 for all
integer fields. Native u16/u32/u64 types are reserved for kernel-internal
(non-wire) structures. Enum discriminants in wire structs are transmitted as Le32
with explicit from_le32() / to_le32() conversion methods on the enum type.
/// Little-endian 16-bit integer for wire format.
///
/// Stored as `[u8; 2]` to prevent accidental arithmetic on wire values and to
/// guarantee correct alignment on all architectures (alignment = 1). Matches
/// the semantics of Linux kernel's `__le16`.
///
/// To perform arithmetic, convert to native first: `val.to_ne()`.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct Le16([u8; 2]);
impl Le16 {
/// Zero value (all bytes 0x00).
pub const ZERO: Self = Self([0; 2]);
/// Convert from native-endian u16 to wire (little-endian) representation.
#[inline(always)]
pub const fn from_ne(v: u16) -> Self {
Self(v.to_le_bytes())
}
/// Convert from wire (little-endian) representation to native-endian u16.
#[inline(always)]
pub const fn to_ne(self) -> u16 {
u16::from_le_bytes(self.0)
}
}
/// Little-endian 32-bit integer for wire format.
///
/// Stored as `[u8; 4]`. See `Le16` for design rationale. Matches Linux's `__le32`.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct Le32([u8; 4]);
impl Le32 {
pub const ZERO: Self = Self([0; 4]);
#[inline(always)]
pub const fn from_ne(v: u32) -> Self {
Self(v.to_le_bytes())
}
#[inline(always)]
pub const fn to_ne(self) -> u32 {
u32::from_le_bytes(self.0)
}
}
/// Little-endian 64-bit integer for wire format.
///
/// Stored as `[u8; 8]`. See `Le16` for design rationale. Matches Linux's `__le64`.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct Le64([u8; 8]);
impl Le64 {
pub const ZERO: Self = Self([0; 8]);
#[inline(always)]
pub const fn from_ne(v: u64) -> Self {
Self(v.to_le_bytes())
}
#[inline(always)]
pub const fn to_ne(self) -> u64 {
u64::from_le_bytes(self.0)
}
}
/// Big-endian 16-bit integer for on-disk formats (JBD2, XFS metadata).
///
/// Stored as `[u8; 2]` to prevent accidental arithmetic on disk values and to
/// guarantee correct alignment on all architectures (alignment = 1). Matches
/// the semantics of Linux kernel's `__be16`.
///
/// JBD2 (ext3/ext4 journal) and XFS use big-endian on-disk formats due to
/// historical design choices (ext3 originated on big-endian SPARC/PA-RISC,
/// XFS on big-endian MIPS/IRIX). Unlike wire-format types (Le*), which are
/// used for inter-node communication, Be* types are used exclusively for
/// persistent on-disk structures that must match Linux's on-disk formats
/// bit-for-bit.
///
/// To perform arithmetic, convert to native first: `val.to_ne()`.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct Be16([u8; 2]);
impl Be16 {
/// Zero value (all bytes 0x00).
pub const ZERO: Self = Self([0; 2]);
/// Convert from native-endian u16 to disk (big-endian) representation.
#[inline(always)]
pub const fn from_ne(v: u16) -> Self {
Self(v.to_be_bytes())
}
/// Convert from disk (big-endian) representation to native-endian u16.
#[inline(always)]
pub const fn to_ne(self) -> u16 {
u16::from_be_bytes(self.0)
}
}
/// Big-endian 32-bit integer for on-disk formats (JBD2, XFS metadata).
///
/// Stored as `[u8; 4]`. See `Be16` for design rationale. Matches Linux's `__be32`.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct Be32([u8; 4]);
impl Be32 {
pub const ZERO: Self = Self([0; 4]);
#[inline(always)]
pub const fn from_ne(v: u32) -> Self {
Self(v.to_be_bytes())
}
#[inline(always)]
pub const fn to_ne(self) -> u32 {
u32::from_be_bytes(self.0)
}
}
/// Big-endian 64-bit integer for on-disk formats (JBD2, XFS metadata).
///
/// Stored as `[u8; 8]`. See `Be16` for design rationale. Matches Linux's `__be64`.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct Be64([u8; 8]);
impl Be64 {
pub const ZERO: Self = Self([0; 8]);
#[inline(always)]
pub const fn from_ne(v: u64) -> Self {
Self(v.to_be_bytes())
}
#[inline(always)]
pub const fn to_ne(self) -> u64 {
u64::from_be_bytes(self.0)
}
}
/// Little-endian atomic 32-bit integer for MMIO-mapped and RDMA-visible fields.
///
/// Used in structures that are written by one node and read by another via
/// RDMA read or MMIO-mapped shared memory (e.g., `PeerControlRegs`,
/// `RdmaRingHeader`). The underlying representation is `AtomicU32` storing
/// a little-endian value — the caller must convert to/from native endianness
/// at every access point.
///
/// Unlike `Le32`, this type supports atomic load/store/CAS operations required
/// for lock-free RDMA ring buffer protocols.
///
/// Memory ordering: the caller specifies ordering per operation. This type
/// does not impose a default ordering — each call site must choose appropriately.
#[repr(transparent)]
pub struct LeAtomicU32(AtomicU32);
impl LeAtomicU32 {
/// Create a new LeAtomicU32 with the given native-endian value,
/// stored in little-endian format.
pub const fn new(v: u32) -> Self {
Self(AtomicU32::new(v.to_le()))
}
/// Atomic load. Returns the value converted to native endianness.
#[inline(always)]
pub fn load(&self, order: Ordering) -> u32 {
u32::from_le(self.0.load(order))
}
/// Atomic store. Converts the native-endian value to little-endian before storing.
#[inline(always)]
pub fn store(&self, val: u32, order: Ordering) {
self.0.store(val.to_le(), order);
}
/// Atomic compare-and-swap. Both `current` and `new` are native-endian;
/// they are converted to little-endian for the CAS operation.
/// Returns the previous value (native-endian) and success flag.
#[inline(always)]
pub fn compare_exchange(
&self, current: u32, new: u32,
success: Ordering, failure: Ordering,
) -> Result<u32, u32> {
self.0
.compare_exchange(current.to_le(), new.to_le(), success, failure)
.map(|v| u32::from_le(v))
.map_err(|v| u32::from_le(v))
}
}
/// Little-endian atomic 64-bit integer for MMIO-mapped and RDMA-visible fields.
///
/// 64-bit counterpart of `LeAtomicU32`. See its documentation for design rationale.
#[repr(transparent)]
pub struct LeAtomicU64(AtomicU64);
impl LeAtomicU64 {
pub const fn new(v: u64) -> Self {
Self(AtomicU64::new(v.to_le()))
}
#[inline(always)]
pub fn load(&self, order: Ordering) -> u64 {
u64::from_le(self.0.load(order))
}
#[inline(always)]
pub fn store(&self, val: u64, order: Ordering) {
self.0.store(val.to_le(), order);
}
#[inline(always)]
pub fn compare_exchange(
&self, current: u64, new: u64,
success: Ordering, failure: Ordering,
) -> Result<u64, u64> {
self.0
.compare_exchange(current.to_le(), new.to_le(), success, failure)
.map(|v| u64::from_le(v))
.map_err(|v| u64::from_le(v))
}
/// Atomic fetch-add. `val` is native-endian. Only valid when the stored
/// value represents a monotonic counter — NOT for arbitrary bitfields.
///
/// Implementation: load-LE → from_le → add → to_le → CAS loop. This is
/// NOT a single hardware fetch_add because the byte-swap prevents using
/// the native atomic add instruction on big-endian machines. For fields
/// that need true atomic fetch_add performance on big-endian, use a
/// native `AtomicU64` with explicit endian conversion at the wire boundary.
#[inline]
pub fn fetch_add(&self, val: u64, order: Ordering) -> u64 {
let mut current = self.0.load(Ordering::Relaxed);
loop {
let native = u64::from_le(current);
let new_le = (native.wrapping_add(val)).to_le();
match self.0.compare_exchange_weak(
current, new_le, order, Ordering::Relaxed,
) {
Ok(_) => return native,
Err(actual) => current = actual,
}
}
}
}
RegionBitmap and Le types:
RegionBitmap.wordsremains[u64; W]— NOT[Le64; W]. Bitmap words are transmitted as raw little-endian bytes over RDMA (the wire format is the memory image). On big-endian nodes, the DSM subsystem byte-swaps each word at the RDMA send/receive boundary using a bulkbitmap_to_wire()/bitmap_from_wire()helper, rather than paying the per-bit-operation overhead of Le64 wrappers. This is the correct trade-off: bitmap operations (set/clear/test/or/and) are hot-path and must be native-speed; RDMA send/receive of bitmaps is warm-path (per-invalidation, not per-bit-op).
/// Convert a native-endian bitmap to wire format (little-endian).
/// Each u64 word is individually byte-swapped. Called at the RDMA
/// send boundary for RegionBitmap fields. On little-endian architectures
/// (x86_64, aarch64, riscv64, loongarch64), this is a no-op memcpy
/// (compiler elides the identity byte-swap).
pub fn bitmap_to_wire(native: &[u64], wire: &mut [Le64]) {
assert_eq!(native.len(), wire.len());
for (n, w) in native.iter().zip(wire.iter_mut()) {
*w = Le64::from_ne(*n);
}
}
/// Convert a wire-format bitmap to native-endian.
/// Called at the RDMA receive boundary. Same no-op optimization
/// on little-endian architectures.
pub fn bitmap_from_wire(wire: &[Le64], native: &mut [u64]) {
assert_eq!(wire.len(), native.len());
for (w, n) in wire.iter().zip(native.iter_mut()) {
*n = w.to_ne();
}
}
Region Slot Index¶
/// Dense slot index within a DSM region. Assigned when a peer joins a region
/// via the slot allocation protocol (below). Slots are unique within a region
/// and index into per-page `RegionBitmap` fields and vector clock arrays.
///
/// u16 supports up to 65,535 slots; hard cap is MAX_REGION_PARTICIPANTS (1024).
/// Slot 0 is valid. SLOT_INVALID is a sentinel for "no slot assigned."
pub type RegionSlotIndex = u16;
/// Sentinel value: peer has no slot in this region.
pub const SLOT_INVALID: RegionSlotIndex = u16::MAX;
DsmRegionHandle¶
/// Opaque handle to a DSM region. Wraps the cluster-wide `region_id` (u64)
/// and provides the API surface for DSM operations: fence open/close, cache
/// policy configuration, lock binding, and application-visible mapping.
///
/// Obtained from `dsm_region_create()` or `dsm_region_join()`. The handle
/// is valid until the region is destroyed or the local node leaves.
/// Not `Send` — region handles are pinned to the creating context. Use
/// `DsmRegionInfo` ([Section 6.14](#dsm-application-visible)) for cross-thread queries.
///
/// The inner `region_id` is a u64 (not u32) to support the 50-year uptime
/// requirement: at one region creation per microsecond, u64 exhausts in
/// ~584,942 years.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct DsmRegionHandle(u64);
impl DsmRegionHandle {
/// Create a handle from a region ID. Called by the DSM subsystem after
/// successful region creation or join — not a public constructor.
pub(crate) const fn from_region_id(id: u64) -> Self { Self(id) }
/// Extract the underlying region ID.
pub const fn region_id(&self) -> u64 { self.0 }
}
Region Bitmap¶
/// Fixed-size bitmap for per-page peer tracking within a DSM region.
///
/// Word count W = ceil(max_participants / 64), determined at region creation.
/// Within a region, every `RegionBitmap` instance has the same W, making it
/// safe for seqlock-based memcpy (fixed-size, no heap pointers).
///
/// All per-page structures in the same region are allocated from a slab pool
/// whose slot size matches the region's bitmap word count. Different regions
/// may have different W values.
///
/// Memory layout: `#[repr(C)]`, inline array of u64. No pointers, no heap.
/// Safe to transmit over RDMA (wire format is the raw u64 array, little-endian).
///
/// Invariant: bits above `max_participants` in the last word are always zero.
/// Enforced at set-time, not checked on read (hot path).
///
/// **Implementation note — slab pool strategy:**
///
/// RegionBitmap's W varies per region but is constant within a region. Every
/// struct that embeds a RegionBitmap (DsmDirectoryEntry via its inner
/// DsmDirEntry) must be allocated from a slab pool whose slot size matches
/// the containing struct's total size for that region's W value.
/// DsmPageMeta does not embed a RegionBitmap — its size is constant (16 bytes)
/// regardless of W — but it is still allocated from a per-region slab pool
/// for cache locality with the directory entries it accompanies.
///
/// The kernel maintains a global table of **5 bitmap slab pool classes**
/// (one per valid W value: 1, 2, 4, 8, 16). At region creation, each
/// per-page structure type gets a slab pool from the class matching the
/// region's W:
///
/// ```
/// // Field-by-field derivation:
/// //
/// // DsmDirEntry (inner, contains RegionBitmap):
/// // state(1) + _pad1(1) + owner_slot(2) + _pad2(4) + owner(8, Option<PeerId>) + sharers(W*8) + lock(4) + _pad3(4) = 24 + W*8
/// //
/// // DsmDirectoryEntry (outer wrapper, allocated from slab):
/// // sequence(8) + inner(24+W*8) + version(8) + rehash_epoch(4) + _pad2(4) + wait_queue(8) + entry_lock(4) + _pad3(4) = 64 + W*8
/// //
/// // DsmPageMeta (no bitmap — constant size):
/// // last_writer(2) + _pad(6) + last_transition_ns(8) = 16
/// // (home_slot moved to DsmPage; local_state replaced by DsmPage.state_dirty)
/// //
/// static BITMAP_SLAB_CLASSES: [SlabClass; 5] = [
/// SlabClass { w: 1, bitmap_bytes: 8, dir_entry_size: 72, page_meta_size: 16 },
/// SlabClass { w: 2, bitmap_bytes: 16, dir_entry_size: 80, page_meta_size: 16 },
/// SlabClass { w: 4, bitmap_bytes: 32, dir_entry_size: 96, page_meta_size: 16 },
/// SlabClass { w: 8, bitmap_bytes: 64, dir_entry_size: 128, page_meta_size: 16 },
/// SlabClass { w: 16, bitmap_bytes: 128, dir_entry_size: 192, page_meta_size: 16 },
/// ];
/// ```
///
/// The Rust implementation uses a type-erased slab allocator — all bitmap
/// operations take `&[u64]` slices with runtime-checked length, not generic
/// `[u64; W]` const-generic arrays. This avoids monomorphization explosion
/// (5 copies of every DSM function) while keeping the hot-path cost at one
/// bounds check (optimized away when the compiler can prove W is constant
/// within a region's code path).
///
/// Alternative: const-generic `RegionBitmap<const W: usize>` with 5
/// monomorphized specializations. Benchmarks should compare both approaches;
/// the type-erased version is simpler and the hot path (single-bit test)
/// is identical in both.
///
/// Operations (all O(1) for single-bit ops, O(W) for aggregate):
/// set(slot), clear(slot), test(slot) — single-bit, O(1)
/// count() — popcount across W words
/// iter_set() — yields each set slot index
/// or(other), and(other), andnot(other) — word-wise, O(W), vectorizable
/// clear_all(), is_empty() — O(W)
/// kernel-internal, not KABI — variable-width (generic parameter W), never sent on wire.
#[repr(C)]
pub struct RegionBitmap {
/// Bitmap words. Length W = ceil(max_participants / 64).
/// Inline array; size is fixed per region.
words: [u64; W],
}
Per-page memory overhead (1 bitmap per page: sharers):
| max_participants | W (words) | Bitmap bytes |
|---|---|---|
| ≤ 64 | 1 | 8 bytes |
| ≤ 128 | 2 | 16 bytes |
| ≤ 256 | 4 | 32 bytes |
| ≤ 512 | 8 | 64 bytes |
| ≤ 1024 | 16 | 128 bytes |
For regions with ≤ 64 participants, the bitmap is identical to the old u64
bitmask — zero overhead regression. For 256 participants (the recommended
default), per-page bitmap overhead is 32 bytes.
Serialization for evolution: RegionBitmap has runtime-determined width W.
For state export during live kernel evolution, each bitmap is serialized as:
/// On-wire / export format for RegionBitmap (variable-width).
/// Wire struct — fixed 8-byte header followed by variable-length Le64 array.
#[repr(C)]
pub struct RegionBitmapExport {
/// Number of u64 words in this bitmap (= W).
pub width_words: Le32,
/// Padding for alignment.
pub _pad: Le32,
/// Bitmap data in little-endian u64 words. Length = width_words.
/// The full region's bitmaps are exported contiguously:
/// [sharers_bitmap] per page.
///
/// C99-style flexible array member. In Rust, declared as `[Le64; 0]`
/// and accessed via unsafe pointer arithmetic:
/// ```
/// // SAFETY: caller ensures allocation is >= 8 + width_words * 8 bytes.
/// let words = unsafe {
/// core::slice::from_raw_parts(
/// self.words.as_ptr(),
/// self.width_words.to_ne() as usize,
/// )
/// };
/// ```
pub words: [Le64; 0],
}
// Wire format fixed header: width_words(4) + _pad(4) = 8 bytes.
// Variable payload: width_words × 8 bytes follows.
// Variable-length trailing array: size_of on the header only.
const_assert!(core::mem::size_of::<RegionBitmapExport>() == 8);
const_assert!(core::mem::offset_of!(RegionBitmapExport, words) == 8);
During evolution, all active DSM regions export their per-page bitmaps using
this format. Import reconstructs RegionBitmap from width_words + data.
If the new component has a different MAX_REGION_PARTICIPANTS, bitmaps are
zero-extended (new W > old W) or truncated with a check that no set bits are
lost (new W < old W → abort evolution if any participant beyond new max exists).
/// Maximum participants per DSM region (hard cap).
/// Beyond 1024, seqlock copy size degrades (>260 bytes per directory entry)
/// and per-page metadata exceeds 128 bytes for the sharers bitmap.
pub const MAX_REGION_PARTICIPANTS: u16 = 1024;
/// Recommended default for most workloads.
pub const DEFAULT_REGION_PARTICIPANTS: u16 = 256;
Region Slot Map¶
/// Bidirectional mapping between PeerId and RegionSlotIndex within a region.
/// One instance per DSM region, shared by all participating peers.
///
/// Concurrency: RCU-protected for reads (page fault path queries slot index
/// for a peer — lock-free). SpinLock for writes (peer join/leave — rare).
///
/// Invariant: the mapping is bijective — each PeerId maps to exactly one
/// slot, and each active slot maps to exactly one PeerId. Departed peers
/// have their slot tombstoned (slot_to_peer[slot] = None) but the slot
/// number is NOT reused within a region's lifetime to prevent ABA problems
/// in bitmap interpretation.
pub struct RegionSlotMap {
/// Forward map: PeerId → slot index.
/// XArray keyed by PeerId (u64) — O(1) lookup.
peer_to_slot: XArray<RegionSlotIndex>,
/// Reverse map: slot index → PeerId. Dense array indexed by slot.
/// Length = next_slot (all indices 0..next_slot are either active or
/// tombstoned). Bounded by `max_participants` (max `MAX_REGION_PARTICIPANTS` = 1024).
slot_to_peer: Vec<Option<PeerId>>,
/// Next slot to allocate. Monotonically increasing within a region's
/// lifetime. Never decremented (slots are not reused).
next_slot: RegionSlotIndex,
/// Region's configured maximum participants.
max_participants: u16,
/// Per-slot last-seen timestamp (monotonic nanoseconds). Updated on
/// every coherence message received from the peer (not just heartbeats —
/// any message proves liveness: GetS, GetM, PutM, Inv, InvAck, etc.).
///
/// Used for partition duration check on `RegionJoinRequest`: the
/// coordinator compares `now_monotonic - last_heartbeat_ns[slot]` against
/// `DsmCachePolicy.max_partition_duration_secs` to detect stale partitions.
/// If a peer has been partitioned longer than the configured maximum,
/// the join request is rejected and the peer must perform a full resync
/// via the anti-entropy protocol ([Section 6.13](#dsm-anti-entropy-protocol)).
///
/// Indexed by slot (dense array, same length as `slot_to_peer`).
/// Atomic for lock-free updates from coherence message handlers.
last_heartbeat_ns: Vec<AtomicU64>,
}
Size: per-region (not per-page). ~16 KB worst case for 1024 participants.
Hot-path rule: no slot map lookup on the data path. Although the
RegionSlotMap now uses XArray (O(1) lookup, ~10-30 ns), the page fault
critical path still avoids even this cost by caching the RegionSlotIndex
alongside the PeerId in hot-path structures:
| Structure | Cached field | Avoids |
|---|---|---|
DsmDirectoryEntry (via inner: DsmDirEntry) |
owner_slot: RegionSlotIndex |
Lookup when processing GetS/GetM at home |
DsmPage |
home_slot: RegionSlotIndex |
Lookup when sending coherence messages to home |
DsmPageMeta (via DsmPage.meta) |
last_writer: RegionSlotIndex |
Lookup when checking causal ordering |
/// Per-page metadata on the local (requestor) node. One per DSM-mapped page.
///
/// **Relationship to `DsmPage`**: `DsmPageMeta` is **embedded inside** `DsmPage`
/// via the `meta` field (see [Section 6.11](#dsm-distributed-page-cache--dsm-page-eviction)).
/// They are a SINGLE allocation from the DSM slab cache. `DsmPage` contains
/// the page frame reference, the packed atomic coherence state (`state_dirty`),
/// and this metadata struct. Access the meta via `dsm_page.meta`.
///
/// **Coherence state**: The authoritative coherence state is stored in
/// `DsmPage.state_dirty` (AtomicU16, packed state + dirty flag). There is
/// no `local_state` field -- it was removed when the coherence state was
/// unified into the atomic `state_dirty` on `DsmPage`. Similarly, `home_slot`
/// was moved to `DsmPage` to avoid duplication. This struct retains only
/// `last_writer` and `last_transition_ns` bookkeeping that `state_dirty`
/// does not carry.
///
/// This struct exists on every node that has a cached copy of a DSM page.
/// It is NOT sent over the wire -- it is local bookkeeping only.
#[repr(C)]
pub struct DsmPageMeta {
/// Last peer that wrote to this page (used for causal ordering checks
/// in DSM_CAUSAL regions, [Section 6.6](#dsm-coherence-protocol-moesi--causal-consistency-protocol-dsmcausal)).
pub last_writer: RegionSlotIndex,
/// Explicit padding for C-layout alignment of last_transition_ns.
_pad: [u8; 6],
/// Timestamp of last state transition (monotonic nanoseconds from
/// `arch::current::time::monotonic_ns()`). Used for stale page detection:
/// pages in SharedReader state with no access for > region.stale_threshold_ns
/// are candidates for silent eviction (PutS) to reduce memory pressure.
pub last_transition_ns: u64,
}
// DsmPageMeta: last_writer(2, RegionSlotIndex/u16) + _pad(6) +
// last_transition_ns(8, u64) = 16 bytes.
const_assert!(core::mem::size_of::<DsmPageMeta>() == 16);
The RegionSlotMap is only queried on the control path: peer join/leave,
slot map replication, and the initial setup when a peer first participates in
a region. These are rare operations (seconds-to-hours apart).
Invariant: any code path that has a PeerId and needs a RegionSlotIndex
on the data path MUST use a cached slot field from a nearby structure. If no
cached slot is available, the code is on the wrong path — it should be using
the control path instead. This invariant is enforced by not exposing
RegionSlotMap::peer_to_slot() as a public method; only
RegionSlotMap::lookup_control_path() (which logs a warning if called
more than 100 times/sec per region) is public.
Slot Allocation Protocol¶
Peer P joins DSM region R:
1. P sends REGION_JOIN(region_id, peer_id) to the region's creator node
(or any peer holding the region's slot map — the slot map is replicated
to all participating peers via the membership protocol).
2. Receiving peer acquires RegionSlotMap write lock.
3. Check: next_slot < max_participants.
If not: return ENOSPC (region is full).
4. Check: peer_to_slot does not contain peer_id.
If it does: return EALREADY (peer already joined).
5. Assign: slot = next_slot; next_slot += 1.
Insert into both maps. Publish new slot map via RCU.
6. Release write lock.
7. Broadcast SLOT_ASSIGNED(region_id, peer_id, slot) to all participating
peers. Each peer updates its local copy of the slot map.
Peer P leaves DSM region R:
1. P sends REGION_LEAVE(region_id, peer_id) to the slot map holder.
2. Receiving peer acquires write lock.
3. Look up slot = peer_to_slot[peer_id].
Remove from peer_to_slot. Set slot_to_peer[slot] = None (tombstone).
The slot is NOT reused (next_slot is not decremented).
4. Release write lock. Publish via RCU.
5. Broadcast SLOT_REMOVED(region_id, peer_id, slot) to all peers.
6. Background sweep: clear bit `slot` in all page `sharers` bitmaps
for this region. Until the sweep completes, the stale bit is
harmless: invalidation messages to a departed peer are silently dropped
(peer is no longer in the registry).
Slot exhaustion (all max_participants slots consumed, many tombstoned):
Triggered when: next_slot == max_participants AND a new peer tries to join.
Also triggered proactively when: next_slot > max_participants * 3/4 AND
live_peers < next_slot / 2 (more than half the slots are tombstoned and
we're running low on capacity).
**Coordination with directory rehash**: Slot compaction MUST NOT overlap with
directory rehash ([Section 6.4](#dsm-home-node-directory)) for the same region. Sequence:
rehash first (update modulus), then compaction (reclaim slots under new modulus).
The region coordinator holds the slot map write lock for both operations
sequentially, preventing concurrent modification of peer indexing.
Compaction protocol (wire format: [Section 6.8](#dsm-region-management--region-management-wire-messages)):
1. Coordinator (region creator or Raft leader) acquires slot map write lock.
2. Build new slot map: assign live peers to dense slots 0..live_count-1.
3. Broadcast SLOT_COMPACTION(region_id, old_map, new_map) to all participants.
4. Each participant reaches a **DSM quiescent point** before remapping:
a. Upon receiving SLOT_COMPACTION, immediately stop initiating new
coherence requests (GetS, GetM, Upgrade) for this region. Continue
responding to in-flight requests from other peers (InvAck, DataFwd).
Drain all pending coherence messages: complete pending GetS/GetM
requests, deliver all queued InvAck/DataFwd responses, flush any
write-update diffs waiting at a release point. **Drain timeout: 2
seconds.** If the drain does not complete within 2 seconds, the
participant NACKs all remaining pending requests (the requesters
will retry after compaction) and proceeds to step 4b.
(Inspired by Kerrighed's safe-point extraction at ret_from_schedule —
compaction must not race with half-completed state transitions.)
a2. **Message cutoff**: After the drain completes (or times out), the
participant enters the paused state (step 4b). All new incoming
coherence messages for this region are buffered in the compaction
message queue (replayed at step 4f). This prevents new messages
from extending the drain.
b. Pauses DSM coherence traffic for this region (buffers incoming messages).
c. Walks its local page `sharers` bitmaps and remaps
bits from old slots to new slots. Cost: O(pages × W) — but W is
constant and the walk is sequential (cache-friendly).
d. Swaps to the new slot map (RCU publish).
e. Sends COMPACTION_ACK to coordinator.
f. Resumes coherence traffic, replaying buffered messages with new slots.
5. Coordinator waits for all ACKs (timeout: 5s). On timeout, unresponsive
peers are presumed dead (standard membership failure path).
6. Coordinator releases write lock. next_slot = live_count.
Cost bounds:
- For a region with 100K active pages and W=4: each participant remaps
100K × 4 words ≈ 400K u64 operations ≈ 3-5 ms. Coherence pause is
bounded to this duration.
- For 1M pages and W=16: ~50 ms pause. Acceptable because compaction is
extremely rare (requires max_participants worth of cumulative churn).
Recommendation: set max_participants >= 2× expected peak concurrent
participants. This ensures compaction never triggers in normal operation.
A region expecting 50 concurrent peers should use max_participants = 128;
one expecting 200 should use 512.
6.2 Design Overview¶
Distributed Shared Memory (DSM) allows processes on different nodes to share a virtual address space. Pages migrate between nodes on demand, using RDMA for transport and page faults for coherence.
Why previous DSM projects failed and why this one can work:
| Past Problem | Our Solution |
|---|---|
| Cache-line coherence (64B) over network | Page-level coherence (4KB minimum) |
| Bolted onto Linux MM (invasive patches) | Designed into UmkaOS MM from start |
| No hardware fault mechanism for remote | RDMA + CPU page fault = standard demand paging |
| Software TLB shootdown over network | Targeted TLB shootdown via RDMA notification (only invalidating nodes listed in the sharer set) |
| Single writer protocol (slow) | Multiple-reader / single-writer with RDMA invalidation |
| No topology awareness | Cluster distance matrix drives all placement decisions |
| Catastrophic on node failure | Capability-based revocation + replicated directory |
DSM is opt-in and subsystem-scoped, not global.
Unlike TidalScale (which transparently DSM-backs every process page via hypervisor EPT traps), UmkaOS DSM is a toolkit that kernel subsystems explicitly opt into by creating regions. The usage model has three levels:
| Level | Scope | Status |
|---|---|---|
| 0 — No DSM | Node participates in cluster (Raft, heartbeat, capability services, DLM) but never creates or joins a DSM region. DSM subsystem consumes zero resources. | Supported |
| 1 — Subsystem-scoped | Individual kernel subsystems create DSM regions for their own needs. UPFS creates small metadata-coherence regions. Block export creates read-cache regions. Accelerator framework creates shared-GPU-memory regions. Each region has independent participant sets, caching policies (DsmSubscriber), coherence modes (invalidate vs. write-update), and max_participants. Subsystems that don't need DSM never interact with it. |
Current design |
| 2 — Application-visible | User-space processes create and mmap DSM regions directly via syscall interface (dsm_create, dsm_attach, dsm_mmap, dsm_detach). Distributed futex for cross-node synchronization. See Section 6.14. |
Specified (Phase 3+) |
Level 3 (full transparent DSM — every process address space automatically coherent across the cluster) is an explicit non-goal. Historical evidence (Kerrighed, MOSIX, TidalScale) shows that transparent DSM incurs unacceptable overhead for general-purpose workloads. UmkaOS's region-based approach lets consumers choose the right granularity: GPFS-class workloads use a few targeted regions for token/metadata caching while performing bulk data I/O via direct RDMA to storage peers with DLM coordination.
6.3 Page Ownership Model¶
Every shared page has exactly one owner node and zero or more reader nodes:
// umka-core/src/dsm/ownership.rs
/// Ownership state of a distributed shared page.
#[repr(u8)]
pub enum DsmPageState {
/// Page is exclusively owned by this node. No remote copies exist.
/// This node can read and write freely.
Exclusive = 0,
/// Page is owned by this node, but read-only copies exist on other nodes.
/// To write: must first invalidate all reader copies.
SharedOwner = 1,
/// This node has a read-only copy. Owner is elsewhere.
/// To write: must request ownership transfer from current owner.
SharedReader = 2,
/// Page is not present on this node. Owner is elsewhere.
/// To read or write: fault → request page from owner via RDMA.
NotPresent = 3,
/// Page is being transferred (migration in progress).
Migrating = 4,
/// Ownership transfer is in progress: the directory entry has been updated to
/// reflect the new owner, but invalidations to current readers have not yet
/// completed. Nodes that read the directory entry during this window see
/// `Invalidating` and must block on the entry's `wait_queue` until the
/// state transitions to `Exclusive`, indicating all invalidations are acked.
Invalidating = 5,
}
impl TryFrom<u8> for DsmPageState {
type Error = ();
fn try_from(v: u8) -> Result<Self, ()> {
match v {
0 => Ok(Self::Exclusive),
1 => Ok(Self::SharedOwner),
2 => Ok(Self::SharedReader),
3 => Ok(Self::NotPresent),
4 => Ok(Self::Migrating),
5 => Ok(Self::Invalidating),
_ => Err(()),
}
}
}
/// **MOESI mapping** — `DsmPageState` extends the classic MOESI protocol with
/// two transient states for ownership transfer:
///
/// | DsmPageState | MOESI Equivalent | Notes |
/// |----------------|------------------|-------|
/// | `Exclusive` | Exclusive / Modified | Modified after first write (dirty bit tracked separately) |
/// | `SharedOwner` | Owned | This node owns the page; read copies exist on other nodes |
/// | `SharedReader` | Shared | Read-only copy; owner is elsewhere |
/// | `NotPresent` | Invalid | No local copy; must fault to obtain |
/// | `Migrating` | *(transient)* | Collapses to Exclusive or SharedReader on completion |
/// | `Invalidating` | *(transient)* | Collapses to Exclusive when all invalidation ACKs received |
///
/// Transient states are never visible to the coherence protocol's steady state —
/// they exist only during ownership transfer and are protected by the directory
/// entry's `wait_queue`. Any node reading the directory during a transient state
/// blocks until the transition completes.
/// Directory entry for a distributed shared page.
/// Stored on the home node (determined by hash of virtual address).
///
/// Seqlock-wrapped directory entry. Contains a `DsmDirEntry` (defined in
/// [Section 6.4](#dsm-home-node-directory--directory-entry-at-the-home-node)) plus per-entry
/// metadata used for home node lookups. All `DsmDirectoryEntry` instances
/// within a region have the same bitmap word count W (set at creation via
/// `max_participants`). Per-page structures are allocated from a per-region
/// slab pool whose slot size matches W.
///
/// **Relationship to `DsmDirEntry`**: `DsmDirEntry` holds the MOESI state,
/// owner, and sharers bitmap. `DsmDirectoryEntry` wraps it with a seqlock
/// for concurrent local readers and adds the version/rehash metadata.
/// The MOESI transition logic in [Section 6.6](#dsm-coherence-protocol-moesi) operates on the inner
/// `DsmDirEntry` fields under `entry.lock`. Local readers that need a
/// consistent snapshot use the seqlock protocol on this outer wrapper.
/// kernel-internal, not KABI — variable size (DsmDirEntry contains W-word bitmap),
/// contains AtomicPtr and SpinLock, never sent on wire.
#[repr(C)]
pub struct DsmDirectoryEntry {
/// Seqlock sequence counter for local CPU consistency on the home node.
///
/// Protocol (home node CPU only — NOT for RDMA):
/// Writer (must also hold entry.inner.lock):
/// 1. Spin-wait until `sequence` is even (unlocked).
/// 2. CAS(sequence, even, even + 1, Acquire) — starts write section.
/// 3. Modify inner DsmDirEntry fields.
/// 4. store(sequence, even + 2, Release) — completes write section.
/// Reader:
/// 1. read sequence (Acquire) → if odd, retry.
/// 2. memcpy inner fields.
/// 3. re-read sequence (Acquire) → if changed, retry.
///
/// Remote nodes do NOT read via one-sided RDMA Read (seqlock cannot work
/// across separate RDMA ops). They send a two-sided directory lookup
/// request; the home node reads locally and returns the result.
pub sequence: AtomicU64, // 8 bytes
/// The MOESI directory entry (state, owner, sharers, spinlock).
/// See [Section 6.3](#dsm-page-ownership-model) for field definitions.
pub inner: DsmDirEntry, // variable size
/// Version counter (incremented on every ownership transfer).
pub version: u64, // 8 bytes
/// Membership epoch when this entry's home assignment was last computed.
pub rehash_epoch: u32, // 4 bytes
pub _pad2: [u8; 4], // align to 8
/// Wait queue for blocking on `Invalidating` → `Exclusive` transitions.
/// Local-only (not transmitted over RDMA). Lazily allocated on first
/// contention — most entries never have waiters.
pub wait_queue: AtomicPtr<WaitQueueHead>, // 8 bytes
/// Per-entry spinlock protecting the wait queue and blocking path.
/// The seqlock handles the fast-path read side; this spinlock protects
/// `prepare_to_wait()` / `wake_up_all()` since seqlocks cannot be held
/// across `schedule()`. Lock ordering: seqlock writer THEN entry_lock.
pub entry_lock: SpinLock, // 4 bytes
pub _pad3: [u8; 4], // align to 8
}
Size (64-bit architectures): 64 + W×8 bytes (W = bitmap words). Fixed overhead:
sequence(8) + inner fixed(24) + version(8) + rehash_epoch(4) + _pad2(4) +
wait_queue(8) + entry_lock(4) + _pad3(4) = 64 bytes. Inner fixed (excluding
sharers bitmap W×8): state(1) + _pad1(1) + owner_slot(2) + _pad2(4) +
owner(8, Option\<PeerId>) + lock(4) + _pad3(4) = 24.
For 256 participants (W=4): 96 bytes ≈ 2 cache lines under seqlock.
For 64 participants (W=1): 72 bytes ≈ 2 cache lines.
For 1024 participants (W=16): 192 bytes ≈ 3 cache lines.
32-bit architectures (ARMv7, PPC32): AtomicPtr<WaitQueueHead> is 4 bytes
(not 8), reducing fixed overhead by 4 bytes to 60 + W×8. AtomicU64
(sequence) remains 8 bytes on 32-bit Rust. Layout: sequence(8) + inner
fixed(24) + version(8) + rehash_epoch(4) + _pad2(4) + wait_queue(4) +
entry_lock(4) + _pad3(4) = 60 bytes fixed.
32-bit slab class table (BITMAP_SLAB_CLASSES):
| W | dir_entry_size (32-bit) | dir_entry_size (64-bit) |
|---|---|---|
| 1 | 68 | 72 |
| 2 | 76 | 80 |
| 4 | 92 | 96 |
| 8 | 124 | 128 |
| 16 | 188 | 192 |
#[cfg(target_pointer_width = "32")]
const_assert!(core::mem::size_of::<DsmDirectoryEntry<1>>() == 68);
Wire format: When transmitted over RDMA (home-to-backup, directory lookup response), only core fields are sent. Local-only fields (
sequence,wait_queue,entry_lock) are excluded.
Offset Size Field 0 8 owner: PeerId(u64; 0 = None — PeerId 0 is reserved, node IDs start at 1)8 2 owner_slot: RegionSlotIndex(u16)10 1 state: DsmPageState(u8)11 1 _pad12 4 rehash_epoch: u3216 8 version: u6424 W×8 sharers: [u64; W](RegionBitmap raw words)Total: 24 + W×8 bytes. For 256 participants (W=4): 56 bytes — fits in a single RDMA inline send (≤64 bytes on most HCAs). For 1024 participants (W=16): 152 bytes — exceeds inline threshold but fits in a standard RDMA Send. The receiver reconstructs the full
DsmDirectoryEntryby initializingsequenceto 0,wait_queueto null, andentry_lockto unlocked.Note: The
RegionBitmapreplaces the oldu64bitfield. Bitmap size is determined per-region bymax_participants(default 256, hard cap 1024). For ≤64 participants, the bitmap is exactly oneu64word — same performance as the old design. See Section 6.1 for the full bitmap specification and memory overhead analysis.RDMA pool constraint: DSM pages are allocated from the RDMA-registered memory pool (Section 5.4), sized as
min(RAM × 25%, rdma.max_pool_gib GiB)with a 256 MiB floor (default cap: 64 GiB). Only pages within the RDMA pool can be transferred to or fetched from remote nodes. If the pool is exhausted, remote page faults block until pages are freed via LRU-based eviction (the DSM eviction policy reclaims the least-recently-accessed remote-resident pages first). The cap and percentage are adjustable viardma.max_pool_giband/sys/kernel/umka/cluster/rdma_pool_percent; workloads with large distributed working sets should increase the cap accordingly.
6.4 Home Node Directory¶
Each shared page has a home node determined by hashing its virtual address. The home node stores the authoritative directory entry (who owns the page, who has copies). This avoids a centralized directory server.
Directory indexing data structure: The home node stores directory entries in a
per-region radix tree indexed by virtual page number within the region. The radix tree
uses 9-bit fan-out (512 entries per node, matching page table structure) with
RCU-protected reads and per-node spinlocks for modification. For a 1TB region with
4KB pages (268M entries), the radix tree uses approximately 4 levels with ~524K internal
nodes (~32MB of metadata). Mutual exclusion for directory entry writes is provided by
the CAS-based seqlock writer protocol defined on DsmDirectoryEntry::sequence: the
CAS(even → odd) step provides seqlock semantics for readers (lock-free consistent
snapshots via retry on sequence mismatch). This eliminates locking on the read path
— local readers never acquire any spinlock. The DsmDirEntry::lock SpinLock remains
for write coordination: multi-step directory operations (e.g., state transition +
sharer bitmap update + owner change) must be atomic, and the CAS-based seqlock only
serializes the sequence counter update, not the multi-field modification. Writers hold
entry.inner.lock for the duration of the directory modification, then update the
seqlock sequence to publish the change to readers (see DsmDirectoryEntry writer
protocol in Section 6.3).
Page with virtual address VA:
home_node = hash(dsm_region_id, VA) % participant_count
The home node might not be the owner or have a copy.
It just maintains the directory entry.
Note: The modulus is `participant_count` (number of peers participating in
this DSM region), NOT `cluster_size` (total cluster membership). A DSM region
may involve a subset of the cluster. The RegionSlotMap maps slot indices to
participating PeerIds.
Why hash-based:
- No single point of failure (every node is home for some pages)
- O(1) lookup (no traversal)
- Uniform distribution across nodes (modular hashing)
- If home node fails: rehash to backup ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery))
Note: This is modular hashing (hash % participant_count), NOT consistent hashing.
Modular hashing remaps most entries when participant_count changes. UmkaOS targets
fixed-membership clusters where node join/leave is a rare, coordinated event.
When membership changes, directory rehash uses incremental migration:
1. Quorum leader announces new participant_count to all nodes.
2. Each home node computes which of its directory entries map to a
different home node under the new hash. These entries are marked
"migrating" but remain readable at the old location.
3. New entries are placed in the new hash location immediately.
4. Existing entries are migrated lazily on access: when a node receives
a directory lookup for an entry it no longer owns (under the new hash),
it forwards the request to the new home node. If the new home node
does not yet have the entry, the old home node transfers it on demand.
5. A background sweep migrates remaining entries at low priority.
6. The directory maintains both old and new hash functions during the
migration window. No stop-the-world pause is required. The migration
duration depends on cluster size and DSM dataset size: for typical
clusters (8-32 nodes, <100GB DSM), the background sweep completes in
~1-5 seconds. For large DSM datasets (e.g., 1TB, ~256M directory entries),
migration of all entries takes longer — potentially minutes if entries are
not accessed during the sweep (each migration requires an RDMA round-trip
of ~3μs). The system remains fully functional during this window via the
dual-hash lookup (old location serves requests until migration completes).
7. After all entries are migrated, the old hash function is retired.
In-flight page faults during rehash: A page fault that arrives at a node which
is no longer the home node (under the new hash) is handled by forwarding:
- The old home node checks if it still has the directory entry locally.
If yes, it services the request directly (the entry hasn't migrated yet).
- If the entry has already migrated, the old home node returns a REDIRECT
response with the new home node ID. The faulting node retries the lookup
at the new home node. At most one redirect per fault in the common case
(the new hash is deterministic, so the second lookup always reaches the
correct node). During active entry migration, the fault handler may spin
for up to 8 retries before the redirect (see write-fault race below),
bounding worst-case latency to ~8μs + one redirect.
- Ownership transfers in progress ([Section 6.5](#dsm-page-fault-flow) steps 3-8) that span the
rehash boundary complete under the old hash. The entry is migrated to the
new home node after the transfer completes. This is safe because the
old home node holds the entry until migration, and the seqlock prevents
concurrent modification.
- **Concurrent rehash**: If a second membership change occurs while a rehash is
in progress, the first rehash is completed before the second begins (the
**quorum leader** — defined as the lowest-ID node in the current majority
partition — serializes membership change announcements). During the first
rehash's 1-5 second migration window, the quorum leader holds a membership-change lock
that prevents new node join/leave from being processed. New membership events
are queued and processed sequentially after the current rehash completes. This
ensures that at most one hash transition is active at any time, preserving the
"at most one redirect" guarantee.
**Failure during rehash**: If a node FAILS during a rehash, the failure
is handled by the existing membership protocol ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery)) — the dead
node's directory entries are redistributed as part of the rehash, not as a
separate event. If the **quorum leader** dies while holding the membership-change
lock, the new quorum leader is deterministically identified as the lowest-ID
surviving node in the majority partition ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--split-brain-resolution)). The new
leader inherits the in-progress rehash by querying all surviving nodes for their
current hash function version (old vs. new). It then either completes the rehash
(if >50% of entries have migrated) or aborts it (rolling back to the old hash
function). The membership-change lock is not a distributed lock — it is a logical
role held by the quorum leader, so quorum-leader reassignment implicitly transfers it.
Write-fault race during directory entry migration: A write fault on a page
whose directory entry is currently being moved to a new home node (step 4
above — lazy on-demand transfer from old to new home node) requires care:
- Each directory entry carries a per-entry rehash_epoch field (u32) that
is incremented when the entry is tagged "migrating to new home" (step 2).
- The fault handler reads the rehash_epoch before and after reading the
directory entry (as part of the existing seqlock protocol). If the epoch
has changed, or if the entry's state is "migrating", the fault handler
treats this as a transient condition and retries the directory lookup after
a short spin (up to 8 retries with 1 μs backoff, then falls back to the
REDIRECT path).
- Alternatively, the home node may hold a per-entry spinlock during the
entry-transfer step (old home → new home). The fault handler acquiring the
same lock before reading the entry ensures it either sees the entry fully
present (pre-transfer) or receives a REDIRECT (post-transfer), with no
window where the entry is partially visible.
Both mechanisms are equivalent in correctness; the epoch/retry approach is
preferred because it avoids blocking the fault path on a lock acquisition.
This is fundamentally different from DLM's consistent hashing
([Section 15.15](15-storage.md#distributed-lock-manager--lock-resource-naming-and-master-assignment)),
which uses a hash ring for minimal redistribution because lock master
reassignment must be fast and non-disruptive.
6.5 Page Fault Flow¶
Process on Node A reads address VA (not present locally):
1. CPU page fault on Node A.
2. UmkaOS MM handler identifies VA as part of a DSM region.
3. Compute home_node = hash(region, VA) % participant_count = Node C.
(The modulus is `participant_count` — the number of peers participating in
this DSM region — NOT `cluster_size`, which is the total cluster membership.
See [Section 6.4](#dsm-home-node-directory).)
4. RDMA Send to home Node C: "Lookup directory entry for VA."
(Two-sided — the home node's CPU reads the entry locally using the seqlock
and returns it. One-sided RDMA Read cannot be used because the DsmDirectoryEntry
is larger than 8 bytes and a seqlock requires atomic re-reads, which separate
RDMA Read operations cannot provide. Round-trip: ~4-5 μs.)
Node C's request handler reads the directory entry under the seqlock. If the
entry's state is a transient state, the handler resolves it locally before
responding:
- **state == Invalidating**: A concurrent write-fault is invalidating readers.
The handler **blocks on a per-entry wait queue** (see "Invalidating state
blocking mechanism" below) until the state transitions out of Invalidating
(typically to Exclusive, once all invalidation acks arrive). The requesting
node (A) never sees the Invalidating state — the home node resolves it before
replying. The handler does NOT spin-wait; spinning for up to 1000ms (the
membership dead timeout) would block the CPU and could cause priority inversion
or deadlock in the request handler thread pool.
- **state == Migrating**: The page's home assignment is being transferred to a
different node due to a membership change. The handler returns EAGAIN. Node A
retries the directory lookup after a brief backoff (10-50μs), by which time
the migration has typically completed and the new home node is authoritative.
5. Home Node C looks up directory. Four cases based on the directory state:
**Case 5a — Uncached (first-ever access):** Home memory is authoritative.
No owner exists, no sharers exist. Node C sends `DataResp(page_data,
ack_count=0)` directly to Node A. Adds Node A to sharers, state → Shared.
No forwarding needed — 2× RTT total (GetS + DataResp). This is the most
common case on cold-start first access to any DSM page.
**Case 5b — Shared (data at home, existing sharers):** Home memory is
up-to-date. Node C sends `DataResp(page_data, ack_count=0)` directly to
Node A. Adds Node A to sharers, remains Shared. No forwarding needed.
**Case 5c — Exclusive:** Owner = Node B, home memory is up-to-date (E state
means clean). Adds Node A to `sharers`, adds former owner (Node B) to
`sharers`, clears `owner`, state → `Shared`. Sends `FwdGetS(requester=A)`
to Node B. Proceeds to step 6.
**Case 5d — Modified/Owned:** Owner = Node B, home memory is stale (owner B
retains dirty data). Adds Node A to `sharers`, state remains `Modified`.
Sends `FwdGetS(requester=A)` to Node B. Proceeds to step 6.
For cases 5a and 5b, step 6 (owner responds) is SKIPPED entirely — data
comes from home, not from an owner. Execution proceeds directly to step 7.
Note: Node A does NOT directly contact Node B. The home node forwards
the request — this is the standard directory-based MOESI forwarding model.
6. Node B receives `FwdGetS(requester=A)` from home Node C (cases 5c/5d only):
a. Transitions local page state: if Exclusive → Shared (clean);
if Modified → Owned (dirty data retained, services future reads).
b. RDMA Write: sends 4KB page data directly to Node A (`DataFwd`).
c. No directory update message to C — the home already updated the
directory in step 5 before forwarding.
7. Node A receives page data (`DataResp` from home for cases 5a/5b, or
`DataFwd` from Node B for cases 5c/5d):
**Concurrent fault coalescing** (logically between steps 3 and 4 of the
read-fault flow above; numbered with sub-steps to avoid confusion with
the main sequence): Before sending GetS to the home, Node A checks for
an in-progress fault on the same page:
(i) Acquire PTL(185). Check if PTE is already installed (a
concurrent fault on the same page completed first). If present
and satisfies the access type (read): release PTL, return Ok
(fault already resolved — no RDMA needed).
(ii) Release PTL. Check the per-region inflight XArray for a pending
`DsmFetchCompletion` keyed by page index
(`(addr - region.base_addr) >> PAGE_SHIFT`).
If found: block on the existing completion's wait queue. On
wake, re-check PTE (goto step 7i).
If not found: allocate a `DsmFetchCompletion` from the per-region
slab pool, insert into the inflight XArray.
(iii) Proceed with RDMA request (step 4).
(iv) On completion: remove from inflight XArray, wake all waiters.
This prevents duplicate RDMA round-trips when multiple threads fault
on the same DSM page simultaneously.
a. Allocate a local page frame from the RDMA-registered pool
([Section 5.4](05-distributed.md#rdma-native-transport-layer--pre-registered-kernel-memory)).
The RDMA Write deposits page data into this frame.
a2. Allocate `DsmPage` from the per-region slab pool (class
determined by the region's bitmap word count W). Initialize:
- `home_slot`: set from the directory lookup response (step 5)
(field on `DsmPage`, not `DsmPageMeta` — see [Section 6.11](#dsm-distributed-page-cache))
- `meta.last_writer`: `SLOT_INVALID` (read fault — no writer yet)
- `state_dirty`: `SharedReader as u16` (atomic, set via `.store(_, Release)`)
- `meta.last_transition_ns`: `arch::current::time::monotonic_ns()`
On allocation failure: free the page frame (step a), remove the
inflight XArray entry (step 7iv), return `FaultError::Oom`.
b. Set `PageFlags::DSM | PageFlags::UPTODATE | PageFlags::ACCESSED` on
the page descriptor.
c. Insert page into the per-region page tracking XArray, keyed by
**page index** within the region: `(addr - region.base_addr) >>
PAGE_SHIFT`. The XArray value is a pointer to the `DsmPage` struct
(which embeds both the page frame reference and coherence metadata).
Increment page refcount: one ref for the PTE, one for the tracking
structure.
d. Add to LRU generation list (youngest generation) for reclaim
integration with local file-backed pages (see
[Section 6.11](#dsm-distributed-page-cache--dsm-page-eviction)).
e. Set coherence state BEFORE PTE install (while PTL is NOT yet held):
`DsmPage.state_dirty.store(SharedReader as u16, Release)` -- bit 8
(dirty) is 0 for a read fault. This MUST be done before `install_pte()`
because the PTE makes the page visible to coherence protocol handlers
(a concurrent `Inv` from the home node could arrive as soon as the PTE
is installed). Setting the state after PTE install creates a race window
where the invalidation handler reads a stale `state_dirty` value.
(The `DsmPage.state_dirty` store above already set the coherence state;
no separate `local_state` field exists — `state_dirty` is authoritative.)
Acquire PTL (page table lock, level 185). Install read-only PTE:
```rust
if let Err(e) = install_pte(pgd, addr, pfn, pte_flags) {
// install_pte can fail with FaultError::Oom if intermediate
// page table allocation fails. Clean up all allocated resources:
// 1. Remove from per-region page tracking XArray, drop tracking ref.
// 2. Free DsmPageMeta (return to per-region slab pool).
// 3. Free page frame (return to RDMA-registered pool).
// 4. Remove inflight XArray entry, wake waiters.
// 5. Reset state_dirty to NotPresent (undo the pre-install set).
return Err(e);
}
```
Release PTL.
g. Remove inflight XArray entry (step 7iv), wake any blocked waiters.
h. Resumes faulting process.
Total latency: ~10-18 μs (directory lookup ~4-5 μs + forwarding ~3-5 μs
+ page transfer ~3-5 μs + local install ~1 μs)
Note: "page transfer ~3-5 μs" is the raw RDMA Write for 4KB (see
[Section 5.4](05-distributed.md#rdma-native-transport-layer--performance-characteristics)).
Software overhead (directory lookup, forwarding protocol, TLB
shootdown) is accounted in the other terms.
Compare: NVMe page fault = ~12-15 μs (comparable)
Per-request timeout: Each step that involves an RDMA round-trip (steps 4
and 6) has a per-request timeout of 10 ms (matching the ownership transfer
timeout). If the home node (step 4) or owner node (step 6) does not respond
within 10 ms, Node A retries with exponential backoff (10 ms, 20 ms, 40 ms).
If the node is declared PeerDead by the membership protocol (~1000 ms), the
DSM subsystem triggers home reconstruction ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery))
and the fault retries after recovery. The faulting thread blocks during timeout
but is guaranteed forward progress via the membership protocol's eventual
PeerDead declaration.
Process on Node A writes to address VA (has read-only copy):
1. CPU page fault (write to read-only page) on Node A.
2. UmkaOS MM handler identifies DSM page with SharedReader state.
3. Request ownership transfer:
a. **Determine wire message**: If the faulting node already has a local copy
(DsmPageState == SharedReader, i.e., S→M upgrade), send `Upgrade(page, requester=A)`
to home Node C — no data transfer needed since Node A already has the page data.
If the faulting node has no local copy (DsmPageState == NotPresent, i.e., I→M miss),
send `GetM(page, requester=A)` to home Node C — data transfer needed.
b. Node C looks up directory: owner = Node B, readers = {A, D}.
**For Cases 1 and 2** (home is NOT the current owner): Node C transitions the
directory entry directly to `Modified(owner=A, sharers={})` under the seqlock
before sending invalidations and data/forward messages. No `Invalidating`
transient state is needed — the home is not the data source.
**For Case 3** (home == owner, C == B): Node C transitions to
`DsmHomeState::Invalidating` (setting `owner = Node A`) under the seqlock.
This prevents self-forwarding loops while the home prepares its own page data.
The `Invalidating` state blocks concurrent `GetS` requests (they wait on the
per-entry wait queue — see read-fault flow step 4). After data is sent,
the home transitions to `Modified(owner=A)`.
This distinction ensures that the `Invalidating` transient state is used ONLY
when the home is also the data source (Case 3), matching the `DsmHomeState`
enum documentation.
c. Node C sends invalidation to all readers except requester (N sharers):
- RDMA Send to each sharer (e.g., Node D): `Inv(page, requester=A)`.
- Node C simultaneously sends `DataResp(page_data, ack_count=N)` to
Node A (if home has the data, i.e., Shared state), or forwards
`FwdGetM(requester=A)` to the current owner Node B (if Modified/
Exclusive state). Node C does NOT wait for InvAcks — it pipelines.
- Each invalidated sharer (Node D) executes the invalidation receiver
sequence:
1. Acquire PTL (page table lock).
2. Clear PTE, flush TLB.
3. Release PTL.
4. Store `NotPresent` to `DsmPage.state_dirty` (Release ordering --
not CAS, no concurrent state writer after unmap).
5. Send `InvAck` directly to Node A (the requester), NOT to the home.
This is the standard MOESI ack_count model: the requester collects
InvAcks, allowing the home to remain available for other requests.
d. Node A collects InvAcks and receives page data:
- Node A receives `DataResp(page_data, ack_count=N)` from home (for
Shared→Modified transition), or `DataFwd(page_data)` from owner B
(forwarded via home for Modified/Exclusive→Modified transition).
- Node A waits for exactly N `InvAck` messages from the invalidated
sharers. Node A does NOT install the write mapping until all N
InvAcks have been received — this ensures no stale readers exist.
- While waiting for InvAcks, the faulting thread blocks on a local
per-page wait queue (not the home node's wait queue).
The data arrives from one of two sources (pipelined with Inv messages):
**Case 1: home state was Shared (home memory current):**
- Node C sends `DataResp(page_data, ack_count=N)` directly to Node A.
The data comes from the home node's local memory (up-to-date in S state).
Node C updates directory: `owner = Node A`, `state = Modified`,
`sharers = {}`. The directory update is completed at the home BEFORE
Node A finishes collecting InvAcks — this is safe because any
concurrent `GetS` arriving at the home during this window will be
forwarded to the new owner (Node A), and Node A will defer serving
it until it has installed the write mapping.
**Case 2: home state was Modified/Exclusive (data at remote owner B):**
- Node C forwards `FwdGetM(requester=A)` to current owner Node B.
Node C updates directory: `owner = Node A`, `state = Modified`,
`sharers = {}`.
- Node B receives `FwdGetM`, unmaps its local copy, sends page data
directly to Node A via RDMA Write (`DataFwd`), transitions local
state to NotPresent.
**Case 3: home == owner (C == B) — skip forwarding:**
- The home node IS the current owner. No forwarding message is needed.
The home node directly invalidates its own local copy, prepares the
page data, and sends `DataResp(page_data, ack_count=N)` to Node A.
Node C updates directory: `owner = Node A`, `state = Modified`,
`sharers = {}`.
- **Concurrency model**: The directory entry is transitioned at step 3b
to `Modified(owner=A)` under the seqlock. Concurrent `GetS` requests
arriving at the home during the invalidation window are forwarded to
the new owner (Node A). Node A defers serving forwarded `GetS` until
it has received all InvAcks and installed the write mapping. The
`Invalidating` transient state is used only for Case 3 (home==owner)
to prevent self-forwarding loops: the home enters `Invalidating`
while it prepares its own page data, then transitions to
`Modified(owner=A)` before sending the data. For Cases 1 and 2, the
home transitions directly to `Modified(owner=A)` — no `Invalidating`
state needed because the home is not the data source.
e. Node A installs write mapping after all InvAcks collected:
- Allocate local page frame from RDMA-registered pool if not already
present (I→M case). For S→M upgrade, the page is already local.
- For I→M case: allocate `DsmPage` from per-region slab pool.
Initialize: `home_slot` from directory response (field on `DsmPage`),
`meta.last_writer` = local node's `RegionSlotIndex`,
`state_dirty = (Exclusive as u16) | 0x100` (via `.store(_, Release)`),
`meta.last_transition_ns = monotonic_ns()`. On allocation failure:
free page frame, return `FaultError::Oom`.
- Set `PageFlags::DSM | PageFlags::UPTODATE | PageFlags::DIRTY |
PageFlags::ACCESSED`.
- Insert into per-region page tracking XArray (if new), keyed by
page index `(addr - region.base_addr) >> PAGE_SHIFT`. Increment
refcount (PTE ref + tracking ref).
- Add to LRU generation list if new page.
- Acquire PTL (level 185). Install read-write PTE:
```rust
if let Err(e) = install_pte(pgd, addr, pfn, pte_flags_rw) {
// Clean up: remove from tracking XArray, drop tracking ref,
// free DsmPageMeta, free page frame. Return Err(e).
return Err(e);
}
```
Release PTL.
- Set coherence state + dirty bit atomically:
`DsmPage.state_dirty.store((Exclusive as u16) | 0x100, Release)`
— bit 8 = dirty flag. `state_dirty` is the sole authoritative
coherence state field (no separate `local_state`).
Set `PageFlags::DIRTY` on the page descriptor.
f. Process deferred `FwdGetS`/`FwdGetM` requests (if any arrived
during InvAck collection — see "Deferred forwarded requests" below).
g. Resumes faulting process.
Total latency: ~15-25 μs (involves invalidating remote copies + collecting InvAcks)
Home==owner optimization saves one RDMA round-trip (~3-5 μs) in that case.
Note: the ack_count model pipelines data transfer with InvAck collection,
so the latency is dominated by max(data_transfer, slowest_InvAck).
Write-fault upgrade path (S -> M, local copy already exists):
Process on Node A writes to address VA (has read-only copy, DsmPageState = SharedReader):
1. CPU write fault (write to read-only page).
2. Handler identifies DsmPageState = SharedReader (local copy exists).
**Concurrent fault coalescing** (before sending Upgrade in step 3):
Same protocol as the read-fault coalescing (step 7 of the read path above):
(i) Acquire PTL(185). Check if PTE is already read-write (a concurrent
write fault on the same page completed first). If present and writable:
release PTL, return Ok (ownership already acquired).
(ii) Release PTL. Check the per-region inflight XArray for a pending
`DsmFetchCompletion` keyed by page index.
If found: block on the existing completion's wait queue. On wake,
re-check PTE (goto step i).
If not found: allocate a `DsmFetchCompletion`, insert into inflight XArray.
(iii) Proceed with Upgrade request (step 3).
(iv) On completion: remove from inflight XArray, wake all waiters.
This prevents duplicate Upgrade messages when multiple threads write-fault
on the same DSM page simultaneously. Without this check, the home node
would receive duplicate Upgrade requests, potentially causing the second
Upgrade to invalidate the first requester's freshly-acquired exclusive copy.
3. Send Upgrade(page, requester=A) to home Node C (NOT GetM -- no data
transfer needed since Node A already has the page data).
4. Home Node C: send Inv to all other sharers, send AckCount(N-1) to Node A.
Update directory: owner = Node A, state = Modified, sharers = {}.
5. Node A collects N-1 InvAcks directly from sharers.
6. Acquire PTL (level 185). Upgrade local PTE to read-write:
```rust
if let Err(e) = install_pte(pgd, addr, pfn, pte_flags_rw) {
// install_pte failure on upgrade is rare (page table already exists)
// but possible if intermediate levels were reclaimed. Clean up:
// release PTL, return Err(e). The existing SharedReader state is
// preserved; the process will re-fault and retry.
return Err(e);
}
```
Release PTL.
**TLB flush**: `tlb_flush_page(addr)` — the old read-only PTE may be cached
in TLB. Per-architecture requirements:
- x86-64: RO→RW upgrade does NOT require TLB flush (CPU hardware handles
permission widening without stale TLB issues).
- AArch64: TLBI required (break-before-make for PTE changes).
- RISC-V: `sfence.vma` required.
- PPC32/PPC64LE: `tlbie` + `ptesync` required.
- s390x: IPTE + global TLB purge.
- ARMv7: TLBI + DSB.
- LoongArch64: INVTLB required.
The `install_pte` documentation states: "Replacing existing mapping:
caller must issue `tlb_flush_page(addr)` AFTER install_pte returns."
Set coherence state + dirty bit atomically:
`DsmPage.state_dirty.store((Exclusive as u16) | 0x100, Release)`
— bit 8 = dirty flag. `state_dirty` is the sole authoritative
coherence state field (no separate `local_state`).
Set `PageFlags::DIRTY` on the page descriptor.
7. Resume faulting process.
No data transfer needed — Node A already has the page. This saves one
RDMA Write (~3-5 μs) compared to a full I→M write fault.
Total latency: ~8-15 μs (directory + Inv + InvAck collection, no data).
**Deferred forwarded requests during InvAck collection**: Between "home updates
directory to `Modified(owner=A)`" and "Node A collects all InvAcks and installs
write PTE", Node A is the official directory owner but cannot yet serve data.
If a concurrent `GetS` from Node E arrives at the home during this window, the
home forwards `FwdGetS(requester=E)` to Node A.
Node A handles these deferred requests as follows:
1. **Queue**: Incoming `FwdGetS`/`FwdGetM` messages for pages where Node A is
still collecting InvAcks are queued in a **per-page pending-forward list**
(embedded in the `DsmFetchCompletion` struct for the in-progress fault).
Maximum queue depth: 16 entries per page. If the queue overflows, Node A
sends `Nack(reason=Busy)` to the home, which returns `Nack` to the requester.
2. **Process after PTE install**: After all InvAcks arrive and the write PTE is
installed (step 3e/3f), Node A drains the pending-forward list:
- `FwdGetS`: send `DataFwd(page_data)` to each requester. Transition
`DsmPageState` from `Exclusive` to `SharedOwner` (MOESI: M → O). Update
`DsmPage.state_dirty` accordingly.
- `FwdGetM`: send `DataFwd(page_data)` to the requester, invalidate local
copy, transition to `NotPresent` (MOESI: M → I). Only the last `FwdGetM`
is honored; earlier ones receive `Nack(reason=Busy)`.
3. **Timeout cleanup**: If InvAck collection times out and Node A gives up
ownership (returns `FaultError::Sigbus`), all pending forwarded requests are
discarded. The home will re-evaluate directory state on retry from the
original requesters.
**Early-timeout deduplication**: When the bounded-wait early-timeout fires (see
below), the retry creates a potential for duplicate in-flight RDMA requests.
The deduplication protocol:
1. Each DSM fault request carries a **request sequence number** (`req_seq: u64`,
monotonically increasing per-node) in the `DsmWireHeader.aux` field.
2. On early timeout: mark the `DsmFetchCompletion` for this page as `TIMED_OUT`.
3. If the late RDMA response arrives, the completion handler checks the status.
If `TIMED_OUT`, the response is discarded (free the received page frame if
any). The `req_seq` in the response is compared to the current inflight
entry's `req_seq`; mismatches are silently dropped.
4. The retry creates a NEW `DsmFetchCompletion` with a new `req_seq`. The home
node does not need to deduplicate — it processes each request independently.
If two GetS arrive for the same page from the same node, the home adds the
node to sharers (idempotent on bitmap) and sends DataResp twice. The
requester's inflight XArray dedup ensures only one response is consumed.
**vma.vm_lock hold duration during DSM fault**: The vma.vm_lock (read mode) is
held for the duration of the DSM page fault, which includes RDMA round-trips
(10-18 us for read, 15-25 us for write, up to 1000 ms if a node is unresponsive
before timeout expiry). During this time, any concurrent `munmap()` on the same
VMA range blocks. This is documented as an expected behavior — the per-VMA lock
is designed for short critical sections, but DSM faults are inherently network-
latency-bound. The 10ms per-request timeout and 1000ms aggregate timeout bound
the worst case.
**Phase 3 optimization (P1)**: Release vma.vm_lock before the RDMA wait and
re-acquire after, re-validating VMA state. This requires careful interaction
with the `invalidate_seq` protocol ([Section 4.8](04-memory.md#virtual-memory-manager)): after
re-acquiring vma.vm_lock, the fault handler must re-check that (a) the VMA
still exists, (b) the VMA still covers `addr`, and (c) the invalidate_seq has
not advanced (indicating a concurrent munmap/mprotect that may have changed
the VMA's properties).
**Bounded-wait early-timeout mechanism**: If a DSM fault waits longer than
`DSM_FAULT_TIMEOUT_US` (default: 5000 µs = 5ms) for an RDMA response, the
fault handler returns `-EAGAIN` and the caller retries the fault from the
top (re-acquiring vma.vm_lock and re-checking VMA state). This bounds the
worst-case vma.vm_lock hold duration to 5ms, even on slow or partitioned
networks. The retry loop is bounded by the aggregate fault timeout (1000ms);
after exhaustion, the fault returns `FaultError::Sigbus` (delivering SIGBUS to
the process). `-ENOMEM` would be semantically wrong — the process is not out
of memory, the remote node is unreachable. Using `-ENOMEM` would trigger the
OOM killer (per `FaultError::Oom` handling), which would kill an innocent
process to free memory that cannot resolve a network partition. SIGBUS is the
correct signal: it matches the DataFwd timeout path and the semantic meaning
(remote memory unreachable = bus error, not OOM).
**DataFwd timeout retry semantics**: If Node A's 10ms timer fires waiting for
data from Node B (the forwarded DataFwd), Node A retries by re-sending the
original request (`GetS` or `GetM`/`Upgrade`) to the **home node**, NOT to
Node B. The home node re-evaluates directory state — if Node B has been marked
as failed by the membership protocol, the home may serve the data itself (if it
has a copy) or redirect to a different owner. If the home is also unresponsive,
the fault path returns `-EAGAIN` after exponential backoff (max 3 retries),
and the faulting thread receives `SIGBUS`.
RDMA Send failure handling: If the initial RDMA Send of an invalidation request
fails (queue pair error, transport failure, out-of-send-WRs), the home node treats
the failure as an immediate entry into the retry escalation path below — equivalent
to an instant ACK timeout. The home node does NOT wait 200 μs before retrying;
instead it immediately attempts the first retry via a backup QP (if available) or
a two-sided fallback channel. If all retries fail, the home node escalates to the
membership protocol as in step 2 below. This ensures that directory entries never
get stuck in the Invalidating state due to transport-level failures.
Invalidation ack timeout with escalation: Each invalidation request sent by the home node in step 3c carries a 200 μs timeout. The home node does NOT proceed with the ownership transfer until all readers have acknowledged invalidation or been removed from the reader set — doing so would violate coherence (the stale reader could read data that the new exclusive owner has since modified). If a reader does not acknowledge within 200 μs, the home node escalates:
- Retry (up to 3 attempts, 200 μs apart): The home node re-sends the invalidation request. A live but slow reader (e.g., handling a long interrupt or scheduling delay) will respond on retry.
- Suspect (after 600 μs total): The home node reports the non-responding reader to the membership protocol (Section 5.8) as suspect. If the node is genuinely unreachable, the membership protocol will mark it Suspect after 3 missed heartbeats (300 ms) and Dead after 10 missed heartbeats (1000 ms, per Section 5.8), at which point its reader bit is cleared from all directory entries.
- Proceed after removal: Once the reader has either acknowledged the invalidation or been marked Dead by the membership protocol (at which point its reader bit is cleared from the directory entry), the ownership transfer proceeds. The write-faulting thread blocks during escalation but is guaranteed forward progress — either the reader responds or the membership protocol eventually marks it Dead and removes it.
This ensures strict coherence: no node can hold a read-only mapping while another node holds exclusive ownership. The worst-case latency for a write fault with an unresponsive reader is ~1000 ms (the membership dead timeout: 10 missed heartbeats at 100 ms intervals, per Section 5.8), compared to Linux's 10-30 second fencing delay. In the common case (all readers responsive), the additional cost is zero — the 200 μs timeout never fires.
Maximum Invalidating state duration: A directory entry remains in the Invalidating
state from step 3b (when the home node sets it) until step 3f (when all invalidation
acks arrive and the state transitions to Exclusive). The maximum duration is bounded
by the invalidation ack timeout escalation above: 200 μs initial timeout, up to 3
retries (600 μs), then escalation to the membership protocol which marks an
unresponsive node Dead after 1000 ms (10 missed heartbeats at 100 ms intervals, per
Section 5.8). At that point, the home node removes the unresponsive node from the
sharer set and proceeds with the state transition. Therefore, the worst-case
Invalidating duration is bounded by the membership dead timeout (~1000 ms). Any
concurrent read-fault that arrives at the home node during this window will block
on the per-entry wait queue (see read-fault flow step 4) until the state resolves.
The waiters are descheduled during this time, not spinning, allowing the CPU to
process other requests.
Wait queue allocation (Section 3.14 rule): Wait queue heads are drawn on demand from a dedicated pre-allocated pool (sized at region creation time, never from the general slab allocator). The per-entry pointer in
DsmDirectoryEntry(wait_queue: AtomicPtr<WaitQueueHead>) isnulluntil the first blocking wait occurs on that entry. On first contention, a wait queue head is acquired from the pool and stored via CAS. This "lazily drawn from pre-allocated pool" design satisfies the Section 3.14 rule (no heap allocation in fault handlers) while avoiding the memory overhead of embedding a full wait queue head in every directory entry (most entries never experience contention).
Invalidating state blocking mechanism: Each DsmDirectoryEntry includes a
per-entry spinlock (entry_lock: SpinLock) and a kernel wait queue
(wait_queue: AtomicPtr<WaitQueueHead>) for blocking on state transitions. The seqlock
(sequence field) remains the fast-path mechanism for read-side directory lookups,
but the blocking/wakeup path uses a separate spinlock because seqlocks cannot be
held across schedule() (the seqlock write side disables preemption internally,
and sleeping with preemption disabled is a deadlock).
When a request handler encounters the Invalidating state:
- The handler acquires
entry_lock(per-entry spinlock). - The handler re-checks the state under the spinlock. If no longer
Invalidating, release the spinlock and proceed (avoids unnecessary sleep). - The handler calls
prepare_to_wait(&entry.wait_queue, TASK_INTERRUPTIBLE)— this adds the handler to the wait queue while holding the spinlock, preventing the race where a waker callswake_up_all()between the lock release and the wait queue addition. - The handler releases
entry_lock. - The handler calls
schedule(), which deschedules the thread. If a concurrent wakeup occurred between steps 3 and 5,schedule()returns immediately without sleeping (the standard Linux wait pattern). - On wakeup, the handler calls
finish_wait(&entry.wait_queue)and re-reads the directory entry under the seqlock (optimistic read path) to proceed.
When the ownership transfer completes (step 3f), the invalidating handler:
- Acquires the seqlock writer (CAS even→odd, provides writer serialization).
- Transitions state from Invalidating to Exclusive.
- Releases the seqlock writer (write even).
- Acquires entry_lock, calls wake_up_all(&entry.wait_queue), releases
entry_lock.
Wait queue safety invariant: The wait queue is protected by entry_lock (a
per-entry spinlock), NOT by the seqlock. Waiters add themselves to the queue under
entry_lock (step 3), and wakers hold entry_lock while calling wake_up_all().
The seqlock writer and entry_lock are independent — the seqlock serializes
directory entry updates (state transitions), while entry_lock serializes wait
queue operations. Lock ordering: seqlock writer THEN entry_lock (never reversed).
This blocking mechanism ensures that the home node can process thousands of
concurrent DSM requests without CPU-starving due to spin-waits. The wait queue
is per-entry (not global), so blocking on one page's invalidation does not
affect other pages. The wait queue head is allocated lazily — the
DsmDirectoryEntry stores only a pointer (wait_queue: *mut WaitQueueHead,
8 bytes) that is null until the first blocking wait occurs. Most entries never
experience contention, so the common case pays only 8 bytes of pointer overhead
per entry rather than embedding a full 16-32 byte wait queue head in every entry.
Thread pool exhaustion prevention: The blocking mechanism above allows
request handler threads to sleep on per-entry wait queues. This creates a
thread pool exhaustion risk: if all threads in the home node's request handler
pool block on Invalidating state wait queues, no thread remains to process
the invalidation ACK completions that would wake them — a deadlock.
UmkaOS prevents this by separating the two paths:
-
Request handler pool (bounded, per-NUMA-node): Processes incoming DSM directory lookup requests (read-faults, write-fault ownership requests). These threads may block on per-entry wait queues when encountering the
Invalidatingstate. Pool size is configurable (default: 2 × cpu_count per NUMA node, minimum 4). -
RDMA completion pool (separate, never blocks on directory state): Processes invalidation ACK completions, page transfer confirmations, and membership heartbeat responses. These threads run to completion without blocking on any DSM directory lock or wait queue — their only operations are: (a) update the directory entry state under the seqlock, (b) wake waiters on the per-entry wait queue, (c) send follow-up RDMA messages. Because they never sleep on directory state, they cannot be starved by request handler blocking.
The two pools use separate RDMA completion queues (CQs): request handler
threads poll the request CQ, completion threads poll the ACK CQ. This
hardware-level separation ensures that ACK processing is never queued behind
blocked request handlers. Even if every request handler thread is sleeping
on an Invalidating wait queue, ACK completions are processed by the
completion pool, which transitions the directory entry to Exclusive and
wakes the blocked handlers.
Seqlock reader retry policy: The home node's request handler reads the directory entry under the seqlock (optimistic read). If the seqlock sequence number indicates a concurrent writer (odd sequence), the reader retries. To prevent live-lock under sustained write contention:
const SEQLOCK_SPIN_RETRIES: u32 = 4; // fast path: retry in-line
const SEQLOCK_EXTENDED_SPIN_RETRIES: u32 = 8; // medium path: spin_loop() hint between retries
fn seqlock_read_with_backoff(entry: &DsmDirectoryEntry) -> DsmDirectorySnapshot {
for attempt in 0..(SEQLOCK_SPIN_RETRIES + SEQLOCK_EXTENDED_SPIN_RETRIES) {
let seq = entry.sequence.load(Acquire);
if seq & 1 == 0 {
// Writer not active. Try optimistic read.
let snapshot = entry.memcpy_snapshot();
if entry.sequence.load(Acquire) == seq {
return snapshot; // Success: consistent read.
}
}
if attempt >= SEQLOCK_SPIN_RETRIES {
// spin_loop() is NOT a scheduler yield. It emits a PAUSE instruction
// (x86) or YIELD hint (ARM) that signals the CPU to save power and
// release pipeline resources to the sibling hyperthread. The calling
// thread does NOT relinquish its timeslice.
core::hint::spin_loop();
}
}
// Exhausted retries. Fall through to entry_lock path:
// acquire entry_lock, read entry under spinlock (guaranteed progress),
// release entry_lock. This path is strictly slower but bounded.
entry.read_under_spinlock()
}
The 3-tier retry (spin → yield → spinlock) guarantees forward progress: worst case is ~200-500 ns for the spinlock fallback, which is negligible compared to the ~3-5 μs RDMA round-trip that follows. The spinlock fallback is expected <0.01% of reads (only under sustained write storms).
RDMA operation ordering: This section describes the ordering protocol for the split-transfer case (owner sends data to requester while home updates directory).
In the actual fault flows above, the home node updates the directory FIRST (under the seqlock), THEN sends DataResp or forwards FwdGetM. The directory update is a local operation on the home node (no cross-QP ordering needed). The split-transfer protocol below applies to the specific case where the owner (Node B) sends data directly to the requester (Node A) via RDMA Write, and the home needs to know that the data has arrived before allowing further state transitions:
- Owner sends page data to requester via RDMA Write (on QP to requester).
- Owner sends a completion notification to requester via RDMA Send on the SAME QP. RC (Reliable Connection) QP ordering guarantees that the Write data is visible at the receiver before the Send completion is processed (in-order delivery).
- The requester, upon receiving the Send completion, knows the data is present. No explicit ack to the home is needed — the home updated its directory before forwarding, and the requester's subsequent InvAck collection provides implicit ordering (the requester cannot complete InvAck collection until it has the data).
This avoids the extra round-trip that a home-ack model would require. The home does NOT need to wait for data arrival at the requester — it updated its directory atomically before sending, and any concurrent requests see the new directory state.
Security requirement — Invalidation ACK authentication: Invalidation ACKs MUST be authenticated to prevent spoofing by malicious nodes. Without authentication, a malicious node could send forged invalidation ACKs to the requester, causing it to prematurely install the write mapping while stale readers still hold copies — violating cache coherence and potentially leaking data.
Key design constraint: InvAcks are sent directly to the requester (Node A), NOT to the home node — this is the standard MOESI ack_count model where the home remains available for other requests. Therefore, the requester must verify the HMAC, not the home.
Authentication protocol:
- Each invalidation request from the home node (step 3c) MUST include a
cryptographically random 128-bit
invalidation_noncegenerated fresh for each invalidation batch. - The home MUST also include the
invalidation_noncein theDataResporAckCountmessage sent to the requester (Node A). This is how the requester learns the nonce — it cannot verify InvAcks without knowing the expected nonce. - The HMAC is computed using a cluster-wide per-epoch key (not a pairwise session key), derived via HKDF-SHA256 from the shared cluster secret (Section 5.2). This ensures that any node can verify InvAcks from any other node without needing a separate pairwise key setup for each requester-sharer pair.
- The invalidation ACK from each reader node MUST include:
- The same
invalidation_nonce(proving the ACK is a response to this specific invalidation request, not a replay) - An HMAC-SHA-256 computed over
{nonce || page_va || reader_node_id}using the cluster-wide epoch key - The requester (Node A) MUST verify the HMAC before accepting each InvAck. Invalid or missing HMACs MUST be treated as if the ACK was never received (triggering timeout and escalation to membership protocol).
- To limit overhead, the session key is established once during cluster join and rotated every 24 hours. Key rotation uses the authenticated control channel (Section 5.2). During routine rotation, both old and new session keys are accepted for a 30-second grace period (same dual-key acceptance window as emergency re-key below). DSM invalidation ACKs and control messages in flight during rotation are accepted with either key. After the grace period, the old key is zeroized from memory. This ensures that in-flight messages sent just before rotation are never rejected.
- Session key compromise recovery: If a session key is suspected of compromise, the detecting node initiates emergency key rotation per the protocol below.
Session Key Compromise Recovery Protocol
Session keys authenticate DSM invalidation ACKs (above) and control channel messages (Section 5.2). Compromise of a session key allows an attacker to forge ACKs, potentially corrupting directory coherence. This section specifies detection, coordination, and recovery.
Detection — A node suspects session key compromise when it observes repeated HMAC verification failures from a specific peer. The detection state machine per peer:
State machine per (local_node, peer_node) pair:
NORMAL
│ HMAC verification failure from peer
│ → increment fail_counter, record timestamp
│ → if fail_counter >= threshold within window → transition to SUSPECT
▼
SUSPECT
│ Send KEY_ROTATE_URGENT to peer (Ed25519-signed, independent of session key)
│ Start re-key timer (3× measured RTT to peer)
│ → if peer responds with KEY_ROTATE_ACK → transition to REKEYING
│ → if timer expires → transition to EVICTING
▼
REKEYING
│ Both sides perform X25519 DH exchange (authenticated via Ed25519 identity keys)
│ Derive new session key via HKDF-SHA256
│ Enter dual-key acceptance window
│ → if both sides confirm (KEY_ROTATE_CONFIRM) → transition to NORMAL, zeroize old key
│ → if confirm not received within 3× RTT → transition to EVICTING
▼
EVICTING
│ Escalate to membership protocol ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery))
│ → Peer is marked Dead, all QPs destroyed, all capabilities revoked
Detection threshold tuning — The threshold is configurable via
cluster.hmac_fail_threshold and cluster.hmac_fail_window_ms:
| Network | RTT | Recommended threshold | Window | Rationale |
|---|---|---|---|---|
| RDMA LAN (<10 μs) | ~1-5 μs | 5 failures | 60s | Low loss rate; 5 failures is highly anomalous |
| TCP LAN (<1 ms) | ~0.2-1 ms | 10 failures | 120s | TCP retransmits can cause transient HMAC mismatches if segments arrive out of order during congestion |
| WAN (10-100 ms) | ~20-100 ms | 20 failures | 300s | Higher packet loss, reordering, and latency variance; larger window avoids false positives |
False positives (legitimate HMAC failures misidentified as compromise) are handled by the re-keying protocol itself — re-keying is safe even if triggered unnecessarily. The only cost of a false positive is one X25519+HKDF key derivation (~50 μs CPU time) and ~2 RTTs of latency.
Multi-node coordination — When multiple nodes simultaneously detect HMAC failures from the same peer (e.g., nodes A and B both detect failures from node C), a tie-breaking protocol prevents conflicting concurrent re-key exchanges:
- Each
KEY_ROTATE_URGENTmessage includes the sender'snode_id. - The receiving node (C) may receive multiple
KEY_ROTATE_URGENTmessages from different peers (A and B) within a short interval. - Node C processes re-key requests sequentially in
node_idorder: lowestnode_idis re-keyed first. Concurrent requests from highernode_idpeers are queued (acknowledged withKEY_ROTATE_QUEUED) and processed after the current re-key completes. - If node A and node B are both trying to re-key with C, and A has
node_id < B: - A↔C re-key proceeds immediately.
- B receives
KEY_ROTATE_QUEUEDfrom C, waits forKEY_ROTATE_READYfrom C before starting its own DH exchange. - Maximum serialization delay: one re-key duration (~2 RTTs) per queued peer.
- If two nodes try to re-key with each other (A detects failures from B, and B
detects failures from A simultaneously): the node with lower
node_idis the initiator (sends the DH ephemeral public key first). The other node becomes the responder. Both detect the symmetric case via receivingKEY_ROTATE_URGENTwhile already inSUSPECTstate for the same peer.
Interaction with in-flight DSM operations — During re-keying, DSM invalidation and ACK processing continue without interruption:
- Dual-key acceptance window: During the
REKEYINGstate (~1-2 RTTs), both old and new session keys are accepted for incoming HMAC verification. This covers in-flight invalidation ACKs that were signed with the old key before the peer computed the new key. - Nonce binding prevents confusion: Each invalidation batch has a unique 128-bit
nonce. An ACK signed with the old key for batch N is valid only for batch N (the
HMAC covers
{nonce || page_va || reader_node_id}). Even during dual-key acceptance, there is no risk of an attacker replaying an old ACK for a new batch because the nonce will not match. - Seqlock isolation: Directory state transitions (
Invalidating→Exclusive) are protected by the per-entry seqlock (Section 6.4). The seqlock is independent of key state — re-keying does not hold or wait for directory locks, and directory operations do not hold or wait for re-keying locks. - Key zeroization timing: The old key is zeroized only after both sides confirm
the new key (
KEY_ROTATE_CONFIRM). At that point, all in-flight messages signed with the old key have either been received (within the dual-key window) or timed out. The timeout escalation (Section 6.5 retry sequence) handles any ACKs lost during the window.
Recovery guarantee: After a successful re-key, all subsequent HMAC operations use
the new key. If the old key was compromised, the attacker can no longer forge messages.
Directory coherence is preserved throughout because invalidation correctness depends on
nonce uniqueness and seqlock atomicity, not on which session key was used to authenticate
the ACK.
- ACKs received after the directory state has already transitioned to Exclusive
(duplicate ACKs due to network reordering) MUST be silently discarded without
error — the nonce lookup will fail since the invalidation batch is complete.
Performance impact: HMAC-SHA-256 verification adds ~500 ns to ACK processing on the home node. This is negligible compared to the ~200 μs timeout window. The nonce prevents replay attacks without requiring per-ACK signatures.
HMAC verification rate limiting — To prevent a malicious or malfunctioning peer from exhausting CPU time on the home node by flooding HMAC-bearing messages, each peer connection has a per-peer rate limiter that bounds the number of HMAC verifications processed per second:
/// Per-peer rate limiter for HMAC-SHA-256 verification on the home node.
/// One instance per (local_node, remote_peer) pair, stored alongside the
/// peer's session key in the cluster key table.
///
/// Uses a token-bucket algorithm: tokens are consumed on each HMAC
/// verification attempt and refilled at a fixed rate. When the bucket is
/// empty, incoming messages from the peer are dropped without HMAC
/// verification (the message payload is never processed).
///
/// Concurrency: `tokens` and `last_refill_ns` are `AtomicU64` for lock-free
/// hot-path operation. The refill is performed inline by the verifying CPU
/// using a CAS loop — no background timer thread. This is acceptable because
/// refill arithmetic is ~5 ns (two u64 multiplies + one CAS), negligible
/// compared to the ~500 ns HMAC verification itself.
pub struct HmacRateLimiter {
/// Current token count. Each HMAC verification consumes one token.
/// Decremented atomically (CAS loop) on each verification attempt.
/// If zero, the verification is skipped and the message is dropped.
pub tokens: AtomicU64,
/// Maximum verifications per second per peer. Default: 100,000.
///
/// Rationale: at ~500 ns per HMAC-SHA-256, 100K verifications/sec
/// consumes ~50 ms of CPU time — 5% of one core. This is generous
/// enough for legitimate DSM traffic (a fully saturated 200 Gb/s
/// RDMA link transferring 4KB pages generates at most ~6.25M
/// page transfers/sec, but each transfer produces at most one
/// HMAC-bearing ACK, and a single home node is unlikely to be the
/// target of all transfers). The cap prevents a single rogue peer
/// from consuming more than 5% of a core on HMAC verification.
pub max_per_sec: u64,
/// Timestamp (monotonic nanoseconds) of the last token refill.
/// Used to compute elapsed time and proportional token replenishment.
pub last_refill_ns: AtomicU64,
}
impl HmacRateLimiter {
/// Attempt to consume one token. Returns `true` if the verification
/// should proceed, `false` if the peer has exceeded its rate limit.
///
/// On `false`, the caller MUST:
/// 1. Drop the message without processing the payload.
/// 2. Emit an FMA warning: `FmaEvent::HmacRateLimitExceeded {
/// peer_id, current_rate, max_rate }`.
/// 3. Increment the per-peer `hmac_drops` counter (exported via
/// `/sys/kernel/umka/cluster/peer/<id>/hmac_drops`).
///
/// The FMA warning is itself rate-limited to 1 per second per peer
/// (via the standard FMA deduplication window) to avoid log flooding.
pub fn try_consume(&self) -> bool {
// 1. Compute elapsed time since last refill.
// 2. Add proportional tokens: elapsed_ns * max_per_sec / 1_000_000_000.
// 3. Clamp to max_per_sec (bucket capacity = one second's worth).
// 4. CAS decrement tokens by 1. If tokens == 0, return false.
/* ... */
}
}
Default: 100,000 verifications/sec/peer. Configurable per-peer via:
/sys/kernel/umka/cluster/peer/<peer_id>/hmac_rate_limit
# # Read/write. Accepts integer (verifications per second). Default: 100000.
# # Setting to 0 disables rate limiting for this peer (not recommended).
Interaction with the compromise detection state machine: Dropped messages due
to rate limiting do NOT increment the fail_counter in the HMAC compromise
detection state machine (above). Rate-limited drops are a resource-protection
mechanism, not evidence of key compromise. A rogue peer that floods HMAC messages
is handled by rate limiting (drops) and, if it persists, by the cluster membership
protocol's misbehavior detection (separate from HMAC failure counting).
Nonce Lifecycle and Replay Prevention:
Each invalidation batch carries a 128-bit cryptographically random nonce
(InvalidationNonce: [u8; 16]). The nonce is generated by the sender using
UmkaOS's CSPRNG (getrandom(2) equivalent).
Retention window: Each receiving node maintains a
NonceWindow: HashMap<InvalidationNonce, u64> (value = insertion timestamp in ns
for expiry) of nonces seen in the last 30 seconds. HMAC nonce verification
is configurable per-region via DsmRegionConfig::nonce_hmac_enabled: bool
(default: true for cross-trust-domain regions, false for intra-rack
clusters where all nodes share a hardware root of trust). When disabled,
the NonceWindow is not allocated and no per-ACK overhead is incurred.
When enabled, the rate limiter (100K verifications/sec/peer, see above)
bounds practical NonceWindow growth to ~3M entries per node regardless of
the number of peers (100K x 30s = 3M entries, ~100 MB worst case at ~33
bytes per entry including SwissTable metadata). In practice <10K entries
under normal load. Defense-in-depth: if entries exceed 500K (configurable
max_nonce_window_entries, default 500,000), oldest entries are evicted
immediately. At 500K entries x ~33 bytes = ~16.5 MB peak, which is
proportionate for a security-critical replay prevention table.
Collection Policy justification: HashMap is used here despite the
warm-path insertion temperature because: (a) the key is a 128-bit nonce
(not an integer -- XArray requires integer keys), (b) no range queries are
needed, (c) the HashMap is initialized with
HashMap::with_capacity(16_384) at region join time (small initial
allocation, ~540 KiB), and grows dynamically as entries accumulate.
Hashbrown's amortized O(1) growth adds at most ~8 reallocations to reach
500K entries. This avoids the prior design's ~132 MB upfront allocation
(with_capacity(4_000_000)) for a table that typically holds <10K entries.
The per-entry cost of SipHash on 16-byte keys (~20 ns) is within budget at
100K ops/sec (~2 ms/sec total CPU). BTreeSet's O(log N) is unnecessary
since range queries are not used.
The 30-second expiry window is maintained by a separate
time-ordered eviction performed on the RCU grace period thread.
On receiving an ACK:
1. Verify nonce is in the pending-invalidation table (sent but not yet ACKed).
2. Verify nonce is NOT in NonceWindow (replay check).
3. Mark the batch as ACKed.
4. Insert nonce into NonceWindow.
Expiry: NonceWindow entries are purged after 30 seconds (wall clock). This
window is set to 3× the maximum expected network round-trip time (10s RTT bound
in UmkaOS cluster config). Purging runs on the RCU grace period thread, not on the
ACK hot path.
Batch-nonce binding: Each invalidation message includes the batch sequence number AND the nonce. A replayed ACK with an old nonce for a new batch sequence number fails the pending-table lookup (the old nonce is no longer in the pending table). The nonce and sequence number together provide double validation against replay.
Clock skew: Since replay uses wall-clock expiry, nodes must maintain clock synchronization within the replay window. UmkaOS's cluster time sync (Section 5.8) guarantees ±1s drift, well within the 30s window.
6.6 DSM Coherence Protocol: MOESI¶
UmkaOS's DSM implements a distributed directory-based MOESI protocol over RDMA. MOESI
is chosen over MSI or MESI because the Owned state (DsmPageState: SharedOwner)
allows a node to service read requests for dirty data directly — without first writing
the data back to the home node's memory. This avoids an extra network round-trip and
reduces home-node memory-controller traffic in multi-reader scenarios.
Each 4KB page has a home node (determined by virtual address hash via the hash function in Section 6.4) that maintains the directory entry tracking which nodes hold copies and in which state.
6.6.1 MOESI Protocol States¶
Each DSM page tracked per-node can be in one of five steady-state MOESI protocol states.
The transition table (Section 6.6) and
performance table (Section 14.8) use the MOESI
single-letter abbreviations (M, O, E, S, I) for compactness because the transition
logic is pure protocol theory. The DsmPageState enum
(Section 6.3) maps these to implementation-level variants
as shown in the mapping table on DsmPageState (repeated here for reference):
| MOESI State | Abbrev | DsmPageState Variant | Meaning |
|---|---|---|---|
| Modified | M | Exclusive (dirty bit set) |
Node has the only valid copy; dirty (not written to home memory). Node must supply data on any incoming request. |
| Owned | O | SharedOwner |
Node has a dirty copy; other nodes may have SharedReader copies; home memory is stale. Node must supply data on read requests without first flushing to home. |
| Exclusive | E | Exclusive (dirty bit clear) |
Node has the only valid copy; clean (matches home memory). No remote copies exist. |
| Shared | S | SharedReader |
Node has a read-only copy; home memory is up-to-date; other nodes may also have SharedReader copies. |
| Invalid | I | NotPresent |
Node has no valid copy; must fault to obtain. |
The SharedOwner state (MOESI: O) is the key differentiator from MESI: when a node in the
Exclusive state with dirty data (MOESI: M) receives a FwdGetS (another node wants a
read copy), it transitions to SharedOwner (MOESI: O) rather than writing back to home
first. The dirty data stays with the owner; home memory remains stale, but the owner
supplies all subsequent read requests. Only a PutO eviction or a FwdGetM (exclusive
request) forces a writeback.
6.6.2 Directory Entry at the Home Node¶
The home node maintains a compact directory entry for each shared page. The home node's directory is stored in the radix tree described in Section 6.4.
// umka-core/src/dsm/moesi.rs
/// Per-page directory entry maintained at the home node for MOESI coherence.
/// The home node uses this to track which peers hold copies and in what state,
/// so it can send the correct forwarding or invalidation messages on a miss.
///
/// Write-side protection: `lock` (SpinLock) serializes multi-field modifications.
/// Held only for a single directory update step; never across a network round-trip.
/// Read-side protection: local CPU readers use the outer `DsmDirectoryEntry::sequence`
/// seqlock for lock-free consistent snapshots — they never acquire `lock`.
///
/// The `sharers` field is a `RegionBitmap` ([Section 6.1](#dsm-foundational-types))
/// with the same word count W as all other bitmaps in this region.
pub struct DsmDirEntry {
/// State of the page from the home node's perspective.
pub state: DsmHomeState, // 1 byte (repr(u8))
pub _pad1: [u8; 1],
/// Owner's slot index in this region. SLOT_INVALID if no owner (Uncached).
pub owner_slot: RegionSlotIndex, // 2 bytes
pub _pad2: [u8; 4], // align to 8
/// Peer that holds the Exclusive (MOESI: M or E) or SharedOwner (MOESI: O) copy, if any.
/// `None` means home memory is authoritative (Uncached state).
pub owner: Option<PeerId>, // 8 bytes (niche optimization: PeerId(NonZeroU64))
/// Peers that have SharedReader (MOESI: S) or SharedOwner (MOESI: O) copies.
/// Bit N = slot N in this region has a copy eligible to supply read data.
/// `RegionBitmap` with W words — O(1) test/set/clear, O(W) iteration.
pub sharers: RegionBitmap, // W × 8 bytes
/// Lock protecting this entry during message processing.
/// Held only for a single directory update step; never across a network
/// round-trip or any blocking call.
pub lock: SpinLock<()>, // 4 bytes
pub _pad3: [u8; 4], // align to 8
}
// PeerId wraps NonZeroU64, so Option<PeerId> is 8 bytes via niche optimization.
const_assert!(core::mem::size_of::<Option<PeerId>>() == 8);
/// State of a page from the home node's directory perspective.
/// Maps to requestor-visible MOESI states as follows:
/// Uncached → home is authoritative (no remote copies)
/// Shared → one or more peers have S copies; home memory matches
/// Exclusive → one peer has Exclusive (MOESI: E); home memory matches (owner field set)
/// Modified → one peer has Exclusive-dirty (MOESI: M) or SharedOwner (MOESI: O); home memory is stale (owner field set)
#[repr(u8)]
pub enum DsmHomeState {
/// No peer has a copy. Home memory is the only valid copy.
Uncached = 0,
/// One or more peers have read-only SharedReader (MOESI: S) copies.
/// Home memory is up-to-date. `sharers` bitmap is non-empty; `owner` is None.
Shared = 1,
/// Exactly one peer has an Exclusive clean (MOESI: E) copy.
/// Home memory matches. `owner` is set; `sharers` is empty.
Exclusive = 2,
/// One peer has an Exclusive-dirty (MOESI: M) or SharedOwner (MOESI: O) copy;
/// home memory is stale. `owner` is set; `sharers` may be non-empty
/// (non-empty = owner is in SharedOwner state, supplying data to sharers;
/// empty = owner is in Exclusive-dirty state, sole copy).
Modified = 3,
/// *(transient)* Used only when home==owner (Case 3 in the write-fault flow):
/// the home node is preparing its own page data for transfer. Concurrent
/// requests that observe this state block on the per-entry wait queue until
/// the transition completes. In the ack_count model, the home does NOT enter
/// Invalidating for the general case — it transitions directly to
/// `Modified(owner=requester)` and the requester collects InvAcks.
/// Maximum duration: bounded by the membership dead timeout (~1000 ms,
/// 10 missed heartbeats at 100 ms intervals per
/// [Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--heartbeat-protocol)).
Invalidating = 4,
/// *(transient)* The page's home assignment is being transferred to another
/// node. Concurrent requests block on the per-entry wait queue until the
/// migration completes (collapses to `Exclusive` or `Uncached` at the old home,
/// new home directory re-created at the destination).
Migrating = 5,
}
Relationship to
DsmPageState: TheDsmHomeStateis the home node's directory view;DsmPageState(Section 6.3) is the per-node local view held in the requestor's page metadata. The two views are consistent: when the home directory saysDsmHomeState::Modifiedwithowner = Node B, Node B's localDsmPageStateisExclusivewith dirty bit set (MOESI: M) ifsharersis empty, orSharedOwner(MOESI: O) ifsharersis non-empty. The transient statesInvalidatingandMigratingexist in both enums but serve different roles:DsmHomeState::Invalidatingmeans the home directory is waiting for invalidation ACKs;DsmPageState::Invalidatingmeans the local page is being invalidated by a remote request.
6.6.3 State Transitions — Requestor's View¶
The following table describes the complete MOESI transition function. The From and
To columns use MOESI single-letter abbreviations for the requestor's local state
(see Section 6.6 for the mapping to DsmPageState
variants). The Home Actions column uses DsmHomeState variant names (Uncached,
Shared, Exclusive, Modified) for directory state transitions. "Home Actions" are
performed atomically under the DsmDirEntry::lock; messages are sent after releasing the
lock to avoid holding it across network operations.
| From | Event | Message to Home | Home Actions | Message(s) from Home | To |
|---|---|---|---|---|---|
| I | Read miss | GetS(page, requester) |
Uncached: send data from home memory; add requester to sharers; → Shared. Shared: send data; add requester; remain Shared. Exclusive: forward FwdGetS to owner (who will downgrade E → S, since E state means home memory is current); add requester to sharers; add current owner to sharers; clear owner; → Shared (home memory up-to-date; prior E holder and requester both have S copies). Modified: forward FwdGetS to owner; add requester to sharers; remain Modified. |
DataResp(page, data, ack_count=0) (home-sourced) or DataFwd(page, data) (owner-sourced after FwdGetS) |
S |
| I | Write miss | GetM(page, requester) |
Uncached: send data from home memory; → Modified (owner = requester, sharers = ∅). Shared: send Inv to all sharers; send DataResp(data, ack_count=N) to requester; → Modified (owner = requester, sharers = ∅). Home does NOT wait — requester collects N InvAcks before installing write mapping. Exclusive: forward FwdGetM to owner; → Modified (owner = requester). Modified (no sharers): forward FwdGetM to owner; → Modified (owner = requester). Modified (with sharers — SharedOwner/O scenario where owner has dirty data and S readers exist): forward FwdGetM to owner; send Inv to all sharers; send AckCount(N_sharers) to requester (home does not have data — owner sends DataFwd with page data to the requester); → Modified (owner = requester, sharers = ∅). The owner sends DataFwd with the page data to the requester; sharers send InvAck to the requester. |
DataResp(data, ack_count=N) + N × Inv to sharers (requester collects N InvAcks before entering M) |
M (after N InvAcks) |
| S | Write upgrade | Upgrade(page, requester) |
Send Inv to all other sharers; send AckCount(N−1) to requester; → Modified (owner = requester, sharers = ∅). Home does NOT wait — requester collects N−1 InvAcks. No data transfer — requester already has a clean copy. |
AckCount(N−1) + (N−1) × Inv to other sharers |
M (after N−1 InvAcks) |
| E | Local write (silent upgrade) | — (no message) | — (home already shows this node as exclusive owner; home directory state DsmHomeState::Exclusive remains unchanged — the home discovers dirtiness only on eviction via PutM instead of PutE) |
— | M (set dirty bit via state_dirty.fetch_or(0x100, Release); no network traffic — this is the E→M optimization that distinguishes MOESI from MSI) |
| M | Eviction (dirty) | PutM(page, data) |
Write data to home memory; → Uncached (owner = None). |
PutAck(page) |
I |
| O | Eviction (owned) | PutO(page, data) |
Write data to home memory (home now up-to-date); set owner = None. If sharers becomes empty: → Uncached. If sharers non-empty: → Shared (home memory now up-to-date, remaining sharers have valid S copies). |
PutAck(page) |
I |
| E | Eviction (clean) | PutE(page) |
Set state → Uncached (owner = None). No data transfer needed — home memory is already current. |
PutAck(page) |
I |
| S | Eviction (read copy) | PutS(page) |
Remove requester from sharers bitmask. If sharers becomes 0, → Uncached. |
PutAck(page) |
I |
| E | FwdGetS received (from home, on behalf of a new reader) |
— (response to home forward) | Home has already updated directory: added requester to sharers, added former E holder to sharers, cleared owner, state → Shared (home memory up-to-date). |
Send DataFwd(page, data) to new reader node; self → S (clean — home memory is current since E state is clean; no need for O state). |
S |
| M or O | FwdGetS received (from home, on behalf of a new reader) |
— (response to home forward) | Home has already updated directory: added new reader to sharers, state remains Modified. |
Send DataFwd(page, data) to new reader node; self → O (if was M, now dirty with sharers). |
O |
| M or O | FwdGetM received (from home, on behalf of an exclusive requester) |
— (response to home forward) | Home has already updated owner field to requester. |
Send DataFwd(page, data) to requester; self → I (invalidate local copy). |
I |
| O | Implicit writeback on FwdGetM |
— | Home state → Modified (new owner = requester; sharers cleared). |
— | I |
Notes on ack_count: When home issues GetM with N sharers (or Upgrade with N−1
other sharers), it sends Inv to each sharer simultaneously with the DataResp to the
requester. The requester receives the DataResp containing ack_count = N, then waits
for exactly N (or N−1) InvAck messages from those sharers before installing the
write-capable mapping. This allows the home node to pipeline the invalidations without
waiting for all InvAcks itself, reducing write-miss latency.
E→M silent upgrade: The DsmPageState::Exclusive variant represents both MOESI E
(clean) and M (dirty). The distinction is tracked by the per-page dirty bit
(DsmPage.state_dirty bit 8). On first local write to an E page, the kernel sets the
dirty bit without any state machine transition or network message. The home node's
directory state (DsmHomeState::Exclusive) remains unchanged — the home node discovers
the dirty state only when the node eventually evicts the page via PutM instead of
PutE. This silent upgrade is the performance reason MOESI adds the E state over MSI.
6.6.4 Message Types¶
All messages are exchanged over the RDMA transport (Section 5.4):
- Data messages (DataResp, DataFwd, PutM, PutO): use RDMA one-sided Write
for the 4KB payload to avoid receiver CPU involvement in data movement.
- Control messages (everything else): use RDMA two-sided Send/Receive.
When RDMA is unavailable, DSM falls back to TcpTransport
(Section 5.10). TCP
messages use 8-byte framing: [msg_len: u32 LE] [seq: u32 LE] followed by
ClusterMessageHeader + DsmWireHeader + payload.
ClusterTransport mapping: DSM uses the ClusterTransport trait
(Section 5.10) for all
inter-node operations and never creates QPs, sockets, or transport-specific
objects directly. Each peer's transport is obtained from PeerNode.transport
(Arc<dyn ClusterTransport>). The mapping is: transport.send() /
transport.send_reliable() for all coherence control messages (GetS, GetM,
Inv, InvAck, FwdGetS, FwdGetM, AckCount, PutAck, Nack, PutE, PutS, Upgrade,
WriteDiff, FwdDiff, causal messages, and anti-entropy headers);
transport.fetch_page() for one-sided page retrieval (RDMA Read on RDMA
peers, request-response on TCP peers, memcpy on CXL peers);
transport.push_page() for one-sided page data transfer (DataResp, DataFwd,
PutM, PutO, AntiEntropyData). This abstraction allows DSM to operate over
any transport — RDMA, TCP, CXL, PCIe BAR — without transport-specific code
in the coherence protocol. Different peers in the same DSM region can use
different transports simultaneously.
// umka-core/src/dsm/moesi.rs
/// Messages exchanged between nodes in the MOESI DSM protocol.
///
/// Naming convention:
/// Requestor → Home: Get*, Put*, Upgrade
/// Home → Requestor: DataResp, AckCount, PutAck, Nack
/// Home → Sharer: FwdGetS, FwdGetM, Inv
/// Sharer → Requestor: InvAck
/// Owner → Requestor: DataFwd
pub enum DsmMsg {
// ── Requestor → Home ─────────────────────────────────────────────────────
/// Read request: requester wants a SharedReader (MOESI: S) copy.
/// The `receiver_rdma_addr` and `receiver_rkey` carry the requester's
/// pre-allocated RDMA receive buffer for the data transfer. The home
/// forwards these fields in `FwdGetS` so the owner can RDMA Write
/// directly to the requester without an extra round-trip.
GetS {
page: PageAddr,
requester: PeerId,
receiver_rdma_addr: u64,
receiver_rkey: u32,
},
/// Write request: requester wants an Exclusive (MOESI: M) copy.
/// Same RDMA buffer fields as GetS — forwarded in `FwdGetM`.
GetM {
page: PageAddr,
requester: PeerId,
receiver_rdma_addr: u64,
receiver_rkey: u32,
},
/// Upgrade: requester already has SharedReader (MOESI: S) and wants
/// Exclusive (MOESI: M). Avoids retransmitting the data (home sends
/// only invalidations + AckCount).
Upgrade { page: PageAddr, requester: PeerId },
/// Eviction of an Exclusive dirty (MOESI: M) copy. Carries the
/// written-back data. Sent via RDMA one-sided Write for the data
/// payload; the control header is a two-sided Send.
PutM { page: PageAddr, data: DsmPageBuf },
/// Eviction of a SharedOwner (MOESI: O) copy. Carries the written-back data.
PutO { page: PageAddr, data: DsmPageBuf },
/// Eviction of an Exclusive clean (MOESI: E) copy. No data transfer needed.
PutE { page: PageAddr },
/// Silent eviction of a SharedReader (MOESI: S) copy. No data transfer.
PutS { page: PageAddr },
// ── Home → Requestor ─────────────────────────────────────────────────────
/// Data response from home memory (for DsmHomeState::Uncached/Shared).
/// `ack_count`: number of `InvAck` messages the requester must collect
/// before entering Exclusive (MOESI: M) state (zero for read misses
/// transitioning to SharedReader (MOESI: S)).
/// When `ack_count > 0`, the `invalidation_nonce` carries the 128-bit nonce
/// that was included in each `Inv` message. The requester uses this nonce
/// to verify InvAck HMACs. When `ack_count == 0` (read miss, no invalidations),
/// the nonce is zeroed.
DataResp {
page: PageAddr,
data: DsmPageBuf,
ack_count: u32,
/// 128-bit nonce for InvAck authentication. Zeroed when ack_count == 0.
invalidation_nonce: [u8; 16],
},
/// Acknowledgment count only (for Upgrade: no data resent).
/// Carries the `invalidation_nonce` so the requester can verify InvAck HMACs.
AckCount {
page: PageAddr,
ack_count: u32,
/// 128-bit nonce included in each Inv message of this batch.
invalidation_nonce: [u8; 16],
},
/// Acknowledgment that a Put* eviction was received and home is updated.
PutAck { page: PageAddr },
/// Transient conflict: requester must retry after backoff.
/// Home returns Nack instead of blocking when the directory entry is
/// momentarily locked (e.g., a concurrent ownership transfer is in progress).
/// Requester retries with exponential backoff: 1 μs, 2 μs, 4 μs, …, max 1 ms.
Nack { page: PageAddr, reason: NackReason },
// ── Home → Sharer / Owner (forwarded requests) ────────────────────────────
/// Forward a read request to the current owner (Exclusive or SharedOwner).
/// Owner must send `DataFwd` to `requester` and transition to SharedOwner (MOESI: O).
/// The `receiver_rdma_addr` and `receiver_rkey` fields carry the requester's
/// pre-allocated RDMA receive buffer information, forwarded from the original
/// GetS message. The owner (Node B) uses these to RDMA Write the page data
/// directly to the requester (Node A) without an extra round-trip.
FwdGetS {
page: PageAddr,
requester: PeerId,
/// Requester's RDMA receive buffer offset (pre-allocated before GetS send).
receiver_rdma_addr: u64,
/// Requester's RDMA remote key for the receive buffer.
receiver_rkey: u32,
},
/// Forward an exclusive request to the current owner (Exclusive or SharedOwner).
/// Owner must send `DataFwd` to `requester` and transition to NotPresent (MOESI: I).
FwdGetM {
page: PageAddr,
requester: PeerId,
/// Requester's RDMA receive buffer offset.
receiver_rdma_addr: u64,
/// Requester's RDMA remote key for the receive buffer.
receiver_rkey: u32,
},
/// Invalidation request to a SharedReader (MOESI: S) peer.
/// Sharer must unmap the page, flush TLBs, and reply with `InvAck`.
/// The `invalidation_nonce` is a cryptographically random 128-bit value
/// generated fresh by the home for each invalidation batch. The sharer
/// echoes this nonce back in the `InvAck` with an HMAC for authentication.
Inv {
page: PageAddr,
requester: PeerId,
/// 128-bit nonce for InvAck authentication. The sharer echoes this
/// back in InvAck.invalidation_nonce with an HMAC.
invalidation_nonce: [u8; 16],
},
// ── Sharer → Requestor ───────────────────────────────────────────────────
/// Acknowledgment of invalidation. Sent directly to the requester (not home)
/// to allow the requester to pipeline Exclusive (MOESI: M) entry without home round-trip.
/// Includes the invalidation nonce and HMAC for authentication — the requester
/// verifies these before accepting the ACK. See
/// [Section 6.5](#dsm-page-fault-flow--invalidation-ack-authentication).
InvAck {
page: PageAddr,
/// 128-bit nonce from the original Inv message. Proves this ACK is a
/// response to a specific invalidation batch, not a replay.
invalidation_nonce: [u8; 16],
/// HMAC-SHA-256 over {nonce || page_va || sender_node_id} using the
/// cluster-wide epoch key. 32 bytes.
hmac: [u8; 32],
},
// Wire size: DsmWireHeader(40) + PageAddr(8) + nonce(16) + hmac(32) = 96 bytes.
// Transport: Send (inline) | 96 B.
// ── Owner → Requestor (data forwarding) ──────────────────────────────────
/// Forwarded data from the current owner (Exclusive or SharedOwner) to a new requester.
/// Sent via RDMA one-sided Write for the 4KB payload.
DataFwd { page: PageAddr, data: DsmPageBuf },
// ── Subscriber-Controlled Caching ([Section 6.12](#dsm-subscriber-controlled-caching)) ──
// Local-only variants (never serialized to wire). Rust `bool` and
// native types are acceptable per CLAUDE.md rule 8 exemption
// (rule 8 applies only to wire/KABI structs). DsmMsgType codes
// 0x0040-0x0043 are reserved but never appear on the wire.
/// Subscriber requests prefetch of a page range into DSM cache.
/// Home fetches pages proactively (before a page fault occurs).
Prefetch {
region_id: u64,
va_start: u64,
page_count: u32,
priority: DsmPrefetchPriority,
requester: PeerId,
},
/// Subscriber requests explicit invalidation of locally cached pages.
/// Used by subsystems that know their cached data is stale (e.g., after
/// a DLM lock downgrade or a distributed transaction abort).
SubscriberInvalidate {
region_id: u64,
va_start: u64,
page_count: u32,
requester: PeerId,
},
/// Subscriber requests writeback of dirty pages to home.
/// `sync`: if true, requester blocks until writeback completes.
SubscriberWriteback {
region_id: u64,
va_start: u64,
page_count: u32,
sync: bool,
requester: PeerId,
},
/// Acknowledgment of a subscriber control operation.
SubscriberAck {
region_id: u64,
va_start: u64,
page_count: u32,
op: DsmSubscriberOp,
},
}
/// Subscriber operation type for SubscriberAck.
#[repr(u8)]
pub enum DsmSubscriberOp {
Prefetch = 0,
Invalidate = 1,
Writeback = 2,
}
/// Prefetch priority hint.
#[repr(u8)]
pub enum DsmPrefetchPriority {
/// Background: low-priority, yield to demand faults.
Background = 0,
/// Normal: standard priority (e.g., sequential read-ahead).
Normal = 1,
/// Urgent: subscriber expects imminent access (e.g., DLM lock acquired,
/// about to touch the data). Prefetch bypasses background queue.
Urgent = 2,
}
/// Virtual page address within a DSM region — identifies a page globally
/// across the cluster. Value is the virtual address of the first byte of the
/// page, 4KB-aligned. Globally unique because DSM regions use disjoint virtual
/// address ranges assigned at region creation time, and all peers map the
/// region at the same base address ([Section 6.8](#dsm-region-management)).
///
/// This is a VIRTUAL address, not physical. Different nodes have different
/// physical frames for the same DSM page — that is the fundamental property
/// of DSM. The home node for a page is determined by hashing the virtual
/// address: `home_node = hash(region_id, VA) % participant_count`
/// ([Section 6.4](#dsm-home-node-directory)).
pub struct PageAddr(u64);
/// Reference to a pool-allocated page buffer. The pool is per-region,
/// pre-allocated at region join time (max_inflight_transfers × PAGE_SIZE).
/// This avoids placing 4KB data inline in DsmMsg variants, which would
/// make DsmMsg too large for the stack (~4KB+ per enum value).
///
/// DsmPageBuf is DMA-capable and 4KB-aligned (required for RDMA one-sided
/// Write of page data). On Drop, the buffer is returned to its pool.
///
/// SAFETY: DsmPageBuf is Send but not Sync — only one thread may write to the
/// buffer at a time. The MOESI protocol's ownership model guarantees this.
pub struct DsmPageBuf {
/// Pointer to a 4KB-aligned, DMA-capable page buffer.
ptr: NonNull<[u8; PAGE_SIZE]>,
/// The pool this buffer was allocated from. Used by Drop to return it.
pool: &'static DsmPagePool,
}
impl Drop for DsmPageBuf {
fn drop(&mut self) {
// SAFETY: ptr was allocated from this pool and has not been freed.
// Drop cannot take &PreemptGuard (fixed trait signature), so
// return_buf() uses debug_assert!(!preempt_enabled()) internally.
// DsmPageBuf is only created in non-preemptible context (softirq/
// NAPI) and must not be held across a schedule() point.
self.pool.return_buf(self.ptr);
}
}
/// Pre-allocated pool of DMA-capable page buffers for DSM data transfers.
/// One per region per node, sized at region join time to accommodate the
/// maximum number of in-flight transfers (default: 64).
///
/// **Context requirement**: `alloc()` and `free()` take `&PreemptGuard`
/// to enforce non-preemptible context at compile time (same pattern as
/// `PerCpu::get()`). This is required for ABA safety of the lock-free
/// free list — see the `free_list` field documentation below.
pub struct DsmPagePool {
/// Free list of page buffers. Lock-free stack using `AtomicU64` with
/// a generation counter and buffer index packed into a single word.
///
/// **64-bit packing layout** (x86-64, AArch64, RISC-V 64, PPC64LE,
/// s390x, LoongArch64):
/// bits 63-48: 16-bit generation counter (ABA guard)
/// bits 47-0: 48-bit buffer index
/// Buffer addresses: `base + index * PAGE_SIZE`.
///
/// **32-bit packing layout** (ARMv7, PPC32):
/// bits 63-32: 32-bit generation counter (ABA guard)
/// bits 31-0: 32-bit buffer index
/// The full 32-bit address space fits in the lower half. The 32-bit
/// generation counter provides ~4 billion ABA guard cycles — vastly
/// more headroom than the 64-bit layout's 65,536.
///
/// ABA is avoided because the generation counter increments on every
/// push, and this pool is ONLY accessed from non-preemptible context
/// (softirq/NAPI on the RDMA completion path, or under
/// `preempt_disable`). In non-preemptible context, the maximum number
/// of interleaved operations between a thread's `load` and its CAS is
/// bounded by interrupt handlers (NMI/IRQ), which complete in <10 us
/// (~100 operations at ~100 ns each). The 16-bit generation (65,536
/// wrap) on 64-bit provides >650x safety margin.
///
/// **Compile-time enforcement**: `alloc()` and `free()` require
/// `&PreemptGuard`, which can only be obtained by calling
/// `preempt_disable()`. This is zero-cost in release builds (the
/// guard is a ZST). Debug builds additionally verify
/// `!preempt_enabled()` as a belt-and-suspenders check.
free_list: AtomicU64,
/// Total capacity (for diagnostics / OOM detection).
capacity: u32,
/// Number of currently allocated (in-use) buffers.
allocated: AtomicU32,
}
impl DsmPagePool {
/// Allocate a DMA-capable page buffer from the pool.
/// Requires `&PreemptGuard` to enforce non-preemptible context at
/// compile time (ABA safety invariant). Returns `None` if the pool
/// is exhausted (caller should back-pressure or reclaim).
pub fn alloc(&self, _guard: &PreemptGuard) -> Option<DsmPageBuf> {
// CAS loop on free_list: pop top, increment generation.
/* ... */
}
/// Return a buffer to the pool (called from DsmPageBuf::Drop).
/// Uses `debug_assert!(!preempt_enabled())` because the Drop trait
/// cannot accept `&PreemptGuard`. Callers must ensure DsmPageBuf is
/// not held across a schedule() point.
fn return_buf(&self, ptr: NonNull<[u8; PAGE_SIZE]>) {
// CAS loop on free_list: push, increment generation.
/* ... */
}
}
// DsmMsg stack size: ~64-72 bytes (largest variant is DataResp with
// PageAddr + DsmPageBuf + ack_count = 8 + 16 + 4 = 28 bytes payload
// plus enum discriminant). Compare to 4KB+ if PageData were inline.
/// Reason codes for transient Nack responses.
pub enum NackReason {
/// Directory entry is temporarily locked by a concurrent operation. Retry.
Busy,
/// Home node is in the middle of a directory rehash. Retry after redirect.
Transient,
}
Wire encoding: The
DsmMsgenum above is the in-memory representation. The#[repr(C)]wire format, RDMA verb bindings, and split-transfer protocol for data messages are specified in Section 6.6. Write-update diff wire encoding is in Section 6.6.
6.6.5 Deadlock Avoidance¶
The MOESI protocol is deadlock-free by construction via three invariants:
-
No blocking while holding a directory lock. The home node acquires
DsmDirEntry::lock, updates the entry, releases the lock, and only then sends forwarding or invalidation messages. It never blocks waiting for remoteInvAcks orDataFwdcompletions while holding the lock. This ensures the home node's lock is always available to process incoming messages. -
NACK instead of blocking for transient conflicts. When the directory entry is locked by a concurrent operation, the home node returns
Nack(withNackReason::Busy) rather than queuing the request. Requestors retry with exponential backoff (1 μs, 2 μs, 4 μs, …, capped at 1 ms). This prevents priority inversion and livelock by bounding the retry interval. -
Separate request and response channels. RDMA QP pairs are partitioned into a request channel and a response channel per peer. Response messages (
InvAck,DataFwd,PutAck) never block behind request messages (GetS,GetM,Inv). This prevents the classic deadlock where a node cannot process an incomingFwdGetSbecause it is blocked trying to send its ownGetM. -
Owner always responds before issuing new requests. A node that receives
FwdGetSorFwdGetMmust sendDataFwd(andInvAckif transitioning to I) before it may issue any newGetS,GetM, orUpgraderequests. This prevents cyclic wait: A waiting on B'sDataFwdwhile B is waiting on A'sDataFwd.
6.6.6 Performance Characteristics¶
State abbreviations below use MOESI single letters (see
Section 6.6 for DsmPageState mapping).
DsmHomeState names (Uncached, Shared, Exclusive, Modified) appear where the
home directory state determines the code path.
| Operation | Network Latency | RDMA Operations |
|---|---|---|
| Read hit (S or O or E state) | 0 | 0 |
| Read miss — I → S, home-sourced (Uncached or Shared) | 2× RTT | 2× two-sided Send |
| Read miss — I → S, owner-forwarded (Modified/Exclusive owner) | 3× RTT | 3× two-sided Send + 1× one-sided Write |
| Write miss — I → M, no sharers (Uncached or Exclusive) | 2× RTT | 2× two-sided Send |
| Write miss — I → M, N sharers | 2× RTT + max(InvAck RTTs) | 2× Send + N× Inv/InvAck |
| Upgrade — S → M, N−1 other sharers | 1× RTT + max(InvAck RTTs) | 1× Send + (N−1)× Inv/InvAck |
| Eviction M → I (dirty writeback) | 1× RTT | 1× two-sided Send + 1× one-sided Write |
| Eviction O → I (owned writeback) | 1× RTT | 1× two-sided Send + 1× one-sided Write |
| Eviction E → I (clean eviction) | 1× RTT | 1× two-sided Send |
| Eviction S → I (read-copy eviction) | 1× RTT | 1× two-sided Send |
RTT here is the RDMA network round-trip time (~2–5 μs for a local rack, per Section 14.8).
RDMA one-sided Write is used for DataResp, DataFwd, PutM, and PutO (4KB payload)
because it avoids receiver CPU involvement in the data path. Two-sided Send is used for
all control messages (≤64 bytes, sent as RDMA inline) because they require the receiver's
CPU to process the directory update or mapping change.
The SharedOwner (MOESI: O) state provides a concrete advantage for multi-reader scenarios:
when the first reader (FwdGetS) causes the owner to transition from Exclusive-dirty
(MOESI: M) to SharedOwner (MOESI: O), subsequent readers (FwdGetS again) are served
directly by the SharedOwner node without any home-node memory write. In a scenario with
1 writer followed by K readers, MOESI requires 1 + K network round-trips; MESI (which
requires a writeback before sharing) would require 2 + K.
6.6.7 Write-Update Protocol (DSM_WRITE_UPDATE Flag)¶
The default MOESI protocol uses write-invalidate: when a peer writes a page, all shared copies are invalidated. This is optimal when write sharing is rare (most workloads), but causes ping-pong when multiple peers frequently write to different fields within the same page — each write invalidates all readers, who immediately re-fetch the page, only to be invalidated again on the next write.
Kerrighed's KRC (kernel release consistency) demonstrated that write-update with byte-level diffs eliminates this ping-pong for true-sharing workloads (database index nodes, shared counters, lock-free data structures that happen to share a page).
Regions created with DSM_WRITE_UPDATE (flag 0x04 in DsmRegionCreate.flags) use a
modified coherence protocol:
On write (Release consistency — diff computed at release point):
- At lock acquire / region entry, the DSM records a pristine copy of each page the peer has in SharedReader (MOESI: S) state (copy-on-write snapshot via page protection — no actual copy until the first write, same as TreadMarks' lazy diff creation).
- At lock release / barrier, the DSM compares each written page against its pristine
copy and produces a diff: a compact encoding of
(offset, length, new_bytes)tuples. Only the changed bytes are encoded, not the full page. - The diff is sent to the home node via RDMA Send (two-sided, for ordering).
- The home node applies the diff to its authoritative copy and forwards the
diff to all peers in the
sharersbitmap. No invalidation — readers apply the diff to their local copies directly and remain in SharedReader state. - Peers apply incoming diffs under the
DsmDirEntry::lock(same lock as invalidations). Application order is guaranteed by the home node's serialization.
Diff encoding:
/// Compact diff for a single page. Sent as RDMA inline if ≤256 bytes,
/// otherwise as RDMA Send with DMA buffer.
///
/// `PageDiff` is a dynamically-sized type (DST) due to the trailing
/// `[DiffRun]` slice. It can only be used behind a reference or `Box`.
/// Construction requires building a fat pointer:
///
/// ```
/// let layout = Layout::from_size_align(
/// size_of::<PageAddr>() + size_of::<u16>() + 6 + run_count * size_of::<DiffRun>(),
/// align_of::<PageAddr>(),
/// ).unwrap();
/// let ptr = slab_alloc(layout);
/// // Initialize fields, then create fat pointer:
/// let diff: &PageDiff = &*(core::ptr::slice_from_raw_parts(ptr, run_count) as *const PageDiff);
/// ```
///
/// The wire encoding is `PageDiffWire` (fixed 16-byte header + variable
/// DiffRunWire headers + data), not this in-memory DST.
pub struct PageDiff {
pub page: PageAddr,
/// Number of changed regions within the page.
pub run_count: u16,
pub _pad: [u8; 6],
/// Variable-length array of (offset: u16, length: u16, data: [u8; length]).
/// Total size ≤ 4096 bytes (worst case: entire page changed → fall back to
/// full-page invalidate, which is cheaper than a 4KB diff).
pub runs: [DiffRun],
}
pub struct DiffRun {
pub offset: u16, // byte offset within page (0..4095)
pub length: u16, // bytes changed (1..4096)
// followed by `length` bytes of new data
}
Fallback: If the diff exceeds 50% of page size (2048 bytes), the protocol falls back
to write-invalidate for that page (sending Inv instead of the diff). This prevents
pathological cases where the diff is larger than a full page transfer.
When to use: DSM_WRITE_UPDATE is beneficial when:
- Multiple peers write to disjoint fields of the same page (struct with per-peer counters,
database page with multiple row slots, shared hash table bucket).
- Write regions are small relative to page size (a few cache lines per write).
- Readers significantly outnumber writers (diffs are multicast to all sharers).
When NOT to use: Large sequential writes (entire page rewritten), single-writer workloads (no sharing to update), or when writes and reads are temporally separated (invalidation is cheaper — the reader fetches the full page once, not N diffs).
Write-Update Interval Abort Path
If a write-update interval is abandoned before the normal close (lock release / barrier), the COW snapshots must be cleaned up to avoid dangling read-only page mappings and leaked pristine copies. Three abort triggers share a common cleanup sequence:
-
Thread exit / SIGKILL with open write-update interval: a. Restore original page protections (remove COW write-trap from all pages in the interval's pristine list). b. Discard all pristine copies (free the COW shadow pages). c. Any writes made during the partial interval are treated as full-page modifications: send
PutMfor each dirty page (write-invalidate fallback), not diffs. The home node processesPutMnormally, invalidating all sharers. d. Log FMA eventDsmWriteUpdateAbort { region_id, reason: ThreadExit, dirty_pages, pristine_pages }. -
OOM during diff computation: a. Fall back to write-invalidate for the affected pages only (send
PutMfor pages where diff allocation failed). b. Pages whose diffs were successfully computed are sent normally. c. Log FMA eventDsmWriteUpdateDiffOom { region_id, affected_pages }. -
DLM lock timeout (lock not released within configured timeout): a. Same cleanup as thread exit (steps 1a-1d), with
reason: LockTimeout.
The cleanup sequence is adapted from TreadMarks' abort path. The key invariant: after cleanup, no page retains a COW write-trap mapping from the aborted interval. All dirty pages are reconciled with the home node via the standard PutM path (not lost silently).
DsmMsg additions for write-update:
// Added to DsmMsg enum:
/// Write-update diff from writer to home node.
WriteDiff { page: PageAddr, diff: PageDiff, writer: PeerId },
/// Home forwards diff to all sharers (multicast).
FwdDiff { page: PageAddr, diff: PageDiff },
Interaction with DsmConsistency: Write-update is orthogonal to the consistency model.
It works with Release (diffs sent at release), Eventual (diffs propagated
asynchronously), and Causal (diffs carry causal stamps). It is not applicable to
Synchronous (which requires full-page acknowledgment from all replicas).
6.6.8 DSM Coherence Message Wire Format¶
The DsmMsg enum (Section 6.6) defines the in-memory protocol
representation. This section specifies the #[repr(C)] wire encoding for RDMA transport.
Wire header:
Every DSM coherence message on the wire consists of a ClusterMessageHeader
(Section 5.1, 40 bytes) followed by a DsmWireHeader
(40 bytes). Total header: 80 bytes.
/// Wire header for all DSM coherence messages. Follows ClusterMessageHeader.
/// Total: 40 bytes. All integer fields use Le types for mixed-endian clusters
/// ([Section 6.1](#dsm-foundational-types--wire-format-integer-types)).
#[repr(C)]
pub struct DsmWireHeader {
/// DSM message type — identifies which DsmMsg variant this encodes
/// (DsmMsgType as Le16).
pub dsm_type: Le16, // 2 bytes
/// Flags:
/// DSM_FLAG_HAS_DATA = 0x01 — a 4KB page payload follows (RDMA Write).
/// DSM_FLAG_HAS_DIFF = 0x02 — a PageDiffWire payload follows.
pub flags: Le16, // 2 bytes
/// Message-specific auxiliary field:
/// DataResp: ack_count (number of InvAcks the requester must collect).
/// Prefetch/SubscriberInvalidate/SubscriberWriteback: page_count.
/// Data messages with DSM_FLAG_HAS_DATA: offset of the 4KB data slot
/// in the receiver's pre-registered receive region.
/// Nack: NackReason encoded as Le32.
/// All others: 0.
pub aux: Le32, // 4 bytes
/// Region ID for region-scoped messages (subscriber control, region
/// management). Zero for page-addressed MOESI messages (GetS, GetM, etc.),
/// which identify pages by the global PageAddr. u64 to match DsmRegion.region_id.
pub region_id: Le64, // 8 bytes
/// Virtual page address (4KB-aligned). Zero for non-page messages.
pub page_addr: Le64, // 8 bytes
/// Peer ID of the requester or sender.
pub peer_id: Le64, // 8 bytes (PeerId)
/// Reserved for future use (alignment padding to 40 bytes).
pub _pad: [u8; 8], // 8 bytes
}
// Layout: 2 + 2 + 4 + 8 + 8 + 8 + 8 = 40 bytes (Le types are alignment 1).
const_assert!(core::mem::size_of::<DsmWireHeader>() == 40);
DSM messages are carried as payloads within the RDMA ring buffer
(Section 5.5). Each
DsmWireHeader is preceded by a ClusterMessageHeader and wrapped in
an RdmaRingHeader entry when using RDMA transport.
/// Explicit wire type codes for each DsmMsg variant. These are the values
/// that appear in DsmWireHeader.dsm_type on the wire.
#[repr(u16)]
pub enum DsmMsgType {
// ── Requestor → Home ────────────────────────────────────────────────
GetS = 0x0001,
GetM = 0x0002,
Upgrade = 0x0003,
PutM = 0x0004,
PutO = 0x0005,
PutE = 0x0006,
PutS = 0x0007,
// ── Home → Requestor ────────────────────────────────────────────────
DataResp = 0x0010,
AckCount = 0x0011,
PutAck = 0x0012,
Nack = 0x0013,
// ── Home → Sharer/Owner ─────────────────────────────────────────────
FwdGetS = 0x0020,
FwdGetM = 0x0021,
Inv = 0x0022,
// ── Sharer → Requestor ──────────────────────────────────────────────
InvAck = 0x0023,
// ── Owner → Requestor ───────────────────────────────────────────────
DataFwd = 0x0030,
// ── Subscriber control (local-only, [Section 6.12](#dsm-subscriber-controlled-caching--subscriber-control-api)) ──
// Reserved codes 0x0040-0x0043. These are never sent over the wire;
// the DSM subsystem translates subscriber API calls into standard
// MOESI messages (GetS, PutS, PutM, PutO). Codes reserved for type
// completeness in debug tracing.
_SubscriberPrefetchReserved = 0x0040,
_SubscriberInvalidateReserved = 0x0041,
_SubscriberWritebackReserved = 0x0042,
_SubscriberAckReserved = 0x0043,
// ── Write-update ([Section 6.6](#dsm-coherence-protocol-moesi--write-update-protocol-dsm_write_update-flag)) ──
WriteDiff = 0x0050,
FwdDiff = 0x0051,
// ── Causal consistency ([Section 6.6](#dsm-coherence-protocol-moesi--causal-consistency-protocol-dsm_causal)) ──
CausalPropagate = 0x0060,
CausalAck = 0x0061,
CausalWait = 0x0062,
// ── Anti-entropy ([Section 6.13](#dsm-anti-entropy-protocol)) ────────────────────
AntiEntropyRequest = 0x0070,
AntiEntropyData = 0x0071,
AntiEntropyComplete = 0x0072,
// ── Distributed futex ([Section 6.14](#dsm-application-visible--distributed-futex-on-dsm-pages)) ──
DsmFutexWake = 0x0090,
DsmFutexWakeTarget = 0x0091,
DsmFutexWaitRegister = 0x0092,
DsmFutexWaitUnregister = 0x0093,
// ── Home directory reconstruction ──
// DSM directory reconstruction messages use PeerMessage framing
// (PeerMessageType::DsmDirReconstruct* 0x0330-0x0332), not DSM
// data-plane DsmMsg framing. See
// [Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--dsm-home-reconstruction).
}
Transport binding table:
Each DsmMsgType is bound to a specific RDMA verb type. The choice between one-sided
Write and two-sided Send depends on whether the receiver's CPU needs to process the
message (Send) or only the data needs to land in memory (Write).
Size convention: "Max Wire Size" below is the DSM payload size (DsmWireHeader + any variant-specific payload). The full on-wire size includes the ClusterMessageHeader (40 bytes) prepended by the cluster transport, plus any RdmaRingHeader framing (8 bytes). To compute full wire size: add 40 (ClusterMessageHeader) + 8 (RdmaRingHeader) to the DSM payload size. For TCP fallback, add 8 (TCP framing) + 40 (ClusterMessageHeader).
| DsmMsgType | RDMA Verb | DSM Payload Size | Notes |
|---|---|---|---|
| GetS, GetM, Upgrade | Send (inline) | 40 B | DsmWireHeader only; no extra payload |
| PutE, PutS | Send (inline) | 40 B | Clean eviction — no data transfer |
| PutM, PutO | Send (header) + Write (data) | 40 + 4096 B | Split transfer (see below) |
| DataResp | Send (header) + Write (data) | 40 + 4096 B | Split transfer |
| DataFwd | Send (header) + Write (data) | 40 + 4096 B | Split transfer |
| AckCount, PutAck, Nack | Send (inline) | 40 B | Control only; aux field carries count/reason |
| FwdGetS, FwdGetM, Inv | Send (inline) | 40 B | Forwarded control |
| InvAck | Send (inline) | 40 B | Direct to requester, not home |
| WriteDiff | Send (variable) | 40 + ≤2048 B | Variable-length diff; inline if total ≤ 256 B |
| FwdDiff | Send (variable) | 40 + ≤2048 B | Home fans out to each sharer individually |
| CausalPropagate | Send (variable) | 40 + N×8 + D×8 | Inline if total ≤ 256 B |
| CausalAck | Send (inline) | 40 B | Control only |
| CausalWait | Send (inline) | 40 B | Encoded in DsmWireHeader fields |
| AntiEntropyRequest | Send (variable) | 40 + N×8 | Version vector; continuation for large regions |
| AntiEntropyData | Send (header) + Write (data) | 40 + 4096 B | Same split-transfer as DataResp |
| AntiEntropyComplete | Send (variable) | 40 + N×8 | Final version vector |
| DsmFutexWake | Send (inline) | 40 B | Cross-node futex wake (waker → home) |
| DsmFutexWakeTarget | Send (inline) | 40 B | Targeted wake (home → specific peer) |
| DsmFutexWaitRegister | Send (inline) | 40 B | Waiter registration (waiter → home) |
| DsmFutexWaitUnregister | Send (inline) | 40 B | Waiter unregistration (waiter → home) |
Inline threshold: Control messages (total wire size ≤ 256 bytes) use
IBV_SEND_INLINE — data is copied directly from CPU registers to the NIC, avoiding
DMA setup overhead. This applies to all messages except data-bearing ones (PutM, PutO,
DataResp, DataFwd) and large diffs.
Split-transfer protocol (for PutM, PutO, DataResp, DataFwd):
Sender Receiver
│ │
│ 1. Allocate 4KB slot from receiver's │
│ pre-registered receive region │
│ (ring buffer or dedicated pool, │
│ [Section 5.4](05-distributed.md#rdma-native-transport-layer--pre-registered-kernel-memory)) │
│ │
│ 2. RDMA Write: 4KB page data ──────→ │ (one-sided, no CPU involvement)
│ to allocated slot │
│ │
│ 3. RDMA Send: DsmWireHeader ────────→ │ (two-sided, triggers CPU processing)
│ flags = DSM_FLAG_HAS_DATA │
│ aux = slot offset in recv region│
│ │
│ │ 4. Process Send completion:
│ │ read data from slot[aux]
│ │ update page table / directory
RC (Reliable Connection) QP ordering guarantees that the RDMA Write (step 2) data is visible at the receiver before the RDMA Send (step 3) completion is processed, because both are posted to the same QP and RC enforces in-order delivery.
No message batching:
DSM coherence messages are NOT batched. Each GetS/GetM/Inv/InvAck is a separate RDMA operation. Rationale: coherence messages are on the page fault critical path — every microsecond of added latency directly increases process stall time. Batching would add queueing delay (waiting to fill a batch) that exceeds the per-message RDMA overhead (~0.5 μs for an inline Send). The IPC ring buffer layer (Section 5.5) provides batching for bulk data operations that are not latency-critical.
Subscriber control messages are local, not wire:
DsmMsg::Prefetch, SubscriberInvalidate, SubscriberWriteback, and SubscriberAck
(Section 6.12) are internal kernel API calls
between a subscriber (e.g., UPFS, block export layer) and the local DSM subsystem. They
are NOT sent over RDMA. The DsmMsgType codes 0x0040-0x0043 are reserved but never appear
on the wire.
The DSM subsystem translates subscriber API calls into standard MOESI wire messages:
- dsm_prefetch() → pipelined GetS messages (up to 8 concurrent, to fill a 100 Gbps
link at 4KB/page with ~3 μs RTT).
- dsm_invalidate() → PutS (for SharedReader pages) or PutM/PutO (for dirty pages).
- dsm_writeback() → PutM/PutO (with sync=true: wait for PutAck; sync=false:
completion via on_writeback_complete() callback).
6.6.9 Write-Update Wire Encoding¶
The PageDiff and DiffRun structures (Section 6.6)
define the logical diff representation. This section specifies their wire encoding.
/// Wire encoding of a page diff. Follows DsmWireHeader with
/// dsm_type = WriteDiff or FwdDiff, flags = DSM_FLAG_HAS_DIFF.
///
/// Layout: fixed header, then run_count × DiffRunWire headers (4 bytes each),
/// then raw diff data bytes (contiguous, not interleaved with headers).
/// This separation allows the receiver to read all headers first (small,
/// cache-friendly) and then apply data runs by computed offset.
///
/// Total wire size: 16 + (run_count × 4) + total_data_len bytes.
/// Maximum: 16 + (N × 4) + 2048 = ≤ 2560 bytes (50% page fallback rule
/// limits total_data_len to 2048).
#[repr(C)]
pub struct PageDiffWire {
/// Virtual page address being updated.
pub page_addr: Le64, // 8 bytes
/// Number of changed byte ranges within the page.
pub run_count: Le16, // 2 bytes
/// Sum of all DiffRunWire.length values. Allows receiver to pre-allocate
/// a contiguous buffer for the data section without parsing individual runs.
pub total_data_len: Le16, // 2 bytes
/// Padding to align to 8 bytes.
pub _pad: [u8; 4], // 4 bytes
// Followed by: run_count × DiffRunWire (4 bytes each)
// Followed by: total_data_len bytes of raw diff data
}
/// Wire encoding of a single changed byte range within a page.
/// The actual data bytes are NOT inline in this struct — they follow
/// all DiffRunWire headers contiguously.
///
/// Data offset for run[i] = sum(run[0..i].length).
#[repr(C)]
pub struct DiffRunWire {
/// Byte offset within the 4KB page (0..4095).
pub offset: Le16, // 2 bytes
/// Number of bytes changed starting at offset (1..4096).
pub length: Le16, // 2 bytes
}
const_assert!(core::mem::size_of::<PageDiffWire>() == 16);
const_assert!(core::mem::size_of::<DiffRunWire>() == 4);
WriteDiff transport: Always RDMA two-sided Send (the receiver's CPU must apply the
diff to its local page copy). If the total message size (ClusterMessageHeader +
DsmWireHeader + PageDiffWire + data) ≤ 256 bytes, uses IBV_SEND_INLINE.
FwdDiff fan-out: When the home node receives a WriteDiff from a writer, it:
1. Applies the diff to its own authoritative page copy.
2. Forwards FwdDiff to each peer in the sharers bitmap — one separate RDMA Send
per sharer, on the RC QP to that peer.
RDMA UD (Unreliable Datagram) multicast is NOT used because coherence operations require guaranteed delivery. RC sends to different peers are pipelined: the home node posts all FwdDiff sends concurrently to separate QPs, so the total fan-out latency is approximately one RDMA RTT (not N × RTT). For typical write-update workloads, the sharer count is small (2-8 peers writing to disjoint fields of the same page).
6.6.10 Causal Consistency Protocol (DSM_CAUSAL)¶
Regions created with DsmConsistency::Causal use per-region vector clocks
to enforce causal ordering: if peer A writes X then writes Y, any peer that
observes Y must also observe X. The protocol operates over causal intervals
— bounded windows of writes tracked by a DsmCausalStamp.
Interval lifecycle:
1. OPEN: A causal interval begins when:
a. A DLM lock is acquired on a DSM_CAUSAL region (automatic if
DsmLockBinding exists, [Section 6.12](#dsm-subscriber-controlled-caching--dlm-token-binding)), OR
b. The subscriber explicitly calls dsm_fence_open().
The DSM subsystem allocates a DsmCausalStamp from the region's slab pool.
The stamp's clock is initialized to the peer's current vector clock
snapshot (copy of local vc[0..max_participants]).
2. ACTIVE: During the interval, every write to a DSM_CAUSAL page:
a. Increments vc[own_slot] in the peer's local vector clock.
b. Sets page.last_stamp_epoch = interval.epoch.
c. Appends the page address to interval.dirty_pages.
d. If dirty_count reaches DSM_MAX_DIRTY_PER_INTERVAL (4096):
force-flush — close this interval and open a new one.
3. CLOSE: The interval closes when:
a. The DLM lock is released (automatic), OR
b. The subscriber calls dsm_fence_close(), OR
c. Force-flush due to dirty page overflow.
On close:
i. Increment vc[own_slot] once more (marks interval boundary).
ii. Build CausalStampWire from the current vc[] and dirty_pages list.
iii. Attach CausalStampWire to the DLM lock-release message (if lock-
triggered) or send as a standalone CausalPropagateMsg.
Propagation is asynchronous — the sender does NOT wait for
CausalAck from all peers. Causal visibility is guaranteed lazily
via the CausalWait protocol if a reader encounters a gap.
iv. Free the DsmCausalStamp back to the slab pool.
4. MERGE: When a peer receives a CausalStampWire (via lock-release or
standalone message):
a. For each slot i: local_vc[i] = max(local_vc[i], received_vc[i]).
b. For each page in dirty_pages: set page.last_stamp_epoch =
max(page.last_stamp_epoch, received_stamp.epoch).
CW-mode DLM lock and causal ordering: When a DLM lock is granted in CW
(Concurrent Write) mode, the lock-grant message carries the previous holder's
CausalStampWire. The new holder MUST call the causal-wait protocol before
accessing protected data to ensure all causal predecessors are visible. This is
enforced by the DlmLockHandle API: lock.access() checks the causal stamp
and blocks if predecessors are pending. Specifically:
- DLM master includes the previous CW holder's
CausalStampWirein the grant message (DlmGrantMsg.causal_stampfield). - On grant receipt, the DSM subsystem calls
causal_merge(stamp)to update the local vector clock. lock.access()verifies that all pages referenced instamp.dirty_pageshavelast_stamp_epoch >= stamp.epochlocally. If any page is stale, the CausalWait protocol fetches the missing data (one RTT per missing page, pipelined).- Only after all causal predecessors are visible does
lock.access()return, allowing the caller to proceed with protected data.
This ensures that CW-mode locks — which permit concurrent writers — still maintain causal ordering: writer B sees all writes that causally preceded writer A's release, even though A and B may overlap in time.
Fence API:
impl DsmRegionHandle {
/// Open a causal interval explicitly (without a DLM lock).
/// Returns a handle used to close the interval later.
/// The interval tracks all writes to DSM_CAUSAL pages in this region
/// until closed via dsm_fence_close().
///
/// Use case: DSM_RELAXED regions that need occasional causal ordering
/// (e.g., publish a batch of updates, then fence, so readers see all
/// updates or none). Also used by DSM_CAUSAL regions when the
/// subscriber manages its own synchronization (no DLM).
pub fn dsm_fence_open(&self) -> Result<DsmFenceHandle, DsmError>;
/// Close a causal interval and propagate the stamp to all region
/// participants. Propagation is asynchronous — the stamp is sent to
/// all peers but this call does NOT block waiting for acknowledgments.
/// Causal visibility is guaranteed by the protocol: any peer that
/// subsequently accesses a page dirtied in this interval will either
/// (a) have already merged the stamp (fast path), or (b) discover the
/// causal gap via the epoch check and resolve it via CausalWait
/// (adding one RTT). This avoids tail-latency problems from slow peers.
///
/// For callers that need synchronous visibility (rare — most workloads
/// rely on DLM locks for synchronization), use dsm_fence_close_sync().
pub fn dsm_fence_close(
&self,
handle: DsmFenceHandle,
) -> Result<(), DsmError>;
/// Like dsm_fence_close(), but blocks until all peers have acknowledged
/// the stamp (CausalAck received from every alive peer in the region).
/// Timeout: 500 ms. If any peer hasn't acknowledged by then, the fence
/// completes anyway and a tracepoint is emitted. The unacknowledged
/// peer will merge the stamp lazily via CausalWait when it next reads
/// an affected page.
pub fn dsm_fence_close_sync(
&self,
handle: DsmFenceHandle,
) -> Result<(), DsmError>;
}
Wire format — CausalStampWire:
/// Wire encoding of a DsmCausalStamp. Variable-length: 24 bytes header
/// + clock_len × 8 bytes (clock array) + dirty_count × 8 bytes (page addrs).
///
/// Carried as payload of CausalPropagate (DsmMsgType = 0x0060) or
/// piggybacked onto DLM lock-release messages.
#[repr(C)]
pub struct CausalStampWire {
/// Region this stamp belongs to.
pub region_id: Le64, // 8 bytes
/// Slot of the peer that produced this stamp.
pub writer_slot: Le16, // 2 bytes (RegionSlotIndex)
/// Number of entries in the clock array (= region's max_participants).
pub clock_len: Le16, // 2 bytes
/// Number of dirty page addresses following the clock array.
pub dirty_count: Le32, // 4 bytes
/// Epoch counter (monotonically increasing per-peer). Le64 to match
/// DsmPageReport.last_stamp_epoch and the project-wide u64 counter
/// policy. At 1 increment/μs a u32 wraps in ~72 minutes — far too
/// short for production clusters. u64 provides >500,000 years.
pub epoch: Le64, // 8 bytes
// Followed by:
// clock_len × Le64 — the vector clock array (each entry is Le64)
// dirty_count × Le64 — dirty page addresses (4KB-aligned, each Le64)
}
const_assert!(core::mem::size_of::<CausalStampWire>() == 24);
Total wire size: 24 + (max_participants × 8) + (dirty_count × 8). - 8 participants, 100 dirty pages: 24 + 64 + 800 = 888 bytes. Inline Send. - 256 participants, 1000 dirty pages: 24 + 2048 + 8000 = 10,072 bytes. Ring entry continuation (Section 5.1). - 1024 participants, 4096 dirty pages: 24 + 8192 + 32768 = 40,984 bytes. Continuation + RDMA Write for the bulk data.
New DsmMsgType:
// Causal consistency ([Section 6.6](#dsm-coherence-protocol-moesi--causal-consistency-protocol-dsm_causal))
CausalPropagate = 0x0060, // Standalone causal stamp propagation
CausalAck = 0x0061, // Receiver acknowledges stamp merge
Transport binding:
| DsmMsgType | RDMA Verb | Max Wire Size | Notes |
|---|---|---|---|
| CausalPropagate | Send (variable) | 24 + N×8 + D×8 | Inline if ≤256 B; continuation for larger |
| CausalAck | Send (inline) | 64 B | Control only |
Integration with MOESI state transitions:
The vector clock protocol is orthogonal to MOESI — it does not change the GetS/GetM/Inv/PutM message flow. The integration points are:
| MOESI Event (see Section 6.6 for DsmPageState mapping) | Causal Action | When |
|---|---|---|
| Write fault (S→M via GetM) | vc[own_slot] += 1; record dirty page |
After GetM completes, before resuming thread |
| Write to Exclusive-dirty (M) page | vc[own_slot] += 1; record dirty page |
On store instruction (no GetM needed — already Exclusive) |
| Read fault (I→S via GetS) | Epoch fast-path check; slow-path if needed | Before mapping page into requester's address space |
| PutM/PutO (writeback) | No clock action | Writeback is data movement, not causal event |
| Inv (invalidation from home) | No clock action | Invalidation drops local copy; no causal effect |
| Lock release (DLM) | Close interval; propagate stamp | After writeback, before lock message sent |
| Lock acquire (DLM) | Open interval; snapshot clock | After lock granted, before page access |
Read-path protocol (detailed):
Peer B wants to read page P in a DSM_CAUSAL region:
1. Page fault → GetS to home.
2. Home sends DataResp with page data.
3. Before mapping the page into B's address space, the DSM subsystem checks
causal visibility:
a. EPOCH FAST PATH (O(1)):
if P.last_stamp_epoch <= B.current_interval.epoch:
ACCEPT — all writes prior to B's interval are visible.
b. SLOW PATH (O(max_participants)):
Fetch P's last writer's vector clock from the home's per-page causal
metadata (DsmPageCausalMeta, see below).
Compare component-wise: for all i, B.vc[i] >= P.writer_vc[i]?
If yes: ACCEPT — B has seen all causal predecessors of P's write.
If no: STALL — B must wait for the missing causal predecessor.
B sends CausalWait(region_id, page_addr, missing_slot, missing_epoch)
to the peer at missing_slot. That peer responds with a
CausalPropagate once it reaches the requested epoch.
B merges the stamp, then re-checks. Typically resolves in one RTT.
TIMEOUT: If CausalWait is not answered within 100 ms, the default
behavior is to return `ETIMEDOUT` to the faulting thread. The
application can retry, fall back to explicit synchronization, or
re-read with a relaxed consistency hint.
A per-region flag `DSM_CAUSAL_DEGRADE_ON_TIMEOUT` (default: false,
set via `dsm_create()` flags) allows the application to opt into
degraded behavior: when set, the timeout serves the page with a
tracepoint warning (`umka_tp_stable_dsm_causal_wait_timeout`)
instead of returning an error. This preserves the safety-by-default
principle while allowing latency-sensitive applications to explicitly
accept the tradeoff.
Rationale: causal ordering IS the consistency guarantee of
`DSM_CAUSAL`. Silently violating it would make lock-free
`DSM_CAUSAL` access (via `dsm_fence_open/close`) quietly unsound.
The timeout is necessary for liveness (avoiding indefinite blocking
on dead peers), but the default behavior must preserve the safety
property.
4. Map page into B's address space. Resume faulting thread.
CausalWait message (for stall resolution):
/// CausalWait payload — sent by a reader that cannot satisfy the causal
/// dominance check. The target peer responds with CausalPropagate once
/// it has reached the requested epoch for the specified slot.
///
/// Transmitted inline as a DsmMsg with type CausalWait (72 bytes total,
/// fits within the inline Send threshold).
#[repr(C)]
pub struct CausalWaitPayload {
/// Epoch the reader is waiting for. Le64 to match the widened
/// CausalStampWire.epoch and DsmPageReport.last_stamp_epoch.
pub missing_epoch: Le64, // 8 bytes (offset 0)
/// Vector clock slot of the peer whose epoch is missing. The reader
/// sends this CausalWait to the peer at this slot index. Le16 to match
/// all other slot wire fields (RegionSlotIndex is u16).
pub missing_slot: Le16, // 2 bytes (offset 8)
/// PeerId of the requesting node (who is stalled waiting).
/// Le64 to match PeerId(u64) — all other PeerId wire fields use Le64.
pub requesting_node: Le64, // 8 bytes (offset 10)
/// Virtual address of the page being read (for diagnostics and
/// region lookup on the receiver). The region is identified by
/// page_addr's home-node assignment — no separate region_id needed.
pub page_addr: Le64, // 8 bytes (offset 18)
/// Explicit padding to round to 32 bytes total.
pub _pad: [u8; 6], // 6 bytes (offset 26)
}
// Total payload: 8 + 2 + 8 + 8 + 6 = 32 bytes.
// With DsmWireHeader (40 bytes): 72 bytes inline.
const_assert!(core::mem::size_of::<CausalWaitPayload>() == 32);
6.6.10.1 Per-Page Causal Metadata (DSM_CAUSAL Regions Only)¶
The causal slow path (step 3b above) requires the writer's vector clock at the time of the last write to each page. This metadata is stored in a side-table at the home node:
/// Per-page causal metadata stored at the home node for DSM_CAUSAL regions only.
/// Allocated from the region's causal slab pool. Keyed by PageAddr in a per-region
/// XArray alongside the DsmDirectoryEntry.
///
/// Only created on the first write to a page under DSM_CAUSAL consistency. Pages
/// that have never been written have no causal metadata (the fast path always
/// accepts reads for pristine pages since there is no writer to wait for).
pub struct DsmPageCausalMeta {
/// Vector clock of the last writer at the time of the write.
/// Length = region's max_participants. Allocated once from the region's
/// causal slab pool and reused across writes (only the contents change).
pub writer_vc: Box<[u64]>,
/// Epoch of the last writer's causal interval.
pub writer_epoch: u64,
/// Slot index of the last writer (for CausalWait target resolution).
pub writer_slot: RegionSlotIndex,
}
/// Per-region side-table: XArray<DsmPageCausalMeta> keyed by PageAddr.
/// Stored at the home node's region coordinator state.
///
/// Memory overhead: O(dirty_pages × max_participants × 8).
/// For 10K tracked pages × 64 participants: 10_000 × 64 × 8 = ~5 MiB.
/// Acceptable for DSM_CAUSAL regions (which are explicitly opted into).
///
/// Updated by the home node on every PutM/PutO that carries a causal stamp.
/// Read by the home node on causal slow-path queries from readers.
Integration with DLM token binding (Section 6.12):
When a DsmLockBinding covers a DSM_CAUSAL region:
- Lock acquire → opens a causal interval automatically (no explicit
dsm_fence_open() needed).
- Writeback groups (Section 6.12) are respected within the interval: group 0
pages are written back before group 1, but all are part of the same
causal interval. The stamp is propagated after ALL groups complete.
- Lock release → closes the interval, propagates the stamp, THEN releases
the DLM lock. The ordering guarantee: stamp propagation happens-before
lock release, so the next lock holder sees all causal predecessors.
ReplicateWrite clarification:
The ReplicateWrite message referenced in the DSM_CAUSAL description
(Section 5.10) is NOT a separate wire message.
In UmkaOS's DSM, writes propagate via the standard MOESI protocol (GetM for
exclusive access, then local store). The vector clock is propagated via
CausalPropagate at interval close — not per-write. This is the key
optimization: per-write vector clock propagation would add 1 RDMA RTT per
store instruction, which is unacceptable. Per-interval propagation batches
clock updates across all writes in the interval, amortizing the cost.
6.7 PageLocationTracker Extension¶
The PageLocation enum (Section 22.4) includes RemoteNode, RemoteDevice, and
CxlPool variants for distributed memory tracking. The following shows the distributed
variants and their semantics:
// Distributed variants of PageLocation (defined canonically in
// [Section 22.4](22-accelerators.md#accelerator-memory-and-p2p-dma--page-location-tracking))
pub enum PageLocation {
// ... existing variants (CpuNode, DeviceLocal, Migrating, etc.) ...
/// Page is in CPU memory on this NUMA node (existing).
CpuNode(u8),
/// Page is in accelerator device-local memory (existing).
DeviceLocal {
device_id: DeviceNodeId,
device_addr: u64,
},
/// Page is being transferred (migration in progress).
/// Consistent with DsmPageState::Migrating = 4 (the canonical definition above).
/// **Canonical definition**: [Section 22.4](22-accelerators.md#accelerator-memory-and-p2p-dma--page-location-tracking)
/// defines the Migrating variant with a side-table index (`migration_id: u32`) to keep
/// `PageLocation` at 24 bytes. The `MigrationRecord` side table (defined in
/// [Section 22.4](22-accelerators.md#accelerator-memory-and-p2p-dma--page-location-tracking), stored in
/// `PageLocationTracker::active_migrations`) holds
/// the full source/target details (source_kind, source_node, source_device,
/// source_addr, target_kind, target_node, target_device, target_addr).
Migrating {
migration_id: u32,
},
/// Page is not yet allocated (existing).
NotPresent,
/// Page is in compressed pool (existing).
Compressed,
/// Page is in swap (existing).
Swapped,
// === New: distributed memory locations ===
/// Page is on a remote peer's CPU memory, accessible via RDMA.
RemotePeer {
peer_id: PeerId,
remote_phys_addr: u64,
dsm_state: DsmPageState,
},
/// Page is on a remote peer's accelerator memory (GPUDirect RDMA).
RemoteDevice {
peer_id: PeerId,
device_id: DeviceNodeId,
device_addr: u64,
},
/// Page is in CXL-attached memory pool (hardware-coherent).
CxlPool {
pool_id: u32,
pool_offset: u64,
},
}
Security requirement — CXL pool bounds validation: The kernel MUST validate
pool_id and pool_offset before using them to access memory. An out-of-bounds
pool_id or pool_offset could allow unauthorized access to memory outside the
intended pool, potentially exposing kernel data or allowing privilege escalation.
Validation requirements:
pool_idMUST be validated against the global CXL pool registry (cxl_pool_count). Accessing a non-existent pool MUST fail withEINVAL.pool_offsetMUST be validated against the target pool's size (cxl_pools[pool_id].size_bytes). Accesses beyond the pool boundary MUST fail withEFAULT.- For RDMA-initiated CXL pool access, the remote node's capability (Section 5.7)
MUST authorize the specific
pool_id. A capability granting access to pool 0 MUST NOT be usable to access pool 1. - CXL pool resize is grow-only — pools can be expanded but never shrunk. This
eliminates the TOCTOU race where a pool shrinks between the bounds check and the
CPU load/store instruction (seqlocks cannot prevent this race because CPU memory
accesses are not rollback-capable). The pool's
size_bytesis read withAcquireordering; a concurrent grow only increases the valid range, so a stale (smaller) size produces a conservative bounds check, never an out-of-bounds access. Pool deallocation (destroying a pool entirely) requires quiescing all accessors first via the standard RCU grace period mechanism.
6.8 DSM Region Lifecycle¶
6.8.1 DSM Region Management¶
Distributed shared memory is opt-in. Processes create DSM regions explicitly:
/// Create a distributed shared memory region.
/// All peers participating in this region can map it.
///
/// Access control uses the capability system ([Section 5.7](05-distributed.md#network-portable-capabilities))
/// instead of a bitmask: any peer holding a valid DistributedCapability with
/// the correct `object_id` (matching `region_id`) and appropriate permissions
/// can join. Capability revocation instantly de-authorizes a peer.
///
/// `max_participants` determines the size of per-page `RegionBitmap` fields
/// and vector clocks. All per-page structures in this region are allocated
/// from slab pools whose slot size matches `ceil(max_participants / 64)` words.
/// Cannot be changed after region creation.
// kernel-internal, not KABI
#[repr(C)]
pub struct DsmRegionCreate {
/// Unique region identifier (cluster-wide).
pub region_id: Le64,
/// Virtual address base for this region (must be page-aligned).
/// Mapped at the same address on all participating peers.
pub base_addr: Le64,
/// Size of the shared region (bytes, page-aligned).
pub size: Le64,
/// Page size for this region (DsmPageSize as Le32).
pub page_size: Le32,
/// Access permissions (DSM_PROT_READ=0x1, DSM_PROT_WRITE=0x2, DSM_PROT_EXEC=0x4).
pub permissions: Le32,
/// Consistency model (DsmConsistency as Le32).
pub consistency: Le32,
/// Maximum number of peers that can participate in this region.
/// Determines `RegionBitmap` word count: W = ceil(max_participants / 64).
/// Range: 1..=MAX_REGION_PARTICIPANTS (1024). Default: 256.
/// Cannot be changed after creation. Join when full returns ENOSPC.
pub max_participants: Le16,
pub _pad1: [u8; 2],
/// Initial placement: which peer holds the pages initially.
/// This peer gets slot 0 in the region's slot map.
pub initial_owner: Le64, // PeerId
/// Home node assignment policy for directory entries.
/// DSM_HOME_HASH = 0: home = hash(region_id, VA) % participant_count (default).
/// DSM_HOME_FIXED = 1: all entries homed on initial_owner.
pub home_policy: Le32,
pub _pad2: [u8; 4],
/// Capability required to join this region. Replaces the old
/// `allowed_nodes: u64` bitmask. Any peer holding a valid
/// DistributedCapability with object_id = region_id and permissions
/// including REMOTE_MEMORY_READ/WRITE can join. No fixed node set —
/// new peers are authorized by issuing capabilities.
pub required_cap: Le64, // CapHandle
/// Behavior flags (bitfield).
/// DSM_EAGER_WRITEBACK = 0x01: flush dirty pages on lock release.
/// DSM_REPLICATE = 0x02: enable directory replication for fault tolerance.
/// DSM_WRITE_UPDATE = 0x04: use write-update protocol instead of write-invalidate
/// (see below).
pub flags: Le32,
/// Maximum dirty pages per causal interval (DSM_CAUSAL regions only).
/// Default: DSM_MAX_DIRTY_PER_INTERVAL (4096). Set to 0 for non-causal.
pub max_dirty_per_interval: Le32,
/// Reserved for future extensions; must be zero.
pub _reserved: [u8; 8],
}
/// Total: 80 bytes.
const_assert!(core::mem::size_of::<DsmRegionCreate>() == 80);
#[repr(u32)]
pub enum DsmPageSize {
/// Standard 4KB pages. Best for random access patterns.
Page4K = 0,
/// 2MB huge pages. Best for sequential / bulk access.
/// Reduces TLB misses but increases transfer granularity.
HugePage2M = 1,
}
#[repr(u32)]
pub enum DsmConsistency {
/// Release consistency: writes become visible to other nodes
/// after an explicit release (memory barrier / unlock).
/// Best performance. Standard for most HPC/ML workloads.
Release = 0,
/// Sequential consistency is NOT supported. Providing a total order over all
/// memory operations across nodes would require serializing every write through
/// a single ordering point (or using a distributed total-order broadcast), adding
/// ~5-15 μs per write operation even on RDMA. This overhead would negate the
/// performance benefits of DSM for virtually all workloads. Applications requiring
/// sequential consistency should use explicit distributed locking (DLM, [Section 15.15](15-storage.md#distributed-lock-manager)) or message-passing instead of DSM.
///
/// Reserved for potential future use if hardware (e.g., CXL 3.0 with hardware
/// coherence) provides efficient total ordering.
/// **NOT SUPPORTED**: `dsm_create()` and `DsmRegionCreateBcast` handlers MUST
/// reject `consistency = SequentialReserved (1)` with `EINVAL`. The
/// `from_wire_u8()` method accepts value 1 for forward-compatible
/// deserialization, but all region creation paths must check
/// `consistency != SequentialReserved` before proceeding.
SequentialReserved = 1,
/// Eventual consistency: writes propagate asynchronously.
/// Lowest overhead. Suitable for read-heavy, stale-tolerant data.
///
/// **Propagation mechanism** (Phase 3 deliverable — high-level design):
/// - Writes are applied locally and enqueued in a per-region **propagation log**
/// (circular buffer, one per DSM region per node).
/// - A background propagation thread sends log entries to the home node via
/// RDMA Send (two-sided, for ordering). The home node applies them and
/// pushes updates to other readers via the invalidation protocol.
/// - **Staleness bound**: configurable per-region (default: unbounded).
/// With a bound of T ms, the propagation thread ensures all entries
/// are sent within T ms of the write. A bound of 0 falls back to Release.
/// - **Anti-entropy**: On region join or after network partition heal,
/// nodes exchange version vectors and replay missing updates.
/// See [Section 6.13](#dsm-anti-entropy-protocol).
Eventual = 2,
/// Synchronous (strong) consistency: write completes only after all replicas
/// acknowledge. Provides per-page linearizability — reads always serve from the
/// local replica (all replicas are up to date by construction). Write cost:
/// ~3–5 μs RDMA round-trip. Use case: shared metadata, distributed lock tables.
/// NOT sequential consistency — there is no total order across pages or regions.
/// (Phase 3 deliverable.)
Synchronous = 3,
/// Causal consistency: if a thread writes X then writes Y, any reader that
/// observes Y must also observe X. Enforced via per-region vector clocks.
/// Write cost: 1 RDMA RTT + vector clock update (~2–4 μs).
/// Use case: distributed queues, logs, pub/sub.
/// (Phase 3 deliverable.)
Causal = 4,
}
impl DsmConsistency {
/// Convert to u8 for wire encoding in DsmPageReport.consistency_mode.
/// DsmConsistency has only 5 variants (0-4); u8 wire encoding is
/// space-efficient with no information loss (saves 3 bytes per report
/// vs. transmitting the full repr(u32)).
///
/// **Two wire encodings exist**: u8 (this method, for compact per-page
/// metadata in anti-entropy reports — millions of reports during sync)
/// and Le32 (region management messages, matching `#[repr(u32)]`). Both
/// round-trippable for values 0-4. Intentional.
pub fn to_wire_u8(self) -> u8 { self as u32 as u8 }
/// Reconstruct from u8 wire encoding. Returns None for invalid discriminant
/// values (>4). Accepts SequentialReserved (1) for forward-compatible
/// deserialization — use `from_wire_u8_validated()` for all region creation
/// and operational paths.
///
/// **Visibility**: `pub(crate)` — intended only for raw deserialization
/// in diagnostics, logging, and error reporting where the caller needs to
/// inspect the exact variant before rejecting. External code MUST use
/// `from_wire_u8_validated()`.
pub(crate) fn from_wire_u8(v: u8) -> Option<Self> {
match v {
0 => Some(Self::Release),
1 => Some(Self::SequentialReserved),
2 => Some(Self::Eventual),
3 => Some(Self::Synchronous),
4 => Some(Self::Causal),
_ => None,
}
}
/// Reconstruct from u8 wire encoding AND validate that the consistency mode
/// is supported. This is the recommended public API for all region creation
/// and operational paths. Returns Err for:
/// - Invalid discriminant values (>4)
/// - SequentialReserved (reserved for future use, not implemented)
///
/// All region creation paths, syscall handlers, and wire message processing
/// MUST use this method, not `from_wire_u8()`.
pub fn from_wire_u8_validated(v: u8) -> Result<Self, DsmUnsupportedConsistency> {
match Self::from_wire_u8(v) {
Some(mode) if mode.is_supported() => Ok(mode),
Some(mode) => Err(DsmUnsupportedConsistency::Reserved(mode)),
None => Err(DsmUnsupportedConsistency::InvalidDiscriminant(v)),
}
}
/// Returns true if this consistency mode is currently implemented.
pub fn is_supported(&self) -> bool {
!matches!(self, Self::SequentialReserved)
}
}
/// Error returned by `DsmConsistency::from_wire_u8_validated()`.
pub enum DsmUnsupportedConsistency {
/// A valid but unsupported variant (currently: SequentialReserved).
Reserved(DsmConsistency),
/// An invalid discriminant value (not in the enum range).
InvalidDiscriminant(u8),
}
6.8.1.1 Region Management Wire Messages¶
DSM region lifecycle messages use PeerMessageType codes in the 0x0300-0x03FF range.
These messages are sent via RDMA two-sided Send (all are control messages; none carry
bulk data). All payloads are #[repr(C)], explicitly padded, with all integer fields
using Le types (Section 6.1) for mixed-endian clusters.
// PeerMessageType codes for DSM region management:
DsmRegionCreateBcast = 0x0300,
DsmRegionCreateAck = 0x0301,
DsmRegionJoinRequest = 0x0302,
DsmRegionJoinAccept = 0x0303,
DsmRegionJoinReject = 0x0304,
DsmRegionLeave = 0x0305,
DsmRegionLeaveAck = 0x0306,
DsmSlotCompaction = 0x0310,
DsmSlotCompactionAck = 0x0311,
DsmRegionDestroy = 0x0320,
DsmRegionDestroyAck = 0x0321,
Region creation broadcast:
/// Broadcast when a new DSM region is created. Sent to all peers in the
/// cluster. Only peers holding a valid capability matching `required_cap`
/// will issue a JoinRequest.
///
/// Total: 128 bytes. Wire format for region creation broadcast.
/// Note: this is the wire payload struct, not the same as DsmRegionCreate
/// (which is the in-memory representation with different field types).
const_assert!(core::mem::size_of::<RegionCreateBcastPayload>() == 128);
// kernel-internal, not KABI
#[repr(C)]
pub struct RegionCreateBcastPayload {
pub region_id: Le64, // 8 (offset 0)
pub base_addr: Le64, // 8 (offset 8)
pub size: Le64, // 8 (offset 16)
pub page_size: Le32, // 4 (offset 24) DsmPageSize as Le32
pub permissions: Le32, // 4 (offset 28)
pub consistency: Le32, // 4 (offset 32) DsmConsistency as Le32
pub max_participants: Le16, // 2 (offset 36)
pub _pad: [u8; 2], // 2 (offset 38)
pub initial_owner: Le64, // 8 (offset 40) PeerId
pub home_policy: Le32, // 4 (offset 48)
pub _pad2: [u8; 4], // 4 (offset 52)
pub required_cap: Le64, // 8 (offset 56) CapHandle
pub flags: Le32, // 4 (offset 64)
pub max_dirty_per_interval: Le32, // 4 (offset 68)
pub _reserved: [u8; 56], // 56 (offset 72) pad to 128 bytes (72 + 56 = 128)
}
/// Acknowledgment of RegionCreateBcast. Sent by each peer that receives
/// the broadcast, regardless of whether it will join. Allows the creator
/// to confirm broadcast delivery.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct RegionCreateAckPayload {
pub region_id: Le64,
pub acking_peer: Le64, // PeerId
}
const_assert!(core::mem::size_of::<RegionCreateAckPayload>() == 16);
Region join handshake:
/// Request to join an existing DSM region. Sent by a peer that holds
/// the required capability to the region coordinator (initial_owner or
/// current Raft-assigned coordinator).
///
/// Total: 56 bytes.
#[repr(C)]
pub struct RegionJoinRequestPayload {
pub region_id: Le64, // 8 bytes
pub joining_peer: Le64, // 8 bytes (PeerId)
/// Full HMAC-SHA256 proving the peer holds the required capability.
/// Computed as: HMAC-SHA256(session_key, region_id || peer_id).
/// 32 bytes provides 128+ bit security level (no truncation).
/// Truncating to 8 bytes (64 bits) would allow brute-force forgery
/// in ~2^64 operations, which is below the 128-bit security target.
pub cap_proof: [u8; 32], // 32 bytes
/// DSM protocol version supported by the joining peer. Coordinator
/// rejects the join if `dsm_protocol_version < region.min_supported_version`.
/// This prevents a long-partitioned node running stale protocol from
/// corrupting region state with incompatible wire formats.
pub dsm_protocol_version: Le32, // 4 bytes
pub _pad: [u8; 4], // 4 bytes
}
const_assert!(core::mem::size_of::<RegionJoinRequestPayload>() == 56);
/// Join accepted: coordinator assigns a slot in the RegionSlotMap.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct RegionJoinAcceptPayload {
pub region_id: Le64, // 8 bytes
pub assigned_slot: Le16, // RegionSlotIndex
/// Current number of live participants (for the joiner to size its
/// initial bloom filter and cache structures).
pub current_participants: Le16, // 2 bytes
pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<RegionJoinAcceptPayload>() == 16);
/// Join rejected: capacity exceeded, capability proof invalid, version
/// mismatch, or partition too long.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct RegionJoinRejectPayload {
pub region_id: Le64, // 8 bytes
/// Rejection reason:
/// - 0 = capacity (max_participants reached)
/// - 1 = cap_invalid (HMAC proof failed)
/// - 2 = shutting_down (region being destroyed)
/// - 3 = version_mismatch (dsm_protocol_version too old)
/// - 4 = partition_too_long (exceeded max_partition_duration_secs)
/// - 5 = rdma_quota_exceeded (RDMA pool cannot allocate resources for
/// this region — peer remains in cluster but does not participate in
/// coherence for this region; retry when RDMA capacity is available)
pub reason: Le32,
pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<RegionJoinRejectPayload>() == 16);
Region departure:
/// Peer departing from a DSM region. Sent after all dirty pages have been
/// written back (PutM/PutO) and all shared copies dropped (PutS).
/// The coordinator tombstones the departing peer's slot.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct RegionLeavePayload {
pub region_id: Le64, // 8 bytes
pub leaving_slot: Le16, // RegionSlotIndex
pub _pad: [u8; 6],
}
const_assert!(core::mem::size_of::<RegionLeavePayload>() == 16);
/// Coordinator confirms slot tombstone. Safe for the departing peer
/// to forget about this region.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct RegionLeaveAckPayload {
pub region_id: Le64, // 8 bytes
pub acking_peer: Le64, // 8 bytes (coordinator's PeerId)
}
const_assert!(core::mem::size_of::<RegionLeaveAckPayload>() == 16);
Slot compaction (wire format for the protocol described in Section 6.1):
/// Broadcast by the coordinator to all region participants when slot
/// compaction is triggered. Contains the full old→new slot mapping.
///
/// Variable size: 16 + (entry_count × 16) bytes.
/// Maximum: 16 + (1024 × 16) = 16,400 bytes. Messages exceeding the
/// 224-byte ring entry payload limit use the continuation flag
/// ([Section 5.1](05-distributed.md#distributed-kernel-architecture--peer-ring-entry-format)) for multi-entry transmission.
#[repr(C)]
pub struct SlotCompactionPayload {
pub region_id: Le64, // 8 bytes
/// Number of live peers being remapped.
pub entry_count: Le16, // 2 bytes
pub _pad: [u8; 6],
// Followed by entry_count × SlotRemapEntry (16 bytes each).
}
const_assert!(core::mem::size_of::<SlotCompactionPayload>() == 16);
/// One entry in the slot compaction mapping.
#[repr(C)]
pub struct SlotRemapEntry {
pub old_slot: Le16, // RegionSlotIndex (source)
pub new_slot: Le16, // RegionSlotIndex (destination)
pub _pad: [u8; 4],
pub peer_id: Le64, // 8 bytes (PeerId, for verification)
}
const_assert!(core::mem::size_of::<SlotRemapEntry>() == 16);
/// Participant acknowledges compaction completion. Sent after the
/// participant has remapped all local bitmaps and swapped to the new
/// slot map via RCU.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct SlotCompactionAckPayload {
pub region_id: Le64, // 8 bytes
pub acking_peer: Le64, // 8 bytes (PeerId)
}
const_assert!(core::mem::size_of::<SlotCompactionAckPayload>() == 16);
Region destruction:
/// Broadcast by the coordinator to initiate region destruction.
/// All participants must write back dirty pages and leave the region
/// before acknowledging.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct RegionDestroyPayload {
pub region_id: Le64, // 8 bytes
pub initiator: Le64, // 8 bytes (PeerId)
}
const_assert!(core::mem::size_of::<RegionDestroyPayload>() == 16);
/// Participant acknowledges region destruction. Sent after the
/// participant has completed all writebacks and released all local
/// state for this region.
///
/// Total: 16 bytes.
#[repr(C)]
pub struct RegionDestroyAckPayload {
pub region_id: Le64, // 8 bytes
pub acking_peer: Le64, // 8 bytes (PeerId)
}
const_assert!(core::mem::size_of::<RegionDestroyAckPayload>() == 16);
Protocol sequences:
Region creation:
1. Creator calls select_transport() (Section 5.5)
for each peer that will participate in the region. The selected transport
(RDMA, CXL shared memory, or TCP fallback) is stored per-peer in the region's
DsmRegionPeerState. All subsequent DSM coherence operations for this region
use the ClusterTransport trait (Section 5.10)
— page fetches use transport.fetch_page(), writebacks use transport.push_page(),
and coherence control messages use transport.send_reliable(). The DSM protocol
is transport-agnostic; only latency differs across transports.
2. Creator broadcasts RegionCreateBcast to all peers.
3. Each eligible peer sends RegionJoinRequest to the creator (who is the initial
coordinator and holds slot 0).
4. Creator validates cap_proof, assigns slots, responds with RegionJoinAccept
or RegionJoinReject.
5. Each accepted peer calls select_transport() for the coordinator and all other
accepted peers, storing the selected transport in its local DsmRegionPeerState.
6. Each accepted peer initializes its local region state (allocate bitmap slab pools,
create directory entries for pages homed on this peer, register with the local
DSM fault handler).
Region departure (graceful):
1. Departing peer writes back all Exclusive-dirty/SharedOwner pages (PutM/PutO)
and drops all SharedReader copies (PutS) — standard MOESI eviction.
2. Departing peer sends RegionLeave to coordinator.
3. Coordinator tombstones the slot, broadcasts SLOT_REMOVED to participants.
4. Coordinator responds with RegionLeaveAck.
5. Background sweep clears the departed peer's bit from all page bitmaps.
Slot compaction (wire flow for the in-memory protocol in
Section 6.1):
1. Coordinator sends SlotCompaction to all participants.
2. Each participant reaches a DSM quiescent point (drains in-flight coherence).
3. Each participant remaps local bitmaps, swaps slot map via RCU.
4. Each participant sends SlotCompactionAck.
5. Coordinator waits for all acks (timeout: 5s), then releases write lock.
6.8.2 DSM Region Destruction Protocol¶
Destroying a DSM region requires coordinated cleanup across all participating nodes:
DSM region destruction for region R:
1. Initiator (any node with the region's destroy capability) sends
REGION_DESTROY(region_id) to all nodes in the region's participant
list (the slot map maintained by the home node directory, see
[Section 6.1](#dsm-foundational-types)).
2. Each participating node:
a. Unmaps all local pages belonging to region R from process page tables.
b. Flushes TLB entries for region R's address range.
c. For pages where this node is the owner: marks pages as reclaimable
(returned to the local physical page allocator).
d. For pages where this node is a reader: discards the local copy
(no writeback needed — reader copies are clean by the single-writer
invariant).
e. Sends REGION_DESTROY_ACK(region_id, node_id) to the initiator.
3. Initiator collects ACKs from all participating nodes.
Timeout: 5 seconds. Nodes that do not ACK within timeout are assumed
dead — their pages are abandoned (will be reclaimed when those nodes
eventually rejoin or are declared dead via heartbeat).
4. After all ACKs received (or timeout):
a. Initiator sends REGION_DIRECTORY_CLEANUP(region_id) to all nodes
that serve as home nodes for pages in this region.
b. Each home node removes all DsmDirectoryEntry records for region R
from its directory. The backup home node ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--split-brain-resolution)) is also
notified to remove shadow entries.
c. The region_id is retired and cannot be reused for 24 hours
(prevents stale references from delayed messages).
5. Region metadata is removed from /sys/kernel/umka/cluster/dsm/regions.
Error handling:
- If a process still has region R mapped when destruction is initiated,
the mapping is force-unmapped and the process receives SIGBUS on
subsequent access attempts.
- If the initiator crashes during destruction, any node can resume
the protocol by re-sending REGION_DESTROY (idempotent — nodes that
already completed destruction simply re-ACK).
6.8.3 Linux Compatibility Interface¶
DSM is exposed via standard POSIX shared memory with extensions:
Standard POSIX (works unmodified):
fd = shm_open("/my_region", O_RDWR | O_CREAT, 0666);
ftruncate(fd, size);
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
→ On a single node, this is standard shared memory. No DSM.
UmkaOS-specific extension (opt-in):
fd = shm_open("/my_region", O_RDWR | O_CREAT, 0666);
ioctl(fd, UMKA_SHM_MAKE_DISTRIBUTED, &dsm_config);
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
→ Same mmap'd pointer, but pages can now migrate across nodes.
→ Process on Node B can shm_open("/my_region") and map the same region.
For MPI applications: MPI implementations (OpenMPI, MPICH) can use DSM
regions for intra-communicator shared memory windows, replacing the
current combination of mmap + RDMA + application-level coherence with
kernel-managed coherence.
6.9 DSM Operational Properties¶
6.9.1 False Sharing Mitigation¶
When multiple nodes write to different offsets within the same page, the page bounces between owners ("false sharing"), degrading performance severely.
Detection: The home node tracks per-page ownership transfer frequency. If a page
has more than N ownership transfers per second (default: 100), it is flagged as
contended. The contention count is exposed via the umka_tp_stable_dsm_contention
tracepoint.
Mitigation 1: Advisory — Log contended pages to a tracepoint, allowing the application to restructure its data layout to avoid cross-node false sharing. This is the lowest-overhead option.
Mitigation 2: Whole-page writeback on release — On a write fault to a contended page under the single-writer model (Section 6.3), the owning node sends the entire 4KB page to the home node at the next release point (memory barrier or unlock). This is simpler than TreadMarks-style Twin/Diff because the single-writer invariant guarantees only one node modifies the page at a time — there are no concurrent writers whose changes need merging. The home node distributes the updated page to readers on their next fault. Overhead: one 4KB RDMA Write per release (~1-2 μs). This is acceptable because contention mitigation already implies the page is frequently transferred. Note: Twin/Diff (creating a copy before writing, then diffing to find changed bytes) would only be beneficial under a relaxed multiple-writer consistency model. UmkaOS's DSM uses single-writer, making whole-page transfer the correct and simpler approach.
Mitigation 3: Sub-page coherence for huge pages — For 2MB huge pages that exhibit contention, the kernel falls back to 4KB sub-page coherence for the contended region. The huge page is logically split into 512 sub-pages, each tracked independently in the DSM directory. The physical huge page is preserved (no TLB cost).
Default: Detection + advisory tracing. Eager writeback on release is opt-in per
DSM region via DsmRegionCreate.flags |= DSM_EAGER_WRITEBACK.
6.9.2 Error Handling¶
Concurrent ownership requests: When two nodes simultaneously request exclusive ownership of the same page, the home node serializes the requests. The first request is processed normally; the second requester is queued and notified when ownership becomes available.
Stale directory: If a node presents an ownership claim with a version counter that doesn't match the directory, the home node detects the stale state and rejects the operation. The requestor retries by re-fetching the directory entry.
Partial transfer: If a page is sent via RDMA Write but the directory update (RDMA Send to home node) fails (e.g., due to a concurrent modification or lost message), the home node detects the version mismatch on the next access and repairs the directory. The transferred page is either adopted (if the transfer was valid) or discarded. Directory updates use two-sided RDMA Send because the DsmDirectoryEntry is at least 72 bytes (for clusters with ≤64 nodes, W=1 bitmap word: sequence(8) + state(1) + _pad(1) + owner_slot(2) + _pad(4) + owner(8) + sharers(8) + lock(4) + _pad(4) + version(8) + rehash_epoch(4) + _pad(4) + wait_queue(8) + entry_lock(4) + _pad(4) = 72 bytes; for larger clusters the sharers bitmap grows by 8 bytes per additional 64-node group, making it even larger). This is far too large for 8-byte RDMA Atomic CAS. The home node's CPU processes the update request locally using the seqlock protocol (Section 6.5).
Deadlock: If Node A waits for ownership of a page held by Node B, while Node B
waits for a page held by Node A, a deadlock occurs. DSM ownership deadlocks are
resolved via timeout on ownership requests (default: 10ms). On timeout, the requesting
operation is aborted and returns EAGAIN. The application retries with backoff. This
is a deadlock recovery mechanism (timeout-based), not true deadlock detection
(which requires a wait-for graph). True deadlock detection with a distributed wait-for
graph is implemented in the DLM (Section 15.15), where lock
dependencies are tracked explicitly. DSM uses the simpler timeout approach because page
ownership requests are transient (microsecond-scale) and building a wait-for graph for
every page fault would add unacceptable overhead to the critical path.
6.9.3 Honest Performance Expectations¶
DSM provides a programming convenience (shared address space), NOT transparent performance parity with local memory. Expectations must be calibrated:
What DSM IS good for:
- Read-mostly workloads with occasional writes (replicated data).
- Coarse-grained partitioned workloads where each node mostly touches
its own partition, with infrequent cross-node access.
- Replacing explicit message-passing in applications where shared memory
is a natural fit and page-level granularity matches access patterns.
What DSM is NOT good for:
- Fine-grained shared data structures (concurrent hash maps, work-stealing
queues). These cause page-level false sharing and ownership bouncing.
Use explicit RDMA messaging or partitioned data structures instead.
- Workloads with random write patterns across the shared space.
Every cross-node write fault costs ~5-50μs (RDMA round-trip + TLB
invalidation), vs ~100ns for local memory. This is 50-500x slower.
- Latency-sensitive paths. DSM page faults are unpredictable.
Performance model (InfiniBand 200Gb/s, ~2μs RTT):
Local page access (TLB hit): ~1 ns
Local page fault (mmap, disk): ~1-100 μs
DSM read fault (page not present): ~10-18 μs (directory lookup ~4-5 μs
+ ownership negotiation ~3-5 μs
+ page transfer ~3-5 μs
+ local install ~1 μs;
see [Section 6.5](#dsm-page-fault-flow) for detailed flow)
DSM write fault (exclusive ownership): ~10-50 μs (invalidate readers + transfer)
DSM false sharing (bouncing page): ~100+ μs per iteration (pathological)
The mitigations in Section 6.9 help, but cannot eliminate the fundamental cost of network coherence. Applications must be DSM-aware at the data structure level. The kernel's job is to make the common case fast and the pathological case detectable (tracepoints), not to pretend the network is as fast as local memory.
6.9.4 Interaction with Memory Compression¶
The DSM protocol and the memory compression tier (Section 4.12) have a potential conflict: when a page is compressed in the local compress pool, should it be advertised as "present" or "not present" to the DSM directory?
Design decision: Compressed pages are treated as locally present in the DSM protocol. The decompression cost (~1-3 microseconds via LZ4) is far lower than a network fetch (~5-50 microseconds over RDMA), so it never makes sense to fetch a remote copy of a page that exists locally in compressed form.
Interaction rules:
- Remote node requests a locally compressed page: The owning node decompresses from its compress pool, then transfers the uncompressed data via RDMA Write. The compress pool entry is freed after transfer.
- DSM migrates a page to a remote node: The sending node decompresses first, then sends the uncompressed 4KB page. The receiving node may independently compress it based on local memory pressure.
- Compressed page metadata is NOT replicated across nodes. Compression is a node-local optimization, invisible to the DSM directory and coherence protocol.
- DSM coherence protocol uses only the
{CpuNode, RemoteNode, NotPresent, Migrating}variants ofPageLocation(Section 22.4) for coherence decisions. A "Compressed" DSM state is NOT added — a compressed page is simply "local on CpuNode(N)" from the DSM protocol's perspective. The fullPageLocationTrackertracks all variants (includingDeviceLocal,Compressed,Swapped,RemoteDevice,CxlPool), but compression is transparent to the DSM directory.
Edge case — double fault path: If Node B requests a page from Node A, and Node A has that page compressed in its compress pool: (1) Node A receives the RDMA request, (2) local MM subsystem decompresses from compress pool (~1-2 microseconds), (3) DSM handler completes the RDMA Write with the decompressed data. The requesting node never knows the page was compressed. Total added latency: ~1-2 microseconds.
6.10 Global Memory Pool¶
6.10.1 Design: Cluster Memory as a Unified Tier Hierarchy¶
The memory manager already manages a tier hierarchy:
Current (single-node):
Tier 0: Per-CPU free page pools (fastest, smallest)
Tier 1: Local NUMA node DRAM (fast, local)
Tier 2: Remote NUMA node DRAM (cross-socket, ~150ns)
Tier 3: GPU VRAM ([Section 22.4](22-accelerators.md#accelerator-memory-and-p2p-dma))
Tier 4: Compressed pool ([Section 4.12](04-memory.md#memory-compression-tier))
Tier 5: Swap (NVMe SSD)
Extended (distributed cluster):
Tier 0: Per-CPU free page pools (unchanged)
Tier 1: Local NUMA node DRAM (unchanged)
Tier 2: Remote NUMA node DRAM, same machine (unchanged)
Tier 3: CXL-attached memory pool (~200-400ns) ← NEW
Tier 4: GPU VRAM (unchanged)
Tier 5: Compressed pool (unchanged)
Tier 6: Remote node DRAM via RDMA (~3-5 μs) ← NEW
Tier 7: Remote node compressed pool via RDMA ← NEW
Tier 8: Local NVMe swap (unchanged)
Tier 9: Remote NVMe via NVMe-oF/RDMA ← NEW
Key insight: Tier 6 (remote RDMA, ~3-5 μs) is faster than Tier 8 (local NVMe).
The global memory pool inserts remote memory as a tier BETWEEN
compressed pages and local swap. Compressed pool (Tier 5, ~1-2 μs
decompression) precedes remote RDMA because local decompression is
faster than a network round-trip.
Note: Tier numbers are ordinal positions in the latency-sorted hierarchy, not
fixed identifiers. When distributed mode adds new tier sources (CXL-attached memory,
remote node DRAM), existing sources shift to higher tier numbers. Code that uses
tier numbers MUST NOT hardcode numeric values — instead, use the TierKind enum
(defined in Section 4.11) to identify tier types by semantic name (e.g.,
TierKind::LocalDram, TierKind::GpuVram, TierKind::DsmRemote) and query
mem::tier_ordinal(kind) for the current ordinal position of each kind.
6.10.2 Memory Pool Accounting¶
// umka-core/src/mem/global_pool.rs
/// Global memory pool state (cluster-wide view).
/// Dynamic peer set — no fixed-size arrays.
pub struct GlobalMemoryPool {
/// Per-peer memory availability.
/// XArray keyed by PeerId (u64) — O(1) lookup with native RCU-compatible
/// reads (DSM page fault → check remote peer availability: lock-free,
/// no contention). Writes (peer join/leave, periodic capacity update)
/// use XArray's internal locking. Writes are rare (membership changes +
/// periodic refresh every 10s).
peers: XArray<PeerMemoryState>,
/// Total cluster memory (sum of all peers).
total_cluster_bytes: AtomicU64,
/// Total available for remote allocation (sum of exported pools).
total_available_bytes: AtomicU64,
/// Current remote memory usage by local processes.
local_remote_usage_bytes: AtomicU64,
/// Current memory exported to remote peers.
exported_usage_bytes: AtomicU64,
/// Policy for remote memory allocation.
policy: GlobalPoolPolicy,
/// Write serialization. Reads are RCU (lock-free). Writes are serialized
/// by cluster membership protocol (one join/leave at a time).
write_lock: SpinLock<()>,
}
pub struct PeerMemoryState {
peer_id: PeerId,
/// Total physical memory on this peer.
total_bytes: u64,
/// Memory available for remote allocation.
/// Admin-configurable: don't export all memory.
/// Default: min(25% of total, rdma.max_pool_gib GiB) — capped to prevent excessive
/// page pinning on large systems (local workloads get priority).
/// This pool is backed by the RDMA-registered memory region
/// ([Section 5.4](05-distributed.md#rdma-native-transport-layer--pre-registered-kernel-memory)).
/// Only RDMA-registered pages can be served to remote peers.
export_pool_bytes: u64,
/// Currently allocated to remote peers.
export_used_bytes: u64,
/// Cost to reach this peer (ns), from TopologyGraph ([Section 5.2](05-distributed.md#cluster-topology-model)).
cost_ns: u64,
/// Bandwidth to this peer (bytes/sec), from TopologyGraph.
bandwidth_bytes_per_sec: u64,
}
pub struct GlobalPoolPolicy {
/// Maximum percentage of local memory to export for remote use.
/// Default: 25%. Protects local workloads from starvation.
pub max_export_percent: u32,
/// When local memory pressure exceeds this threshold,
/// start reclaiming exported pages (evicting remote users).
/// Default: 80% of local memory usage.
pub reclaim_threshold_percent: u32,
/// Prefer remote memory over local swap?
/// Default: true (RDMA is faster than NVMe).
pub prefer_remote_over_swap: bool,
/// Maximum remote memory a single process can consume (bytes).
/// 0 = unlimited (subject to cgroup limits).
pub per_process_remote_max: u64,
}
Security requirement — Cross-node capability chain validation: Remote memory access via the global memory pool MUST validate capability chains across nodes. A capability issued by Node A that authorizes access to memory on Node A must NOT be usable to access memory on Node B without explicit cross-node authorization. Validation requirements:
- Capability scope validation: When Node A's process accesses memory exported by Node B via the global pool, the kernel MUST verify that:
- The process holds a valid
DistributedCapability(Section 5.7) for remote memory access, signed by a trusted issuer - The capability's
constraintsfield explicitly authorizes the targetPeerId(or containsPEER_ID_ANYfor cluster-wide access) - The capability has not expired, been revoked, or had its generation invalidated (standard distributed capability validation)
-
The capability's
permissionsfield includesREMOTE_MEMORY_READand/orREMOTE_MEMORY_WRITEas appropriate for the operation -
Delegation chain integrity: If a capability was derived through delegation (e.g., process P1 on Node A delegates to process P2 on Node B), the kernel MUST verify the entire chain:
- Each capability in the chain MUST have a valid signature from its issuer
- Each delegation MUST have
DELEGATEpermission in the parent capability - Derived capabilities MUST be strictly less powerful than their parent (no permission amplification)
-
The delegation depth MUST NOT exceed
MAX_CAP_DELEGATION_DEPTH(default: 8) to prevent resource exhaustion -
Remote node attestation: Before honoring a cross-node capability, Node B MUST verify that Node A is a current, authenticated cluster member (not revoked, evicted, or marked Dead per Section 5.8). This check is O(1) via the cluster membership bitmap.
-
Audit logging: Cross-node capability validations that fail MUST be logged to the security audit subsystem (Section 2.1) with:
- Source node ID
- Target node ID
- Capability object_id and generation
- Failure reason (signature invalid, expired, revoked, wrong node, etc.)
6.10.3 The Killer Use Case: AI Model Memory¶
Cluster: 4 nodes, each 512GB RAM + 4× A100 GPUs (80GB VRAM each)
Total CPU RAM: 2 TB
Total GPU VRAM: 1.28 TB
Total cluster memory: 3.28 TB
Scenario: Run a 405B parameter model (Llama-3.1-405B, ~810GB at FP16)
Without global memory pool (current state of the art):
- Tensor parallelism across 16 GPUs: each GPU holds 1/16 of model (~50GB)
- Model fits in GPU VRAM (80GB per GPU)
- BUT: KV cache for long context (128K tokens) = ~100-200GB additional
- KV cache spills to CPU RAM via UVM → 10-15 μs per page fault to NVMe
- Inference latency: dominated by KV cache spills
With global memory pool:
- GPU VRAM: hot layers (active attention heads, current KV cache entries)
- Local CPU RAM: warm layers (recent KV cache, inactive attention heads)
- Remote CPU RAM (RDMA): cold KV cache entries from other nodes
→ 3-5 μs per page fault (faster than local NVMe!)
- Local NVMe: only for truly cold data (old checkpoints, etc.)
The kernel manages placement transparently via MigrationPolicy
(defined below). The ML framework sees a flat address space; the
kernel handles the rest.
Performance impact:
- KV cache "miss" on remote RDMA: ~5 μs (vs. ~15 μs from NVMe)
- 3x improvement in tail latency for long-context inference
- No application code changes required
6.10.4 Migration Policy¶
Page placement decisions — whether to migrate a page toward its accessor, replicate
it for read sharing, or leave it in place — are governed by the MigrationPolicy
trait. The global memory pool evaluates this policy on every remote page fault to
determine the optimal action.
/// Decision returned by the migration policy for a remote-faulted page.
#[repr(u8)]
pub enum MigrationDecision {
/// Transfer ownership to the accessing node. The page is moved (not
/// copied): the old owner's mapping is invalidated and the directory
/// entry is updated to reflect the new owner. Appropriate when a single
/// node is the dominant accessor.
Migrate = 0,
/// Create a read-only replica on the accessing node without changing
/// ownership. The owner retains the authoritative copy. Appropriate
/// for read-shared data accessed by many nodes (e.g., model weights,
/// lookup tables). Replication does NOT apply to write faults — a
/// write fault always triggers ownership transfer (GetM protocol).
Replicate = 1,
/// Do not migrate or replicate. The fault is serviced by fetching the
/// page via RDMA Read into a temporary mapping that is not cached in
/// the local page table beyond the current access. Appropriate for
/// infrequent or one-shot accesses where the cost of directory updates
/// and invalidation tracking outweighs the benefit of local caching.
Stay = 2,
}
/// Pluggable migration policy trait. Implementations are stateless — all
/// decision inputs are passed via `should_migrate()` arguments. The active
/// policy is selected per DSM region via `DsmCachePolicy.migration_policy`
/// (see [Section 6.8](#dsm-region-management)) and can be changed at runtime without draining
/// in-flight operations (the new policy takes effect on the next fault).
///
/// Hot-path cost: `should_migrate()` is called on every remote page fault
/// (~10-18 μs total fault path). The implementation MUST complete in < 200 ns
/// (no heap allocation, no locks, no I/O). The default policy uses only
/// integer comparisons on the arguments provided.
pub trait MigrationPolicy: Send + Sync + 'static {
/// Evaluate whether `page` should be migrated, replicated, or left in
/// place given the current access pattern.
///
/// # Arguments
///
/// * `page` — Metadata for the faulted page (home node, current state,
/// last transition timestamp). Obtained from the local `DsmPageMeta`
/// (see [Section 6.1](#dsm-foundational-types)).
/// * `accessor` — Node that triggered the fault (the local node).
/// * `access_count` — Number of remote faults from `accessor` to this
/// page's home node for this specific page, accumulated in the
/// directory entry's per-requester fault counter (a saturating u32
/// stored in a compact per-page XArray keyed by `RegionSlotIndex`,
/// reset to zero on ownership transfer).
/// * `is_write` — Whether the fault is a write fault (GetM) or read
/// fault (GetS). Write faults that return `Replicate` are promoted
/// to `Migrate` by the caller (coherence protocol requires exclusive
/// ownership for writes).
fn should_migrate(
&self,
page: &DsmPageMeta,
accessor: NodeId,
access_count: u32,
is_write: bool,
/// Home directory state for the faulted page, piggybacked in the
/// DataResp message. The requestor evaluates the policy locally
/// after receiving data — zero extra round-trip latency.
home_state: DsmHomeState,
/// Number of sharers in the home directory's sharer set.
sharer_count: u16,
) -> MigrationDecision;
}
Default policy: ThresholdMigrationPolicy
The built-in default policy uses simple threshold-based rules derived from empirical DSM research (Cashmere-2L, Grappa). It is designed for the common case of mixed read/write workloads on 4-32 node clusters.
/// Default migration policy. Stateless — all thresholds are compile-time
/// constants. Registered as the initial policy for all DSM regions unless
/// overridden via `DsmCachePolicy.migration_policy`.
pub struct ThresholdMigrationPolicy;
impl MigrationPolicy for ThresholdMigrationPolicy {
fn should_migrate(
&self,
page: &DsmPageMeta,
_accessor: NodeId,
access_count: u32,
is_write: bool,
home_state: DsmHomeState,
sharer_count: u16,
) -> MigrationDecision {
// Write faults always migrate ownership (coherence protocol
// requires exclusive access for writes). The caller enforces
// this — Replicate is promoted to Migrate for writes — but
// the policy returns Migrate directly for clarity.
if is_write {
// Migrate on first write fault. Rationale: write locality
// is strongly predictive — a node that writes a page once
// is very likely to write it again (temporal locality in
// producer-consumer patterns). Delaying migration increases
// invalidation traffic without benefit.
return MigrationDecision::Migrate;
}
// Read faults: migrate after repeated access from the same node.
if access_count >= 3 {
// 3 remote read faults from the same node indicates sustained
// read locality. Migrating the page eliminates future RDMA
// round-trips (~3-5 μs each) at the cost of one ownership
// transfer (~10-18 μs). Break-even after 3-6 subsequent
// local accesses.
return MigrationDecision::Migrate;
}
// Read-shared: if the home directory shows the page is already
// shared among multiple peers, replicate rather than migrate.
// This avoids ping-ponging ownership between competing readers.
// Uses home_state (from DataResp), NOT the local DsmPage.state_dirty
// (which is NotPresent on the fault path — that's why we faulted).
if home_state == DsmHomeState::Shared && sharer_count > 1 {
return MigrationDecision::Replicate;
}
// Few accesses, not shared: fetch once without caching.
MigrationDecision::Stay
}
}
Decision summary:
| Fault type | access_count | Home state | Decision | Rationale |
|---|---|---|---|---|
| Write | any | any | Migrate | Coherence requires exclusive ownership |
| Read | >= 3 | any | Migrate | Sustained locality; amortize transfer cost |
| Read | < 3 | DsmHomeState::Shared (sharer_count > 1) | Replicate | Avoid ownership ping-pong |
| Read | < 3 | other | Stay | Too few accesses to justify caching overhead |
Configurable thresholds: The migration threshold (default: 3) and the write-on-first-fault behavior are configurable per region via sysfs:
/sys/kernel/umka/dsm/region/<region_id>/migrate_read_threshold
# # Read/write. Default: 3. Number of remote read faults from the same
# # node before migrating the page. Range: 1-255 (u8). Setting to 1
# # migrates aggressively on first read fault; setting to 255 effectively
# # disables read migration (only writes trigger migration).
/sys/kernel/umka/dsm/region/<region_id>/migrate_write_threshold
# # Read/write. Default: 1. Number of remote write faults before migrating.
# # Range: 1-255. Default of 1 migrates on first write (recommended for
# # most workloads). Higher values reduce migration churn for write-shared
# # patterns at the cost of increased invalidation traffic.
Custom policies: Subsystems or userspace (via the UMKA_DSM_SET_POLICY
ioctl on the DSM file descriptor) can register alternative policies that
implement the MigrationPolicy trait. The policy is a &'static dyn
MigrationPolicy stored in the region's DsmCachePolicy — no heap allocation
on the fault path. Example use cases:
- ML inference: Always replicate model weight pages (read-only, shared
across all GPU nodes). Migrate KV cache pages aggressively (write-heavy,
producer-consumer).
- Database: Higher migration threshold (e.g., 10) to reduce churn for
index scans that touch many pages briefly.
6.10.5 Cgroup Integration¶
/sys/fs/cgroup/<group>/memory.remote.max
# # Maximum remote memory this cgroup can consume (bytes)
# # Default: "max" (unlimited, subject to global pool policy)
/sys/fs/cgroup/<group>/memory.remote.current
# # Current remote memory usage (read-only)
/sys/fs/cgroup/<group>/memory.remote.stat
# # Remote memory statistics:
# # remote_alloc <bytes allocated on remote nodes>
# # remote_faults <page faults resolved from remote>
# # remote_migrations_in <pages migrated from remote to local>
# # remote_migrations_out <pages migrated from local to remote>
# # rdma_bytes_rx <total RDMA data received>
# # rdma_bytes_tx <total RDMA data sent>
/sys/fs/cgroup/<group>/memory.tier_preference
# # Override default tier ordering for this cgroup:
# # "local,cxl,remote,compressed,swap" (default)
# # "local,compressed,swap" (disable remote memory for this cgroup)
# # "local,cxl,remote,swap" (skip compression, prefer remote)
6.11 Distributed Page Cache¶
6.11.1 Problem¶
When Node A reads a file that Node B recently accessed and has cached, the standard approach (NFS/CIFS) refetches from the storage server over TCP. This ignores that Node B already has the data cached in its page cache and could serve it faster via RDMA.
6.11.2 Design: Cooperative Page Cache¶
The page cache gains awareness of what other nodes have cached:
Traditional NFS read:
Node A: read() → VFS → NFS client → TCP → NFS server → disk → TCP → Node A
Latency: ~200 μs (network + server + disk if not cached on server)
Cooperative page cache (shared filesystem):
Node A: read() → VFS → page cache miss → WHERE is this page?
Option 1: Remote page cache (RDMA read from Node B's page cache)
Latency: ~3-5 μs
Option 2: Local disk
Latency: ~10-15 μs (NVMe)
Option 3: Remote disk (NVMe-oF/RDMA)
Latency: ~15-25 μs
Option 4: Traditional NFS/CIFS
Latency: ~200 μs
Kernel picks the fastest source automatically.
Page cache miss hook — three-stage speculative protocol: The VFS page cache
read path (Section 14.1) inserts a cooperative cache check on
cache misses for DSM-aware superblocks (sb.flags & MS_DSM_COOPERATIVE). The
protocol is designed to never add latency to the common case:
| Stage | Check | Cost | Eliminates |
|---|---|---|---|
| 1 | Per-peer Bloom filter (NodeBloomFilter) |
~15-30ns (3-5 cache lines) | ~90-95% of remote lookups |
| 2 | Sequential access rejection (FileRaState) |
~5ns (single branch) | ~3-5% more (sequential readahead streams) |
| 3 | Speculative parallel RDMA + NVMe | 0ns added (parallel) | N/A — first responder wins |
Stage 1 uses the counting Bloom filters exchanged via DSM heartbeat piggyback
(see NodeBloomFilter below). A negative result means "definitely not cached
remotely" — the miss proceeds directly to local NVMe. No RDMA traffic generated.
Stage 2 checks whether the readahead engine has classified the current access
pattern as sequential (ra_state.prev_pos + 1 == current_idx). Sequential streams
are unlikely to be cached by a cooperative peer (each node reads its own sequential
range), so RDMA lookup is skipped even if the Bloom filter returns "maybe."
Stage 3 fires for the remaining ~2-5% of random-access misses where Bloom says
"maybe." Both an RDMA cooperative lookup (dsm_cooperative_cache_lookup_async) and
the normal NVMe readahead are issued simultaneously. The RDMA response (~2-5μs
on InfiniBand) typically arrives before NVMe (~10-100μs), providing a real speedup
when the remote cache is warm. When the remote cache is cold, NVMe completes
normally — the RDMA future is cancelled on drop with zero overhead.
Net effect: zero overhead for ~95% of misses; ~5-10μs speedup for ~2-5% where remote cache is warm; never slower than local-only I/O.
See Section 14.1 for the implementation in filemap_get_pages().
6.11.3 Page Cache Directory¶
// umka-core/src/vfs/cooperative_cache.rs
/// Unique file identifier for DSM cooperative cache lookups.
/// Constructed from (superblock_id, inode_number) to provide a
/// globally unique identifier across filesystems and nodes.
pub type FileId = u64;
/// Construct a FileId from superblock and inode identifiers.
/// The superblock ID occupies the upper 16 bits; the inode number
/// occupies the lower 48 bits. This supports up to 65536 mounted
/// filesystems and 2^48 inodes per filesystem.
pub fn make_file_id(sb_id: u32, ino: u64) -> FileId {
((sb_id as u64) << 48) | (ino & 0x0000_FFFF_FFFF_FFFF)
}
/// Distributed page cache directory.
/// Tracks which nodes have cached pages for which files.
///
/// Uses a per-node counting Bloom filter architecture (NOT per-file).
/// Each node maintains ONE filter covering ALL files it has cached,
/// and broadcasts it to peers periodically.
///
/// Memory bound: 256KB per peer. For 100 peers: 25MB total local cache
/// of all peer filters. This replaces the prior per-file design that would
/// have consumed 20KB × files × peers (unbounded).
pub struct CooperativeCache {
/// Per-peer Bloom filters received from cluster peers.
/// Key: PeerId (u64) of the remote peer.
/// Value: that peer's Bloom filter (covering all files cached on that peer)
/// plus the timestamp when the filter was last received.
/// XArray keyed by PeerId — O(1) lookup with native RCU-compatible reads
/// (cache lookup on page fault is lock-free);
/// writes (filter update from peer broadcast) use XArray's internal locking.
peer_filters: XArray<CachedPeerFilter>,
/// This peer's own Bloom filter, broadcast to peers.
local_filter: NodeBloomFilter,
/// Configuration. Tunable at runtime via sysctl.
config: CooperativeCacheConfig,
}
pub struct CooperativeCacheConfig {
/// Maximum memory for cached peer filters (bytes).
/// Default: 64 MiB. When this limit would be exceeded by a new peer's
/// filter, the cooperative cache drops the most distant peer's filter
/// (highest topology cost) instead of allocating. This bounds memory
/// even in very large clusters.
/// Set to 0 to disable the cooperative cache entirely (all lookups
/// return PageSource::Storage — no RDMA probes to remote caches).
pub max_filter_memory_bytes: u64,
/// Maximum number of peer filters to cache. Derived from
/// max_filter_memory_bytes / BLOOM_FILTER_SIZE (256 KB).
/// Default: 256 (= 64 MiB / 256 KB). In a cluster with more peers
/// than this limit, only the nearest peers (by topology cost) have
/// their filters cached.
pub max_cached_peers: u32,
/// Broadcast interval (seconds). Default: 30.
pub broadcast_interval_secs: u32,
/// Stale threshold (seconds). Filters older than this are ignored.
/// Default: 60 (= 2 × broadcast_interval).
pub stale_threshold_secs: u32,
/// Timeout for RDMA probe requests (microseconds).
/// If a DsmProbeRequest does not receive a DsmProbeResponse within this
/// duration, the probe is treated as a miss and the requester falls back
/// to storage. Default: 500 μs (~100× expected RDMA RTT).
/// Set to 0 to disable probing entirely (Bloom filter hits are ignored;
/// all page fetches go directly to storage).
pub probe_timeout_us: u32,
}
/// Counting Bloom filter: 4 bits per counter, 8 hash functions.
/// Size: 256KB per node = 512K counters (m). At k=8 hash functions and n≈50K items,
/// FPR = (1 - e^(-kn/m))^k = (1 - e^(-8*50K/512K))^8 ≈ 1%. Saturates at ~50K items.
/// Formula: n_max = m/k * ln(1/(1-p^(1/k))) ≈ 52K for m=512K, k=8, p=0.01.
///
/// The filter keys are (filesystem_id, inode, page_offset) hashes.
/// A positive result means "this node probably has this page cached."
/// A negative result means "this node definitely does not have this page."
pub struct NodeBloomFilter {
/// 256KB = 512K nibbles = 512K 4-bit counters (2 nibbles per byte).
/// Counting filter supports deletion (decrement on eviction).
counters: [u8; 262144],
/// Approximate count of items in the filter (for load factor tracking).
approx_count: AtomicU64,
/// Last reset timestamp. Filters are periodically reset to prevent
/// saturation (reset cycle: configurable, default 300 seconds).
last_reset: AtomicU64,
}
pub struct CachedPeerFilter {
/// The remote node's Bloom filter (decompressed from ZSTD on receipt).
filter: NodeBloomFilter,
/// Timestamp when this filter was received from the peer.
received_ns: u64,
}
/// Stage 1 fast path: probe all peer Bloom filters for (file, page_offset).
/// Returns `true` if ANY peer's Bloom filter reports a positive result
/// (page might be cached remotely). Returns `false` if ALL filters are
/// negative (page is definitely not cached remotely).
///
/// Cost: ~15-30ns — iterates cached peer filters (RCU read, lock-free),
/// each Bloom probe is 8 hash lookups across 3-5 cache lines.
/// Called from `filemap_get_pages()` on every cache miss for
/// `MS_DSM_COOPERATIVE` filesystems (Stage 1 of the three-stage protocol).
pub fn dsm_bloom_probe(file_id: &DsmFileId, page_offset: u64) -> bool {
let cache = dsm_cooperative_cache();
let _rcu = rcu_read_lock();
let hash = bloom_hash(file_id, page_offset);
for (_peer_id, filter) in cache.peer_filters.iter_rcu() {
if filter.filter.test(hash) {
return true; // at least one peer might have it
}
}
false
}
/// Stage 3 async path: initiate a speculative RDMA cooperative lookup.
/// Returns a lightweight future that completes when the RDMA response
/// arrives or the probe timeout expires. The future is cancel-safe —
/// dropping it cancels the outstanding RDMA request with no side effects.
///
/// Internally selects the closest peer with a positive Bloom result
/// (lowest topology cost) and sends a `DsmProbeRequest` via RDMA two-sided
/// Send. The peer looks up the page in its local cache, performs an RDMA
/// Write of the page data to the requester's pre-allocated buffer, and
/// replies with a `DsmProbeResponse` via two-sided Send. If the probe hits,
/// the page data is returned as `Some(PageRef)`.
///
/// Called from `filemap_get_pages()` Stage 3 (speculative parallel issue).
/// Runs concurrently with the NVMe readahead path.
pub fn dsm_cooperative_cache_lookup_async(
file_id: &DsmFileId,
page_offset: u64,
) -> DsmProbeFuture {
// 1. Select closest Bloom-positive peer.
// 2. Allocate local page frame (GFP_NOFS — called from page fault context).
// 3. Send DsmProbeRequest via RDMA Send (two-sided; peer responds with
// RDMA Write of page data + DsmProbeResponse Send).
// 4. Return future; completion callback fills the page and wakes waiter.
// Timeout: CooperativeCacheConfig.probe_timeout_us (default 500μs).
}
/// Cancel-safe future for a speculative DSM probe. Dropping this future
/// cancels the outstanding RDMA request and frees the pre-allocated page
/// frame if the probe has not yet completed.
pub struct DsmProbeFuture { /* opaque */ }
impl DsmProbeFuture {
/// Non-blocking check: returns `Some(PageRef)` if the RDMA probe
/// completed successfully, `None` if still pending or missed.
/// After returning `Some`, the future is consumed.
pub fn try_complete(&self) -> Option<PageRef> { /* ... */ }
}
impl Drop for DsmProbeFuture {
fn drop(&mut self) {
// Cancel outstanding RDMA request if not yet completed.
// Free pre-allocated page frame. No-op if already completed.
}
}
impl CooperativeCache {
/// Find the best source for a cache page.
pub fn find_page(
&self,
file: FileId,
page_offset: u64,
local_peer: PeerId,
topology: &TopologyGraph,
) -> PageSource {
// 1. Check local page cache first (always).
// 2. Hash (file, page_offset) and query each peer's Bloom filter
// (RCU read traversal of peer_filters).
// A negative result = definitely not cached on that peer.
// A positive result = maybe cached, worth an RDMA probe.
// 3. Of the candidate peers (positive results), pick the closest
// (lowest cost in the topology graph).
// 4. If no remote cache hit: fall back to storage.
}
}
pub enum PageSource {
/// Page is in local page cache. No I/O needed.
LocalCache,
/// Page is likely cached on a remote peer. Try RDMA read.
RemoteCache { peer_id: PeerId, expected_latency_ns: u64 },
/// Page is not cached anywhere. Read from storage.
Storage { device: DeviceNodeId },
/// Page is on remote storage (NVMe-oF/RDMA).
RemoteStorage { peer_id: PeerId, device: DeviceNodeId },
}
Bounded Bloom filter memory: Bloom filters for negative caching are bounded by a two-level architecture. The key invariant: one filter per node covering all files, not one filter per file. This bounds total memory to O(nodes), not O(nodes x files).
Level 1: Per-Peer Counting Bloom Filter (bounded)
Each peer maintains a single counting Bloom filter (NodeBloomFilter, defined above)
for all its cached file pages, not one filter per file.
Memory bound: 256KB per peer. For a 100-peer cluster: 25MB total per peer (for the full cluster's filters). This replaces the prior per-file design that would have required 20KB × files × peers (unbounded and unusable).
False positive rate management: With 256KB = 512K 4-bit counters (m), k=8 hash functions, and n≈50K items, FPR = (1 − e^(−kn/m))^k ≈ 1%. A single peer's active page-cache working set is typically tens of thousands of (file, page_offset) pairs, well within this limit. A false positive means we send one extra RDMA probe to a peer that does not have the page (cost: ~3 us wasted), not that we miss a page that exists (which would be a correctness error). False negatives cannot occur with a standard Bloom filter.
When approx_count > 0.8 * capacity (>40K entries), the filter is considered
saturated. On saturation:
- Log a warning (filter accuracy degraded).
- If persistent (> 60s), trigger a partial reset: halve all counters
(right-shift each 4-bit counter by 1), preserving relative insertion recency
while reducing saturation. This is the same decay applied by the periodic
maintenance tick (see
bloom_maintenance_tickbelow). Halving avoids false negatives that would occur from fixed-amount decrement (a counter set by exactly one item would be zeroed, making the item appear absent). - If count exceeds capacity by 2x (>2M entries), trigger a full reset and rebuild from the current page cache index.
Level 2: Periodic Reset
/// Periodic maintenance for the counting Bloom filter.
///
/// Uses RCU copy-swap to avoid data races: the maintenance tick
/// allocates a new counter array, applies the decay, and atomically
/// swaps the pointer. Concurrent lookups (RCU readers) continue to
/// read the old array until the grace period elapses.
///
/// This function runs on the DSM maintenance kthread (one per node,
/// not on the hot path). The allocation cost is acceptable:
/// 256 KB per filter, once every BLOOM_RESET_INTERVAL_SECS (300s).
fn bloom_maintenance_tick(filter: &NodeBloomFilter) {
let age_secs = now() - filter.last_reset.load(Acquire);
if age_secs >= BLOOM_RESET_INTERVAL_SECS { // default 300s
// Allocate a new counter array (256 KB).
let mut new_counters = Box::new([0u8; BLOOM_FILTER_SIZE]);
// Copy existing counters with decay: halve each 4-bit counter.
let old_counters = filter.counters.rcu_read();
for (i, &byte) in old_counters.iter().enumerate() {
// Each byte packs two 4-bit counters (high nibble, low nibble).
// Right-shift each nibble by 1 to halve (floor division).
let hi = ((byte >> 4) & 0x0F) >> 1;
let lo = (byte & 0x0F) >> 1;
new_counters[i] = (hi << 4) | lo;
}
// Atomically swap the counter array (RCU update).
filter.counters.update(new_counters);
// Approximate the new count after halving all counters.
// Relaxed store: single writer (300s maintenance tick), TOCTOU between
// load and store is benign. Value is a heuristic for filter saturation.
filter.approx_count.store(
filter.approx_count.load(Relaxed) / 2,
Relaxed,
);
filter.last_reset.store(now(), Release);
}
}
Negative cache lookup protocol:
fn file_page_cached_on_peer(
cache: &CooperativeCache,
peer_id: PeerId,
file: FileId,
page_offset: u64,
) -> BloomResult {
let rcu_guard = rcu_read_lock();
let peer = cache.peer_filters.get(&peer_id, &rcu_guard)
.ok_or(BloomResult::Unknown)?; // No filter received yet from this peer.
// Expire stale filters (>60 seconds old = 2x the broadcast interval).
if now_ns() - peer.received_ns > BLOOM_STALE_THRESHOLD_NS {
return BloomResult::Unknown;
}
let key_hash = hash_bloom_key(file, page_offset);
if !peer.filter.query(key_hash) {
// Definitely not cached on this peer. No RDMA probe needed.
BloomResult::DefinitelyAbsent
} else {
// Might be cached. Worth an RDMA probe.
BloomResult::MaybePresent
}
}
Filter distribution: Each peer periodically broadcasts its Bloom filter to all
other peers (compressed via ZSTD, ~64KB after compression for 256KB raw). Broadcast
interval: 30 seconds or on significant change (>10K inserts since last broadcast).
Peers cache the received filter in the peer_filters XArray (keyed by PeerId).
Total memory per peer for all peer filters: N_peers × 256KB (e.g., 100 peers = 25MB,
or ~6MB if filters are kept in compressed form and decompressed on query).
Local filter maintenance: When the local peer caches a page, it inserts the
(file_id, page_offset) hash into local_filter. When a page is evicted from the
local page cache, the corresponding counter is decremented (counting Bloom filter
supports deletion). The bloom_maintenance_tick runs on the periodic maintenance timer
to prevent long-term counter accumulation from hash collisions.
6.11.4 RDMA Probe Protocol¶
When the Bloom filter indicates a remote peer may have a page cached, the requester sends an RDMA probe to confirm and fetch the data. This is a best-effort optimization — storage fallback is always correct.
6.11.4.1 Wire Structures¶
/// Request to probe a remote peer's page cache for a specific file page.
/// Sent via RDMA two-sided Send (inline, ≤64 bytes).
///
/// Protocol: requester pre-allocates a 4KB receive buffer from the
/// RDMA-registered memory pool ([Section 5.4](05-distributed.md#rdma-native-transport-layer--pre-registered-kernel-memory))
/// and provides its rkey + physical address so the responder can RDMA-Write
/// the page data directly into the requester's buffer (zero-copy).
#[repr(C)]
pub struct DsmProbeRequest {
/// Unique per-requester monotonic request ID.
/// Used to match responses to requests and detect stale/duplicate replies.
/// Requester increments a per-CPU AtomicU64 counter; uniqueness is
/// guaranteed within a single requester peer.
pub request_id: Le64,
/// Filesystem identity + inode number identifying the file.
/// Same encoding as the Bloom filter key — the responder uses this to
/// look up the page in its local page cache.
/// Wire-safe encoding: native `FileId` (u64) is encoded as `Le64` for
/// cross-endian safety on big-endian peers (PPC32, s390x).
pub file_id: Le64,
/// Page offset within the file (in PAGE_SIZE units).
pub page_offset: Le64,
/// PeerId of the requesting node (for response routing).
pub requester_peer: Le64,
/// RDMA remote key for the requester's pre-allocated receive buffer.
/// The responder uses this rkey in its RDMA Write operation to deposit
/// the page data directly into the requester's memory.
pub rdma_rkey: Le32,
/// Padding for alignment.
pub _pad: Le32,
/// Physical address of the requester's receive buffer (4KB-aligned).
/// The responder's RDMA Write targets this address using `rdma_rkey`.
pub rdma_addr: Le64,
}
const_assert!(core::mem::size_of::<DsmProbeRequest>() == 48);
/// Response to a page cache probe.
/// Sent via RDMA two-sided Send (inline, ≤64 bytes).
///
/// **Wire framing**: Probe messages use the standard `ClusterMessageHeader`
/// (40 bytes) as their outer framing — they are cluster transport messages,
/// NOT DSM coherence messages. They do NOT carry a `DsmWireHeader`. The
/// `ClusterMessageHeader.msg_type` distinguishes them (MSG_DSM_PROBE_REQ
/// and MSG_DSM_PROBE_RESP). This is because probes are page-cache-level
/// operations, not MOESI coherence operations — they operate on the
/// distributed page cache, not on the DSM directory.
///
/// If status == Hit: the responder has already completed an RDMA Write of
/// the 4KB page data to the requester's buffer (rdma_addr/rdma_rkey from
/// the request) before sending this response. The requester can use the
/// data immediately upon receiving this message.
///
/// If status != Hit: no RDMA Write was performed. The requester's receive
/// buffer is unchanged and must be released back to the RDMA pool.
#[repr(C)]
pub struct DsmProbeResponse {
/// Echo of the request_id from DsmProbeRequest.
pub request_id: Le64,
/// Result of the probe (ProbeStatus as Le32).
pub status: Le32,
/// Padding for alignment.
pub _pad: Le32,
/// Generation counter of the page at the time of the RDMA Write.
/// The requester uses this to detect staleness: if a concurrent write
/// invalidated the page between the RDMA Write and this response,
/// the requester detects the mismatch via the page cache's own
/// generation tracking and re-fetches from storage.
/// Zero when status != Hit.
pub page_generation: Le64,
}
const_assert!(core::mem::size_of::<DsmProbeResponse>() == 24);
/// Outcome of a remote page cache probe.
#[repr(u32)]
pub enum ProbeStatus {
/// Page was found in the responder's cache. Data has been RDMA-Written
/// to the requester's buffer. `page_generation` is valid.
Hit = 0,
/// Page is not in the responder's cache (Bloom filter false positive
/// or page was never cached on this peer).
Miss = 1,
/// Page was cached but has been evicted between the Bloom filter check
/// and the probe arrival. This is expected under memory pressure —
/// the Bloom filter's counting decrement may not have propagated yet.
Evicted = 2,
/// Page is currently being written or transferred (MOESI transient state).
/// Requester should fall back to storage rather than retry — the page
/// may be mid-transfer for an extended period.
Busy = 3,
/// Internal error on the responder. The low 16 bits of the Le32 encode
/// a DsmError code for diagnostics; requester treats this as a miss.
Error = 4,
}
6.11.4.2 Protocol Flow¶
Requester Responder (candidate peer)
│ │
│ 1. Allocate 4KB receive buffer from │
│ RDMA pool; obtain rkey + phys addr │
│ │
│ 2. RDMA Send: DsmProbeRequest ───────────→ │
│ (file_id, page_offset, rkey, addr) │
│ │ 3. Look up (file_id, page_offset)
│ │ in local page cache.
│ │
│ ┌── Hit: ──────┤
│ │ │ 4a. Pin page (prevent eviction
│ │ │ during RDMA Write).
│ 5a. ←─── RDMA Write: 4KB ──┘ │ 4b. RDMA Write page data to
│ (one-sided, lands in recv buffer) │ requester's rdma_addr/rkey.
│ │ 4c. Record page_generation.
│ 6a. ←─── RDMA Send: ProbeResponse(Hit) ───┤ 4d. Unpin page.
│ page_generation set │
│ │
│ ┌── Miss: ─────┤
│ 6b. ←─── RDMA Send: ProbeResponse(Miss) ──┘
│ page_generation = 0 │
│ │
│ 7. On Hit: verify page_generation │
│ matches expected. Insert into local │
│ page cache. Release RDMA buffer. │
│ │
│ 7. On Miss/Evicted/Busy/Error: │
│ Release RDMA buffer. Fall back to │
│ storage I/O (standard read path). │
6.11.4.3 Page Cache Integration After RDMA Fetch¶
When a remote probe succeeds (step 7, Hit), the RDMA-fetched page is inserted into the local page cache. Two additional integration steps are required to maintain consistency with the readahead engine and filesystem integrity:
Readahead state update: RDMA-fetched pages bypass the standard ra_submit() path
(Section 4.4), so the per-file FileRaState is not aware of them.
Without correction, the readahead engine would misinterpret the fetched pages as a
readahead window gap and either re-fetch them from storage or reset the sequential
detection heuristic. After inserting an RDMA-fetched page:
- Advance
FileRaState.starttomax(ra.start, fetched_page_index + 1)if the fetched page falls within or just beyond the current readahead window. - If the fetch covers a contiguous range (e.g., cooperative cache returned pages
N through N+k), update
FileRaState.sizeto account for the remotely-filled portion so thatasync_sizetriggers at the correct position. - Set
PageFlags::UPTODATEon the inserted page (standard for any newly-valid page). Do NOT setPageFlags::READAHEAD— the page was demand-fetched, not speculatively prefetched. The readahead engine will prefetch subsequent pages on the next sequential access.
Filesystem-level verification: Some filesystems store per-page checksums (btrfs data checksums, ext4 metadata checksums) to detect silent data corruption. RDMA-fetched pages arrive as raw 4KB data with no filesystem-level integrity guarantee — the remote peer's page cache may have been corrupted by a hardware fault (bitflip in DRAM, silent DMA error).
The DSM calls mapping.ops.verify_page(mapping, page_index, page) before
setting PageFlags::UPTODATE:
- Call
mapping.ops.verify_page(mapping, page_index, page)before settingPageFlags::UPTODATE. - If verification fails (
Ok(false)— checksum mismatch): discard the RDMA-fetched page, release the RDMA buffer, and fall back to storage I/O. Log anFmaEventatWarningseverity for the remote peer's node (potential hardware issue on that node). - If
verify_page()returnsOk(true)(the default for filesystems without per-page checksums, e.g., tmpfs, ext2): the page is accepted as-is (same trust model as reading from the local page cache, where no per-page checksum exists).
The verify_page() method is defined on the AddressSpaceOps trait
(Section 14.1).
The default implementation returns Ok(true) (accept without verification).
Filesystems with per-page checksums (btrfs, ZFS) override it to validate
the RDMA-fetched data against their stored checksums. If verify_page()
returns Ok(false) (checksum mismatch), the DSM discards the fetched page
and falls back to local storage I/O.
6.11.4.4 Timeout and Fallback¶
- Default timeout: 500 μs (configurable via
CooperativeCacheConfig.probe_timeout_us). This is ~100× the expected RDMA round-trip (~3-5 μs) to tolerate transient congestion without blocking the page fault path for too long. - On timeout: treat as a miss. Release the RDMA receive buffer and fall back to storage. No retry — the probe is a best-effort latency optimization, and storage is always the correct fallback.
- No retry protocol: Probes are never retried. If a probe fails for any reason
(timeout, error, miss, eviction), the requester immediately falls through to the
next
PageSourcein priority order (local storage → remote storage → NFS). This keeps the worst-case page fault latency bounded toprobe_timeout_us + storage_latencyrather than accumulating retries.
6.11.4.5 Race Condition Handling¶
| Race | Detection | Resolution |
|---|---|---|
| Page evicted between Bloom filter check and probe arrival | Responder returns ProbeStatus::Evicted |
Requester falls back to storage |
| Page modified during RDMA Write (concurrent writer) | page_generation mismatch: requester's cached generation differs from the page cache entry's current generation after insertion |
Requester discards the stale page and re-fetches from the authoritative source (home node or storage) |
| Page evicted on responder after RDMA Write but before ProbeResponse sent | Response still carries Hit with valid page_generation — the data was already written to the requester's buffer before eviction |
No issue: requester has a valid copy; it becomes the cached copy |
| Responder crashes mid-probe | Requester's timeout fires (500 μs) | Treated as miss; standard storage fallback |
| Multiple requesters probe same peer for same page simultaneously | Each probe is independent (different request_id, different receive buffers) |
Responder pins the page once per outstanding probe; concurrent RDMA Writes to different buffers are safe |
6.11.4.6 Wire Format Integration¶
Probe messages use the cooperative page cache's own message type codes, separate from the DSM MOESI coherence protocol. They are carried on the same RDMA ring buffer (Section 5.4) but are dispatched to the cooperative cache handler, not the DSM coherence state machine:
/// DsmMsgType extensions for the cooperative page cache probe protocol.
// 0x0080-0x008F: cooperative page cache probe messages (formerly 0x0090-0x0091;
// reassigned to avoid numeric collision with PeerMessageType::ThreadMigrateRequest
// = 0x0090 in the top-level message type space).
/// Separate from MOESI (0x0001-0x0072) and DSM directory reconstruction (0x0330-0x0332).
impl DsmMsgType {
pub const ProbeRequest: u16 = 0x0080;
pub const ProbeResponse: u16 = 0x0081;
}
6.11.5 Cache Coherence for Shared Files¶
When multiple nodes cache the same file, writes must be coordinated:
Write strategy (per-file, configurable):
1. Write-invalidate (default for shared mutable files):
- Writer acquires exclusive ownership (like DSM write fault)
- All reader copies are invalidated via RDMA
- Writer modifies page, becomes sole cached copy
- Other nodes re-fault on next access (get updated page)
2. Write-through (for append-only logs, databases):
- Writer writes to local page cache AND pushes update to owner
- Owner propagates to all readers via RDMA write
- Higher bandwidth cost, but readers see updates faster
3. No coherence (for read-only data, e.g., shared model weights):
- File is marked immutable (or read-only mounted)
- All nodes cache freely, no invalidation needed
- Best case for ML inference: model weights cached everywhere
Integration with the Distributed Lock Manager (Section 15.15):
For clustered filesystems (Section 15.14), page cache coherence is coordinated through DLM lock operations rather than the DSM coherence protocol. The DLM provides filesystem-aware semantics that the generic DSM protocol cannot:
-
Lock downgrade triggers targeted writeback: When a DLM lock is downgraded from EX to PR (Section 15.15), only dirty pages tracked by the lock's
LockDirtyTrackerare flushed — not the entire inode's page cache. This eliminates the Linux problem where dropping a lock on a large file requires flushing all dirty pages regardless of how many were actually modified. -
MOESI-like page states: Each cached page on a clustered filesystem carries a coherence state relative to the DLM lock protecting it. These use MOESI protocol names (not
DsmPageStatevariants) because the file cache coherence is managed by DLM lock transitions, not by the DSM coherence protocol: - Modified (cf. DsmPageState
Exclusivewith dirty bit): Page dirty under EX lock. Sole copy in the cluster. - Owned (cf. DsmPageState
SharedOwner): Page was modified, then lock downgraded to PR. This node is responsible for writing back on eviction. Other nodes may have Shared copies. - Exclusive (cf. DsmPageState
Exclusiveclean): Page clean, held under EX lock. Can transition to Modified without network traffic. - Shared (cf. DsmPageState
SharedReader): Page clean, held under PR/CR lock. Read-only. -
Invalid (cf. DsmPageState
NotPresent): Lock released or revoked. Page must be re-fetched on next access. -
Per-lock-range dirty tracking: The cooperative page cache directory (Section 6.11) integrates with Section 15.15's
LockDirtyTrackerto record which pages were dirtied under which lock range. On lock downgrade, the writeback is scoped to the lock's byte range — concurrent holders of non-overlapping ranges on the same file are not affected. When aDsmLockBindingis active, the DLM'sLockDirtyTrackerdelegates to the binding'sDsmDirtyBitmap(Section 15.15) — a single bitmap serves both DLM writeback scoping and DSM coherence tracking, avoiding double bookkeeping. -
Lock upgrade coherence: When a DLM lock is upgraded from PR to EX on a bound DSM region, the
DsmLockBindingauto-lifecycle issues GetM/Upgrade for all cached SharedReader pages before signaling grant completion (Section 6.12). This guarantees that holding a DLM EX lock implies all bound pages are DSM-writable (Exclusive or Modified state). Without this guarantee, the first write to each cached page would fault on the hot path.
6.11.6 AI Training Data Pipeline¶
Training data scenario:
- 100TB dataset on shared NVMe storage
- 8 training nodes, each with 4 GPUs
- Each node reads different shards, but shards overlap (data augmentation)
Without cooperative cache:
Each node reads its shard from storage independently.
If Node A and Node B need the same page: two storage reads.
With cooperative cache:
Node A reads page from storage → cached in Node A's page cache.
Node B needs same page → Node A's Bloom filter (cached locally) shows it might have it.
Node B fetches via RDMA Read from Node A: ~3 μs (vs ~15 μs from NVMe).
Storage bandwidth saved: proportional to shard overlap.
For a typical ImageNet-style dataset with 30% shard overlap:
~30% reduction in storage I/O, ~30% more effective storage bandwidth.
6.11.7 DSM Dirty Tracking Coordination¶
DSM pages have two independent dirty tracking mechanisms that must be coordinated to avoid double-writeback or lost writes:
-
PageFlags::DIRTY— set by the VFS write path (generic_perform_write) and inspected by the local writeback subsystem (Section 4.6). Driveskupdate/bdflushwriteback to the backing filesystem. -
DsmDirtyBitmap— set by the DSM coherence subsystem when a page transitions to MOESI Modified or Owned state (Section 6.12). Drives DSM MOESI writeback (PutM/PutO to the home node).
Both flags track the same underlying event (page content was modified), but they drive different writeback paths (local storage vs. DSM home node). Without coordination, the local writeback thread might flush a page to disk while the DSM writeback is simultaneously sending PutM to the home node, causing redundant I/O and potential state inconsistency.
Coordination protocol:
Page write on a DSM-cached page:
1. VFS write path sets PageFlags::DIRTY (standard).
2. DSM subsystem sets DsmDirtyBitmap bit for this page (standard).
Both flags are set atomically from the writer's context — no race between
the two because a single writer holds the page lock during write.
Local writeback (kupdate/bdflush) encounters a DSM page:
1. Check PageFlags::DSM on the page.
2. If DSM flag is set, check DsmDirtyBitmap for this page.
3. If DsmDirtyBitmap bit is set: **skip this page** — defer to DSM writeback.
The local writeback thread does NOT write the page to the backing filesystem.
Rationale: the DSM home node is the authoritative owner of the page's
coherence state. Writing to local storage would create a stale copy that
diverges from the DSM-managed version once the home node processes PutM.
4. If DsmDirtyBitmap bit is NOT set (page was dirtied by a non-DSM path, or
DSM writeback already completed): proceed with normal local writeback.
DSM writeback (PutM/PutO to home node):
1. DSM subsystem initiates PutM for a Modified page.
2. On PutAck receipt from the home node (confirming data stored):
a. Clear DsmDirtyBitmap bit for this page.
b. Clear PageFlags::DIRTY for this page.
Both clears are performed atomically under the page lock.
3. The page is now clean from both the local and DSM perspectives.
DSM invalidation (Inv from home node):
1. Page transitions to NotPresent. Both PageFlags::DIRTY and DsmDirtyBitmap
bit are cleared (the invalidation discards the local copy entirely).
2. If the page was Modified, the Inv handler sends PutM first (flush-before-
invalidate), which clears both flags via the PutAck path above.
3. After the page transitions to NotPresent: call `dsm_remove_page_tracking(page)`
to remove the page from the per-region tracking XArray, drop the tracking
refcount (freeing the page frame to the RDMA-registered pool if this is
the last reference), and free the embedded DsmPageMeta back to the
per-region slab pool. This is the same cleanup as the eviction path --
invalidation is coherence-mandated eviction.
Key invariant: A DSM page's PageFlags::DIRTY bit is never cleared by local
writeback alone — only DSM writeback (PutAck) or invalidation clears it. Local
writeback is prevented from touching DSM-dirty pages by the DsmDirtyBitmap check.
This ensures the DSM home node always receives the authoritative dirty data before
the page is considered clean.
Performance impact: The additional DsmDirtyBitmap check in the local writeback
path is a single bitmap bit test (O(1), ~2 ns) per DSM page encountered. Non-DSM pages
(the vast majority in typical workloads) skip the check entirely via the PageFlags::DSM
fast-path test.
6.11.8 DSM Eviction Policy¶
DSM-cached pages participate in the standard generational LRU reclaim (Section 4.4) but require additional MOESI-aware writeback logic: evicting a page that is Modified or Owned requires a network round-trip to the home node before the page frame can be reclaimed, while Shared or Exclusive-clean pages can be silently dropped.
6.11.8.1 DsmPage Descriptor¶
/// Metadata for a page managed by the DSM subsystem. Extends the base
/// `Page` descriptor with DSM-specific fields (home node, coherence state,
/// dirty bitmap index). Allocated from the DSM slab cache when a page
/// enters DSM management.
#[repr(C)]
pub struct DsmPage {
/// Base page descriptor (shared with local page cache).
/// Uses PageRef (non-owning refcounted reference to the page cache's
/// Page descriptor) for consistency with the local page cache and VMM
/// subsystems. Arc<Page> would create a separate allocation; PageRef
/// points directly into the page descriptor array managed by the
/// physical memory allocator ([Section 4.2](04-memory.md#physical-memory-allocator)).
pub page: PageRef,
/// Home node for this page, stored as a region slot index (u16).
/// The PeerId can be recovered from the region's slot map when needed.
/// Using RegionSlotIndex instead of PeerId saves 6 bytes per DsmPage
/// and matches the RegionSlotIndex type used in DsmPageMeta.last_writer.
pub home_slot: RegionSlotIndex,
/// Packed MOESI coherence state + dirty flag in a single atomic.
/// Low byte (bits 0-7): DsmPageState discriminant (from DsmPageState::TryFrom<u8>).
/// Bit 8: dirty flag (1 = locally modified since last writeback).
/// Bits 9-15: reserved (must be zero).
///
/// Packing state and dirty into a single AtomicU16 eliminates the TOCTOU race
/// between separate loads of coherence_state and dirty in the eviction path.
/// A single atomic load provides a consistent snapshot of both fields.
///
/// Accessor methods:
/// `fn state(&self) -> DsmPageState`:
/// `DsmPageState::try_from((self.state_dirty.load(Acquire) & 0xFF) as u8)`
/// `fn is_dirty(&self) -> bool`:
/// `(self.state_dirty.load(Acquire) >> 8) & 1 != 0`
/// `fn set_dirty(&self)`:
/// `self.state_dirty.fetch_or(0x100, Release)`
/// `fn clear_dirty(&self)`:
/// `self.state_dirty.fetch_and(!0x0100u16, Release)`
/// `fn set_state(&self, state: DsmPageState)`:
/// CAS loop: read current, replace low byte, write back.
/// CAS bounded to 1 retry (2 independent bit groups: low byte = state,
/// bit 8 = dirty). A concurrent `set_dirty()` changes bit 8; `set_state()`
/// changes low byte. CAS fails once, retries with new bit 8, succeeds.
/// No livelock.
///
/// **Relationship to DsmDirtyBitmap**: The bitmap tracks dirty status at
/// region granularity for bulk writeback scheduling. This per-page flag
/// (bit 8) provides the definitive per-page dirty state for the eviction and
/// writeback hot paths without requiring a bitmap lookup.
pub state_dirty: AtomicU16,
/// Index into the per-region DsmDirtyBitmap.
pub bitmap_index: u32,
/// Timestamp of last remote access (for eviction scoring).
pub last_access_ns: AtomicU64,
/// Embedded per-page metadata (last_writer, last_transition_ns).
/// Stored inline in DsmPage to avoid a separate allocation and to
/// eliminate the undefined `meta_ptr()` accessor. Previous spec
/// versions had DsmPageMeta as a separate slab allocation with
/// duplicated fields. The unified layout consolidates them:
/// - `home_slot` is in DsmPage only (removed from DsmPageMeta).
/// - Coherence state is in `state_dirty` only (no `local_state` field).
/// - DsmPageMeta retains only `last_writer` and `last_transition_ns`.
/// See [Section 6.1](#dsm-foundational-types) for the DsmPageMeta field docs.
pub meta: DsmPageMeta,
}
impl DsmPage {
/// Return a pointer to the embedded DsmPageMeta for slab free operations.
/// Used by `dsm_remove_page_tracking()` when freeing the DsmPage back to
/// the per-region slab pool. Since meta is embedded, this returns a
/// pointer into `self` (the entire DsmPage is the slab allocation unit).
pub fn meta_ptr(&self) -> *const DsmPageMeta {
&self.meta as *const DsmPageMeta
}
}
// DsmPage is kernel-internal (not wire/KABI). Size on 64-bit:
// page(8, PageRef) + home_slot(2, RegionSlotIndex/u16) +
// state_dirty(2, AtomicU16) + bitmap_index(4, u32) +
// last_access_ns(8, AtomicU64) + meta(16, DsmPageMeta) = 40 bytes.
// Same on all 64-bit architectures (PageRef is pointer-sized = 8).
#[cfg(target_pointer_width = "64")]
const_assert!(core::mem::size_of::<DsmPage>() == 40);
// On 32-bit (ARMv7, PPC32): PageRef is 4 bytes (pointer-sized).
// Layout: page(4) + home_slot(2) + state_dirty(2) + bitmap_index(4) = 12 bytes.
// AtomicU64 alignment (8) inserts 4 bytes implicit padding after bitmap_index.
// Total: 4+2+2+4 + 4(pad) + 8+16 = 40 bytes (same as 64-bit by coincidence
// of padding alignment).
#[cfg(target_pointer_width = "32")]
const_assert!(core::mem::size_of::<DsmPage>() == 40);
6.11.8.2 MOESI-Aware Writeback on Eviction¶
When the page reclaim path selects a DSM-cached page for eviction, the DSM eviction
handler inspects the page's coherence state (DsmPage.state() via state_dirty) and performs
the appropriate protocol action:
/// DSM eviction handler. Called from the generational LRU reclaim path
/// ([Section 4.4](04-memory.md#page-cache)) when a DSM-cached page is selected for eviction.
///
/// Returns `EvictionResult::Done` if the page can be immediately freed,
/// or `EvictionResult::Async` if an asynchronous writeback was initiated
/// (the page frame will be freed when PutAck is received from the home node).
///
/// The subscriber's `on_eviction_candidate()` callback
/// ([Section 6.12](#dsm-subscriber-controlled-caching--subscriber-trait)) is consulted BEFORE
/// this function: if the subscriber returns `EvictDecision::Deny`, the page
/// is skipped entirely and remains in the LRU.
pub fn dsm_evict_page(page: &DsmPage) -> EvictionResult {
// Load state and dirty flag atomically from the packed AtomicU16.
// This eliminates the TOCTOU race between separate loads of
// coherence_state and dirty: a concurrent MOESI transition that
// changes both fields is seen atomically (single load on the
// combined value). See DsmPage.state_dirty field documentation.
let combined = page.state_dirty.load(Acquire);
let state = match DsmPageState::try_from((combined & 0xFF) as u8) {
Ok(s) => s,
Err(_) => {
fma_event!(Warning, "DSM page invalid coherence state", combined);
return EvictionResult::Skip;
}
};
let dirty = (combined >> 8) & 1 != 0;
match state {
// ── Modified (MOESI: M) ─────────────────────────────────────────
// This node has the sole dirty copy. The home node's memory is stale.
// Must write back to home before freeing the local frame.
//
// Protocol: send PutM(page_addr, data) to home node.
// Home updates its directory entry: state → Uncached, owner → None.
// Home stores the data into its memory (or forwards to another cacher).
// Home replies PutAck.
// On PutAck receipt: free the local page frame.
//
// Cost: 1× RDMA Write (4KB data) + 1× RDMA Send (PutM header) +
// 1× RDMA Send (PutAck). Total: ~5-8 μs.
DsmPageState::Exclusive if dirty => {
initiate_putm(page);
// Tracking XArray removal and refcount drop happen in the
// PutAck handler (dsm_remove_page_tracking) — NOT here.
// The page frame must remain valid until the home acknowledges
// the writeback, since the RDMA Write reads from the local frame.
EvictionResult::Async
}
// ── Owned (MOESI: O) ────────────────────────────────────────────
// This node has a dirty copy; other nodes may have SharedReader copies.
// Must write back to home so the data survives and sharers can be
// redirected to home for future reads.
//
// Protocol: send PutO(page_addr, data) to home node.
// Home updates directory: removes this node as owner, stores data.
// If sharers exist: home becomes the data source for future reads
// (directory transitions to a state where home memory is up-to-date
// and sharers remain valid).
// If no sharers: home transitions to Uncached.
// Home replies PutAck.
// On PutAck receipt: free the local page frame.
//
// Cost: same as PutM (~5-8 μs).
DsmPageState::SharedOwner => {
initiate_puto(page);
// Tracking XArray removal in PutAck handler (see PutM comment).
EvictionResult::Async
}
// ── Shared (MOESI: S) ───────────────────────────────────────────
// This node has a clean read-only copy. Home memory is up-to-date.
// Can be dropped silently — no data transfer needed.
//
// Protocol: send PutS(page_addr) to home node (control message only,
// no page data). Home removes this node from the sharers bitmap.
// The PutS is fire-and-forget: the local frame is freed immediately
// without waiting for PutAck, because the data is not at risk
// (home has an up-to-date copy). If the PutS is lost (network
// failure), the home directory retains a stale sharer entry. This
// causes performance degradation (not correctness loss): a future
// Inv to this node will be silently dropped (page already evicted),
// and the requesting node will interpret the missing InvAck as a
// timeout (~200us) and retry.
//
// Stale sharer cleanup: the anti-entropy protocol
// ([Section 6.13](#dsm-anti-entropy-protocol)) includes a sharer bitmap
// reconciliation step during its periodic sync (every
// BLOOM_RESET_INTERVAL_SECS = 300s). During anti-entropy, the home
// node sends a lightweight "sharer liveness probe" to each peer with
// a sharer bit set for pages that have not seen coherence traffic
// since the last anti-entropy cycle. Peers that no longer cache the
// page respond with PutS, clearing the stale bit. This bounds the
// maximum duration of stale sharer degradation to 300 seconds.
//
// Cost: 1× RDMA Send (PutS header, inline). ~1-2 μs, non-blocking.
DsmPageState::SharedReader => {
send_puts_fire_and_forget(page);
// Remove from per-region page tracking XArray and drop tracking ref.
dsm_remove_page_tracking(page);
EvictionResult::Done
}
// ── Exclusive clean (MOESI: E) ──────────────────────────────────
// This node has the sole copy, but it is clean (identical to home).
// Can be dropped silently — same as Shared for eviction purposes.
//
// Protocol: send PutE(page_addr) to home node (control only).
// Home transitions directory to Uncached (no sharers, no owner).
// Fire-and-forget like PutS.
//
// Cost: 1× RDMA Send (PutE header, inline). ~1-2 μs, non-blocking.
DsmPageState::Exclusive => {
send_pute_fire_and_forget(page);
// Remove from per-region page tracking XArray and drop tracking ref.
dsm_remove_page_tracking(page);
EvictionResult::Done
}
// ── NotPresent ──────────────────────────────────────────────────
// Page is not cached locally. Nothing to do.
DsmPageState::NotPresent => EvictionResult::Done,
// ── Transient states (Migrating, Invalidating) ──────────────────
// Page is mid-transfer. Cannot evict — skip it. The LRU reclaim
// path moves to the next candidate. The transfer will complete
// shortly and the page will be evictable on the next reclaim pass.
DsmPageState::Migrating | DsmPageState::Invalidating => {
EvictionResult::Skip
}
}
}
/// Remove a page from the per-region tracking XArray and drop the tracking
/// refcount. Called during eviction (synchronous paths: PutS, PutE, NotPresent)
/// and after PutAck receipt (asynchronous paths: PutM, PutO).
///
/// The page index key is `(page.page.addr() - region.base_addr) >> PAGE_SHIFT`.
/// After XArray removal, the tracking refcount is decremented via `put_ref()`.
/// If the refcount reaches zero (PTE ref already dropped), the physical page
/// frame is returned to the RDMA-registered pool.
///
/// For async eviction (PutM/PutO), this function is called from the PutAck
/// handler, NOT from `dsm_evict_page()`. The PutAck handler runs in the RDMA
/// completion pool thread.
fn dsm_remove_page_tracking(page: &DsmPage) {
let region = page.region();
let page_index = (page.page.addr() - region.base_addr) >> PAGE_SHIFT;
region.tracking_xa.remove(page_index);
// Drop the tracking refcount. If this is the last ref, the page frame
// is freed to the RDMA-registered pool.
page.page.put_ref();
// Free DsmPageMeta back to per-region slab pool.
region.page_meta_slab.free(page.meta_ptr());
}
pub enum EvictionResult {
/// Page frame freed immediately (synchronous eviction).
Done,
/// Writeback initiated; frame will be freed asynchronously on PutAck.
/// When `EvictionResult::Async` is returned, the caller increments
/// `zone.nr_pages_writeback_pending` by the number of pages submitted
/// for RDMA writeback. Kswapd considers `nr_pages_writeback_pending > 0`
/// as "making progress" and does not trigger OOM until pending writebacks
/// complete or timeout (default 5s). This prevents false OOM kills while
/// RDMA PutM/PutO writebacks are in-flight.
Async,
/// Page cannot be evicted right now (transient state or pinned).
Skip,
}
6.11.8.3 Eviction Triggers¶
DSM page eviction is triggered by three sources:
| Trigger | Mechanism | Urgency |
|---|---|---|
| Local memory pressure | The generational LRU kswapd (Section 4.4) scans the oldest generation for eviction candidates. DSM pages are interleaved with local file-backed pages in the same generation lists. The dsm_evict_page() handler is called for pages identified as DSM-backed (via the PageFlags::DSM flag). |
Standard reclaim — backpressure from the physical allocator watermarks. |
| DSM region quota exceeded | When DsmCachePolicy.max_cached_pages is non-zero and the region's local cache count exceeds the quota (Section 6.12), the DSM subsystem proactively evicts the oldest pages in that region's local cache. This runs as a background workqueue task, not on the page fault path. |
Region-scoped — does not wait for global memory pressure. Prevents a single region from monopolizing local memory. |
| Explicit invalidation from directory | A remote node's GetM request causes the home node to send Inv to all sharers or FwdGetM to the current owner (Section 6.6). The local node must transition the page to NotPresent and free the frame. This is NOT an eviction in the LRU sense — it is a coherence-mandated invalidation. The page is removed from the LRU generation list as a side effect. |
Immediate — blocking the remote requester's page fault until the invalidation completes. |
6.11.8.4 Cost-Aware Eviction Ordering¶
Within the generational LRU's oldest generation, DSM pages are not all equally expensive to re-fetch. The eviction policy incorporates re-fetch cost alongside LRU age to prefer evicting pages that are cheap to bring back:
/// Cost-aware eviction score for a DSM-cached page.
/// Lower score = evict first. Used by the reclaim path to sort eviction
/// candidates within the same LRU generation.
///
/// The score balances two factors:
/// 1. LRU age (older pages score lower — standard LRU behavior).
/// 2. Re-fetch cost (pages with cheap re-fetch paths score lower).
///
/// This prevents the pathological case where reclaim evicts a page that can
/// only be fetched from a remote peer (3-5 μs RDMA) when a page backed by
/// local NVMe storage (10-15 μs, but always available) is equally old.
pub fn dsm_eviction_score(page: &DsmPage, region: &DsmRegion) -> u64 {
let age_score = generation_age(page); // 0 = oldest (evict first)
// Re-fetch cost tiers (lower = cheaper to re-fetch = evict first):
// 0: page backed by local storage (can re-read from local disk)
// 1: page backed by remote storage (NVMe-oF, always available)
// 2: page only available from remote peer cache (peer might evict too)
let refetch_cost = match page_refetch_source(page) {
RefetchSource::LocalStorage => 0_u64,
RefetchSource::RemoteStorage => 1_u64,
RefetchSource::RemotePeerOnly => 2_u64,
};
// Region eviction priority (0 = evict last, 1000 = evict first).
// From DsmCachePolicy.eviction_priority.
let region_priority = region.cache_policy.eviction_priority as u64;
// Composite score: age dominates, cost breaks ties within same age.
// Region priority provides cross-region ordering.
//
// Encoding: [age_score:48 | refetch_cost:8 | region_priority_inverted:8]
// Lower composite = evict first.
(age_score << 16) | (refetch_cost << 8) | (1000 - region_priority).min(255)
}
/// How the page can be re-fetched after eviction.
#[repr(u8)]
pub enum RefetchSource {
/// Page is backed by a local filesystem on local storage.
/// Re-fetch cost: ~10-15 μs (NVMe read). Always available.
LocalStorage = 0,
/// Page is backed by remote storage (NVMe-oF, iSCSI).
/// Re-fetch cost: ~15-25 μs. Available as long as storage network is up.
RemoteStorage = 1,
/// Page exists only in remote peer caches (no persistent backing store
/// within this node's storage reach). Re-fetch cost: ~3-5 μs if the
/// peer still has it, but the peer may also evict under pressure.
/// Evicting this page risks a cascade of remote misses.
RemotePeerOnly = 2,
}
Eviction priority interaction with subscribers: The subscriber's
on_eviction_candidate() callback (Section 6.12)
runs before the cost-aware scoring. If the subscriber returns EvictDecision::Deny,
the page is unconditionally skipped regardless of its score. If the subscriber returns
EvictDecision::WritebackFirst, the dsm_evict_page() handler initiates the appropriate
PutM/PutO writeback and returns EvictionResult::Async. The cost-aware score only
determines the order in which non-denied pages are considered.
Async writeback throttling: When multiple Modified/Owned DSM pages are selected for
eviction simultaneously, the writeback PutM/PutO messages are rate-limited to avoid
saturating the RDMA fabric. The DSM subsystem maintains a per-peer outstanding-writeback
counter (AtomicU32); when it exceeds max_concurrent_writebacks (default: 64), further
evictions are deferred to the next reclaim pass. This bounds the burst writeback bandwidth
to 64 × 4KB / RDMA_RTT ≈ 50 GB/s at 5 μs RTT — well within typical 100 Gbps RDMA
fabric capacity.
6.11.8.4.1 Compound Failure: Memory Pressure + DSM¶
DSM's RDMA transport is designed to operate under local memory pressure without circular dependencies:
-
RDMA message pool: Pre-allocated at cluster join time from a dedicated
GFP_KERNELreservation. The pool is exempt from memory reclaim — the OOM killer cannot reclaim RDMA message buffers. Pool size is configured at join time based on the number of peer nodes and expected concurrent page operations (default: 256 message buffers per peer, 512 bytes each = 128 KiB per peer). -
PutS (shared eviction): When the local node evicts a DSM page in Shared state, it sends a 32-byte inline RDMA Send to the home node to surrender the sharing right. This operation is zero-allocation — the message is constructed on the stack and sent via a pre-allocated RDMA send buffer from the message pool.
-
PutM (modified eviction): When evicting a page in Modified state, the page data must be written back to the home node before the local frame can be reclaimed. The RDMA WRITE uses the evicted page itself as the source buffer — no additional memory allocation is required. The page is pinned for the duration of the RDMA transfer, then unpinned and freed upon completion.
-
Emergency reclaim path: Under extreme memory pressure (below
min_free_kbytes), the DSM reclaim path can evict DSM-cached pages without waiting for RDMA completion. The page is marked asDsmEvictPendingand the RDMA writeback proceeds asynchronously. If the PutM RDMA fails for an emergency-evicted Modified page, the page data is lost — the home node's memory is stale for M-state pages (this is a fundamental MOESI property: Modified means home memory is not current). The application observesSIGBUSon next access, the same behavior as when the sole Exclusive owner crashes (Section 5.8). Shared (S) and Exclusive-clean (E) pages can be safely evicted without writeback because the home node retains a valid copy.
See also: Section 4.5 for OOM scoring adjustments for DSM-owning processes.
6.12 Subscriber-Controlled Caching¶
Kernel subsystems that manage DSM-backed data (clustered filesystems, distributed databases, service provider layers) need direct control over DSM caching behavior. The default demand-fault-only model is insufficient for workloads that know their access patterns in advance and need to coordinate cache state with distributed locks (DLM) or transaction protocols.
6.12.1 Subscriber Trait¶
// umka-core/src/dsm/subscriber.rs
/// Trait for kernel subsystems that control DSM caching behavior.
/// Registered per-region via `dsm_register_subscriber()`.
/// At most one subscriber per region (the subsystem that created it).
///
/// **Time budget contract**: All callbacks are invoked on the DSM page fault
/// critical path (between TLB miss and RDMA page fetch). Callback latency
/// directly adds to fault latency. The contract:
///
/// - `on_page_fault()`: MUST complete within **1 μs** (worst case).
/// Typical implementation: check a per-thread DLM lock table (hash
/// lookup, O(1)). Must NOT acquire locks, allocate memory, or perform
/// I/O. If the subscriber cannot make a decision within 1 μs, it MUST
/// return `DsmFaultHint::Default` and defer complex logic to an
/// asynchronous path.
///
/// - `on_eviction_candidate()`: MUST complete within **1 μs**.
/// Typical: check a DLM dirty-page bitmap (bit test, O(1)).
///
/// - `on_writeback_complete()`: MUST complete within **10 μs**.
/// Called from RDMA completion context (not fault path). May update
/// subscriber-internal bookkeeping. Must NOT block.
///
/// The DSM subsystem enforces the time budget via a tracepoint watchdog:
/// if any callback exceeds its budget, the kernel emits
/// `umka_tp_stable_dsm_subscriber_slow(region_id, callback, elapsed_ns)`
/// and increments the region's `subscriber_slow_count` counter (visible
/// via `/sys/kernel/dsm/regions/<id>/subscriber_slow_count`). After 100
/// consecutive violations, the subscriber is forcibly deregistered and the
/// region reverts to default DSM behavior (no subscriber callbacks). This
/// prevents a buggy subscriber from degrading the entire DSM subsystem.
pub trait DsmSubscriber: Send + Sync {
/// Called on every page fault within the subscribed region, BEFORE the
/// page fault handler acquires `PageFlags::LOCKED`. The callback runs
/// with a 1 μs budget and MUST NOT block — it performs only XArray
/// lookups and bloom filter checks.
///
/// Returns a hint that influences DSM fault handling:
/// - `DsmFaultHint::Default`: proceed with normal local fault handling
/// (standard demand fetch, single page).
/// - `DsmFaultHint::PrefetchAhead(n)`: fetch this page + n subsequent pages.
/// - `DsmFaultHint::Reject`: subscriber knows this access is invalid (e.g.,
/// DLM lock not held). The VMM fault handler delivers `SIGBUS` to the
/// faulting thread (matching the VMM fault dispatch table's error path
/// for DSM faults -- see [Section 4.8](04-memory.md#virtual-memory-manager--page-fault-handler)).
/// - `DsmFaultHint::FetchRemote { source_node }`: DSM will fetch the page
/// via RDMA from the specified node; the fault handler waits on a
/// `DsmFetchCompletion` future (no busy-wait — the thread is parked
/// and woken by the RDMA completion callback).
/// - `DsmFaultHint::MigrateThread { target_peer }`: suggest migrating the
/// faulting thread to the page's owner peer instead of fetching the page.
/// **Phase 4+ feature**: not implemented in Phase 2/3. The VMM bridge
/// function (`dsm_handle_fault`) falls back to `DsmFaultHint::Default`
/// when this variant is returned, logging a warning. Subscribers SHOULD
/// NOT return this variant until Phase 4 process migration is available.
///
/// Time budget: 1 μs. Must not block.
fn on_page_fault(&self, region_id: u64, va: u64,
fault_type: DsmFaultType) -> DsmFaultHint;
/// Called when DSM needs to evict a page from local cache (memory pressure).
/// Subscriber can override the decision:
/// - `EvictDecision::Allow`: proceed with eviction.
/// - `EvictDecision::Deny`: page is pinned (subscriber holds DLM lock on it).
/// - `EvictDecision::WritebackFirst`: page is dirty and must be written back
/// before eviction (subscriber tracks dirty state via DLM).
/// Time budget: 1 μs. Must not block.
fn on_eviction_candidate(&self, region_id: u64, va: u64) -> EvictDecision;
/// Called when an asynchronous writeback completes (RDMA write to home
/// node completed + PutAck received). The subscriber clears the page's
/// dirty state only after receiving this callback — without the ack, the
/// page remains dirty, ensuring crash consistency even if DSM writeback
/// is lost mid-flight.
///
/// The ack is piggybacked on the next coherence message to the
/// subscriber's node (no extra round-trip) when possible; if no
/// coherence traffic is pending, a standalone PutAck is sent.
///
/// Time budget: 10 μs. Must not block.
fn on_writeback_complete(&self, region_id: u64, va_start: u64,
page_count: u32);
/// Called AFTER a page has been evicted from local cache (reclaim
/// completed). This is a post-eviction notification — distinct from
/// `on_eviction_candidate()` which is a pre-eviction permission check.
///
/// The subscriber uses this callback to:
/// (a) Update the page location tracker (directory entry) to reflect
/// that this node no longer holds the page.
/// (b) Optionally initiate asynchronous ownership transfer to another
/// node that has expressed interest (via the DSM interest bitmap).
///
/// Called from: `shrink_folio_list()` in the reclaim path, after the
/// page frame has been freed. Referenced as `dsm_subscriber_on_eviction()`
/// in [Section 4.4](04-memory.md#page-cache).
///
/// Time budget: 10 μs. Must not block (reclaim context).
fn on_eviction(&self, region_id: u64, va: u64);
}
#[repr(u8)]
pub enum DsmFaultType {
Read = 0,
Write = 1,
}
pub enum DsmFaultHint {
/// Standard demand fetch (single page). For read faults, this triggers
/// a `GetS` to the home node. For write faults, this triggers an
/// `Upgrade` (if local copy exists, S→M) or `GetM` (if no local copy, I→M).
/// The DSM handler uses `AccessType` to determine the protocol path —
/// the subscriber does not need to differentiate.
Default,
/// Fetch the faulting page plus `n` subsequent pages.
PrefetchAhead(u32),
/// Reject the fault. DSM returns -EACCES.
Reject,
/// DSM will fetch the page via RDMA from the specified source node.
/// The fault handler waits on a `DsmFetchCompletion` future: the
/// faulting thread is parked (not busy-waiting) and woken by the
/// RDMA completion callback when data arrives.
/// NOTE: `source_node` is a hint — the home node's directory lookup
/// determines the actual data source. For read faults, this issues
/// `GetS` to the home (not directly to source_node).
FetchRemote { source_node: PeerId },
/// Request exclusive ownership of the page. Used by subscribers that
/// know the next access will be a write (e.g., write-intensive regions).
/// Issues `GetM` (if I→M) or `Upgrade` (if S→M) to the home node.
/// Data is transferred only for I→M; for S→M the local copy is already
/// present. This is semantically equivalent to `Default` on a write fault
/// but allows the subscriber to force ownership acquisition on a read fault.
FetchExclusive,
/// Suggest migrating the faulting thread to the page's owner peer
/// instead of fetching the page. Inspired by TidalScale's "wandering
/// vCPU": when a thread repeatedly faults on pages owned by the same
/// remote peer, moving ~20 KB of thread context (registers + kernel
/// stack) is cheaper than fetching N × 4 KB pages.
///
/// DSM checks: (a) destination peer is alive, (b) thread has no
/// pinned local resources (open device handles, RT scheduling class,
/// CPU affinity mask that excludes the destination). If any check
/// fails, falls back to `Default`.
///
/// The subscriber returns this when it detects data locality is
/// strongly remote — e.g., >80% of recent faults for this thread
/// target the same peer. The cluster scheduler ([Section 5.6](05-distributed.md#cluster-aware-scheduler--process-migration))
/// performs the actual migration using the lightweight thread
/// migration path (register state + kernel stack only, ~10-20 μs).
MigrateThread { target_peer: PeerId },
}
/// **No-subscriber fallback**: When a DSM region has no registered subscriber
/// (the common case for application-visible DSM via `sys_dsm_map()`), the VMM
/// DSM handler behaves as if the subscriber returned `DsmFaultHint::Default`.
/// The handler uses `AccessType` (from the page fault) to determine protocol:
/// - `AccessType::Read` → `GetS` (shared read copy)
/// - `AccessType::Write` → `Upgrade` (if SharedReader) or `GetM` (if NotPresent)
///
/// The `dsm_subscriber_on_page_fault()` function returns `Default` immediately
/// when no subscriber is registered (O(1) — check `region.subscriber.is_none()`).
/// Completion token for DSM remote page fetch. The faulting thread blocks
/// on this until the RDMA completion callback wakes it.
///
/// Implemented as a `WaitQueue` entry linked to the RDMA CQ completion handler.
/// When the RDMA completion event arrives (data written to the local page frame),
/// the CQ handler calls `complete.wake()` which wakes the parked thread.
///
/// Lock level: The `DsmFetchCompletion`'s internal WaitQueue uses lock level 175
/// (between VMA_LOCK(105) and PAGE_LOCK(180) in the lock hierarchy).
///
/// Interruptibility: The wait is `TASK_KILLABLE` — `SIGKILL` wakes the thread
/// immediately, which then checks the signal and returns `-EINTR` up the fault
/// path (ultimately causing SIGSEGV or process death). Other signals do not
/// interrupt the wait (preventing spurious fault retries).
pub struct DsmFetchCompletion {
/// WaitQueue for the faulting thread.
pub waitq: WaitQueue,
/// Completion status: set by the RDMA CQ handler.
pub status: AtomicU8, // 0 = pending, 1 = success, 2 = timeout, 3 = error
/// The physical page frame number (PFN) where data was written by RDMA.
/// Stored as `AtomicU64` wrapping a `Pfn` value (0 = not yet allocated).
/// `Pfn` is a newtype `Pfn(u64)` with no atomic variant; `AtomicU64` is
/// the appropriate atomic container. Type-safe access pattern:
/// Read: `Pfn::from_raw(self.page_frame.load(Acquire))`
/// Write: `self.page_frame.store(pfn.raw(), Release)`
/// The sentinel value 0 is safe because PFN 0 is never used for DSM
/// pages (it is below the kernel's direct-map floor on all architectures).
pub page_frame: AtomicU64, // Pfn value; 0 = not yet allocated
/// Per-page pending-forward list for deferred FwdGetS/FwdGetM requests
/// that arrive while this fault is collecting InvAcks (write faults only).
/// Capacity 16; overflow returns Nack(Busy) to the home.
pub pending_forwards: SpinLock<ArrayVec<DsmMsg, 16>>,
/// Request sequence number for early-timeout deduplication.
/// Monotonically increasing per-node. Responses with mismatched req_seq
/// are silently dropped.
pub req_seq: u64,
}
pub enum EvictDecision {
/// Allow eviction (page is clean or subscriber doesn't care).
Allow,
/// Deny eviction — page is pinned by subscriber (e.g., active DLM lock).
Deny,
/// Allow eviction, but write back to home first.
WritebackFirst,
}
6.12.2 Per-Region Cache Policy¶
/// Cache policy set at region creation or updated at runtime.
/// Controls DSM behavior for all pages in the region.
pub struct DsmCachePolicy {
/// Maximum pages cached locally from this region.
/// 0 = unlimited (bounded only by system memory pressure).
pub max_cached_pages: u64,
/// Number of pages to prefetch on sequential access detection.
/// Default: 16. Set to 0 to disable prefetch entirely.
pub prefetch_window: u32,
/// Writeback mode for dirty pages.
pub writeback_mode: DsmWritebackMode,
/// Eviction priority relative to other regions.
/// 0 = highest priority (evict last), 1000 = lowest (evict first).
/// Default: 500.
pub eviction_priority: u16,
/// Maximum partition duration (seconds) before forced state discard
/// on rejoin. If a peer is partitioned longer than this, it must
/// perform a full snapshot sync instead of incremental anti-entropy.
/// Default: 86400 (24 hours). Set to 0 to disable (always allow
/// incremental sync regardless of partition length). Satellite or
/// high-latency deployments should increase this value.
pub max_partition_duration_secs: u64,
}
#[repr(u8)]
pub enum DsmWritebackMode {
/// Write-back: dirty pages are written to home lazily (on eviction
/// or explicit writeback). Default. Lowest network traffic.
WriteBack = 0,
/// Write-through: every write is immediately sent to home.
/// Highest consistency, highest network traffic. Used by subscribers
/// that need strong durability guarantees (e.g., database WAL regions).
WriteThrough = 1,
/// Write-combining: dirty pages are batched and written to home
/// at configurable intervals. Balance of consistency and throughput.
WriteCombining = 2,
}
6.12.3 Subscriber Control API¶
Subscriber operations (Prefetch, SubscriberInvalidate, SubscriberWriteback,
SubscriberAck) are local kernel API calls, not wire messages. The DSM subsystem
translates them into standard MOESI coherence messages for the wire — see
Section 6.6 for the wire encoding
and transport binding.
/// Explicit cache control operations on a DSM region.
/// These are in addition to the callback-based DsmSubscriber trait.
impl DsmRegionHandle {
/// Prefetch pages into local DSM cache proactively.
/// Pages are fetched from their home peers via the standard coherence
/// protocol (GetS). Prefetch failures (page busy, network error) are
/// silently dropped — the subsequent demand fault will retry.
///
/// This sends a `DsmMsg::Prefetch` to the local DSM subsystem, which
/// issues GetS for each page in the range.
pub fn dsm_prefetch(
&self,
va_start: u64,
page_count: u32,
priority: DsmPrefetchPriority,
) -> Result<(), DsmError>;
/// Invalidate locally cached pages. Pages in Exclusive or SharedOwner state
/// are silently dropped (data loss if not written back first — caller's
/// responsibility to call dsm_writeback() before invalidating dirty pages).
///
/// Use case: DLM lock downgrade — subscriber knows its cached copy is
/// about to become stale, so it proactively invalidates.
pub fn dsm_invalidate(
&self,
va_start: u64,
page_count: u32,
) -> Result<(), DsmError>;
/// Write back dirty pages to home. If `sync` is true, blocks until
/// home acknowledges all writebacks. If false, returns immediately
/// and writeback proceeds asynchronously (completion notified via
/// `DsmSubscriber::on_writeback_complete()`).
///
/// Use case: DLM lock release — subscriber writes back dirty pages
/// before releasing the lock so the next lock holder sees current data.
pub fn dsm_writeback(
&self,
va_start: u64,
page_count: u32,
sync: bool,
) -> Result<(), DsmError>;
/// Update the cache policy for this region at runtime.
pub fn dsm_set_cache_policy(
&self,
policy: &DsmCachePolicy,
) -> Result<(), DsmError>;
}
6.12.4 DLM-DSM Bidirectional Notification Hooks¶
The DLM and DSM subsystems maintain bidirectional notification hooks to keep
lock state and page state synchronized. Both hooks are registered via the
DsmDlmBridge struct, initialized at DLM startup when DSM regions with lock
bindings exist.
/// Bidirectional bridge between DLM lock state and DSM page state.
/// One instance per DLM lockspace that has DSM-bound resources.
/// Registered during lockspace creation if DSM regions are active.
pub struct DsmDlmBridge {
pub lockspace_id: u64,
}
impl DsmDlmBridge {
/// Called by DLM when a lock is downgraded from EX/PW to CR/NL.
/// Triggers dirty page writeback for any DSM pages protected by
/// the downgraded lock's DsmLockBinding. The writeback respects
/// writeback group ordering (lower groups first).
pub fn dsm_on_lock_downgrade(
&self, resource: &DlmLockResource,
old_mode: LockMode, new_mode: LockMode,
);
/// Called by DSM when a page is invalidated by the coherence
/// protocol (Inv message from home). Clears the page from the
/// lock's dirty tracker bitmap so the next lock release does not
/// attempt to write back an already-invalidated page.
pub fn dlm_dirty_tracker_clear(
&self, resource: &DlmLockResource, page_offset: u64,
);
}
6.12.5 Coherence Mechanism Selection¶
Filesystems choose the coherence mechanism per lock resource via the
subscriber's coherence_mode() method:
/// Coherence path selection for a DLM-protected resource.
#[repr(u8)]
pub enum CoherenceMode {
/// DLM-only: coarse-grained, lock-based consistency. Used for metadata
/// (inode locks, extent locks) where lock granularity matches the
/// access pattern and DSM page-level tracking adds no benefit.
DlmOnly = 0,
/// DSM-only: page-granularity MOESI coherence. Used for data pages
/// where fine-grained, automatic coherence is needed and the DLM lock
/// covers a large range of pages.
DsmOnly = 1,
/// DLM + DSM: DLM provides synchronization points (lock acquire/release);
/// DSM provides page-level coherence between synchronization points.
/// Used when both coarse ordering (DLM) and fine-grained data movement
/// (DSM) are needed — e.g., a database buffer pool where DLM serializes
/// transactions but DSM manages individual page transfers.
DlmPlusDsm = 2,
}
impl DsmSubscriber {
/// Returns the coherence mode for the given DLM lock resource.
/// Called by the DSM-DLM bridge during lock state transitions to
/// determine whether DSM page operations are needed.
/// **Time budget: 100 ns** (O(1), return cached enum value — must not
/// acquire locks or perform I/O).
fn coherence_mode(&self, resource: &DlmLockResource) -> CoherenceMode;
}
6.12.6 DLM Integration Pattern¶
The subscriber control API is designed to integrate with the Distributed Lock Manager (Section 15.15). The canonical usage pattern for a clustered filesystem or distributed database:
DLM lock acquire (exclusive):
1. dsm_prefetch(region, va_range, Urgent) — warm cache before use
2. Access pages (demand faults fill any misses)
3. dsm_writeback(region, va_range, sync=true) — flush dirty pages
4. dsm_invalidate(region, va_range) — drop local copies
5. DLM lock release
DLM lock acquire (shared/read):
1. dsm_prefetch(region, va_range, Normal)
2. Read pages (all reads, no writes — SharedReader state)
3. DLM lock release (no writeback needed — pages are clean)
VFS page fault integration: DsmSubscriber::on_page_fault() is called from the
VFS page fault handler BEFORE filemap_get_pages(). If the subscriber returns
FetchRemote, the fault handler skips filemap_get_pages() and waits for the DSM
fetch. If Default or PrefetchAhead, the normal filemap_get_pages() path runs.
The DsmSubscriber::on_page_fault() callback enables the subscriber to enforce
that DLM locks are held before DSM pages are accessed — returning
DsmFaultHint::Reject if the faulting thread does not hold the appropriate lock.
Lock Acquisition Retry Protocol
When an application receives -EACCES from a DSM page fault (because
DsmSubscriber::on_page_fault() returned DsmFaultHint::Reject), the
recommended recovery pattern is:
- The faulting thread catches the
SIGSEGVwithsi_code == SEGV_ACCERR(or receives-EACCESfrom a syscall that triggered the fault). - Determine which DLM lock protects the faulted region (application-specific mapping, typically maintained in a per-region metadata table).
- Acquire the DLM lock:
dlm_lock(lockspace, resource, mode). - Retry the memory access. The page fault now succeeds because
on_page_fault()finds the DLM lock held. - When done with the protected region, release the lock:
dlm_unlock(lockspace, resource).
Note: The on_page_fault() callback itself must NOT acquire locks — it runs
on the page fault hot path with a 1 μs time budget (Section 6.12).
Lock acquisition happens in the application's fault recovery handler, outside
the kernel's fault path.
6.12.7 DLM Token Binding¶
A DsmLockBinding ties a DLM lock resource to the DSM pages it protects.
The subscriber creates bindings at initialization; the DSM subsystem uses them
to automate writeback and invalidation on lock state transitions.
// umka-core/src/dsm/lock_binding.rs
/// Binds a DLM lock resource to a range of DSM pages.
/// When the DLM lock state changes, the DSM subsystem automatically
/// performs the corresponding cache operations.
/// Kernel-internal struct. Not KABI. Not wire.
pub struct DsmLockBinding {
/// Client-side DLM lock handle that owns this binding.
/// Uses `DlmLockHandle` (client-side) rather than `DlmLockResource`
/// (master-side) because bindings are created by the subscriber node
/// that holds the lock, not by the DLM master.
pub lock_handle: DlmLockHandle,
/// DSM region containing the pages.
pub region: DsmRegionHandle,
/// Virtual address range covered by this lock.
/// All pages in [va_start, va_start + page_count * PAGE_SIZE) are
/// bound to this DLM lock.
pub va_start: u64,
pub page_count: u32,
/// Dirty page bitmap. One bit per page in the range.
/// Set when a page transitions to Exclusive-dirty or SharedOwner state while this lock is held.
/// Cleared on writeback completion.
pub dirty: DsmDirtyBitmap,
/// Writeback ordering group. Bindings with the same group ID are
/// written back in group-order (lower group first). Used for WAL:
/// log pages (group 0) before data pages (group 1).
pub writeback_group: u16,
/// Whether to prefetch on lock acquire (optimistic prefetch).
pub prefetch_on_acquire: bool,
/// Prefetch priority for optimistic prefetch.
pub prefetch_priority: DsmPrefetchPriority,
}
/// Per-binding dirty page tracker. Fixed-size bitmap, allocated once
/// at binding creation. Size = ceil(page_count / 64) AtomicU64 words.
pub struct DsmDirtyBitmap {
/// Bitmap words. One bit per page in the **binding's range** (NOT the
/// full region). Size = ceil(page_count / 64) AtomicU64 words. Indexing:
/// `bit_index = (page_idx - binding.base_page_idx)`.
///
/// The 32 MiB figure is the theoretical max (one binding covering the
/// entire 1 TB region with 4 KiB pages: 256M bits = 32 MiB).
/// `DSM_MAX_BITMAP_SIZE` rejects bindings exceeding 32 MiB at creation
/// time with `-ENOMEM`.
///
/// In practice, most bindings cover a sub-range of the region (e.g., one
/// node caches only 1 GB of a 1 TB region → 32K bits = 4 KB).
/// AtomicU64 for lock-free concurrent mark_dirty/clear_dirty/iter_dirty.
words: Box<[AtomicU64]>,
/// Number of dirty pages (cached count — avoids popcount on query).
dirty_count: AtomicU32,
}
impl DsmDirtyBitmap {
/// Mark page at offset `page_idx` as dirty. Called by DSM subsystem
/// when a page transitions to Exclusive-dirty (MOESI: M) or SharedOwner (MOESI: O) state and this binding is active.
pub fn mark_dirty(&self, page_idx: u32) { ... }
/// Clear dirty bit on writeback completion.
pub fn clear_dirty(&self, page_idx: u32) { ... }
/// Return the number of dirty pages.
pub fn dirty_count(&self) -> u32 { ... }
/// Iterate dirty page indices. Used by writeback to avoid scanning
/// clean pages.
pub fn iter_dirty(&self) -> impl Iterator<Item = u32> + '_ { ... }
}
Dirty state canonical arbiter: PageFlags::DIRTY is the single source of truth
for the VFS/writeback layer. AddressSpace.nr_dirty tracks the count. DsmPage.is_dirty()
(via state_dirty bit 8 and DsmDirtyBitmap) tracks DSM-specific dirty state for writeback ordering. The
filesystem trusts PageFlags::DIRTY; DSM trusts DsmDirtyBitmap; both are set and
cleared together by the writeback engine. No component sets one without the other.
Automatic lock lifecycle operations:
When a DLM lock with a DsmLockBinding changes state, the DSM subsystem
performs the corresponding cache operations automatically:
| DLM Transition | DSM Action | Rationale |
|---|---|---|
| Lock grant (EX/PW) | Prefetch bound pages if prefetch_on_acquire |
Warm cache before use |
| Lock grant (PR/CR) | Prefetch bound pages if prefetch_on_acquire |
Warm cache for reads |
| Lock upgrade (PR→EX/PW) | Issue GetM/Upgrade for all SharedReader pages in bound range; block grant completion until all pages reach Exclusive (MOESI: M or E) | Write safety: DLM EX grant must guarantee DSM write permission |
| Lock downgrade (EX→PR) | Writeback dirty pages, then mark as read-only | Next EX holder sees writes |
| Lock release (EX/PW) | Writeback dirty pages, then invalidate | Clean slate for next holder |
| Lock release (PR/CR) | Invalidate (no writeback — pages are clean) | Drop stale copies |
| Lock cancel (AST) | Writeback dirty + invalidate (urgent) | Another peer needs the lock |
The subscriber does NOT need to call dsm_writeback() / dsm_invalidate()
manually for bound pages — the lock binding handles it. Manual control
(Section 6.12) remains available for pages outside lock bindings or for
custom patterns.
GetM-on-upgrade protocol (PR→EX coherence guarantee):
A DLM PR → EX lock conversion grants exclusive write permission at the DLM
level, but does NOT by itself change the MOESI state of cached DSM pages.
Pages fetched under the PR lock are in SharedReader (MOESI: S) state — writing
to a SharedReader page would fault because the PTE is read-only and the DSM
coherence directory still lists other sharers. Without an explicit coherence
upgrade, the first write to every cached page would trigger an on-demand
Upgrade (S→M) via the MOESI protocol, adding ~3-5 μs latency per page on the
write hot path.
The DsmLockBinding auto-lifecycle eliminates this problem by eagerly
upgrading page states when the DLM lock is converted from PR to EX/PW:
PR→EX upgrade with DsmLockBinding:
1. DLM master grants the PR→EX conversion.
2. Before signaling grant completion to the caller, the DSM subsystem
scans the binding's page range for pages in SharedReader state.
3. For each SharedReader page:
a. If no other sharers exist (home directory shows sharer_count == 1):
Send Upgrade(page, self) to home. Home invalidates no one
(AckCount = 0) and transitions directory to Modified(owner = self).
Local page transitions S → M. Cost: one control message RTT (~2 μs).
b. If other sharers exist (sharer_count > 1):
Send Upgrade(page, self) to home. Home sends Inv to all other
sharers, waits for InvAcks, then sends AckCount to self.
Local page transitions S → M after all InvAcks are collected.
Cost: one RTT + invalidation fan-out (~3-8 μs depending on
sharer count).
c. If page is not cached locally (NotPresent / evicted since PR grant):
Send GetM(page, self) to home. Home forwards to current owner
(FwdGetM) or supplies from home memory. Local page transitions
I → M. Cost: one RTT + possible owner forward (~5-10 μs).
4. Pages already in Exclusive or Modified state (e.g., from a previous
EX hold that was downgraded to PR and then re-upgraded) require no
action — they are already writable.
5. All upgrades are issued concurrently (pipelined RDMA control messages).
The DSM subsystem tracks outstanding upgrades with a completion counter
(AtomicU32, decremented on each DataResp/AckCount receipt).
6. Grant completion is signaled to the DLM caller only after the
completion counter reaches zero — all pages in the bound range are
in Exclusive or Modified state.
Performance:
- Common case (single holder, no other sharers): step 3a for all pages.
N pages × 1 RTT, pipelined. Total: ~5-10 μs regardless of page count
(limited by RDMA pipeline depth, not page count).
- Worst case (many sharers): step 3b dominates. Latency bounded by the
slowest sharer's InvAck response (typically < 10 μs on RDMA).
- Zero-page case (no cached pages in range): no coherence messages needed.
Grant completes immediately after DLM conversion.
Why grant completion must block on DSM Exclusive state:
If the DLM reports the PR→EX conversion as complete while pages are still in SharedReader state, the caller's first write would trigger a synchronous MOESI Upgrade on the page-fault path. This is incorrect for two reasons: (1) the page-fault path has a 1 μs time budget (Section 6.12) and cannot tolerate a multi-microsecond coherence round-trip; (2) the caller reasonably expects that holding an EX lock means writes proceed without coherence faults — the DLM grant is the synchronization point, not individual page writes. By completing all Upgrade/GetM operations before signaling the grant, the binding preserves the invariant: DLM EX grant ⟹ all bound pages are DSM-writable.
Registration:
impl DsmRegionHandle {
/// Register a lock binding. The DSM subsystem begins tracking dirty
/// state for pages in the bound range. Multiple bindings per region
/// are allowed (different lock resources covering different page ranges).
/// Overlapping bindings are permitted — dirty tracking is per-binding.
pub fn dsm_bind_lock(
&self,
binding: DsmLockBinding,
) -> Result<DsmLockBindingHandle, DsmError>;
/// Remove a lock binding. Outstanding dirty pages are written back
/// synchronously before the binding is removed.
pub fn dsm_unbind_lock(
&self,
handle: DsmLockBindingHandle,
) -> Result<(), DsmError>;
}
6.12.8 Writeback Ordering and Barriers¶
Subscribers with write-ahead requirements (database WAL, journaled filesystems) need ordered writeback: log pages must reach the home node before data pages.
Writeback groups: Each DsmLockBinding has a writeback_group: u16. When
the DSM subsystem writes back dirty pages for a lock release or downgrade, it
processes groups in ascending order:
Lock release with writeback ordering:
1. Writeback all dirty pages in group 0 (log/journal). Wait for PutAck.
2. Writeback all dirty pages in group 1 (data). Wait for PutAck.
3. Writeback all dirty pages in group 2 (metadata). Wait for PutAck.
...
N. Invalidate all pages in the binding.
N+1. Release DLM lock.
Within a group, pages are written back concurrently (pipelined RDMA Writes). Between groups, a writeback barrier ensures all PutAck messages from the previous group are received before the next group begins.
Group dependency resolution: Before writing data pages in a group, the
writeback engine calls dsm_subscriber.get_group_barriers(group_id) which
returns a list of prerequisite groups whose writeback must complete first
(e.g., journal log group before data group). The writeback engine ensures all
prerequisite groups are flushed before starting the current group. This
dependency check is performed per-writeback-cycle, not per-page — the
prerequisite list is cached at cycle start and not re-evaluated during the
cycle.
impl DsmSubscriber {
/// Returns prerequisite group IDs that must complete writeback
/// before the given group can begin. Called once per writeback
/// cycle per group. Default implementation returns empty (no
/// ordering constraints beyond ascending group number).
/// **Time budget: 1 us** (must not block — return cached dependency list).
fn get_group_barriers(&self, group_id: u16) -> ArrayVec<u16, 8> {
ArrayVec::new()
}
}
Relationship to DLM Targeted Writeback (Section 15.15)
DSM writeback groups and DLM targeted writeback are orthogonal mechanisms:
-
DSM writeback groups (this section): Define ordering of pages within a DSM region during any writeback event. Group 0 pages (e.g., journal/log) are flushed before group 1 pages (e.g., data blocks).
-
DLM targeted writeback (Section 15.15): Defines the scope of writeback on lock downgrade — only pages within the lock's byte range are flushed, not all dirty pages in the region.
When both apply (DSM region protected by DLM lock), the DLM determines
which pages to flush (range-scoped), and DSM writeback groups determine
the order in which those pages are flushed. The subscriber's
on_writeback_complete() callback enforces group ordering by scheduling
the next higher-group flush only after the current group's writeback has
completed (i.e., all PutAcks received for group N pages trigger the
submission of group N+1 pages).
DSM health check and local fallback: The writeback engine checks DSM health
before attempting DSM writeback: dsm_subscriber.is_healthy(). If unhealthy
(home node unreachable, RDMA timeout >100 ms), the engine falls through to
local writeback with a degraded-mode flag. An FMA event DsmWritebackDegraded
is emitted (Section 20.1). Local writeback proceeds normally — the page
is written to the local block device. When DSM recovers, the anti-entropy
protocol (Section 6.13) reconciles divergent pages.
impl DsmSubscriber {
/// Returns true if DSM transport to the home node is healthy.
/// Called by the writeback engine before each writeback cycle.
/// Must complete in O(1) — checks cached RDMA connection state,
/// does not probe the network.
/// **Time budget: 100 ns** (O(1), check cached RDMA connection state only).
fn is_healthy(&self) -> bool;
}
fsync() semantics for DSM pages: When an application calls fsync() on a file
with DSM-managed pages, the VFS writeback engine flushes all dirty pages. For DSM
pages, "flushed" means the page has been sent to the home node AND the PutAck has
been received (on_writeback_complete callback fired). fsync() blocks until all
PutAcks for the file's dirty DSM pages are received. If a PutAck times out (RDMA
timeout, default 500 us), fsync() retries the RDMA write up to 3 times with
exponential backoff (500 us, 1 ms, 2 ms). After 3 failures, fsync() returns
-EIO and the page remains dirty. This ensures fsync() on DSM pages provides
the same durability guarantee as on local pages: data is on stable storage (the
home node) before fsync() returns.
Explicit barriers:
impl DsmRegionHandle {
/// Wait until all pending async writebacks for pages in the given
/// range have completed (PutAck received from home for each page).
/// Returns the number of pages that were pending.
pub fn dsm_writeback_barrier(
&self,
va_start: u64,
page_count: u32,
) -> Result<u32, DsmError>;
}
6.12.9 Optimistic Prefetch¶
When DsmLockBinding.prefetch_on_acquire is true, the DSM subsystem begins
prefetching pages from the bound range as soon as the DLM lock request is
submitted — not when the lock is granted. This hides network latency:
Optimistic prefetch timeline:
T=0 DLM lock request submitted
T=0 DSM begins prefetching (up to max_optimistic_prefetch pages)
T=3μs First pages arrive (RDMA one-way latency)
T=5μs DLM lock granted (typical grant latency)
T=5μs Application begins accessing pages — many already cached
Prefetch cap: At most max_optimistic_prefetch pages (default: 256)
are prefetched at lock-request time. If the binding covers more pages,
only the first 256 are prefetched; the rest are filled by demand faults or
by the subscriber's DsmFaultHint::PrefetchAhead during access. This
prevents a large lock range (e.g., tablespace lock covering millions of
pages) from flooding the network with GetS requests.
max_optimistic_prefetch is configurable per DsmLockBinding:
pub struct DsmLockBinding {
// ... existing fields ...
/// Maximum pages to prefetch at lock-request time.
/// Default: 256. Set to 0 to disable optimistic prefetch entirely
/// (equivalent to prefetch_on_acquire = false).
pub max_optimistic_prefetch: u32,
}
Safety: Prefetched pages are in SharedReader (MOESI: S) state. If the lock is an exclusive lock (EX/PW), the first write to a prefetched page triggers a standard Upgrade (S→M) via the MOESI protocol. This adds ~3-5 μs per write-faulted page but avoids the full GetM latency because the page data is already local.
If the lock request is denied or times out: Prefetched pages remain in cache as normal SharedReader pages. No cleanup needed — they'll be evicted naturally by memory pressure or invalidated by the home node when another peer writes to them.
6.12.10 Subscriber Usage Patterns¶
UPFS (clustered filesystem):
Region: per-filesystem metadata region + per-file data regions
Lock binding:
- Inode lock → metadata pages (group 0: journal, group 1: inode blocks)
- Byte-range lock → data pages (single group, no ordering needed)
- prefetch_on_acquire = true (metadata), false (data — sequential prefetch
via DsmFaultHint::PrefetchAhead instead)
Cache policy:
- Metadata region: WriteBack, max_cached_pages = unlimited, eviction_priority = 0
- Data region: WriteBack, prefetch_window = 64, eviction_priority = 500
DsmSubscriber:
- on_page_fault: check DLM lock held for the inode/range → Reject if not
- on_eviction_candidate: Deny if DLM lock held (pinned by active lock)
Distributed database (e.g., shared-disk PostgreSQL):
Region: per-tablespace DSM region
Lock binding:
- Buffer lock → data pages (group 1)
- WAL lock → WAL pages (group 0 — written before data pages)
- prefetch_on_acquire = true (both)
Cache policy:
- WriteBack, prefetch_window = 0 (database has its own prefetch logic)
- eviction_priority = 100 (prefer keeping DB pages over other DSM regions)
DsmSubscriber:
- on_page_fault: check buffer pin held → Reject if not
- on_eviction_candidate: WritebackFirst if dirty, Allow if clean
- on_writeback_complete: clear buffer-level dirty flag, advance checkpoint LSN
Block service provider (host-proxy for remote block access):
Region: per-block-device cache region
Lock binding:
- DLM lock per LBA range → cached blocks
- No writeback ordering (block layer has its own barrier semantics via FLUSH)
- prefetch_on_acquire = false (block layer prefetches via read-ahead)
Cache policy:
- WriteBack, max_cached_pages = configurable via /sys, eviction_priority = 800
DsmSubscriber:
- on_page_fault: always Default (no lock check — block layer handles ordering)
- on_eviction_candidate: WritebackFirst if dirty, Allow if clean
6.13 Anti-Entropy Protocol for DSM_RELAXED¶
Terminology:
DSM_RELAXEDis the mmap flag name used in the syscall interface (e.g.,mmap(MAP_DSM | DSM_RELAXED, ...)). The corresponding internal enum variant isDsmConsistency::Eventual. They refer to the same consistency model.
DSM_RELAXED regions use asynchronous write propagation — the writer
completes locally and a background thread pushes updates to replicas. This
creates a window where replicas may be stale. The anti-entropy protocol
handles two situations where replicas need explicit synchronization:
- Region join — a new peer joins and needs to catch up to the current state of all pages it will cache.
- Partition heal — after a network partition, peers on both sides may have diverged and need to reconcile.
6.13.1 Version Vectors¶
Each peer in a DSM_RELAXED region maintains a version vector —
one monotonic counter per slot, tracking how many writes from each peer
have been applied locally:
/// Per-peer version vector for a DSM_RELAXED region.
/// Indexed by RegionSlotIndex (dense, per-region).
pub struct DsmVersionVector {
/// One counter per participant. `vv[slot]` = number of writes from
/// the peer at `slot` that have been applied to this peer's local
/// replica. Monotonically increasing.
/// Length = region's `max_participants`, bounded by
/// `MAX_REGION_PARTICIPANTS` (1024). Heap-allocated at region join.
pub vv: Box<[u64]>,
/// Number of entries (= region's max_participants).
pub len: u16,
}
On every local write: vv[own_slot] += 1.
On receiving a propagated write from peer P: vv[P.slot] = max(vv[P.slot], P.epoch).
Evolution safety: Version vectors are indexed by slot (not node ID). During
live kernel evolution, slot assignments are stable — the RegionSlotMap is part
of the exported DSM state. The version vector is exported as:
{ len: u16, entries: [Le64; len] } alongside the slot map. Import reconstructs
DsmVersionVector with the same slot→node mapping. If the new component changes
MAX_REGION_PARTICIPANTS, version vectors are zero-extended (safe — new slots
have zero writes). Shrinking max_participants below the current active slot count
aborts evolution (cannot discard active participants' history).
6.13.2 Anti-Entropy on Region Join¶
When a new peer joins a DSM_RELAXED region:
1. New peer N sends AntiEntropyRequest to the region coordinator:
{ region_id, version_vector: [0, 0, ..., 0] }
2. Coordinator selects the peer with the most complete state
(highest sum of version vector entries) as the sync source S.
3. S compares N's version vector against its own:
For each slot i where S.vv[i] > N.vv[i]:
S has writes that N is missing.
4. S sends missing writes as a stream of AntiEntropyData messages:
{ region_id, page_addr, data, writer_slot, epoch }
Ordered by epoch (oldest first) to preserve causal ordering.
5. N applies each write, updating its local pages and version vector.
6. When all missing writes are sent, S sends AntiEntropyComplete:
{ region_id, final_vv: S.vv }
7. N verifies: for all i, N.vv[i] >= final_vv[i]. If yes: sync complete.
If no: retry (S received new writes during sync — rare, converges
quickly because writes during sync are few).
Optimization: If N's version vector is all-zeros (fresh join), S sends a full snapshot of all non-Uncached pages rather than replaying the write log. This avoids replaying potentially millions of historical writes.
6.13.3 Anti-Entropy on Partition Heal¶
After a network partition heals, peers on both sides may have applied writes that the other side hasn't seen. The reconciliation protocol:
1. Each peer P broadcasts its version vector to all peers in the region
via AntiEntropyRequest { region_id, version_vector: P.vv }.
2. Each peer compares received vectors against its own.
For each peer Q where P.vv[i] > Q.vv[i] for some i:
P has writes that Q is missing → P sends them.
For each peer Q where Q.vv[i] > P.vv[i] for some i:
Q has writes that P is missing → Q sends them.
3. Concurrent conflicting writes (same page written on both sides of
the partition) are resolved by last-writer-wins (highest epoch wins).
Ties: lowest slot index wins (deterministic, same rule as DSM_CAUSAL
conflict resolution). **Note**: epoch comparison is non-causal — epoch
values are monotonic wall-clock timestamps (from NTP-synchronized
`CLOCK_REALTIME`), not Lamport/vector clocks. Two writes with the same
epoch are concurrent in the Lamport sense; the tie-breaker is arbitrary
but deterministic. This is acceptable for `DSM_RELAXED` where the
consistency model already tolerates non-causal ordering.
**Note**: LWW conflict resolution silently discards the losing write.
Applications using `DSM_RELAXED` must tolerate write loss during
partitions. If durability guarantees are required, use `DSM_SYNCHRONOUS`
(all-replica ack) or `DSM_CAUSAL` with DLM locks (which serialize
writers and prevent conflicts). The `fsync()` syscall on a `DSM_RELAXED`
page guarantees data reaches the home node, but does not prevent
subsequent overwrite by a concurrent writer's data during anti-entropy
reconciliation.
4. After exchanging all missing writes, all peers should have identical
version vectors. Each peer verifies and logs any discrepancies as
FMA events.
Convergence: The protocol is idempotent — applying the same write
twice is a no-op (epoch check). In the worst case (N peers, each with
unique writes), the protocol exchanges N × (pages modified during
partition) messages. For short partitions (<1 minute) with moderate write
rates, this is negligible. For extended partitions exceeding
DsmCachePolicy.max_partition_duration_secs, anti-entropy is skipped
entirely in favor of full state discard and snapshot sync (see below).
Long-partition recovery: If a peer has been partitioned longer than
DsmCachePolicy.max_partition_duration_secs (default: 86400 = 24 hours),
the coordinator rejects rejoining with RegionJoinReject(reason=4,
partition_too_long). The rejected peer must perform forced state
discard: drop all locally cached pages for the region, reset its
version vector to all-zeros, and re-issue RegionJoinRequest. This
triggers a full snapshot sync (same as fresh join) rather than
incremental anti-entropy, which avoids unbounded write-log replay for
very long partitions.
The partition duration is measured as now_monotonic - last_heartbeat_seen
on the coordinator side. The coordinator tracks last_seen_ns[slot] for
each participant and checks it on RegionJoinRequest.
Rationale: incremental anti-entropy is correct regardless of partition length (version vector comparison is mathematically sound for any gap), but the practical cost of replaying months of accumulated writes makes full re-sync faster and simpler. The 24-hour default is configurable per-region to accommodate satellite uplinks or other high-latency environments.
Post-sync verification: AntiEntropyCompletePayload.region_digest
(SipHash-2-4 of all non-Uncached pages) allows the joining peer to
verify sync correctness. On digest mismatch: the peer logs an FMA event
(DSM_SYNC_DIGEST_MISMATCH) and initiates a full re-sync from scratch
(version vector reset to zeros). Two consecutive digest mismatches
trigger DSM_SYNC_FATAL — the peer leaves the region and requires
operator intervention.
6.13.4 Wire Messages¶
// Anti-entropy messages (DsmMsgType range 0x0070-0x007F)
AntiEntropyRequest = 0x0070, // Peer → coordinator or broadcast
AntiEntropyData = 0x0071, // Source → joining/stale peer (page data)
AntiEntropyComplete = 0x0072, // Source → joining peer (sync done)
AntiEntropyRequestPayload (variable):
#[repr(C)]
pub struct AntiEntropyRequestPayload {
pub region_id: Le64, // 8 bytes (offset 0)
pub requesting_slot: Le16, // 2 bytes (offset 8, RegionSlotIndex)
pub vv_len: Le16, // 2 bytes (offset 10, = max_participants)
/// DSM protocol version of the requesting peer. Source peer rejects
/// sync if version is incompatible (below min_supported_version).
pub dsm_protocol_version: Le32, // 4 bytes (offset 12)
// Followed by vv_len × Le64 — the requesting peer's version vector.
}
const_assert!(core::mem::size_of::<AntiEntropyRequestPayload>() == 16);
// Fixed header: 16 bytes. Variable payload: vv_len × 8 bytes.
// Maximum total: 16 + MAX_REGION_PARTICIPANTS × 8 = 16 + 1024 × 8 = 8,208 bytes.
//
// Inline threshold: regions with <= 26 participants fit in a single 224-byte
// ring entry payload: (224 - 16) / 8 = 26.
// Regions with > 26 participants require continuation framing per the ring
// entry protocol ([Section 6.6](#dsm-coherence-protocol-moesi--dsm-coherence-message-wire-format)).
AntiEntropyDataPayload: Uses the existing split-transfer protocol (Section 6.6) — RDMA Write for page data, RDMA Send for the header. Same mechanism as DataResp/DataFwd.
/// Wire header for anti-entropy data transfer (RDMA Send).
/// The page data itself is delivered via a preceding RDMA Write to the
/// pre-registered receive buffer at `dest_offset`. This header follows
/// and tells the receiver which page the data belongs to.
#[repr(C)]
pub struct AntiEntropyDataPayload {
/// Region this page belongs to.
pub region_id: Le64, // 8 bytes (offset 0)
/// Virtual page address within the region (page-aligned).
pub page_addr: Le64, // 8 bytes (offset 8)
/// Slot index of the writer that last modified this page.
pub writer_slot: Le16, // 2 bytes (offset 16)
/// Writer's epoch at the time of the last modification.
/// The receiver uses this to determine causal ordering:
/// accept the page only if writer_epoch > local epoch for this slot.
pub writer_epoch: Le64, // 8 bytes (offset 18)
/// Byte offset into the RDMA receive buffer where the preceding
/// RDMA Write deposited the 4 KiB page data.
pub dest_offset: Le32, // 4 bytes (offset 26)
/// Rounds total size to 32 bytes (not gap padding — all fields pack
/// tightly because Le types have alignment 1).
pub _align: [u8; 2], // 2 bytes (offset 30)
}
// Total: 8 + 8 + 2 + 8 + 4 + 2 = 32 bytes.
const_assert!(core::mem::size_of::<AntiEntropyDataPayload>() == 32);
AntiEntropyCompletePayload (variable):
#[repr(C)]
pub struct AntiEntropyCompletePayload {
pub region_id: Le64, // 8 bytes
pub source_slot: Le16, // 2 bytes (RegionSlotIndex)
pub vv_len: Le16, // 2 bytes
pub _pad: [u8; 4], // 4 bytes
/// SipHash-2-4 of all non-Uncached page addresses XORed with their
/// content hashes. The SipHash key is derived from the region's session
/// key: `digest_key = HKDF-SHA256(region_session_key, salt=b"dsm-ae-digest",
/// info=region_id)`. All peers derive the same key from the shared session key.
/// This is an integrity check for debugging, not a security mechanism.
/// **Known limitation**: XOR combining is commutative and associative but
/// weak — it is insensitive to element multiplicity and produces false
/// matches when two sets differ by elements with equal XOR contributions
/// (e.g., two pages with identical content hashes). Phase 4+ optimization:
/// replace XOR combining with GF(2^64) polynomial evaluation or
/// order-insensitive hash (e.g., sum of per-page hashes modulo a large
/// prime) to eliminate this weakness.
/// For now, a mismatch triggers full re-sync, and a false match is benign (the
/// system converges through subsequent anti-entropy rounds).
/// The joining peer computes its own digest after applying all received
/// writes and compares. Mismatch → FMA event + full re-sync from scratch.
pub region_digest: Le64, // 8 bytes
// Followed by vv_len × Le64 — the source peer's final version vector.
}
const_assert!(core::mem::size_of::<AntiEntropyCompletePayload>() == 24);
6.13.5 Performance Bounds¶
| Operation | Cost | When |
|---|---|---|
| Version vector comparison | O(max_participants) u64 compares | On join or partition heal |
| Full snapshot (fresh join) | O(pages × 4KB) RDMA Writes | Once per join |
| Incremental sync | O(missing_writes × 4KB) RDMA Writes | Once per partition heal |
| Normal operation (no sync) | Zero overhead — anti-entropy is dormant | Always |
Anti-entropy never runs on the data path. During normal operation, it consumes zero CPU and zero network bandwidth. It activates only on region join or partition heal — both are rare events.
Rate limiting: Anti-entropy data transfer is rate-limited to 50% of
the available bandwidth to the sync source, preventing anti-entropy from
starving normal coherence traffic. The rate limit is configurable per
region via DsmCachePolicy.anti_entropy_bandwidth_pct (default: 50).
Enforcement: A per-link TokenBucket rate limiter sized to 50% of
measured link bandwidth (from TopologyEdge.bandwidth_bytes_per_sec). Anti-entropy
sends are throttled (deferred, not dropped — anti-entropy data is
idempotent but required for convergence) when the bucket is empty. The
bucket is initialized at region creation and updated on topology change
events.
/// Token-bucket rate limiter for anti-entropy bandwidth enforcement.
/// One instance per (region, sync_target_peer) pair, allocated at region
/// creation time and stored in the region's anti-entropy state.
///
/// The bucket limits the byte rate of anti-entropy data transfers to
/// prevent starvation of normal coherence traffic. The capacity is sized
/// to one refill interval's worth of tokens, allowing short bursts up to
/// the full interval budget.
///
/// Concurrency: `tokens` and `last_refill_ns` are `AtomicU64` for lock-free
/// operation on the anti-entropy sender thread. Only one thread sends
/// anti-entropy data per (region, peer) pair, so contention is minimal —
/// the atomics are for visibility across the sender thread and the topology
/// update path (which may adjust `capacity` and `refill_rate_per_ms`).
///
/// **Enforcement**: Each TokenBucket is owned by one region's anti-entropy
/// thread (sole caller of `try_consume`). The topology update path writes
/// ONLY `capacity` and `refill_rate_per_ms` — it MUST NOT write `tokens`
/// or `last_refill_ns`.
pub struct TokenBucket {
/// Current token count (bytes). Each anti-entropy RDMA Write consumes
/// `page_size` tokens (typically 4096). Decremented atomically before
/// each send. If insufficient tokens remain, the sender defers the
/// transfer until the next refill.
pub tokens: AtomicU64,
/// Maximum tokens the bucket can hold (bytes). Set to
/// `refill_rate_per_ms * 1000` (one second's worth of bandwidth).
/// This bounds the maximum burst size: a sender that has been idle
/// can send up to `capacity` bytes immediately before throttling.
pub capacity: u64,
/// Tokens added per millisecond (bytes/ms). Derived from the link's
/// measured bandwidth:
/// `refill_rate_per_ms = topology_edge.bandwidth_bytes_per_sec
/// * anti_entropy_bandwidth_pct / 100 / 1000`
///
/// For a 200 Gb/s InfiniBand link at 50%:
/// 200_000_000_000 * 50 / 100 / 8 / 1000 = 1_250_000 bytes/ms
/// = 1.25 GB/s anti-entropy budget.
///
/// Updated on topology change events (link speed renegotiation,
/// failover to a slower path). The update is a simple atomic store —
/// the sender picks up the new rate on its next refill cycle.
pub refill_rate_per_ms: u64,
/// Timestamp (monotonic nanoseconds) of the last token refill.
/// The sender computes elapsed time since `last_refill_ns`, adds
/// proportional tokens, and clamps to `capacity`.
pub last_refill_ns: AtomicU64,
}
impl TokenBucket {
/// Attempt to consume `n` bytes of bandwidth. Returns `true` if the
/// tokens were available and consumed, `false` if the bucket has
/// insufficient tokens (the caller should defer and retry after a
/// short sleep — typically 1 ms, aligned with the refill granularity).
///
/// Internally performs an inline refill before the consumption check:
/// 1. Read `last_refill_ns` (Acquire).
/// 2. Compute `elapsed_ms = (now_ns - last_refill_ns) / 1_000_000`.
/// 3. If `elapsed_ms > 0`:
/// a. `new_tokens = min(tokens + elapsed_ms * refill_rate_per_ms, capacity)`.
/// b. CAS `last_refill_ns` to `now_ns` (Acquire/Release).
/// c. Store `new_tokens` into `tokens` (Release).
/// 4. CAS `tokens` from `current` to `current - n` (Acquire/Release).
/// If `current < n`, return `false` without modifying `tokens`.
/// Invariant: single-sender per TokenBucket instance. CAS in step 4
/// cannot fail under single-sender (no concurrent writer); used as a
/// checked store, not a retry loop. If concurrent senders were needed,
/// the refill+consume steps would need a single CAS loop to prevent
/// lost refill tokens.
pub fn try_consume(&self, n: u64) -> bool { /* ... */ }
}
Sizing example: On a 200 Gb/s InfiniBand link with the default 50% bandwidth cap, the bucket permits ~1.25 GB/s of anti-entropy traffic. A full snapshot of 1M pages (4 GB) completes in ~3.2 seconds. Normal coherence traffic (the other 50%) is unaffected because anti-entropy sends block when the bucket is empty rather than competing for link bandwidth.
6.13.6 Stale Sharer Reconciliation¶
PutS messages sent during eviction of SharedReader pages are fire-and-forget (Section 6.11). If a PutS is lost (network failure), the home directory retains a stale sharer bit. Each stale bit adds ~200us timeout to subsequent write-miss Inv cycles for that page.
The anti-entropy protocol includes a sharer bitmap reconciliation step piggybacked on its periodic sync cycle. This applies to ALL consistency modes (not just DSM_RELAXED) because stale sharer bits are a transport-layer issue, not a consistency-model issue:
-
During each anti-entropy cycle (every
BLOOM_RESET_INTERVAL_SECS= 300s for DSM_RELAXED regions; triggered by the coordinator for other consistency modes), the home node identifies pages with sharer bits set that have not seen any coherence traffic (GetS, GetM, Inv, InvAck, PutS, PutM) from the sharing peer since the last reconciliation cycle. -
For each stale-candidate page × peer combination, the home sends a lightweight
SharerLivenessProbe { region_id, page_addr }message to the peer. -
The peer responds with either:
PutS— page is no longer cached locally (clears the stale bit).-
SharerConfirm— page IS still cached (bit is valid, no action needed). -
If the peer does not respond within 500ms, the home node clears the sharer bit proactively (the peer is likely dead or partitioned — the peer departure cleanup path will handle it). FMA event
DsmStaleShaerCleared { region_id, page_addr, peer }.
Memory cost: zero (uses existing coherence message path). Network cost: bounded by the number of stale candidates per cycle (typically <100 pages per region, ~3.2 KB of probes).
6.13.7 Interaction with Other Consistency Modes¶
- DSM_SYNCHRONOUS: No anti-entropy needed — all replicas are always up to date (writes wait for all acks). Stale sharer reconciliation still runs.
- DSM_CAUSAL: Uses the causal stamp propagation protocol (Section 6.6) for ordering, not version vectors. Anti-entropy is not applicable. Stale sharer reconciliation still runs.
- DSM_RELAXED: This section. Version vectors + anti-entropy + sharer reconciliation.
6.14 Application-Visible DSM (Level 2)¶
Level 1 (subsystem-scoped) DSM is used exclusively by kernel subsystems — UPFS,
block service providers, accelerator framework. Application-visible DSM exposes
DSM regions to user-space processes via a syscall interface, providing distributed
shared memory semantics similar to SYSV shared memory (shmget/shmat) but
across cluster nodes.
Use cases: - MPI applications that benefit from shared-memory transport across nodes (replaces MPI_Win for one-sided communication) - Distributed databases with shared buffer pools (e.g., shared-disk PostgreSQL) - Scientific computing with large shared arrays (e.g., distributed NumPy-like) - Actor frameworks with shared state (e.g., Ray object store)
6.14.1 Syscall Interface¶
Six new syscalls, registered in the UmkaOS syscall table (Section 19.1):
/// Create a new DSM region. Returns a region file descriptor.
/// The region is not mapped into any address space until dsm_mmap().
///
/// Permissions: requires CAP_DSM_CREATE capability ([Section 9.1](09-security.md#capability-based-foundation)).
/// Region ID is allocated by the kernel (cluster-unique via Raft).
///
/// Returns: file descriptor for the region (can be passed to other
/// processes via SCM_RIGHTS, similar to memfd_create).
pub fn dsm_create(
name: *const u8, // NUL-terminated name (max 63 bytes, informational)
size: u64, // Region size in bytes (page-aligned)
consistency: u32, // DsmConsistency value (0=Release, 2=Eventual, 3=Synchronous, 4=Causal; 1=SequentialReserved is REJECTED with EINVAL)
max_participants: u16, // Max peers (1-1024, default 256)
flags: u32, // DSM_WRITE_UPDATE, DSM_REPLICATE, etc.
) -> Result<RawFd, Errno>;
/// Attach to an existing DSM region by name. Returns a region file descriptor.
/// The calling process's peer must join the region (RegionJoinRequest is sent
/// to the coordinator automatically).
///
/// Permissions: the caller must hold a DistributedCapability for the region
/// (obtained via the capability system, [Section 9.1](09-security.md#capability-based-foundation)). The join is rejected
/// if the region is full (max_participants reached) or the capability is
/// invalid/expired.
pub fn dsm_attach(
name: *const u8, // Region name (must match an existing region)
flags: u32, // DSM_RDONLY = 0x01 (join as read-only participant)
) -> Result<RawFd, Errno>;
/// Detach from a DSM region. Unmaps all mappings of this region in the
/// calling process, writes back dirty pages, and sends RegionLeave.
/// The file descriptor is closed.
///
/// If other processes on this peer still have the region attached, the
/// peer remains a participant — only this process's mappings are removed.
/// The peer sends RegionLeave only when the last process detaches.
pub fn dsm_detach(fd: RawFd) -> Result<(), Errno>;
/// Map a DSM region into the calling process's address space.
/// Uses the same semantics as mmap(MAP_SHARED) — writes are visible
/// to other processes and peers. Page faults trigger the standard DSM
/// coherence protocol (GetS/GetM to home node).
///
/// addr: hint address (NULL for kernel-chosen). Must be page-aligned.
/// offset: offset within the region (page-aligned).
/// length: bytes to map (page-aligned, <= region size - offset).
/// prot: PROT_READ, PROT_WRITE, PROT_EXEC (must be subset of region permissions).
///
/// Returns: mapped virtual address.
pub fn dsm_mmap(
fd: RawFd,
addr: *mut u8, // hint (NULL = kernel chooses)
length: u64,
offset: u64,
prot: u32, // PROT_READ | PROT_WRITE | PROT_EXEC
) -> Result<*mut u8, Errno>;
/// Unmap a DSM region from the calling process's address space.
/// Dirty pages in the unmapped range are written back to home.
/// The process remains attached — dsm_detach() must be called separately.
pub fn dsm_munmap(
addr: *mut u8,
length: u64,
) -> Result<(), Errno>;
/// Query DSM region information.
pub fn dsm_info(
fd: RawFd,
info: *mut DsmRegionInfo,
) -> Result<(), Errno>;
/// User-visible region info struct. Returned to userspace by `dsm_info()`.
/// **Security**: All padding bytes MUST be zero-initialized before `copy_to_user`
/// (e.g., `MaybeUninit::zeroed()`) to prevent information disclosure.
/// Fields ordered to eliminate hidden repr(C) padding gaps.
#[repr(C)]
pub struct DsmRegionInfo {
pub region_id: u64, // offset 0, 8 bytes
pub name: [u8; 64], // offset 8, 64 bytes
pub size: u64, // offset 72, 8 bytes
pub consistency: u32, // offset 80, 4 bytes
pub max_participants: u16, // offset 84, 2 bytes
pub current_participants: u16, // offset 86, 2 bytes
pub flags: u32, // offset 88, 4 bytes
pub home_policy: u32, // offset 92, 4 bytes (moved before my_slot to avoid gap)
pub my_slot: u16, // offset 96, 2 bytes (this peer's slot; SLOT_INVALID if not joined)
pub _pad: [u8; 6], // offset 98, 6 bytes — explicit padding to 104 total
}
const_assert!(core::mem::size_of::<DsmRegionInfo>() == 104);
Syscall numbers: Allocated from UmkaOS-specific range (450+), not conflicting with Linux syscall numbers. These are UmkaOS extensions — Linux binaries don't use them.
6.14.2 Kernel-Side Implementation¶
The syscall layer is a thin wrapper around the existing kernel DSM API:
| Syscall | Kernel Action |
|---|---|
dsm_create() |
Allocate region_id via Raft, create DsmRegionCreate, broadcast DsmRegionCreateBcast |
dsm_attach() |
Look up region by name, send DsmRegionJoinRequest, wait for DsmRegionJoinAccept |
dsm_mmap() |
Insert VMA into process address space with vm_ops pointing to DSM fault handler |
dsm_detach() |
Write back dirty pages, send DsmRegionLeave (if last process on this peer) |
dsm_munmap() |
Remove VMA, writeback dirty pages in range |
dsm_info() |
Read from local region metadata (no network traffic) |
Page fault path (same as Level 1):
Process accesses unmapped DSM page:
→ Page fault → VMA identifies DSM region
→ DSM subsystem: GetS/GetM to home node (RDMA)
→ Home responds with DataResp (page data via RDMA Write)
→ Page installed in process page table
→ Process resumes
No DsmSubscriber is registered for application-visible regions — they use
default DSM behavior (demand-fault, LRU eviction, no lock binding). The
application manages its own synchronization using:
- dsm_create() with DSM_SYNCHRONOUS consistency (automatic all-replica ack)
- Explicit dsm_fence() calls for DSM_RELAXED/DSM_CAUSAL
- Standard POSIX synchronization (futex, pthread_mutex) on DSM-mapped memory
— futex operations on DSM pages are transparently cluster-aware
(Section 19.8)
6.14.3 Per-Process DSM Region Tracking¶
/// Per-process DSM state. Stored as `Option<Box<ProcessDsmState>>` in the
/// process's task_struct extension, initialized to `None`. Allocated lazily
/// on first `dsm_create()` or `dsm_attach()` call. This avoids the ~13 KB
/// overhead for processes that do not use DSM (the vast majority).
/// Tracks which DSM regions this process has open file descriptors for.
pub struct ProcessDsmState {
/// Open DSM region file descriptors. Keyed by fd number.
/// Max 64 simultaneous DSM regions per process (configurable via
/// /proc/sys/kernel/dsm_max_regions_per_process). **Compile-time cap**:
/// ArrayVec<_, 64> is the hard upper bound. The sysctl can reduce the
/// limit below 64 but cannot increase it beyond 64 without recompilation.
/// 64 is sufficient for all anticipated workloads (most distributed
/// applications use 1-4 regions).
regions: ArrayVec<ProcessDsmRegion, 64>,
}
/// Per-process per-region state.
pub struct ProcessDsmRegion {
/// File descriptor number.
pub fd: RawFd,
/// Region handle (kernel-internal).
pub region: DsmRegionHandle,
/// VMAs mapping this region in this process's address space.
/// Multiple mappings of the same region are allowed (different offsets
/// or overlapping — same as mmap(MAP_SHARED) of the same file).
pub mappings: ArrayVec<DsmMapping, 8>,
}
/// One mapping of a DSM region in a process's address space.
pub struct DsmMapping {
pub va_start: u64,
pub length: u64,
pub offset: u64, // offset within the region
pub prot: u32,
}
Process lifecycle interactions:
| Event | DSM Action |
|---|---|
fork() |
Child inherits DSM file descriptors and mappings (copy-on-fork for the fd table). Both parent and child share the same DSM region participation — writes from either are coherent. |
exec() |
All DSM mappings are unmapped (same as mmap semantics). DSM file descriptors with O_CLOEXEC are closed — those without are inherited by the new binary. |
exit() |
All DSM file descriptors closed. Dirty pages written back. If last process on this peer with this region attached: peer sends RegionLeave. |
| Process migration (Section 6.5) | DSM mappings are recreated on the destination peer. If the destination peer is not yet a participant in the region, it joins automatically (RegionJoinRequest). Page data is demand-faulted on the destination (not transferred as part of migration — DSM handles it). |
6.14.4 Distributed Futex on DSM Pages¶
User-space processes using DSM regions will naturally use futex-based synchronization (pthread_mutex, pthread_cond, semaphores) on DSM-mapped memory. UmkaOS extends the futex subsystem to handle DSM-backed pages:
Process A (Node 1) calls futex_wait(dsm_addr, expected_val):
1. Kernel detects that dsm_addr is in a DSM VMA.
2. Register the waiter at the home node via a FutexWaitRegister message
that includes the expected_val:
{ region_id, offset, expected_val, waiter_peer_id, waiter_slot }.
(The home identifies waiters by (waiter_peer, waiter_slot); TID is
not needed on the wire — it is available locally for tracing.)
3. Home node receives the registration and performs the TOCTOU-safe check:
a. Acquire the futex hash bucket lock for (region_id, offset).
b. Read the current value at (region_id, offset) from the home's
page cache. If the home node does not have a local copy (directory
state = Modified/Exclusive, owner elsewhere), the home issues a GetS
to obtain a SharedReader copy before the TOCTOU comparison. This adds
~3-5 us (one RDMA round-trip). The page remains cached for subsequent
futex checks on the same address.
c. If current_val != expected_val: send FutexWakeTarget back to the
waiter immediately (spurious wakeup — the value changed between
the waiter's userspace load and the registration arrival). The
waiter returns -EAGAIN.
d. If current_val == expected_val: insert the waiter into the home's
futex wait table. Release bucket lock. The waiter sleeps.
This home-side recheck eliminates the TOCTOU race where a wake on
another node occurs between the waiter's userspace value check and
registration: the home node is the single serialization point for both
wake and wait operations on this futex address.
Process B (Node 2) calls futex_wake(dsm_addr, num_waiters):
1. Kernel detects DSM VMA.
2. Compute home node: home = hash(region_id, futex_offset) % participant_count.
3. Send FutexWake message to the home node:
{ region_id, offset, num_waiters }.
4. Home node maintains the authoritative waiter list for this futex address
(ordered by arrival time, same ordering as Linux futex hash bucket chain).
5. Home node wakes exactly num_waiters waiters total (not per-peer):
- Walk the waiter list. For each waiter on a remote peer, send a targeted
FutexWakeTarget message to that specific peer.
- Stop after waking num_waiters total.
6. For num_waiters == INT_MAX (0x7FFFFFFF, wake-all): home broadcasts
FutexWakeAll to all peers with waiters (same as FUTEX_WAKE with INT_MAX).
This hierarchical design preserves POSIX semantics: pthread_cond_signal() calls
futex_wake(addr, 1) and exactly one thread wakes across the entire cluster, not
one per node. The prior broadcast-and-hope design woke up to N threads (one per peer),
violating POSIX FUTEX_WAKE semantics where val=1 means exactly 1 total.
Registration: futex_wait registers the waiter both locally (for fast wakeup if
the waker is on the same node) AND at the home node (via a FutexWaitRegister message).
If the waker is on the same node as the waiter, the local wake path short-circuits
without involving the home node. The home node registration is needed only for
cross-node wakes.
Wire messages:
// Futex coordination (DsmMsgType)
DsmFutexWake = 0x0090, // Waker → home node
DsmFutexWakeTarget = 0x0091, // Home → specific peer with waiter
DsmFutexWaitRegister = 0x0092, // Waiter → home node (register)
DsmFutexWaitUnregister = 0x0093, // Waiter → home node (unregister on timeout/signal)
#[repr(C)]
pub struct DsmFutexWakePayload {
/// Region this futex belongs to. Included in the payload (not just the
/// transport header) so that the receiver can validate the region_id
/// without parsing the transport header, and so that the payload is
/// self-describing for logging/debugging. The transport header's
/// region_id (if present) must match this field; mismatch is a protocol error.
pub region_id: Le64, // 8 bytes (offset 0)
/// Offset within the region (page-aligned + intra-page offset).
pub offset: Le64, // 8 bytes (offset 8)
/// Number of waiters to wake (INT_MAX = 0x7FFFFFFF to wake all,
/// matching Linux FUTEX_WAKE val semantics; 0 = wake none).
pub num_waiters: Le32, // 4 bytes (offset 16)
pub _pad: [u8; 12], // 12 bytes (offset 20)
}
const_assert!(core::mem::size_of::<DsmFutexWakePayload>() == 32);
/// Home node → specific peer: wake one or more waiters at the given futex
/// address. Sent when the home node selects a waiter on this peer from the
/// ordered wait list in response to a DsmFutexWake(num_waiters >= 1).
#[repr(C)]
pub struct DsmFutexWakeTargetPayload {
/// Region the futex belongs to.
pub region_id: Le64, // 8 bytes (offset 0)
/// Futex address within the region (page-aligned + intra-page offset).
pub offset: Le64, // 8 bytes (offset 8)
/// Number of waiters to wake on this specific peer. Normally 1, but
/// may be >1 if the home node is collapsing multiple wakes to the same
/// peer in a single message (optimization for wake-all).
pub wake_count: Le32, // 4 bytes (offset 16)
pub _pad: [u8; 12], // 12 bytes (offset 20)
}
const_assert!(core::mem::size_of::<DsmFutexWakeTargetPayload>() == 32);
/// Waiter → home node: register a futex waiter at the given address.
/// The home node adds this entry to its per-address ordered wait list
/// and performs a TOCTOU-safe comparison of `expected_val` against the
/// current value at (region_id, offset) — see the distributed futex
/// protocol description above. If the values differ, the home sends
/// `DsmFutexWakeTarget` immediately (spurious wakeup, waiter returns
/// `-EAGAIN`). If they match, the waiter is inserted into the home's
/// futex wait table.
///
/// If the waiter is woken locally (same-node) before the home acknowledges,
/// the waiter sends DsmFutexWaitUnregister to cancel.
#[repr(C)]
pub struct DsmFutexWaitRegisterPayload {
/// Region the futex belongs to.
pub region_id: Le64, // 8 bytes (offset 0)
/// Futex address within the region.
pub offset: Le64, // 8 bytes (offset 8)
/// PeerId of the waiting node (for the home node's wait list).
pub waiter_peer: Le64, // 8 bytes (offset 16)
/// Slot index of the waiting peer (for fast bitmap lookup).
pub waiter_slot: Le16, // 2 bytes (offset 24)
pub _pad1: [u8; 2], // 2 bytes (offset 26)
/// Expected futex value for TOCTOU-safe comparison at the home node.
/// Le32 matches Linux FUTEX_WAIT semantics (compares a 32-bit value:
/// `u32 __user *uaddr, u32 val` in the Linux syscall interface).
/// The home node reads the current value at (region_id, offset) and
/// compares it to `expected_val` under the futex hash bucket lock.
pub expected_val: Le32, // 4 bytes (offset 28)
}
// Layout: region_id(8) + offset(8) + waiter_peer(8) + waiter_slot(2) +
// _pad1(2) + expected_val(4) = 32 bytes.
const_assert!(core::mem::size_of::<DsmFutexWaitRegisterPayload>() == 32);
/// Waiter → home node: unregister a futex waiter (timeout, signal, or
/// local wake). The home node removes this entry from the wait list.
/// Idempotent: if the waiter was already woken (race with DsmFutexWakeTarget),
/// the home node silently ignores the unregister.
#[repr(C)]
pub struct DsmFutexWaitUnregisterPayload {
/// Region the futex belongs to.
pub region_id: Le64, // 8 bytes (offset 0)
/// Futex address within the region.
pub offset: Le64, // 8 bytes (offset 8)
/// PeerId of the unregistering waiter.
pub waiter_peer: Le64, // 8 bytes (offset 16)
pub _pad: [u8; 8], // 8 bytes (offset 24)
}
const_assert!(core::mem::size_of::<DsmFutexWaitUnregisterPayload>() == 32);
Performance: futex_wait is local-first: the waiter sleeps locally and sends a
registration message to the home node (one RDMA Send, ~32 bytes). futex_wake(addr, 1)
sends one message to the home node; the home sends one targeted wake message to the
specific peer. Total: 2 RDMA messages, ~6-10 us (home has data cached) or ~9-15 us
(home must demand-fetch via GetS before TOCTOU comparison). For wake-all: 1 message
to home + W messages to peers with waiters (W typically 1-3). The home node's waiter
list is O(1) per-address (futex hash keyed by (region_id, offset)).
Why not a distributed lock: Application-level synchronization (mutexes,
semaphores) should use futex, not DLM. DLM is for kernel subsystems that
need distributed byte-range locking with complex conflict modes (EX/PW/PR/CR).
User-space applications using DSM should use futex for simplicity and Linux
compatibility — pthread_mutex_lock() on DSM memory just works.
6.14.5 Security Model¶
dsm_create()requiresCAP_DSM_CREATEcapability (Section 9.1).dsm_attach()requires a validDistributedCapabilityfor the region, obtained via the capability system. The region creator issues capabilities to authorized peers.- DSM region file descriptors can be passed between processes on the same
node via
SCM_RIGHTS(Unix domain sockets), similar tomemfd_create(). - Cross-node access is controlled by the region's
required_capfield — peers must hold the capability to join. - Mapped pages respect
protflags —PROT_WRITEwithout the region'sDSM_PROT_WRITEpermission returnsEACCESondsm_mmap(). - Per-process limit:
dsm_max_regions_per_process(default 64, sysctl). - Per-node limit:
dsm_max_regions_per_node(default 256, sysctl).
6.14.6 Compatibility with Linux Shared Memory¶
| Linux API | DSM Equivalent | Compatibility |
|---|---|---|
shmget()/shmat() |
dsm_create()/dsm_mmap() |
Different API — DSM is an UmkaOS extension |
mmap(MAP_SHARED, fd) |
dsm_mmap(dsm_fd) |
Similar semantics — MAP_SHARED coherence |
memfd_create() |
dsm_create() returns fd |
Same fd-passing pattern |
futex() |
Transparent — works on DSM pages | Full compatibility |
POSIX shm (shm_open) |
Not integrated | POSIX shm is local-only |
Linux SYSV shared memory (shmget/shmat) continues to work for local
shared memory. DSM regions are a separate namespace — there is no automatic
"make all shared memory distributed" mode. Applications that want
distributed shared memory explicitly create DSM regions.
6.14.7 Performance Considerations¶
| Operation | Latency | Notes |
|---|---|---|
dsm_create() |
~10-50 ms | Raft consensus for region_id allocation |
dsm_attach() |
~5-20 ms | RegionJoin handshake |
dsm_mmap() |
<1 μs | Local VMA insertion (no network) |
| First page access (cold) | ~10-18 μs | Standard DSM page fault (GetS + RDMA) |
| Subsequent access (hot) | <1 ns | Local memory access (page cached) |
futex_wake() cross-node |
~3-5 μs | One RDMA Send per peer |
dsm_detach() |
~1-100 ms | Writeback dirty pages (depends on working set) |
No performance bombs:
- dsm_create() is slow (Raft) but is a one-time setup operation.
- Page faults are the standard DSM cost (~10-18 μs) — same as NVMe latency.
- No hidden costs — the application explicitly opts into distributed shared
memory and pays only for what it uses (demand-fault, not eager replication).
- Futex wake is lightweight (one small message per peer, no page transfers).