Skip to content

Chapter 14: Storage and Filesystems

Durability guarantees, block I/O, volume management, block storage networking, clustered filesystems, persistent memory, SATA/AHCI, ext4/XFS/Btrfs, ZFS


14.1 Durability Guarantees

Linux problem: Applications couldn't reliably know when data was on disk. The ext4 delayed-allocation data loss bugs (2008-2009) were a symptom. Worse, fsync() error reporting was broken — errors could be silently lost between calls. Partially fixed with errseq_t in kernel 4.13 (with subsequent refinements in 4.14 and 4.16), but the contract between applications and filesystems around durability remains murky.

UmkaOS design: - Error reporting: Every filesystem operation tracks errors via a per-file error sequence counter. fsync() returns errors exactly once and never silently drops them. The VFS layer enforces this — individual filesystem implementations cannot bypass it. - Durability contract: Three explicit levels, documented and testable: 1. write() → data in page cache (may be lost on crash) 2. fsync() → data + metadata on stable storage (guaranteed) 3. O_SYNC / O_DSYNC → each write waits for stable storage - Filesystem crash consistency: All filesystem implementations must declare their consistency model (journal, COW, log-structured) and pass a crash-consistency test suite as part of KABI certification. - Error propagation: Writeback errors propagate to ALL file descriptors that have the file open, not just the one that triggered writeback. No silent data loss.


14.2 ZFS Integration

14.2.1 Native ZFS and Filesystem Licensing

Linux problem: ZFS can't be merged due to CDDL vs GPL license incompatibility. Users rely on out-of-tree OpenZFS which breaks with kernel updates.

UmkaOS design: - The kernel is licensed under UmkaOS's proposed OKLF v1.3 license framework (see Appendix A of 23-roadmap.md, Section 23.1 for the full specification — OKLF is a novel license being developed for UmkaOS, not a pre-existing published license): GPLv2 base with the Approved Linking License Registry (ALLR) which explicitly includes CDDL as an approved license. CDDL-licensed code (like OpenZFS) communicates with the kernel via KABI IPC without license conflict (no in-kernel linking occurs). - ZFS is a first-class Tier 1 filesystem driver, same tier as ext4, XFS, and Btrfs. The KABI interface provides the license boundary: ZFS is dynamically loaded, has one resolved symbol (__kabi_driver_entry), communicates exclusively through ring buffer IPC and vtable dispatch — no linking, no shared symbols. This provides more isolation than Linux's EXPORT_SYMBOL_GPL boundary (where modules ARE linked into the kernel and share function calls). The license separation is provided by KABI, not by the isolation tier — running a filesystem as Tier 2 (process isolation) for licensing reasons would impose catastrophic I/O overhead (~200-500 cycles per VFS operation) for zero additional legal benefit. - NFSv4 ACLs are first-class (Section 8.1.4), so ZFS's native ACL model works natively. - Filesystem KABI interface is rich enough to support ZFS's advanced features: snapshots, send/receive, datasets, native encryption, dedup, special vdevs. - ZFS benefits from the stable driver ABI, so it won't break with kernel updates — eliminating the primary pain point of Linux's out-of-tree OpenZFS module.

14.2.2 ZFS Advanced Features

Section 14.2 establishes that ZFS is a first-class UmkaOS citizen via KABI (Tier 1 driver). This section covers advanced ZFS features that benefit from UmkaOS's architecture: capability-based dataset management, RDMA-accelerated replication, and cluster integration.

Dataset hierarchy as capability objects — ZFS datasets form a hierarchy (pool → dataset → child dataset → snapshot → clone). In UmkaOS, each dataset is a capability object. The capability token for a dataset encodes the specific operations permitted:

Capability Permits
CAP_ZFS_MOUNT Mount the dataset as a filesystem
CAP_ZFS_SNAPSHOT Create/destroy snapshots of the dataset
CAP_ZFS_SEND Generate a send stream (for replication)
CAP_ZFS_RECV Receive a send stream into this dataset
CAP_ZFS_CREATE Create child datasets
CAP_ZFS_DESTROY Destroy the dataset (highest privilege)

Delegation means transferring a subset of your capabilities to another local entity (a container, a user). A pool administrator holding all capabilities can delegate CAP_ZFS_MOUNT + CAP_ZFS_SNAPSHOT + CAP_ZFS_CREATE for a subtree to a container — the container can mount, snapshot, and create children within its subtree, but cannot destroy the parent dataset or send replication streams. For shared storage across hosts, use clustered filesystems (Section 14.5) backed by the DLM (Section 14.6) over shared block devices (Section 14.4).

zvol (ZFS volumes) — ZFS volumes are datasets that expose a block device interface instead of a POSIX filesystem. UmkaOS integrates zvols with umka-block's device-mapper framework — a zvol can serve as the backing store for dm-crypt, dm-mirror, or as an iSCSI LUN (Section 14.4). This enables ZFS's checksumming, compression, and snapshot capabilities for raw block storage consumers.

zfs send/recv over RDMA — ZFS replication streams (zfs send) are often used for backup, disaster recovery, and dataset migration. In Linux, zfs send | ssh remote zfs recv pushes the stream over TCP (typically SSH-encrypted). UmkaOS provides a native RDMA transport option: - Uses Section 5.1.4's RDMA infrastructure - Kernel-to-kernel path: when both source and destination run UmkaOS, the send stream bypasses userspace entirely — data moves directly from the source ZFS module through RDMA to the destination ZFS module - Zero-copy: send stream data is RDMA READ from source memory, written directly into destination's transaction group - Encryption: if the dataset uses ZFS native encryption, the stream is already encrypted end-to-end. Otherwise, RDMA transport encryption (Section 5.1.4) protects data in transit

Import/export compatibility — UmkaOS's ZFS implementation reads and writes the standard ZFS on-disk format (as defined by OpenZFS). Existing zpools created on Linux, FreeBSD, or illumos can be imported by UmkaOS without modification. Conversely, zpools created by UmkaOS can be exported and imported on any OpenZFS-compatible system.


14.3 Block I/O and Volume Management

Linux problem: LVM/mdadm are mature but fragile when a block device disappears momentarily — the volume layer panics or marks the device as failed. A NVMe driver reload that takes 50ms can cascade into a degraded RAID array and an unnecessary multi-hour resync.

UmkaOS design:

14.3.1 Block Device Trait

/// Block device abstraction — the interface between the block I/O layer
/// and storage device drivers (NVMe, SATA, virtio-blk, eMMC, SD, dm-*).
///
/// Every storage driver registers a `BlockDevice` with umka-block.
/// The block I/O layer routes bio requests through this trait.
pub trait BlockDeviceOps: Send + Sync {
    /// Submit a block I/O request. The request contains one or more
    /// bio segments (contiguous LBA ranges with associated memory pages).
    /// Returns immediately; completion is signaled via the bio's completion
    /// callback. For synchronous I/O, the caller waits on the callback.
    fn submit_bio(&self, bio: &mut Bio) -> Result<()>;

    /// Flush volatile write cache to stable storage. Called by fsync(),
    /// sync(), and journal commit paths. Must not return until all
    /// previously submitted writes are on stable media.
    fn flush(&self) -> Result<()>;

    /// Discard (TRIM/UNMAP) the specified LBA range. The device may
    /// deallocate the underlying storage. Not all devices support this;
    /// return ENOSYS if unsupported.
    fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()>;

    /// Return device geometry and capabilities.
    fn get_info(&self) -> BlockDeviceInfo;

    /// Shut down the device. Flushes caches and releases hardware resources.
    fn shutdown(&self) -> Result<()>;
}

/// Block device metadata and capabilities.
pub struct BlockDeviceInfo {
    /// Logical sector size in bytes (typically 512 or 4096).
    pub logical_block_size: u32,
    /// Physical sector size in bytes (4096 for AF drives).
    pub physical_block_size: u32,
    /// Total device capacity in logical sectors.
    pub capacity_sectors: u64,
    /// Maximum segments per bio request.
    pub max_segments: u16,
    /// Maximum total bytes per bio request.
    pub max_bio_size: u32,
    /// Device supports discard/TRIM.
    pub supports_discard: bool,
    /// Device supports flush (volatile write cache).
    pub supports_flush: bool,
    /// Device supports FUA (Force Unit Access) — write directly to media
    /// without requiring a separate flush.
    pub supports_fua: bool,
    /// Optimal I/O size in bytes (for alignment).
    pub optimal_io_size: u32,
    /// NUMA node affinity (for interrupt/queue placement).
    pub numa_node: u16,
}

/// Block I/O request — carries data between the block layer and device drivers.
///
/// A Bio represents a contiguous logical block range and its associated
/// memory pages. Multiple bios can be chained for scatter-gather I/O.
/// The bio is the fundamental unit of block I/O in UmkaOS, equivalent to
/// Linux's `struct bio`.
pub struct Bio {
    /// Target block device.
    pub bdev: Arc<dyn BlockDeviceOps>,
    /// Operation type.
    pub op: BioOp,
    /// Starting logical block address (in logical sectors).
    pub start_lba: u64,
    /// Scatter-gather list of memory segments.
    pub segments: ArrayVec<BioSegment, 16>,
    /// Extension segment list for bios with >16 segments.
    pub segments_ext: Option<Box<[BioSegment]>>,
    /// Completion callback. Called by the device driver when the I/O completes.
    pub completion: BioCompletion,
    /// Error status (set by the driver on completion).
    pub status: AtomicI32,
    /// Flags (REQ_FUA, REQ_PREFLUSH, etc.).
    pub flags: BioFlags,
}

/// A single segment of a bio — a contiguous range of physical memory.
pub struct BioSegment {
    /// Physical page containing the data.
    pub page: PageId,
    /// Offset within the page (bytes).
    pub offset: u32,
    /// Length of this segment (bytes).
    pub len: u32,
}

#[repr(u8)]
pub enum BioOp {
    Read = 0,
    Write = 1,
    Flush = 2,
    Discard = 3,
    WriteZeroes = 4,
    ZoneAppend = 5,
}

14.3.2 Device-Mapper and Volume Management

Device-mapper framework — UmkaOS implements a device-mapper layer in umka-block with standard targets:

Target Description Linux equivalent
dm-linear Simple linear mapping dm-linear
dm-striped Stripe across N devices dm-stripe
dm-mirror Synchronous mirror (RAID-1) dm-mirror
dm-crypt Transparent encryption (AES-XTS) dm-crypt
dm-verity Read-only integrity verification dm-verity
dm-snapshot Copy-on-write snapshots dm-snapshot
dm-thin Thin provisioning with overcommit dm-thin-pool

LVM2 metadata compatibility — UmkaOS reads the LVM2 on-disk metadata format (PV headers, VG descriptors, LV segment maps) and constructs logical volumes using device-mapper targets. Existing LVM2 volume groups created under Linux are usable without conversion. LVM2 userspace tools (lvm, pvs, vgs, lvs) work unmodified via the standard device-mapper ioctl interface.

Software RAID — RAID levels 0/1/5/6/10 are implemented as device-mapper targets. MD superblock formats (0.90, 1.0, 1.2) are read for compatibility with existing Linux mdadm arrays. mdadm works unmodified.

Recovery-aware volume layer — This is where UmkaOS diverges meaningfully from Linux. Block device temporary disappearance during Tier 1 driver reload (~50-150ms) does NOT mark the device as failed:

Volume Layer State Machine:
  DEVICE_ACTIVE       → Normal I/O flow
  DEVICE_RECOVERING   → Driver reload in progress, I/O queued
  DEVICE_FAILED       → Device permanently gone, failover/degrade

Transition rules:
  ACTIVE → RECOVERING:  When driver supervisor signals reload start
  RECOVERING → ACTIVE:  When new driver instance signals ready (typical: <100ms)
  RECOVERING → FAILED:  When recovery timeout expires (default: 5 seconds)
  • During DEVICE_RECOVERING, the volume layer pauses I/O in its ring buffer. No requests are failed; they simply wait.
  • RAID resync is NOT triggered for sub-100ms driver reloads — the array stays clean. The volume layer distinguishes "device temporarily gone for driver reload" from "device removed from bus" by checking the driver supervisor state.
  • If the recovery window exceeds the configurable timeout (default 5s), the device transitions to DEVICE_FAILED and normal degraded-mode behavior applies (RAID rebuilds, error returns for non-redundant volumes).
  • dm-verity for verified boot is already designed (Section 8.2.6).

14.4 Block Storage Networking

Storage networking protocols that expose remote block devices as local storage. These integrate with UmkaOS's block layer (umka-block), RDMA infrastructure (Section 5.1.4), and driver recovery model.

iSCSI Initiator

Tier 1 umka-block module implementing the iSCSI initiator role (RFC 7143): - Session management: login, logout, connection multiplexing, session recovery - SCSI command encapsulation over TCP - CHAP authentication (unidirectional and mutual) - Header and data digests (CRC32C) for integrity - Multi-connection sessions (MC/S) for bandwidth aggregation - Error recovery levels 0, 1, and 2

iSCSI Target

Tier 1 module exposing local block devices as iSCSI LUNs: - LIO-compatible configuration interface (existing targetcli works via compat layer) - ACL-based access control (initiator IQN whitelist + CHAP) - Multiple LUNs per target portal group - SCSI Persistent Reservations (PR) support (required for clustered filesystems)

iSER (iSCSI Extensions for RDMA)

When RDMA fabric is available (InfiniBand, RoCE, iWARP — Section 5.1.4), iSCSI sessions transparently upgrade to RDMA transport: - Zero-copy data transfer: RDMA READ/WRITE directly between initiator/target memory - Kernel-bypass data path: data moves without CPU involvement - Same iSCSI session management and authentication, different transport - Transparent upgrade: if both ends advertise RDMA capability during login, iSER is negotiated automatically. Applications and management tools see a standard iSCSI session.

NVMe-oF Initiator (Host)

Tier 1 umka-block module implementing the NVMe over Fabrics host side (NVM Express 2.0, NVMe TCP Transport Specification TP 8000, NVMe/RDMA part of original NVMe-oF specification June 2016):

  • Discovery: NVMe-oF discovery protocol (well-known discovery NQN) — initiator queries a discovery controller to enumerate available subsystems and transport addresses. Supports Discovery Log Page, referrals, and persistent discovery connections, unique discovery controller identification (TP 8013a).
  • NVMe/TCP transport: NVMe commands encapsulated in TCP (NVMe TCP Transport Specification, TP 8000, widely deployed). Lighter than iSCSI — no SCSI translation layer, native NVMe command set. Supports header and data digests (CRC32C), and TLS 1.3 for in-transit encryption (TP 8011).
  • NVMe/RDMA transport: NVMe commands over RDMA (InfiniBand, RoCE, iWARP). Capsule commands sent via RDMA SEND, data transferred via RDMA READ/WRITE — zero-copy, kernel-bypass. Lowest latency option (~3-5 μs network transport; ~10-20 μs end-to-end including NVMe target processing).
  • Multipath: native NVMe multipath (ANA — Asymmetric Namespace Access). Multiple paths to the same namespace are managed by the NVMe driver itself (not dm-multipath). ANA groups indicate path optimality (optimized, non-optimized, inaccessible). UmkaOS's NVMe multipath integrates with the recovery-aware volume layer (Section 14.3) — if a path fails due to driver crash, the volume layer waits for recovery rather than immediately failing over.
  • Namespace management: attach/detach namespaces, resize, format — full NVMe-oF namespace management command set.
  • Zoned namespaces (ZNS): NVMe-oF supports zoned namespaces. UmkaOS exposes these through the block layer's zone interface, compatible with zonefs and f2fs.

NVMe-oF Target (Subsystem)

Tier 1 module exposing local NVMe devices (or any block device) as NVMe-oF subsystems:

  • Subsystem management: create/destroy NVMe subsystems, each with one or more namespaces backed by local block devices (NVMe, zvol, dm device, or any umka-block device).
  • Transport bindings: simultaneous TCP and RDMA listeners on the same subsystem. Clients connect via whichever transport is available.
  • Access control: per-host NQN ACLs. Each allowed host can be restricted to specific namespaces within the subsystem.
  • ANA groups: configure asymmetric namespace access for multipath. Allows active/passive and active/active configurations.
  • Passthrough mode: for local NVMe devices, optionally pass NVMe commands directly to the hardware (no block layer translation). Provides the lowest-latency target implementation — remote host gets near-local NVMe performance.
  • Configuration interface: nvmetcli-compatible JSON configuration (existing Linux NVMe target management tools work via compat layer).

NVMe-oF over Fabrics — Why It Matters

NVMe-oF is replacing iSCSI in new deployments because it eliminates the SCSI translation layer. iSCSI encapsulates SCSI commands (a protocol designed for parallel buses in 1986) over TCP. NVMe-oF speaks NVMe natively — the same command set used by local NVMe SSDs. This means: - No SCSI CDB translation overhead - Native support for NVMe features (multipath/ANA, zoned namespaces, NVMe reservations) - Simpler protocol state machine (NVMe queue pairs vs iSCSI session/connection/task) - Lower latency at every layer

UmkaOS supports both because iSCSI remains dominant in existing infrastructure (and iSER makes it competitive on RDMA fabrics), while NVMe-oF is the clear direction for new deployments.

Protocol comparison:

Protocol Transport CPU overhead Latency Bandwidth
iSCSI TCP High (TCP stack + SCSI) ~100μs 10-25 Gbps
iSER RDMA Minimal (zero-copy) ~15-25μs end-to-end (transport only: ~5-10μs; end-to-end with SCSI target: ~15-25μs) Line rate (100+ Gbps)
NVMe-oF/TCP TCP Medium (no SCSI layer) ~15-30μs 25-100 Gbps
NVMe-oF/RDMA RDMA Minimal ~10-20μs end-to-end¹ Line rate

¹ NVMe-oF/RDMA latency breakdown: ~3-5 μs network transport (RDMA) + NVMe target processing. The 3-5 μs figure commonly cited represents RDMA transport latency only; end-to-end I/O latency including NVMe device processing is typically ~10-20 μs.

Recovery advantage — Both iSCSI and NVMe-oF initiators run as Tier 1 drivers with state preservation (Section 10.8). If an initiator driver crashes: 1. Connection state is checkpointed to the state preservation buffer 2. Driver reloads in ~50-150ms 3. RDMA transports (iSER, NVMe-oF/RDMA): When a driver crashes, the local RNIC's Queue Pair enters Error state, and the remote side's QP also transitions to Error state from retransmission timeouts. QP state cannot be transparently restored from a checkpoint — the QP must be destroyed and re-created (Reset -> Init -> RTR -> RTS). UmkaOS performs a fast QP re-creation: checkpointed session parameters (remote QPN, GID, LID, PSN, MTU, RDMA capabilities) allow the new QP to be configured without full connection manager negotiation. The remote side detects the QP failure (via async error event or failed RDMA operation) and cooperates in re-establishing the QP pair. Total recovery: ~50-150ms (fast re-creation, not transparent restore), vs. 10-30 seconds for full re-discovery in Linux. 4. TCP transports (iSCSI/TCP, NVMe-oF/TCP): Full TCP connection state cannot be reliably restored after a crash (the remote peer's TCP state has advanced: retransmissions, window adjustments, etc.). Instead, UmkaOS performs a fast reconnect: the checkpointed session parameters (target portal, ISID, TSIH for iSCSI; NQN, controller ID for NVMe-oF) allow session re-establishment without full discovery. The target accepts the reconnect as a session continuation (iSCSI RFC 7143 Section 6.3.5 session reinstatement; NVMe-oF controller reconnect). I/O commands in flight are retried by the block layer. Total recovery: ~200-500ms (vs. 10-30 seconds for full re-discovery in Linux).

In Linux, an initiator crash requires full session re-establishment: TCP/RDMA reconnection, login/connect, LUN/namespace re-discovery, and filesystem remount. This can take 10-30 seconds and may cause I/O errors visible to applications.

Multipath — Two multipath models coexist: - iSCSI: dm-multipath integration with the recovery-aware volume layer (Section 14.3). Multiple iSCSI paths (via different network interfaces or through different target portals) provide redundancy. - NVMe-oF: native NVMe ANA multipath (managed by the NVMe driver, not dm-multipath). ANA state changes are handled in-driver with recovery awareness.

Both models coordinate with the volume state machine — if a path fails due to driver crash (not network failure), the volume layer waits for driver recovery rather than immediately failing over.

14.4.1 NVMe-oF Reconnect Policy

The external NVMe-oF protocol is Linux-compatible (same wire format, same controller reconnect semantics). The reconnect strategy — when and how to retry — is UmkaOS's internal design space. Without backoff and jitter, all hosts in a cluster that lose fabric connectivity simultaneously will attempt to reconnect simultaneously, overloading the target's accept queue and prolonging the outage. UmkaOS uses exponential backoff with full jitter to spread reconnect attempts across the cluster.

Algorithm: exponential backoff with full jitter

When a fabric connection drops (TCP disconnect, QP error event) or an initial connect attempt fails:

attempt = 0
base_delay_ms = 100
max_delay_ms  = 30_000   // 30 seconds
jitter_frac   = 0.25     // ±25%

loop:
    delay = min(base_delay_ms * 2^attempt, max_delay_ms)
    jitter = random_uniform(-delay * jitter_frac, +delay * jitter_frac)
    sleep(delay + jitter)
    attempt = attempt + 1
    try connect()
    if connected: reset attempt = 0, break

Delays without jitter (for reference): 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 25.6s, 30s, 30s, ...

With jitter, the actual delay is uniformly random in [0.75×delay, 1.25×delay]. Full jitter (as opposed to equal jitter or decorrelated jitter) provides the best protection against synchronized reconnects in large clusters — reconnect attempts spread across the jitter window rather than clustering at the same instant. Reference: AWS Architecture Blog "Exponential Backoff And Jitter" (2015).

ANA path failover — If a path transitions to ANAState::Inaccessible, UmkaOS immediately tries the next available ANA-optimized path before entering the reconnect loop for the failed path. The reconnect loop is only entered after all optimized paths for a namespace are exhausted. This preserves I/O availability during single-path failures without incurring any reconnect delay.

Fast-path reconnect (NVMe-oF/TCP only) — If the TCP connection drops but the NVMe-oF controller was previously established (implying a fabric-layer issue rather than a target reset or controller crash), the first reconnect attempt uses a fixed 10ms delay instead of the normal 100ms base delay. The rationale: the target controller is likely still healthy and ready to accept the reconnect immediately; the full backoff sequence is reserved for cases where the target itself is unavailable.

Maximum reconnect attempts — After 20 consecutive failed attempts (approximately 10 minutes at the 30s ceiling), the controller is marked NvmeControllerState::Offline and I/O to namespaces served only by this controller fails with EIO. The controller remains registered; operators can re-trigger connection attempts via sysfs or the umkafs control interface at /System/Kernel/nvmeof/<nqn>/reconnect.


14.5 Clustered Filesystems

Shared-disk filesystems where multiple nodes access the same block device simultaneously, coordinated by a distributed lock manager (DLM).

Linux problem — GFS2 and OCFS2 require a complex multi-daemon stack: - Corosync: cluster membership and messaging - Pacemaker: resource manager and fencing coordinator - DLM: distributed lock manager (kernel module + userspace daemon) - Fencing agent: STONITH (Shoot The Other Node In The Head) — kills unresponsive nodes to prevent split-brain corruption

These components are developed by different teams, have different configuration languages, and interact in subtle ways. Diagnosing failures requires understanding all four components and their interactions. A single daemon crash can fence the entire node.

UmkaOS design — The cluster infrastructure from Section 5.1 provides the foundation. UmkaOS integrates these components into a coherent architecture:

DLM over RDMA — The DLM (Section 14.6) uses Section 5.1.4's RDMA transport for lock operations. Lock grant/release round-trip is ~3-5μs over RDMA (vs ~30-50μs over TCP in Linux's DLM). This directly impacts filesystem performance — every metadata operation (create, rename, delete, stat) requires at least one DLM lock. At 3-5μs per lock, clustered filesystem metadata operations approach local filesystem performance. See Section 14.6 for the full DLM design, including RDMA-native lock protocols, lease-based extension, batch operations, and recovery.

Fencing — When a node becomes unresponsive, the cluster must fence it (prevent it from accessing shared storage) before allowing other nodes to recover its locks: - IPMI/BMC fencing: power-cycle the node via out-of-band management - SCSI-3 Persistent Reservations: revoke the node's reservation on the shared storage device — the storage controller itself blocks I/O from the fenced node - Same mechanisms as Linux, but integrated into Section 5.1.12's cluster membership protocol rather than requiring a separate Pacemaker/STONITH stack

Quorum — Inherits from Section 5.1.12's split-brain handling. A partition with fewer than quorum nodes self-fences (stops accessing shared storage) to prevent data corruption.

GFS2 compatibility — Read the GFS2 on-disk format, implemented as an umka-vfs module: - Resource groups, dinodes, journaled metadata - GFS2 DLM lock types mapped to DLM lock modes (Section 14.6.2) - Journal recovery for failed nodes - Existing GFS2 volumes can be mounted by UmkaOS without reformatting

OCFS2 compatibility — Similar approach: read OCFS2 on-disk format, implement as an umka-vfs module. Lower priority than GFS2.

Recovery advantage — This is where UmkaOS's architecture fundamentally changes clustered filesystem behavior: - Linux: if a node's storage driver crashes, the DLM loses heartbeat from that node. Fencing kicks in — the node is killed (power-cycled or SCSI-3 PR revoked). After reboot (~60s), the node must rejoin the cluster, replay its journal, and re-acquire locks. Other nodes are blocked on any locks held by the crashed node until fencing and recovery complete. - UmkaOS: if a node's storage driver crashes, the driver recovers in ~50-150ms (Tier 1 reload). The DLM heartbeat continues throughout (heartbeat is in umka-core, not the storage driver). The node stays in the cluster. Its locks remain valid. No fencing, no journal replay, no lock recovery. Other nodes never notice.

This transforms clustered filesystem reliability from "minutes of disruption per failure" to "50ms blip per failure." See Section 14.6.12 for detailed recovery comparison.


14.6 Distributed Lock Manager

The Distributed Lock Manager (DLM) is a first-class kernel subsystem in umka-core that provides cluster-wide lock coordination for shared-disk filesystems (Section 14.5), distributed applications, and any kernel subsystem requiring cross-node synchronization. It implements the VMS/DLM lock model — the same model used by Linux's DLM, GFS2, OCFS2, and VMS clustering.

The DLM lives in umka-core (not a separate daemon or Tier 1 driver). This is a deliberate architectural choice: lock state survives Tier 1 driver restarts, DLM heartbeat continues during storage driver reloads, and there are zero kernel/userspace boundary crossings for lock operations.

14.6.1 Design Overview and Linux Problem Statement

Linux's DLM implementation suffers from seven systemic problems that limit clustered filesystem performance. Each problem stems from architectural decisions made when the Linux DLM was designed for 1 Gbps Ethernet and 4-node clusters in the early 2000s. UmkaOS's DLM addresses each problem by design:

# Linux Problem Impact UmkaOS Fix
1 Global recovery quiesce — DLM stops ALL lock activity cluster-wide during any node failure recovery Seconds of cluster-wide stall; all nodes blocked, not just those sharing resources with the dead node Per-resource recovery: only resources mastered on the dead node are affected; all other lock operations continue uninterrupted (Section 14.6.11)
2 TCP lock transport (~30-50 μs per lock operation) Orders of magnitude slower than hardware allows; metadata-heavy workloads bottleneck on lock latency RDMA-native: Atomic CAS for uncontested locks (~3-5 μs including confirmation, zero remote CPU on CAS path), RDMA Send for contested locks (~5-8 μs) (Section 14.6.5)
3 No lock batching — each lock request is a separate network round-trip rename() requires 3 locks = 3 round-trips = ~90-150 μs on Linux DLM Batch API: up to 64 locks grouped by master in a single RDMA Write (~5-10 μs total) (Section 14.6.5)
4 BAST (Blocking AST) callback storms — O(N) invalidation messages for N holders of a contended resource, including uncontended downgrades Metadata-heavy workloads on large clusters see network saturation from invalidation traffic Lease-based extension: holders extend cheaply via RDMA Write; minimal traffic for uncontended resources — only periodic one-sided RDMA lease renewals that bypass the remote CPU (zero CPU-consuming traffic, vs. Linux BASTs on every downgrade that require CPU processing); contended worst case is still O(K) for K active holders but K ≤ N because expired leases are reclaimed without messaging (Section 14.6.6)
5 Separate daemon architecture — corosync + pacemaker + dlm_controld with kernel/userspace boundary crossings Every membership change requires multiple kernel↔userspace transitions; diagnosis requires understanding 4 separate components Integrated in-kernel: membership events from Section 5.1.12 delivered directly to DLM; single heartbeat source; no userspace daemons (Section 14.6.10)
6 Lock holder must flush ALL dirty pages on lock downgrade Dropping an EX lock on a 100 GB file flushes all dirty pages, even if only 4 KB was written Targeted writeback: DLM tracks dirty page ranges per lock; only modified pages within the lock's range are flushed (Section 14.6.8)
7 No speculative multi-resource lock acquire GFS2 rgrp allocation: each attempt to lock a resource group is a full round-trip; 8 attempts = 8 × 30-50 μs lock_any_of(N) primitive: single message tries N resources, first available is granted (Section 14.6.7)

14.6.2 Lock Modes and Compatibility Matrix

The DLM implements the six standard VMS/DLM lock modes. GFS2 uses all six modes — this is not a simplification, it is the minimum required for correct clustered filesystem operation.

/// DLM lock modes, ordered by exclusivity (lowest to highest).
/// Compatible with Linux DLM, GFS2, and OCFS2 expectations.
#[repr(u8)]
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
pub enum LockMode {
    /// Null Lock — placeholder, compatible with everything.
    /// Used to hold a position in the lock queue without blocking others.
    NL = 0,

    /// Concurrent Read — read access, compatible with all except EX.
    /// Used by GFS2 for inode lookup (reading inode from disk).
    CR = 1,

    /// Concurrent Write — write access, compatible with NL, CR, CW.
    /// Used by GFS2 for writing to a file while others read metadata.
    CW = 2,

    /// Protected Read — read-only, blocks writers.
    /// Used by GFS2 for operations requiring consistent metadata snapshot.
    PR = 3,

    /// Protected Write — write, compatible with NL and CR only.
    /// Used by GFS2 for metadata modification (create, rename, unlink).
    PW = 4,

    /// Exclusive — sole access, incompatible with everything except NL.
    /// Used by GFS2 for operations requiring exclusive inode access.
    EX = 5,
}

Compatibility matrixtrue means the two modes can be held concurrently by different nodes:

NL CR CW PR PW EX
NL yes yes yes yes yes yes
CR yes yes yes yes yes no
CW yes yes yes no no no
PR yes yes no yes no no
PW yes yes no no no no
EX yes no no no no no

This matrix follows the standard VMS/DLM compatibility semantics (OpenVMS Programming Concepts Manual, Red Hat DLM Programming Guide Table 2-2; Linux kernel fs/dlm/lock.c __dlm_compat table). Key points: PW is compatible with NL and CR only (PW is the "update lock" — allows one writer with concurrent readers); CW is compatible with NL, CR, and CW (CW allows concurrent writers); PW and CW are mutually incompatible (PW forbids other writers, including CW holders). The matrix is stored as a compile-time constant lookup table for zero-cost compatibility checks on the lock grant path.

14.6.3 Lock Value Blocks (LVBs)

Each lock resource carries a 64-byte Lock Value Block — a small metadata payload piggybacked on lock state. LVBs are the critical optimization that makes clustered filesystem metadata operations efficient.

/// Lock Value Block — 64 bytes of metadata attached to a lock resource.
/// Updated by the last EX/PW holder on downgrade or unlock.
/// Read by PR/CR holders on lock grant.
///
/// MUST be cache-line aligned (`align(64)`). On all target RDMA hardware
/// (ConnectX-5+, EFA, RoCEv2 NICs), a cache-line-aligned 64-byte RDMA Read
/// is performed as a single PCIe transaction, providing de facto atomicity.
/// The alignment is a correctness requirement for the double-read protocol;
/// see the "LVB read consistency" section below.
#[repr(C, align(64))]
pub struct LockValueBlock {
    /// Application-defined data (e.g., inode size, mtime, block count).
    pub data: [u8; 56],

    /// Sequence counter — incremented on every LVB update.
    /// Readers use this to detect stale LVBs after recovery.
    ///
    /// Stored as u64 for alignment and RDMA atomic operation compatibility
    /// (RDMA atomics require 8-byte aligned 8-byte values).
    ///
    /// **Odd/even protocol**: Writers use FAA to increment the counter before
    /// and after writing data. An odd value indicates mid-update (reader should
    /// retry); an even value indicates stable data. The counter is initialized
    /// to 0 (even) on LVB creation.
    ///
    /// **Masking requirement**: Readers MUST mask with `LVB_SEQUENCE_MASK`
    /// (0x0000_FFFF_FFFF_FFFF) before checking parity or comparing values.
    /// The high 16 bits are used for special sentinel values (e.g., INVALID)
    /// and should not be interpreted as part of the sequence counter.
    ///
    /// The 48-bit counter wraps after ~9.2 years at 1M increments/sec
    /// (2^48 / 10^6 ≈ 290 million seconds). See the wrap limitation section
    /// below for handling guidance.
    pub sequence: u64,
}

/// Mask to extract the 48-bit sequence counter from the u64 field.
/// MUST be applied before checking odd/even parity or comparing sequence values.
pub const LVB_SEQUENCE_MASK: u64 = 0x0000_FFFF_FFFF_FFFF;

/// Sentinel value indicating an invalid LVB (after recovery from dead holder).
/// Uses high bits outside the 48-bit sequence space to avoid collision.
/// Readers observing this value must treat the LVB as invalid and refresh
/// from disk before use.
pub const LVB_SEQUENCE_INVALID: u64 = 0xFFFF_0000_0000_0000;

Why LVBs matter: Consider the common case of reading a file's size on a clustered filesystem:

Without LVB:
  Node A holds inode EX lock → writes file → updates size on disk → releases EX
  Node B acquires inode PR lock → reads inode FROM DISK → gets current size
  Cost: 1 lock operation (~3-5 μs) + 1 disk read (~10-15 μs NVMe) = ~13-20 μs

With LVB:
  Node A holds inode EX lock → writes file → writes size to LVB → releases EX
  Node B acquires inode PR lock → reads size FROM LVB (in lock grant message)
  Cost: 1 lock operation (~4-6 μs, LVB included) + 0 disk reads = ~4-6 μs

LVBs eliminate one disk read per metadata operation in the common case. GFS2 uses LVBs to cache inode attributes (i_size, i_mtime, i_blocks, i_nlink) and resource group statistics (free blocks, free dinodes). The VFS layer reads these attributes from the LVB via Section 13.3's per-field inode validity mechanism.

Note: UmkaOS uses 64-byte LVBs (56 data + 8 sequence counter), vs Linux's 32 bytes, to accommodate extended metadata including the sequence counter and capability token. GFS2 on-disk format compatibility requires translating between 32-byte and 64-byte LVB formats at the filesystem layer: UmkaOS's GFS2 implementation packs the standard 32-byte GFS2 LVB fields into the first 32 bytes of the 56-byte data portion, using the remaining 24 bytes for UmkaOS-specific metadata (sequence validation, capability references). When importing a GFS2 volume from Linux, the filesystem driver zero-extends Linux's 32-byte LVBs into the 64-byte format on first lock acquire.

LVB read consistency: RDMA does not provide atomic reads for 64-byte payloads (RDMA atomics are limited to 8 bytes). When a node reads an LVB via RDMA Read, a concurrent writer could update the LVB mid-read, producing a torn value. The protocol: 1. Reader performs RDMA Read of the full 64-byte LVB. 2. Reader checks sequence counter. If sequence is odd, the writer is mid-update (writers set sequence to an odd value before writing data, then increment to even after). Retry the read. 3. Reader performs a second RDMA Read of the full 64-byte LVB. If every byte (data + sequence) matches the first read, the data is consistent. If any byte differs, retry from step 1. The full-payload comparison (not just the sequence field) catches the case where a writer completes two full updates between the reader's two reads: the 48-bit sequence counter (bits 47:0 of the sequence field) is monotonically increasing (wraps after ~9.2 years at 500K writes/sec — two FAAs per write equals 1M increments/sec — far exceeding practical deployment lifetimes; the correctness argument holds for any deployment shorter than this), so it will differ after any update. The full-payload comparison is a defense-in-depth measure that also detects torn reads where the sequence counter itself was partially updated.

LVB sequence counter wrap limitation: The 48-bit sequence counter (bits 47:0 of the sequence field, masked by LVB_SEQUENCE_MASK) wraps after 2^48 increments. At maximum sustained write rate (500,000 writes/sec = 1,000,000 FAA operations/sec), wrap occurs in approximately 290 million seconds (~9.2 years). During the wrap transition, a reader could observe sequence=2^48-1 on the first read and sequence=0 on the second read, incorrectly concluding that no write occurred between reads (ABA problem on the sequence field). This is an acceptable limitation because: (1) the wrap interval far exceeds typical cluster deployment lifetimes; (2) the full-payload comparison (data + sequence) still detects torn reads even during wrap, since the writer's data changes between FAA operations; (3) production deployments monitor LVB write rate and proactively replace LVB structures approaching the wrap threshold. Clusters with write-intensive workloads exceeding ~50,000 writes/sec on critical LVBs may configure periodic LVB rotation (allocate fresh LVB with new generation counter) to avoid theoretical wrap scenarios in long-running deployments. The sequence counter detects torn reads: the reader retries if the sequence changed during the read. This is a consistency mechanism, not an ABA prevention mechanism — ABA is not applicable because the reader does not perform compare-and-swap on the LVB data. The writer protocol uses RDMA Fetch-and-Add (FAA) for both transitions: FAA(sequence, 1) (now odd = writing) → update data → FAA(sequence, 1) (now even = stable). FAA is a standard RDMA atomic operation, ensuring visibility to concurrent one-sided readers.

LVB single-writer guarantee: The double-read protocol's correctness depends on there being at most one concurrent LVB writer for a given resource. This invariant is provided by the DLM lock itself: only a node holding an EX (Exclusive) or PW (Protected Write) lock on a resource may write to that resource's LVB (per the DLM compatibility matrix in Section 14.6.2). Because the DLM guarantees that at most one node holds EX or PW on a resource at any time, the single-writer invariant is guaranteed by the lock mode rules — no additional coordination is needed. During master failover, LVB writes are suspended until the new master is established and the lock state has been recovered, preventing interleaved writes from two nodes each believing they hold the lock.

RDMA ordering correctness argument: The writer updates the LVB via three RDMA operations posted to a single Reliable Connection (RC) Queue Pair: (1) FAA on sequence, (2) RDMA Write to data bytes, (3) FAA on sequence. Per the InfiniBand Architecture Specification (Vol 1, Section 10.5), operations within a single RC QP are processed at the responder (target NIC) in posting order. Therefore, when FAA #3 completes, the data Write #2 has already completed at the responder's memory. A reader on a DIFFERENT QP (QP_B) may see operations from QP_A interleaved with its own reads — this is the "no inter-QP ordering" property of RDMA. However, the double-read protocol handles this correctly: if QP_A's operations interleave with QP_B's first Read, the torn value will differ from QP_B's second Read (because the writer changed data and/or sequence between reads), causing a retry. The only remaining concern is whether QP_A's three operations can interleave with BOTH of QP_B's reads to produce identical torn values — this is impossible because the FAA operations on the sequence counter are 8-byte RDMA atomics (always observed atomically, no partial reads), and the sequence counter is monotonically increasing. If the reader's two RDMA Reads see the same sequence value (even), the writer either completed all three operations before both reads (data is consistent) or has not started (data is unchanged). If the sequence values differ between the two reads, the reader retries. The double-read protocol is therefore correct under RDMA's relaxed inter-QP ordering model without requiring explicit fencing between QPs.

RDMA Read atomicity and the SIGMOD 2023 analysis: The InfiniBand Architecture Specification does not formally guarantee that an RDMA Read larger than 8 bytes is delivered atomically. Ziegler et al. (SIGMOD 2023) investigated this question and found that in practice, cache-line-aligned 64-byte RDMA Reads are delivered atomically on all tested hardware — their experiments observed no torn reads for objects that fit within a single cache line. This empirical finding supports our cache-line-aligned LVB design. Nevertheless, the IB spec provides no formal guarantee, and future NICs or memory subsystems could behave differently. The double-read protocol provides defence-in-depth across three complementary layers:

  1. Cache-line alignment (de facto atomicity): The #[repr(C, align(64))] requirement ensures the 64-byte LVB is always cache-line aligned. On all shipping RDMA NICs (ConnectX-5+, AWS EFA, RoCEv2 adapters), the responder NIC reads from the last-level cache or memory controller, which operates at cache-line granularity. A cache-line-aligned 64-byte read therefore arrives from the responder as a single coherent unit — a single PCIe TLP — providing de facto atomicity even without formal IB spec guarantees. This is the primary defence.

Hardware qualification note: 64-byte RDMA read atomicity is a de-facto property of specific NICs, not guaranteed by the InfiniBand specification. It is confirmed on: Mellanox/NVIDIA ConnectX-5, ConnectX-6, ConnectX-7 (single cache-line reads are atomic in the NIC's memory subsystem), and AWS EFA (Elastic Fabric Adapter) NICs. It is NOT guaranteed on iWARP NICs (Chelsio T6, Intel X722) or InfiniBand HCAs without this property. UmkaOS's LVB implementation checks for the RDMA_ATOMIC_64B capability flag at device initialization and falls back to the double-read protocol (read → check sequence → read again if sequence changed) when the flag is absent. The double-read protocol is correct regardless of hardware atomicity; the single-read optimization is enabled only when the flag is present.

  1. Probabilistic defence via double-read: Even if a torn read occurs on a specific platform (e.g., under unusual NUMA topology or memory subsystem conditions), the double-read comparison provides a strong probabilistic defence. For both reads to produce identical torn values, the writer's in-progress modifications must create the EXACT same byte pattern in both torn snapshots — including the monotonically increasing sequence counter. Because the sequence counter changes by exactly 2 per complete write (odd during update, even after), reconstructing the same even sequence value twice from independent torn reads of two different write phases would require an astronomically unlikely alignment of byte delivery from two distinct PCIe transactions. In practice this is negligible.

  2. Two-sided fallback (absolute correctness): After 8 retries the reader falls back to a two-sided RDMA Send to the resource master, which reads the LVB under its local lock and returns a consistent snapshot. This path is unconditionally correct regardless of RDMA read atomicity guarantees or NIC implementation details.

Together these three layers ensure correctness: the first eliminates torn reads on all known hardware, the second provides defence-in-depth on any hypothetical future hardware, and the third guarantees forward progress regardless of RDMA semantics.

Livelock prevention: A continuously-updated LVB could cause a reader to retry indefinitely (the writer keeps changing the sequence counter between the reader's two RDMA Reads). To prevent this, the reader enforces a maximum of 8 retries with exponential backoff (1 μs, 2 μs, 4 μs, ..., 128 μs). If all retries are exhausted, the reader falls back to a two-sided RDMA Send to the resource master, requesting a consistent LVB snapshot. The master reads the LVB under its local lock (preventing concurrent writer updates during the read) and returns the consistent value. This fallback adds ~5-8 μs but guarantees forward progress. In practice, a single retry suffices in over 99% of cases — the 8-retry limit is a safety bound for pathological writer contention.

Typical case: 1 RDMA Read + 1 RDMA Read (64 bytes) = ~3-4 μs total.

After lock master recovery (Section 14.6.11), LVBs from dead holders are marked INVALID (sequence counter set to u64::MAX). The next EX or PW holder must refresh the LVB from disk before other nodes can trust it (both EX and PW are write modes that can update the LVB, per the compatibility matrix above).

14.6.4 Lock Resource Naming and Master Assignment

Lock resources are identified by hierarchical names that encode the filesystem, resource type, and specific object:

Format: <filesystem>:<uuid>:<type>:<id>[:<subresource>]

Examples:
  gfs2:550e8400-e29b:inode:12345:data      — data lock for inode 12345
  gfs2:550e8400-e29b:inode:12345:meta      — metadata lock for inode 12345
  gfs2:550e8400-e29b:rgrp:42               — resource group 42 allocation lock
  gfs2:550e8400-e29b:journal:3             — journal 3 ownership lock
  gfs2:550e8400-e29b:dir:789:bucket:5      — directory 789 hash bucket 5
  app:mydb:table:users:row:1001            — application-level row lock

Master assignment: Each lock resource is assigned a master node responsible for maintaining the granted/converting/waiting queues. The master is determined by consistent hashing using a virtual-node ring (note: this is deliberately different from DSM home-node assignment in Section 5.1.6.3, which uses modular hashing — hash % cluster_size — for simpler O(1) lookups; DLM uses consistent hashing because lock resources are more numerous and benefit from minimal redistribution on node changes):

// Each physical node has V virtual nodes on the ring (default V=64).
// The ring is a sorted array of (hash, physical_node_id) pairs.
ring = [(hash(node_0, vnode_0), 0), (hash(node_0, vnode_1), 0), ...,
        (hash(node_N, vnode_V), N)]

master(resource_name) = ring.successor(hash(resource_name)).physical_node_id

When a node joins or leaves the cluster, only ~1/N of total resources are remapped (the resources whose ring position falls between the departed node's virtual nodes and their successors). This is the key property of consistent hashing — unlike modular hashing (hash % cluster_size), which remaps nearly all resources on membership change.

Design choice — consistent hashing vs. directory-based master assignment: Linux's DLM uses modular hashing for lock resource mastering. UmkaOS uses consistent hashing with virtual nodes because: (1) it is fully distributed with no single point of failure — any node can compute any resource's master locally from the ring (O(log V×N) binary search); (2) membership changes remap only ~1/N of resources instead of ~all. Note that the DLM's consistent hashing is deliberately different from DSM's modular hashing (Section 5.1.6.3, hash % cluster_size): DSM uses modular hashing for simpler O(1) lookups with full rehash on membership change, while the DLM uses consistent hashing for minimal redistribution on node changes. These are separate protocols with different tradeoffs, not a shared scheme. The tradeoff is that consistent hashing cannot optimize for locality (a node that uses a resource heavily is not preferentially assigned as its master). For workloads where locality matters (e.g., a single node accessing a file exclusively), the DLM's lease mechanism (Section 14.6.6) compensates: the holder simply extends its lease without contacting the master, so master location is irrelevant on the fast path.

/// Intrusive doubly-linked list node. Embedded in structs that need to
/// be linked without heap allocation.
///
/// # Safety invariant
/// A node must be removed from all lists before its containing struct
/// is freed. Leaving a dangling node pointer causes use-after-free.
pub struct IntrusiveListNode {
    pub prev: *mut IntrusiveListNode,
    pub next: *mut IntrusiveListNode,
}

/// Head sentinel for an intrusive list. The `prev`/`next` pointers
/// form a circular doubly-linked list with the head acting as a
/// sentinel. An empty list has `head.prev == head.next == &head`.
pub struct IntrusiveListHead {
    pub sentinel: IntrusiveListNode,
    pub len: usize,
}

/// Typed intrusive list. `T` must embed an `IntrusiveListNode` accessible
/// via the `node_offset` (computed by `field_offset!` at the call site).
pub struct IntrusiveList<T> {
    head: IntrusiveListHead,
    _marker: PhantomData<T>,
}
/// DLM resource name. Variable-length, hierarchical (e.g., "gfs2:fsid:inode:12345").
/// Maximum 256 bytes. Compared by byte equality for lock matching.
pub struct ResourceName {
    /// Name bytes (NUL-terminated, max 256 bytes including NUL).
    pub bytes: [u8; 256],
    /// Length of the name (excluding NUL terminator).
    pub len: u16,
}

/// Wait-for graph for distributed deadlock detection.
/// Nodes are lock holders (identified by (node_id, lock_id) pairs).
/// Edges represent "waits for" relationships. Cycle detection runs
/// periodically (default: every 100ms) using a DFS traversal.
pub struct WaitForGraph {
    /// Adjacency list: waiter → set of holders it's waiting for.
    /// Bounded to MAX_CONCURRENT_LOCKS (65536) entries.
    ///
    /// **BTreeMap rationale**: Deadlock detection runs only after a lock
    /// has been waiting >5 seconds (see Section 14.6.9) — this is off the
    /// hot lock-grant path entirely. BTreeMap provides O(log N) insert/lookup
    /// and O(N) ordered iteration (needed for consistent cycle detection
    /// across gossip rounds — consistent ordering ensures all nodes detect
    /// the same cycle and select the same victim). The 65536-entry bound
    /// caps memory at ~2MB (65536 × (16+8+padding) bytes), acceptable for
    /// a background structure. An alternative HashMap would give O(1) average
    /// but non-deterministic iteration order would require an explicit sort
    /// before cycle detection, eliminating the main advantage.
    pub edges: BTreeMap<WaiterId, ArrayVec<WaiterId, 8>>,
}

pub struct WaiterId {
    pub node_id: NodeId,
    pub lock_id: u64,
}

/// A lock resource managed by the DLM.
pub struct DlmResource {
    /// Resource name (hierarchical, variable-length).
    pub name: ResourceName,

    /// Node ID of the resource master.
    pub master: NodeId,

    /// Lock Value Block for this resource.
    pub lvb: LockValueBlock,

    /// Granted queue — locks currently held.
    /// Intrusive linked list: DlmLock nodes are allocated from a per-lockspace
    /// slab allocator (fixed-size, no heap resizing on the lock grant path).
    pub granted: IntrusiveList<DlmLock>,

    /// Converting queue — locks being converted (upgrade/downgrade).
    /// Processed in FIFO order before the waiting queue.
    pub converting: IntrusiveList<DlmLock>,

    /// Waiting queue — new lock requests waiting for compatibility.
    pub waiting: IntrusiveList<DlmLock>,

    /// Pending CAS confirmations (Section 14.6.5). When remote nodes acquire a
    /// lock via RDMA CAS but have not yet sent the confirmation RDMA Send, this
    /// field tracks the expected confirmations. The master defers processing new
    /// incompatible-mode requests against this resource until all confirmations
    /// arrive or time out. A bounded collection is required — not Option<PendingCas>
    /// — because shared-mode CAS operations (e.g., PR acquires) allow multiple
    /// nodes to win concurrently (each successive shared-mode CAS increments
    /// holder_count). For exclusive-mode CAS (EX, PW), at most one entry exists.
    pub pending_cas: ArrayVec<PendingCas, MAX_CLUSTER_NODES>,
}

/// Tracks a pending CAS confirmation for a DlmResource.
pub struct PendingCas {
    /// Node that performed the CAS.
    pub node: NodeId,
    /// Lock mode the node acquired.
    pub mode: LockMode,
    /// Sequence value in the CAS word after the acquire (for timeout reset).
    pub post_cas_sequence: u64,
    /// Timestamp when the CAS was detected (for 500 μs timeout).
    pub detected_at_ns: u64,
}

// Note on allocation strategy: DlmLock nodes are allocated from a per-lockspace
// slab allocator (umka-core Section 4.1.2). The slab pre-allocates DlmLock-sized objects
// and grows in page-sized chunks, so individual lock grant/release operations
// never trigger the general-purpose heap allocator. This ensures bounded latency
// on the contested lock path. The intrusive list avoids the pointer indirection
// and dynamic resizing of VecDeque/Vec.
//
// Note on byte-range lock tracking: each DlmLock's associated LockDirtyTracker
// (Section 14.6.8) uses LargeRangeBitmap (not a flat SparseBitmap) to track dirty
// pages within the lock's byte range. This supports files of any practical size:
// ≤ 1 GiB files use the flat SparseBitmap path (zero overhead), while larger files
// use the two-level LargeRangeBitmap with lazily-allocated 1 GiB slots.

/// A single lock held or requested by a node.
pub struct DlmLock {
    /// Node that owns this lock.
    pub node: NodeId,

    /// Requested/granted lock mode.
    pub mode: LockMode,

    /// Process ID on the owning node (for deadlock detection).
    pub pid: u32,

    /// Flags (NOQUEUE, CONVERT, CANCEL, etc.).
    pub flags: LockFlags,

    /// Timestamp for ordering and deadlock victim selection.
    pub timestamp_ns: u64,
}

14.6.5 RDMA-Native Lock Operations

All DLM operations use Section 5.1.4's RDMA transport. Four protocol flows cover the full lock lifecycle:

1. Uncontested acquire (RDMA Atomic CAS, ~3-5 μs full operation)

When a resource has no current holders or only compatible holders, the requesting node can acquire the lock via RDMA Atomic Compare-and-Swap on the master's lock state word — a 64-bit value encoding the current lock state:

/// 64-bit lock state word, laid out for RDMA Atomic CAS.
/// Stored in master's RDMA-accessible memory for each DlmResource.
///
///   bits [63:61] = current_mode (3 bits: 0=NL, 1=CR, 2=CW, 3=PR, 4=PW, 5=EX)
///   bits [60:48] = holder_count (13 bits: up to 8191 concurrent holders;
///                   sufficient for MAX_CLUSTER_NODES=64, with margin for
///                   future expansion up to ~128x the cluster size limit)
///   bits [47:0]  = sequence (48 bits: monotonic counter for ABA prevention)
///
/// IMPORTANT: current_mode encodes a SINGLE lock mode. This means the CAS fast
/// path only works for HOMOGENEOUS holder sets — all holders must be in the same
/// mode. When holders have different compatible modes (e.g., CR + PR, or CR + PW),
/// the CAS word cannot represent the mixed state. These transitions MUST use the
/// two-sided RDMA Send path (protocol 2 below), where the master's control thread
/// maintains per-holder mode information in the full DlmResource granted queue.
///
/// This is a deliberate design tradeoff: the CAS fast path covers the most common
/// lock patterns in practice:
///   - EX for exclusive write access (single writer)
///   - PR for shared read access (multiple readers)
///   - CR for concurrent read (e.g., GFS2 inode attribute reads via LVB)
/// Mixed-mode combinations (CR+PR, CR+PW, CR+CW) are valid but uncommon in
/// GFS2 workloads — they arise primarily during mode transitions (one node
/// downgrades while another acquires). The two-sided path at ~5-8μs is still
/// 5-10x faster than Linux's TCP-based DLM.
///
/// ABA safety: 48-bit sequence counter. At 500,000 lock ops/sec on a single
/// resource (sustained maximum), wrap time = 2^48 / 500,000 = ~563 million
/// seconds (~17.8 years). This eliminates ABA as a practical concern.
/// (Note: this is the CAS lock-word sequence counter, which increments once
/// per lock acquisition. The LVB sequence counter in Section 14.6.3 wraps in ~9.2
/// years because it increments twice per write — once at begin_write, once
/// at end_write — giving 1M increments/sec at 500K writes/sec.)
///
/// The full granted/converting/waiting queues are maintained separately in the
/// master's local memory. The CAS word is a fast-path optimization — it
/// encodes enough state for common homogeneous transitions without remote CPU
/// involvement. The master's granted queue is the authoritative lock state;
/// the CAS word is a cache of that state for the fast path.

CAS fast path cases (homogeneous mode only):

Transition CAS expected CAS desired Ops Notes
Unlocked → EX NL\|0\|seq EX\|1\|seq+1 1 CAS First exclusive holder
Unlocked → PR NL\|0\|seq PR\|1\|seq+1 1 CAS First protected reader
Unlocked → CR NL\|0\|seq CR\|1\|seq+1 1 CAS First concurrent reader
PR → PR (add reader) PR\|K\|seq PR\|K+1\|seq+1 Read + CAS Add same-mode holder
CR → CR (add reader) CR\|K\|seq CR\|K+1\|seq+1 Read + CAS Add same-mode holder
EX → NL (unlock) EX\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
PR → NL (last reader) PR\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
CR → NL (last reader) CR\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
PR (remove reader) PR\|K\|seq PR\|K-1\|seq+1 Read + CAS K>1, decrement count
CR (remove reader) CR\|K\|seq CR\|K-1\|seq+1 Read + CAS K>1, decrement count
Unlocked → PW NL\|0\|seq PW\|1\|seq+1 1 CAS Single PW holder (PW+PW incompatible)
Unlocked → CW NL\|0\|seq CW\|1\|seq+1 1 CAS First concurrent writer
CW → CW (add writer) CW\|K\|seq CW\|K+1\|seq+1 Read + CAS CW is self-compatible (per Section 14.6.2 matrix)
CW → NL (last writer) CW\|1\|seq NL\|0\|seq+1 1 CAS Last CW holder releases
CW (remove writer) CW\|K\|seq CW\|K-1\|seq+1 Read + CAS K>1, decrement count

Transitions that CANNOT use CAS (require two-sided path): - Any mode conversion (e.g., PR→EX, EX→PR, CR→PW) - Acquiring a mode different from current holders (e.g., CW when current_mode=CR, or PR when current_mode=CW) - Adding a second PW holder (PW is not self-compatible) - These transitions require the master's control thread to evaluate the full compatibility matrix and update per-holder mode tracking in the granted queue.

Requester                                   Master (remote memory)
    |                                            |
    |--- RDMA Atomic CAS (expected=UNLOCKED, --->|
    |    desired=EX|1|seq+1)                     |
    |<-- CAS result (old value) ------------------|
    |                                            |
    If old_value matched expected: lock acquired.|
    Zero remote CPU involvement.                 |
    Raw CAS round-trip: ~2-3 μs.               |
    Full acquire (CAS + confirmation): ~3-5 μs. |

For the Read+CAS path (adding a shared reader when holders exist), the requester first performs an RDMA Read to learn the current state, then a CAS to atomically increment the holder count. Total: 2 RDMA operations (~3-5 μs). CAS failure (due to concurrent modification) triggers retry with the returned value as the new expected value.

Important: The CAS word is an optimization for the uncontested fast path. It does NOT replace the full lock queues maintained in the master's local memory. When a CAS succeeds, the acquiring node MUST send a two-sided RDMA Send to the master confirming its identity (node ID) and the acquired lock mode. The master updates the full granted queue upon receiving this confirmation. If the master does not receive confirmation within ~500 μs (the confirmation timeout), it assumes the CAS winner crashed before completing the acquire and resets the lock state word via its own CAS (restoring the pre-acquire state). The CAS target word includes a generation counter (the 48-bit sequence field) to prevent ABA issues during this reclamation — the master's restoration CAS uses the post-acquire sequence value as the expected value, so a concurrent legitimate acquire by another node will not be clobbered. This confirmation step is a required correctness measure, not an optimization: without it, if the CAS winner crashes before the master processes its queue entry, recovery would iterate the granted queue and find no record of the holder, leaving the lock state word permanently wedged. When a CAS fails (contested lock, incompatible mode), the requester falls back to the two-sided protocol below. The master's control thread is the sole authority for complex operations (conversions, waiters, deadlock detection).

CAS outcome determination and transport failure recovery. RDMA Atomic CAS is a single round-trip operation: the RNIC performs the compare-and-swap on the remote memory and returns the previous value of the target word in the CAS completion. The requester determines the CAS outcome entirely from this return value — if the returned old value matches the expected value, the CAS succeeded and the lock is held. No separate "confirmation response" from the master's CPU is involved in determining CAS success or failure; the RDMA NIC hardware handles the entire operation atomically. This means the requester always knows whether it acquired the lock, as long as the RDMA completion is delivered.

If the RDMA transport itself fails during a CAS operation (e.g., the Queue Pair enters Error state due to a link failure, cable pull, or remote RNIC reset), the requester receives a Work Completion with an error status (not a successful CAS completion). In this case, the CAS may or may not have been applied to the master's memory — the requester cannot distinguish between "CAS was never sent", "CAS was sent but not executed", and "CAS succeeded but the response was lost in transit." The requester must handle this ambiguity:

  1. Assume the CAS may have succeeded. The requester must not retry the CAS blindly (doing so could double-acquire or corrupt the sequence counter).
  2. Query the master via a recovery path. The requester establishes a fresh RDMA connection (or uses a separate TCP fallback if the RDMA fabric is partitioned) and sends a two-sided lock state query to the master's control thread. The master reads its authoritative lock state — the CAS word in registered memory — and responds with the current lock state plus the sequence counter value.
  3. Master's lock word is ground truth. If the CAS word shows the requester's expected post-CAS value (matching mode, holder count, and sequence), the CAS succeeded and the requester proceeds with the confirmation RDMA Send (on the new connection). If the CAS word shows a different state, the CAS either was not applied or was already reclaimed by the master's confirmation timeout (the ~500 μs timeout described above). In either case, the requester starts a fresh lock acquisition attempt.
  4. Interaction with confirmation timeout. If the CAS succeeded but the requester takes longer than ~500 μs to query the master (due to connection re-establishment), the master may have already reclaimed the lock via its confirmation timeout logic. This is safe: the master's reclamation CAS uses the post-acquire sequence value, so if reclamation occurred, the lock word has been reset and the requester's recovery query will see the reset state. The requester then re-acquires normally.

This recovery path is exercised rarely (only on RDMA transport failures, not on normal CAS contention), so its higher latency (~1-5 ms for connection re-establishment + query) does not affect steady-state performance.

Pending CAS confirmation window: Between a successful CAS and the arrival of the confirmation Send, the CAS word and the master's granted queue are temporarily inconsistent — the CAS word shows a lock held, but the granted queue has no entry. During this window, if another node's CAS fails and it falls back to the two-sided path, the master must handle the discrepancy correctly:

  1. When the master receives a two-sided lock request, it checks BOTH the granted queue AND the CAS word state. If the granted queue is empty but the CAS word shows a held lock, the master knows a CAS confirmation is pending.
  2. The master enqueues the incoming request in the waiting queue and defers processing until either: (a) the CAS confirmation arrives (at which point the granted queue is updated and the waiting queue is processed normally), or (b) the confirmation timeout expires (at which point the master resets the CAS word and processes the waiting queue against the now-empty granted queue).
  3. If the pending CAS mode is compatible with the incoming request's mode (per the Section 14.6.2 compatibility matrix), the master grants the incoming request immediately without waiting for the CAS confirmation. The master also updates the CAS word via its own local CAS to reflect the new holder (incrementing holder_count in the CAS word to account for both the pending CAS winner and the newly granted node). The CAS winner's confirmation, when it arrives, simply adds the CAS winner to the already-updated granted queue. This eliminates the blocking window entirely for same-mode shared requests (e.g., multiple concurrent PR acquires), which are the most common contested case.
  4. For incompatible-mode requests, this deferred processing adds at most 500 μs of latency to the second node's request in the worst case (CAS winner crashed). In the normal case, the confirmation arrives within ~1-2 μs (one RDMA Send), so the deferred processing completes almost immediately. A crashed node's 500 μs delay is negligible compared to the 50-200 ms DLM recovery time.
  5. The master tracks pending CAS confirmations with a per-resource pending_cas: ArrayVec<PendingCas, MAX_CLUSTER_NODES> field (see DlmResource struct in Section 13.6.4). A bounded collection is required — not Option<PendingCas> — because shared-mode CAS operations (e.g., PR acquires) allow multiple nodes to win concurrently: each successive shared-mode CAS increments the holder_count field embedded in the CAS word and updates the sequence number, so two or more nodes can complete their CAS atomics before any confirmation arrives. The master must reconcile ALL concurrent CAS winners: it reads the final CAS word once all confirmations have arrived (or the polling timeout expires) and uses the holder_count to verify that the number of confirmations received matches the number of nodes that successfully CAS'd. Any node whose confirmation does not arrive within the timeout is treated as crashed and is excluded from the granted queue. For exclusive-mode CAS (EX, PW), at most one node can win — the CAS word format enforces mutual exclusion — so the collection will contain at most one entry in that case. This field is set when the master observes a CAS word change via periodic polling of the CAS word in its registered memory region, and cleared when all confirmations arrive or times out. Note: The master does NOT receive RDMA completion queue notifications for remote CAS operations (one-sided RDMA is CPU-transparent at the responder). Detection relies on the master's targeted polling of CAS words with pending requests only — the master maintains a per-lockspace pending set of resources with outstanding CAS operations, and polls only those CAS words (poll interval: ~100μs per pending resource). Resources with no pending CAS operations are not polled, so the CPU overhead scales with O(pending) not O(total_resources). On a lockspace with 10,000 resources but only 50 with pending CAS operations, polling generates ~500K polls/second — manageable on a single core. Optimization note: For workloads with consistently high pending-CAS counts (>100), an interrupt-driven notification path is available: the requesting node sends a two-sided RDMA Send to the master after completing its CAS, triggering a completion queue event instead of requiring polling. The master switches to interrupt-driven mode per-resource when the pending count exceeds a configurable threshold (default: 100). This trades higher per-lock latency (~1μs CQ processing vs ~0.1μs poll) for reduced CPU overhead.

Security: RDMA CAS access to the lock state word is controlled via RDMA memory registration (Memory Regions / MRs). The master registers each lockspace's CAS word array as a separate RDMA MR and distributes the Remote Key (rkey) only to nodes that hold CAP_DLM_LOCK for that lockspace. Capability verification happens at lockspace join time (a two-sided RDMA Send to the master, which checks CAP_DLM_LOCK via umka-core's capability system before returning the rkey). Nodes that lose CAP_DLM_LOCK have their rkey revoked via RDMA MR re-registration (which invalidates the old rkey). This enforces the capability boundary at the RDMA transport layer — a node without the rkey physically cannot issue CAS operations to the lock state words. The rkey is per-lockspace, so CAP_DLM_LOCK scoping (Section 14.6.14) maps directly to RDMA access control.

Rkey lifetime and TOCTOU safety: RDMA rkeys are registered for the lifetime of the node's DLM membership in the lockspace, not per-operation. When a node joins a lockspace, the master registers the RDMA Memory Region and returns the rkey; when the node leaves (graceful or fenced), the MR is deregistered and the rkey is invalidated. This eliminates TOCTOU (time-of-check-to-time-of-use) races: a node that passes the capability check at join time retains a valid rkey for all subsequent lock operations until membership ends. Rkey revocation (for CAP_DLM_LOCK loss) uses RDMA MR re-registration, which atomically invalidates the old rkey -- any in-flight CAS using the old rkey will fail with a remote access error (IBA v1.4 Section 13.6.7.2: deregistered MR causes Remote Access Error completion), and the node must re-join the lockspace (re-passing the capability check) to obtain a new rkey.

Revocation ordering: The MR re-registration is the authoritative enforcement mechanism — it must complete before the capability is marked as revoked in the local capability table. Sequence: (1) master calls dereg_mr() on the RNIC, which invalidates the rkey in hardware; (2) master updates the lockspace membership record (removes node); (3) capability revocation propagates to the evicted node. This ordering ensures no window exists where the capability is revoked but the rkey is still valid. If the evicted node races a CAS between steps (1) and (3), the RNIC rejects it (rkey already invalid). Rkey revocation is hardware-enforced with < 1ms latency from the dereg_mr() call — there is no exposure window. This is the same eager dereg_mr() mechanism used for cluster membership revocation (Section 5.1.12); the 180s rkey rotation grace period described in Section 5.1.3.2 (Mitigation 2) is a separate defense-in-depth against rkey leakage to non-cluster entities, not the revocation path for DLM membership loss.

2. Contested acquire (RDMA Send, ~5-8 μs)

When the CAS fails (resource is already locked in an incompatible mode), the requester falls back to a two-sided RDMA Send to the master's control queue pair:

Requester                                   Master
    |                                            |
    |--- RDMA Send (lock request) ------------->|
    |                                [enqueue in waiting list]
    |                                [check compatibility]
    |                                [if compatible: grant]
    |<-- RDMA Send (lock grant + LVB) ----------|
    |                                            |
    Total latency: 2 RDMA round-trips (~5-8 μs).|

The master's kernel thread processes the request, checks compatibility against the granted queue, and either grants immediately or enqueues for later grant.

3. Lock conversion (upgrade/downgrade)

A node holding a lock can convert it to a different mode without releasing and reacquiring. Conversions use the same protocol as contested acquire (RDMA Send to master). The converting queue is processed before the waiting queue — a conversion request from an existing holder takes priority over new requests.

Common conversions: - PR → EX: upgrade from read to write (e.g., before modifying an inode) - EX → PR: downgrade from write to read (triggers targeted writeback, Section 14.6.8) - EX → NL: release write lock but keep queue position (for future reacquire)

4. Batch request (up to 64 locks, ~5-10 μs total)

Multiple lock requests destined for the same master are grouped into a single RDMA Write:

Requester                                   Master
    |                                            |
    |--- RDMA Write (batch: 8 lock requests) -->|
    |                                [process all 8]
    |<-- RDMA Send (batch: 8 grants/queued) ----|
    |                                            |
    Total: ~5-10 μs for 8 locks.                |
    Linux DLM: 8 × 30-50 μs = 240-400 μs.      |

Batch requests are critical for operations that require multiple locks atomically. A rename() requires locks on the source directory, destination directory, and the file being renamed — three locks that can be batched into a single network operation when they share the same master.

When batch locks span multiple masters, the requester sends one batch per master in parallel and waits for all grants. Worst case: N masters = N parallel RDMA operations completing in max(individual latencies) rather than sum(individual latencies).

14.6.6 Lease-Based Lock Extension

Problem solved: Linux DLM's BAST (Blocking AST) callback storms.

In Linux, when a node requests a lock in a mode incompatible with current holders, the DLM sends a BAST callback to every holder. For a popular file with 100 readers (PR mode), a writer requesting EX mode triggers 100 BAST messages — O(N) network traffic per contention event. On large clusters (64+ nodes), this becomes a significant source of network overhead.

UmkaOS's lease-based approach:

  • Every granted lock includes a lease duration (configurable per resource type):
  • Metadata locks: 30 seconds default
  • Data locks: 5 seconds default
  • Application locks: configurable (1-300 seconds)

  • Lease extension: Holders extend their lease cheaply via RDMA Write to the master's lease table — a single one-sided RDMA operation that updates a timestamp. No master CPU involvement. Cost: ~1-2 μs per extension.

  • Revocation strategy:

  • Uncontended resource: No revocation needed. Holders extend leases indefinitely. Minimal network traffic for uncontended locks — only periodic one-sided RDMA lease renewals, which do not interrupt the remote CPU (vs. Linux's periodic BAST heartbeats that require CPU processing on every node).
  • Contended resource (incompatible request arrives): Master checks lease expiry for all incompatible holders. If all leases have expired, master grants to new requester immediately. If any leases are active, master sends revocation messages to those holders. For the worst case (EX request on a resource with K active CR/PR holders), this is O(K) revocations — the same as Linux's BAST count. The improvement over Linux is for the common case: uncontended resources have zero CPU-consuming traffic — only one-sided RDMA lease renewals that bypass the remote CPU (Linux BASTs are sent even for uncontended downgrade requests and require CPU processing on the receiving node), and resources where most holders' leases have naturally expired need only revoke the few remaining active holders.
  • Emergency revocation: For locks with NOQUEUE flag (non-blocking), the master immediately checks compatibility and returns EAGAIN if blocked. No revocation attempted.

  • Correctness guarantee: Lease expiry is a sufficient condition for revocation — if a holder fails to extend its lease, the master knows the lock can be safely reclaimed. For contended resources, the fallback to immediate revocation (single targeted message) preserves correctness identically to Linux's BAST mechanism.

  • Clock skew safety: Lease timing is master-clock-relative only. The master is the sole arbiter of lease validity. To handle clock skew between holder and master:

  • Grant messages include the master's absolute expiry timestamp.
  • Holders renew at 50% of lease duration (e.g., 15s for a 30s metadata lease), providing a safety margin larger than any reasonable clock skew (seconds).
  • Holders track the master's clock offset from grant/renewal responses and adjust their renewal timing accordingly.
  • If a holder discovers its lease was revoked (via a failed extension response), it must immediately stop using cached data and flush any dirty pages before reacquiring the lock. This is the hard correctness boundary: the holder's opinion of lease validity does not matter — only the master's.
  • NTP or PTP synchronization is recommended but not required for correctness. The protocol is safe with unbounded clock skew — only the renewal safety margin shrinks, increasing the probability of unnecessary revocations (performance, not correctness).

  • Network traffic reduction: From O(N) BASTs per contention event to O(1) for uncontended resources (no active holders — just clear the lease) and O(K) for contended resources with K active holders. Cluster-wide lock traffic is reduced by orders of magnitude on large clusters.

14.6.7 Speculative Multi-Resource Lock Acquire

Problem solved: GFS2 resource group contention.

GFS2 must find a resource group (rgrp) with free blocks before allocating file data. In Linux, this is sequential: try rgrp 0, if locked → full round-trip (~30-50 μs); try rgrp 1, if locked → another round-trip. On a busy cluster with 8 rgrps, worst case is 8 × 30-50 μs = 240-400 μs just to find a free rgrp.

UmkaOS's lock_any_of() primitive:

/// Request an exclusive lock on ANY ONE of the provided resources.
/// The DLM tries all resources and grants the first available one.
/// Returns the index of the granted resource and the lock handle.
pub fn lock_any_of(
    resources: &[ResourceName],
    mode: LockMode,
    flags: LockFlags,
) -> Result<(usize, DlmLockHandle), DlmError>;

The requester sends a single message listing N candidate resources. The master (or masters, if resources span multiple masters) evaluates each candidate and grants the first one that is available in the requested mode.

Requester                              Master(s)
    |                                       |
    |--- "Lock any of [rgrp0..rgrp7]" ---->|
    |                           [try rgrp0: locked]
    |                           [try rgrp1: locked]
    |                           [try rgrp2: FREE → grant]
    |<-- "Granted: rgrp2" ------------------|
    |                                       |
    Total: ~5-10 μs (single round-trip).   |
    Linux: up to 8 × 30-50 μs = 240-400 μs.|

For resources spanning multiple masters, the requester sends parallel requests to each master. The first grant received is accepted; the requester cancels remaining requests.

14.6.8 Targeted Writeback on Lock Downgrade

Problem solved: Linux's "flush ALL pages" on lock drop.

In Linux, when a node holding an EX lock on a GFS2 inode downgrades to PR or releases to NL, the kernel must flush ALL dirty pages for that inode to disk. This is because Linux's page cache has no concept of which pages were dirtied under which lock — the dirty tracking is per-inode, not per-lock-range.

For a 100 GB file where only 4 KB was modified, Linux flushes ALL dirty pages (which could be the entire file if it was recently written). This turns a lock downgrade into a multi-second I/O operation.

UmkaOS's per-lock-range dirty tracking:

The DLM integrates with the VFS layer (Section 13.3) to track dirty pages per lock range:

/// A 512-byte chunk holding 64 consecutive u64 words of the bitmap.
/// Allocated from the slab allocator as a unit; freed when all 64 words
/// become zero. One chunk covers 64 × 64 = 4,096 bit positions.
pub struct SparseBitmapChunk {
    /// 64 consecutive bitmap words. Index within the chunk is `(bit / 64) % 64`.
    pub words: [u64; 64],
}

/// Sparse bitmap for tracking dirty page ranges.
///
/// Two-level structure:
/// - **Top level**: a 64-bit presence word per chunk. Bit `c` of `top` is set
///   whenever `chunks[c]` is allocated (i.e., has at least one set bit). This
///   allows O(1) skip of empty chunks during iteration.
/// - **Bottom level**: up to 64 chunk slots, each covering 64 u64 words.
///
/// A chunk is allocated on the first `set()` that falls within it and freed
/// when the last `clear()` empties all 64 words. Maximum coverage:
/// 64 chunks × 64 words × 64 bits = 262,144 tracked positions.
///
/// **Addressing**: bit `b` maps to chunk `b / 4096`, word-in-chunk
/// `(b / 64) % 64`, bit-in-word `b % 64`.
///
/// **Allocation cost**: O(set_chunks), not O(set_bits). A fully-dense
/// 262,144-bit bitmap requires 64 slab allocations of 512 bytes each,
/// versus 4,096 individual allocations under the old per-word scheme.
/// Cache locality: all 64 words of a chunk occupy 8 consecutive cache lines,
/// so sequential scans stay within L1 for the active chunk.
///
/// Used by DLM targeted writeback ([Section 14.6.8](#1468-targeted-writeback-on-lock-downgrade)) to track dirty pages
/// within a lock range.
pub struct SparseBitmap {
    /// Top-level presence map. Bit `c` is set iff `chunks[c]` is `Some(_)`.
    /// Allows fast iteration: `leading_zeros()` / `trailing_zeros()` locate
    /// the next non-empty chunk in one instruction.
    pub top: u64,
    /// Chunk slots. `None` means the chunk is all-zeros and not allocated.
    /// 64 slots × 512 bytes/chunk = 32 KiB maximum live data.
    pub chunks: [Option<Box<SparseBitmapChunk>>; 64],
    /// Total number of set bits across all chunks. Maintained by `set()`
    /// and `clear()`. Allows O(1) `is_empty()` and density checks.
    pub popcount: u32,
}

/// Sparse bitmap for tracking locked byte ranges.
///
/// A flat `SparseBitmap` covers 262,144 bit positions. When each bit represents
/// a 4 KiB page, that covers 1 GiB — sufficient for most files. However, the
/// DLM must track byte-range locks on files that can be much larger (e.g.,
/// 100 GB NFS exports). `LargeRangeBitmap` provides a two-level fallback:
///
/// - **Files ≤ 1 GiB** (common case): uses a flat `SparseBitmap` directly.
///   Zero overhead versus the existing flat bitmap — `small` is `Some(bitmap)`,
///   `large` is `None`.
/// - **Files > 1 GiB**: uses a two-level structure where each top-level slot
///   covers a 1 GiB region and is lazily allocated as a `SparseBitmap` when
///   first needed.
pub struct LargeRangeBitmap {
    /// For files ≤ 1 GiB (common case): flat bitmap.
    small: Option<SparseBitmap>,
    /// For files > 1 GiB: array of 1 GiB-covering SparseBitmaps, lazily allocated.
    /// Index N covers byte range [N * 2^30, (N+1) * 2^30).
    /// Maximum file size supported: 1 TiB (1024 slots × 1 GiB each).
    large: Option<Box<[Option<SparseBitmap>; 1024]>>,
    /// Total file size in bytes (determines which level to use).
    file_size: u64,
}

LargeRangeBitmap design notes:

  • Lazy transition: The bitmap starts in small mode. On the first set() call targeting a bit position beyond the 1 GiB boundary (bit index ≥ 262,144), the small bitmap is moved into slot 0 of the newly-allocated large array, and small is set to None. Subsequent accesses compute the slot index as bit / 262_144 and the intra-slot bit index as bit % 262_144.

  • Lazy slot allocation: The large array is heap-allocated only when needed (files > 1 GiB that actually have locks past the 1 GiB boundary). Within the large array, each slot's SparseBitmap is allocated on first set() to that slot — empty slots remain None.

  • Maximum coverage: 1 TiB (1024 slots × 1 GiB each). Files larger than 1 TiB use coarse-grained lock tracking: byte-range locks map to 1 GiB granules, with potential false conflicts for adjacent byte ranges within the same 1 GiB granule. This is acceptable because files > 1 TiB with fine-grained byte-range locking are extremely rare in practice; whole-file or large-region locks dominate.

  • Performance: For files ≤ 1 GiB (the common case), zero overhead versus the existing flat SparseBitmap — one branch on small.is_some(). For large files, each access adds one pointer dereference (slot lookup) plus the existing SparseBitmap O(1) per-bit cost.

  • range_coverage_bytes() -> u64: Returns the current maximum byte range the bitmap can track at full granularity. In small mode: 1 GiB (262,144 × 4 KiB). In large mode: 1 TiB (1024 × 1 GiB). For files beyond 1 TiB: returns file_size (coarse tracking covers the entire file, but at 1 GiB granularity beyond the 1 TiB fine-grained limit).

SparseBitmap method contracts:

  • set(b: u64): Computes chunk index c = b / 4096, word index w = (b / 64) % 64, bit index k = b % 64. If chunks[c] is None, allocates a SparseBitmapChunk from the slab allocator and sets bit c in top. Sets bit k in chunks[c].words[w]. If the bit was previously clear, increments popcount.

  • clear(b: u64): Computes (c, w, k) as above. Clears bit k in chunks[c].words[w]. If the bit was set, decrements popcount. If all 64 words in chunks[c] are now zero, frees the chunk and clears bit c in top.

  • test(b: u64) -> bool: Computes (c, w, k). If chunks[c] is None, returns false. Otherwise returns (chunks[c].words[w] >> k) & 1 != 0.

  • iter_set() -> impl Iterator<Item = u64>: Iterates over set chunk indices using top.trailing_zeros() / bit-clear loop. Within each chunk, iterates over non-zero words using words[w].trailing_zeros(). Yields absolute bit positions. Total cost: O(set_chunks + set_bits).

/// Dirty page tracker associated with a DLM lock.
/// Tracks which pages were modified while this lock was held.
pub struct LockDirtyTracker {
    /// Byte range covered by this lock (for range locks).
    /// For whole-file locks: 0..u64::MAX.
    pub range: core::ops::Range<u64>,

    /// Bitmap of dirty pages within the lock's range.
    /// Indexed by (page_offset - range.start) / PAGE_SIZE.
    ///
    /// Uses `LargeRangeBitmap` to support files of any practical size:
    /// - Files ≤ 1 GiB: flat `SparseBitmap` (O(1) per page, zero overhead).
    /// - Files > 1 GiB: two-level structure with lazily-allocated 1 GiB slots.
    /// - Files > 1 TiB: coarse 1 GiB granule tracking (rare in practice).
    ///
    /// O(1) set/clear per page, O(dirty_chunks + dirty_bits) iteration.
    /// Slab allocation is per-chunk (512 bytes), not per set bit, keeping
    /// allocator pressure and fragmentation proportional to the number of
    /// 256 KB dirty regions rather than the number of dirty pages.
    pub dirty_pages: LargeRangeBitmap,
}

Downgrade behavior:

  • EX/PW → PR (downgrade to read): Flush only pages in dirty_pages bitmap. If 4 KB of a 100 GB file was modified, flush exactly 1 page (~10-15 μs for NVMe), not the entire file. PW (Protected Write) follows the same writeback rules as EX, since both are write modes that can dirty pages (per the compatibility matrix in Section 14.6.2).
  • EX/PW → NL (release): Flush dirty pages, then invalidate only pages covered by this lock's range. Other cached pages (from other lock ranges or read-only access) remain valid.
  • Range lock downgrade: When a byte-range lock is downgraded, only dirty pages within that specific byte range are flushed. Pages outside the range are untouched.

Cost reduction: From O(file_size) to O(dirty_pages_in_range). For the common case of small writes to large files, this reduces lock downgrade cost by orders of magnitude.

14.6.9 Deadlock Detection

Distributed deadlock detection uses a gossip-based wait-for graph:

  • Each node maintains a local wait-for graph of lock dependencies. Vertices are globally unique process identifiers (node_id, pid) — bare PIDs are insufficient because PID 1234 on Node A and PID 1234 on Node B are different processes. Edges represent lock dependencies: process (N1, P) holds lock A, process (N2, Q) waits for lock A → edge (N2, Q) → (N1, P). The pid field always refers to the initial (host) PID namespace, not a container-local PID namespace. Containers that each have PID 1 are unambiguously distinguished this way. Container-local PIDs are translated to initial-namespace PIDs at the DLM boundary before insertion into the wait-for graph.
  • Every 100 ms, nodes exchange wait-for graph edges with their neighbors (gossip protocol). Neighbor selection: each node maintains a membership list from the cluster membership protocol (Section 5.1.12). On each gossip round, the node selects ceil(log2(N)) random peers from the membership list (anti-entropy gossip). Random selection ensures convergence in O(log N) rounds with high probability (Erdős–Rényi connectivity result); worst-case O(N) if selections are consistently unlucky, though this probability falls exponentially with N. Edge propagation converges in O(log N) gossip rounds for N nodes in expectation. After 3 × ⌈log₂(N)⌉ rounds without a complete cycle being detected, the detector falls back to a centralized query to the DLM coordinator (lowest live node-id), adding one extra RTT but guaranteeing termination regardless of gossip convergence. Each gossip message includes the (node_id, pid) tuples for both endpoints of each edge, ensuring no PID aliasing across nodes or containers. Edge removal: When a lock request is granted or cancelled, the node removes the corresponding edge from its local graph and propagates a tombstone (edge + deletion timestamp) in the next gossip round. Tombstones are garbage-collected after 2x the gossip interval (200 ms).
  • Each node runs local cycle detection on its accumulated graph. If a cycle is found, the youngest transaction (highest timestamp) is selected as the victim and receives EDEADLK.
  • Victim selection is configurable: youngest (default), lowest priority, or smallest transaction (fewest locks held).

Zero overhead on fast path: Deadlock detection only activates when a lock request has been waiting for longer than a configurable threshold (default: 5 seconds). Short waits (the common case for contended locks) complete before deadlock detection engages. The gossip protocol runs on a low-priority background thread and uses minimal bandwidth (each gossip message is typically <1 KB).

Latency tradeoff justification: The 5-second activation threshold means a true deadlock waits ~5 seconds before detection begins, which is 1,000,000x the typical lock latency (~5 μs). This is acceptable because: (1) deadlocks are rare in practice -- most lock waits resolve within milliseconds; (2) the alternative (immediate distributed cycle detection on every wait) would add gossip overhead to every contended lock operation, degrading the common-case latency that the DLM is optimized for; (3) the 5-second threshold matches Linux DLM's deadlock detection timeout and is well within application tolerance for the rare deadlock case.

Local fast-path detection: For locks mastered on the same node, the master performs immediate local cycle detection when enqueueing a new waiter -- if the waiter and all holders in the cycle are on the same node, the deadlock is detected in O(edges) time without any network round-trips, typically within microseconds. The 5-second gossip-based detection is only needed for cross-node deadlock cycles, where the wait-for graph edges span multiple nodes and must be aggregated via the gossip protocol.

14.6.10 Integration with Cluster Membership (Section 5.1.12)

The DLM receives cluster membership events directly from Section 5.1.12's cluster membership protocol:

  • NodeJoined: New node added to consistent hash ring. Some lock resources are remapped to the new master (~1/N of resources). The new node receives resource state from the old masters.
  • NodeSuspect: Heartbeat missed. DLM begins preparing for potential recovery but does NOT stop lock operations. Current lock holders continue normally.
  • NodeDead: Confirmed node failure. DLM initiates recovery for resources mastered on or held by the dead node (Section 14.6.11).
  • NodeLeaving: Graceful departure. Node transfers mastered resources to their new owners before leaving. Zero disruption.

Single heartbeat source: The DLM does NOT run its own heartbeat. It piggybacks on Section 5.1.12.2's HeartbeatMessage, which already includes per-node liveness information. This eliminates the Linux problem where DLM and corosync can disagree on node liveness — in UmkaOS, there is exactly one source of truth for cluster membership.

14.6.11 Recovery Protocol

Four failure scenarios, each with a targeted recovery flow:

1. Lock holder failure (a node holding locks crashes)

Timeline:
  t=0:    Node B crashes while holding locks on resources R1, R2, R3
  t=300ms: Section 5.1.12 heartbeat detects NodeSuspect(B) (3 missed heartbeats at 100ms interval)
  t=1000ms: NodeDead(B) confirmed (10 missed heartbeats)

Recovery (per-resource, NOT global):
  For each resource where B held a lock:
    1. Master removes B's lock from granted queue
    2. If B held EX with dirty LVB: mark LVB as INVALID (sequence = LVB_SEQUENCE_INVALID)
    3. Process converting queue, then waiting queue (grant compatible waiters)
    4. If B held journal lock: trigger journal recovery for B's journal

  Resources NOT involving B: completely unaffected. Zero disruption.

  Lease expiry race handling: NodeSuspect is detected at 300ms (3 missed heartbeats),
  but leases may not expire until their full timeout (metadata: 30s, data: 5s). If the
  master attempts to send revocation messages to B during recovery and B is already
  dead (RDMA Send fails), the master does not block indefinitely waiting for B to
  acknowledge revocation. Instead, the master records B as "revocation pending" and
  proceeds with resource recovery immediately — the lease timeout will naturally
  invalidate B's access rights when it expires. For data locks (5s timeout), the
  recovery completes within the lease window; for metadata locks (30s timeout), the
  master may grant new locks on the resource before B's lease expires. This is
  correct because B is confirmed dead at t=1000ms and cannot access the resource.
  The lease timeout provides a safety net in the corner case where NodeDead
  confirmation is delayed beyond the lease duration — if the master cannot confirm
  B's death, B retains access until lease expiry, preserving correctness at the cost
  of temporary unavailability for incompatible lock requests.

2. Lock master failure (the node responsible for a resource's lock queues crashes)

Timeline:
  t=0:    Node M crashes (was master for resources hashing to M)
  t=1000ms: NodeDead(M) confirmed (10 missed heartbeats per Section 5.1.12.2)

Recovery:
  1. Consistent hashing reassigns M's resources to surviving nodes.
     (~1/N resources move, distributed across all survivors.)
  2. Each survivor that held locks on M's resources reports its lock
     state to the new master via RDMA Send.
  3. New master rebuilds granted/converting/waiting queues from
     survivor reports.
  4. Lock operations resume for affected resources.

  Timeline: ~50-200ms for affected resources.
  All other resources: unaffected (their masters are alive).

3. Split-brain (network partition divides cluster)

Inherits Section 5.1.12.3's quorum protocol: - Majority partition: Continues normal DLM operation. Resources mastered on nodes in the minority partition are remapped. - Minority partition: Blocks new EX/PW lock acquisitions to prevent conflicting writes. Existing EX/PW locks are downgraded to PR — the holder retains the lock (avoiding re-acquisition on partition heal) but cannot write. Dirty pages held under the downgraded lock are flushed before the downgrade completes (targeted writeback, Section 14.6.8). Existing PR and CR locks remain valid for local cached reads.

How nodes learn they are in the minority partition: The cluster membership subsystem (Section 5.1.12) calls dlm_partition_event(PartitionRole::Minority) on the DLM when quorum is lost. This is the single notification entry point — the DLM does not independently monitor heartbeats or quorum; it relies entirely on the membership layer's event. The event is delivered on a dedicated kernel thread and holds the DLM partition_lock during processing to serialize with ongoing lock grant decisions.

In-flight write handling: An in-flight write is any operation where a write() syscall has returned to userspace but the dirtied pages have not yet been included in the LockDirtyTracker for the covering EX lock. Two sub-cases:

``` Case A — write() completed before partition detected: Pages are already in the dirty page cache and tracked by LockDirtyTracker. The downgrade flushes them via targeted writeback (normal path).

Case B — write() in progress (PTE dirty bit set, LockDirtyTracker not yet updated): The VFS page-fault path sets the dirty bit before returning to userspace. DLM's partition handler waits for the write_seq counter to stabilize (spin at most 1ms — write() syscall cannot hold a page lock indefinitely) then calls sync_file_range(ALL) on all files covered by EX locks. This forces any PTE-dirty pages into tracked writeback before downgrade. ```

Atomic writeback-then-downgrade sequence:

For each EX or PW lock held by this node: 1. Set lock.state = LOCK_CONVERTING (blocks new writers via KABI fence). 2. Flush in-flight writes: sync_file_range(file, lock.range.start, lock.range.end). This is synchronous: returns only when all dirty pages in the range are submitted to the block layer (not necessarily persisted to disk). 3. Call targeted_writeback_flush(lock) ([Section 14.6.8]): Walk LockDirtyTracker, submit writeback for each dirty page. Wait for writeback completion (submit + await journal commit). 4. Only after step 3 completes: change lock mode from EX/PW → PR. This is the atomic downgrade: no window where lock is PR but pages are dirty. 5. Send LOCK_DOWNGRADE message to lock master (majority partition). Master updates granted queue: replaces EX entry with PR entry.

The "atomic" guarantee is within a single CPU: steps 3→4 are serialized by partition_lock. Concurrent readers (PR/CR holders) may read stale data from the page cache during the flush window (step 2-3), but they cannot read partially-flushed state because each page is either fully clean or fully dirty at page cache granularity. No intermediate state is visible. Lease enforcement is suspended in the minority partition: since masters in the majority partition cannot be reached for lease renewal, lease expiry cannot be used to revoke locks. No new writes are permitted. No data corruption is possible because the minority cannot acquire or hold write locks, and read-only access to stale data is explicitly safe for PR/CR modes at the filesystem level (no on-disk corruption or metadata structure damage, though application-visible staleness is possible (e.g., readdir may return deleted entries or miss new files created on the majority partition)). Applications requiring linearizable reads (e.g., databases with ACID guarantees) may see stale values during the partition; this is inherent to any system that allows minority-partition reads (CAP theorem). DSM integration: The DLM's write-lock downgrade is consistent with the DSM's SUSPECT page mechanism (Section 5.1.12.3): DSM write-protects SUSPECT pages while allowing reads. Both subsystems independently block writes in the minority partition, providing defense-in-depth. - Partition heals: Minority nodes rejoin. Lock state is reconciled: 1. Minority nodes report their held lock state to the (majority-elected) masters. 2. Masters compare against current granted queues (majority wins for conflicts). 3. Any minority-held locks that conflict with locks granted during the partition are forcibly revoked on the minority nodes (cached data invalidated). 4. Non-conflicting locks are re-validated and lease timers restarted.

4. Simultaneous holder + master failure (the node holding locks is also the master for those resources, or both the holder and master crash at the same time)

Timeline:
  t=0:    Node B crashes. B held EX on resources R1, R2 (with dirty LVBs).
          B was also the master for R1 (self-mastered). Node M was the master
          for R2 and also crashes at t=0 (e.g., rack power failure).
  t=1000ms: NodeDead(B) and NodeDead(M) confirmed.

Recovery (composes scenarios 1 + 2, master rebuild first):

  Phase 1 — Master rebuild (scenario 2):
    1. Consistent hashing reassigns R1 (was mastered on B) and R2 (was
       mastered on M) to surviving nodes. New master N1 gets R1, new master
       N2 gets R2.
    2. Surviving nodes report their lock state to N1 and N2:
       - For R1: Node C reports "I have PR on R1", Node D reports "I am
         waiting for EX on R1." No node reports holding EX on R1.
       - For R2: Node C reports "I have PR on R2." No node reports holding
         EX on R2.
    3. N1 and N2 rebuild granted/converting/waiting queues from survivor
       reports. Dead node B's locks are absent (B cannot report).

  Phase 2 — Dead holder cleanup (scenario 1, applied by new masters):
    4. N1 examines R1's rebuilt state: PR holders exist (C), but no EX
       holder. A waiting EX request exists (D). N1 infers that the dead
       node B held the missing EX lock:
       - INFERENCE RULE: If a resource has waiters for an incompatible mode
         but no granted lock blocking them, the dead node(s) held the
         blocking lock. The new master does not need to know WHICH dead
         node — the lock is simply gone.
    5. N1 marks R1's LVB as INVALID (LVB_SEQUENCE_INVALID) because the
       dead EX holder may have written a dirty LVB that no survivor has.
    6. N1 processes the waiting queue: grants D's EX request on R1.
    7. N2 performs the same for R2: marks LVB INVALID, grants waiters.

  Phase 3 — Journal recovery:
    8. If B held journal locks, journal recovery runs against B's journal
       slot (same as scenario 1 step 4). The new master coordinates this.

  Timeline: same as scenario 2 (~50-200ms for affected resources).
  The holder cleanup (phase 2) adds negligible time — it is local queue
  manipulation on the new master, no network round-trips.

The key insight is ordering: master rebuild (phase 1) must complete before dead holder cleanup (phase 2), because the new master needs the rebuilt queue state to infer which locks the dead node held. An implementer must NOT attempt scenario 1 cleanup before scenario 2 rebuild — the old master is dead and cannot execute holder cleanup steps.

Key difference from Linux: NO global recovery quiesce. Linux's DLM stops ALL lock activity cluster-wide while recovering from ANY node failure. This is because Linux's DLM recovery protocol requires a globally consistent view of all lock state before it can proceed — every node must acknowledge the recovery, and no new lock operations can be processed until all nodes agree.

UmkaOS's DLM recovers per-resource: only resources mastered on or held by the dead node require recovery. The remaining (typically 90%+) of lock resources continue operating without any pause.

14.6.12 UmkaOS Recovery Advantage

The combination of umka-core's architecture and the per-resource DLM recovery protocol creates a fundamentally different failure experience:

Linux path (storage driver crash on Node B):

t=0:      Driver crash
t=0-30s:  Fencing: cluster must confirm B is dead (IPMI/BMC power-cycle
          or SCSI-3 PR revocation). Conservative timeout.
t=30-90s: Reboot: Node B reboots, OS loads, cluster stack starts.
t=90-120s: Rejoin: B rejoins cluster. DLM recovery begins.
          GLOBAL QUIESCE: ALL nodes stop ALL lock operations.
t=120-130s: DLM recovery: all nodes exchange lock state, rebuild queues.
t=130s:    Normal operation resumes.
Total: 80-130 seconds of disruption. ALL nodes affected.

UmkaOS path (storage driver crash on Node B):

t=0:       Driver crash in Tier 1 storage driver.
t=0:       DLM heartbeat CONTINUES (heartbeat is in umka-core, not the
           storage driver). Cluster does NOT detect a node failure.
t=50-150ms: Driver reloads (Tier 1 recovery, Section 10.8). State restored
           from checkpoint.
t=150ms:   Driver operational. Lock state was never lost (DLM is in
           umka-core). No fencing needed. No recovery needed.
Total: 50-150ms I/O pause on Node B only. Zero lock disruption.
Zero impact on other nodes.

The difference is architectural: in Linux, the DLM runs in the same failure domain as storage drivers (all are kernel modules that crash together). In UmkaOS, the DLM is in umka-core — it survives driver crashes. The DLM only needs recovery when umka-core itself fails (which means the entire node is down).

14.6.13 Application-Level Distributed Locking

The DLM provides application-visible locking interfaces:

  • flock() on clustered filesystem → transparently maps to DLM lock operations. Applications using flock() for coordination get cluster-wide locking without code changes.
  • fcntl(F_SETLK) byte-range locks → DLM range lock resources. POSIX byte-range locks on clustered filesystems provide true cluster-wide exclusion.
  • Explicit DLM API via /dev/dlm → compatible with Linux's dlm_controld interface. Applications that use libdlm for explicit distributed locking work without modification.
  • flock2() system call (new, UmkaOS extension) — enhanced distributed lock with:
  • Lease semantics: caller specifies desired lease duration
  • Failure callback: notification when lock is lost due to node failure
  • Partition behavior: configurable (block, release, or fence)
  • Batch support: lock multiple files in a single system call

14.6.14 Capability Model

DLM operations are gated by capabilities (Section 8.1):

Capability Permits
CAP_DLM_LOCK Acquire, convert, and release locks on resources in permitted lockspaces
CAP_DLM_ADMIN Create and destroy lockspaces, configure parameters, view lock state
CAP_DLM_CREATE Create new lock resources (for application-level locking via /dev/dlm)

Lockspaces provide namespace isolation — a container with CAP_DLM_LOCK scoped to its own lockspace cannot interfere with locks in other lockspaces. GFS2 creates a lockspace per filesystem; applications create lockspaces via /dev/dlm.

14.6.15 Performance Summary

Operation UmkaOS Latency Linux DLM Improvement
Uncontested acquire ~3-5 μs (RDMA CAS + confirmation) ~30-50 μs (TCP) ~10-15×
Uncontested acquire + LVB read ~4-6 μs ~100 μs ~20×
Contested acquire (same master) ~5-8 μs (RDMA Send) ~100-200 μs (TCP) ~20-30×
Batch N locks (same master) ~5-10 μs N × 30-50 μs ~N×8×
Lock any of N resources ~5-10 μs N × 30-50 μs (sequential) ~N×8×
Lease extension ~1-2 μs (RDMA Write) N/A (no leases)
Lock holder recovery ~50-200 ms (affected resources only) 5-10 s (global quiesce) ~50×
Lock master recovery ~200-500 ms (affected resources only) 5-10 s (global quiesce) ~20×

Arithmetic basis: RDMA CAS latency is measured at 1.5-2.5 μs on InfiniBand HDR (200 Gb/s) and RoCEv2 (100 Gb/s) in published benchmarks. The full uncontested acquire includes the raw CAS (~2-3 μs) plus the mandatory confirmation RDMA Send (~1-2 μs), totaling ~3-5 μs. RDMA Send/Receive for contested locks adds ~1-2 μs for receive-side processing. Linux DLM TCP latency includes TCP stack processing (~15-20 μs round-trip), DLM lock manager processing (~10-15 μs), and completion notification (~5-10 μs), totaling ~30-50 μs in published GFS2 benchmarks. Note: The Linux DLM runs entirely in-kernel since kernel 2.6; dlm_controld handles only membership events, not lock operations.

14.6.16 Data Structures

/// Fixed-capacity open-addressing hash table shard.
/// Capacity is chosen at construction time and never changes — no rehashing,
/// no heap allocation after initialization, no spinlock hold during allocation.
pub struct ShardedMapShard<K, V, const CAP: usize> {
    /// Open-addressing table. Each slot is Option<(K, V)>.
    /// CAP must be a power of 2. Load factor kept ≤ 0.75 by construction.
    slots: [Option<(K, V)>; CAP],
    count: usize,
}

/// Sharded lock table for DLM. Each shard has its own spinlock to minimize contention.
///
/// ShardedMap uses fixed-capacity open-addressing to ensure spinlock hold times are
/// bounded and O(1). The DLM must pre-allocate sufficient capacity based on expected
/// concurrent lock count; capacity exhaustion returns `LockError::TableFull` rather
/// than blocking. Insertion returns `Err` if the load factor would exceed 75%.
/// `insert_or_update` and `remove` complete in bounded time under the spinlock —
/// there is no rehashing, no heap allocation, and no unbounded iteration.
pub struct ShardedMap<K: Hash + Eq, V, const SHARDS: usize = 256, const SHARD_CAP: usize = 64> {
    shards: [SpinLock<ShardedMapShard<K, V, SHARD_CAP>>; SHARDS],
}
/// DLM lockspace — namespace for a set of related lock resources.
pub struct DlmLockspace {
    /// Lockspace name (e.g., "gfs2:550e8400-e29b" for a GFS2 filesystem).
    pub name: LockspaceName,

    /// Lock resources in this lockspace.
    /// Sharded concurrent hash map: 256 shards, each with its own SpinLock.
    /// Shard = hash(resource_name) & 0xFF. This reduces lock contention from
    /// a single global bottleneck to per-shard contention. Individual lock
    /// operations only hold their shard's SpinLock, allowing concurrent access
    /// to resources in different shards. DlmResource entries are allocated
    /// from a per-lockspace slab allocator.
    pub resources: ShardedMap<ResourceName, DlmResource, 256>,

    /// Lease configuration for this lockspace.
    pub lease_config: LeaseConfig,

    /// Deadlock detection state.
    pub wait_for_graph: Mutex<WaitForGraph>,

    /// Statistics counters.
    pub stats: DlmStats,
}

/// Per-lockspace lease configuration.
pub struct LeaseConfig {
    /// Default lease duration for metadata locks.
    pub metadata_lease_ns: u64,

    /// Default lease duration for data locks.
    pub data_lease_ns: u64,

    /// Default lease duration for application locks.
    pub app_lease_ns: u64,

    /// Grace period after lease expiry before forced revocation.
    pub grace_period_ns: u64,
}

/// DLM statistics (per-lockspace, exposed via umkafs Section 19.4).
pub struct DlmStats {
    /// Total lock operations (acquire + convert + release).
    pub lock_ops: AtomicU64,

    /// Operations served by RDMA CAS fast path (uncontested).
    pub fast_path_ops: AtomicU64,

    /// Operations requiring RDMA Send (contested).
    pub slow_path_ops: AtomicU64,

    /// Batch operations.
    pub batch_ops: AtomicU64,

    /// Lock-any-of operations.
    pub lock_any_ops: AtomicU64,

    /// Deadlocks detected.
    pub deadlocks_detected: AtomicU64,

    /// Recovery events (holder + master).
    pub recovery_events: AtomicU64,
}

14.6.17 Licensing

The VMS/DLM lock model is published academic work (VAX/VMS Internals and Data Structures, Digital Press, 1984). The six-mode compatibility matrix, Lock Value Block concept, and granted/converting/waiting queue model are well-documented in public literature and implemented by multiple independent projects (Linux DLM, Oracle DLM, HP OpenVMS DLM). No patent or proprietary IP concerns.

RDMA Atomic CAS and Send/Receive operations are standard InfiniBand/RoCE verbs defined by the IBTA (InfiniBand Trade Association) specification, which is publicly available.

14.6.18 DLM Master Election and Heartbeat Protocol

The DLM uses a deterministic master election based on node ranking rather than a Paxos/Raft round to minimize election latency in the common case (no failures).

Master selection rule: The node with the lowest node_id among currently healthy cluster members is the DLM master. On membership change (join/leave), all nodes independently compute the new master from the updated membership view — no election protocol needed. This requires consistent failure detection.

/// DLM master state. One instance per DLM domain (per filesystem/cluster).
pub struct DlmMaster {
    /// Node ID of the current master (determined by lowest-node-id rule).
    /// Atomically updated on membership changes. Zero = no master (election in progress).
    pub master_node_id: AtomicU32,
    /// True if this node is the current DLM master.
    pub is_master: AtomicBool,
    /// Monotonic epoch counter. Incremented on each master transition.
    /// Used to detect stale messages from a previous master.
    pub epoch: AtomicU64,
    /// Per-peer heartbeat tracking. Index = node_id.
    pub peers: Box<[DlmPeerState; MAX_CLUSTER_NODES]>,
    /// Last time this node received any message from each peer (nanoseconds).
    pub last_heard_ns: Box<[AtomicU64; MAX_CLUSTER_NODES]>,
}

/// Heartbeat and failure detection state for one peer node.
pub struct DlmPeerState {
    /// Node is considered live (last heartbeat within FAILURE_TIMEOUT_NS).
    pub alive: AtomicBool,
    /// Number of consecutive missed heartbeats.
    pub missed_heartbeats: AtomicU32,
    /// Sequence number of the last heartbeat received from this peer.
    pub last_seq: AtomicU64,
}

/// DLM heartbeat interval: every 500ms a heartbeat message is sent to all peers.
/// Lower than typical cluster monitors (1-2s) for faster DLM-specific failure detection.
pub const DLM_HEARTBEAT_INTERVAL_NS: u64 = 500_000_000; // 500 ms

/// A node is declared failed if no heartbeat is received for this duration.
/// 3× heartbeat interval provides tolerance for one dropped heartbeat packet.
pub const DLM_FAILURE_TIMEOUT_NS: u64 = 1_500_000_000; // 1.5 s (3 × 500 ms)

/// After declaring a node failed, wait this long before reclaiming its locks.
/// Allows the failed node's RDMA NIC to drain in-flight operations.
pub const DLM_LOCK_RECLAIM_DELAY_NS: u64 = 200_000_000; // 200 ms

Failure detection algorithm (runs in kthread/dlm_monitor):

  1. Every DLM_HEARTBEAT_INTERVAL_NS (500 ms): send DlmHeartbeat { node_id, epoch, seq } to all peers via RDMA UD (unreliable datagram) multicast.
  2. On receiving a heartbeat: update last_heard_ns[sender], reset missed_heartbeats[sender].
  3. On each monitor wakeup: for each peer, if now_ns - last_heard_ns[peer] > DLM_FAILURE_TIMEOUT_NS and peer.alive == true:
  4. Set peer.alive = false, increment missed_heartbeats.
  5. Notify the membership layer (triggers cluster reconfiguration).
  6. After DLM_LOCK_RECLAIM_DELAY_NS, begin reclaiming locks granted to the failed node.
  7. Master recomputation: after any membership change, all nodes compute new_master = min(alive_node_ids). If new_master != master_node_id, atomically swap and increment epoch.

Relationship to Section 14.6.10: The DLM does not run its own independent heartbeat. DLM_HEARTBEAT_INTERVAL_NS governs DLM-domain-specific monitoring (500 ms, tuned for lock reclaim timing), which is separate from the cluster membership layer's heartbeat (Section 5.1.12). The DLM receives confirmed NodeDead events from the membership layer for authoritative failure decisions; the DLM heartbeat provides a faster, DLM-scoped signal to pre-stage lock reclaim before the membership layer confirms failure. This two-tier approach eliminates the Linux DLM problem where DLM and corosync can disagree on node liveness during partial-failure scenarios.


14.7 Persistent Memory

14.7.1 The Hardware

CXL-attached persistent memory is coming (Samsung CMM-H with NAND-backed persistence via CXL GPF, SK Hynix). Also: battery-backed DRAM (NVDIMM-N) for enterprise storage. The model: byte-addressable memory that survives power loss.

14.7.2 Design: DAX (Direct Access) Integration

// umka-core/src/mem/persistent.rs

/// Persistent memory region descriptor.
pub struct PersistentMemoryRegion {
    /// Physical address range.
    pub base: PhysAddr,
    pub size: u64,

    /// NUMA node this persistent memory is attached to.
    pub numa_node: u16,

    /// Technology type (affects performance characteristics).
    pub tech: PmemTechnology,

    /// Is this region backed by a filesystem (DAX mode)?
    pub dax_device: Option<DeviceNodeId>,
}

#[repr(u32)]
pub enum PmemTechnology {
    /// Intel Optane / 3D XPoint (legacy, for existing deployments).
    Optane          = 0,
    /// CXL-attached persistent memory.
    CxlPersistent   = 1,
    /// Battery-backed DRAM (NVDIMM-N).
    BatteryBacked   = 2,
}

14.7.3 Memory-Mapped Persistent Storage

When a filesystem on persistent memory is mounted with DAX:

Standard file I/O (non-DAX):
  read() → VFS → page cache → memcpy to userspace
  write() → VFS → page cache → writeback → storage device

DAX file I/O:
  read() → VFS → mmap directly to persistent memory → load instruction
  write() → VFS → store instruction → persistent memory
  No page cache. No copies. No writeback.
  CPU load/store talks directly to persistent media.

The memory manager must handle persistent pages differently: - Persistent pages are NOT evictable (they ARE the storage) - fsync() → CPU cache flush (CLWB/CLFLUSH) not block I/O - MAP_SYNC flag ensures metadata (file size, timestamps) is also persistent - Crash consistency: partial writes are visible after reboot (see Section 14.7.4)

14.7.4 Crash Consistency Protocol

Persistent memory stores survive power loss, but CPU caches do not. Without explicit cache flushing, writes to persistent memory may be reordered or lost in the CPU write-back cache. The kernel must enforce a strict persistence protocol:

Persistence primitives (x86):
  CLWB addr     — Write-back cache line, leave line CLEAN but VALID in cache.
                  (Preferred: no performance penalty on subsequent reads.)
  CLFLUSHOPT addr — Flush cache line, INVALIDATE from cache.
                  (Legacy: forces re-fetch on next read.)
  SFENCE        — Store fence. Guarantees all preceding CLWB/CLFLUSHOPT
                  have reached the persistence domain (ADR/eADR boundary).

Correct write sequence for persistent data:
  1. Store data to persistent memory region (mov/memcpy)
  2. CLWB for each modified cache line (64 bytes each)
  3. SFENCE  ← data is now durable
  4. Store metadata update (e.g., committed flag, log tail pointer)
  5. CLWB for metadata cache line(s)
  6. SFENCE  ← metadata is now durable (atomically marks data as committed)

ARM equivalent:
  DC CVAP addr  — Clean data cache to Point of Persistence (ARMv8.2+)
  DSB           — Data Synchronization Barrier

fsync() on a DAX-mounted filesystem translates to CLWB + SFENCE (not block I/O). msync(MS_SYNC) on DAX mappings follows the same path. The kernel provides pmem_flush() and pmem_drain() helpers that abstract the architecture-specific instructions.

14.7.5 PMEM Error Handling

Persistent memory is physical media and can develop errors (bit rot, wear-out, manufacturing defects). The error model mirrors Linux badblocks:

Error sources:
  1. UCE (Uncorrectable Error) — MCE (Machine Check Exception) on x86,
     SEA (Synchronous External Abort) on ARM.
     CPU receives #MC / abort when reading a poisoned cache line.

  2. ARS (Address Range Scrub) — ACPI background scan discovers latent
     errors before they're read. Results reported via ACPI NFIT.

  3. CXL Media Error — CXL 3.0 devices report media errors via CXL
     event log (Get Event Records command).

Kernel response:
  MCE/SEA on PMEM page:
    1. Mark physical page as HWPoison (same as DRAM MCE path).
    2. Add to per-region badblocks list.
    3. If a process has the page mapped:
       a. DAX mapping → deliver SIGBUS (BUS_MCEERR_AR) with fault address.
       b. Process can handle SIGBUS and skip/retry the corrupted region.
    4. Filesystem (ext4/xfs DAX) is notified via dax_notify_failure().
       Filesystem marks affected file range as damaged.

  ARS/CXL background error:
    1. ACPI notification or CXL event interrupt.
    2. Add to badblocks list.
    3. If mapped: deliver SIGBUS (BUS_MCEERR_AO — action optional).
    4. Userspace can query badblocks via /sys/block/pmemN/badblocks.

14.7.6 Integration with Memory Tiers

Persistent memory becomes another level in the memory hierarchy. Note: the "Memory Level" numbering below refers to the memory distance hierarchy, NOT the driver isolation tiers (Tier 0/1/2) used elsewhere in this architecture.

Existing memory levels (Section 21.2, Section 5.1.7):
  Level 0: Per-CPU caches
  Level 1: Local DRAM
  Level 2: Remote DRAM (cross-socket)
  Level 3: CXL pooled memory
  ...

Extended:
  Level N: Persistent memory (CXL-attached or NVDIMM)
    Properties:
      - Byte-addressable (like DRAM)
      - Survives power loss (like storage)
      - Higher latency than DRAM (~200-500ns vs ~80ns)
      - Lower bandwidth than DRAM
      - Cannot be evicted (it IS the backing store)

14.7.7 Linux Compatibility

Linux persistent memory interfaces are preserved:

/dev/pmem0, /dev/pmem1:       Block device interface (libnvdimm)
/dev/dax0.0, /dev/dax1.0:    Character DAX device (devdax)
mount -o dax /dev/pmem0 /mnt: DAX-mounted filesystem
mmap() with MAP_SYNC:         Guaranteed persistence of metadata

Optane Discontinuation Note:

Intel discontinued Optane persistent memory products in 2022. The persistent memory design in this section is hardware-agnostic — it applies to any byte-addressable persistent medium. CXL 3.0 Type 3 devices with persistence (battery-backed or inherently persistent media) are the expected successor. The PmemTechnology enum includes CxlPersistent for this reason. The DAX path, cache flush protocol, and error handling are technology-independent.

PMEM Namespace Discovery:

Persistent memory regions are discovered via:

  • ACPI NFIT (NVDIMM Firmware Interface Table): For NVDIMM-N and legacy Optane. The NFIT describes each PMEM region's physical address range, interleave set, and health status.
  • CXL DVSEC (Designated Vendor-Specific Extended Capability): For CXL-attached persistent memory. CXL devices advertise memory regions via PCIe DVSEC structures. The kernel's CXL driver enumerates regions and creates /dev/daxN.M device nodes.
  • Namespace management: Regions are partitioned into namespaces via ndctl (userspace tool) using the Linux-compatible namespace management ioctl interface. UmkaOS implements the same ioctls via umka-compat.

14.7.8 Performance Impact

Zero overhead for systems without persistent memory. When persistent memory is present: DAX I/O is faster than standard I/O (eliminates page cache copies and writeback). Performance improves.

14.7.9 Filesystem Repair and Consistency Checking

Filesystem repair (fsck, xfs_repair, btrfs check) is handled by existing Linux userspace utilities running against UmkaOS's block device interface. UmkaOS does not implement in-kernel repair paths — the standard Linux repair tools are unmodified userspace binaries that interact with block devices via standard syscalls (open, read, write, ioctl). Since UmkaOS implements the complete block device interface (Section 14.3) and the relevant filesystem syscalls (Section 18.1), these tools work unchanged:

  • e2fsck / fsck.ext4 for ext4 repair
  • xfs_repair for XFS repair
  • btrfs check / btrfs scrub for btrfs repair (btrfs scrub runs online)
  • ZFS self-heals via block-level checksums (Section 14.2); zpool scrub is the equivalent of fsck for ZFS

No kernel-side changes are needed to support these tools. The only UmkaOS-specific consideration is that filesystem drivers should expose consistent BLKFLSBUF and BLKRRPART ioctl behavior matching Linux, as some repair tools use these to synchronize cache state.

14.7.10 SCSI-3 Persistent Reservations

SCSI-3 Persistent Reservations (PR) are required for shared-storage cluster fencing (Section 14.5). UmkaOS's block I/O layer implements the following PR commands as ioctls on block devices:

  • PR_REGISTER / PR_REGISTER_AND_IGNORE: register a reservation key with the storage target. Each node registers a unique key (derived from node ID).
  • PR_RESERVE: acquire a reservation (Write Exclusive, Exclusive Access, or their "Registrants Only" variants).
  • PR_RELEASE: release a held reservation.
  • PR_CLEAR: clear all registrations and reservations.
  • PR_PREEMPT / PR_PREEMPT_AND_ABORT: preempt another node's reservation (used for fencing — a surviving node preempts the fenced node's key).

These map to SCSI PR IN / PR OUT commands (SPC-4) for SCSI/SAS devices and to NVMe Reservation Register/Acquire/Release/Report commands for NVMe devices. The block layer translates between the common ioctl interface and the device-specific command set. The fencing integration with Section 5.1.12's membership protocol uses PR_PREEMPT_AND_ABORT to revoke a dead node's storage access before recovering its DLM locks.


14.8 Computational Storage

14.8.1 Problem

NVMe Computational Storage Devices (CSDs) can run compute on the storage device: filter, aggregate, search, compress — without moving data to the host CPU.

14.8.2 Design: CSD as AccelBase Device

A CSD naturally fits the accelerator framework (Section 21.1). It's a device with local memory (flash) and compute capability (embedded processor):

// Extends AccelDeviceClass (Section 21.1)

#[repr(u32)]
pub enum AccelDeviceClass {
    Gpu             = 0,
    GpuCompute      = 1,
    Npu             = 2,
    Tpu             = 3,
    Fpga            = 4,
    Dsp             = 5,
    MediaProcessor  = 6,
    /// Computational Storage Device.
    /// "Local memory" = flash storage on the device.
    /// "Compute" = embedded processor running submitted programs.
    ComputeStorage  = 7,
    Other           = 255,
}

Note: The AccelDeviceClass enum is canonically defined in Section 21.1.1 (11-accelerators.md). The ComputeStorage variant (value 7) must be added to the canonical definition to support computational storage devices.

14.8.3 CSD Command Submission

Standard NVMe read (move data to compute):
  Host CPU ← 1 TB data ← NVMe SSD
  Host CPU processes 1 TB → produces 1 MB result
  Total data moved: 1 TB

CSD compute (move compute to data):
  Host CPU → submit "grep pattern" → CSD
  CSD processes 1 TB internally → produces 1 MB result
  Host CPU ← 1 MB ← CSD
  Total data moved: 1 MB (1000x reduction)

The CSD accepts commands via the AccelBase vtable: - create_context: allocate CSD execution context - submit_commands: submit a compute program (filter, aggregate, map, etc.) - poll_completion: check if computation is done - Results returned via DMA to host memory

14.8.4 CSD Security Model

CSDs run arbitrary compute programs on the device's embedded processor. The kernel must enforce access boundaries:

Capability-gated namespace access:
  1. Each NVMe namespace has an owner (cgroup or capability).
  2. CSD compute programs can ONLY access namespaces granted to
     the submitting process's capability set.
  3. Cross-namespace access (e.g., join across two datasets on
     different namespaces) requires capabilities for BOTH namespaces.
  4. The CSD driver enforces this BEFORE submitting to hardware
     via the NVMe Computational Storage command set.

Program validation:
  - CSD programs are opaque to the kernel (device-specific bytecode).
  - The kernel does NOT inspect or validate program contents.
  - Trust boundary: the NVMe device enforces isolation between
    namespaces at the hardware level (NVMe namespace isolation).
  - If the CSD hardware lacks namespace isolation, the kernel
    treats the device as single-tenant (only one cgroup at a time).

DMA buffer isolation:
  - Result DMA buffers are allocated from the submitting process's
    address space (via IOMMU-mapped regions, same as GPU DMA).
  - CSD cannot DMA to arbitrary host memory — IOMMU enforces this.

CSD Program Validation and IOMMU Enforcement:

Before submitting a CSD program to a device, the kernel performs:

1. IOMMU domain restriction: The CSD device is placed in an isolated IOMMU domain (one per process/namespace submitting CSD work). The IOMMU mapping for the CSD domain is restricted to: - The input data region(s) specified in the submission descriptor. - The output data region(s) specified in the submission descriptor. - The program binary itself (if stored in a device-accessible region). Any attempt by the CSD device to DMA outside these regions raises an IOMMU fault, which terminates the CSD operation and returns EPERM to the submitting process.

2. Capability check: CSD program submission requires CAP_ACCEL_SUBMIT (Section 8.1.3) on the CSD device's capability object. Programs submitted via a cgroup with storage quota enforcement additionally require that the submission's estimated compute units do not exceed the cgroup's CSD budget.

3. Program opaqueness vs. DMA opaqueness: The program logic is opaque to the kernel (vendor-specific bytecode). However, the DMA access pattern is NOT opaque: the IOMMU enforces that the device can only DMA to the addresses explicitly listed in the submission. The program cannot expand its DMA scope at runtime.

4. Namespace isolation: Each process namespace maps to a distinct IOMMU domain. Programs from process A cannot access data mapped into process B's CSD domain. Shared CSD regions (for cooperative workloads) require an explicit capability grant from process B to process A (Section 8.1.1 capability delegation) and a corresponding IOMMU mapping shared between the two domains.

5. Program signing (optional policy): Operators can configure CSD device policies to reject programs without a valid signature (csd_policy: require_signed = true). The signature is checked against the system's IMA policy (Section 8.4). Unsigned programs return EKEYREJECTED.

14.8.5 CSD Error Handling

Error scenarios and kernel response:

Timeout (program runs too long):
  1. CSD command timeout (default: 30s, configurable via AccelBase).
  2. Kernel sends NVMe Abort command for the specific command ID.
  3. Returns -ETIMEDOUT to the submitting process.
  4. If abort fails: NVMe controller reset (same path as NVMe I/O timeout).

Hardware error (device reports failure):
  1. CSD returns NVMe status code (e.g., Internal Error, Data Transfer Error).
  2. Kernel maps to errno: -EIO for hardware faults, -ENOMEM for device
     memory exhaustion, -EINVAL for malformed programs.
  3. Error counter incremented in /sys/class/accel/csdN/errors.
  4. If error rate exceeds threshold: driver marks device degraded,
     stops accepting new submissions, notifies userspace via udev event.

Device reset:
  1. NVMe controller reset via PCIe FLR (Function Level Reset).
  2. All in-flight CSD commands are failed with -EIO.
  3. Contexts are invalidated; processes must re-create them.
  4. Same recovery path as standard NVMe timeout handling in Linux.

14.8.6 Linux Compatibility

NVMe Computational Storage is defined in separate NVMe technical proposals — primarily TP 4091 (Computational Programs) and TP 4131 (Subsystem Local Memory) — not in the NVMe 2.0 base specification. These TPs define the Computational Programs I/O command set and the Subsystem Local Memory I/O command set as independent command sets within the NVMe 2.0 specification library architecture (which separates base spec, command set specs, and transport specs into distinct documents). Linux support is emerging (/dev/ngXnY namespace devices). UmkaOS supports the same device files and NVMe ioctls through umka-compat.

CSD Programming Model:

CSD programs are opaque command buffers — the kernel does not interpret or compile them. The programming model:

  1. Vendor SDK in userspace: Each CSD vendor provides a userspace SDK that compiles programs for their embedded processor (e.g., Samsung SmartSSD SDK, ScaleFlux CSD SDK).
  2. NVMe TP 4091 (Computational Programs): The NVMe technical proposal defines a standard command set for managing computational programs on CSDs. Programs are uploaded via NVMe admin commands and executed via NVMe I/O commands.
  3. Kernel role: The kernel manages namespace access (capability-gated), DMA buffer allocation (IOMMU-protected), command timeout enforcement, and error reporting. The kernel does NOT validate program correctness — that is the vendor SDK's responsibility.

CSD Data Affinity:

For workloads that benefit from computational storage, data should be placed on the CSD's local namespaces:

  • Filesystem-level routing: Mount a CSD-backed filesystem and place data files on it. CSD compute programs access data locally (no PCIe transfer).
  • Cgroup hint: csd.preferred_device cgroup knob suggests which CSD device should be preferred for new file allocations within that cgroup's processes. Advisory only — the filesystem makes the final placement decision.
  • Explicit placement: Applications using O_DIRECT + the NVMe passthrough interface can target specific CSD namespaces directly.

14.8.7 Performance Impact

CSD offload reduces host CPU usage and PCIe bandwidth consumption. Performance improves for data-heavy workloads. Zero overhead when CSDs are not present.


14.9 SATA/AHCI and Embedded Flash Storage

SATA and eMMC are general-purpose block storage buses present in servers, edge nodes, embedded systems, and consumer devices alike. They belong in the core storage architecture alongside NVMe.

14.9.1 SATA/AHCI

SATA (Serial ATA) remains widely deployed: HDDs in cold/warm storage tiers, SATA SSDs in cost-sensitive edge nodes, and legacy server hardware. AHCI (Advanced Host Controller Interface) is the standard host-side register interface for SATA controllers.

Driver tier: Tier 1. SATA is a block-latency-sensitive path.

AHCI register interface: The AHCI controller exposes a set of memory-mapped registers (HBA memory space, BAR5) and per-port command list / FIS receive areas. The driver:

  1. Discovers ports via HBA_CAP.NP (number of ports).
  2. For each implemented port: reads PxSIG to identify device type (ATA, ATAPI, PM, SEMB).
  3. Issues IDENTIFY DEVICE (ATA command 0xEC) to retrieve geometry, capabilities, LBA48 support, NCQ depth.
  4. Allocates per-port command list (up to 32 slots) and FIS receive buffer.
  5. Registers the device with umka-block as a BlockDevice with sector size 512 or 4096 (Advanced Format).

Command submission: AHCI uses a memory-based command list. Each command slot contains a Command Table with a Physical Region Descriptor Table (PRDT) for scatter-gather DMA. Native Command Queuing (NCQ, up to 32 outstanding commands) is used when the device reports IDENTIFY.SATA_CAP.NCQ_SUPPORTED.

/// AHCI port state (per-port, Tier 1 driver domain).
pub struct AhciPort {
    /// MMIO base for this port's register set.
    mmio: PortedMmio,
    /// Command list: up to 32 slots, each 32 bytes (AHCI 1.3.1 Section 4.2.2).
    cmd_list: DmaBox<[AhciCmdHeader; 32]>,
    /// FIS receive area: 256 bytes (AHCI 1.3.1 Section 4.2.1).
    fis_rx: DmaBox<AhciFisRxArea>,
    /// Per-slot command tables (scatter-gather descriptors).
    cmd_tables: [DmaBox<AhciCmdTable>; 32],
    /// Tracks which command slots are in-flight.
    inflight: AtomicU32,
}

Power management: AHCI supports three interface power states: Active, Partial (~10ms wake), Slumber (~200ms wake). The driver uses Aggressive Link Power Management (ALPM) to enter Partial/Slumber when the port is idle. On system suspend (Section 6.2.11), the driver flushes the write cache (FLUSH CACHE EXT, ATA 0xEA) and issues STANDBY IMMEDIATE (ATA 0xE0) before the controller is powered down.

Integration with Section 14.3 Block I/O: AHCI ports register as BlockDevice instances with umka-block. The volume layer (Section 14.3) treats SATA devices identically to NVMe namespaces — RAID, dm-crypt, dm-verity, thin provisioning all work on SATA block devices without modification.

14.9.2 eMMC (Embedded MultiMediaCard)

eMMC is a managed NAND flash storage interface used in embedded systems, edge servers with soldered storage, and cost-sensitive devices. The host interface is a parallel bus (up to 8-bit data width) with an MMC command set.

Driver tier: Tier 1 for the MMC host controller; device command processing follows the same ring buffer model as NVMe.

eMMC register interface: The eMMC host controller (typically SDHCI-compatible or vendor-specific) exposes MMIO registers for command/response, data FIFO, and interrupt status. The driver:

  1. Initializes the host controller and negotiates bus width (1/4/8-bit) and speed (HS200/HS400 where supported).
  2. Issues CMD8 (SEND_EXT_CSD) to retrieve the extended CSD register (512 bytes), which contains capacity, supported features, lifetime estimation, and write-protect status.
  3. Registers partitions (boot partitions BP1/BP2, RPMB, user area, general purpose partitions) as separate BlockDevice instances with umka-block.

RPMB (Replay-Protected Memory Block): eMMC RPMB is a hardware-authenticated storage area with replay protection, used for secure credential storage (e.g., TPM secrets, disk encryption keys). Access requires HMAC-SHA256-authenticated commands using a device-specific key programmed once at manufacturing. The kernel exposes RPMB as a capability-gated block device; only processes with the CAP_RPMB_ACCESS capability can issue RPMB commands.

Lifetime and wear: The Extended CSD PRE_EOL_INFO and DEVICE_LIFE_TIME_EST fields report device health. The kernel reads these periodically and exposes them via sysfs (/sys/block/mmcblk0/device/life_time). No kernel policy is applied — userspace storage daemons make retention/migration decisions.

Integration with Section 14.3: eMMC user-area partitions register as BlockDevice instances. All Section 14.3 volume management targets (dm-crypt, dm-mirror, dm-thin) work on eMMC partitions identically to NVMe namespaces.

14.9.3 SD Card Reader (SDHCI)

SDHCI (SD Host Controller Interface) is the standard register interface for built-in SD card slot controllers. SD cards register as BlockDevice instances with umka-block.

Driver tier: Tier 1.

Speed mode negotiation: UHS-I (SDR104, 208 MB/s max), UHS-II (312 MB/s), and UHS-III (624 MB/s) negotiated per JEDEC SD 8.0 spec. The driver reads the SD card's OCR, CID, CSD, and SCR registers at initialization to determine supported speed modes and switches the bus to the highest mutually supported mode.

Presence detection: SD cards are hot-plug devices. The SDHCI controller raises an interrupt on card insertion/removal. The driver posts a BlockDeviceChanged event to the system event bus (Section 6.6, umka-core) on state change.

Consumer vs. embedded: SD cards are used in consumer laptops (built-in SD slot), embedded systems (primary boot/storage medium), and IoT devices. The SDHCI driver is general-purpose; its presence in consumer devices is the most common deployment.


14.10 Filesystem Drivers: ext4, XFS, and Btrfs

The kernel ships three general-purpose local filesystem drivers. All three implement the FileSystemOps and InodeOps traits defined in Section 13.1 (VFS layer). All three are used in server, workstation, embedded, and consumer contexts; they are not consumer-specific.

14.10.1 ext4

Use cases: Default Linux filesystem. Ubiquitous across servers, containers (overlayfs on ext4), embedded roots, VM images, CI/CD storage nodes, and most existing Linux deployments. UmkaOS must read/write ext4 volumes from day one for bare-metal Linux migration compatibility.

Tier: Tier 1 (in-kernel driver; no privilege boundary makes sense for a root filesystem that must be available before any domain infrastructure is up).

Journal modes (selected at mount time via data= option):

Mode What is journalled Durability on crash
data=writeback Metadata only Stale data may appear in reallocated blocks
data=ordered (default) Metadata only; data flushed before metadata commit No stale data
data=journal Metadata and data Strongest; ~2× write amplification

UmkaOS exposes these as mount flags via the FileSystemOps::mount() options string, consistent with Linux behaviour. The VFS durability contract (Section 14.1) requires data=ordered or data=journal to satisfy O_SYNC/fsync guarantees; drivers must reject data=writeback if the volume is mounted as a root or journalled data store unless the operator explicitly overrides.

Key features the driver must implement: - Extents (ext4_extent_tree): 48-bit logical-to-physical mapping via a four-level B-tree embedded in the inode. Supports extents up to 128 MiB contiguous. Replaces the older indirect-block scheme (must also be readable for old volumes without the extents feature flag). - HTree directory indexing: dir_index feature flag. Directories stored as B-trees keyed by filename hash (half-MD4). Required for directories with more than ~10,000 entries; without it readdir degrades to O(n). - 64-bit support: 64bit feature flag extends block count from 32 to 48 bits, enabling volumes >16 TiB. Required for modern datacenter deployments; the driver must handle both 32-bit and 64-bit superblocks. - Inline data: Small files (≤60 bytes) stored directly in the inode body. Important for filesystems hosting millions of tiny files (container layers, npm caches). - Fast commit (fast_commit feature, Linux 5.10+): Appends a small delta journal entry instead of a full transaction commit for common operations (rename, link, unlink). Reduces journal write amplification by 4–10× for metadata-heavy workloads.

Crash recovery: Replay the ext4 journal (jbd2 compatible format) on mount. The VFS freeze/thaw interface (Section 13.1 freeze() / thaw()) is used for consistent snapshots (LVM thin, VM live migration).

Linux compatibility: UmkaOS's ext4 driver is wire-compatible with Linux's ext4. Volumes formatted with mkfs.ext4 on Linux are mountable by UmkaOS without conversion. The tune2fs -l feature list (FEATURE_COMPAT, FEATURE_INCOMPAT, FEATURE_RO_COMPAT) governs which features are required vs. optional; the driver rejects mount if any INCOMPAT bit is set that it does not understand.

14.10.2 XFS

Use cases: Default filesystem on RHEL, CentOS, Fedora, Rocky Linux, and Oracle Linux. Dominant in enterprise storage servers, HPC scratch filesystems, media production storage, and large-scale NFS servers. Designed for very large files and very large directories.

Tier: Tier 1 (same rationale as ext4).

Design:

XFS partitions the volume into allocation groups (AGs), each an independent unit with its own free-space B-trees (bnobt, cntbt), inode B-tree (inobt), and reverse-mapping B-tree (rmapbt, v5 only). Allocation groups enable parallel allocation for multi-threaded workloads — different AGs are independent, so concurrent file creation on different CPUs does not serialize.

Volume layout (simplified):
  [ Superblock | AG 0 | AG 1 | ... | AG N ]
  Each AG: [ AG header | free-space B-trees | inode B-tree | data blocks ]

Key features: - Delayed allocation (delalloc): Blocks are not physically allocated until writeback, allowing the allocator to choose large contiguous extents instead of the first available fragment. Critical for streaming-write performance. - Speculative preallocation: XFS preallocates beyond the current EOF during sequential writes, then trims unused preallocation on close. Dramatically reduces fragmentation for growing files (logs, databases, media files). - Reflink (XFS v5, Linux 4.16+): Copy-on-write extent sharing for cheap file copies (same semantic as Btrfs reflinks). Required for efficient container image layering and cp --reflink. - Reverse mapping B-tree (rmapbt, v5): Tracks which owner (inode or B-tree structure) holds each physical block. Required for online scrub, online repair, and reflink. Adds ~5% space overhead. - Real-time device: XFS optionally uses a separate real-time device for files tagged with XFS_XFLAG_REALTIME, guaranteeing allocation from a contiguous extent region. Used in HPC and media production for deterministic I/O latency. UmkaOS supports the real-time device as a second BlockDevice passed in the mount option rtdev=. - xattr namespaces: user., trusted., security., system.posix_acl_*. The trusted. namespace is restricted to CAP_SYS_ADMIN; the kernel enforces this via capability checks in setxattr(2).

Journal (xlog): XFS uses a write-ahead log (xlog) for all metadata mutations. The log is circular; the driver replays from the last checkpoint on mount after unclean shutdown. Log can be on the same device (default) or an external device (logdev=) for better write isolation on HDD-based arrays.

Linux compatibility: XFS v5 (superblock sb_features_incompat bit XFS_SB_FEAT_INCOMPAT_FTYPE) is required for all new volumes. v5 includes a CRC checksum on every metadata block (CRC32C), catching silent corruption that ext4 without metadata checksums would miss. UmkaOS rejects mounting v4 volumes unless a compatibility shim is provided (v4 is deprecated upstream as of Linux 6.x and not worth supporting at launch).

14.10.3 Btrfs

Use cases: Default filesystem for UmkaOS desktop/laptop deployments (see consumer roadmap Section 23.2.9, 23-roadmap.md; open questions decision), Fedora workstations, Steam Deck. Also used in enterprise for its snapshot and send/receive capabilities (Proxmox, SUSE). Relevant at kernel level wherever atomic snapshots, compression, or multi-device volumes are needed.

Tier: Tier 1.

Design: Btrfs is a copy-on-write (CoW) B-tree filesystem. Every write produces a new copy of the modified data/metadata; the old copy is retained until freed. This is the foundation for snapshots (zero-cost at creation) and atomic multi-file transactions.

Key features:

Feature Kernel behaviour
Subvolumes Independent CoW trees within a volume; each mountable separately. The kernel tracks the active subvolume ID per mount point.
Snapshots Read-write or read-only clone of a subvolume at a point in time. Zero-cost creation (no data copied). Used by UmkaOS live update rollback (Section 12.6).
Reflinks Shallow file copy (cp --reflink). Shares extent references until written. Critical for container runtimes and package managers.
Transparent compression Per-file or per-subvolume, online. Algorithms: LZO (fast), ZLIB (balanced), ZSTD (best ratio, default for UmkaOS). Kernel compresses on writeback; decompresses on read.
RAID profiles RAID 0 / 1 / 1C3 / 1C4 / 5 / 6 / 10 across multiple BlockDevice instances. RAID 5/6 has known write-hole issues (upstream caveat); UmkaOS documents this and defaults to RAID 1 for redundant volumes.
Online scrub Background verification of all data and metadata checksums. Driven by a kernel thread (btrfs-scrub); progress exposed via ioctl and sysfs.
Send/receive Incremental snapshot delta serialisation. btrfs send produces a stream; btrfs receive applies it on another volume. Used for backup, replication, and container image distribution.
Free space tree v2 free-space cache (b-tree based); replaces the v1 file-based cache. Required for large volumes (>1 TiB); UmkaOS always mounts with space_cache=v2.

CoW and O_SYNC interaction: Because Btrfs delays the final tree root update until transaction commit, fsync must trigger a full transaction commit (not just a data flush) to satisfy durability. The driver calls btrfs_commit_transaction() on fsync for non-nodatacow files. This is a known latency source for databases; the architecture recommends nodatacow mount option for database subvolumes (trades crash consistency for performance, consistent with how PostgreSQL and MySQL recommend mounting their data directories on any CoW filesystem).

Live update integration (Section 12.6): Btrfs subvolume snapshots can support snapshot-based atomic OS updates. A live update agent can create a read-only snapshot of the root subvolume before applying an update, making rollback trivial and zero-downtime. This makes Btrfs a natural fit for deployments that use snapshot-based atomic updates; on servers where ext4 or XFS is already in use, this advantage does not justify a migration.

Linux compatibility: Btrfs on-disk format is stable since Linux 3.14. UmkaOS's Btrfs driver is wire-compatible with Linux's. Volumes created on Linux are mountable by UmkaOS. Feature detection uses the incompat_flags superblock field; the driver rejects mount if any unknown INCOMPAT bit is set.

Limitations documented: - RAID 5/6: write-hole not fixed upstream as of Linux 7.0. Use RAID 1 or RAID 10 for redundant server volumes. - nodatacow files cannot have checksums. Applications that disable CoW for performance must accept no data integrity checking on those files. - Very large directories (>1M entries) have worse performance than XFS due to CoW overhead on directory mutations.

14.10.4 Removable Media and Interoperability Filesystems

These filesystem drivers serve interoperability with Windows, macOS, and removable media standards. They are not consumer-specific — embedded systems, edge nodes, and industrial devices also use FAT/exFAT/NTFS for removable storage interoperability.

UmkaOS's strategy for these filesystems is native in-kernel drivers implemented as Tier 1 drivers using the standard FileSystemOps / InodeOps / FileOps trait set (Section 13.1). FUSE-backed userspace drivers are supported as a compatibility mechanism for filesystems where a full native implementation is deferred; the FUSE subsystem is specified in Section 14.10.4.4.

14.10.4.1 exFAT

Use case: SDXC (SD cards >32 GB) mandates exFAT per JEDEC SD specification. USB flash drives commonly use exFAT. Required for read/write interop with Windows and macOS systems.

Tier: Tier 1 (in-kernel umka-exfat driver).

Implementation: Microsoft published the exFAT specification as an open specification in 2019 (SPDX: LicenseRef-exFAT-Specification; no royalty or patent encumbrance for implementors). The exFAT on-disk format is simpler than ext4 or XFS: a flat cluster chain FAT or an Allocation Bitmap (preferred for exFAT), a root directory cluster chain, and per-file directory entries using UTF-16 with UpCase table normalization. UmkaOS's native umka-exfat driver implements the full read/write path using the FileSystemOps trait.

Compatibility: Read/write. Cluster sizes from 512 B to 32 MB. Files up to 16 EiB (volume limit). Directory entries use UTF-16LE with the volume's UpCase table. Timestamps include UTC offset field (Windows 10+). No journaling; power loss can corrupt a directory entry mid-write. The driver issues a FLUSH CACHE command to the underlying block device after each fsync to bound exposure.

Linux compatibility: exFAT volumes created on Linux (kernel exFAT driver, merged in 5.7) are mountable by UmkaOS and vice versa. The UpCase table format and cluster allocation bitmap are identical.

14.10.4.2 NTFS

Use case: External drives shared with Windows installations. Common on USB hard drives purchased pre-formatted. Required for read/write interop with Windows-hosted data volumes.

Tier: Tier 1 (in-kernel ntfs3 driver; based on the Paragon ntfs3 implementation merged into Linux 5.15).

Implementation: UmkaOS's ntfs3 driver is derived from the upstream Linux ntfs3 implementation by Paragon Software Group. It provides full read/write support including NTFS compression (LZX per-cluster), sparse files (sparse-file runs), and hard links (multiple $FILE_NAME attributes per MFT record).

Features not supported (return EOPNOTSUPP on access): - Alternate Data Streams exposed as separate mount namespace entries (ADS content is preserved on read/write of the primary stream but not enumerable via openat/readdir). - Reparse points used as Windows junction points or symlinks (IO_REPARSE_TAG_SYMLINK, IO_REPARSE_TAG_MOUNT_POINT) — accessed as regular files or returned as DT_UNKNOWN in directory listings. - Encrypted files ($EFS attribute) — opened successfully but content reads return raw ciphertext with a warning in the kernel log.

Phase constraint: Full NTFS write support is present from Phase 2. The NTFS journaling structure ($LogFile, $UsnJrnl) is replayed on mount to ensure volume consistency after unclean shutdown, matching Linux ntfs3 behavior. No NTFS write support is deferred; the complexity of NTFS journaling, compression, and sparse files is handled by the derived ntfs3 implementation.

Linux compatibility: Wire-compatible with Linux ntfs3. Volumes created on Linux ntfs3 are mountable by UmkaOS and vice versa.

14.10.4.3 APFS (Read-Only)

Use case: External drives formatted by macOS. Required for data migration from macOS systems and for mounting Apple Silicon boot drives in dual-boot or forensic scenarios.

Tier: Tier 1 (in-kernel read-only driver, Phase 4+).

Phase constraint: APFS write support is permanently deferred. The APFS on-disk format is not a public specification; Apple documents only enough for APFS tooling on macOS. Reverse-engineered write support risks silent metadata corruption when Apple makes undocumented changes between macOS releases. The read-only constraint is therefore not a temporary limitation but a deliberate design boundary: APFS volumes mounted by UmkaOS are always mounted read-only, enforced in the FileSystemOps::mount() implementation by returning EROFS if MountFlags::READ_WRITE is set.

Implementation: Read-only native kernel driver derived from the apfs-fuse project's reverse-engineered format analysis (MIT licensed). Supported features: - APFS container and volume superblock parsing. - B-tree (object map, file system tree) traversal. - Extent-based and inline file data. - Compression (APFS_COMPRESS_ZLIB, APFS_COMPRESS_LZVN, APFS_COMPRESS_LZFSE). - Symlinks, hard links (inode numbers via DREC_TYPE_HARDLINK). - Extended attributes (xattr tree). - Time Machine snapshot enumeration (read-only).

Phase ordering: Phase 3 delivers HFS+ read-only support (for older macOS volumes). Phase 4 delivers APFS read-only, layered on the HFS+ driver's infrastructure for Apple partition map and CoreStorage detection.

Until Phase 4, APFS volumes are accessible via the FUSE subsystem (Section 14.10.4.4) using the apfs-fuse userspace daemon, which provides a compatible FileDescriptor interface through FuseSession.

14.10.4.4 FUSE — Userspace Filesystem Framework

FUSE (Filesystem in Userspace) enables userspace daemons to implement filesystems served through the kernel VFS. UmkaOS implements the FUSE kernel interface as a Tier 2 bridge driver, compatible with the Linux /dev/fuse protocol (FUSE protocol version 7.x; minimum negotiated minor version: 26, released with Linux 4.20, which introduced FUSE_RENAME2 and FUSE_LSEEK).

Scope: FUSE is a compatibility and extensibility mechanism. Native in-kernel drivers are preferred for performance-critical or widely-used filesystems. FUSE is the appropriate path for: - Filesystems with complex or proprietary on-disk formats where a native kernel driver is not feasible (e.g., APFS before Phase 4). - Userspace tools that already implement a filesystem (e.g., sshfs, s3fs, custom FUSE daemons in container runtimes). - Development and prototyping of new filesystem drivers before promotion to Tier 1.

Protocol: The FUSE kernel↔daemon protocol uses /dev/fuse. The kernel writes request messages (opcodes: FUSE_LOOKUP, FUSE_OPEN, FUSE_READ, FUSE_WRITE, FUSE_READDIR, etc.) into the fd; the daemon reads them, processes them, and writes reply messages back. Each request carries a unique unique identifier matching it to its reply. The wire format is identical to Linux libfuse protocol version 7.x, ensuring compatibility with all existing FUSE daemons without recompilation.

FuseSession struct — kernel-side state for one mounted FUSE filesystem:

/// Kernel-side state for one active FUSE mount.
///
/// Created when the userspace daemon opens `/dev/fuse` and calls `mount(2)`
/// with `fstype = "fuse"`. Destroyed when the daemon closes the fd or the
/// mount is forcibly unmounted (`umount -f`).
pub struct FuseSession {
    /// Negotiated FUSE protocol version (major, minor).
    /// Major is always 7 for current FUSE protocol; minor is negotiated
    /// during `FUSE_INIT` handshake. The kernel refuses to mount if the
    /// daemon proposes major != 7.
    pub proto_version: (u32, u32),

    /// The `/dev/fuse` file descriptor held open by the daemon process.
    /// Closing this fd triggers an implicit `FUSE_DESTROY` + unmount.
    pub dev_fd: FileDescriptor,

    /// Mount flags captured at mount time (read-only, no-exec, etc.).
    /// Propagated to `InodeOps::permission()` checks within this session.
    pub mount_flags: MountFlags,

    /// Maximum write payload the daemon declared it can handle, in bytes.
    /// Capped at `FUSE_MAX_MAX_PAGES * PAGE_SIZE` (128 × 4096 = 512 KiB).
    /// The kernel splits `FUSE_WRITE` requests larger than this value.
    pub max_write: u32,

    /// Maximum `readahead` size the kernel will request, in bytes.
    /// Negotiated during `FUSE_INIT`; 0 disables kernel readahead for
    /// this mount.
    pub max_readahead: u32,

    /// Whether the daemon supports `FUSE_ASYNC_READ` (concurrent reads
    /// on the same file handle without serialization). Declared by the
    /// daemon in `FUSE_INIT` flags. When false, the kernel serializes
    /// all reads per file handle.
    pub async_read: bool,

    /// Whether the daemon supports `FUSE_WRITEBACK_CACHE` mode.
    /// When true, the kernel VFS page cache handles write coalescing and
    /// fsync; individual 4 KB write-cache flushes are not sent per page.
    /// When false, every `write(2)` generates a `FUSE_WRITE` request.
    pub writeback_cache: bool,

    /// Pending request queue. Requests generated by VFS operations are
    /// enqueued here; the daemon's `read(2)` on `/dev/fuse` dequeues them.
    /// Bounded to `FUSE_MAX_PENDING` (default: 12 + 1 per CPU) requests
    /// to apply backpressure to VFS callers when the daemon is slow.
    pub pending: FuseRequestQueue,

    /// In-flight requests awaiting a reply from the daemon. Keyed by
    /// `unique` identifier. On daemon close, all in-flight requests are
    /// completed with `ENOTCONN`.
    pub inflight: FuseInflightMap,
}

FuseRequestQueue and FuseInflightMap are internal kernel types; their exact layout is not part of the KABI — only the FuseSession fields visible to the Tier 2 FuseDriver are stable.

FUSE_INIT handshake: On first read(2) from the daemon, the kernel sends a FUSE_INIT request with major = 7, minor = UMKA_FUSE_MINOR (the maximum minor the kernel supports). The daemon replies with its supported minor; the negotiated minor is min(kernel_minor, daemon_minor). Capabilities (flags field) are intersected: a capability is active only if both sides declare it. The kernel stores the negotiated values in FuseSession::proto_version and the derived async_read, writeback_cache, max_write, max_readahead fields.

Error handling: If the daemon crashes or closes /dev/fuse with in-flight requests, all pending VFS operations on the mount return ENOTCONN. The mount remains in the VFS tree but is marked MS_DEAD; subsequent operations return ENOTCONN until the mount is explicitly removed with umount. A daemon can reconnect to a dead mount by opening /dev/fuse with O_RDWR | O_CLOEXEC and the same mount cookie — this is the basis for daemon live-restart without unmounting (supported when FUSE_CONN_INIT_WAIT is negotiated).

Security: The /dev/fuse fd is accessible only to the mounting user (or root). Filesystem operations that arrive from processes outside the mounting user's UID are checked against the allow_other mount option. Without allow_other, FUSE_ACCESS is called only for processes with the mounting UID/GID; others receive EACCES at the VFS permission check before the FUSE request is even generated.

Phase: FUSE kernel infrastructure is delivered in Phase 3. FUSE daemons such as apfs-fuse, sshfs, and custom drivers are usable from Phase 3 onward. The native APFS in-kernel driver (Phase 4) supersedes apfs-fuse for performance-sensitive workloads but does not remove FUSE support.

14.10.5 Summary of Design Decisions

  1. Tier 1 placement: overlayfs runs in the VFS domain because it is a pure VFS stacking layer with moderate code complexity. Tier 2 would double domain-crossing overhead for every file operation in every container.

  2. xattr-based whiteouts as default: Avoids CAP_MKNOD requirement for rootless containers. Character device 0:0 whiteouts are recognized on read for backward compatibility.

  3. Metacopy enabled by default: Matches modern Docker/containerd behavior. The security caveat (attacker-crafted xattrs) is mitigated by the trusted.* namespace restriction and container runtime control of layer provenance.

  4. Atomic copy-up via workdir rename: Uses the same-filesystem rename guarantee. The workdir must share a superblock with upperdir, which the mount validation enforces.

  5. Dentry invalidation on copy-up: Uses d_invalidate() on the parent directory's dentry for the affected name, forcing re-lookup through the overlay lookup() path which will find the new upper entry.

  6. d_revalidate() for overlay dentries: Checks for copy-up state changes. This is the primary mechanism by which concurrent readers discover that a file has been copied up.

  7. Readdir merge with HashSet dedup: O(entries x layers) with hash-based dedup. The merged listing is cached per-opendir for consistency.

  8. xattr escaping for nested overlays: Supports overlayfs-on-overlayfs via the trusted.overlay.overlay.* prefix convention, matching Linux.

  9. Volatile sentinel directory: Prevents mounting on unclean upper layers. The sentinel is created on mount, removed on clean unmount.

  10. dm-verity + IMA dual coverage: Lower layers protected by dm-verity (block-level, Section 8.2.6), upper layer by IMA (file-level, Section 8.4). This is cross-referenced for clarity.


14.11 NFS Client, SunRPC, and RPCSEC_GSS

NFS is UmkaOS's primary network filesystem. This section specifies the complete kernel-side stack: - SunRPC transport layer: connection management, XDR encoding, RPC dispatch - RPCSEC_GSS + Kerberos: Kerberos-authenticated NFS (NFSv4 + Kerberos = "krb5i/krb5p") - NFSv4 client state machine: open/lock/delegation/lease - Network filesystem cache (netfs layer): shared page cache for NFS, Ceph, and other network filesystems

14.11.1 SunRPC Transport Layer

SunRPC (RFC 5531) is the RPC framework underlying NFS, lockd, and mount protocol.

RpcTransport trait — abstraction over TCP and UDP transports:

pub trait RpcTransport: Send + Sync {
    fn send_request(&self, req: &RpcMsg, xid: u32, timeout: Duration) -> Result<(), RpcError>;
    fn recv_reply(&self, xid: u32) -> impl Future<Output = Result<RpcMsg, RpcError>>;
    fn close(&self);
    fn reconnect(&self) -> Result<(), RpcError>;
    fn max_payload_size(&self) -> usize;
}

TCP transport — one persistent TCP connection per server per NFS client. Record marking (RFC 5531 §10): each RPC message prefixed with a 4-byte record mark (u32 with high bit set indicating last fragment, low 31 bits = fragment length). Multiple RPC messages may be pipelined on one TCP connection. Connection maintained as long as mounts are active; reconnect on ECONNRESET.

XClnt (RPC client) struct:

pub struct XClnt {
    pub server_addr:   SockAddr,
    pub transport:     Arc<dyn RpcTransport>,
    pub prog:          u32,    // RPC program number (NFS = 100003, mountd = 100005)
    pub vers:          u32,    // Program version (NFSv4 = 4)
    pub auth:          Arc<dyn RpcAuth>,
    pub xid_counter:   AtomicU32,
    pub pending:       Mutex<HashMap<u32, PendingRpc>>,  // xid → waker
    pub timeout:       Duration,
    pub retries:       u32,
}

pub struct PendingRpc {
    pub xid:    u32,
    pub waker:  Waker,
    pub result: Option<Result<RpcMsg, RpcError>>,
}

XDR (External Data Representation) — RFC 4506. UmkaOS implements XDR as zero-copy where possible: XdrEncoder writes directly into a NetBuf chain; XdrDecoder reads from received NetBuf without copying. Fixed-size types (u32, u64, bool) are directly encoded; variable-length strings and arrays have a 4-byte length prefix followed by zero-padded data to a 4-byte boundary.

Async RPC dispatchcall_async(proc: u32, args: impl XdrEncode) -> impl Future<Output = Result<R, RpcError>>: builds RpcMsg { xid, call: RpcCall { rpc_version: 2, program, version, procedure, auth, verifier } }, encodes args via XDR, sends via transport, registers PendingRpc in the pending map, returns a future that resolves when the matching reply arrives. The reply receiver loop runs as a Tier 1 kernel task.

14.11.2 RPC Authentication (RpcAuth)

RpcAuth trait:

pub trait RpcAuth: Send + Sync {
    fn auth_type(&self) -> RpcAuthFlavor;
    fn marshal_cred(&self, encoder: &mut XdrEncoder) -> Result<()>;
    fn verify_verf(&self, decoder: &mut XdrDecoder) -> Result<()>;
    fn refresh(&self) -> Result<()>;  // Re-fetch credentials if expired
}

Built-in auth flavors: - AuthNone (flavor 0): null credentials. Used only for portmap/rpcbind. - AuthUnix / AUTH_SYS (flavor 1): uid, gid, supplementary groups. Used for NFSv3, not secure. - RPCSEC_GSS (flavor 6): GSS-API based authentication. Described in Section 14.11.3.

14.11.3 RPCSEC_GSS and Kerberos

RPCSEC_GSS (RFC 2203) wraps any GSS-API mechanism. UmkaOS implements the Kerberos V5 mechanism (RFC 4121).

Service types (negotiated at mount time via sec= mount option): - krb5: authentication only (integrity of RPC header) - krb5i: authentication + integrity (checksum of entire RPC payload) - krb5p: authentication + integrity + privacy (encryption of RPC payload)

GssContext struct:

pub struct GssContext {
    pub mech_oid:    GssMechOid,     // 1.2.840.113554.1.2.2 for Kerberos V5
    pub context_hdl: u64,            // Opaque handle to the GSS security context
    pub service:     GssService,     // None / Integrity / Privacy
    pub seq_num:     AtomicU64,      // Monotonic sequence counter
    pub session_key: Zeroizing<[u8; 32]>,  // AES-256 session key
    pub expiry:      Instant,        // When this context expires
    pub uid:         UserId,
}

RPCSEC_GSS credential exchange (happens automatically on first NFS connection): 1. Client sends RPCSEC_GSS_INIT call with a Kerberos AP_REQ (service ticket + authenticator) obtained from the kernel keyring (Section 9.2). The request_key("krb5", "nfs@server.example.com", NULL) lookup triggers gssd upcall if no ticket is cached. 2. Server responds with AP_REP (session key confirmation) and assigns a gss_proc_handle. 3. Subsequent RPCs carry the gss_proc_handle + sequence number + integrity/privacy checksum in the credential field.

RpcsecGssAuth struct — implements RpcAuth:

pub struct RpcsecGssAuth {
    pub ctx:     Arc<RwLock<GssContext>>,
    pub handle:  u32,          // gss_proc_handle from server
    pub service: GssService,
}
  • marshal_cred(): writes RPCSEC_GSS credential with current seq_num.
  • verify_verf(): checks server's GSS MIC (Message Integrity Code) over the reply XID.
  • refresh(): if ctx.expiry < now, calls request_key() to fetch a new service ticket, re-runs RPCSEC_GSS_INIT.

Key retrieval integration with Section 9.2: Kerberos TGTs are cached as LogonKey entries in the kernel Key Retention Service. When refresh() needs a new service ticket, it calls request_key("krb5tgt", "REALM", NULL) to retrieve the cached TGT LogonKey, then calls request_key("krb5", "nfs@server.example.com", NULL) to obtain (or derive) a service ticket. If no TGT is present, the request_key upcall invokes userspace gssd, which performs the full Kerberos AS exchange, deposits the resulting TGT as a LogonKey, and provides the service ticket. This path requires Capability::SysAdmin only for initial keyring population; subsequent ticket requests use the session keyring of the process that triggered the mount.

Sequence number anti-replay: Each GssContext maintains a monotonic seq_num (AtomicU64). The server rejects any RPC with a sequence number more than 256 positions behind the current window (RFC 2203 §5.3.3). The client never reuses sequence numbers within a context lifetime.

GSS Upcall Mechanism:

Kerberos authentication requires obtaining credentials from userspace (the gssd daemon), since the kernel cannot contact a KDC directly. UmkaOS uses an upcall mechanism:

Channel: A per-mount Unix domain socket (/run/umka/gss/{mount_id}) created when the NFS mount is established. The kernel writes requests and reads responses using a simple binary framing protocol.

Request format (GssUpcallRequest):

#[repr(C)]
pub struct GssUpcallRequest {
    /// Protocol version (currently 1).
    pub version: u32,
    /// Request type: 1=INIT_SEC_CONTEXT, 2=ACCEPT_SEC_CONTEXT, 3=GET_MIC, 4=VERIFY_MIC.
    pub req_type: u32,
    /// Client principal name (NUL-terminated, max 256 bytes).
    pub client_principal: [u8; 256],
    /// Target service name, e.g., "nfs@server.example.com" (NUL-terminated, max 256 bytes).
    pub target: [u8; 256],
    /// Input token length (0 for INIT, non-zero for mutual auth response).
    pub input_token_len: u32,
    /// Input token data (up to 65535 bytes; variable length follows this struct).
    // (actual data follows at offset sizeof(GssUpcallRequest))
}

Response format (GssUpcallResponse):

#[repr(C)]
pub struct GssUpcallResponse {
    pub version:      u32,
    pub status:       i32,  // 0 = success; negative = GSS error code
    /// GSS context handle (opaque; returned to kernel for subsequent calls).
    pub context_id:   u64,
    /// Output token length (for INIT_SEC_CONTEXT response token).
    pub output_token_len: u32,
    // output token data follows at offset sizeof(GssUpcallResponse)
}

Timeout: 30 seconds per upcall. If gssd does not respond within 30s: - The kernel returns ETIMEDOUT to the NFS operation. - The upcall socket is closed and re-created; a new connection attempt is made. - After 3 consecutive timeouts, the mount is marked NFS_MOUNT_SECFLAVOUR_FORCE_NONE and falls back to AUTH_SYS (if configured) or returns EACCES permanently until the mount is remounted.

Concurrent upcalls: Multiple upcalls may be in flight simultaneously (one per in-progress authentication). Each upcall is tagged with a unique upcall_id: u32; responses match by upcall_id. A ring buffer of 32 concurrent upcalls is supported.

14.11.3.1 GSS Context Lifecycle and Proactive Renewal

Linux behavior (reference): Linux hard-fails all NFS RPCs with EKEYEXPIRED when the GSS/Kerberos TGT or service ticket expires. The user sees I/O errors on NFS mounts until they re-authenticate (kinit). This is a poor user experience for long-running workloads.

UmkaOS improvement — proactive renewal + grace period:

UmkaOS's GSS context manager proactively renews credentials and provides a short grace period for in-flight RPCs, eliminating spurious I/O errors in well-managed environments.

/// Lifecycle state of a GSS security context.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum GssContextState {
    /// Context is valid and usable for RPC signing/encryption.
    Valid,
    /// Context expires within RENEWAL_LEAD_TIME_SEC (60 s); renewal upcall
    /// has been sent to the gssd daemon. New RPCs may still use this context.
    RenewPending,
    /// Renewal failed or context has just expired; within the grace period
    /// (GRACE_PERIOD_MS = 500 ms). In-flight RPCs are allowed to complete.
    /// New RPCs are queued pending renewal or context replacement.
    GracePeriod,
    /// Grace period elapsed; all RPCs return EKEYEXPIRED until re-authentication.
    Expired,
    /// Context has been explicitly destroyed (session logout or server reset).
    Destroyed,
}

/// Per-server-per-credential GSS context. One context per (client principal,
/// server principal) pair, shared by all threads using the same credentials
/// on the same NFS server connection.
pub struct GssContext {
    /// Opaque GSS context token (from gss_init_sec_context). Variable length;
    /// stored as a heap allocation updated atomically on renewal.
    pub token: RwLock<Box<[u8]>>,
    /// Absolute expiry time (nanoseconds since boot).
    pub expiry_ns: AtomicU64,
    /// Current lifecycle state.
    pub state: AtomicU8, // GssContextState as u8
    /// Number of RPCs currently in-flight using this context.
    /// Grace-period teardown waits for this to reach zero before expiring.
    pub in_flight: AtomicU32,
    /// Upcall ID sent to gssd for renewal (0 = none pending).
    pub renewal_upcall_id: AtomicU64,
}

/// Renewal timing constants.
/// Renewal is triggered this many seconds before expiry.
pub const GSS_RENEWAL_LEAD_TIME_SEC: u64 = 60;
/// After expiry, in-flight RPCs have this long to complete before the context
/// is torn down and new RPCs start returning EKEYEXPIRED.
pub const GSS_GRACE_PERIOD_MS: u64 = 500;

Renewal algorithm (runs in the kthread/gss_renewer background thread):

  1. Wake every 5 seconds (or when notified by an expiry timer).
  2. For each GssContext with state == Valid:
  3. If now_ns >= expiry_ns - GSS_RENEWAL_LEAD_TIME_SEC * NS_PER_SEC:
    • Transition state to RenewPending.
    • Send upcall to gssd: GssUpcallRequest { op: Renew, ... }.
  4. If renewal upcall succeeds (gssd responds within 30 s):
  5. Update token and expiry_ns under token.write().
  6. Transition state back to Valid.
  7. If renewal upcall fails or times out:
  8. If now_ns < expiry_ns: retry after 10 s (transient failure).
  9. If now_ns >= expiry_ns: transition to GracePeriod.
    • Start a 500 ms timer; on expiry: wait for in_flight == 0, then transition to Expired.
  10. New RPCs arriving while state == GracePeriod are queued (not failed); they proceed if renewal succeeds or fail with EKEYEXPIRED if grace period expires.

14.11.4 NFSv4 Client State Machine

NFSv4 (RFC 7530 for v4.0, RFC 5661 for v4.1) is the primary NFS version. Key concepts: - Leases: all NFSv4 state (open files, locks, delegations) is held under a time-limited lease. Client must renew its lease before it expires (default 90s) or all state is purged by the server. - Client ID: a 64-bit clientid identifying the client, established via SETCLIENTID (v4.0) or EXCHANGE_ID (v4.1). - Sessions (v4.1): connection-independent; RPCs can arrive on any TCP connection in the session. CREATE_SESSION establishes a session; SEQUENCE operation prefixes every compound. - Compounds: NFSv4 operations are batched into compounds (multiple operations per RPC call). E.g., PUTFH + GETATTR in one RPC.

NfsClient struct:

pub struct NfsClient {
    pub server_addr:   SockAddr,
    pub rpc_clnt:      Arc<XClnt>,
    pub clientid:      AtomicU64,       // NFSv4 client ID
    pub verifier:      [u8; 8],         // Client verifier (random, per boot)
    pub lease_time_s:  u32,             // Negotiated from server
    pub lease_renewer: JoinHandle<()>,  // Background task renewing the lease
    pub state_lock:    Mutex<NfsClientState>,
    pub nfs_version:   NfsVersion,      // V4_0 or V4_1
    // NFSv4.1 only:
    pub session_id:    Option<[u8; 16]>,
    pub fore_channel:  Option<SessionChannel>,
    pub back_channel:  Option<SessionChannel>,
}

Open state machineNfsOpenState per open file handle:

pub struct NfsOpenState {
    pub open_stateid: [u8; 16],       // 4-component stateid from server
    pub seqid:        u32,            // Local sequence for state transitions
    pub access:       NfsOpenAccess,  // Read / Write / Both
    pub deny:         NfsOpenDeny,    // None / Read / Write / Both
    pub delegation:   Option<NfsDelegation>,
    pub locks:        Vec<NfsLockState>,
}

pub struct NfsDelegation {
    pub stateid:   [u8; 16],
    pub type_:     DelegationType,  // Read or Write
    pub recall_wq: WaitQueue,       // Signaled when server sends CB_RECALL
}

Write delegation — when the server grants a write delegation, the client may write and cache locally without contacting the server for each operation. On recall (server sends CB_RECALL via the NFSv4 callback channel), the client must flush all dirty pages and send DELEGRETURN before the server can grant access to other clients. The callback channel (established in CREATE_SESSION for v4.1, or via SETCLIENTID for v4.0) is a reverse TCP connection: server connects to client. The back_channel in NfsClient tracks this connection.

Lease renewal — a background kernel task (running as a Tier 1 task) calls RENEW (v4.0) or sends a SEQUENCE-only compound (v4.1) every lease_time_s / 2 seconds. On network partition: lease renewal fails; after lease_time_s the server purges all client state. Client must perform state recovery: sends SETCLIENTID / EXCHANGE_ID (to re-establish client identity), then CLAIM_PREVIOUS opens for each open file, and LOCK reclaims for each lock, concluding with RECLAIM_COMPLETE.

State recovery error paths: - If the server returns NFS4ERR_STALE_CLIENTID during recovery, the client lost its lease entirely: all open-file state is gone, all in-progress writes that were not yet flushed are lost. The VFS layer returns EIO to all blocked file operations. - If CLAIM_PREVIOUS returns NFS4ERR_RECLAIM_BAD, the server no longer has a record of the open: the file descriptor is invalidated, pending writes are dropped with EIO. - Recovery is gated by a per-client recovering flag; new operations block (interruptibly if intr mount option is set) until recovery completes or fails.

14.11.5 netfs Page Cache Layer

The netfs layer provides a shared page cache infrastructure for network filesystems. UmkaOS implements it as the cache tier between NFS (and future Ceph/AFS) and the page allocator. It replaces ad-hoc per-filesystem readahead and writeback logic with a unified, testable implementation.

Core abstractions:

pub trait NetfsInode: Send + Sync {
    /// Populate subrequests for a read covering [rreq.start, rreq.start + rreq.len).
    fn init_read_request(&self, rreq: &mut NetfsReadRequest);
    /// Issue a single subrequest to the server (or local cache).
    fn issue_read(&self, subreq: &mut NetfsSubrequest);
    /// Issue a write request to the server.
    fn issue_write(&self, wreq: &mut NetfsWriteRequest);
    /// Split a dirty range into write requests.
    fn create_write_requests(&self, wreq: &mut NetfsWriteRequest, start: u64, len: u64);
}

pub struct NetfsReadRequest {
    pub inode:       Arc<dyn NetfsInode>,
    pub start:       u64,             // Byte offset in file
    pub len:         usize,
    pub subrequests: Vec<NetfsSubrequest>,
    pub netfs_priv:  u64,             // Filesystem-private field
}

pub struct NetfsSubrequest {
    pub rreq:   Weak<NetfsReadRequest>,
    pub start:  u64,
    pub len:    usize,
    pub source: NetfsSource,   // Server, Cache, LocalXfer
    pub state:  AtomicU32,     // Pending / InFlight / Completed / Failed
}

Read path: On page fault or explicit read() hitting an NFS-backed folio not in the page cache, netfs_read_folio() creates a NetfsReadRequest, calls init_read_request() which the NFS implementation uses to split the range into subrequests (one per READ RPC, sized to rsize), issues them concurrently via async tasks, and waits for all subrequests to complete. If a local CacheFiles cache is configured, subsets of reads may be served from disk cache rather than issuing an RPC.

Write path: On writeback(), netfs_writeback() groups dirty folios into write requests sorted by file offset, calls create_write_requests() to split into WRITE RPC-sized chunks (sized to wsize), and issues them via issue_write(). Ordering within a single writeback is by offset to maximize sequential I/O on the server. NFSv4 WRITE with FILE_SYNC stability mode is used when O_SYNC is active; otherwise UNSTABLE writes are used followed by a COMMIT RPC at fsync() time.

Readahead: The NetfsReadaheadControl struct drives speculative prefetch. When sequential read access is detected (via pos tracking in the file's NetfsInode), the readahead window expands up to max_readahead pages (default: 128 pages = 512 KiB at 4 KiB page size, configurable via mount option readahead=N). Readahead requests are lower priority than demand reads and are cancelled if memory pressure rises.

14.11.6 Mount Options and Integration

NFS mounts use the new mount API (fsopen("nfs4") + fsconfig() + fsmount(), as specified in Section 13.2):

fsconfig(fd, FSCONFIG_SET_STRING, "source", "server.example.com:/export")
fsconfig(fd, FSCONFIG_SET_STRING, "sec",    "krb5p")
fsconfig(fd, FSCONFIG_SET_STRING, "vers",   "4.1")
fsconfig(fd, FSCONFIG_SET_STRING, "rsize",  "1048576")
fsconfig(fd, FSCONFIG_SET_STRING, "wsize",  "1048576")
fsconfig(fd, FSCONFIG_SET_STRING, "timeo",  "600")    // 60 seconds (units: 1/10 s)
fsconfig(fd, FSCONFIG_SET_STRING, "retrans","2")
fsconfig(fd, FSCONFIG_SET_FLAG,   "hard",   NULL)     // Hard mount: retry indefinitely

Key mount options:

Option Values Meaning
vers 4.0, 4.1, 4.2 NFSv4 minor version
sec sys, krb5, krb5i, krb5p Security flavor
rsize 4096–1048576 Read buffer size (bytes); must be multiple of 4096
wsize 4096–1048576 Write buffer size (bytes); must be multiple of 4096
hard / soft flag Hard: retry indefinitely; soft: return error after retrans timeouts
intr flag Allow signals to interrupt hard-mount retries
timeo integer (1/10 s) Per-RPC timeout before retransmit
retrans integer Number of retransmits before soft-mount error
nconnect 1–16 Number of parallel TCP connections to the server
readahead pages Readahead window size (default 128)
ac / noac flag Attribute caching; noac disables client-side attribute cache
actimeo seconds Unified attribute cache timeout

nconnect implementation: When nconnect=N is set, the XClnt maintains N TcpTransport instances. Each async RPC call is dispatched to the transport with the lowest in-flight queue depth (round-robin with depth tie-breaking). This spreads NFS traffic across multiple TCP flows, which improves throughput on high-bandwidth links where a single TCP flow is CPU- or window-limited.

Capability requirements: - Capability::SysAdmin: required to mount NFS (same as Linux). Enforced in nfs4_validate_mount_data() called from the fsconfig() implementation. - Capability::NetAdmin: required to configure NFS server-side parameters (not client mounts). - Rootless containers: NFS mounts inside a user namespace require that the filesystem server grants access to the mapped UID/GID range; the mount itself is permitted only if the user namespace has a mapping for UID 0 (i.e., is a privileged user namespace in context of the host).

sysfs interface/sys/kernel/umka/nfs/: - clients/: one directory per active NfsClient, containing: - clientid: hex-encoded 64-bit client ID - server: server address - lease_time_s: negotiated lease period - state: active / recovering / expired - session_id (v4.1 only): hex-encoded 128-bit session ID - servers/: per-server aggregate statistics: - rtt_us: exponentially smoothed round-trip time (microseconds) - retransmissions: total retransmitted RPCs since mount - ops_per_sec: rolling 1-second average of completed RPCs

14.11.7 Locking: lockd and NFSv4 Built-in Locks

NFSv3 uses lockd (Network Lock Manager, NLM protocol, RFC 1813 appendix) for advisory file locking. NFSv4 has locking built into the compound protocol (LOCK / UNLOCK / LOCKT operations).

NfsLockState (NFSv4):

pub struct NfsLockState {
    pub stateid: [u8; 16],
    pub type_:   NfsLockType,  // Read / Write
    pub offset:  u64,
    pub length:  u64,          // u64::MAX = to end of file
    pub seqid:   u32,
}

NFSv4 LOCK compoundSEQUENCE + PUTFH + LOCK { type_, reclaim, offset, length, locker: OpenToLockOwner { open_seqid, open_stateid, lock_seqid, lock_owner } }. On success returns lock_stateid used for subsequent LOCKU. On NFS4ERR_DENIED, returns the conflicting lock's owner, offset, and length so the caller can implement blocking via POSIX F_SETLKW semantics (client polls with exponential backoff up to timeo).

lockd (NFSv3) — NLM protocol between kernel lockd threads. lockd starts automatically when the first NFSv3 mount is established (Capability::SysAdmin required). The NLM daemon: 1. Registers with portmap/rpcbind as program 100021 version 4. 2. Accepts NLM_LOCK, NLM_UNLOCK, NLM_TEST RPCs from clients (server role) and issues them to remote servers (client role). 3. Implements the grace period subsystem: after server reboot, accepts only NLM_LOCK with reclaim=true until all clients have re-claimed their locks or the grace period (default 45s) expires.

Interaction between NLM and the VFS lock layer: NLM calls vfs_lock_file() (which calls the filesystem's lock() inode operation) on behalf of remote clients. UmkaOS's lock layer tracks pending NLM locks in INode::nlm_locks: Vec<NlmLock>, serialized by the inode's lock_mutex. When a lock is granted to a remote client, the NlmLock entry records the remote host and lock owner opaque identifier so it can be released on client crash (detected via NSM — Network Status Monitor callbacks, registered via SM_NOTIFY).

14.11.8 Design Decisions

  1. NFSv4.1 as the default minor version: v4.1 sessions eliminate the need for the callback channel to traverse firewalls (server uses the established fore channel for callbacks in v4.1), simplify lease recovery (session semantics), and enable parallel slot usage. The client attempts v4.1 first and falls back to v4.0 only if the server rejects EXCHANGE_ID.

  2. RPCSEC_GSS in-kernel, not userspace: Keeping GSS context management in the kernel (with upcalls to gssd only for ticket acquisition) eliminates a round-trip to userspace per-RPC at krb5i/krb5p security levels. The integrity and privacy transforms (AES-256-CTS + HMAC-SHA-512/256 per RFC 8009) are performed in-kernel using the crypto subsystem.

  3. nconnect for throughput scaling: A single TCP connection is limited by the TCP window and per-CPU processing. Multiple connections allow the NFS client to drive higher server throughput without RDMA. This matches Linux behavior since kernel 5.3.

  4. Hard mounts as default: Soft mounts return EIO on transient network failures and can corrupt application data. Hard mounts block until the server is reachable again. Applications that need timeout behavior use intr + SIGINT handling or O_NONBLOCK at the VFS layer.

  5. netfs layer as shared infrastructure: Rather than NFS implementing its own readahead and writeback, the netfs layer provides a single tested implementation. Future addition of Ceph or AFS clients reuses the same infrastructure without duplicating logic.

  6. Zero-copy XDR via NetBuf chains: RPC payloads for large reads and writes avoid data copies by encoding directly into or decoding directly from the NetBuf chains used by the TCP transport (Section 12). The record-mark framing is prepended as a single 4-byte header NetBuf node; the data pages are appended as additional NetBuf nodes referencing page cache pages directly.

  7. Attribute caching (ac option): NFS attributes (size, mtime, ctime, nlinks) are cached for actimeo seconds (default: 3–60s, scaling with file size change frequency). noac disables caching entirely, providing close-to-open coherence at the cost of one GETATTR per VFS operation. The attribute cache is stored in the NfsInode overlaid on the Inode (as with all UmkaOS filesystem-specific inode data).


14.12 NFS Server (nfsd)

UmkaOS's NFS server (nfsd) enables exporting local filesystems to remote NFS clients over NFSv3 (RFC 1813) and NFSv4.1 (RFC 5661). The server runs as a pool of kernel threads that service SunRPC requests arriving on UDP and TCP port 2049. Configuration is via /proc/fs/nfsd/ and the exportfs(8) utility, which parses /etc/exports and writes export records into the kernel. NFSv4.1 is the default negotiated minor version; NFSv4.0 and NFSv3 clients are accepted by capability negotiation at connection time. The NFS server integrates with:

  • Section 10 (VFS) for all filesystem operations (lookup, read, write, getattr, setattr, readdir, lock, fsync).
  • Section 14.11 (NFS Client) for the shared SunRPC transport and RPCSEC_GSS machinery (the same RpcTransport infrastructure is used in both client and server roles).
  • Section 8 (Security) for Kerberos GSS context establishment and UID/GID credential validation.

14.12.1 Overview

The NFS server is structured into four layers:

  1. Transport: svc_recv() — per-thread blocking receive over the shared RpcSocket.
  2. Dispatch: svc_dispatch() — demultiplex by RPC program / version / procedure.
  3. NFS handlers: per-procedure functions that validate export permissions, decode XDR arguments, call into the VFS, and encode XDR replies.
  4. Stable state: the NFSv4 state machine (clients, sessions, opens, locks, delegations) and the stable-storage journal for crash recovery.

The Duplicate Request Cache (DRC) sits between layers 2 and 3 to suppress re-execution of non-idempotent operations on retransmitted requests.

14.12.2 VFS ExportOps Interface

The NFS server requires filesystems to implement ExportOperations to allow stable file handles — handles that survive server restart and that the server can use to reconstruct a dentry from opaque bytes alone, without a mounted path hierarchy.

/// Implemented by filesystems that support being NFS-exported.
///
/// Stable file handles survive server restarts. The server must be able to
/// reconstruct a `Dentry` from the opaque handle bytes alone. Filesystems
/// that do not implement this trait cannot be NFS-exported; attempting to do
/// so returns `EINVAL`.
///
/// # Safety invariant
/// `encode_fh` and `fh_to_dentry` must be inverses: for any inode `i`,
/// `fh_to_dentry(sb, buf, ty)` where `(buf, ty) = encode_fh(i, buf, None)`
/// must return a dentry pointing to the same inode.
pub trait ExportOperations: Send + Sync {
    /// Encode `inode` (and optionally its `parent`) into `fh`.
    ///
    /// Returns the handle-type byte stored in the on-wire NFS file handle.
    /// Typical implementations encode `(ino, generation)` for `parent = None`
    /// and `(ino, generation, parent_ino, parent_generation)` when a parent
    /// is supplied.
    fn encode_fh(
        &self,
        inode: &Inode,
        fh: &mut [u8; 128],
        parent: Option<&Inode>,
    ) -> u8;

    /// Reconstruct a dentry from a file handle.
    ///
    /// Called on every NFS operation that arrives with a file handle. The
    /// implementation must locate the inode (by inode number + generation or
    /// by UUID) and return an instantiated dentry. Returns `ESTALE` if the
    /// inode no longer exists.
    fn fh_to_dentry(
        &self,
        sb: &SuperBlock,
        fh: &[u8],
        fh_type: u8,
    ) -> Result<Arc<Dentry>, KernelError>;

    /// Reconstruct the parent dentry from a file handle that contains parent
    /// information (i.e., was encoded with `parent = Some(...)`).
    ///
    /// Returns `ESTALE` if the parent inode no longer exists.
    fn fh_to_parent(
        &self,
        sb: &SuperBlock,
        fh: &[u8],
        fh_type: u8,
    ) -> Result<Arc<Dentry>, KernelError>;

    /// Return the filename of `child` within `parent`.
    ///
    /// Used during NFSv4 `READDIR` to build parent-relative paths for
    /// directory entries. Returns `ENOENT` if `child` is not in `parent`.
    fn get_name(
        &self,
        parent: &Dentry,
        child: &Dentry,
    ) -> Result<String, KernelError>;

    /// Return the parent dentry of `child`.
    ///
    /// Used to walk upward toward the export root when the client traverses
    /// beyond the export boundary. Returns `EXDEV` if `child` is already the
    /// filesystem root.
    fn get_parent(&self, child: &Dentry) -> Result<Arc<Dentry>, KernelError>;
}

Standard UmkaOS-supported filesystems (ext4, XFS, Btrfs, tmpfs) implement ExportOperations using (inode_number, generation_number) as the file handle payload. The generation number is incremented each time an inode number is reused, ensuring handles from before a delete are correctly rejected as ESTALE rather than silently aliasing a new file.

14.12.3 Exports Database

The exports table maps (host_pattern, local_path) to ExportOptions. It is loaded at server startup and updated by exportfs -a writing binary records to /proc/fs/nfsd/exports.

/// One row in the NFS exports table.
pub struct NfsExport {
    /// Root dentry of the exported directory tree.
    pub path:       Arc<Dentry>,
    /// Unique filesystem-ID for this export, embedded in NFSv3 `fsstat` and
    /// NFSv4 `fs_locations`. Auto-assigned from `sb.dev` unless overridden
    /// by `fsid=` option.
    pub fsid:       u64,
    /// Host specifier: single IP (`192.168.1.5`), CIDR subnet
    /// (`10.0.0.0/24`), DNS name (`host.example.com`), NIS netgroup
    /// (`@cluster`), or wildcard (`*`).
    pub client:     NfsClientSpec,
    /// Parsed export options.
    pub options:    ExportOptions,
    /// Effective UID for unauthenticated or squashed access (default 65534,
    /// the traditional `nfsnobody` UID).
    pub anon_uid:   u32,
    /// Effective GID for unauthenticated or squashed access (default 65534).
    pub anon_gid:   u32,
}

/// Parsed export options from `/etc/exports`.
pub struct ExportOptions {
    /// Allow write access. Default: `false` (read-only).
    pub rw:            bool,
    /// Require that every `WRITE` is committed to stable storage before the
    /// RPC reply is sent (`sync` option). Default: `false` (`async`).
    pub sync:          bool,
    /// Map UID 0 to `anon_uid`. Default: `true`.
    pub root_squash:   bool,
    /// Map all UIDs to `anon_uid`. Default: `false`.
    pub all_squash:    bool,
    /// Verify that file handles refer to a file within the exported subtree
    /// (not just the exported filesystem). Incurs a full path walk per
    /// request. Default: `false` (disabled since Linux 2.6.x; the
    /// performance cost is rarely worth the security benefit on modern
    /// systems).
    pub subtree_check: bool,
    /// Accepted security flavors, in preference order. Default: `[Sys]`.
    pub sec:           Vec<NfsSec>,
    /// Explicit `fsid=` override. Supersedes the auto-assigned value.
    pub fsid:          Option<u64>,
    /// Automatically re-export submounts visible under this path. Default:
    /// `false`.
    pub crossmnt:      bool,
    /// Do not hide submounts from clients; clients must traverse them
    /// explicitly via a separate mount. Default: `false`.
    pub nohide:        bool,
    /// Skip AUTH_NLM authentication for NFSv3 lock requests. Default:
    /// `false`.
    pub no_auth_nlm:   bool,
    /// Require that the export is only activated when this path is an active
    /// mountpoint (the `mp=` option). `None` = no requirement.
    pub mp:            Option<String>,
}

/// Security flavor accepted on this export.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum NfsSec {
    /// AUTH_SYS (UID/GID in RPC credential, no authentication).
    Sys,
    /// RPCSEC_GSS Kerberos 5: authentication only.
    Krb5,
    /// RPCSEC_GSS Kerberos 5: authentication + integrity.
    Krb5i,
    /// RPCSEC_GSS Kerberos 5: authentication + integrity + privacy.
    Krb5p,
}

The NfsExportTable is an RCU-protected hash table keyed on (path_hash, client_addr). On each NFS request the server calls export_table.lookup(dentry, peer_addr) — an O(1) RCU read with no lock acquisition on the hot path. Updates (from exportfs -a writing /proc/fs/nfsd/exports) take the writer lock, rebuild the affected bucket, and publish via an RCU grace period.

14.12.4 Server Threads

/// The nfsd thread pool. One pool per NUMA node (optional; by default a
/// single pool is used for all CPUs).
pub struct NfsdPool {
    /// Active kernel threads servicing RPC requests.
    pub threads:      Vec<KernelThread>,
    /// Current configured thread count. Writable via
    /// `/proc/fs/nfsd/threads`. Default: 8. Typical production: 32–512.
    pub count:        AtomicU32,
    /// Shared UDP + TCP listener sockets on port 2049.
    pub socket:       Arc<RpcSocket>,
    /// Duplicate request cache shared across all threads in this pool.
    pub drc:          Arc<DuplicateRequestCache>,
    /// Per-pool statistics (requests received, dispatched, dropped).
    pub stats:        NfsdPoolStats,
}

Thread lifecycle:

  1. rpc.nfsd(8) opens /proc/fs/nfsd/threads and writes the desired thread count.
  2. The kernel spawns that many nfsd/<n> kernel threads.
  3. Each thread loops: svc_recv(socket)svc_authenticate(req)svc_dispatch(req)svc_send(reply).
  4. svc_recv() blocks in poll()/epoll_wait() on the shared socket; threads compete for incoming requests (one request per wakeup).
  5. Each thread owns a private 16 KB request buffer and a private 16 KB reply buffer. These buffers are stack-allocated within the kernel thread's stack; no per-request heap allocation is required for the common case.
  6. Writing 0 to /proc/fs/nfsd/threads shuts down all threads after draining in-flight requests.

Because nfsd threads are kernel threads (not user processes), each VFS call from a thread executes directly in kernel context with the caller's effective credential set — no context switch to user space is required between RPC dispatch and filesystem operation.

14.12.5 Duplicate Request Cache (DRC)

The DRC prevents non-idempotent operations from being re-executed on retransmitted requests. It is mandatory for correctness: a client that retransmits CREATE foo after a network timeout would otherwise create foo a second time if the first succeeded.

Non-idempotent procedures covered: SETATTR, WRITE, CREATE, MKDIR, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK (NFSv3); OPEN, CLOSE, SETATTR, WRITE, CREATE, REMOVE, RENAME, LINK, LOCK, LOCKU (NFSv4 — note: NFSv4.1 sessions provide their own exactly-once semantics via slot + sequence IDs, so the DRC is used only for NFSv3 and NFSv4.0 in UmkaOS).

/// Duplicate request cache: keyed by `(client_addr, xid)`.
pub struct DuplicateRequestCache {
    /// LRU cache. Capacity = `1024 * nfsd_thread_count` entries.
    entries:     Mutex<LruCache<DrcKey, DrcEntry>>,
    max_entries: usize,
}

/// Cache key: uniquely identifies one RPC call from one client.
#[derive(Hash, Eq, PartialEq, Clone)]
pub struct DrcKey {
    /// IPv4 or IPv6 address of the originating client.
    pub client_addr: IpAddr,
    /// RPC transaction ID (XID) from the call header.
    pub xid:         u32,
}

/// Cached reply for a completed non-idempotent operation.
pub struct DrcEntry {
    /// Serialized XDR reply bytes, ready to retransmit.
    pub reply:     Vec<u8>,
    /// Wall-clock time the entry was inserted (for eviction policy).
    pub timestamp: Instant,
    /// Adler-32 of the full request body. Used to detect the degenerate case
    /// where two different requests happen to share the same XID — in that
    /// case the cached reply is discarded and the new request is executed.
    pub checksum:  u32,
}

Request processing for non-idempotent procedures:

  1. Compute DrcKey { client_addr, xid } and checksum = adler32(request_body).
  2. Lock the DRC and look up the key.
  3. Hit, checksum matches: return entry.reply directly; skip VFS execution.
  4. Hit, checksum mismatch: evict stale entry; proceed to execute (new request collided with an old XID).
  5. Miss: release lock, execute VFS operation, acquire lock, insert DrcEntry { reply, timestamp, checksum }, release lock, send reply.
  6. Entries are evicted LRU when capacity is exceeded, or after 120 seconds (hard TTL).

14.12.6 NFSv3 Protocol Dispatch

NFSv3 (RPC program 100003, version 3, RFC 1813) uses a stateless request/reply model. All NFS file handles are opaque blobs of up to 64 bytes. The server reconstructs a dentry from the file handle on every request via ExportOperations::fh_to_dentry().

Procedure Handler Idempotent
NULL (0) nfsd3_null() yes
GETATTR (1) nfsd3_getattr() yes
SETATTR (2) nfsd3_setattr() no
LOOKUP (3) nfsd3_lookup() yes
ACCESS (4) nfsd3_access() yes
READLINK (5) nfsd3_readlink() yes
READ (6) nfsd3_read() yes
WRITE (7) nfsd3_write() no
CREATE (8) nfsd3_create() no
MKDIR (9) nfsd3_mkdir() no
SYMLINK (10) nfsd3_symlink() no
MKNOD (11) nfsd3_mknod() no
REMOVE (12) nfsd3_remove() no
RMDIR (13) nfsd3_rmdir() no
RENAME (14) nfsd3_rename() no
LINK (15) nfsd3_link() no
READDIR (16) nfsd3_readdir() yes
READDIRPLUS (17) nfsd3_readdirplus() yes
FSSTAT (18) nfsd3_fsstat() yes
FSINFO (19) nfsd3_fsinfo() yes
PATHCONF (20) nfsd3_pathconf() yes
COMMIT (21) nfsd3_commit() yes

WRITE stability semantics: NFSv3 WRITE carries a stable_how field:

  • FILE_SYNC: data and metadata must be written to stable storage before reply. Implemented by calling vfs_write() followed by vfs_fsync(file, 0, len, 1).
  • DATA_SYNC: data must reach stable storage; metadata update may be deferred. Implemented by vfs_write() + vfs_fdatasync().
  • UNSTABLE: data written to page cache only (no fsync). The server returns the current write_verifier (a 64-bit value, initialized to ktime_get_boot_ns() at server start and written to /proc/fs/nfsd/write_verifier). The client must issue a COMMIT RPC before treating UNSTABLE writes as durable.

COMMIT: nfsd3_commit() calls vfs_fsync_range(file, offset, offset + count - 1, 0) and returns the write_verifier. If the verifier has changed since the client last received it (indicating a server restart), the client must re-issue all UNSTABLE writes.

READDIRPLUS: returns both directory entry names and their attributes in a single RPC, amortizing the per-entry GETATTR round trips. Implemented by iterating vfs_iterate_dir() and calling vfs_getattr() on each child inode, packing results into a single XDR reply up to the maxcount limit supplied by the client.

14.12.7 NFSv4.1 Compound Dispatch

NFSv4.1 (RPC program 100003, version 4, RFC 5661) replaces the per-procedure dispatch model with a compound RPC: a single RPC carries a sequence of operations processed left-to-right. If an operation fails with any status other than NFS4_OK, the server stops processing and returns partial results — only the first failed operation's status is returned along with the results of all preceding successful operations.

SEQUENCE must be the first operation in every compound (except BIND_CONN_TO_SESSION and EXCHANGE_ID). It provides session ID, slot ID, sequence ID, and cache-this flag. The server's slot table enforces exactly-once semantics: slot i may not carry a new request until the previous request on slot i has been replied to. This replaces the NFSv3/v4.0 DRC with a per-session, per-slot mechanism.

Key operations and their VFS mappings:

NFSv4.1 Operation VFS call Notes
EXCHANGE_ID Client registration; returns clientid + capabilities
CREATE_SESSION Establishes fore/back channels; negotiates slot counts and max RPC sizes
DESTROY_SESSION Tears down session; releases slot table
DESTROY_CLIENTID Releases all state for a clientid
SEQUENCE Slot/sequence enforcement; lease renewal
PUTROOTFH VFS root dentry Sets current FH to the export root
PUTFH fh_to_dentry() Sets current FH from wire handle
GETFH Returns current FH to client
SAVEFH / RESTOREFH Push/pop FH onto per-compound stack
LOOKUP vfs_lookup() Walks one path component
LOOKUPP vfs_lookup("..") Walks to parent directory
OPEN vfs_open() Returns stateid + open flags
CLOSE vfs_release() Releases open stateid
READ vfs_read() Returns data + EOF flag
WRITE vfs_write() Returns bytes written + stability
COMMIT vfs_fsync_range() Flushes unstable writes
GETATTR vfs_getattr() Returns requested attribute bitmask
SETATTR vfs_setattr() Sets attributes; stateid required for size truncation
CREATE vfs_mkdir() / vfs_symlink() / vfs_mknod() Non-regular files only (regular files via OPEN)
REMOVE vfs_unlink() / vfs_rmdir() Inferred from inode type
RENAME vfs_rename() Atomic cross-directory rename
LINK vfs_link() Hard link
READDIR vfs_iterate_dir() Returns entries with requested attributes
READLINK vfs_readlink() Returns symlink target
LOCK vfs_lock_file() Byte-range lock; returns lock stateid
LOCKT vfs_lock_file(F_GETLK) Test for conflicting lock
LOCKU vfs_lock_file(F_UNLCK) Release byte-range lock
DELEGRETURN Client returns a read or write delegation
LAYOUTGET pNFS metadata pNFS layout (optional; Tier 1 storage backends only)
LAYOUTRETURN pNFS metadata Client returns layout

14.12.7.1 pNFS Data Server Interface

pNFS (parallel NFS, RFC 5661 Section 12 and RFC 8435) distributes file data across multiple data servers (DSes) while the metadata server (MDS) handles namespace operations and layout leases. The following trait must be implemented by any Tier 1 block driver that wishes to serve as a pNFS data server.

/// pNFS data server operations. A pNFS layout divides file data across one or more
/// data servers (DSes); the metadata server (MDS) provides layout leases.
/// Each data server implements this trait to provide layout-specific I/O.
///
/// Layouts defined by RFC 5661 (NFS 4.1): FILE, BLOCK, OBJECT, FLEX_FILE (RFC 8435).
/// UmkaOS implements FILE layout (direct NFS I/O to data servers) and FLEX_FILE layout
/// (mirrors/striping with per-DS error tolerance).
pub trait PnfsDataServer: Send + Sync {
    /// Unique server identifier (IP:port or RDMA endpoint address).
    fn server_addr(&self) -> &PnfsServerAddr;

    /// Read `len` bytes from the data server at file offset `file_offset` into `buf`.
    /// Uses the layout credential from `layout_stateid`.
    ///
    /// Returns `Ok(bytes_read)` or an error. On `PNFS_NO_LAYOUT` error, the caller
    /// must fall back to the metadata server (MDS) for I/O.
    fn read(
        &self,
        layout_stateid: &LayoutStateId,
        file_offset: u64,
        len: u32,
        buf: &mut [u8],
    ) -> Result<u32, PnfsError>;

    /// Write `data` to the data server at file offset `file_offset`.
    /// `stable` indicates whether stable (synchronous) or unstable write is requested.
    ///
    /// Unstable writes are buffered in the data server; a subsequent `commit()`
    /// flushes them to stable storage. Stable writes are immediately persistent.
    fn write(
        &self,
        layout_stateid: &LayoutStateId,
        file_offset: u64,
        data: &[u8],
        stable: WriteStability,
    ) -> Result<WriteResponse, PnfsError>;

    /// Flush unstable writes to stable storage on the data server.
    /// Returns the write verifier that can be compared with previous unstable writes.
    fn commit(
        &self,
        layout_stateid: &LayoutStateId,
        file_offset: u64,
        count: u64,
    ) -> Result<WriteVerifier, PnfsError>;

    /// Return the data server's capabilities (supported layout types, max I/O size).
    fn capabilities(&self) -> PnfsDataServerCaps;

    /// Called when the layout is recalled by the MDS or invalidated. The data server
    /// must flush all pending writes and return the layout.
    fn layout_recall(&self, layout_stateid: &LayoutStateId, recall_type: RecallType);
}

/// Write stability mode for pNFS data server writes.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum WriteStability {
    /// Data written to client-side buffer only (DATA_SYNC on server side).
    Unstable,
    /// Data written to server stable storage before reply (FILE_SYNC).
    Stable,
}

/// Response from a pNFS data server write operation.
pub struct WriteResponse {
    /// Number of bytes actually written.
    pub count: u32,
    /// Write stability achieved (may be higher than requested).
    pub stability: WriteStability,
    /// Write verifier: random value chosen by server at startup.
    /// If verifier changes between write and commit, a server restart occurred
    /// and uncommitted writes are lost (client must retry).
    pub verifier: WriteVerifier,
}

/// Capabilities of a pNFS data server.
pub struct PnfsDataServerCaps {
    /// Maximum I/O size for a single read or write RPC.
    pub max_rw_size: u32,
    /// Supported layout types (FILE, BLOCK, OBJECT, FLEX_FILE).
    pub layout_types: u32,
    /// True if RDMA transport is available for this data server.
    pub rdma_available: bool,
}

/// Opaque pNFS layout stateid (per RFC 5661 §13.5.2).
pub type LayoutStateId = [u8; 16];
/// pNFS write verifier (per RFC 5661 §16.3): 8-byte opaque value.
pub type WriteVerifier = [u8; 8];
/// Opaque server network address.
pub struct PnfsServerAddr { pub addr: [u8; 48], pub len: u8, pub _pad: [u8; 7] }

/// Errors specific to pNFS data server operations.
#[derive(Debug)]
pub enum PnfsError {
    /// The layout stateid is no longer valid (server recalled or expired it).
    /// Client must fetch a new layout from the MDS.
    NoLayout,
    /// Data server is temporarily unavailable. Client may retry or fall back to MDS.
    Unavailable,
    /// I/O error on the data server.
    Io(KernelError),
    /// Layout type not supported by this data server.
    UnsupportedLayout,
}

14.12.8 NFSv4 State Management

NFSv4 introduces stateful file access. The server tracks client IDs, sessions, open owners, lock owners, and delegations. All state has an associated lease; state from clients whose leases expire is reclaimed by the server.

/// All per-client NFSv4 state. Protected by `NfsdStateTable::client_lock`.
pub struct NfsdClientState {
    /// 64-bit client ID assigned at `EXCHANGE_ID`. Unique for the server's
    /// lifetime.
    pub clientid:     u64,
    /// 8-byte verifier supplied by the client at `EXCHANGE_ID`. Used to
    /// detect client restarts (same IP, new verifier → client rebooted).
    pub verifier:     [u8; 8],
    /// Confirmed IP address of the client (from the TCP connection that
    /// issued `CREATE_SESSION`).
    pub client_addr:  IpAddr,
    /// RPCSEC_GSS principal name if the client authenticated with Kerberos.
    /// `None` for AUTH_SYS clients.
    pub principal:    Option<String>,
    /// Active sessions (fore + back channel pairs).
    pub sessions:     Vec<Arc<NfsdSession>>,
    /// Open owners: keyed by the 28-byte `open_owner` opaque identifier.
    pub open_owners:  BTreeMap<[u8; 28], Arc<OpenOwner>>,
    /// Lock owners: keyed by the 28-byte `lock_owner` opaque identifier.
    pub lock_owners:  BTreeMap<[u8; 28], Arc<LockOwner>>,
    /// Read and write delegations currently granted to this client.
    pub delegations:  Vec<Arc<Delegation>>,
    /// Absolute time at which this client's lease expires if not renewed.
    /// Renewed on every `SEQUENCE` from this client.
    pub lease_expiry: Instant,
}

/// An NFSv4.1 session (one `CREATE_SESSION` creates one session).
pub struct NfsdSession {
    pub session_id:   [u8; 16],
    /// Fore channel: client → server request slots.
    pub fore_slots:   Vec<NfsdSlot>,
    /// Back channel: server → client callback slots.
    pub back_channel: Option<RpcBackChannel>,
    /// Maximum request size negotiated at `CREATE_SESSION` (bytes).
    pub max_req_sz:   u32,
    /// Maximum response size negotiated at `CREATE_SESSION` (bytes).
    pub max_resp_sz:  u32,
}

/// One slot in a session's fore channel.
pub struct NfsdSlot {
    pub seq_id:       u32,
    /// Cached reply for the last compound on this slot (for replay detection).
    pub cached_reply: Option<Vec<u8>>,
    pub in_use:       AtomicBool,
}

/// An open-owner and the associated open stateid.
pub struct OpenOwner {
    /// Current stateid (seqid increments on each OPEN/CLOSE/OPEN_DOWNGRADE).
    pub stateid:    StateId,
    /// The opened file's dentry.
    pub file:       Arc<Dentry>,
    /// Share access bits granted to this open (read, write, or both).
    pub access:     OpenAccess,
    /// Share deny bits this open holds (deny read, deny write, or neither).
    pub deny:        OpenDeny,
    /// Reference count: number of times the client has opened this
    /// (owner, file) pair without a corresponding CLOSE.
    pub open_count: u32,
}

/// An NFSv4 stateid: identifies one open, lock, or delegation instance.
pub struct StateId {
    /// Sequence number, incremented on each state transition.
    pub seqid: u32,
    /// 12 opaque bytes unique within the server's lifetime.
    pub other: [u8; 12],
}

/// A delegation granted to a client.
pub struct Delegation {
    pub stateid:     StateId,
    pub dtype:       DelegationType,  // Read or Write
    pub file:        Arc<Dentry>,
    pub client:      u64,             // clientid
    /// Time at which a pending recall (CB_RECALL) was sent. `None` if no
    /// recall is in progress.
    pub recall_sent: Option<Instant>,
}

#[derive(Clone, Copy, PartialEq, Eq)]
pub enum DelegationType {
    /// Read delegation: client may cache reads without contacting server.
    Read,
    /// Write delegation: client has exclusive write access; all writes are
    /// cached locally and flushed on DELEGRETURN or recall.
    Write,
}

Lease renewal: Each NfsdClientState has a lease_expiry deadline. Any SEQUENCE operation from the client resets the deadline to now + nfsd_lease_time (default: 90 seconds). The lease reaper task runs every 10 seconds and reclaims state for clients whose lease_expiry is in the past: all OpenOwner entries are closed, byte-range locks are released via vfs_lock_file(F_UNLCK), and delegations are revoked.

Grace period: After nfsd starts (or restarts), the server enters a grace period of nfsd_gracetime seconds (default: 90 seconds, equal to the lease time). During the grace period, the server: - Accepts OPEN with claim_type = CLAIM_PREVIOUS (state reclaim) from clients that held opens or delegations before the restart. - Rejects new OPEN with claim_type = CLAIM_NULL with NFS4ERR_GRACE. - Reads the stable-storage journal (Section 14.12.10) to learn which clients had state before the restart, populating the set of expected reclaimants.

Once the grace period expires (or all expected reclaimants have completed reclaim, whichever is first), the server transitions to normal operation.

Delegations and recalls: The server grants a Read delegation when a file is opened for read and there are no write opens or write delegations outstanding. It grants a Write delegation when a file is opened for write and there is exactly one open (the requesting client's) and no conflicting opens or delegations. When a conflicting open arrives for a delegated file, the server issues CB_RECALL on the back channel to the delegating client and waits nfsd_lease_time / 2 seconds for DELEGRETURN before forcibly revoking the delegation with NFS4ERR_DELEG_REVOKED.

14.12.9 Authentication and Security

AUTH_SYS (auth_flavor = AUTH_UNIX): the RPC credential carries a plaintext UID, GID, and supplementary GID list. The server uses these directly as the effective credential for VFS calls. No cryptographic authentication is performed. AUTH_SYS is a legacy mechanism for trusted private-network deployments only — credentials are trivially forgeable by any host on the network segment. Use sec=krb5p (authentication + integrity + privacy) for production deployments, or at minimum sec=krb5i (authentication + integrity). AUTH_SYS should be restricted to legacy appliances or isolated lab networks where deploying a Kerberos KDC is not feasible. Rejected on exports that specify sec=krb5 or stronger.

RPCSEC_GSS / Kerberos 5 (RFC 2203 + RFC 7861): three protection levels:

  • krb5: authentication only. The RPC call header contains a GSS MIC token covering the XID and procedure number; the server verifies the MIC using the session key. Payload is transmitted in clear.
  • krb5i: authentication + integrity. The entire RPC body (arguments + results) is covered by a GSS MIC token. Payload is transmitted in clear but any tampering is detected.
  • krb5p: authentication + integrity + privacy. The entire RPC body is wrapped with GSS Wrap (encrypt-then-MAC). Payload is opaque to network observers.

In all three cases the cryptographic transforms use AES-256-CTS-HMAC-SHA512-256 (enctypes aes256-cts-hmac-sha512-256, RFC 8009) when negotiated with a Kerberos 5 KDC that supports it, falling back to aes128-cts-hmac-sha256-128 (RFC 8009) or aes256-cts-hmac-sha1-96 (RFC 3962) for older KDCs.

GSS context establishment flow:

  1. The client sends RPCSEC_GSS_INIT with a GSS_Init_sec_context token (Kerberos AP-REQ encapsulated in GSS-API).
  2. The server calls rpc_gss_svc_accept_sec_context() which makes a synchronous upcall to gssd via a kernel–user pipe. gssd calls gss_accept_sec_context() with the host's keytab (/etc/krb5.keytab) and returns the derived session key and client principal to the kernel.
  3. The kernel stores the session key in GssContext::session_key (protected by a Mutex<>; the key is zeroed on context expiry via Drop). Subsequent RPCs perform MIC/Wrap/Unwrap in-kernel using the UmkaOS crypto subsystem (Section 8).
  4. The svcgssd daemon (alternative to gssd) is also supported; the upcall interface is identical.

UID mapping: applied after credential extraction, before any VFS call: - root_squash (default on): UID 0 → anon_uid (65534), GID 0 → anon_gid (65534). - all_squash: all UIDs/GIDs → anon_uid/anon_gid. - Neither option: credentials passed through unchanged.

UID mapping is applied per-export, so the same file can be accessed with different effective credentials by clients matched to different export rows.

14.12.10 /proc/fs/nfsd Interface

The /proc/fs/nfsd/ pseudo-filesystem is the control plane for the NFS server. It is mounted at boot when the nfsd kernel module is loaded (or when the first export is created, if nfsd is built-in).

/proc/fs/nfsd/
├── threads          (rw): read = "N\n" current thread count; write N to spawn/trim threads
├── exports          (rw): current exports table in exportfs format; write to update
├── clients/         (r-x): one subdirectory per active NFSv4 client
│   └── <clientid>/        clientid in lowercase hex (16 hex digits)
│       ├── info     (r--): "addr: ...\nprincipal: ...\nlease_remaining: ...s\n"
│       ├── states   (r--): one line per open stateid and delegation
│       └── ctl      (-w-): write "expire\n" to immediately revoke this client's lease
├── pool_stats       (r--): per-pool thread count, requests served, DRC hit rate
├── write_verifier   (r--): current write verifier as 16-char lowercase hex
├── nfsv4leasetime   (rw): NFSv4 lease duration in seconds (default 90, range 10–3600)
├── nfsv4gracetime   (rw): grace period duration in seconds (default = nfsv4leasetime)
├── nfsv4minorversion (rw): highest NFSv4 minor version offered (0 or 1; default 1)
└── stable_storage   (rw): path to the stable-state journal file
                            (default: /var/lib/nfs/nfsd4_recoverydir)

The stable_storage path points to a directory on a local persistent filesystem. The server writes one file per client (named by clientid) containing serialized NfsdClientState (open owners, lock owners, delegation stateids) using a binary format with a CRC-32C checksum. These files are read during the grace period to populate the set of expected reclaimants. They are deleted when a client sends DESTROY_CLIENTID or when its lease expires normally.

14.12.11 NLM (Network Lock Manager) Server

NFSv3 byte-range locking uses a separate RPC protocol: NLM (program 100021, version 4, defined in the OpenGroup XNFS specification). The NLM server runs as part of lockd alongside the NFS server.

NLM server procedures:

Procedure Handler Notes
NLM_TEST nlm4_test() Test for conflicting lock (non-destructive)
NLM_LOCK nlm4_lock() Acquire byte-range lock; may block if block=true
NLM_CANCEL nlm4_cancel() Cancel a pending blocked lock request
NLM_UNLOCK nlm4_unlock() Release a byte-range lock
NLM_GRANTED nlm4_granted() Callback: server notifies client of granted blocked lock
NLM_TEST_MSG async variant of TEST One-way; reply via NLM_TEST_RES callback
NLM_LOCK_MSG async variant of LOCK One-way; reply via NLM_LOCK_RES callback
NLM_UNLOCK_MSG async variant of UNLOCK One-way; reply via NLM_UNLOCK_RES callback
NLM_SHARE nlm4_share() DOS-style share reservation (rarely used)
NLM_UNSHARE nlm4_unshare() Release share reservation
NLM_NM_LOCK nlm4_nm_lock() Non-monitored lock (NSM not involved)
NLM_FREE_ALL nlm4_free_all() Release all locks for a client (NSM reboot notification)

VFS integration: nlm4_lock() calls vfs_lock_file(file, F_SETLKW, flock) with the translated struct file_lock. Granted locks are recorded in INode::nlm_locks: Vec<NlmLock>, protected by INode::lock_mutex. Each NlmLock entry stores the remote host address and the NLM lock_owner opaque cookie so the lock can be released if the client crashes.

NSM (Network Status Monitor) integration: rpc.statd (program 100024) runs in user space and monitors client liveness. When lockd grants a lock to a remote client, it calls nsm_monitor(client_addr) to register the client with rpc.statd. If the client reboots, rpc.statd calls nsm_callback() which delivers SM_NOTIFY to the kernel's nfsd_sm_notify() entry point. The kernel then calls nlm_host_rebooted(), which iterates INode::nlm_locks for all inodes holding locks from that host and calls vfs_lock_file(F_UNLCK) to release them, allowing other waiters to proceed.

Grace period: After lockd restarts (following a server crash), it enters a grace period (default 45 seconds) during which it accepts only NLM_LOCK requests with reclaim = true. This allows clients to re-acquire locks they held before the crash before the server accepts new competing lock requests.

14.12.12 Linux Compatibility

  • /etc/exports format: identical to Linux nfsd, including all documented export options (rw, ro, sync, async, root_squash, no_root_squash, all_squash, no_all_squash, subtree_check, no_subtree_check, sec=, fsid=, anonuid=, anongid=, crossmnt, nohide, no_auth_nlm, mp=). Unrecognized options are rejected with a logged warning (not silently ignored).
  • exportfs(8), showmount(8), nfsstat(8), rpc.nfsd(8), rpc.mountd(8) all operate without modification.
  • /proc/fs/nfsd/ layout matches Linux kernel 5.15+ nfsd. Fields that do not exist in older kernels (e.g., nfsv4minorversion) are additive and ignored by older tools.
  • NFSv3 wire protocol: RFC 1813 compliant, interoperable with Linux, Solaris, macOS, FreeBSD, and Windows NFS clients.
  • NFSv4.1 wire protocol: RFC 5661 compliant. pNFS metadata operations (LAYOUTGET, LAYOUTRETURN, LAYOUTCOMMIT, GETDEVICEINFO) are implemented; pNFS data-server operations require a Tier 1 block driver that exposes the PnfsDataServer interface (optional; falls back to MDS-only mode if unavailable).
  • NFSv4.0 minor version: accepted (negotiated down from v4.1 if the client does not support EXCHANGE_ID). The DRC (Section 14.12.5) provides exactly-once semantics for v4.0.
  • NFSv3 and NFSv4 server can run concurrently; both are enabled by default. Writing 3 or 4 to a hypothetical nfsv_versions knob is not yet implemented; the standard mechanism (exportfs options + kernel compile flags) applies as on Linux.

14.12.13 Design Decisions

  1. NFSv4.1 sessions replace the DRC for v4.1 clients: The per-session, per-slot sequence-ID mechanism in NFSv4.1 (RFC 5661 §2.10) provides exactly-once semantics without the hash-table overhead of the DRC. The DRC is retained only for NFSv3 and NFSv4.0 clients. NFSv4.1 clients receive NFS4ERR_SEQ_MISORDERED on sequence violations rather than a cached reply.

  2. Stable storage journal for NFSv4 state: Writing client open/lock/delegation state synchronously to disk on every OPEN, CLOSE, LOCK, LOCKU, and DELEGRETURN allows the server to survive a crash and offer clients a grace period for state reclamation (RFC 5661 §8.4.2). Without stable storage, the server would be forced to return NFS4ERR_NO_GRACE to all clients, requiring them to re-open all files from scratch — disruptive for workloads with thousands of open files.

  3. Thread pool model over event-driven dispatch: Kernel threads (one thread per outstanding request, blocking on svc_recv) keep the code path from RPC arrival to VFS call entirely synchronous. An event-driven model (one thread multiplexing many connections via epoll) would require explicit continuation passing through VFS callbacks, adding complexity with negligible throughput benefit at the connection counts typical for NFS servers (100s–1000s of clients, not millions).

  4. ExportOperations as a required trait: Requiring filesystems to provide stable file handles (encode_fh / fh_to_dentry) makes the correctness contract explicit at the type level. Filesystems that cannot provide stable handles (e.g., a synthetic in-memory filesystem with no persistent inode allocation) simply do not implement the trait and cannot be exported — instead of being exported with silently broken ESTALE behavior.

  5. AUTH_SYS and Kerberos both in-kernel: The Kerberos per-RPC integrity and privacy transforms (AES-256-CTS + HMAC) are performance-critical at high RPC rates and belong in the kernel crypto subsystem. Only the initial GSS context negotiation (involving the KDC and the host keytab) uses a user-space upcall to gssd. This is identical to Linux nfsd's approach and ensures compatibility with existing gssd/svcgssd deployments.

  6. NLM co-located with nfsd: The NLM lock manager shares the lockd kernel threads and the per-inode nlm_locks list with the NFS server rather than running as a separate subsystem. This avoids a cross-subsystem RPC for every lock operation and allows lock grants and lock releases to be performed atomically with respect to VFS inode locking.


14.13 I/O Priority and Scheduling

UmkaOS implements per-task I/O priority with full Linux ioprio_set/ioprio_get syscall compatibility. The UmkaOS I/O scheduler (MQPA — Multi-Queue Priority-Aware) is a unified implementation that replaces the Linux family of pluggable schedulers (CFQ, mq-deadline, BFQ, kyber) with a single, purpose-built scheduler that is correct, composable, and integrates natively with NVMe multi-queue hardware.

14.13.1 Syscall Interface

ioprio_set(which: i32, who: i32, ioprio: i32) -> 0 | -EINVAL | -EPERM | -ESRCH
ioprio_get(which: i32, who: i32) -> ioprio: i32 | -EINVAL | -EPERM | -ESRCH

Syscall numbers (x86-64): ioprio_set = 251, ioprio_get = 252. Syscall numbers (i386 compat): ioprio_set = 289, ioprio_get = 290. Syscall numbers (AArch64): ioprio_set = 30, ioprio_get = 31.

which argument — target scope:

Constant Value Meaning
IOPRIO_WHO_PROCESS 1 Single process or thread identified by who (PID/TID). If who = 0, the calling thread.
IOPRIO_WHO_PGRP 2 All processes in the process group identified by who. If who = 0, the caller's process group.
IOPRIO_WHO_USER 3 All processes whose real UID matches who.

ioprio_get with PGRP/USER: When multiple processes match, returns the highest priority found: RT > BE > Idle, and within the same class, the numerically lowest level (0 = highest).

Error conditions:

Error Condition
EINVAL which is not one of the three valid values; ioprio encodes an invalid class (> 3) or level (> 7); the level is non-zero for IoSchedClass::Idle.
EPERM Caller lacks CAP_SYS_ADMIN when setting RT class; caller lacks CAP_SYS_NICE when setting another user's tasks.
ESRCH No process matching the given which/who combination was found.

14.13.2 IoPriority Encoding

The ioprio value is a 16-bit quantity passed as a 32-bit int (upper 16 bits must be zero). The bit layout is identical to Linux's <linux/ioprio.h>:

bits 15-13: I/O scheduling class (3 bits)
bits 12-0:  Priority level within the class (13 bits; only values 0-7 are meaningful)
/// Per-task I/O priority. Wire-compatible with Linux `ioprio` values.
///
/// Bit layout (little-endian u16):
///   [15:13] = IoSchedClass (3 bits)
///   [12:0]  = level (13 bits; values 0–7 meaningful; 0 = highest priority)
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct IoPriority(u16);

impl IoPriority {
    /// Construct an `IoPriority` from a class and level.
    ///
    /// `level` must be in 0..=7. Values 8..=0x1fff are invalid and rejected
    /// by `ioprio_set`; this constructor does not clamp — callers should
    /// validate before constructing.
    pub const fn new(class: IoSchedClass, level: u8) -> Self {
        IoPriority(((class as u16) << 13) | (level as u16 & 0x1fff))
    }

    /// Decode the scheduling class from the encoded value.
    pub fn class(self) -> IoSchedClass {
        match (self.0 >> 13) & 0x7 {
            0 => IoSchedClass::None,
            1 => IoSchedClass::RealTime,
            2 => IoSchedClass::BestEffort,
            3 => IoSchedClass::Idle,
            _ => IoSchedClass::None, // bits 4-7 are invalid; treat as None
        }
    }

    /// Decode the priority level (0 = highest within the class).
    pub fn level(self) -> u8 {
        (self.0 & 0x1fff) as u8
    }

    /// Round-trip to/from the raw `i32` syscall argument.
    pub fn from_raw(raw: i32) -> Option<Self> {
        if raw < 0 || raw > 0xffff { return None; }
        Some(IoPriority(raw as u16))
    }

    pub fn to_raw(self) -> i32 {
        self.0 as i32
    }

    /// The zero value: class = None, level = 0.
    /// Semantics: inherit priority from CPU nice value.
    pub const NONE: IoPriority = IoPriority(0);
}

/// I/O scheduling class. Numeric values are identical to Linux
/// `IOPRIO_CLASS_*` constants — do not renumber.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum IoSchedClass {
    /// Class not set. I/O priority is derived from CPU nice (see Section 14.13.3).
    None       = 0,
    /// Real-time. Levels 0–7, 0 = highest. Preempts all BestEffort and Idle I/O.
    RealTime   = 1,
    /// Best-effort. Levels 0–7, 0 = highest. Default class for all tasks.
    BestEffort = 2,
    /// Idle. Served only when no RT or BE I/O is pending.
    /// The level field is ignored; all Idle I/O is equal.
    Idle       = 3,
}

Validation rules (enforced by ioprio_set before storing): - Class must be 0–3 (values 4–7 are reserved; return EINVAL). - For RT and BE: level must be 0–7 (values 8–0x1fff are invalid; return EINVAL). - For Idle: level must be 0 (any non-zero level is EINVAL). - For None: level must be 0.

14.13.3 Priority Inheritance from CPU Nice

When a task has IoPriority::NONE (class = IoSchedClass::None), its effective I/O priority is derived from its CPU nice value at dispatch time. This matches Linux behavior:

effective_class = BestEffort
effective_level = (nice + 20) / 5

This maps the nice range −20..+19 to BE levels 0..7:

nice effective BE level
−20 0 (highest)
−15 1
−10 2
−5 3
0 4 (default)
5 5
10 6
19 7 (lowest)

The derivation happens in the dispatch path, not at ioprio_set time, so that a subsequent setpriority(2) call continues to influence I/O priority as expected.

14.13.4 Task Storage and Inheritance

/// Fields added to the Task structure (see Chapter 7).
pub struct Task {
    // ... existing fields ...

    /// Explicitly set I/O priority. `IoPriority::NONE` means "derive from nice".
    pub io_priority: IoPriority,
}

Fork semantics: On fork(2) and clone(2) without CLONE_IO, the child inherits the parent's io_priority value verbatim. If the parent had IoPriority::NONE, the child also starts with IoPriority::NONE and its effective priority is derived from its own nice value (which it also inherits from the parent, but may be changed independently).

CLONE_IO: When CLONE_IO is set, the child shares the parent's I/O context (same io_context pointer). In this case the io_priority is also shared — a write by either task is visible to the other. This is the same as Linux.

Thread groups: POSIX threads within the same process do NOT share io_priority by default (consistent with Linux). Each thread has an independent io_priority. Tools that wish to set the priority for all threads of a process must call ioprio_set(IOPRIO_WHO_PROCESS, tid, ioprio) once per thread, using TIDs from /proc/<pid>/task/.

14.13.5 Permission Model

UmkaOS enforces the same permission rules as Linux:

Operation Required capability
Set IoSchedClass::RealTime for any task CAP_SYS_ADMIN
Set IoSchedClass::BestEffort or IoSchedClass::Idle for own tasks None
Set IoSchedClass::BestEffort or IoSchedClass::Idle for another user's tasks CAP_SYS_NICE
Set priority for a process group or all processes of a UID Same as for individual processes
Read priority of any task None (always permitted)

"Own tasks" means: tasks whose real or effective UID matches the caller's real UID, or tasks in the caller's process group when which = IOPRIO_WHO_PGRP. Setting a higher-than-current BE level (lower priority number) for one's own tasks is always permitted.

14.13.6 UmkaOS I/O Scheduler: Multi-Queue Priority-Aware (MQPA)

UmkaOS does not implement CFQ, BFQ, mq-deadline, or kyber as separate pluggable schedulers. Instead, UmkaOS implements a single unified scheduler — MQPA — that provides the correct behavior for all workloads without the configuration complexity of Linux's scheduler selection knob.

Design rationale vs Linux schedulers: - CFQ: deprecated in Linux 5.0, removed in 5.3. Had global elevator lock, per-process queues with O(n) dispatch, poor NVMe multi-queue support. - BFQ: per-process B-WF2Q+ scheduling with budget tracking. Good fairness, but complex and has a single per-device lock that limits scaling on high-queue-depth SSDs. - mq-deadline: simple, fast, low overhead, but only provides read/write starvation prevention — no per-class prioritization beyond that. - kyber: good SSD latency targeting, but no class-based priority support.

MQPA provides class-based strict priority (RT > BE > Idle), weighted round-robin within BE levels, per-CPU queues for lock-free submission, elevator merge optimization, and NVMe hardware queue integration — without any of the above limitations.

14.13.6.1 Scheduler Data Structures

/// One MQPA scheduler instance per NVMe hardware queue (or per storage device
/// for non-NVMe targets).
pub struct IoScheduler {
    /// Per-CPU dispatch queues for RT class, indexed by level (0 = highest).
    rt_queues: [PerCpu<IoQueue>; 8],

    /// Per-CPU dispatch queues for BE class, indexed by level (0 = highest).
    be_queues: [PerCpu<IoQueue>; 8],

    /// Single per-CPU idle queue.
    idle_queue: PerCpu<IoQueue>,

    /// Monotonic count of in-flight requests across all classes.
    inflight: AtomicU32,

    /// In-flight RT requests (used to determine when BE/Idle may proceed).
    inflight_rt: AtomicU32,

    /// Maximum queue depth supported by the device.
    queue_depth: u32,
}

/// Backing storage for an `IoQueue`, parameterised by media type.
///
/// - **Sorted** (rotational media — HDD): requests ordered by LBA for elevator
///   merge and seek-distance minimisation. `BTreeMap` provides O(log N) insert,
///   O(log N) predecessor/successor lookup for merge checks, and O(log N)
///   `pop_first()` for dispatch. Allocation per insert is negligible vs HDD
///   access latency (~4 ms for a 7200 RPM drive).
/// - **Fifo** (non-rotational media — NVMe, SSD, PMEM): no seek penalty; FIFO
///   preserves submission order and is optimal for deep hardware queues. Uses
///   `VecDeque` (amortised O(1) push/pop, no per-element allocation).
///
/// The variant is set once at `IoScheduler` creation from `blk_queue_flag_set(QUEUE_FLAG_NONROT)`
/// and never changes at runtime. All `IoQueue` instances within one `IoScheduler`
/// use the same variant.
pub enum IoQueueBacking {
    Sorted(BTreeMap<Lba, Arc<IoRequest>>),
    Fifo(VecDeque<Arc<IoRequest>>),
}

/// A single priority-level dispatch queue.
pub struct IoQueue {
    /// Backing storage, parameterised by media type. See `IoQueueBacking`.
    pub backing: IoQueueBacking,

    /// Deadline for the oldest request in this queue (for starvation detection).
    /// `None` if the queue is empty.
    oldest_deadline: Option<Instant>,

    /// Number of requests dispatched from this queue in the current WRR round
    /// (BE queues only; unused for RT and Idle).
    dispatched_this_round: u32,
}

14.13.6.2 Dispatch Algorithm

The dispatch loop runs when the device signals readiness for more commands (doorbell ring, completion interrupt, or explicit dispatch_pending() call from the submit path).

fn dispatch_one(sched: &IoScheduler, cpu: CpuId) -> Option<Arc<IoRequest>> {
    // Step 1: RT always wins. Scan RT levels 0..7, take first non-empty queue.
    for level in 0..8 {
        if let Some(req) = sched.rt_queues[level].get(cpu).pop_front() {
            sched.inflight_rt.fetch_add(1, Release);
            sched.inflight.fetch_add(1, Release);
            return Some(req);
        }
    }

    // Step 2: Starvation promotion (BE). If any BE request has waited beyond
    // the starvation deadline (500ms), treat it as RT-priority for one dispatch.
    for level in 0..8 {
        let q = sched.be_queues[level].get(cpu);
        if q.oldest_deadline.map_or(false, |d| d.elapsed() > Duration::from_millis(500)) {
            if let Some(req) = q.pop_front() {
                sched.inflight.fetch_add(1, Release);
                return Some(req);
            }
        }
    }

    // Step 3: BE weighted round-robin.
    // Weights: level 0 = 8, level 1 = 4, level 2 = 2, levels 3-7 = 1.
    let be_weights: [u32; 8] = [8, 4, 2, 1, 1, 1, 1, 1];
    for level in 0..8 {
        let q = sched.be_queues[level].get_mut(cpu);
        if q.dispatched_this_round < be_weights[level] {
            if let Some(req) = q.pop_front() {
                q.dispatched_this_round += 1;
                sched.inflight.fetch_add(1, Release);
                return Some(req);
            }
        }
    }
    // End of WRR round: reset counters and retry from level 0.
    for level in 0..8 {
        sched.be_queues[level].get_mut(cpu).dispatched_this_round = 0;
    }
    for level in 0..8 {
        if let Some(req) = sched.be_queues[level].get_mut(cpu).pop_front() {
            sched.be_queues[level].get_mut(cpu).dispatched_this_round = 1;
            sched.inflight.fetch_add(1, Release);
            return Some(req);
        }
    }

    // Step 4: Starvation promotion (Idle). 5s deadline.
    let iq = sched.idle_queue.get(cpu);
    if iq.oldest_deadline.map_or(false, |d| d.elapsed() > Duration::from_secs(5)) {
        if let Some(req) = sched.idle_queue.get_mut(cpu).pop_front() {
            sched.inflight.fetch_add(1, Release);
            return Some(req);
        }
    }

    // Step 5: Idle — only when RT and BE are empty.
    sched.idle_queue.get_mut(cpu).pop_front().map(|req| {
        sched.inflight.fetch_add(1, Release);
        req
    })
}

Starvation prevention: - BE requests that wait longer than 500ms are promoted once (dispatched as if RT, then return to normal BE accounting afterward). - Idle requests that wait longer than 5s are promoted once (dispatched regardless of pending BE I/O). - Promotion is per-request, not per-queue: only the single oldest request in a queue is promoted at a time, preserving ordering within the queue.

14.13.6.3 Elevator Merge Optimization

For rotational media (IoQueueBacking::Sorted), requests are sorted by starting LBA. When a new request arrives:

  1. Back-merge check: Look up the predecessor entry via BTreeMap::range(..lba).next_back(). If the predecessor's end LBA + 1 == new request's start LBA, and the combined bio size is ≤ 64 KB (the merge size limit), extend the predecessor's IoRequest to cover the new range and discard the new request object.
  2. Front-merge check: Look up the successor via BTreeMap::range(lba..).next(). If the successor's start LBA == new request's end LBA + 1, and combined size ≤ 64 KB, extend the new request and replace the successor.
  3. No merge: Insert the new request into the BTreeMap keyed by its start LBA.

Each merge check is O(log N). There is no global elevator lock: the per-CPU IoQueue is accessed only while holding the per-CPU scheduler lock (preempt-disable critical section on the submitting CPU).

For non-rotational media (IoQueueBacking::Fifo), back/front merge checks are still attempted (same logic, but searching by LBA in the VecDeque is O(N)); dispatch pops from the front of the deque rather than the lowest-LBA entry.

The 64 KB merge limit is chosen to match a typical NVMe preferred transfer size and to bound the latency spike of a merged request. This can be adjusted per-device at initialization time by querying the device's MDTS (Maximum Data Transfer Size) field in the NVMe identify controller data structure.

14.13.6.4 Submission Path

pub fn submit(sched: &IoScheduler, req: Arc<IoRequest>, task: &Task) {
    let priority = task.effective_io_priority(); // derives from nice if NONE
    let cpu = current_cpu();

    match priority.class() {
        IoSchedClass::RealTime => {
            sched.rt_queues[priority.level() as usize]
                .get_mut_nosave(&irq_guard)
                .insert_merged(req);
        }
        IoSchedClass::BestEffort | IoSchedClass::None => {
            let level = match priority.class() {
                IoSchedClass::None => task.nice_to_be_level(),
                _ => priority.level() as usize,
            };
            sched.be_queues[level]
                .get_mut_nosave(&irq_guard)
                .insert_merged(req);
        }
        IoSchedClass::Idle => {
            sched.idle_queue
                .get_mut_nosave(&irq_guard)
                .insert_merged(req);
        }
    }

    sched.kick_dispatch(cpu);
}

14.13.7 NVMe Multi-Queue Integration

NVMe hardware supports multiple independent submission/completion queue pairs. UmkaOS maps the MQPA scheduler to NVMe hardware queues as follows:

Queue layout per NVMe controller: - One hardware queue pair per online CPU (as Linux does with blk-mq). - Each hardware queue has its own IoScheduler instance — no cross-queue locking. - Tasks submit requests to the IoScheduler associated with their current CPU. The dispatcher drains that scheduler's queues into the hardware submission queue doorbell.

NVMe queue priority (QPRIO): When the NVMe controller supports the Weighted Round Robin with Urgent Priority Class arbitration mechanism (reported in CAP.AMS), UmkaOS creates dedicated submission queue tiers:

NVMe QPRIO Value (CDW11[2:1]) Used for
Urgent 00b RT I/O class (all levels 0-7)
High 01b BE levels 0-1
Medium 10b BE levels 2-4
Low 11b BE levels 5-7 and Idle

Queue priority is set at queue creation time via the QPRIO field in CDW11 of the Create I/O Submission Queue admin command. This maps UmkaOS's software priority classes to NVMe hardware arbitration, so that the drive's internal scheduler also respects UmkaOS priorities — not just the host-side MQPA scheduler.

If the controller does not support CAP.AMS priority, all queues are created at the default (equal) priority and MQPA's software dispatch order is the sole priority mechanism.

RT fast path: RT requests are eligible for direct hardware queue submission without going through the sorted BTreeMap, provided the hardware queue has available slots. This reduces the RT dispatch latency to approximately one PCIe round trip (2–4 μs on Gen4/Gen5 NVMe) without waiting for a dispatch tick.

Completion handling: NVMe completions arrive per-queue. Each completion decrements inflight and inflight_rt (if RT), then calls dispatch_one to fill the freed slot. This keeps queue depth at the device's preferred level for maximum throughput.

14.13.8 cgroup Integration

UmkaOS's io cgroup v2 controller and blkio cgroup v1 controller interact with MQPA:

cgroup v2 io controller:

/sys/fs/cgroup/<group>/io.weight

Integer weight 1–10000 (default 100). Maps to an effective BE level multiplier:

effective_weight = io.weight           // 1-10000, default 100
be_dispatch_quota = be_weights[level] * effective_weight / 100

Tasks in a cgroup with io.weight=500 (5× default) get 5× the per-round dispatch quota at their BE level. Tasks in a cgroup with io.weight=10 get 0.1× quota (rounded up to 1 dispatch per round to avoid starvation).

The per-cgroup weight applies within the same BE priority level. A task at BE level 0 with io.weight=10 still preempts a task at BE level 1 with io.weight=10000 — class and level take strict priority; cgroup weight only affects relative bandwidth within the same level.

cgroup v2 io.max — hard rate limits:

/sys/fs/cgroup/<group>/io.max

Format (Linux compatible): MAJ:MIN rbps=N wbps=N riops=N wiops=N

Implemented as a token bucket per cgroup per device. Tokens refill at the configured rate; requests that arrive when the bucket is empty are held in a per-cgroup delay queue and released when tokens become available. Rate-limited requests retain their original MQPA priority and are inserted into the normal dispatch queue when released from the delay queue.

Token bucket parameters: - Bucket capacity: 4× the per-second rate limit (allows burst up to 4 seconds of quota). - Refill granularity: every 1ms tick (avoids thundering herd on 1-second boundaries).

cgroup v1 blkio controller:

Supported knobs and their v2 equivalents:

v1 knob v2 equivalent Notes
blkio.weight io.weight Per-cgroup default weight
blkio.weight_device io.weight (per-device) Per-device weight override
blkio.throttle.read_bps_device io.max rbps= Hard rate limit
blkio.throttle.write_bps_device io.max wbps= Hard rate limit
blkio.throttle.read_iops_device io.max riops= Hard rate limit
blkio.throttle.write_iops_device io.max wiops= Hard rate limit

v1 blkio.bfq.* knobs are accepted but ignored with a logged warning (BFQ is not implemented; MQPA provides equivalent or better behavior).

cgroup v2 io.stat — I/O accounting:

/sys/fs/cgroup/<group>/io.stat

Format (Linux 4.16+ compatible):

MAJ:MIN rbytes=N wbytes=N rios=N wios=N dbytes=N dios=N

Fields: - rbytes / wbytes: bytes read/written from storage (not page cache hits) - rios / wios: number of completed read/write I/O operations - dbytes / dios: bytes/ops issued as discard (TRIM/UNMAP) commands

Counters are updated on I/O completion, not on submission. Accounted per-task first, then aggregated to the cgroup hierarchy on read.

14.13.9 /proc/PID/io Accounting

Each task accumulates I/O counters in its RusageAccum structure (defined in Chapter 7). These are exposed in /proc/<pid>/io with the following format (Linux compatible):

rchar: <N>
wchar: <N>
syscr: <N>
syscw: <N>
read_bytes: <N>
write_bytes: <N>
cancelled_write_bytes: <N>

Field definitions:

Field Type Description
rchar u64 Bytes passed to read(2) and similar calls. Includes page cache hits. Does not represent physical I/O.
wchar u64 Bytes passed to write(2) and similar calls. Includes writes to page cache. Does not represent physical I/O.
syscr u64 Number of read-class syscalls (read, pread64, readv, preadv, preadv2, sendfile, copy_file_range).
syscw u64 Number of write-class syscalls (write, pwrite64, writev, pwritev, pwritev2, sendfile, copy_file_range).
read_bytes u64 Bytes actually fetched from storage (cache misses that triggered block I/O). Updated at I/O completion.
write_bytes u64 Bytes actually written to storage (writeback completions). Updated at writeback completion.
cancelled_write_bytes u64 Bytes charged to write_bytes that were subsequently cancelled because the page was truncated before writeback.

Implementation: - rchar and wchar are incremented in the VFS read/write path before checking the page cache. - syscr and syscw are incremented at syscall entry. - read_bytes is incremented in the block I/O completion handler when the originating task can be attributed (via IoRequest::submitter_pid). - write_bytes is incremented in the writeback completion handler. Writeback is attributed to the task that dirtied the page (recorded in the page's DirtyAccountable field). - cancelled_write_bytes is incremented in truncate_inode_pages when a dirty page is discarded before writeback.

Thread aggregation: /proc/<pid>/io reports the sum across all threads in the process. Per-thread values are available at /proc/<pid>/task/<tid>/io.

14.13.10 sysfs Interface

/sys/block/<dev>/queue/scheduler:

UmkaOS presents the MQPA scheduler under the name umka-mqpa. For compatibility with tools that check this file (e.g., fio, tuned, irqbalance), the file also accepts none, mq-deadline, bfq, and kyber as writes — all are silently mapped to umka-mqpa. The read value always shows [umka-mqpa] in the list of available schedulers.

/sys/block/<dev>/queue/iosched/:

UmkaOS presents as mq-deadline for maximum tool compatibility (iostat, blktrace, fio all detect the scheduler name and adjust output accordingly). The following tunables are honored:

Tunable Default Meaning in UmkaOS
read_expire 500ms Starvation deadline for BE read requests
write_expire 5000ms Starvation deadline for BE write requests
writes_starved 2 (ignored; MQPA WRR handles fairness)
front_merges 1 0 = disable front-merge check; 1 = enable (default)
fifo_batch 16 (ignored; MQPA dispatches one request per call)

All other mq-deadline tunables (async_depth, prio_aging_expire, etc.) are accepted via sysfs write but have no effect. A single-line message is logged at info level when an ignored tunable is written: umka-mqpa: tunable '<name>' accepted but has no effect.

/sys/block/<dev>/queue/ common knobs honored by MQPA:

Knob Description
nr_requests Maximum queue depth. UmkaOS clamps to the device's reported NVMe MQES.
rq_affinity 0 = complete on any CPU, 1 = complete on submitting CPU's socket, 2 = complete on exact submitting CPU.
add_random 0 = do not contribute to /dev/random entropy pool on I/O completion.
rotational 0 = SSD/NVMe (disable elevator C-scan; use FIFO-within-level order instead of LBA order).

When rotational=0, each IoQueue is created with backing: IoQueueBacking::Fifo (a VecDeque<Arc<IoRequest>>). Back/front merge checks are still performed but dispatch pops from the front of the deque rather than the lowest-LBA entry. This avoids unnecessary seek-optimisation work on random-access media.

14.13.11 Linux Compatibility Notes

Item Detail
Syscall numbers (x86-64) ioprio_set = 251, ioprio_get = 252
Syscall numbers (i386 compat) ioprio_set = 289, ioprio_get = 290
Syscall numbers (AArch64) ioprio_set = 30, ioprio_get = 31
IOPRIO_CLASS_NONE 0
IOPRIO_CLASS_RT 1
IOPRIO_CLASS_BE 2
IOPRIO_CLASS_IDLE 3
IOPRIO_PRIO_CLASS(ioprio) (ioprio >> 13) & 0x7
IOPRIO_PRIO_DATA(ioprio) ioprio & 0x1fff
IOPRIO_PRIO_VALUE(class, data) ((class) << 13) \| (data)
ionice(1) (util-linux) Works without modification
iopriority field in /proc/<pid>/status Not exposed; use ioprio_get(2)
taskset / chrt Unaffected; these set CPU/RT scheduler priority, not I/O priority
cgroup v2 io.stat format Compatible with Linux 4.16+
cgroup v2 io.weight range 1–10000, default 100 (Linux compatible)
blkio.weight v1 range 10–1000, mapped to v2 weight via weight * 10
/proc/<pid>/io format Identical to Linux (all 7 fields, same names)

ionice(1) tool compatibility: The ionice utility from util-linux calls ioprio_set(2) and ioprio_get(2) directly via syscall(2) (no glibc wrapper exists). No modification is required.

Tools that query /sys/block/<dev>/queue/scheduler: Tools like fio, tuned, and storage benchmarks that read or write the scheduler knob will see [umka-mqpa] and accept writes of mq-deadline without error. The fio engine io_uring and libaio are unaffected by scheduler selection — they bypass the scheduler for direct I/O (O_DIRECT).

O_DIRECT and io_uring with fixed buffers: Requests submitted via io_uring with IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED on O_DIRECT file descriptors are still subject to MQPA priority. The submitting task's io_priority is sampled at io_uring_enter(2) time and embedded in each IoRequest generated from the submission ring.