Skip to content

Chapter 15: Storage and Filesystems

Durability guarantees, block I/O, volume management, block storage networking, clustered filesystems, persistent memory, SATA/AHCI, ext4/XFS/Btrfs, ZFS


The storage subsystem spans block I/O, volume management (device-mapper), filesystem drivers (ext4, XFS, Btrfs, ZFS), NFS client/server, and persistent memory. Durability guarantees are explicit: every I/O path documents its crash-consistency model. The I/O scheduler is a replaceable policy layer for 50-year uptime via live kernel evolution.

15.1 Durability Guarantees

Linux problem: Applications couldn't reliably know when data was on disk. The ext4 delayed-allocation data loss bugs (2008-2009) were a symptom. Worse, fsync() error reporting was broken — errors could be silently lost between calls. Partially fixed with errseq_t in kernel 4.13 (with subsequent refinements in 4.14 and 4.16), but the contract between applications and filesystems around durability remains murky.

UmkaOS design: - Error reporting: Every filesystem operation tracks errors via a per-file error sequence counter. fsync() returns errors exactly once and never silently drops them. The VFS layer enforces this — individual filesystem implementations cannot bypass it. - Durability contract: Three explicit levels, documented and testable: 1. write() → data in page cache (may be lost on crash) 2. fsync() → data + metadata on stable storage (guaranteed) 3. O_SYNC / O_DSYNC → each write waits for stable storage - Filesystem crash consistency: All filesystem implementations must declare their consistency model (journal, COW, log-structured) and pass a crash-consistency test suite as part of KABI certification. - Error propagation: Writeback errors propagate to ALL file descriptors that have the file open, not just the one that triggered writeback. No silent data loss.

15.1.1 Boot Initialization Sequence

The storage subsystem initializes in dependency order during the canonical boot sequence (Section 2.3). Phase numbers below match the master table.

Phase 4.4a bus_enumerate() + Phase 4.5 block_init(): - Bus enumeration (Phase 4.4a) discovers NVMe/AHCI/VirtIO controllers in ACPI/DT namespace. - Block layer registration (Phase 4.5): I/O scheduler, bio slab, request queues.

Phase 5.4 storage_probe() — NVMe/SCSI/AHCI/VirtIO/eMMC: - Probe discovered controllers. Allocate NVMe submission/completion queues from slab. Register block devices. This step requires Tier 1 driver loading (5.3) for storage drivers behind the KABI boundary.

Phase 5.45 dm_init(): - Initialize device-mapper: register target types (linear, striped, crypt, verity, thin, cache, mirror), then assemble dm devices specified on the kernel command line or in initramfs. Depends on 5.4 because dm devices are built on top of physical block devices that must already be probed.

Phase 5.5 mount_rootfs(): - Scan registered block devices for GPT/MBR partition tables. Identify root by PARTUUID from kernel command line. If root= specifies a dm device (root=/dev/dm-0 for LVM/LUKS root), the dm device was assembled in 5.45. - Register filesystem types (ext4_init(), xfs_init(), btrfs_init()). - Mount root filesystem. On failure, panic with diagnostic showing failed PARTUUID and all detected block devices/partitions.

Phase 6.x — Post-root (on demand): - fuse_init(): Register FUSE filesystem type. Daemon not yet running. - nfs_init(): Register NFS. Network stack must be up for NFS root (initramfs handles). - dlm_init(): Distributed Lock Manager for clustered filesystems.

Ordering constraints:

Constraint Canonical steps Reason
storage_probe before dm_init 5.4 → 5.45 dm devices are layered on physical block devices
NVMe/VirtIO/SATA before root scan 5.4 → 5.5 Devices must be registered before root scan
Network stack before NFS root 5.2 → 6.1 NFS client requires TCP; initramfs manages
dlm_init() after nfs_init() if co-located 6.1 → 6.3 Standalone NFS does not need DLM; clustered filesystems (GFS2/OCFS2) need DLM. DLM uses its own TCP transport (not NFS). NFS inits first (Phase 6.1) for early NFS root; DLM inits later (Phase 6.3) for cluster locking.

Error handling: If any phase 5.4 device init fails, that device is marked unavailable and an FMA event is raised; boot continues without it. If root mount fails in 5.5, the kernel panics with a diagnostic serial console message showing the failed PARTUUID and all detected block devices and partitions.

15.1.2 Filesystem Error Mode Selection by Error Code

Operators configure per-mount via errors=continue|remount-ro|panic mount option. The table below defines the default error mode when no mount option is specified:

Error Default mode Rationale
EIO Continue (retry) Transient device error, may recover
ENOSPC Continue Out of space is recoverable (free space, retry)
EROFS RemountRo Filesystem corruption detected
EUCLEAN RemountRo Metadata checksum failure
EREMOTEIO Continue (retry) Remote transport failure (NFS, iSCSI, cluster FS); transient network issue, retry after reconnect
ETIMEDOUT Continue (retry) I/O timeout; device or network may recover on retry
EUCLEAN (critical) Panic Critical filesystem corruption (superblock/journal/bitmap). Same errno as non-critical EUCLEAN above; the FMA severity (Critical vs Warning) determines escalation to Panic. Linux uses EUCLEAN (aliased as EFSCORRUPTED) for all corruption levels.

The check_fs_error_mode() function (Section 15.2) consults the superblock's error_mode field (set at mount time). If the operator has set errors=, that overrides the per-error-code defaults above.

15.1.3 I/O Result Codes

IoResultCode is a type alias for i32 (negated errno), matching Linux bio completion semantics. Every KABI vtable method that performs I/O returns IoResultCode. This includes BlockDeviceOps::submit_bio() completion callbacks, filesystem AddressSpaceOps::writepage() completions, and NVMe passthrough command results.

/// I/O completion result code. Negated errno value.
/// 0 = success, negative = error.
///
/// This is the same encoding used by Linux's `blk_status_to_errno()` and
/// bio completion callbacks — existing driver code and filesystem error
/// handling logic works without translation.
pub type IoResultCode = i32;

Common values:

Value Constant Meaning
0 (success) I/O completed successfully
-5 -EIO Device error (timeout, transport failure, uncorrectable media error)
-28 -ENOSPC No space left on device (filesystem or thin-provisioned volume)
-30 -EROFS Read-only filesystem (write attempted after remount-ro)
-117 -EUCLEAN Metadata checksum failure (block or filesystem layer detected corruption)
-74 -EBADMSG CRC verification failure (used by XFS for xfs_buf_verify() failures). Handled identically to EUCLEAN by check_fs_error_mode(). Note: Linux defines EFSCORRUPTED as alias for EUCLEAN (117), not 74. ext4 uses EUCLEAN for corruption; XFS uses both EUCLEAN and EBADMSG.
-121 -EREMOTEIO Remote I/O error (NFS, iSCSI, or cluster filesystem transport failure)
-110 -ETIMEDOUT I/O timeout (command did not complete within device timeout)

The error mode mapping table above defines the default filesystem response to each IoResultCode. The mapping from IoResultCode to filesystem action is: bio_completion(IoResultCode) -> check_fs_error_mode(errno) -> ErrorAction.


15.2 Block I/O and Volume Management

Linux problem: LVM/mdadm are mature but fragile when a block device disappears momentarily — the volume layer panics or marks the device as failed. A NVMe driver reload that takes 50ms can cascade into a degraded RAID array and an unnecessary multi-hour resync.

UmkaOS design:

15.2.1 Evolvable/Nucleus Classification

The block I/O subsystem follows the UmkaOS Evolvable component model (Section 13.18). The table below classifies every major data structure, trait, and algorithm in this section.

Nucleus (non-replaceable, verified correctness, survives live evolution):

Component Rationale
Bio struct layout and lifetime Correctness: every I/O path depends on Bio field semantics. Changing Bio layout requires full subsystem quiesce.
IoRequest struct and merge rules Correctness: elevator merge correctness depends on request ordering invariants. A broken merge can corrupt data.
BlockDeviceOps trait signature ABI contract: drivers implement this trait. Changing the signature breaks all compiled drivers.
Elevator merge algorithm correctness Correctness: merge must never combine requests that cross partition or stripe boundaries. This is a safety invariant, not a policy choice.
Write barrier ordering guarantees Correctness: barrier semantics are part of the durability contract (Section 15.1). Violating barrier ordering causes data loss.
Device-mapper target interface (DmTarget trait) ABI contract: dm targets implement this trait. Must be stable across live evolution.

Evolvable (replaceable policy, hot-swappable via EvolvableComponent):

Component Rationale
I/O scheduler algorithm (mq-deadline, BFQ, none) Policy: which requests to dispatch first is a heuristic. Different workloads benefit from different schedulers. ML can tune or replace.
Writeback throttling policy Policy: how aggressively to throttle dirty page generation is a tunable heuristic. Optimal policy depends on device speed and workload.
Readahead strategy Policy: how many pages to prefetch is a heuristic. Sequential vs random detection and prefetch window sizing are ML-tunable.
Stripe log flush policy Policy: when to flush the RAID write-hole journal is a latency/durability tradeoff. Tunable per workload.
I/O priority class mapping Policy: how ioprio classes map to dispatch weights is a scheduling policy decision.
Device-mapper thin provisioning overcommit thresholds Policy: when to warn or block on overcommitted thin pools is an operator-tunable policy.

15.2.2 Storage Driver Isolation Tiers at Boot

Boot-critical storage drivers (NVMe, AHCI/SATA, virtio-blk) follow a two-phase isolation model that reconciles the requirements of fast boot with post-boot fault containment.

Phase 1 — Tier 0 at boot (Phases 5.1–5.4): During boot, storage drivers needed for root filesystem access load as Tier 0 (in-kernel, statically linked, no isolation domain). This is required because:

  • The root filesystem must be mounted (Phase 5.5) before the full module loader infrastructure and IOMMU domain allocation are exercised under load.
  • Boot-critical storage I/O runs a single code path with no concurrent untrusted work — the isolation overhead of Tier 1 domain switching has no security benefit during early boot when only kernel-authored code is executing.
  • The canonical boot sequence (Section 2.3) loads Tier 0 drivers at Phase 5.1, then Tier 1 at Phase 5.3, followed by storage probe at Phase 5.4. Boot-critical storage drivers are loaded in Phase 5.1 (Tier 0) so they are ready for Phase 5.4 probing.

The boot command line identifies boot-critical storage drivers via the root device specification (e.g., root=/dev/nvme0n1p2, root=UUID=...). The kernel's root device resolver maps this to the required driver (NVMe, AHCI, or virtio-blk) and ensures it loads in Phase 5.1 as Tier 0 rather than waiting for Phase 5.3 Tier 1 loading.

Phase 2 — Optional Tier 1 assignment (post-boot): After rootfs mount, the operator or system policy may set a boot-critical storage driver to Tier 1 to gain crash recovery and fault isolation:

1. System is booted, rootfs mounted, init running.
2. Operator (or systemd unit) sets tier to 1:
   echo 1 > /ukfs/kernel/drivers/nvme0/tier
3. Registry initiates tier change for the target device:
   a. Quiesce I/O: drain all in-flight bios for this device (flush + barrier).
   b. Allocate isolation domain (MPK PKEY, POE overlay, or per-arch equivalent).
   c. Remap driver pages into the new domain.
   d. Resume I/O through the Tier 1 dispatch trampoline.
4. Device is now Tier 1: crashes cause driver reload (~50-150ms), not kernel panic.

Non-boot storage drivers (USB mass storage, SD/eMMC card readers, iSCSI initiator) always load as Tier 1 via the standard Phase 5.3 module loader path. They are never Tier 0 because they are not on the rootfs critical path.

Decision matrix:

Driver Boot role Initial tier Post-boot promotion Rationale
NVMe Root device Tier 0 Yes (recommended) Root FS access; promote after init
AHCI/SATA Root device Tier 0 Yes (recommended) Root FS access on SATA systems
virtio-blk Root device (VMs) Tier 0 Yes (recommended) Root FS in virtual machines
USB mass storage Never root Tier 1 N/A (already Tier 1) Removable media, not boot-critical
SD/eMMC Rarely root Tier 0 if root, else Tier 1 Yes if Tier 0 Embedded systems may boot from eMMC
iSCSI Network root Tier 1 N/A Network boot uses initramfs pivot

Cross-references: isolation tier model (Section 11.3), device registry boot sequence (Section 11.6), crash recovery for Tier 1 block drivers (Section 11.9).


15.2.3 Block Device Trait

/// Block device abstraction — the interface between the block I/O layer
/// and storage device drivers (NVMe, SATA, virtio-blk, eMMC, SD, dm-*).
///
/// Every storage driver registers a `BlockDevice` with umka-block.
/// The block I/O layer routes bio requests through this trait.
pub trait BlockDeviceOps: Send + Sync {
    /// Submit a block I/O request. The request contains one or more
    /// bio segments (contiguous LBA ranges with associated memory pages).
    /// Returns immediately; completion is signaled via the bio's completion
    /// callback. For synchronous I/O, the caller waits on the callback.
    fn submit_bio(&self, bio: &mut Bio) -> Result<()>;

    /// Flush volatile write cache to stable storage. Called by fsync(),
    /// sync(), and journal commit paths. Must not return until all
    /// previously submitted writes are on stable media.
    fn flush(&self) -> Result<()>;

    /// Discard (TRIM/UNMAP) the specified LBA range. The device may
    /// deallocate the underlying storage. Not all devices support this;
    /// return ENOSYS if unsupported.
    fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()>;

    /// Return device geometry and capabilities.
    fn get_info(&self) -> BlockDeviceInfo;

    /// Shut down the device. Flushes caches and releases hardware resources.
    fn shutdown(&self) -> Result<()>;
}

bitflags! {
    /// Block device capability flags. Replaces individual bool fields for
    /// extensibility — new capabilities (zoned, write zeroes, zone append,
    /// secure erase, copy offload) can be added as new bits without changing
    /// the struct layout.
    pub struct BlockDeviceFlags: u32 {
        /// Device supports discard/TRIM (ATA TRIM, NVMe Deallocate, SCSI UNMAP).
        const DISCARD    = 1 << 0;
        /// Device has a volatile write cache and supports flush commands.
        const FLUSH      = 1 << 1;
        /// Device supports FUA (Force Unit Access) — write directly to media
        /// without requiring a separate flush.
        const FUA        = 1 << 2;
        /// Device is a rotational disk (HDD). If not set, assumed non-rotational (SSD/NVMe).
        const ROTATIONAL = 1 << 3;
        /// Device supports write zeroes command (NVMe Write Zeroes, SCSI Write Same).
        const WRITE_ZEROES = 1 << 4;
        /// Device is a zoned block device (ZNS NVMe, SMR HDD).
        const ZONED      = 1 << 5;
        /// Device supports secure erase.
        const SECURE_ERASE = 1 << 6;
    }
}

/// Block device metadata and capabilities.
/// Kernel-internal, not KABI — populated within the same compilation unit
/// (Tier 0 block layer or Tier 1 driver via KABI ring serialization).
pub struct BlockDeviceInfo {
    /// Logical sector size in bytes (typically 512 or 4096).
    pub logical_block_size: u32,
    /// Physical sector size in bytes (4096 for AF drives).
    pub physical_block_size: u32,
    /// Total device capacity in logical sectors.
    pub capacity_sectors: u64,
    /// Maximum segments per bio request.
    pub max_segments: u16,
    /// Maximum total bytes per bio request.
    /// A value of 0 means "no explicit limit beyond segment count" — the block
    /// layer uses `max_segments * PAGE_SIZE` as the effective limit. Drivers
    /// that set this to 0 (e.g., VirtIO-blk) rely solely on segment limits.
    pub max_bio_size: u32,
    /// Device capability flags (discard, flush, FUA, rotational, etc.).
    pub flags: BlockDeviceFlags,
    /// Optimal I/O size in bytes (for alignment).
    pub optimal_io_size: u32,
    /// NUMA node affinity (for interrupt/queue placement).
    pub numa_node: u16,
}

/// Cached immutable block device parameters. Populated once at device
/// registration from `BlockDeviceOps::get_info()` and stored in the
/// `BlockDevice` wrapper struct. Avoids vtable dispatch on the hot
/// bio-to-request conversion path — device geometry is immutable after
/// registration.
///
/// Fields are a subset of `BlockDeviceInfo` — only those needed on
/// the hot I/O path. Additional fields may be added as needed.
pub struct BlockDeviceCachedParams {
    /// Logical sector size in bytes (typically 512 or 4096).
    pub logical_block_size: u32,
    /// Physical sector size in bytes.
    pub physical_block_size: u32,
    /// Maximum total bytes per bio request.
    pub max_bio_size: u32,
    /// Device capability flags (discard, flush, FUA, etc.).
    pub flags: BlockDeviceFlags,
}

/// Concrete block device wrapper. Holds the driver's `BlockDeviceOps` vtable
/// together with cached geometry, I/O queues, and per-device accounting. Every
/// registered block device produces one `BlockDevice` instance stored in the
/// device registry XArray (keyed by `dev_t`). The `Bio.bdev` field holds
/// `Arc<BlockDevice>`, not `Arc<dyn BlockDeviceOps>` — this gives the block
/// layer access to both the ops vtable and the cached parameters without
/// double-indirection.
///
/// **Nucleus component**: The struct layout is Nucleus (field changes require
/// full subsystem quiesce). The device registration/teardown code is Evolvable.
pub struct BlockDevice {
    /// Driver-provided operations vtable (submit_bio, flush, get_info, etc.).
    pub ops: Arc<dyn BlockDeviceOps>,
    /// I/O scheduler queues. `None` for devices using hardware multi-queue
    /// dispatch (NVMe). `Some` for devices that benefit from software
    /// scheduling (AHCI/SATA with single hardware queue).
    pub io_queues: Option<DeviceIoQueues>,
    /// Immutable geometry cached at registration time. Avoids vtable dispatch
    /// on the hot bio-to-request conversion path.
    pub cached_params: BlockDeviceCachedParams,
    /// Device number (major:minor). Unique identifier in the device registry.
    pub dev: DevT,
    /// Human-readable device name (e.g., "nvme0n1", "sda").
    pub name: ArrayVec<u8, 32>,
    /// Per-device requeue list for bios returned with EAGAIN by the driver.
    /// Bounded by `MAX_REQUEUE_DEPTH` (default 4096). Re-drained by the
    /// device's completion IRQ handler via `blk_kick_requeue()`.
    ///
    /// **Use-after-free prevention (BIO-09 fix)**: Each entry stores a
    /// `(generation, *mut Bio)` tuple. The `generation` is a snapshot of
    /// `bio.generation` at enqueue time. When `blk_kick_requeue()` dequeues
    /// an entry, it first CAS's `bio.state` from `Inflight` to `Inflight`
    /// (a no-op CAS that succeeds only if the bio is still inflight). If the
    /// CAS fails, the bio was completed or timed out — the entry is stale.
    /// Additionally, the generation check (`bio.generation == saved_gen`)
    /// detects the ABA case where the slab recycled the memory for a new bio
    /// that happens to be in `Inflight` state. The generation counter is
    /// incremented on every `bio_alloc()`, so a recycled bio will have a
    /// different generation than the saved snapshot.
    ///
    /// **Why not Arc**: `Arc<Bio>` would prevent the slab from recycling the
    /// bio until the requeue list drops its reference, defeating the purpose
    /// of slab-based allocation (bounded pool, no heap growth). The generation
    /// counter achieves the same safety guarantee without extending bio lifetime.
    ///
    /// Uses `BoundedDeque` (fixed-capacity ring buffer with O(1) push/pop at
    /// both ends) instead of `ArrayVec` because `blk_kick_requeue()` needs
    /// FIFO semantics: drain from front, re-insert deferred bios at front.
    /// `ArrayVec` has no `pop_front()`/`push_front()` methods.
    pub requeue_list: SpinLock<BoundedDeque<RequeueEntry, 4096>>,
}

/// Entry in the per-device requeue list. Pairs a bio pointer with a
/// generation snapshot to detect use-after-free (BIO-09 fix).
pub struct RequeueEntry {
    /// Raw pointer to the bio. Valid only if `generation` matches
    /// `bio.generation` at dequeue time.
    pub bio: *mut Bio,
    /// Snapshot of `bio.generation` at enqueue time. If
    /// `bio.generation != saved_generation` at dequeue, the slab
    /// recycled the memory — the entry is stale and must be skipped.
    pub generation: u64,
}

/// Bio lifecycle states. Resolves the completion/timeout race via CAS.
///
/// Both the device completion handler and the synchronous timeout path
/// attempt to CAS from `Inflight` to their target state. The winner
/// proceeds with its action; the loser observes a non-`Inflight` state
/// and bails out (no-op). This eliminates the `mem::replace` race that
/// existed before: `mem::replace` is NOT atomic and two concurrent
/// `mem::replace` calls corrupt the completion field.
///
/// State diagram:
/// ```
/// Inflight ──CAS──→ Completing ──store──→ Done
///    │                                      ↑
///    └──CAS──→ TimedOut ──store─────────────┘
/// ```
///
/// **IoRequest.bio state tracking invariant** (BIO-12 resolution):
/// When a bio is wrapped in an IoRequest (scheduler path), the bio's
/// `state` field remains the single source of truth. The IoRequest does
/// NOT have its own state field. The completion path always goes through
/// the bio:
///
/// 1. Hardware signals completion → driver calls `bio_complete(req.bio, status)`.
/// 2. `bio_complete()` CAS's `bio.state` from `Inflight` to `Completing`.
/// 3. If CAS succeeds: invoke `bio.end_io` callback, store `Done`.
/// 4. If CAS fails: timeout handler already won — completion is a no-op.
///
/// The IoRequest is freed by the scheduler after `bio_complete()` returns
/// (whether the CAS succeeded or failed). The bio is freed by its `end_io`
/// callback (or by the synchronous waiter, depending on the submission
/// path). There is exactly one completion attempt per bio, regardless of
/// whether it went through the scheduler or not. The `*mut Bio` in
/// IoRequest is never dereferenced after `bio_complete()` transitions the
/// bio to `Done` — the IoRequest is dropped immediately after.
#[repr(u32)]
pub enum BioState {
    /// Bio submitted, I/O in progress.
    Inflight   = 0,
    /// Device completion handler won the CAS; executing callback.
    Completing = 1,
    /// Timeout handler won the CAS; executing timeout action.
    TimedOut   = 2,
    /// Terminal state — bio lifecycle complete.
    Done       = 3,
}

/// Bio completion callback type. A function pointer set by the submitter
/// (filesystem, page cache, io_uring, synchronous waiter) before calling
/// `bio_submit()`. Called by `bio_complete()` when I/O finishes.
///
/// **Design rationale (Decision 4)**: The previous `BioCompletion` enum had 5
/// variants (`None`, `Callback`, `DeferredCallback`, `Waiter`, `StackWaiter`)
/// and required a bridging conversion (`IoCompletion::from_bio_completion()`)
/// to route scheduler completion back to the bio. That bridge was never
/// defined, creating a broken completion chain (BIO-01, BIO-05). The function
/// pointer replaces ALL variants: each submitter provides its own callback
/// that performs the appropriate completion action. The I/O scheduler wraps
/// the Bio in an IoRequest for merging/sorting, and on completion calls
/// `bio_complete()` which invokes this callback.
///
/// **Matches Linux's `bio->bi_end_io`**: Linux uses `void (*bi_end_io)(struct bio *)`
/// — a function pointer set by the submitter. UmkaOS adds a `status: i32`
/// parameter for direct error propagation (Linux uses `bio->bi_status` which
/// the callback reads separately; passing it avoids an extra atomic load).
///
/// **`*mut Bio` parameter**: Raw pointer because the callback runs in
/// interrupt/softirq context after the CAS-protected state transition in
/// `bio_complete()`. The CAS guarantees exclusive access — no aliasing.
/// The callback may free the bio (via `SlabBox::from_raw()`) or retain it
/// for retry. `&mut Bio` is unsuitable because the bio may already be
/// behind a raw pointer in the requeue list or I/O scheduler.
///
/// **Callback context constraints**: See "Bio Completion Callback Constraints"
/// below. Callbacks that need process context (page cache updates, sleeping
/// locks) must schedule work on the `blk-io` workqueue and return immediately.
///
/// **Common callback implementations**:
///
/// | Submitter | Callback | Action |
/// |-----------|----------|--------|
/// | `bio_submit_and_wait()` | `bio_sync_end_io` | Sets status, wakes stack waiter |
/// | Async writeback | `writeback_end_io_deferred` | Enqueues `blk-io` workqueue item for page cache updates |
/// | io_uring block ops | `io_uring_bio_end_io` | Posts CQE to completion ring |
/// | Direct I/O (no waiter) | `bio_noop_end_io` | Logs warning (catches double-signal bugs) |
///
/// **Status convention**: `status` is 0 on success, negative errno on error
/// (e.g., `-(EIO as i32)`). Matches the `bio.status` AtomicI32 encoding.
pub type BioEndIo = fn(bio: *mut Bio, status: i32);

/// Default (no-op) bio completion callback. Logs a warning if invoked —
/// catches double-signal bugs and bios submitted without a completion
/// callback set (programming error).
fn bio_noop_end_io(_bio: *mut Bio, _status: i32) {
    klog_warn!("bio_complete: called on bio with default (noop) completion");
}

/// Synchronous bio completion callback. Used by `bio_submit_and_wait()`.
/// Sets `bio.status` and wakes the stack-allocated waiter. The waiter
/// pointer is stored in `bio.private` (set by `bio_submit_and_wait()`
/// before submission).
///
/// # Safety
/// - `bio` is valid (CAS-protected exclusive access in `bio_complete()`).
/// - `bio.private` is a valid `*const WaitQueueHead` pointing to the
///   caller's stack frame. The caller blocks until completion, so the
///   stack frame outlives this callback. If the timeout path wins the
///   CAS instead, this callback is never invoked.
fn bio_sync_end_io(bio: *mut Bio, status: i32) {
    // SAFETY: bio pointer is valid (CAS-protected in bio_complete).
    let bio = unsafe { &mut *bio };
    bio.status.store(status, Ordering::Release);
    // SAFETY: bio.private was set to a valid *const WaitQueueHead by
    // bio_submit_and_wait(). The caller is blocked, so the stack frame
    // (and thus the WaitQueueHead) is alive. The CAS in bio_complete()
    // ensures this callback runs only if the timeout path did NOT win.
    let wq = bio.private as *const WaitQueueHead;
    unsafe { (*wq).wake_up(); }
}

/// Deferred writeback completion callback. Enqueues the actual page cache
/// update work (`writeback_end_io`) on the `blk-io` per-CPU workqueue.
/// This two-phase approach is required because page cache operations
/// (xa_lock, wait queue wake, `nr_dirty` decrement) are forbidden in
/// interrupt/softirq context where `bio_complete()` runs.
fn writeback_end_io_deferred(bio: *mut Bio, status: i32) {
    // Schedule deferred execution on the `blk-io` per-CPU workqueue.
    // The workqueue item captures the bio pointer and status. The
    // bio remains valid until the deferred callback runs and either
    // frees or recycles it.
    workqueue_enqueue("blk-io", move || {
        // SAFETY: bio is valid — ownership transferred from bio_complete()
        // through the workqueue item. No other path accesses the bio
        // between bio_complete() and this deferred execution.
        writeback_end_io(unsafe { &mut *bio }, status);
    });
}

/// Unified bio completion entry point. ALL completion paths MUST use this.
///
/// Performs CAS(Inflight -> Completing), stores the status, invokes the
/// bio's `end_io` callback, then transitions to Done. If CAS fails
/// (timeout or double-completion), does nothing — the timeout path or
/// prior completion already owns the bio.
///
/// **Why a function, not a method**: The CAS guarantees exclusive access
/// to the bio. Calling the `end_io` callback requires passing `bio` as
/// `*mut Bio` — the callback may free the bio, retry it, or chain it.
/// A `&mut self` method on Bio would be unsound because the callback
/// receives the same pointer. The free function takes `*mut Bio`
/// explicitly, and the CAS ensures no aliasing.
///
/// This eliminates the TOCTOU race (SF-373): the callback is a function
/// pointer (not an enum extracted via `mem::take`), so there is no
/// extraction step between CAS and invocation. The CAS guarantees
/// exclusive access; the callback is invoked directly.
///
/// Usage (in device IRQ handler, scheduler completion, or Tier 0 ring consumer):
/// ```
/// bio_complete(bio, 0);           // success
/// bio_complete(bio, -(EIO as i32)); // error
/// ```
///
/// **Caller context**: May be called from interrupt/softirq context
/// (device IRQ handler), process context (timeout handler), or the
/// I/O scheduler completion path. The `end_io` callback must respect
/// the Bio Completion Callback Constraints documented below.
pub fn bio_complete(bio: *mut Bio, status: i32) {
    // SAFETY: caller guarantees bio is a valid pointer to a live Bio.
    // The CAS below establishes exclusive ownership before any mutation.
    let bio_ref = unsafe { &*bio };
    match bio_ref.state.compare_exchange(
        BioState::Inflight as u32,
        BioState::Completing as u32,
        Ordering::AcqRel,
        Ordering::Acquire,
    ) {
        Ok(_) => {
            // Won the CAS — exclusive ownership of the bio.
            // Store status before invoking callback (callback may read it).
            bio_ref.status.store(status, Ordering::Release);
            // Invoke the submitter's completion callback. The callback
            // receives `*mut Bio` and the status. It may:
            // - Free the bio (SlabBox::from_raw)
            // - Wake a waiter (bio_sync_end_io)
            // - Schedule deferred work (writeback_end_io_deferred)
            // - Post an io_uring CQE (io_uring_bio_end_io)
            (bio_ref.end_io)(bio, status);
            // No post-callback state store. After end_io returns, the
            // bio may have been freed and its slab slot recycled by
            // another CPU's bio_alloc(). Writing to freed memory would
            // corrupt the new bio's state.
            //
            // The CAS to BioState::Completing is the terminal state for
            // bio_complete()'s ownership. Sync waiters (bio_sync_end_io)
            // wake on `bio.status != BIO_STATUS_PENDING`, which is set
            // BEFORE the callback. eBPF/blktrace observers treat
            // Completing as equivalent to Done.
        }
        Err(_) => {
            // Lost the CAS — timeout path or another completion already
            // claimed the bio. Do nothing.
        }
    }
}

Bio Completion Callback Constraints: The end_io callback function executes in interrupt or softirq context (Section 3.8) — specifically the BLOCK_SOFTIRQ vector (index 4) for block I/O completion processing, or in the I/O scheduler's completion path (which may also run in softirq context for scheduler-attached devices). Callbacks MUST NOT:

  • Acquire sleeping locks (Mutex, RwLock, Semaphore)
  • Allocate memory with GFP_KERNEL (only pre-allocated objects or GFP_ATOMIC from the emergency reserve)
  • Call filesystem operations or page cache methods
  • Trigger KABI domain crossings
  • Block or sleep for any reason

The end_io callback runs in interrupt or softirq context. Page cache state updates (clearing PG_WRITEBACK, waking waiters, updating AddressSpace.wb_err) must be deferred to the blk-io workqueue from within the callback (e.g., writeback_end_io_deferred schedules a workqueue item). This handoff adds ~1-5us latency but is required for correctness — page cache operations may acquire sleeping locks.

Clarification: end_page_writeback() (clearing PG_WRITEBACK and waking waiters) is the operation deferred to the workqueue. Read-path completions that only clear PageFlags::LOCKED and wake waiters run directly in softirq — no workqueue deferral for page flag updates on the read path. The "workqueue deferral" described here applies to writeback completion, where filesystem journal updates and AddressSpace.wb_err manipulation may need sleeping locks.

For the performance impact of this deferral on fsync/O_SYNC paths, see Section 3.4.

Permitted operations:

  • Atomic bitfield updates (set/clear PageFlags)
  • Wake wait queues (WaitQueueHead::wake_up)
  • Update per-CPU counters
  • Schedule work on a workqueue for deferred processing
  • Set bio.status and signal completion
/// Block I/O request — carries data between the block layer and device drivers.
///
/// A Bio represents a contiguous logical block range and its associated
/// memory pages. Multiple bios can be chained for scatter-gather I/O.
/// The bio is the fundamental unit of block I/O in UmkaOS, equivalent to
/// Linux's `struct bio`.
pub struct Bio {
    /// Target block device (concrete wrapper, not trait object).
    ///
    /// `Arc<BlockDevice>` provides access to both the driver's `BlockDeviceOps`
    /// vtable (`bdev.ops`) and the cached device parameters (`bdev.cached_params`)
    /// without double-indirection. The block layer uses `bdev.cached_params` on
    /// the hot bio-to-request path and `bdev.ops.submit_bio()` for dispatch.
    ///
    /// **Collection policy exemption**: `Arc<BlockDevice>` is used despite being
    /// on the I/O hot path because the block device outlives all its bios.
    /// The Arc refcount increment/decrement is a single atomic op (~5 ns)
    /// amortized across the full bio lifecycle. The alternative (raw pointer +
    /// manual lifetime tracking) would sacrifice Rust's use-after-free
    /// prevention for negligible gain. Clone occurs at bio_alloc time (warm
    /// path), not per-sector.
    pub bdev: Arc<BlockDevice>,
    /// Operation type.
    pub op: BioOp,
    /// Starting logical block address (in logical sectors).
    pub start_lba: u64,
    /// Scatter-gather list of memory segments.
    pub segments: ArrayVec<BioSegment, 16>,
    /// Extension segment list for bios with >16 segments.
    ///
    /// **Hot-path allocation note**: Most I/O requests fit within the inline
    /// 16-segment ArrayVec (filesystem block I/O, direct I/O up to 64 KB with
    /// 4 KB pages). The `Box<[BioSegment]>` fallback is allocated only for
    /// large scatter-gather lists (e.g., O_DIRECT reads >64 KB into a
    /// fragmented user buffer). This allocation is from the `bio_slab` pool
    /// (a dedicated slab cache with pre-allocated pages), NOT from the general
    /// heap, ensuring bounded allocation latency on the I/O submit path. The
    /// slab cache is sized at boot: `min(1024, nr_cpus * 64)` entries, each
    /// holding up to `BIO_MAX_SEGMENTS - 16 = 240` BioSegments. If the slab
    /// is exhausted, `bio_submit()` blocks on the slab mempool (same as Linux's
    /// `bioset` mempool behaviour). The `Box` is freed to the slab on Bio
    /// completion (drop path).
    pub segments_ext: Option<Box<[BioSegment]>>,
    /// Completion callback. Set by the submitter (filesystem, page cache,
    /// io_uring, sync waiter) before calling `bio_submit()`. Invoked by
    /// `bio_complete()` when I/O finishes. See `BioEndIo` type documentation
    /// for callback constraints and common implementations.
    ///
    /// **Default**: `bio_noop_end_io` (logs warning — catches bios submitted
    /// without a callback set). Submitters MUST set this before `bio_submit()`.
    pub end_io: BioEndIo,
    /// Opaque per-submitter private data. Used by the completion callback to
    /// locate submitter-specific state without an extra indirection. Common
    /// uses:
    /// - `bio_submit_and_wait()`: `*const WaitQueueHead` (stack waiter)
    /// - io_uring: `*const IoRingBioPrivate` (ring + user_data)
    /// - Writeback: `*const AddressSpace` (for wb_err update)
    ///
    /// Stored as `usize` (pointer-sized opaque value). Each callback knows
    /// how to interpret it. Initialized to 0 by `bio_alloc()`.
    pub private: usize,
    /// Atomic state machine for bio lifecycle. Resolves the completion/timeout
    /// race: both the completion handler and timeout path CAS from INFLIGHT to
    /// their target state. Winner proceeds, loser bails. No `mem::replace` race.
    ///
    /// ```
    /// INFLIGHT ──CAS──→ COMPLETING ──store──→ DONE
    ///    │                                      ↑
    ///    └──CAS──→ TIMED_OUT ──store────────────┘
    /// ```
    ///
    /// States:
    /// - `Inflight` (0): bio submitted, I/O in progress.
    /// - `Completing` (1): device completion handler won the CAS; executing callback.
    /// - `TimedOut` (2): timeout handler won the CAS; executing timeout action.
    /// - `Done` (3): terminal state, bio lifecycle complete.
    pub state: AtomicU32,
    /// Error status (set by the driver on completion).
    pub status: AtomicI32,
    /// Flags controlling I/O semantics and crash recovery behavior.
    pub flags: BioFlags,
    /// Originating cgroup ID for I/O accounting and throttling.
    /// Set to 0 by default; populated by `bio_submit()` before dispatch.
    /// This is the **global** cgroup ID (unique across all cgroup namespaces,
    /// assigned monotonically by the cgroup core). Not namespace-scoped —
    /// the block layer operates below namespace boundaries and uses the
    /// global ID for blkcg throttling and accounting.
    pub cgroup_id: u64,
    /// Generation counter for use-after-free detection (BIO-09 fix).
    /// Incremented by `bio_alloc()` on every allocation from the slab.
    /// The requeue list stores a snapshot of this value; at dequeue time,
    /// if `bio.generation != saved_generation`, the slab recycled the
    /// memory for a new bio and the requeue entry is stale.
    ///
    /// **Longevity**: u64 at 10 billion I/Os per second wraps after ~58
    /// years. Exceeds the 50-year operational target.
    pub generation: u64,
}

/// A single segment of a bio — a contiguous range of physical memory.
pub struct BioSegment {
    /// Physical page containing the data. Raw pointer with manual refcount
    /// management via `page_get()` / `page_put()` on `Page._refcount`.
    ///
    /// # Why not `Arc<Page>`
    ///
    /// `Page` already has an intrinsic atomic refcount (`_refcount: AtomicI32`).
    /// Wrapping it in `Arc` adds a second refcount, doubling the atomic
    /// operations on the hot I/O path and creating confusion about which
    /// refcount is authoritative for page lifetime.
    ///
    /// # Safety
    ///
    /// - `page_get()` must be called before storing the pointer (bio_add_page).
    /// - `page_put()` must be called when the segment is consumed (bio_endio).
    /// - The page must not be freed while any BioSegment holds a pointer to it.
    /// - Callers must not dereference the pointer after `page_put()`.
    pub page: *const Page,
    /// Offset within the page (bytes).
    pub offset: u32,
    /// Length of this segment (bytes).
    pub len: u32,
}

// SAFETY: BioSegment is Send because the page pointer validity is
// maintained by page refcount (page_get/page_put). The page refcount
// ensures the page is not freed while any BioSegment holds a pointer.
// Cross-CPU completion (interrupt on different CPU) requires Send.
unsafe impl Send for BioSegment {}
// SAFETY: BioSegment is Sync because all fields are read-only after
// construction. The page pointer is dereferenced only for DMA address
// calculation, which is a pure read of Page.phys_addr.
unsafe impl Sync for BioSegment {}

/// Block I/O operation type. Values MUST match Linux `include/linux/blk_types.h`
/// `enum req_op` exactly — BioOp values are serialized across the KABI ring
/// boundary and observed by eBPF/blktrace tools. Gaps in the numbering
/// (4, 6, 8, 10-16) are reserved for future zone management ops matching Linux.
#[repr(u8)]
pub enum BioOp {
    Read        = 0,   // REQ_OP_READ
    Write       = 1,   // REQ_OP_WRITE
    Flush       = 2,   // REQ_OP_FLUSH
    Discard     = 3,   // REQ_OP_DISCARD
    SecureErase = 5,   // REQ_OP_SECURE_ERASE
    ZoneAppend  = 7,   // REQ_OP_ZONE_APPEND
    WriteZeroes = 9,   // REQ_OP_WRITE_ZEROES
    // Note: Read-ahead is signaled via `BioOp::Read` + `BioFlags::RAHEAD`,
    // not as a separate BioOp variant. This matches Linux's design: read-ahead
    // has identical device-level semantics to Read but different error handling
    // (silently droppable on resource pressure). Keeping it as a flag avoids
    // duplicating all Read handling in the block layer dispatch.
    //
    // Future zone ops (Phase 3+): ZoneOpen=11, ZoneClose=13, ZoneFinish=15,
    // ZoneReset=17, ZoneResetAll=19, DrvIn=34, DrvOut=35.
}

bitflags! {
    /// Bio flags controlling I/O semantics and crash recovery behavior.
    pub struct BioFlags: u32 {
        /// Force Unit Access — bypass volatile write cache.
        const FUA        = 1 << 0;
        /// Pre-flush — flush device write cache before this I/O.
        const PREFLUSH   = 1 << 1;
        /// Metadata I/O (journal, superblock) — higher priority.
        const META       = 1 << 2;
        /// Synchronous I/O — caller expects low latency.
        const SYNC       = 1 << 3;
        /// Read-ahead — low priority, can be dropped under pressure.
        const RAHEAD     = 1 << 4;
        /// Persistent bio — must be replayed after Tier 1 driver crash
        /// recovery. Bios WITHOUT this flag are drained with `-EIO` on
        /// crash. Used for filesystem journal commits, superblock writes,
        /// and any I/O where silent loss causes data corruption.
        /// See [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation--bio-crash-recovery).
        const PERSISTENT = 1 << 5;
        /// No-merge hint — do not merge with adjacent bios.
        const NOMERGE    = 1 << 6;
        /// Marks bio as submitted from an async context (io_uring, AIO).
        /// Completion uses async notification rather than blocking waiter.
        const ASYNC      = 1 << 7;
    }
}

15.2.3.1 Bio Crash Recovery

When a Tier 1 block device driver crashes, in-flight bios are handled based on the PERSISTENT flag:

  • BioFlags::PERSISTENT set: The bio is preserved in the per-device bio retry list (allocated in umka-core memory, outside the driver's isolation domain). After driver reload, these bios are replayed automatically — cleared of BIO_ERROR flags and re-submitted to the new driver instance. Used for journal commits, superblock writes, and other I/O that cannot be silently lost.

Replay ordering: PERSISTENT bios are replayed in submission order within each device (FIFO, matching the order they were originally submitted to the driver). This preserves the filesystem's write-ordering assumptions (e.g., journal commit block written after journal data blocks). For RAID arrays, submission-order replay is critical: ascending-LBA replay could reorder data and parity writes within a stripe, corrupting the parity. The bio retry list is maintained as a FIFO queue (append on capture, replay from head) during the crash recovery collection phase. Each captured bio records a monotonically increasing capture_seq: u64 for tie-breaking if bios are captured from multiple CPUs concurrently (sorted by capture_seq after collection, before replay).

Replay set size bound: The maximum number of PERSISTENT bios retained per device is bounded by MAX_PERSISTENT_BIOS_PER_DEVICE (default: 256). This limit corresponds to the maximum number of concurrent journal commit + superblock writes that can be in-flight for a single block device. If the retry list exceeds this limit (indicating a pathological workload or a stuck driver), the oldest bios are drained with -EIO to prevent unbounded memory consumption. The limit is configurable via umka.max_persistent_bios=N.

Error handling during replay: If a replayed bio fails on the new driver instance (the new driver returns an error for that I/O), the error is handled as follows: - The failed bio is logged via FMA with severity Degraded, including the LBA range, bio flags, and error code. - The bio is skipped (not retried again) — replay continues with the next bio in the sorted list. The failed bio's completion callback fires with the error status from the new driver. - The block device is marked DEGRADED in the volume layer state machine. If the device is part of a RAID array, the standard degraded-mode handling applies (parity reconstruction for RAID5/6, mirror failover for RAID1). - Replay does NOT abort on a single failure. All remaining PERSISTENT bios are still replayed — a transient error on one LBA range should not prevent replay of unrelated ranges.

  • BioFlags::PERSISTENT not set: The bio is drained with bio.status = -EIO. The completion callback fires with error status. Applications retry via standard I/O error handling (fsync retry, read retry). This avoids replaying stale data bios whose contents may have been superseded.

Filesystem layers set PERSISTENT on critical I/O: - ext4/XFS/Btrfs journal commits: BioFlags::FUA | BioFlags::PERSISTENT - Superblock writes: BioFlags::PREFLUSH | BioFlags::FUA | BioFlags::PERSISTENT - Regular data writes: no PERSISTENT — drained with -EIO on crash

15.2.3.2 Bio Lifecycle and Ownership

Bio objects are allocated from a dedicated permanent slab cache (bio_slab). This slab is marked PERMANENT — it is never garbage-collected, ensuring that bio allocation latency remains bounded even under sustained memory pressure.

Ownership rules:

  • bio_submit_and_wait() (synchronous): The caller owns the bio for its entire lifetime. After bio_submit_and_wait() returns, the caller may reuse the bio for another I/O or drop it. The Drop impl frees segments_ext (if allocated) and returns the bio to the bio_slab cache.

  • bio_submit() (asynchronous): After calling bio_submit(), ownership of the bio is logically transferred to the I/O completion path. The bio's end_io callback is responsible for either reusing the bio (e.g., for retry or chained I/O) or calling bio_free() to return it to the slab. The submitter must not access the bio after bio_submit() returns.

  • Completion callback context: The completion callback fires in interrupt or softirq context (see Bio Completion Callback Constraints above). If the callback needs to perform work that may sleep (e.g., page cache updates), it must schedule a workqueue item and return immediately.

/// Allocate a bio from the bio slab cache.
///
/// Returns a `SlabBox<Bio>` — a slab-allocated owning pointer with RAII
/// `Drop`. `SlabBox<T>` wraps `NonNull<T>` + `&'static SlabCache<T>`;
/// the `Drop` impl calls `cache.free(ptr)`, making use-after-free
/// impossible by construction. No explicit `bio_free()` needed.
///
/// The previous `&'static mut Bio` return type was unsound: it claimed
/// the borrow had `'static` lifetime, but the bio may be freed (returned
/// to the slab) at any time — creating a dangling reference.
///
/// For the completion callback path (where ownership transfers to the
/// I/O stack), use `ManuallyDrop<SlabBox<Bio>>` or explicit consumption
/// via `SlabBox::into_raw()` / `SlabBox::from_raw()`.
///
/// `SlabBox<T>` is defined in [Section 4.2](04-memory.md#physical-memory-allocator--slabbox).
pub fn bio_alloc() -> SlabBox<Bio>;

Ownership model: SlabBox<Bio> provides type-safe slab lifetime management. When dropped, the bio is returned to its originating SlabCache<Bio>. For bios transferred to the completion path via bio_submit(), the submitter uses ManuallyDrop::new(bio) to suppress the automatic drop; the completion callback calls ManuallyDrop::into_inner() to resume RAII ownership, or uses SlabBox::into_raw() / SlabBox::from_raw() for raw pointer interop with the block layer's per-request state.

15.2.3.3 Bio-to-IoRequest Conversion

For block devices with an I/O scheduler attached (Section 15.18), bios are converted to IoRequest objects before being dispatched to hardware queues. This conversion bridges the filesystem-facing bio interface with the scheduler-facing IoRequest interface.

/// Scheduler-facing I/O request. Created from a Bio by `bio_to_io_request()`.
/// The IoRequest wraps a Bio for the scheduler's merging/sorting/priority
/// logic. On completion, the scheduler calls `bio_complete()` on the
/// originating Bio, which invokes the Bio's `end_io` callback.
// Kernel-internal, not KABI.
pub struct IoRequest {
    /// Starting logical block address.
    pub lba: Lba,
    /// Request length in **bytes** (not sectors). Corresponds to Linux's
    /// `struct request.__data_len`. Named `len_bytes` (not `len`) to prevent
    /// ambiguity between bytes and sectors.
    pub len_bytes: u64,
    /// I/O operation type. Uses BioOp directly (see [Section 15.2](#block-io-and-volume-management)).
    pub op: BioOp,
    /// Scheduling priority (derived from task + cgroup).
    pub priority: IoPriority,
    /// Submission timestamp (monotonic nanoseconds).
    pub submit_ns: u64,
    /// Deadline (set by the I/O scheduler on insertion). 0 = not yet assigned.
    pub deadline_ns: u64,
    /// PID of the submitting task.
    pub pid: Pid,
    /// Cgroup ID for accounting and throttling.
    pub cgroup_id: u64,
    /// Scatter-gather list of DMA-mapped segments.
    pub sgl: DmaSgl,
    /// Back-pointer to the originating Bio. The scheduler uses this to:
    /// 1. Extract the Bio at dispatch time (`submit_bio` takes `&mut Bio`).
    /// 2. Call `bio_complete()` on the Bio when hardware signals completion.
    ///
    /// # Safety
    /// The Bio is kept alive for the duration of the IoRequest's lifetime.
    /// The submitter transfers ownership of the Bio to the I/O completion
    /// path via `ManuallyDrop` / `SlabBox::into_raw()` at `bio_submit()`
    /// time. The Bio is not freed until `bio_complete()` invokes `end_io`,
    /// which may free it. The IoRequest must not outlive the Bio.
    ///
    /// **Lifetime guarantee**: The Bio's slab allocation is pinned (the
    /// `bio_slab` is PERMANENT — never garbage-collected). The raw pointer
    /// remains valid until `bio_complete()` transitions the Bio to `Done`
    /// and the `end_io` callback frees or recycles it.
    pub bio: *mut Bio,
}

/// Convert a bio into an IoRequest for the I/O scheduler.
/// Called by `bio_submit()` when the target device has an I/O scheduler
/// attached (i.e., `bdev.io_queues` is `Some`).
///
/// The bio's LBA range, operation type, and memory segments are transferred
/// to the IoRequest. The bio pointer is stored as a back-reference for
/// dispatch (the driver's `submit_bio` takes `&mut Bio`) and completion
/// (the scheduler calls `bio_complete()` on the originating Bio).
///
/// Priority is derived from the submitting task's effective I/O priority.
///
/// For devices WITHOUT an I/O scheduler (e.g., NVMe with native multi-queue),
/// `bio_submit()` dispatches directly to `bdev.ops.submit_bio()` without
/// conversion.
/// **DmaSgl construction paths** (resolves BIO-07 — DmaSgl is NOT built twice):
///
/// - **Scheduler path** (`bio_to_io_request()`): DmaSgl is built HERE from
///   bio segments. The I/O scheduler stores it in `IoRequest.sgl`. At
///   dispatch time, the scheduler calls `bdev.ops.submit_request(req)`
///   which passes the pre-built SGL to the driver. The driver does NOT
///   rebuild it from the bio — it uses `req.sgl` directly.
///
/// - **Direct dispatch path** (NVMe with no scheduler): `bio_submit()`
///   calls `bdev.ops.submit_bio(bio)` directly. The NVMe driver builds
///   the DmaSgl inside its `submit_bio()` implementation. No IoRequest
///   is ever created — the DmaSgl is built exactly once.
///
/// In both paths, `DmaSgl::from_bio_segments()` is called exactly once
/// per bio. The two call sites are mutually exclusive (scheduler attached
/// vs no scheduler).
fn bio_to_io_request(bio: &mut Bio, task: &Task) -> IoRequest {
    IoRequest {
        lba: Lba(bio.start_lba),
        // Device geometry is cached at registration time in the block device
        // wrapper struct. No vtable dispatch per bio — BlockDeviceCachedParams
        // holds immutable geometry discovered during device probe.
        len_bytes: bio.total_sectors() as u64
            * bio.bdev.cached_params.logical_block_size as u64,
        op: bio.op,
        priority: task.effective_io_priority(),
        submit_ns: clock_monotonic_ns(),
        deadline_ns: 0, // Populated by the I/O scheduler on insertion
        pid: task.pid(),
        cgroup_id: bio.cgroup_id,
        sgl: DmaSgl::from_bio_segments(&bio.segments, bio.segments_ext.as_deref()),
        // Store the bio pointer for dispatch and completion routing.
        // The bio remains alive until bio_complete() is called.
        bio: bio as *mut Bio,
    }
}

15.2.3.4 Bio Submission Functions

The block I/O layer provides two free functions for submitting bios. All bio submission flows through bio_submit(), which handles cgroup accounting and throttling before dispatching to the device driver.

/// Submit a bio for asynchronous processing. Returns immediately.
/// The bio's `end_io` callback is invoked when I/O finishes.
///
/// **I/O scheduler path**: If the target block device has an I/O scheduler
/// attached (`bdev.io_queues.is_some()`), the bio is converted to an
/// `IoRequest` via `bio_to_io_request()` and submitted to the scheduler's
/// per-CPU queue ([Section 15.18](#io-priority-and-scheduling)). The scheduler merges,
/// reorders, and dispatches requests to hardware queues.
///
/// **Direct dispatch path**: If the device has no I/O scheduler (e.g.,
/// NVMe devices with hardware multi-queue), the bio is dispatched directly
/// to `bdev.ops.submit_bio()`.
pub fn bio_submit(bio: &mut Bio) {
    // 1. Tag bio with originating cgroup for I/O accounting.
    //
    // **Readahead cgroup attribution**: Readahead bios are submitted by the
    // readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)) which may run in
    // a kernel worker thread context (kworker), not the original task that
    // triggered the readahead. Naively using `current_task().cgroup_id()`
    // here would attribute readahead I/O to the root cgroup (the kworker's
    // cgroup), starving the triggering task's cgroup of its I/O budget and
    // allowing readahead to bypass cgroup I/O throttling entirely.
    //
    // To fix this, the readahead engine stamps `bio.cgroup_id` with the
    // triggering task's cgroup context *before* calling `bio_submit()`.
    // The readahead entry points (`page_cache_readahead()`,
    // `page_cache_async_readahead()`) capture `current_task().cgroup_id()`
    // at call time (when the triggering task is still current) and propagate
    // it through the `ReadaheadControl` struct to all bios created during
    // the readahead window. If `bio.cgroup_id` is already set (non-zero)
    // when `bio_submit()` is entered, step 1 preserves it; only bios with
    // an unset cgroup_id (zero) are tagged with `current_task().cgroup_id()`.
    if bio.cgroup_id == 0 {
        bio.cgroup_id = current_task().cgroup_id();
    }
    // 2. Check cgroup I/O throttling (Section 17.2.5).
    if let Some(throttle) = cgroup_io_throttle(bio) {
        throttle.wait_for_token(bio);  // may sleep if GFP_KERNEL
    }
    // 3. Resolve backing device (concrete `BlockDevice` wrapper,
    //    gives access to both `ops` vtable and `cached_params`/`io_queues`).
    let bdev = &bio.bdev;
    // 4. Set bio state to Inflight. This must happen BEFORE dispatch so
    //    that bio_complete()'s CAS from Inflight→Done succeeds.
    //    Without this, asynchronous callers would have state = uninitialized,
    //    and bio_complete's CAS from Inflight would silently fail.
    bio.state.store(BioState::Inflight as u32, Release);
    // 5. Dispatch: scheduler path or direct.
    if let Some(ref queues) = bdev.io_queues {
        let task = current_task();
        let req = bio_to_io_request(bio, task);
        // **Hot-path allocation note**: `IoRequest` is allocated from a
        // dedicated per-CPU slab cache (`io_request_slab`), not from
        // `Arc::new()` (general heap). The slab pool is sized at boot:
        // `nr_cpus * 128` entries. `SlabArc::new(io_request_slab, req)`
        // returns an `Arc<IoRequest>` backed by the slab cache, avoiding
        // heap allocation on every I/O submission.
        let arc_req = SlabArc::new(&queues.request_slab, req);
        // See [Section 15.18](#io-priority-and-scheduling) for the `submit()` function
        // (inserts into per-CPU scheduler queue and calls kick_dispatch).
        submit(queues, arc_req, task);
    } else {
        // Direct dispatch — no scheduler.
        // **Tier 1 domain crossing**: If the block device driver is Tier 1
        // (post-boot NVMe/AHCI promotion), `bdev.ops.submit_bio()` is
        // dispatched via `kabi_call!(bdev.block_handle, submit_bio, bio)`.
        // The KABI transport serializes the Bio's relevant fields (sector,
        // op, segments) into the T1CommandEntry argument buffer. The
        // driver's consumer loop deserializes and programs DMA. Completion
        // is signaled via the driver's KABI completion ring, which the
        // Tier 0 block layer consumer converts back to `bio_complete()`.
        // Domain crossing cost: ~23-46 cycles per bio (amortized with
        // batched submission via plugging/unplug).
        // Handle EAGAIN from the driver.
        match kabi_call!(bdev.block_handle, submit_bio, bio) {
            Ok(()) => {}
            Err(e) if e == Error::AGAIN => {
                // Descriptor ring full or device temporarily unavailable.
                // Place on per-device requeue list. Bounded by
                // `MAX_REQUEUE_DEPTH` (4096). The device's completion IRQ
                // handler calls `blk_kick_requeue()` after freeing
                // descriptors, which re-submits bios in FIFO order.
                let mut requeue = bdev.requeue_list.lock();
                if requeue.len() < requeue.capacity() {
                    requeue.push(RequeueEntry {
                        bio: bio as *mut Bio,
                        generation: bio.generation,
                    });
                } else {
                    // Requeue list full — fail the bio immediately.
                    bio_complete(bio as *mut Bio, -(Error::NOSPC as i32));
                }
            }
            Err(e) => {
                // Permanent error — fail the bio.
                bio_complete(bio as *mut Bio, -(e as i32));
            }
        }
    }
}

/// Re-drain the per-device requeue list after the driver frees descriptors.
/// Called from the device's completion IRQ handler (or Tier 1 completion
/// ring consumer) after freeing DMA descriptors, to re-submit bios that
/// previously returned EAGAIN.
///
/// The requeue list is a bounded FIFO (`SpinLock<BoundedDeque<*mut Bio, 4096>>`).
/// Bios are re-submitted in FIFO order. For each bio:
/// 1. Check bio state: if not `Inflight` (e.g., timeout handler already
///    transitioned to `TimedOut` or `Done`), skip — do not re-submit.
/// 2. Re-submit via `kabi_call!(bdev.block_handle, submit_bio, bio)`.
/// 3. On EAGAIN: push back to front of requeue list (will retry on next
///    completion). On permanent error: `bio_complete(bio, error)`.
pub fn blk_kick_requeue(bdev: &BlockDevice) {
    let mut list = bdev.requeue_list.lock();
    let mut retry_later = BoundedDeque::<RequeueEntry, 64>::new();
    while let Some(entry) = list.pop_front() {
        // BIO-09 fix: Two-phase validation before dereferencing the bio pointer.
        //
        // Phase 1: Generation check. If the slab recycled the bio's memory
        // for a new allocation, the generation counter will differ. Reading
        // the generation field is safe because the slab does not return
        // pages to the page allocator (bio_slab is PERMANENT), so the
        // memory is always mapped and readable — we just may read a
        // different bio's generation value, which will not match.
        //
        // Phase 2: State CAS. Even if the generation matches, the timeout
        // handler may have transitioned the bio to TimedOut/Done. The CAS
        // from Inflight→Inflight is a no-op that succeeds only if the bio
        // is genuinely still waiting for re-submission.
        //
        // SAFETY: The slab memory backing entry.bio is always mapped
        // (PERMANENT slab). Reading generation and state is safe even if
        // the bio was freed and reallocated — we validate before any
        // mutation or callback invocation.
        let bio_ref = unsafe { &*entry.bio };
        // Phase 1: generation mismatch → slab recycled this slot.
        if bio_ref.generation != entry.generation {
            continue; // stale entry — skip
        }
        // Phase 2: state check — bio must still be Inflight.
        let state = bio_ref.state.load(Acquire);
        if state != BioState::Inflight as u32 {
            continue; // timeout handler already processed this bio
        }
        // SAFETY: generation matches AND state is Inflight → this is
        // the original bio, still alive and waiting for re-submission.
        let bio = unsafe { &mut *entry.bio };
        match kabi_call!(bdev.block_handle, submit_bio, bio) {
            Ok(()) => {}
            Err(e) if e == Error::AGAIN => {
                // Still busy — defer for next completion cycle.
                retry_later.push_back(entry);
            }
            Err(e) => {
                bio_complete(entry.bio, -(e as i32));
            }
        }
    }
    // Re-insert deferred bios at the front (maintain FIFO order).
    for entry in retry_later.into_iter().rev() {
        list.push_front(entry);
    }
}

/// Submit a bio and block until I/O completion. Returns status code.
/// Caller must not hold any spinlocks (this function sleeps).
/// Sentinel value indicating the bio has not yet completed. Set at submission;
/// cleared by completion handler with actual status (0 = success, negative = error).
const BIO_STATUS_PENDING: i32 = i32::MIN;

/// Default: 30 seconds. Linux has no single bio sync timeout; its
/// `BLK_DEFAULT_SG_TIMEOUT` (60s) applies to SG_IO passthrough, not
/// internal block I/O. UmkaOS uses 30 seconds for faster fault detection
/// on synchronous I/O paths (fsync, sync read). Tunable via sysctl
/// `block.sync_timeout_ms`.
const BIO_SYNC_TIMEOUT_MS: u64 = 30_000;

/// Synchronous bio submission: submits the bio and blocks until completion.
///
/// **Preferred pattern**: Uses a stack-allocated `StackBioWaiter` to avoid
/// the heap allocation of `Arc<WaitQueueHead>`. This is safe because the
/// caller's stack frame outlives the bio in this synchronous path — the
/// function blocks until completion or timeout. For high-fsync workloads
/// (databases doing thousands of fsync/sec), eliminating the ~50-100ns
/// Arc allocation per synchronous I/O is measurable.
///
/// ```
/// // StackBioWaiter: stack-allocated waiter for synchronous bio completion.
/// // The WaitQueueHead lives on the caller's stack and is borrowed by the bio.
/// // SAFETY: The caller blocks until completion, so the stack frame outlives
/// // the bio's reference to the waiter.
/// struct StackBioWaiter {
///     wq: WaitQueueHead,
/// }
/// ```
pub fn bio_submit_and_wait(bio: &mut Bio) -> Result<(), IoError> {
    // Stack-allocated waiter — no heap allocation.
    let waiter = StackBioWaiter { wq: WaitQueueHead::new() };
    // Set the synchronous completion callback and store the waiter pointer
    // in bio.private. The callback (bio_sync_end_io) reads bio.private
    // to locate the stack-allocated WaitQueueHead and wakes it.
    // SAFETY: waiter lives on this stack frame; we block below until
    // the bio completes, so the waiter outlives the bio's reference.
    bio.end_io = bio_sync_end_io;
    bio.private = &waiter.wq as *const WaitQueueHead as usize;
    bio.status.store(BIO_STATUS_PENDING, Ordering::Release);
    bio.state.store(BioState::Inflight as u32, Ordering::Release);
    bio_submit(bio);
    let completed = waiter.wq.wait_event_timeout(
        || bio.status.load(Relaxed) != BIO_STATUS_PENDING,
        BIO_SYNC_TIMEOUT_MS,
    );
    if !completed {
        // I/O did not complete within the timeout. The bio is still
        // in-flight in the device queue; it will complete eventually
        // (or be aborted by the error handler).
        //
        // CRITICAL: Atomically claim the bio via CAS(Inflight→TimedOut).
        // If we win the CAS, we own the bio — the completion handler will
        // see TimedOut (not Inflight) and bail out without touching our
        // stack-allocated waiter. If we lose the CAS, the device already
        // transitioned to Completing/Done and the waiter was already signaled.
        match bio.state.compare_exchange(
            BioState::Inflight as u32,
            BioState::TimedOut as u32,
            Ordering::AcqRel,
            Ordering::Acquire,
        ) {
            Ok(_) => {
                // We won: timeout path owns the bio. The device completion
                // handler will see TimedOut and skip the callback. Safe to
                // return (our stack frame is about to be destroyed, but the
                // completion handler will not dereference the StackWaiter).
                bio.state.store(BioState::Done as u32, Ordering::Release);
                return Err(IoError::TimedOut);
            }
            Err(_) => {
                // Device completed between the timeout and our CAS.
                // The waiter was signaled; fall through to check status.
            }
        }
    }
    // Verify state reached Done (completion handler sets this).
    debug_assert!(bio.state.load(Ordering::Relaxed) == BioState::Done as u32);
    match bio.status.load(Acquire) {
        0 => Ok(()),
        err => Err(IoError::from_errno(err)),
    }
}

15.2.3.5 Cgroup I/O Throttling

Where throttling occurs: In bio_submit(), before dispatch to bdev.ops.submit_bio(). This ensures every bio — whether from filesystems, raw block reads, or device-mapper — passes through the cgroup I/O controller.

Throttling algorithm: Token-bucket rate limiter per device per cgroup:

pub struct IoThrottleState {
    /// Tokens available (bytes or IOPS depending on limit type).
    pub tokens: AtomicI64,
    /// Refill rate (bytes/sec or IOPS from io.max).
    pub rate: u64,
    /// Last refill timestamp (nanoseconds).
    pub last_refill_ns: AtomicU64,
    /// Wait queue for throttled bios.
    pub waiters: WaitQueueHead,
}

Integration with io.max: When io.max is set on a cgroup, an IoThrottleState is created per (cgroup, device) pair. cgroup_io_throttle() looks up the throttle state from the cgroup's IoController using the bio's cgroup_id and bdev. If the token bucket has insufficient tokens, the calling task sleeps on waiters until tokens are refilled — the refill rate is derived directly from the io.max limit values.

Bypass: If no io.max is set for the bio's cgroup (or the cgroup is the root cgroup), cgroup_io_throttle() returns None — zero overhead on the unthrottled path. No atomic operations, no lock acquisitions, no branch mispredictions on the fast path.

Cross-reference: See Section 17.2 for cgroup io controller configuration.

15.2.3.6 Block Device Page Cache (Buffer Cache)

Block device special files (/dev/sda, /dev/nvme0n1p1) are accessed through the page cache just like regular files. Every block device has an associated Inode (the "bdev inode") whose i_mapping (AddressSpace) caches raw block data. This is the "buffer cache" — reads from /dev/sda via read(2) or dd go through this page cache; O_DIRECT bypasses it.

bdev inode creation: When a block device is first opened, the VFS creates (or reuses) a bdev inode in the bdevfs pseudo-filesystem. The bdev inode's i_size is set to the device's capacity in bytes. Its i_mapping.ops is set to BDEV_ADDRESS_SPACE_OPS.

/// AddressSpaceOps for block device special files.
/// These translate page cache operations into raw block I/O
/// without filesystem metadata interpretation.
pub static BDEV_ADDRESS_SPACE_OPS: &dyn AddressSpaceOps = &BdevAddressSpaceOps;

struct BdevAddressSpaceOps;

impl AddressSpaceOps for BdevAddressSpaceOps {
    /// Read a single page of raw block data.
    /// Builds a Bio targeting the page's byte offset ÷ logical_block_size,
    /// submits it, and waits for completion.
    /// The caller (`filemap_get_pages`) has already allocated a page, inserted
    /// it into the page cache XArray via `try_store`, and set `PageFlags::LOCKED`.
    /// This function fills the page with data from the block device. It must NOT
    /// allocate a new page or overwrite the XArray slot — doing so would orphan
    /// the original locked page and deadlock concurrent waiters.
    fn read_page(
        &self,
        mapping: &AddressSpace,
        index: u64,
        page: &Arc<Page>,
    ) -> Result<(), IoError> {
        let bdev = bdev_from_inode(mapping.host())?;
        let lba = (index * PAGE_SIZE as u64)
                  / bdev.cached_params.logical_block_size as u64;
        let bio = Bio::new_read(bdev, lba, page);
        bio_submit_and_wait(&bio)?;
        Ok(())
    }

    /// Write a dirty page of raw block data back to the device.
    /// For async writeback (sync_mode == Background), submits the bio and returns
    /// immediately — errors are delivered via the bio completion callback
    /// which sets `AS_EIO`/`AS_ENOSPC` on the address space. For sync
    /// writeback, uses `bio_submit_and_wait()` to block until completion.
    fn writepage(
        &self,
        mapping: &AddressSpace,
        page: &Page,
        wbc: &WritebackControl,
    ) -> Result<(), IoError> {
        let bdev = bdev_from_inode(mapping.host())?;
        // page.index is shorthand for Page::index_or_freelist (the page-cache
        // file offset in page-sized units). Only valid for page-cache pages
        // (not slab pages, where this field is unused). See Page struct in
        // [Section 4.3](04-memory.md#slab-allocator--page-frame-descriptor).
        let lba = (page.index as u64 * PAGE_SIZE as u64)
                  / bdev.cached_params.logical_block_size as u64;
        let mut bio = Bio::new_write(bdev, lba, page);
        // WritebackSyncMode defined in [Section 4.6](04-memory.md#writeback-subsystem--writebacksyncmode).
        if wbc.sync_mode == WritebackSyncMode::Background {
            // Async writeback: fire-and-forget. Errors are reported via
            // the bio completion callback → address_space error flags.
            bio.flags |= BioFlags::ASYNC;
            bio_submit(&mut bio);
            Ok(())
        } else {
            // Sync writeback (fsync path): block until I/O completes.
            bio_submit_and_wait(&mut bio)
        }
    }

    /// No releasepage needed for raw block devices.
    fn releasepage(&self, _page: &Page) -> bool { true }
}

Data flow for raw block reads (e.g., dd if=/dev/sda bs=4096 count=1):

read(fd, buf, 4096)
  → vfs_read()
  → generic_file_read_iter()              // same as regular files
  → filemap_get_pages(mapping, pgoff=0)   // check page cache
  → [cache miss] → BdevAddressSpaceOps::read_page()
    → Bio { op: Read, start_lba: 0, segments: [page] }
    → bio_submit() → BlockDeviceOps::submit_bio()
    → [I/O completion] → page installed in cache
  → copy_to_user(buf, page_data, 4096)

Subsequent reads of the same block hit the page cache directly (no I/O). The bdev page cache is invalidated by blkdev_invalidate_pages() when a partition is re-read or a device is closed with exclusive access.

15.2.3.7 Writeback I/O Completion Callback

Async writeback bios submitted by BdevAddressSpaceOps::writepage() (and filesystem writepage implementations) use a dedicated completion callback to update page cache state and propagate errors to fsync() waiters.

/// Writeback I/O completion callback. Called from the `blk-io` workqueue
/// (deferred from interrupt/softirq context — see Bio Completion Callback
/// Constraints above) when an async writeback bio completes.
///
/// This function bridges the block I/O completion path and the page cache
/// error reporting path. It is the sole point where writeback errors are
/// recorded on the `AddressSpace` — all writeback paths (bdev, ext4, XFS,
/// Btrfs) use this callback or a filesystem-specific variant that calls
/// the same `wb_err` update logic.
/// Called from the `blk-io` workqueue (deferred from interrupt/softirq
/// context via `writeback_end_io_deferred`). Handles page cache
/// updates after writeback I/O completes: clears WRITEBACK/DIRTY flags,
/// decrements `nr_dirty`, records errors on the AddressSpace via ErrSeq.
///
/// **IRQ-safety**: This function is NOT called directly from
/// `bio_complete()`. Instead, `writeback_end_io_deferred` (the `end_io`
/// callback set by the writeback path) schedules this function on the
/// `blk-io` workqueue. This allows it to perform page cache
/// operations (xa_lock, wait queue wake) that are forbidden in IRQ context.
///
/// **Counter ownership**: This function is the SOLE owner of the
/// DIRTY→clean transition and `nr_dirty` decrement for Tier 0 (in-kernel)
/// filesystems. For Tier 1 filesystems, the Tier 0 `WritebackResponse`
/// handler (step 11 in [Section 4.6](04-memory.md#writeback-subsystem)) owns the transition
/// instead — `writeback_end_io` is NOT called for Tier 1 writeback bios.
fn writeback_end_io(bio: &mut Bio, status: i32) {
    let errno = status;

    // Iterate ALL segments in the bio. A single writeback bio may span multiple
    // pages (bio_add_page() coalesces contiguous pages). Processing only
    // segments[0] would silently drop error/completion handling for all
    // subsequent pages, causing writeback hangs (tasks blocked on
    // wait_on_page_writeback() for pages whose WRITEBACK flag is never cleared)
    // and silent data loss (pages left in WRITEBACK state, never re-dirtied on error).
    // Capture the mapping reference ONCE before the per-page loop. After the
    // loop clears WRITEBACK, a concurrent truncate_inode_pages() can null out
    // page.mapping — so post-loop access to page.mapping() is unsafe. This
    // capture is safe because WRITEBACK is still set (pinning the mapping).
    // SAFETY: bio.segments is non-empty (guaranteed by bio_submit validation).
    let mapping_for_error = unsafe { &*bio.segments[0].page }.mapping();

    for seg in &bio.segments {
        // SAFETY: page validity is guaranteed by page_get() in bio_add_page().
        // The page is pinned for the bio's lifetime (page_put in bio_endio).
        let page = unsafe { &*seg.page };

        // Track whether this page should be re-dirtied for retry.
        let mut should_redirty = false;

        if errno == 0 {
            // Success path: page is now clean on disk.
            // **Counter maintenance**: decrement nr_dirty (balancing the increment
            // in __set_page_dirty). If this decrement is omitted,
            // balance_dirty_pages() will eventually throttle all writes to zero.
            //
            // **page.mapping()** resolves the `AtomicPtr<u8>` in `Page.mapping`
            // to `&AddressSpace`:
            // ```rust
            // impl Page {
            //     /// Resolve the mapping pointer to a typed AddressSpace reference.
            //     /// SAFETY: The caller must ensure the page is still attached to
            //     /// an AddressSpace (i.e., not truncated). For pages in writeback,
            //     /// this is guaranteed: the WRITEBACK flag pins the mapping.
            //     pub unsafe fn mapping(&self) -> &AddressSpace {
            //         &*(self.mapping.load(Acquire) as *const AddressSpace)
            //     }
            //     /// Wake tasks waiting for this page's writeback/lock to complete.
            //     pub fn wake_waiters(&self) {
            //         self.waiters.wake_up_all();
            //     }
            // }
            // ```
            page.flags.fetch_and(!PageFlags::DIRTY, Release);
            // SAFETY: page is in writeback — mapping is pinned.
            unsafe { page.mapping() }.page_cache.as_ref().unwrap().nr_dirty.fetch_sub(1, Relaxed);
        } else {
            // Error path: mark page for retry and record error on AddressSpace.
            page.flags.fetch_or(PageFlags::ERROR, Release);
            // Check consecutive failure count. After 3 failures, mark the page
            // PERMANENT_ERROR and exclude from future writeback (matching the
            // writeback subsystem's retry policy in [Section 4.6](04-memory.md#writeback-subsystem)).
            let fail_count = page.wb_fail_count.fetch_add(1, Relaxed) + 1;
            if fail_count >= 3 {
                page.flags.fetch_or(PageFlags::PERMANENT_ERROR, Release);
                // Page excluded from writeback. fsync() returns -EIO.
                // FMA event emitted by the writeback subsystem.
            } else {
                should_redirty = true;
            }
            // Record the error on the AddressSpace for fsync() reporting
            // via ErrSeq (errseq_t pattern). set_err() atomically increments
            // the sequence counter and stores the errno. Concurrent fsync()
            // callers on different fds each observe the error exactly once.
            // The error is recorded once per AddressSpace (not once per page) —
            // set_err() is idempotent within the same generation.
            // SAFETY: page is in writeback — mapping is pinned.
            let mapping = unsafe { page.mapping() };
            mapping.wb_err.set_err(errno as i32);
            // Set legacy AS_EIO / AS_ENOSPC flags on the AddressSpace for
            // backward compatibility with callers that check flags directly
            // (older filesystems, memory-mapped I/O error detection). These
            // flags complement the ErrSeq mechanism — both must be set.
            if errno == -(ENOSPC as i32) {
                mapping.flags.fetch_or(AS_ENOSPC, Release);
            } else {
                mapping.flags.fetch_or(AS_EIO, Release);
            }

            // Re-dirty the page so the writeback subsystem will retry it on
            // the next writeback cycle (only for non-permanent errors).
            if should_redirty {
                page.flags.fetch_or(PageFlags::DIRTY, Release);
                // SAFETY: page is in writeback — mapping is pinned.
                unsafe { page.mapping() }.page_cache.as_ref().unwrap().nr_dirty.fetch_add(1, Relaxed);
            }
        }

        // Unified completion path for ALL outcomes (success, retryable error,
        // permanent error). Order matters: decrement nrwriteback FIRST, then
        // clear WRITEBACK flag. If we cleared WRITEBACK first, a concurrent
        // fsync() could observe WRITEBACK cleared (page "done") while
        // nrwriteback still counts it, leading to stale count or missed waiters.
        // SAFETY: page is in writeback — mapping is pinned.
        unsafe { page.mapping() }.nrwriteback.fetch_sub(1, Release);
        page.flags.fetch_and(!PageFlags::WRITEBACK, Release);

        // Wake any tasks blocked in fsync() or sync_page() waiting for this
        // page's writeback to complete. The waiters check wb_err after waking
        // to detect and propagate errors.
        page.wake_waiters();
    }

    // Check filesystem error mode ONCE per bio (not per page). Multiple pages
    // in the same bio share the same AddressSpace and superblock. The error mode
    // action (continue/remount-ro/panic) is a per-filesystem decision, not per-page.
    //
    // IMPORTANT: `mapping_for_error` was captured BEFORE the per-page loop
    // cleared WRITEBACK flags. After WRITEBACK is cleared, a concurrent
    // truncate_inode_pages() could remove the page from the page cache and
    // null out page.mapping — making a post-loop page.mapping() dereference
    // unsafe (null pointer / use-after-free). The pre-loop capture avoids
    // this race.
    if errno != 0 {
        check_fs_error_mode(mapping_for_error.host().superblock());
    }

    // Dirty extent protocol completion: if this page was part of a dirty
    // extent reservation ([Section 14.1](14-vfs.md#virtual-filesystem-layer--copy-on-write-and-redirect-on-write-infrastructure)),
    // the filesystem's own completion callback (registered in bio.private)
    // calls vfs_flush_extent_complete(token) AFTER this function returns.
    // writeback_end_io() handles only the page cache and errseq_t updates;
    // the filesystem completion callback handles journal commit, extent tree
    // updates, and dirty extent token release. For bdev (raw block) I/O,
    // no dirty extent token exists — this step is a no-op.
}
/// Check the filesystem's error-handling policy and take the appropriate
/// action for a writeback I/O error. Called from `writeback_end_io()` after
/// the error has been recorded on the `AddressSpace` (wb_err, AS_EIO/AS_ENOSPC).
///
/// The superblock's `s_error_behavior` field (set at mount time via the `errors=`
/// mount option) determines the response:
///
/// - `FsErrorMode::Continue` — log the error via FMA
///   ([Section 20.1](20-observability.md#fault-management-architecture)), continue operation. The re-dirtied
///   page will be retried on the next writeback cycle.
/// - `FsErrorMode::RemountRo` — set `SB_RDONLY` on the superblock, log via FMA.
///   Subsequent write operations return `EROFS`. Read operations continue.
/// - `FsErrorMode::Panic` — kernel panic. Used by filesystems where data
///   integrity is critical and unrecoverable corruption is worse than downtime
///   (e.g., ext4 with `errors=panic`, XFS default on metadata error).
///
/// This function does not return a value — in the `Panic` case, it does not
/// return at all. In `Continue` and `RemountRo` cases, `writeback_end_io()`
/// proceeds to wake waiters.
fn check_fs_error_mode(sb: &SuperBlock) {
    match sb.s_error_behavior {
        FsErrorMode::Continue => {
            fma_report(sb.device_handle, HealthEventClass::Storage,
                       FMA_WRITEBACK_IO_ERROR, HealthSeverity::Warning, &[]);
        }
        FsErrorMode::RemountRo => {
            sb.flags.fetch_or(SB_RDONLY, Release);
            fma_report(sb.device_handle, HealthEventClass::Storage,
                       FMA_WRITEBACK_IO_ERROR_REMOUNT_RO, HealthSeverity::Major, &[]);
        }
        FsErrorMode::Panic => {
            panic!("writeback I/O error on {:?} with errors=panic", sb.dev_name);
        }
    }
}

Callback registration: The writeback path sets bio.end_io = writeback_end_io_deferred before calling bio_submit(). The writeback_end_io_deferred callback enqueues a workqueue item that calls writeback_end_io() on the blk-io workqueue in process context, not in IRQ/softirq context — this is required because the function performs page cache operations (xa_lock, wait queue wake, nr_dirty decrement) that are forbidden under IRQ-disabled spinlocks.

For filesystem-specific writeback (ext4 journal, XFS log), the filesystem provides its own end_io callback that schedules filesystem-specific deferred work (e.g., updating journal state) and then calls writeback_end_io() for the common page cache update logic.

Error propagation to userspace: When fsync(fd) is called, the VFS reads the file's AddressSpace.wb_err and compares it against the fd's file.f_wb_err (stamped at open() time). If the generation has advanced with a non-zero errno, fsync() returns that errno to the caller and updates file.f_wb_err to the current generation (so the error is reported exactly once per fd, matching Linux errseq_t semantics).

Cross-references: - AddressSpace and AddressSpaceOps: Section 14.1 - Bio and BlockDeviceOps: §15.3.1 (above) - Page cache dirty tracking: Section 4.2 - Writeback subsystem and dirty page lifecycle: Section 4.6

15.2.4 Device-Mapper and Volume Management

Device-mapper framework — UmkaOS implements a device-mapper layer in umka-block with standard targets:

Target Description Linux equivalent
dm-linear Simple linear mapping dm-linear
dm-striped Stripe across N devices dm-stripe
dm-mirror Synchronous mirror (RAID-1) dm-mirror
dm-crypt Transparent encryption (AES-XTS) dm-crypt
dm-verity Read-only integrity verification dm-verity
dm-snapshot Copy-on-write snapshots dm-snapshot
dm-thin Thin provisioning with overcommit dm-thin-pool

LVM2 metadata compatibility — UmkaOS reads the LVM2 on-disk metadata format (PV headers, VG descriptors, LV segment maps) and constructs logical volumes using device-mapper targets. Existing LVM2 volume groups created under Linux are usable without conversion. LVM2 userspace tools (lvm, pvs, vgs, lvs) work unmodified via the standard device-mapper ioctl interface.

Software RAID — RAID levels 0/1/5/6/10 are implemented as device-mapper targets. MD superblock formats (0.90, 1.0, 1.2) are read for compatibility with existing Linux mdadm arrays. mdadm works unmodified. The RAID5/6 write hole is closed by the stripe log mechanism (Section 15.2) — auto-enabled on import for md 1.1/1.2 arrays.

Recovery-aware volume layer — This is where UmkaOS diverges meaningfully from Linux. Block device temporary disappearance during Tier 1 driver reload (~50-150ms) does NOT mark the device as failed:

Volume Layer State Machine:
  DEVICE_ACTIVE       → Normal I/O flow
  DEVICE_RECOVERING   → Driver reload in progress, I/O queued
  DEVICE_FAILED       → Device permanently gone, failover/degrade

Transition rules:
  ACTIVE → RECOVERING:  When driver supervisor signals reload start
  RECOVERING → ACTIVE:  When new driver instance signals ready (typical: <100ms)
  RECOVERING → FAILED:  When recovery timeout expires (default: 5 seconds)
  • During DEVICE_RECOVERING, the volume layer pauses I/O in its ring buffer. No requests are failed; they simply wait.
  • RAID resync is NOT triggered for sub-100ms driver reloads — the array stays clean. The volume layer distinguishes "device temporarily gone for driver reload" from "device removed from bus" by checking the driver supervisor state.
  • If the recovery window exceeds the configurable timeout (default 5s), the device transitions to DEVICE_FAILED and normal degraded-mode behavior applies (RAID rebuilds, error returns for non-redundant volumes).
  • dm-verity for verified boot is already designed (Section 9.3).

DmTarget trait — Every device-mapper target must implement this trait. All methods are called with preemption disabled; no sleeping is permitted.

/// Error conditions for device-mapper target operations.
pub enum DmError {
    /// Underlying block device returned an I/O error.
    IoError,
    /// Target received a bio with an invalid sector range (outside target bounds).
    InvalidMapping,
    /// Target-specific error (e.g., integrity check failed for dm-verity).
    TargetError,
    /// Target is suspended and cannot process bios.
    Suspended,
    /// No space available on the underlying device (thin provisioning exhausted).
    NoSpace,
    /// Underlying device is busy (e.g., being removed).
    DeviceBusy,
}

/// Result of mapping a bio to an underlying device.
pub enum DmMapResult {
    /// Bio submitted to the underlying device; device-mapper is done.
    Submitted,
    /// Bio remapped in place (`bio.dev` and `bio.sector` updated); caller submits.
    Remapped,
    /// Bio must be requeued (target is suspending or temporarily unavailable).
    Requeue,
    /// Bio failed.
    Error(DmError),
}

pub enum DmStatusType {
    /// Return human-readable target status (I/O counts, health).
    Status,
    /// Return the target table string (as loaded by `dmsetup`).
    Table,
}

/// Output buffer for `DmTarget::status()`.
pub struct DmStatusBuf<'a> {
    pub buf: &'a mut [u8],
    pub len: usize,  // bytes written so far
}

impl<'a> DmStatusBuf<'a> {
    pub fn write_fmt(&mut self, args: core::fmt::Arguments<'_>);
    pub fn write_str(&mut self, s: &str);
}

/// Core trait every device-mapper target must implement.
pub trait DmTarget: Send + Sync {
    /// Map a bio to the underlying device(s). May update `bio.dev` and `bio.sector`.
    fn map(&self, bio: &mut Bio) -> DmMapResult;

    /// Write human-readable status or table string into `result`.
    /// Used for `/sys/block/dmN/dm/name` and the `DM_TABLE_STATUS` ioctl.
    fn status(&self, type_: DmStatusType, result: &mut DmStatusBuf<'_>) -> Result<(), DmError>;

    /// Iterate over all constituent block devices, calling `cb` for each with the
    /// (device, start_sector, length_sectors) tuple. Used by sysfs topology, iostat,
    /// and blk-integrity propagation.
    fn iterate_devices(
        &self,
        cb: &mut dyn FnMut(&BlockDevice, u64, u64) -> i32, // (dev, start_sector, len_sectors)
    ) -> i32;

    /// Handle a device-mapper message (from the `DM_MESSAGE` ioctl). Optional.
    /// **Return convention**: 0 on success, negative errno on failure (e.g.,
    /// `-libc::EINVAL` for unrecognized message, `-libc::EOPNOTSUPP` if the
    /// target does not support messages). Matches Linux `dm_target_type::message`.
    fn message(&self, _argc: u32, _argv: &[&str], _result: &mut DmStatusBuf<'_>) -> i32 { -libc::EOPNOTSUPP }

    /// Called before target is suspended (e.g., for live resize or snapshot).
    fn presuspend(&self) {}
    fn postsuspend(&self) {}

    /// Called when target resumes after suspension.
    fn resume(&self) {}

    /// Target type name (e.g., `"linear"`, `"crypt"`, `"verity"`).
    fn name(&self) -> &'static str;

    /// Target version tuple for `DM_LIST_VERSIONS`. Follows semver (major, minor, patch).
    fn version(&self) -> (u32, u32, u32);
}

/// Registration record. Each target type registers at boot via `dm_register_target()`.
pub struct DmTargetType {
    pub name:    &'static str,
    pub version: (u32, u32, u32),
    /// Create a new target instance from a device-mapper table entry.
    pub create:  fn(
        ti:   &DmTableInfo,
        argc: u32,
        argv: &[&str],
    ) -> Result<Arc<dyn DmTarget>, DmError>,
}

15.2.5 RAID Write Hole Mitigation

The RAID5/6 write hole is a fundamental problem: updating data and parity chunks in a stripe requires multiple disk writes. Power failure between writes leaves the stripe inconsistent — parity doesn't match data. On rebuild, the wrong data is reconstructed from the mismatched parity. Silent data corruption.

Linux's approaches: write-intent bitmap (knows which stripes are dirty but not which chunks completed — insufficient for reconstruction), PPL (Partial Parity Log, stores parity diffs in the metadata region — 30-40% write overhead, RAID5 only, max 64 disks), journal device (full stripe journal on a separate device — effective but requires extra hardware).

UmkaOS provides two solutions depending on whether the array was created by UmkaOS or imported from Linux.

15.2.5.1 New UmkaOS Arrays: Inline Per-Chunk Metadata

New RAID5/6 arrays created by UmkaOS use a native format with per-chunk metadata:

/// Stored at the start of each chunk in UmkaOS-native RAID arrays.
/// Cost: 16 bytes per chunk (0.02% for 64 KB chunks — negligible).
/// On-disk format: all multi-byte fields use Le types to ensure disks are
/// portable across architectures (PPC32/s390x big-endian ↔ x86-64 little-endian).
/// Le* types defined in [Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types).
#[repr(C)]
pub struct ChunkMeta {
    /// Stripe write sequence number. Incremented on every stripe update.
    /// All chunks in a consistent stripe have the same seq value.
    pub seq: Le64,
    /// CRC32C of the chunk data (excluding this header). Hardware-accelerated
    /// on all 8 architectures (x86 SSE4.2, ARM CRC32, RISC-V Zbc, PPC vpmsum, s390x KIMD, LoongArch CRC32).
    pub checksum: Le32,
    pub _reserved: Le32,
}
// On-disk format: seq(8) + checksum(4) + _reserved(4) = 16 bytes.
const_assert!(core::mem::size_of::<ChunkMeta>() == 16);

Write path (zero extra I/O): 1. Read old data + old parity (standard RAID5 read-modify-write). 2. Compute new parity. 3. Write all modified chunks with seq = old_seq + 1 and updated CRC32C. 4. No journal. No extra I/O. Just 16 extra bytes per chunk write.

Recovery after crash (scans dirty stripes from write-intent bitmap): 1. For each dirty stripe: read all chunk seq values and checksum fields. 2. All chunks have same seq AND all checksums valid → stripe is consistent. 3. Mixed seq values → partial write detected: - Chunks with the lower seq (old) form a mutually consistent set. - Recompute parity from the old-seq chunks. The partial write is rolled back. - Chunks with the higher seq whose checksum does NOT match their data (seq written, data partially written) are also detected and treated as old. 4. The in-flight write is lost, but the filesystem journal replays the logical operation. The application sees "write completed" or "write didn't happen" — never corruption.

Performance cost: effectively zero. 16 bytes per 64 KB+ chunk. CRC32C: ~1 ns per 4 KB on hardware-accelerated platforms. No journal device. No extra fsync.

Trade-off: the in-flight write is rolled back (not preserved). For workloads that need the write to survive power failure, combine with a journal device (see tiered approach below).

On-disk format: UmkaOS-native arrays are not mountable by Linux md. Chunk data starts at offset 16 instead of offset 0. This is a deliberate design choice for new arrays — same situation as Btrfs RAID or ZFS (different format, stronger guarantees).

15.2.5.2 Imported Linux Arrays: Auto-Enabled Batched Stripe Log

Existing Linux md arrays must be usable read-write without offline conversion. UmkaOS auto-enables a batched stripe log in the existing metadata region of each member drive.

Compatibility assessment on import:

md superblock version Metadata region Auto-enable stripe log?
1.1 (most common) ~1020 KB between superblock and data_offset (typically 1 MiB) Yes — sufficient for in-flight stripe log
1.2 ~4 KB between superblock and data_offset (typically 4 KB offset) Depends on data_offset; if ≥64 KB free, yes
1.0 (superblock at end) Space between data end and superblock Depends — check actual free space
0.90 (legacy) 64 KB reserved at end No — too small; FMA warning issued

Stripe log format (stored in metadata region, circular buffer):

/// Header for the stripe log region on each member drive.
/// Placed at a fixed offset in the metadata region (after md superblock + bitmap).
/// On-disk format: all multi-byte fields use Le types to ensure disks are
/// portable across architectures. Le* types defined in
/// [Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types).
#[repr(C)]
pub struct StripeLogHeader {
    /// Magic: 0x554D_534C ("UMSL" = UmkaOS Stripe Log).
    pub magic: Le32,
    /// Log version (currently 1).
    pub version: Le32,
    /// Usable log region size in bytes (metadata_region_free - sizeof(StripeLogHeader)).
    /// **Bounded**: metadata region is at most a few MB; u32 (4 GB) is sufficient.
    pub log_size: Le32,
    /// Current write position in the circular log (byte offset from log start).
    /// **Bounded**: wraps within log_size (circular buffer); always < log_size.
    /// **Wrap semantics**: when a batch would straddle the end of the circular
    /// log (`write_pos + batch_size > log_size`), the remaining bytes are
    /// zero-filled and `write_pos` is reset to 0. The batch is written at
    /// offset 0. Recovery skips zero-filled tail regions by checking for a
    /// valid `StripeLogEntry` header (non-zero `stripe_id`).
    pub write_pos: Le32,
    /// Sequence number of the last flushed batch.
    /// u64: monotonic, never wraps in practice (at 1M flushes/sec, lasts 584K years).
    /// **Invariant**: `flush_seq` is always equal to the highest `batch_seq` among
    /// all log entries that have been durably written. During recovery, entries with
    /// `batch_seq > flush_seq` are considered incomplete (in-flight at crash time)
    /// and are replayed. Entries with `batch_seq <= flush_seq` are confirmed durable.
    pub flush_seq: Le64,
}
// On-disk format: magic(4)+version(4)+log_size(4)+write_pos(4)+flush_seq(8) = 24 bytes.
const_assert!(core::mem::size_of::<StripeLogHeader>() == 24);

/// One entry in the stripe log. Records the parity diff for a single stripe write.
/// The log stores parity diffs (not full chunk data) to fit within the ~1 MB region.
/// On-disk format: all multi-byte fields use Le types to ensure disks are
/// portable across architectures. Le* types defined in
/// [Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types).
#[repr(C)]
pub struct StripeLogEntry {
    /// Stripe number: 1-based (physical stripe number + 1).
    /// **Convention**: stripe_id 0 is reserved as the empty sentinel. The first
    /// physical stripe of the array uses stripe_id=1. Recovery uses non-zero
    /// stripe_id to distinguish valid log entries from zero-filled (empty) slots.
    pub stripe_id: Le64,
    /// Sequence number for this batch. All entries in the same batch share the
    /// same `batch_seq` value. Monotonically increasing across batches.
    pub batch_seq: Le64,
    /// CRC32C of old parity XOR new parity (the parity diff).
    pub parity_diff_checksum: Le32,
    /// Length of the parity diff data following this header.
    pub parity_diff_len: Le32,
    // Followed by `parity_diff_len` bytes of (old_parity XOR new_parity).
}
// On-disk format: stripe_id(8)+batch_seq(8)+parity_diff_checksum(4)+parity_diff_len(4) = 24 bytes.
const_assert!(core::mem::size_of::<StripeLogEntry>() == 24);

Batched write path (key improvement over Linux PPL):

Linux PPL flushes one parity diff per stripe write — serializing every write through the log. This causes 30-40% write overhead. UmkaOS batches:

  1. Accumulate N stripe writes in RAM (default batch size: 16, configurable via /sys/block/mdN/md/stripe_log_batch).
  2. Flush one batched log entry covering all N stripes to the metadata region (single sequential write, ~16 × parity_diff_size ≈ 16-64 KB).
  3. Issue all N stripe writes (data + parity) in parallel.
  4. On completion of all N stripe writes: advance StripeLogHeader.write_pos. Old log entries are now reclaimable.

Overhead: ~5-8% write throughput reduction (amortized over batch). Compared to Linux PPL's 30-40%, this is a 4-6x improvement.

Batch flush triggers (whichever comes first): - Batch reaches stripe_log_batch entries (default 16). - Timer expires: stripe_log_flush_ms (default 5ms — matches ext4/XFS commit interval). - fsync() from userspace: immediate flush of current batch.

Recovery after crash: 1. Read StripeLogHeader from each member drive. 2. Scan log entries from write_pos backward to find the last complete batch (matching batch_seq on all members, valid CRC32C on parity diffs). 3. For each logged stripe: re-apply the parity diff to reconstruct correct parity. 4. Stripes NOT in the log were not in-flight — they are already consistent.

Tiered stripe log with optional journal device:

For workloads requiring even lower overhead or guaranteed in-flight write survival:

Tier Log location Overhead In-flight writes survive?
Auto (default) Metadata region on member drives ~5-8% Yes (parity diffs logged)
PMEM journal PMEM/NVDIMM device ~0% (PMEM latency ≈ 100 ns) Yes
NVMe journal Dedicated NVMe partition ~2-3% (fast sequential writes) Yes
None (legacy compat) Disabled (write-intent bitmap only) 0% No — same risk as Linux

The "none" tier is available for users who explicitly accept the write hole risk (e.g., arrays protected by UPS + filesystem journal, where the combined failure probability is acceptable). Configured via: echo none > /sys/block/mdN/md/stripe_log_policy

15.2.5.3 dm-raid and LVM Integration

dm-raid uses the same stripe mechanism as md — the batched stripe log applies identically. When LVM creates a RAID logical volume via dm-raid, the stripe log is auto-enabled using the same metadata region policy.

For dm-thin (thin provisioning): the write hole is metadata consistency (thin pool superblock + space maps), not data stripe consistency. dm-thin already uses a metadata journal (two-copy metadata with atomic swap). UmkaOS preserves this mechanism — no additional stripe log is needed for dm-thin metadata.

15.2.5.4 FMA Integration

Event FMA severity Action
Stripe log auto-enabled on import Info Log: "Stripe write protection enabled for mdN"
Metadata region too small for stripe log Warning Log: "mdN has RAID write hole risk — metadata region insufficient. Run umka-md-upgrade to convert superblock to 1.2"
Stripe log recovery replayed entries Warning Log: "mdN: recovered N stripes from stripe log after unclean shutdown"
CRC32C mismatch during recovery Degraded Log: "mdN stripe S: checksum mismatch, parity recomputed from data"
Legacy mode (no stripe log, user-disabled) Info One-time log: "mdN: stripe log disabled by admin, write hole risk accepted"

15.3 SATA/AHCI and Embedded Flash Storage

SATA and eMMC are general-purpose block storage buses present in servers, edge nodes, embedded systems, and consumer devices alike. They belong in the core storage architecture alongside NVMe.

15.3.1 SATA/AHCI

SATA (Serial ATA) remains widely deployed: HDDs in cold/warm storage tiers, SATA SSDs in cost-sensitive edge nodes, and legacy server hardware. AHCI (Advanced Host Controller Interface) is the standard host-side register interface for SATA controllers.

Full driver architecture: Section 15.4 defines the complete AHCI driver: HBA/port register maps, FIS formats, command header/table layouts, NCQ tag management, error recovery state machine, hot-plug, and ATAPI passthrough.

Driver tier: Tier 1. SATA is a block-latency-sensitive path.

AHCI register interface: The AHCI controller exposes a set of memory-mapped registers (HBA memory space, BAR5) and per-port command list / FIS receive areas. The driver:

  1. Discovers ports via HBA_CAP.NP (number of ports).
  2. For each implemented port: reads PxSIG to identify device type (ATA, ATAPI, PM, SEMB).
  3. Issues IDENTIFY DEVICE (ATA command 0xEC) to retrieve geometry, capabilities, LBA48 support, NCQ depth.
  4. Allocates per-port command list (up to 32 slots) and FIS receive buffer.
  5. Registers the device with umka-block as a BlockDevice with sector size 512 or 4096 (Advanced Format).

Command submission: AHCI uses a memory-based command list. Each command slot contains a Command Table with a Physical Region Descriptor Table (PRDT) for scatter-gather DMA. Native Command Queuing (NCQ, up to 32 outstanding commands) is used when the device reports IDENTIFY.SATA_CAP.NCQ_SUPPORTED.

The canonical AhciPort struct (per-port driver state including command list, FIS receive area, NCQ support, and in-flight tracking) is defined in Section 15.4. This summary section covers the integration points; see the detailed architecture for the full field-level definition (port registers, NCQ depth, power state tracking, etc.).

Power management: AHCI supports three interface power states: Active, Partial (~10ms wake), Slumber (~200ms wake). The driver uses Aggressive Link Power Management (ALPM) to enter Partial/Slumber when the port is idle. On system suspend (Section 7.9), the driver flushes the write cache (FLUSH CACHE EXT, ATA 0xEA) and issues STANDBY IMMEDIATE (ATA 0xE0) before the controller is powered down.

Integration with Section 15.2 Block I/O: AHCI ports register as BlockDevice instances with umka-block. The volume layer (Section 15.2) treats SATA devices identically to NVMe namespaces — RAID, dm-crypt, dm-verity, thin provisioning all work on SATA block devices without modification.

15.3.2 eMMC (Embedded MultiMediaCard)

eMMC is a managed NAND flash storage interface used in embedded systems, edge servers with soldered storage, and cost-sensitive devices. The host interface is a parallel bus (up to 8-bit data width) with an MMC command set.

Driver tier: Tier 1 for the MMC host controller; device command processing follows the same ring buffer model as NVMe.

eMMC register interface: The eMMC host controller (typically SDHCI-compatible or vendor-specific) exposes MMIO registers for command/response, data FIFO, and interrupt status. The driver:

  1. Initializes the host controller and negotiates bus width (1/4/8-bit) and speed (HS200/HS400 where supported).
  2. Issues CMD8 (SEND_EXT_CSD) to retrieve the extended CSD register (512 bytes), which contains capacity, supported features, lifetime estimation, and write-protect status.
  3. Registers partitions (boot partitions BP1/BP2, RPMB, user area, general purpose partitions) as separate BlockDevice instances with umka-block.

RPMB (Replay-Protected Memory Block): eMMC RPMB is a hardware-authenticated storage area with replay protection, used for secure credential storage (e.g., TPM secrets, disk encryption keys). Access requires HMAC-SHA256-authenticated commands using a device-specific key programmed once at manufacturing. The kernel exposes RPMB as a capability-gated block device; only processes with the CAP_RPMB_ACCESS capability (Section 9.1) can issue RPMB commands.

Lifetime and wear: The Extended CSD PRE_EOL_INFO and DEVICE_LIFE_TIME_EST fields report device health. The kernel reads these periodically and exposes them via sysfs (/sys/block/mmcblk0/device/life_time). No kernel policy is applied — userspace storage daemons make retention/migration decisions.

Integration with Section 15.2: eMMC user-area partitions register as BlockDevice instances. All Section 15.2 volume management targets (dm-crypt, dm-mirror, dm-thin) work on eMMC partitions identically to NVMe namespaces.

15.3.3 SD Card Reader (SDHCI)

SDHCI (SD Host Controller Interface) is the standard register interface for built-in SD card slot controllers. SD cards register as BlockDevice instances with umka-block.

Driver tier: Tier 1.

Speed mode negotiation: UHS-I (SDR104, 208 MB/s max), UHS-II (312 MB/s), and UHS-III (624 MB/s) negotiated per JEDEC SD 8.0 spec. The driver reads the SD card's OCR, CID, CSD, and SCR registers at initialization to determine supported speed modes and switches the bus to the highest mutually supported mode.

Presence detection: SD cards are hot-plug devices. The SDHCI controller raises an interrupt on card insertion/removal. The driver posts a BlockDeviceChanged event to the system event bus (Section 7.9, umka-core) on state change.

Consumer vs. embedded: SD cards are used in consumer laptops (built-in SD slot), embedded systems (primary boot/storage medium), and IoT devices. The SDHCI driver is general-purpose; its presence in consumer devices is the most common deployment.


15.4 AHCI/SATA Driver Architecture

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). All #[repr(C)] structs have const_assert! size verification. See CLAUDE.md Spec Pseudocode Quality Gates.

The AHCI driver is a Tier 1 KABI driver that manages SATA storage devices through the AHCI (Advanced Host Controller Interface) register specification. This section defines the complete driver architecture: HBA register model, per-port state machines, FIS (Frame Information Structure) formats, command submission, NCQ (Native Command Queuing), error recovery, hot-plug, and ATAPI passthrough.

Reference specification: Serial ATA AHCI 1.3.1 (Intel, June 2011). SATA 3.5 (SATA-IO, 2024) for link-layer features.

15.4.1 HBA Global Registers

The AHCI HBA exposes a memory-mapped register set at PCI BAR5 (ABAR). The global registers (register offsets 0x00-0x28, last register BOHC ends at byte 0x2B) control HBA-wide behavior:

/// AHCI HBA global registers (ABAR + 0x00).
/// All registers are 32-bit. Access via MMIO (volatile read/write).
///
/// AHCI defines all multi-byte register fields as little-endian.
/// Le* types ensure correct byte order on big-endian architectures
/// (PPC32, s390x). PPC64LE is little-endian and needs no byte-swap.
#[repr(C)]
pub struct AhciHbaRegisters {
    /// Host Capabilities (CAP) — read-only.
    /// Bits: NP (4:0) number of ports - 1, SXS (5) external SATA,
    /// EMS (6) enclosure management, CCCS (7) command completion coalescing,
    /// NCS (12:8) number of command slots - 1, PSC (13) partial state capable,
    /// SSC (14) slumber state capable, PMD (15) PIO multiple DRQ block,
    /// FBSS (16) FIS-based switching, SPM (17) port multiplier,
    /// SAM (18) AHCI-only (no legacy IDE), ISS (23:20) interface speed,
    /// SCLO (24) command list override, SAL (25) activity LED,
    /// SALP (26) aggressive link power mgmt, SSS (27) staggered spin-up,
    /// SMPS (28) mechanical presence switch, SSNTF (29) SNotification,
    /// SNCQ (30) NCQ support, S64A (31) 64-bit addressing.
    pub cap: Le32,
    /// Global HBA Control (GHC).
    /// Bits: HR (0) HBA reset, IE (1) interrupt enable, MRSM (2) MSI revert,
    /// AE (31) AHCI enable.
    pub ghc: Le32,
    /// Interrupt Status (IS) — one bit per port. Write-1-to-clear.
    pub is: Le32,
    /// Ports Implemented (PI) — bitmask of implemented ports.
    pub pi: Le32,
    /// AHCI Version (VS) — major (31:16), minor (15:0). E.g., 0x00010301 = 1.3.1.
    pub vs: Le32,
    /// Command Completion Coalescing Control (CCC_CTL).
    pub ccc_ctl: Le32,
    /// Command Completion Coalescing Ports (CCC_PORTS).
    pub ccc_ports: Le32,
    /// Enclosure Management Location (EM_LOC).
    pub em_loc: Le32,
    /// Enclosure Management Control (EM_CTL).
    pub em_ctl: Le32,
    /// Host Capabilities Extended (CAP2).
    /// Bits: BOH (0) BIOS/OS handoff, NVMP (1) NVMHCI present,
    /// APST (2) automatic partial-to-slumber, SDS (3) DevSleep,
    /// SADM (4) aggressive DevSleep, DESO (5) DevSleep entrance from slumber only.
    pub cap2: Le32,
    /// BIOS/OS Handoff Control and Status (BOHC).
    pub bohc: Le32,
}
// 11 × Le32 = 11 × 4 = 44 bytes. AHCI spec HBA registers: 0x00-0x2B = 44 bytes.
const_assert!(core::mem::size_of::<AhciHbaRegisters>() == 44);

15.4.2 Per-Port Registers

Each port occupies 0x80 bytes at ABAR + 0x100 + (port × 0x80):

/// AHCI per-port register set.
///
/// AHCI defines all multi-byte register fields as little-endian.
/// Le* types ensure correct byte order on big-endian architectures
/// (PPC32, s390x). PPC64LE is little-endian and needs no byte-swap.
// kernel-internal, not KABI
#[repr(C)]
pub struct AhciPortRegisters {
    /// Command List Base Address (PxCLB) — physical address of command list (1024-byte aligned).
    pub clb: Le32,
    /// Command List Base Address Upper 32-bits (PxCLBU) — for 64-bit addressing.
    pub clbu: Le32,
    /// FIS Base Address (PxFB) — physical address of received FIS area (256-byte aligned).
    pub fb: Le32,
    /// FIS Base Address Upper 32-bits (PxFBU).
    pub fbu: Le32,
    /// Interrupt Status (PxIS) — write-1-to-clear.
    /// Bits: DHRS (0) D2H Register FIS, PSS (1) PIO Setup FIS,
    /// DSS (2) DMA Setup FIS, SDBS (3) Set Device Bits FIS,
    /// UFS (4) Unknown FIS, DPS (5) descriptor processed,
    /// PCS (6) port connect change, DMPS (7) device mechanical presence,
    /// PRCS (22) PhyRdy change, IPMS (23) incorrect port multiplier,
    /// OFS (24) overflow, INFS (26) interface non-fatal error,
    /// IFS (27) interface fatal error, HBDS (28) host bus data error,
    /// HBFS (29) host bus fatal error, TFES (30) task file error.
    pub is: Le32,
    /// Interrupt Enable (PxIE) — same bit layout as PxIS.
    pub ie: Le32,
    /// Command and Status (PxCMD).
    /// Bits: ST (0) start, SUD (1) spin-up device, POD (2) power on device,
    /// CLO (3) command list override, FRE (4) FIS receive enable,
    /// CCS (12:8) current command slot, MPSS (13) mechanical presence switch,
    /// FR (14) FIS receive running (RO), CR (15) command list running (RO),
    /// CPS (16) cold presence, PMA (17) port multiplier attached,
    /// HPCP (18) hot-plug capable, MPSP (19) mechanical presence switch present,
    /// CPD (20) cold presence detection, ESP (21) external SATA port,
    /// FBSCP (22) FIS-based switching capable, APSTE (23) auto partial-to-slumber,
    /// ATAPI (24) device is ATAPI, DLAE (25) drive LED on ATAPI enable,
    /// ALPE (26) aggressive link power management enable,
    /// ASP (27) aggressive slumber/partial.
    /// ICC (31:28) interface communication control.
    pub cmd: Le32,
    /// Reserved.
    pub _reserved0: Le32,
    /// Task File Data (PxTFD) — read-only.
    /// Bits (7:0): STS (status register — BSY, DRQ, ERR).
    /// Bits (15:8): ERR (error register).
    pub tfd: Le32,
    /// Signature (PxSIG) — device signature from D2H Register FIS.
    /// 0x00000101 = ATA device, 0xEB140101 = ATAPI device,
    /// 0xC33C0101 = enclosure management bridge, 0x96690101 = port multiplier.
    pub sig: Le32,
    /// Serial ATA Status (PxSSTS) — read-only. SStatus register.
    /// Bits: DET (3:0) device detection, SPD (7:4) interface speed,
    /// IPM (11:8) interface power management.
    pub ssts: Le32,
    /// Serial ATA Control (PxSCTL) — SControl register.
    /// Bits: DET (3:0) device detection init, SPD (7:4) speed allowed,
    /// IPM (11:8) power management transitions allowed.
    pub sctl: Le32,
    /// Serial ATA Error (PxSERR) — write-1-to-clear. SError register.
    pub serr: Le32,
    /// Serial ATA Active (PxSACT) — one bit per NCQ tag. Set by SW before issuing
    /// FPDMA commands; cleared by HW via Set Device Bits FIS on completion.
    pub sact: Le32,
    /// Command Issue (PxCI) — one bit per command slot. Set by SW to issue;
    /// cleared by HW on command completion.
    pub ci: Le32,
    /// SNotification (PxSNTF) — SNotification register (port multiplier).
    pub sntf: Le32,
    /// FIS-based Switching Control (PxFBS).
    pub fbs: Le32,
    /// Device Sleep (PxDEVSLP).
    pub devslp: Le32,
    /// Reserved to 0x6F.
    pub _reserved1: [Le32; 10],
    /// Vendor-specific registers (0x70-0x7F).
    pub vendor: [Le32; 4],
}
// 8+10+[10]+[4] Le32 = 32 × 4 = 128 bytes. AHCI spec: 0x80 per port = 128 bytes.
const_assert!(core::mem::size_of::<AhciPortRegisters>() == 128);

15.4.3 FIS (Frame Information Structure) Types

All host-to-device and device-to-host communication uses FIS frames. The AHCI driver uses these FIS types:

/// FIS types used by AHCI.
#[repr(u8)]
pub enum FisType {
    /// Register FIS — Host to Device (H2D). 20 bytes.
    /// Used for all ATA commands (IDENTIFY, READ DMA EXT, WRITE DMA EXT, etc.).
    RegH2D     = 0x27,
    /// Register FIS — Device to Host (D2H). 20 bytes.
    /// Delivered to FIS receive area on command completion (non-NCQ).
    RegD2H     = 0x34,
    /// DMA Activate FIS — Device to Host. 4 bytes.
    /// Requests host to proceed with DMA transfer (legacy DMA, not used with NCQ).
    DmaActivate = 0x39,
    /// DMA Setup FIS — Bidirectional. 28 bytes.
    /// Auto-activate for first-party DMA (NCQ). Contains DMA buffer offset + transfer count.
    DmaSetup   = 0x41,
    /// Data FIS — Bidirectional. Variable length (up to 8K payload).
    /// Carries actual read/write data.
    Data       = 0x46,
    /// BIST Activate FIS. 12 bytes.
    /// Built-In Self Test pattern generation.
    BistActivate = 0x58,
    /// PIO Setup FIS — Device to Host. 20 bytes.
    /// Precedes PIO data transfer; contains byte count and new status.
    PioSetup   = 0x5F,
    /// Set Device Bits FIS — Device to Host. 8 bytes.
    /// Updates SActive register for NCQ completion notification; carries interrupt bit.
    SetDevBits = 0xA1,
}

/// Register H2D FIS — the primary command FIS. 20 bytes (5 DWORDs).
/// This is what the driver writes into the Command Table CFIS area.
///
/// AHCI defines all multi-byte register fields as little-endian.
/// Le* types ensure correct byte order on big-endian architectures
/// (PPC32, s390x). PPC64LE is little-endian and needs no byte-swap.
#[repr(C)]
pub struct FisRegH2D {
    /// FIS type (0x27).
    pub fis_type: u8,
    /// Bits: PM_PORT (3:0) port multiplier port, C (7) 1=Command, 0=Control.
    pub flags: u8,
    /// ATA command register (e.g., 0x25 = READ DMA EXT, 0x35 = WRITE DMA EXT,
    /// 0x60 = READ FPDMA QUEUED, 0x61 = WRITE FPDMA QUEUED, 0xEC = IDENTIFY DEVICE,
    /// 0xA1 = IDENTIFY PACKET DEVICE, 0xA0 = PACKET, 0xEA = FLUSH CACHE EXT,
    /// 0xE0 = STANDBY IMMEDIATE).
    pub command: u8,
    /// Features register (7:0).
    pub features_lo: u8,
    /// LBA (23:0).
    pub lba_lo: [u8; 3],
    /// Device register. Bit 6 = LBA mode.
    pub device: u8,
    /// LBA (47:24).
    pub lba_hi: [u8; 3],
    /// Features register (15:8).
    pub features_hi: u8,
    /// Sector count (15:0).
    pub count: Le16,
    /// ICC (Isochronous Command Completion).
    pub icc: u8,
    /// Control register.
    pub control: u8,
    /// Reserved (auxiliary).
    pub _reserved: [u8; 4],
}
// 1+1+1+1+3+1+3+1+2+1+1+4 = 20 bytes. ATA Register H2D FIS: 5 DWORDs = 20 bytes.
const_assert!(core::mem::size_of::<FisRegH2D>() == 20);

impl FisRegH2D {
    /// Set 48-bit LBA across the split lba_lo[3] and lba_hi[3] fields.
    pub fn set_lba48(&mut self, lba: u64) {
        self.lba_lo[0] = (lba & 0xFF) as u8;
        self.lba_lo[1] = ((lba >> 8) & 0xFF) as u8;
        self.lba_lo[2] = ((lba >> 16) & 0xFF) as u8;
        self.lba_hi[0] = ((lba >> 24) & 0xFF) as u8;
        self.lba_hi[1] = ((lba >> 32) & 0xFF) as u8;
        self.lba_hi[2] = ((lba >> 40) & 0xFF) as u8;
    }

    /// Clear both LBA fields to zero (for non-data commands like FLUSH).
    pub fn clear_lba(&mut self) {
        self.lba_lo = [0; 3];
        self.lba_hi = [0; 3];
    }

    /// Set sector count in the `count: Le16` field.
    pub fn set_sector_count(&mut self, count: u16) {
        self.count = Le16::from_ne(count);
    }

    /// Zero the FIS to a clean state.
    pub fn zeroed() -> Self {
        // SAFETY: FisRegH2D is #[repr(C)] with all-integer fields; zero is valid.
        unsafe { core::mem::zeroed() }
    }
}

15.4.4 Command Header and Command Table

Each port has a command list of up to 32 entries (determined by CAP.NCS). Each entry is a 32-byte command header that points to a variable-length command table:

/// AHCI Command Header — 32 bytes. One per command slot (up to 32 per port).
/// The command list is a contiguous DMA buffer of 32 × AhciCmdHeader.
#[repr(C)]
pub struct AhciCmdHeader {
    /// DW0: Flags.
    /// CFL (4:0): Command FIS Length in DWORDs (2-16, typically 5 for Register H2D).
    /// A (5): ATAPI command (1 if ACMD contains SCSI CDB).
    /// W (6): Write direction (1 = host-to-device, 0 = device-to-host).
    /// P (7): Prefetchable (hint — HBA may prefetch PRD entries).
    /// R (8): Reset (1 = this command performs a device reset).
    /// B (9): BIST FIS.
    /// C (10): Clear Busy upon R_OK (for overlapped commands).
    /// PMP (15:12): Port Multiplier Port.
    /// PRDTL (31:16): Physical Region Descriptor Table Length (entries, 0-65535).
    pub flags_prdtl: Le32,
    /// DW1: Physical Region Descriptor Byte Count (PRDBC).
    /// Updated by HBA on transfer completion — total bytes transferred.
    pub prdbc: Le32,
    /// DW2-3: Command Table Descriptor Base Address (CTBA, 128-byte aligned).
    pub ctba: Le64,
    /// DW4-7: Reserved.
    pub _reserved: [Le32; 4],
}
// Le32(4) + Le32(4) + Le64(8) + [Le32;4](16) = 32 bytes. AHCI spec: 32-byte command header.
const_assert!(core::mem::size_of::<AhciCmdHeader>() == 32);

impl AhciCmdHeader {
    /// Construct the DW0 value with CFL, write direction, and PRDTL,
    /// then store it as Le32. This is the single method for populating
    /// `flags_prdtl` — callers never manipulate the packed field directly.
    ///
    /// `cfl`: Command FIS Length in DWORDs (typically 5 for Register H2D).
    /// `write`: true if host-to-device (W bit 6).
    /// `prdtl`: PRDT entry count (bits 31:16).
    pub fn set_flags_prdtl(&mut self, cfl: u8, write: bool, prdtl: u16) {
        let dw0 = (cfl as u32 & 0x1F)
            | (if write { 1u32 << 6 } else { 0 })
            | ((prdtl as u32) << 16);
        self.flags_prdtl = Le32::from_ne(dw0);
    }
}

/// AHCI Command Table — variable size. Contains CFIS, ACMD, and PRDT.
/// Minimum size: 128 bytes (CFIS) + 0 (ACMD) + N × 16 (PRDT entries).
/// The command table must be 128-byte aligned.
#[repr(C)]
pub struct AhciCmdTable {
    /// Command FIS area — 64 bytes (only first CFL×4 bytes are valid).
    pub cfis: [u8; 64],
    /// ATAPI Command area — 16 bytes (12-byte SCSI CDB + 4 padding).
    /// Only valid when AhciCmdHeader.flags.A = 1.
    pub acmd: [u8; 16],
    /// Reserved — 48 bytes.
    pub _reserved: [u8; 48],
    /// Physical Region Descriptor Table — up to 65535 entries.
    /// Actual count is in AhciCmdHeader.flags_prdtl (PRDTL field).
    /// For UmkaOS: capped at 248 entries per command (matching `max_segments`
    /// from BlockDeviceInfo). Each PRDT entry can address up to 4MB (22-bit DBC
    /// field), but the block layer's bio splitting caps practical transfers at ~1MB.
    /// Rationale for 248: the command table header (CFIS + ACMD + reserved) is
    /// 128 bytes; 128 + 248 × 16 = 4096 bytes = exactly one 4KB page. This
    /// maximizes scatter-gather capacity within a single-page DMA allocation.
    /// Linux uses `LIBATA_MAX_PRD` = 128 (half of `ATA_MAX_PRD` = 256);
    /// UmkaOS uses 248 to fill the page without crossing a page boundary.
    ///
    /// **Memory footprint**: 32 command slots × 4 KB per command table = 128 KB
    /// per port. Most I/O uses 1-4 PRDT entries, leaving ~244 entries unused
    /// per command. This is intentional: the AHCI spec requires the command
    /// table to be a contiguous DMA allocation, and per-command dynamic sizing
    /// would require separate DMA allocations per I/O (slower, more fragmentation).
    /// The 128 KB/port cost is fixed and acceptable for SATA controllers.
    pub prdt: [AhciPrdtEntry; 248],
}
// AhciCmdTable: cfis(64) + acmd(16) + _reserved(48) + prdt(248×16) = 4096 bytes.
const_assert!(core::mem::size_of::<AhciCmdTable>() == 4096);

/// AHCI PRDT Entry — 16 bytes. Describes one scatter-gather DMA region.
///
/// AHCI defines all multi-byte register fields as little-endian.
/// Le* types ensure correct byte order on big-endian architectures
/// (PPC32, s390x). PPC64LE is little-endian and needs no byte-swap.
#[repr(C)]
pub struct AhciPrdtEntry {
    /// Data Base Address (DBA) — physical byte address of data buffer.
    /// Must be word-aligned (bit 0 = 0).
    pub dba: Le32,
    /// Data Base Address Upper 32-bits (DBAU).
    pub dbau: Le32,
    /// Reserved.
    pub _reserved: Le32,
    /// Data Byte Count (DBC) and Interrupt flag.
    /// DBC (21:0): byte count - 1 (0 = 1 byte, max 0x3FFFFF = 4MB).
    /// I (31): Interrupt on completion of this PRD entry.
    pub dbc_i: Le32,
}
// Le32(4) + Le32(4) + Le32(4) + Le32(4) = 16 bytes. AHCI PRDT entry: 16 bytes.
const_assert!(core::mem::size_of::<AhciPrdtEntry>() == 16);

/// ATA DATA SET MANAGEMENT (TRIM) LBA Range Entry.
/// ACS-4 §7.10: the payload is an array of these 8-byte entries packed into
/// 512-byte blocks. Each entry specifies a contiguous range of LBAs to trim.
/// Command 0x06 (DATA SET MANAGEMENT) with feature register bit 0 = TRIM.
#[repr(C, packed)]
pub struct AtaTrimRangeEntry {
    /// Starting LBA of the range to trim (48-bit).
    /// Stored as 6 little-endian bytes: lba[0..6].
    pub lba: [u8; 6],
    /// Number of logical sectors to trim (16-bit, little-endian).
    /// 0 = entry is unused (skip). Max 65535 sectors per entry.
    pub count: Le16,
}
const_assert!(core::mem::size_of::<AtaTrimRangeEntry>() == 8);
// Each 512-byte block holds 64 entries. The DATA SET MANAGEMENT
// command transfers 1-65535 blocks (count register), each containing
// up to 64 range entries.

15.4.5 FIS Receive Area

Each port has a dedicated 256-byte DMA buffer for received FIS frames:

/// AHCI received FIS area — 256 bytes per port.
/// The HBA writes incoming FIS frames to fixed offsets in this buffer.
#[repr(C, align(256))]
pub struct AhciFisRxArea {
    /// DMA Setup FIS (offset 0x00, 28 bytes).
    pub dma_setup: [u8; 28],
    pub _pad0: [u8; 4],
    /// PIO Setup FIS (offset 0x20, 20 bytes).
    pub pio_setup: [u8; 20],
    pub _pad1: [u8; 12],
    /// D2H Register FIS (offset 0x40, 20 bytes).
    pub d2h_reg: [u8; 20],
    pub _pad2: [u8; 4],
    /// Set Device Bits FIS (offset 0x58, 8 bytes).
    pub set_dev_bits: [u8; 8],
    /// Unknown FIS area (offset 0x60, 64 bytes).
    pub unknown_fis: [u8; 64],
    /// Reserved (offset 0xA0, 96 bytes).
    pub _reserved: [u8; 96],
}
// 28+4+20+12+20+4+8+64+96 = 256 bytes. AHCI spec: 256-byte FIS receive area.
const_assert!(core::mem::size_of::<AhciFisRxArea>() == 256);

15.4.6 AhciPort Driver State

/// Per-port AHCI driver state — lives in the Tier 1 driver domain.
/// One instance per implemented port. Allocated at driver probe time.
pub struct AhciPort {
    /// Port number (0-31).
    pub port_num: u8,
    /// MMIO accessor for this port's register set.
    pub regs: PortedMmio,
    /// DMA-coherent command list (32 entries × 32 bytes = 1024 bytes, 1K-aligned).
    pub cmd_list: DmaBox<[AhciCmdHeader; 32]>,
    /// DMA-coherent received FIS area (256 bytes, 256-byte aligned).
    pub fis_rx: DmaBox<AhciFisRxArea>,
    /// Per-slot command tables. Pre-allocated at probe time — no allocation on the I/O path.
    /// Only `ncs` entries are valid (CAP.NCS + 1).
    pub cmd_tables: ArrayVec<DmaBox<AhciCmdTable>, 32>,
    /// Bitmask of in-flight command slots (mirrors PxCI for driver-side tracking).
    pub inflight: AtomicU32,
    /// Bitmask of in-flight NCQ tags (mirrors PxSACT for driver-side tracking).
    pub ncq_inflight: AtomicU32,
    /// Number of command slots supported (CAP.NCS + 1, max 32).
    pub ncs: u8,
    /// Maximum NCQ depth reported by IDENTIFY DEVICE word 75 (0-based, max 31 → 32 tags).
    pub ncq_depth: u8,
    /// True if the device supports NCQ (IDENTIFY word 76 bit 8).
    pub ncq_capable: bool,
    /// Device type detected from PxSIG.
    pub device_type: AhciDeviceType,
    /// Logical sector size (512 or 4096).
    pub logical_sector_size: u32,
    /// Physical sector size (512 or 4096 for Advanced Format).
    pub physical_sector_size: u32,
    /// Total capacity in logical sectors.
    pub capacity_sectors: u64,
    /// Device supports TRIM (IDENTIFY word 169 bit 0).
    pub supports_trim: bool,
    /// Device supports write cache (IDENTIFY word 82 bit 5).
    pub write_cache_enabled: bool,
    /// Device supports 48-bit LBA (IDENTIFY word 83 bit 10).
    pub lba48: bool,
    /// Device supports Force Unit Access (WRITE DMA FUA EXT).
    /// Requires BOTH LBA48 AND IDENTIFY word 84 bit 6. Not all LBA48
    /// devices support FUA — it is optional.
    pub supports_fua: bool,
    /// Device supports SANITIZE command (IDENTIFY word 59 bit 12).
    pub supports_sanitize: bool,
    /// Nominal media rotation rate from IDENTIFY word 217.
    /// 0x0001 = non-rotating (SSD), any other non-zero = RPM.
    /// Used by get_info() to set BlockDeviceFlags::ROTATIONAL.
    pub nominal_rotation_rate: u16,
    /// Per-slot bio pointer — maps completed command slot/NCQ tag back to
    /// the originating Bio for `bio_complete()`. Analogous to NVMe's
    /// `inflight: Box<[Option<NvmeInflightCmd>]>`.
    ///
    /// Set by `submit_bio()` after claiming a slot. Taken (swap to null) by
    /// the IRQ completion handler when processing a D2H FIS / SDB FIS. For
    /// synchronous commands (IDENTIFY, non-bio flush), this is null — those
    /// use `wait_for_completion()` instead.
    ///
    /// Uses `AtomicPtr<Bio>` (null = no bio) for interior mutability:
    /// submit paths and IRQ handler both access through `&self`/`&AhciPort`.
    /// The `inflight` bitmask provides mutual exclusion (a slot is only
    /// written by the submit path after claiming it, and only read/cleared
    /// by the IRQ handler after the device signals completion).
    ///
    /// **SAFETY**: Raw pointer to a Bio whose lifetime extends until
    /// `bio_complete()` signals completion. The AHCI port state machine
    /// ensures each slot processes exactly one bio at a time.
    pub slot_bios: [AtomicPtr<Bio>; 32],
    /// Per-slot completion error status — maps completed slot to the
    /// errno value (0 = success, -EIO = error) for `bio_complete()`.
    pub slot_status: [AtomicI32; 32],
    /// Per-slot completion state — one per command slot.
    /// Values: IDLE(0), PENDING(1), COMPLETED(2), ERROR(3).
    /// Submit path sets `slot_completions[slot] = PENDING`.
    /// IRQ handler sets `slot_completions[slot] = COMPLETED/ERROR`
    /// and wakes the port's WaitQueue.
    /// `wait_for_completion()` blocks on the port WaitQueue until
    /// `slot_completions[slot] != PENDING`.
    ///
    /// **Tag reuse constraint**: A completed slot MUST NOT be reused
    /// (bit re-set in `inflight`/`ncq_inflight`) until the completion
    /// handler has consumed the previous completion. The submit path
    /// checks `slot_completions[slot] == IDLE` before claiming the slot.
    pub slot_completions: [AtomicU8; 32],
    /// Per-port WaitQueue for synchronous command completion (flush,
    /// IDENTIFY). Woken by the IRQ handler after updating slot_completions.
    /// See [Section 3.6](03-concurrency.md#lock-free-data-structures--completion-one-shot-or-multi-shot-signaling-primitive)
    /// for the formal `Completion` primitive. AHCI uses `WaitQueue` directly
    /// (not `Completion`) because multiple command slots share a single
    /// per-port wait queue with per-slot state discrimination.
    pub completion_waitq: WaitQueue,
    /// Port error state — set by error recovery, checked by submit path.
    pub error_state: AtomicU8,
    /// Link power management state.
    pub link_pm_state: AtomicU8,
}

#[repr(u8)]
pub enum AhciDeviceType {
    /// Standard ATA disk (PxSIG = 0x00000101).
    Ata = 0,
    /// ATAPI device — optical drive, tape (PxSIG = 0xEB140101).
    Atapi = 1,
    /// Port multiplier (PxSIG = 0x96690101).
    PortMultiplier = 2,
    /// Enclosure management bridge (PxSIG = 0xC33C0101).
    Semb = 3,
    /// No device detected.
    None = 0xFF,
}

/// Port error recovery state.
#[repr(u8)]
pub enum AhciPortErrorState {
    /// Normal operation.
    Normal = 0,
    /// Error recovery in progress — new I/O submission blocked.
    Recovering = 1,
    /// Port disabled after unrecoverable error.
    Disabled = 2,
}

/// Link power management state.
#[repr(u8)]
pub enum AhciLinkPmState {
    /// Active (no power saving).
    Active = 0,
    /// AHCI Partial state (low-latency sleep, ~10us resume).
    Partial = 1,
    /// AHCI Slumber state (deeper sleep, ~10ms resume).
    Slumber = 2,
    /// SATA DevSleep (device-initiated deep sleep, ~20ms resume).
    DevSleep = 3,
}

impl AhciPort {
    /// Read the current error state as a typed enum.
    /// Returns `AhciPortErrorState::Disabled` for any unrecognized value
    /// (defensive — treats corruption as worst-case).
    pub fn error_state(&self) -> AhciPortErrorState {
        match self.error_state.load(Acquire) {
            0 => AhciPortErrorState::Normal,
            1 => AhciPortErrorState::Recovering,
            _ => AhciPortErrorState::Disabled,
        }
    }

    /// Set the error state atomically.
    pub fn set_error_state(&self, s: AhciPortErrorState) {
        self.error_state.store(s as u8, Release);
    }

    /// Read the current link power management state as a typed enum.
    /// Returns `AhciLinkPmState::Active` for any unrecognized value.
    pub fn link_pm_state(&self) -> AhciLinkPmState {
        match self.link_pm_state.load(Acquire) {
            0 => AhciLinkPmState::Active,
            1 => AhciLinkPmState::Partial,
            2 => AhciLinkPmState::Slumber,
            3 => AhciLinkPmState::DevSleep,
            _ => AhciLinkPmState::Active,
        }
    }

    /// Set the link power management state atomically.
    pub fn set_link_pm_state(&self, s: AhciLinkPmState) {
        self.link_pm_state.store(s as u8, Release);
    }
}

15.4.7 Initialization Sequence

  1. PCI probe: Match PCI class code 01:06:01 (Mass Storage → SATA → AHCI 1.0). Map BAR5 as uncacheable MMIO.
  2. BIOS/OS handoff: If CAP2.BOH is set, perform BIOS/OS handoff via BOHC register (set OOS bit, wait for BOS clear, timeout 25ms per AHCI spec §11.6).
  3. Enable AHCI mode: Set GHC.AE (bit 31). If CAP.SAM is clear (legacy supported), the HBA may start in IDE mode; AE forces AHCI.
  4. Enumerate ports: Read PI register. Extract num_ports = CAP.NP + 1 (max 32). Bounds validation: verify num_ports <= ports.capacity() (the AhciPort array has a fixed capacity of 32 matching the AHCI spec maximum). If CAP.NP + 1 exceeds the array capacity (hardware bug or MMIO corruption), log an FMA error event and clamp num_ports to the array capacity. The interrupt handler iterates 0..num_ports and indexes hba.ports[port_num]; this bounds check ensures the loop never exceeds the array length. For each bit set in PI: a. Allocate DmaBox<[AhciCmdHeader; 32]> (command list, 1K-aligned). b. Allocate DmaBox<AhciFisRxArea> (FIS receive, 256-byte aligned). c. Write physical addresses to PxCLB/PxCLBU and PxFB/PxFBU. d. Pre-allocate ncs command tables (each 128-byte aligned). e. Clear PxSERR (write all-ones to clear). f. Set PxCMD.FRE (FIS Receive Enable). g. If CAP.SSS (staggered spin-up): set PxCMD.SUD to spin up the device. h. Wait for PxSSTS.DET = 3 (device present and communication established), timeout 1 second. i. Read PxSIG → classify device type. j. Set PxCMD.ST (Start command processing).
  5. IDENTIFY DEVICE: For ATA devices, issue IDENTIFY DEVICE (0xEC). For ATAPI, issue IDENTIFY PACKET DEVICE (0xA1). Parse:
  6. Words 60-61: Total addressable sectors (28-bit LBA).
  7. Words 100-103: Total addressable sectors (48-bit LBA).
  8. Word 75: NCQ queue depth (0-based).
  9. Word 76 bit 8: NCQ supported.
  10. Word 82 bit 5: Write cache supported.
  11. Word 83 bit 10: 48-bit LBA supported.
  12. Word 84 bit 6: FUA (Force Unit Access) supported.
  13. Word 106: Logical/physical sector size.
  14. Word 169 bit 0: TRIM (DATA SET MANAGEMENT) supported.
  15. Word 217: Nominal media rotation rate (1 = non-rotating = SSD).
  16. Enable NCQ: If device supports NCQ, set ncq_capable = true. NCQ depth = min(device depth from IDENTIFY word 75, HBA slots from CAP.NCS).
  17. Enable interrupts: Set PxIE to enable DHRS, SDBS, PCS, IFS, HBFS, HBDS, TFES. Set GHC.IE (global interrupt enable).
  18. Register with umka-block: Create BlockDevice with sector size, capacity, supports_flush = write_cache_enabled, supports_discard = supports_trim, supports_fua = lba48 && identify_word_84_bit_6.

15.4.8 Command Submission (Non-NCQ)

For legacy (non-NCQ) commands — IDENTIFY, FLUSH, STANDBY, TRIM (non-queued):

  1. Find a free slot: atomically claim a clear bit in inflight via CAS loop:
    loop {
        let current = inflight.load(Acquire);
        let free_bit = (!current).trailing_zeros();
        if free_bit >= 32 { return Err(Error::AGAIN); }
        let new = current | (1 << free_bit);
        if inflight.compare_exchange_weak(current, new, AcqRel, Acquire).is_ok() {
            return Ok(free_bit as u8);
        }
    }
    
    If all slots busy, return EAGAIN (caller retries via block layer backpressure).
  2. Build FisRegH2D in cmd_tables[slot].cfis:
  3. Set fis_type = 0x27, flags = 0x80 (C bit = command).
  4. Fill command, lba_lo/hi, count, device (bit 6 = LBA mode).
  5. Build PRDT entries from Bio.segments — one AhciPrdtEntry per segment. Set dbc_i = (segment.len - 1). Set I bit on last entry.
  6. Write AhciCmdHeader: set CFL = 5 (20 bytes / 4), W bit if write, PRDTL = segment count.
  7. Set bit in inflight.
  8. Write slot bit to PxCI — HBA fetches command header and begins execution.
  9. Completion arrives via D2H Register FIS interrupt (PxIS.DHRS).

15.4.9 NCQ Command Submission

For READ/WRITE FPDMA QUEUED (commands 0x60/0x61) — the fast path:

  1. Find a free NCQ tag: scan ncq_inflight for a clear bit (max ncq_depth tags).
  2. Build FisRegH2D:
  3. command = 0x60 (read) or 0x61 (write).
  4. features_lo/features_hi = sector count (16-bit).
  5. lba_lo/lba_hi = 48-bit LBA.
  6. count bits (7:3) = NCQ tag number. Bit 7 of count = FUA if requested.
  7. device = 0x40 (LBA mode).
  8. Build PRDT from bio segments (same as non-NCQ).
  9. Write AhciCmdHeader: CFL = 5, W bit, PRDTL.
  10. Initialize slot status: port.slot_status[tag as usize].store(0, Relaxed). This clears any stale error code from a previous command that used this slot. Without this, the completion handler reads a stale error for successful NCQ completions.
  11. Set tag bit in ncq_inflight. Write tag bit to PxSACT.
  12. Write slot bit to PxCI.
  13. Completion: device sends Set Device Bits FIS with SActive bits cleared. HBA sets PxIS.SDBS. Interrupt handler reads PxSACT to determine which tags completed (bits that transitioned from 1→0).

Tag-to-slot mapping: UmkaOS uses identity mapping (tag N = slot N). This is the simplest model and avoids the complexity of split tag/slot namespaces. Since ncq_depth ≤ ncs, there are always enough slots.

15.4.9.1 Callable Function Bodies

impl AhciPort {
    /// Allocate a free command slot from the port's `inflight` bitmask.
    /// Returns the slot index (0..ncs-1). Returns `Err(Error::AGAIN)` if
    /// all slots are busy (caller retries via block layer backpressure).
    pub fn alloc_slot(&self) -> Result<u8> {
        loop {
            let current = self.inflight.load(Acquire);
            let free_bit = (!current).trailing_zeros();
            if free_bit >= self.ncs as u32 {
                return Err(Error::AGAIN); // all slots busy
            }
            let new = current | (1 << free_bit);
            if self.inflight.compare_exchange_weak(current, new, AcqRel, Acquire).is_ok() {
                return Ok(free_bit as u8);
            }
        }
    }

    /// Submit an NCQ READ/WRITE FPDMA QUEUED command for a Bio.
    /// Uses NCQ tag = slot index (identity mapping).
    ///
    /// Steps: alloc NCQ tag, build FIS, build PRDT, set PxSACT, set PxCI.
    /// Completion arrives via SDB FIS interrupt (PxIS.SDBS).
    ///
    /// Takes `&self` — all mutable state uses interior mutability:
    /// `slot_bios` is `[AtomicPtr<Bio>; 32]`, `cmd_tables` DMA memory
    /// is mutated via unsafe pointer cast (safe because the `inflight`
    /// bitmask guarantees exclusive access to the claimed slot).
    pub fn submit_ncq(&self, bio: &mut Bio) -> Result<()> {
        let tag = self.alloc_slot()?;
        self.slot_bios[tag as usize].store(bio as *mut Bio, Release);
        self.slot_status[tag as usize].store(0, Relaxed);
        // Build FisRegH2D in cmd_tables[tag].cfis.
        // SAFETY: DmaBox provides raw_ptr() returning *mut T (analogous to
        // UnsafeCell — the DMA buffer is owned memory accessible via raw
        // pointer even through &self). cfis is [u8; 64]; the first 20 bytes
        // are the H2D FIS. The inflight bitmask guarantees exclusive access
        // to this slot (no concurrent writer or reader for this tag).
        let cmd_table_ptr = self.cmd_tables[tag as usize].raw_ptr();
        let fis = unsafe {
            &mut *((*cmd_table_ptr).cfis.as_mut_ptr() as *mut FisRegH2D)
        };
        *fis = FisRegH2D::zeroed();
        fis.fis_type = 0x27;
        fis.flags = 0x80; // C bit (command, not control)
        fis.command = if bio.op == BioOp::Read { 0x60 } else { 0x61 };
        fis.set_lba48(bio.start_lba);
        // NCQ FUA: ACS-3 §7.63.6.4 — FUA is bit 7 of the DEVICE register,
        // NOT the COUNT register. The COUNT register carries only the NCQ
        // tag in bits [7:3]. Placing FUA in COUNT silently downgrades FUA
        // writes to non-FUA, risking data loss on power failure.
        let fua_bit: u8 = if bio.flags.contains(BioFlags::FUA) { 0x80 } else { 0 };
        fis.device = 0x40 | fua_bit; // LBA mode | FUA (bit 7)
        // NCQ: sector count goes in features_lo/features_hi (not count).
        // Count register carries only the NCQ tag (bits 7:3).
        let sector_count = bio.total_sectors();
        fis.features_lo = (sector_count & 0xFF) as u8;
        fis.features_hi = ((sector_count >> 8) & 0xFF) as u8;
        fis.count = Le16::from_ne((tag as u16) << 3);
        // Build PRDT from bio segments
        let prdtl = self.build_prdt(tag, bio)?;
        // Write AhciCmdHeader DW0: CFL=5 (20 bytes / 4), W=write, PRDTL=prdtl
        self.cmd_list[tag as usize].set_flags_prdtl(
            5, bio.op == BioOp::Write, prdtl,
        );
        // Set NCQ inflight and issue
        self.ncq_inflight.fetch_or(1u32 << tag, Release);
        self.regs.px_sact.write(1u32 << tag);
        self.regs.px_ci.write(1u32 << tag);
        Ok(())
    }

    /// Submit a legacy (non-NCQ) DMA READ/WRITE command for a Bio.
    /// Used for devices that do not support NCQ (old SATA-I, ATAPI, etc.).
    pub fn submit_legacy_dma(&self, bio: &mut Bio) -> Result<()> {
        let slot = self.alloc_slot()?;
        self.slot_bios[slot as usize].store(bio as *mut Bio, Release);
        self.slot_status[slot as usize].store(0, Relaxed);
        // SAFETY: same as submit_ncq — DmaBox::raw_ptr() + inflight bitmask exclusion.
        let cmd_table_ptr = self.cmd_tables[slot as usize].raw_ptr();
        let fis = unsafe {
            &mut *((*cmd_table_ptr).cfis.as_mut_ptr() as *mut FisRegH2D)
        };
        *fis = FisRegH2D::zeroed();
        fis.fis_type = 0x27;
        fis.flags = 0x80;
        fis.command = if bio.op == BioOp::Read { 0x25 } else { 0x35 }; // READ/WRITE DMA EXT
        fis.set_lba48(bio.start_lba);
        fis.set_sector_count(bio.total_sectors());
        fis.device = 0x40;
        let prdtl = self.build_prdt(slot, bio)?;
        self.cmd_list[slot as usize].set_flags_prdtl(
            5, bio.op == BioOp::Write, prdtl,
        );
        self.regs.px_ci.write(1u32 << slot);
        Ok(())
    }

    /// Submit a non-data ATA command (FLUSH, STANDBY, IDENTIFY, etc.).
    /// No PRDT needed — command has no data transfer phase.
    pub fn submit_non_data_command(&self, command: u8) -> Result<()> {
        let slot = self.alloc_slot()?;
        // No bio for non-data commands — store null.
        self.slot_bios[slot as usize].store(core::ptr::null_mut(), Release);
        self.slot_status[slot as usize].store(0, Relaxed);
        // SAFETY: same as submit_ncq — DmaBox::raw_ptr() + inflight bitmask exclusion.
        let cmd_table_ptr = self.cmd_tables[slot as usize].raw_ptr();
        let fis = unsafe {
            &mut *((*cmd_table_ptr).cfis.as_mut_ptr() as *mut FisRegH2D)
        };
        *fis = FisRegH2D::zeroed();
        fis.fis_type = 0x27;
        fis.flags = 0x80;
        fis.command = command;
        fis.device = 0x00;
        // LBA and count zeroed by FisRegH2D::zeroed().
        self.cmd_list[slot as usize].set_flags_prdtl(5, false, 0);
        self.regs.px_ci.write(1u32 << slot);
        Ok(())
    }

    /// Submit FLUSH CACHE EXT using a pre-allocated command slot.
    /// Called from BlockDeviceOps::submit_bio (BioOp::Flush) where the
    /// caller has already allocated the slot and stored the bio pointer.
    pub fn submit_flush_with_slot(&self, slot: u8) -> Result<()> {
        let cmd = if self.lba48 { 0xEA } else { 0xE7 };
        // SAFETY: same as submit_ncq — DmaBox::raw_ptr() + inflight bitmask exclusion.
        let cmd_table_ptr = self.cmd_tables[slot as usize].raw_ptr();
        let fis = unsafe {
            &mut *((*cmd_table_ptr).cfis.as_mut_ptr() as *mut FisRegH2D)
        };
        *fis = FisRegH2D::zeroed();
        fis.fis_type = 0x27;
        fis.flags = 0x80;
        fis.command = cmd;
        fis.device = 0x00;
        // LBA and count zeroed by FisRegH2D::zeroed().
        self.cmd_list[slot as usize].set_flags_prdtl(5, false, 0);
        self.regs.px_ci.write(1u32 << slot);
        Ok(())
    }
}

15.4.9.2 Non-NCQ Completion Handler

/// Process completion for non-NCQ commands. Called from the IRQ handler
/// when PxIS.DHRS (D2H Register FIS Received) is set.
///
/// For each completed slot: read status from D2H FIS, map slot to bio,
/// call bio_complete() with errno (i32: 0 = success, negative = error).
fn complete_non_ncq_slots(port: &AhciPort, completed: u32) {
    for slot in 0..port.ncs {
        if completed & (1 << slot) == 0 { continue; }
        // Read completion status from D2H Register FIS.
        // `fis_rx` is `DmaBox<AhciFisRxArea>`, `d2h_reg` is `[u8; 20]`.
        // D2H FIS format: byte 0=FIS type (0x34), byte 1=flags,
        // byte 2=status register, byte 3=error register.
        let ata_status = port.fis_rx.d2h_reg[2];
        let _ata_error = port.fis_rx.d2h_reg[3];
        // Map ATA error to errno: ERR bit (status bit 0) = -EIO, else success.
        // (A production driver may refine using _ata_error: ABRT, IDNF, UNC, etc.)
        let errno: i32 = if ata_status & 0x01 != 0 {
            -(EIO as i32)
        } else {
            0
        };
        // Clear inflight bit
        port.inflight.fetch_and(!(1u32 << slot), Release);
        // Complete the bio if one was associated with this slot.
        // AtomicPtr::swap(null) atomically retrieves and clears the pointer.
        let bio_ptr = port.slot_bios[slot as usize].swap(core::ptr::null_mut(), AcqRel);
        if !bio_ptr.is_null() {
            // SAFETY: bio_ptr was stored by submit_bio and is valid
            // until bio_complete is called.
            let bio = unsafe { &mut *bio_ptr };
            bio_complete(bio, errno);
        }
        port.slot_completions[slot as usize].store(IDLE, Release);
    }
}

15.4.10 Flush and Standby Submission

impl AhciPort {
    /// Submit FLUSH CACHE EXT (command 0xEA) asynchronously.
    /// Returns immediately; completion is signaled via the D2H FIS interrupt.
    /// For non-48-bit devices, falls back to FLUSH CACHE (0xE7).
    pub fn submit_flush(&self) -> Result<()> {
        let cmd = if self.lba48 { 0xEA } else { 0xE7 };
        self.submit_non_data_command(cmd)
    }

    /// Submit FLUSH CACHE EXT and wait for completion (blocking).
    /// Used by the `flush()` block device method and shutdown path.
    pub fn submit_flush_sync(&self) -> Result<()> {
        self.submit_flush()?;
        self.wait_for_completion()
    }

    /// Submit STANDBY IMMEDIATE (command 0xE0) — spins down the device.
    /// Used during shutdown to ensure clean power-off.
    pub fn submit_standby_immediate(&self) -> Result<()> {
        self.submit_non_data_command(0xE0)?;
        self.wait_for_completion()
    }
}

15.4.11 Interrupt Handler

The AHCI interrupt handler runs in hardirq context (Tier 1 domain):

fn ahci_irq_handler(hba: &AhciHba) -> IrqReturn {
    let global_is = hba.regs.read_is();
    if global_is == 0 { return IrqReturn::None; }

    for port_num in 0..hba.num_ports {
        if global_is & (1 << port_num) == 0 { continue; }

        let port = &hba.ports[port_num];
        let port_is = port.regs.read_is();

        // NCQ completions — Set Device Bits FIS received.
        if port_is & AHCI_PxIS_SDBS != 0 {
            let completed = port.ncq_inflight.load(Acquire)
                          & !port.regs.read_sact();
            // For each completed tag: retrieve the associated Bio from the
            // per-slot inflight table and call bio_complete() per the unified
            // completion API ([Section 15.2](#block-io-and-volume-management--bio-completion)).
            for tag in 0..32u8 {
                if completed & (1 << tag) != 0 {
                    let bio_ptr = port.slot_bios[tag as usize].swap(
                        core::ptr::null_mut(), AcqRel,
                    );
                    if !bio_ptr.is_null() {
                        let bio = unsafe { &mut *bio_ptr };
                        let status = port.slot_status[tag as usize].load(Acquire);
                        bio_complete(bio, status);
                    }
                    port.ncq_inflight.fetch_and(!(1u32 << tag), Release);
                    port.slot_completions[tag as usize].store(IDLE, Release);
                }
            }
        }

        // Non-NCQ completion — D2H Register FIS received.
        if port_is & AHCI_PxIS_DHRS != 0 {
            let completed = port.inflight.load(Acquire)
                          & !port.regs.read_ci();
            // Same pattern: retrieve Bio, call bio_complete(bio, status),
            // clear inflight bit. Non-NCQ uses slot_completions tracking but
            // MUST use bio_complete() for the BioState CAS state machine.
            complete_non_ncq_slots(port, completed);
        }

        // Hot-plug event — port connect change.
        if port_is & AHCI_PxIS_PCS != 0 {
            handle_hotplug(port);
        }

        // Error conditions.
        if port_is & AHCI_PxIS_ERROR_MASK != 0 {
            handle_port_error(port, port_is);
        }

        // Clear handled interrupts.
        port.regs.write_is(port_is);
    }

    // Clear global IS.
    hba.regs.write_is(global_is);
    IrqReturn::Handled
}

const AHCI_PxIS_ERROR_MASK: u32 =
    (1 << 27) |  // IFS: interface fatal error
    (1 << 29) |  // HBFS: host bus fatal error
    (1 << 28) |  // HBDS: host bus data error
    (1 << 30);   // TFES: task file error status

15.4.12 Error Recovery

AHCI defines three error classes, each with a different recovery procedure:

Non-fatal errors (PxIS.INFS — interface non-fatal): Log the error. Clear PxSERR. No command retry needed — the link layer recovered automatically.

Fatal errors (PxIS.IFS, HBFS, HBDS — interface fatal, host bus fatal/data):

  1. Set error_state = Recovering. New submissions are blocked.
  2. Clear PxCMD.ST (stop command engine). Wait for PxCMD.CR = 0 (timeout 500ms).
  3. If PxCMD.CR doesn't clear: set PxCMD.CLO (Command List Override) if CAP.SCLO is supported, then retry. If CLO fails: perform COMRESET (write PxSCTL.DET = 1, wait 1ms, write PxSCTL.DET = 0).
  4. Clear PxSERR (write all-ones). Clear PxIS (write all-ones).
  5. Set PxCMD.FRE, then PxCMD.ST — restart the port.
  6. Re-identify the device (IDENTIFY DEVICE) to confirm it's still responsive.
  7. Read the NCQ error log via READ LOG EXT (log page 10h) to identify the failed command tag and error reason. For non-NCQ commands, read PxTFD.ERR directly.
  8. Classify in-flight commands by type:
  9. Write commands: Cancel with -EIO. Retrying writes after a fatal error risks data corruption — the device may have partially written the data, and a retry would produce a second write with potentially different data ordering. The filesystem layer (journal or COW) is responsible for replaying writes with proper ordering guarantees.
  10. Read commands: Retry up to 3 times. Read retries are safe because reads are idempotent — the device returns the same data regardless of how many times the read is issued.
  11. Non-data commands (FLUSH, IDENTIFY): Retry once.
  12. Set error_state = Normal. Resume accepting new submissions.

Task file errors (PxIS.TFES — device reported error in TFD.STS.ERR):

  1. For NCQ: the device error log must be read via READ LOG EXT (log page 10h, "NCQ Command Error"). This identifies which tag failed and the error reason. Steps: stop port, issue READ LOG EXT (non-queued, slot 0), read the failing tag + error code, retry or fail that specific bio, restart NCQ for remaining tags.
  2. For non-NCQ: read PxTFD.ERR directly. The error register indicates the cause (ABRT, IDNF, UNC, etc.). Map to appropriate errno:
  3. UNC (uncorrectable data error) → EIO
  4. IDNF (ID not found) → EIO (LBA out of range)
  5. ABRT (command aborted) → EIO (retry once, then fail)
  6. ICRC (interface CRC error) → retry (link issue)

Retry policy: Each bio gets up to 3 retries for transient errors (CRC, timeout). Permanent media errors (UNC) are reported immediately — no retry.

15.4.13 Hot-Plug

AHCI supports hot-plug detection via PxIS.PCS (Port Connect change) and PxIS.DMPS (Device Mechanical Presence):

  • Device insertion: PxSSTS.DET transitions to 3 (device present + communication established). The driver allocates command list/FIS buffers (if not pre-allocated), issues IDENTIFY DEVICE, and registers a new BlockDevice.
  • Device removal: PxSSTS.DET transitions to 0. The driver unregisters the BlockDevice, fails all in-flight bios with ENODEV, and releases DMA buffers. Active filesystem mounts on the device receive I/O errors — unmount is the user's responsibility.

15.4.14 ATAPI Passthrough

ATAPI devices (optical drives, tape) use the PACKET command (ATA 0xA0) to carry 12-byte or 16-byte SCSI CDBs:

  1. Build FisRegH2D with command = 0xA0.
  2. Write the SCSI CDB (e.g., READ(10), INQUIRY, TEST UNIT READY) into cmd_table.acmd[0..12].
  3. Set AhciCmdHeader.flags.A = 1 (ATAPI bit).
  4. PRDT carries data for data-in/data-out CDBs.
  5. The device responds with PIO Setup FIS (for PIO data) or D2H Register FIS (for non-data commands). Check sense data on error (REQUEST SENSE CDB).

ATAPI is exposed to userspace via the standard Linux SG_IO ioctl (Section 19.7) for CD/DVD burning tools (cdrecord, growisofs) and media players.

15.4.15 Power Management

The AHCI driver supports four link power states and two device power states:

State Wake Latency Triggered By
Active 0 I/O activity
Partial ~10 μs ALPM: idle >5ms (configurable)
Slumber ~10 ms ALPM: idle >100ms (configurable)
DevSleep ~20 ms CAP2.SDS + PxDEVSLP: idle >1s
Standby (device) ~5-15 s System suspend or explicit STANDBY IMMEDIATE
Sleep (device) full reset System hibernate (not commonly used)

Aggressive Link Power Management (ALPM): When PxCMD.ALPE is set, the HBA automatically transitions the link to Partial or Slumber after inactivity. The driver sets PxCMD.ASP = 1 for Slumber preference (deeper sleep). ALPM is enabled by default on battery-powered systems and disabled on servers (latency sensitivity). The policy is controlled by sysfs: /sys/class/scsi_host/hostN/link_power_management_policy — values: min_power, med_power_with_dipm, max_performance (Linux-compatible).

System suspend path (Section 7.9): 1. Flush write cache: FLUSH CACHE EXT (0xEA). 2. Standby: STANDBY IMMEDIATE (0xE0). 3. Stop port: clear PxCMD.ST, wait for PxCMD.CR = 0. 4. Disable FIS receive: clear PxCMD.FRE, wait for PxCMD.FR = 0.

System resume path: Reverse — enable FRE, start port (ST), re-identify device.

15.4.16 BlockDeviceOps Implementation

/// Per-port block device wrapper. One `AhciBlockDevice` is created per
/// AHCI port that has a device attached (detected during port probe).
/// Registered with the block layer via `register_block_device()`.
pub struct AhciBlockDevice {
    /// Reference to the AHCI port state (command list, FIS receive, NCQ state).
    pub port: Arc<AhciPort>,
    /// NUMA node closest to this controller's PCIe slot (for allocation affinity).
    /// u16 matches BlockDeviceInfo.numa_node width (supports 65535 nodes).
    pub numa_node: u16,
}

impl BlockDeviceOps for AhciBlockDevice {
    fn submit_bio(&self, bio: &mut Bio) -> Result<()> {
        if self.port.error_state.load(Acquire) != AhciPortErrorState::Normal as u8 {
            return Err(Error::IO); // Port in error recovery
        }
        match bio.op {
            BioOp::Read | BioOp::Write => {
                if self.port.ncq_capable {
                    self.port.submit_ncq(bio)
                } else {
                    self.port.submit_legacy_dma(bio)
                }
            }
            BioOp::Flush => {
                // Allocate a command slot and store the bio pointer so the
                // D2H FIS completion handler can map the slot back to this
                // bio and call bio_complete(). Without this, the flush bio's
                // StackWaiter would never be woken (same fix as NVMe F15-03).
                let slot = self.port.alloc_slot()?;
                self.port.slot_bios[slot as usize].store(bio as *mut Bio, Release);
                self.port.slot_status[slot as usize].store(0, Relaxed);
                self.port.submit_flush_with_slot(slot)
            }
            BioOp::Discard => {
                if self.port.supports_trim {
                    self.port.submit_trim(bio)
                } else {
                    Err(Error::NOSYS)
                }
            }
            BioOp::SecureErase => {
                if self.port.supports_sanitize {
                    self.port.submit_sanitize(bio)
                } else {
                    Err(Error::NOSYS)
                }
            }
            BioOp::WriteZeroes => Err(Error::NOSYS), // ATA has no write-zeroes
            BioOp::ZoneAppend => Err(Error::NOSYS),  // Not a zoned device
        }
    }

    fn flush(&self) -> Result<()> {
        self.port.submit_flush_sync()
    }

    fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()> {
        if !self.port.supports_trim { return Err(Error::NOSYS); }
        // DATA SET MANAGEMENT (0x06) with TRIM bit. Payload: array of
        // (LBA, count) pairs in 512-byte LBA Range Entry format.
        self.port.submit_trim_range(start_lba, len_sectors)
    }

    fn get_info(&self) -> BlockDeviceInfo {
        BlockDeviceInfo {
            logical_block_size: self.port.logical_sector_size,
            physical_block_size: self.port.physical_sector_size,
            capacity_sectors: self.port.capacity_sectors,
            max_segments: 248, // PRDT entries per command table (fills one 4KB page)
            // 1 MiB intentional constant (not PAGE_SIZE-dependent). AHCI PRDT entries
            // support up to 4 MiB each, but 1 MiB is a practical limit that works well
            // across all page sizes (4K/16K/64K) without exceeding DMA mapping budgets.
            max_bio_size: 1024 * 1024, // 1 MiB
            flags: {
                let mut f = BlockDeviceFlags::empty();
                if self.port.supports_trim { f |= BlockDeviceFlags::DISCARD; }
                if self.port.write_cache_enabled { f |= BlockDeviceFlags::FLUSH; }
                if self.port.supports_fua { f |= BlockDeviceFlags::FUA; } // LBA48 AND IDENTIFY word 84 bit 6
                // IDENTIFY word 217: nominal media rotation rate.
                // 0x0001 = SSD (non-rotating), any other non-zero value = RPM.
                if self.port.nominal_rotation_rate != 1 {
                    f |= BlockDeviceFlags::ROTATIONAL;
                }
                f
            },
            optimal_io_size: self.port.physical_sector_size,
            numa_node: self.numa_node,
        }
    }

    fn shutdown(&self) -> Result<()> {
        // Flush cache, standby, stop port.
        self.port.submit_flush_sync()?;
        self.port.submit_standby_immediate()?;
        self.port.stop()
    }
}

15.4.17 KABI Driver Manifest

[driver]
name = "ahci"
version = "1.0.0"
tier = 1
bus-type = "pci"

[match]
pci-class = "01:06:01"  # Mass Storage / SATA / AHCI 1.0

[capabilities]
dma = true
interrupts = "msi-x"    # Preferred; falls back to MSI, then legacy INTx
max-memory = "4MB"       # Per-port: 1K cmd_list + 256 FIS + 32×cmd_tables

[recovery]
crash-action = "reload"
state-preservation = true  # Replay in-flight bios on reload
max-reload-time-ms = 500

15.4.18 Design Decisions

Decision Rationale
Tier 1 (not Tier 2) SATA is block-latency-sensitive; Ring 3 crossing adds ~5-15 μs per bio — unacceptable for HDD seek-bound workloads
Pre-allocated command tables No heap allocation on the I/O hot path. All 32 command tables allocated at probe time.
Identity tag-to-slot mapping Avoids tag/slot translation overhead. NCQ depth ≤ NCS is guaranteed by spec.
NCQ by default NCQ (FPDMA) is strictly better than legacy DMA for multi-outstanding I/O. Fall back to legacy DMA only for IDENTIFY, FLUSH, and error recovery.
248 PRDT entries Fills one 4KB page (128B header + 248 × 16B). Each entry addresses up to 4MB (DBC field), but block layer bio splitting caps practical I/O at ~1MB. Linux uses LIBATA_MAX_PRD = 128; UmkaOS uses 248 to maximize scatter-gather capacity within one page.
COMRESET as last resort Port reset is expensive (~1-2s with device spin-up). Used only when CLO fails.

15.5 VirtIO-blk Driver Architecture

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). All #[repr(C)] structs have const_assert! size verification. See CLAUDE.md Spec Pseudocode Quality Gates.

The VirtIO-blk driver is a Tier 1 KABI driver that provides block storage access in virtualized environments. VirtIO-blk is the primary boot disk driver for QEMU/KVM, Firecracker, Cloud Hypervisor, and other VMMs using the VirtIO specification. This is a Phase 2 driver — required for the busybox boot demo on all 8 architectures.

Reference specification: VirtIO 1.2 (OASIS, July 2022), Section 5.2.

15.5.1 VirtIO Transport

The VirtIO-blk driver uses the common VirtIO transport layer defined in Section 11.3VirtioTransport trait, VirtqDesc/VirtqAvail/VirtqUsed ring structures, packed ring (VirtqPackedDesc), feature negotiation protocol, and common feature bits (VIRTIO_F_*). This section defines only the block-device-specific configuration, request format, and driver state.

VirtIO-blk is PCI device vendor 0x1AF4, device ID 0x1001 (transitional) or 0x1042 (modern, non-transitional). On MMIO transports (AArch64, ARMv7, RISC-V, PPC), the device type is 2.

15.5.2 Device Configuration Space

The VirtIO-blk device exposes a device-specific configuration structure:

/// VirtIO block device configuration (VirtIO 1.2 §5.2.4).
/// Read via VirtioTransport::read_config().
/// All multi-byte fields are little-endian per VirtIO 1.2 §2.4.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures including big-endian
/// PPC32 and s390x. Matches Linux's `__virtio_le16`/`__virtio_le32`/`__virtio_le64`.
///
/// `packed` matches Linux's `__attribute__((packed))` on `struct virtio_blk_config`.
/// Without it, compiler padding between fields with different alignment requirements
/// (e.g., the nested `VirtioBlkTopology` struct) could silently shift field offsets.
/// The current layout happens to be naturally aligned, but `packed` enforces this
/// invariant against future edits.
#[repr(C, packed)]
pub struct VirtioBlkConfig {
    /// Device capacity in 512-byte sectors.
    pub capacity: Le64,
    /// Maximum size of any single segment (if VIRTIO_BLK_F_SIZE_MAX).
    pub size_max: Le32,
    /// Maximum number of segments in a request (if VIRTIO_BLK_F_SEG_MAX).
    pub seg_max: Le32,
    /// Device geometry (if VIRTIO_BLK_F_GEOMETRY).
    pub geometry: VirtioBlkGeometry,
    /// Logical block size in bytes (if VIRTIO_BLK_F_BLK_SIZE). Default 512.
    pub blk_size: Le32,
    /// Topology information (if VIRTIO_BLK_F_TOPOLOGY).
    pub topology: VirtioBlkTopology,
    /// Write Cache Enable (if VIRTIO_BLK_F_CONFIG_WCE). 1 = writeback, 0 = writethrough.
    /// Linux/VirtIO canonical name: `wce`.
    pub wce: u8,
    /// Padding.
    pub _unused0: u8,
    /// Number of virtqueues (if VIRTIO_BLK_F_MQ). Default 1.
    pub num_queues: Le16,
    /// Maximum discard sectors (if VIRTIO_BLK_F_DISCARD).
    pub max_discard_sectors: Le32,
    /// Maximum discard segments (if VIRTIO_BLK_F_DISCARD).
    pub max_discard_seg: Le32,
    /// Discard sector alignment (if VIRTIO_BLK_F_DISCARD).
    pub discard_sector_alignment: Le32,
    /// Maximum write-zeroes sectors (if VIRTIO_BLK_F_WRITE_ZEROES).
    pub max_write_zeroes_sectors: Le32,
    /// Maximum write-zeroes segments (if VIRTIO_BLK_F_WRITE_ZEROES).
    pub max_write_zeroes_seg: Le32,
    /// Write-zeroes may unmap (if VIRTIO_BLK_F_WRITE_ZEROES).
    pub write_zeroes_may_unmap: u8,
    /// Padding (3 bytes to match Linux's `unused1[3]`).
    pub _unused1: [u8; 3],
    /// Maximum secure erase sectors (if VIRTIO_BLK_F_SECURE_ERASE, VirtIO 1.2+).
    pub max_secure_erase_sectors: Le32,
    /// Maximum secure erase segments (if VIRTIO_BLK_F_SECURE_ERASE).
    pub max_secure_erase_seg: Le32,
    /// Secure erase sector alignment (if VIRTIO_BLK_F_SECURE_ERASE).
    pub secure_erase_sector_alignment: Le32,
    /// Zoned device characteristics (if VIRTIO_BLK_F_ZONED).
    /// VirtIO 1.2 §5.2.6.2: 5 × Le32 + 1 × u8 + 3 bytes padding = 24 bytes.
    pub zoned: VirtioBlkZonedCharacteristics,
}
const_assert!(core::mem::size_of::<VirtioBlkConfig>() == 96);

/// Device geometry (VirtIO 1.2 §5.2.4).
/// Multi-byte fields are little-endian per VirtIO spec.
#[repr(C)]
pub struct VirtioBlkGeometry {
    pub cylinders: Le16,
    pub heads: u8,
    pub sectors: u8,
}
const_assert!(core::mem::size_of::<VirtioBlkGeometry>() == 4);

/// Block topology hints for alignment (VirtIO 1.2 §5.2.4).
/// Multi-byte fields are little-endian per VirtIO spec.
#[repr(C)]
pub struct VirtioBlkTopology {
    /// Number of logical blocks per physical block (log2).
    pub physical_block_exp: u8,
    /// Offset of first aligned logical block.
    pub alignment_offset: u8,
    /// Suggested minimum I/O size in logical blocks.
    pub min_io_size: Le16,
    /// Suggested optimal I/O size in logical blocks.
    pub opt_io_size: Le32,
}
// VirtIO spec: physical_block_exp(1)+alignment_offset(1)+min_io_size(2)+opt_io_size(4) = 8 bytes.
const_assert!(core::mem::size_of::<VirtioBlkTopology>() == 8);

/// Zoned block device characteristics (VirtIO 1.2 §5.2.6.2).
/// Present when VIRTIO_BLK_F_ZONED is negotiated.
/// All multi-byte fields are little-endian per VirtIO spec.
#[repr(C)]
pub struct VirtioBlkZonedCharacteristics {
    /// Maximum number of open zones.
    pub zone_sectors: Le32,
    /// Maximum number of active zones.
    pub max_open_zones: Le32,
    /// Maximum number of zones.
    pub max_active_zones: Le32,
    /// Maximum append sectors.
    pub max_append_sectors: Le32,
    /// Write granularity.
    pub write_granularity: Le32,
    /// Zoned model: 0=none, 1=host-aware, 2=host-managed.
    pub model: u8,
    /// Padding to 24 bytes.
    pub _pad: [u8; 3],
}
const_assert!(core::mem::size_of::<VirtioBlkZonedCharacteristics>() == 24);

15.5.3 Feature Negotiation

Feature negotiation follows the VirtIO standard 3-step process:

  1. Device offers features: Driver reads 64-bit feature bitmap.
  2. Driver accepts subset: Driver writes back only the features it supports.
  3. Driver sets FEATURES_OK: Device validates; if cleared, negotiation failed.
/// VirtIO block feature bits (VirtIO 1.2 §5.2.3).
pub mod virtio_blk_features {
    /// Maximum size of any single segment is in `size_max`.
    pub const VIRTIO_BLK_F_SIZE_MAX: u64       = 1 << 1;
    /// Maximum number of segments in a request is in `seg_max`.
    pub const VIRTIO_BLK_F_SEG_MAX: u64        = 1 << 2;
    /// Disk-style geometry specified in `geometry`.
    pub const VIRTIO_BLK_F_GEOMETRY: u64       = 1 << 4;
    /// Device is read-only.
    pub const VIRTIO_BLK_F_RO: u64             = 1 << 5;
    /// Disk logical block size in `blk_size`.
    pub const VIRTIO_BLK_F_BLK_SIZE: u64       = 1 << 6;
    /// Cache flush command support (VIRTIO_BLK_T_FLUSH).
    pub const VIRTIO_BLK_F_FLUSH: u64          = 1 << 9;
    /// Device exports topology information in `topology`.
    pub const VIRTIO_BLK_F_TOPOLOGY: u64       = 1 << 10;
    /// Device can toggle its cache between writeback and writethrough.
    pub const VIRTIO_BLK_F_CONFIG_WCE: u64     = 1 << 11;
    /// Device supports multi-queue (num_queues virtqueues).
    pub const VIRTIO_BLK_F_MQ: u64             = 1 << 12;
    /// Device supports discard (VIRTIO_BLK_T_DISCARD).
    pub const VIRTIO_BLK_F_DISCARD: u64        = 1 << 13;
    /// Device supports write-zeroes (VIRTIO_BLK_T_WRITE_ZEROES).
    pub const VIRTIO_BLK_F_WRITE_ZEROES: u64   = 1 << 14;
    // Bit 15 is unassigned in both Linux UAPI and VirtIO 1.2/1.3 spec.
    /// Device supports secure erase (VIRTIO_BLK_T_SECURE_ERASE).
    pub const VIRTIO_BLK_F_SECURE_ERASE: u64   = 1 << 16;
    /// Device reports zoned block device characteristics.
    pub const VIRTIO_BLK_F_ZONED: u64          = 1 << 17;

    // Transport-level common feature bits (VIRTIO_F_*) are defined in
    // Section 11.4.3.1 — virtio_features module.
}

The UmkaOS driver always negotiates: VIRTIO_F_VERSION_1 (required for modern), VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_FLUSH (if offered), VIRTIO_BLK_F_TOPOLOGY (if offered), VIRTIO_BLK_F_MQ (if offered), VIRTIO_BLK_F_DISCARD (if offered), VIRTIO_BLK_F_WRITE_ZEROES (if offered), VIRTIO_F_INDIRECT_DESC (if offered), VIRTIO_F_EVENT_IDX (if offered), VIRTIO_F_RING_PACKED (if offered). Transport-level feature bits are defined in Section 11.3.

15.5.4 Virtqueue Usage

The VirtIO-blk driver uses the split ring (VirtqDesc/VirtqAvail/VirtqUsed) or packed ring (VirtqPackedDesc) layouts defined in Section 11.3. Split ring is the default; packed ring (VIRTIO_F_RING_PACKED) is negotiated when the device offers it.

Virtqueue sizing: The driver queries max_queue_size() from the transport. Typical values: 128, 256, or 512 entries. The UmkaOS driver uses the device's maximum (no benefit to restricting it). All DMA regions are allocated as a single contiguous buffer with the alignment requirements specified in §11.4.3.1.

15.5.5 Request Format

All VirtIO-blk requests use a three-part descriptor chain:

/// VirtIO block request header — 16 bytes (device-readable).
/// All multi-byte fields are little-endian per VirtIO 1.2 §5.2.6.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct VirtioBlkReqHeader {
    /// Request type.
    pub req_type: Le32,
    /// I/O priority (class and priority, Linux `IOPRIO_PRIO_VALUE` encoding).
    /// Only meaningful if device supports prioritized I/O (currently advisory).
    pub ioprio: Le32,
    /// Starting sector (512-byte units regardless of logical block size).
    pub sector: Le64,
}
const_assert!(core::mem::size_of::<VirtioBlkReqHeader>() == 16);

/// VirtIO block request types.
#[repr(u32)]
pub enum VirtioBlkReqType {
    /// Read from device to guest memory.
    In          = 0,
    /// Write from guest memory to device.
    Out         = 1,
    /// Flush volatile cache. sector field is ignored.
    Flush       = 4,
    /// Discard sectors (if VIRTIO_BLK_F_DISCARD).
    Discard     = 11,
    /// Write zeroes (if VIRTIO_BLK_F_WRITE_ZEROES).
    WriteZeroes = 13,
    /// Secure erase (if VIRTIO_BLK_F_SECURE_ERASE).
    SecureErase = 14,
}

/// VirtIO block request status — 1 byte (device-writable).
/// Written by the device as the last byte of the used descriptor chain.
#[repr(u8)]
pub enum VirtioBlkStatus {
    /// Success.
    Ok       = 0,
    /// Device or driver error.
    IoErr    = 1,
    /// Request unsupported by device.
    Unsupp   = 2,
}

Descriptor chain layout for a read/write request:

Descriptor Direction Contents
0 (head) Device-readable VirtioBlkReqHeader (16 bytes)
1..N Read: device-writable; Write: device-readable Data segments (from bio)
N+1 (tail) Device-writable Status byte (VirtioBlkStatus, 1 byte)

For flush requests: head (16 bytes, type=Flush) + tail (1 byte status). No data descriptors.

For discard/write-zeroes: head + discard/write-zeroes segment descriptors (each VirtioBlkDiscardWriteZeroes, 16 bytes: sector, num_sectors, flags) + tail.

15.5.6 Multi-Queue Support

When VIRTIO_BLK_F_MQ is negotiated, the device provides num_queues virtqueues. The driver creates one virtqueue per CPU (up to num_queues). I/O requests from a CPU are submitted to that CPU's virtqueue without any lock contention:

/// VirtIO-blk driver state.
pub struct VirtioBlkDevice {
    /// VirtIO transport handle (PCI or MMIO).
    /// `&'static dyn` instead of `Box<dyn>`: transport objects are registered
    /// once at boot (PCI) or device probe (MMIO) and live for the device's
    /// lifetime. No heap allocation needed on the hot path.
    pub transport: &'static dyn VirtioTransport,
    /// Negotiated features.
    pub features: u64,
    /// Per-queue state. One per virtqueue (1 for single-queue, up to num_queues for MQ).
    /// Warm-path allocation at device probe. Size = negotiated num_queues
    /// (from VirtIO config). Bounded by VirtIO spec: max 65535, practical QEMU 256.
    /// Each queue wrapped in SpinLock for interior mutability from &self.
    pub queues: Box<[SpinLock<VirtioBlkQueue>]>,
    /// Device capacity in 512-byte sectors.
    pub capacity: u64,
    /// Logical block size in bytes.
    pub blk_size: u32,
    /// Physical block exponent (log2 of physical/logical ratio).
    pub physical_block_exp: u8,
    /// Device is read-only.
    pub read_only: bool,
    /// Writeback mode (if CONFIG_WCE negotiated).
    pub writeback: bool,
}

/// Per-virtqueue state.
pub struct VirtioBlkQueue {
    /// Queue index (0-based).
    pub index: u16,
    /// Queue size (number of descriptors, power of 2).
    pub size: u16,
    /// DMA-coherent descriptor table (VirtqDesc defined in §11.4.3.1).
    pub desc_table: DmaBox<[VirtqDesc]>,
    /// DMA-coherent available ring (VirtqAvail defined in §11.4.3.1).
    pub avail_ring: DmaBox<VirtqAvail>,
    /// DMA-coherent used ring (VirtqUsed defined in §11.4.3.1).
    pub used_ring: DmaBox<VirtqUsed>,
    /// Free descriptor list — indices of available descriptors.
    /// Pre-populated at init time (all descriptors free).
    ///
    /// Capacity is `queue_size` (negotiated with the device at probe time,
    /// power-of-2, range 1-32768 per VirtIO spec 1.2 §2.7). The backing
    /// storage is allocated from slab at queue init (warm path) with
    /// `Box<[u16]>` of length `queue_size`. This avoids a fixed 512-entry
    /// cap that would silently drop descriptors on devices with larger
    /// queues (e.g., cloud VirtIO-blk with 1024 or 4096 queue depth).
    pub free_list: Box<[u16]>,
    /// Number of valid entries in `free_list` (0..=queue_size).
    pub free_count: u16,
    /// Last seen used ring index (for polling completions).
    pub last_used_idx: u16,
    /// In-flight request tracking: maps descriptor head index → Bio.
    /// Allocated with `queue_size` entries at init.
    ///
    /// # Safety
    ///
    /// Each `*mut Bio` is a borrow of a Bio owned by the block layer.
    /// Ownership contract: the block layer retains ownership; the driver
    /// holds a mutable borrow from `submit_bio()` until the corresponding
    /// used ring entry is consumed in `virtio_blk_complete()`. After
    /// completion, the driver sets the slot to `None` and calls
    /// `bio_end_io(bio, status)`, returning the borrow to the block layer.
    /// The raw pointer (rather than `&mut Bio`) is required because the Bio
    /// may be accessed from interrupt context (used ring polling) where
    /// Rust's borrow checker cannot track the lifetime across the async
    /// hardware boundary.
    ///
    /// Invariants:
    /// - Each `*mut Bio` is valid from `submit_bio()` until the
    ///   corresponding used ring entry is consumed and `bio_end_io()` is called.
    /// - The block layer guarantees the Bio remains allocated while inflight.
    /// - Only the owning VirtioBlkQueue may read/write a slot (per-queue lock
    ///   or per-CPU exclusivity ensures no data races).
    /// - The driver MUST NOT dereference a `*mut Bio` after calling
    ///   `bio_end_io()`. The slot MUST be set to `None` before any
    ///   subsequent access.
    pub inflight: Box<[Option<*mut Bio>]>,
}

Queue selection: submit_bio() selects the queue for the current CPU: queue_index = cpu_id % num_queues. No lock is needed because each CPU has its own queue. If num_queues < num_cpus, some CPUs share a queue (protected by a per-queue SpinLock).

15.5.7 I/O Submission

fn virtio_blk_submit(queue: &mut VirtioBlkQueue, bio: &mut Bio) -> Result<()> {
    // 1. Allocate descriptors: 1 (header) + N (data segments) + 1 (status) = N+2.
    let num_descs = 2 + bio.segments.len();
    if (queue.free_count as usize) < num_descs {
        return Err(Error::AGAIN); // Queue full — block layer requeues.
        // The bio is placed on the per-device dispatch list. The device's
        // completion IRQ kicks requeue processing when descriptor space
        // becomes available. See [Section 15.2](#block-io-and-volume-management--eagain-requeue).
    }

    // 2. Build header descriptor.
    let header_desc = queue.alloc_desc();
    queue.desc_table[header_desc].addr = header_dma_addr;
    queue.desc_table[header_desc].len = 16;
    queue.desc_table[header_desc].flags = VIRTQ_DESC_F_NEXT;

    // 3. Chain data descriptors.
    let mut prev = header_desc;
    for seg in &bio.segments {
        let data_desc = queue.alloc_desc();
        queue.desc_table[data_desc].addr = seg.page_phys + seg.offset as u64;
        queue.desc_table[data_desc].len = seg.len;
        queue.desc_table[data_desc].flags = if bio.op == BioOp::Read {
            VIRTQ_DESC_F_WRITE | VIRTQ_DESC_F_NEXT  // Device writes to buffer
        } else {
            VIRTQ_DESC_F_NEXT  // Device reads from buffer
        };
        queue.desc_table[prev].next = data_desc;
        prev = data_desc;
    }

    // 4. Chain status descriptor (1 byte, device-writable).
    let status_desc = queue.alloc_desc();
    queue.desc_table[status_desc].addr = status_dma_addr;
    queue.desc_table[status_desc].len = 1;
    queue.desc_table[status_desc].flags = VIRTQ_DESC_F_WRITE; // No NEXT — end of chain
    queue.desc_table[prev].next = status_desc;
    // Ensure prev descriptor has NEXT flag set (it chains to status_desc).
    // For Read ops, prev already has NEXT set from the data descriptor loop.
    // For Write ops (no WRITE flag), prev also has NEXT from the loop.
    // The status descriptor is the final descriptor — no NEXT flag.
    queue.desc_table[status_desc].flags = VIRTQ_DESC_F_WRITE; // No NEXT — end of chain

    // 5. Track in-flight.
    queue.inflight[header_desc] = Some(bio as *mut Bio);

    // 6. Add to available ring.
    let avail_idx = queue.avail_ring.idx;
    queue.avail_ring.ring[avail_idx as usize % queue.size as usize] = header_desc as u16;
    // Write memory barrier — ensure descriptor writes are visible before idx update.
    core::sync::atomic::fence(Release);
    queue.avail_ring.idx = avail_idx.wrapping_add(1);

    // 7. Notify device (doorbell kick).
    // With EVENT_IDX: only notify if device needs it.
    if needs_notification(queue) {
        queue.transport.notify(queue.index);
    }

    Ok(())
}

15.5.8 I/O Completion

Completions are processed in the interrupt handler or by polling. The bio_complete() free function (Section 15.2) performs CAS + extraction + dispatch, using the formal Completion primitive defined in Section 3.6:

fn virtio_blk_complete(queue: &mut VirtioBlkQueue) {
    // Read memory barrier — ensure we see device's writes to used ring.
    core::sync::atomic::fence(Acquire);

    // used_ring.idx is written by the device via DMA. read_volatile() prevents
    // the compiler from caching the value across loop iterations or reordering
    // the read relative to the Acquire fence above.
    while queue.last_used_idx != read_volatile(&queue.used_ring.idx) {
        let used_elem = &queue.used_ring.ring[
            queue.last_used_idx as usize % queue.size as usize
        ];
        let head_idx = used_elem.id as usize;

        // Read status byte from the last descriptor in the chain.
        let status = read_status_byte(queue, head_idx);

        // Complete the bio.
        if let Some(bio_ptr) = queue.inflight[head_idx].take() {
            let bio = unsafe { &mut *bio_ptr };
            let errno = match status {
                VirtioBlkStatus::Ok => 0,
                VirtioBlkStatus::IoErr => -EIO,
                VirtioBlkStatus::Unsupp => -ENOSYS,
            };
            // Unified completion API: CAS + extraction + dispatch in one call.
            // This eliminates the TOCTOU race from the previous mem::take
            // pattern where extraction preceded the CAS (SF-373).
            bio_complete(bio, errno);
        }

        // Free all descriptors in the chain.
        free_descriptor_chain(queue, head_idx);

        queue.last_used_idx = queue.last_used_idx.wrapping_add(1);
    }

    // Update avail_event for EVENT_IDX notification suppression.
    if has_event_idx(queue) {
        queue.used_ring.avail_event = queue.avail_ring.idx;
    }
}

15.5.9 Initialization Sequence

Follows the common VirtIO initialization protocol defined in Section 11.3, with block-specific steps:

  1. Discovery: Match PCI vendor 0x1AF4, device 0x1001/0x1042 (block), or MMIO magic value 0x74726976 ("virt") with device type 2.
  2. Steps 1-6: Standard VirtIO init (reset → acknowledge → driver → feature negotiation → FEATURES_OK → queue setup) per §11.4.3.1. Block-specific features from §15.15.3 are selected in the negotiation step.
  3. Queue setup: For each virtqueue (1 for single-queue, num_queues for MQ): a. Select queue (write queue index to queue_sel). b. Read max queue size. c. Allocate DMA buffers for desc, avail, used rings. d. Pre-populate free descriptor list. e. Write addresses to the device via setup_queue(). f. Enable the queue.
  4. Read config: Read VirtioBlkConfig — capacity, blk_size, topology.
  5. DRIVER_OK: Set DRIVER_OK (bit 2) in status. Device is now live.
  6. Register with umka-block: Create BlockDevice with parsed config.

15.5.10 Crash Recovery

VirtIO-blk crash recovery follows the Tier 1 recovery protocol (Section 11.9):

  1. Fault detection: PCI error (SERR, AER), timeout (no completion within 30s), or device status DEVICE_NEEDS_RESET (bit 6).
  2. Quiesce: Block new submissions. Drain completion handler.
  3. Device reset: Write 0 to device status register. This resets all device state including virtqueues.
  4. Re-initialize: Repeat steps 2-5 of §15.15.9 (acknowledge, driver, features, queues, config, DRIVER_OK).
  5. Replay in-flight I/O: The block layer retains bios that were submitted but not completed. After re-initialization, these bios are resubmitted to the new virtqueues. Replay is safe because sector writes are idempotent: writing the same data to the same sector twice produces the same result. Writes may have been partially or fully executed by the device before the crash; replaying them is correct regardless. Non-idempotent operations (discard, write-same with side effects) are not replayed — they are failed with -EIO and reported to the block layer for upper-layer recovery.
  6. Resume: Set error_state = Normal. Accept new submissions.

Recovery time: ~10-50ms (dominated by device reset + re-negotiation). No device firmware to reload — VirtIO is a software device.

15.5.11 BlockDeviceOps Implementation

impl BlockDeviceOps for VirtioBlkDevice {
    fn submit_bio(&self, bio: &mut Bio) -> Result<()> {
        if self.read_only && bio.op == BioOp::Write {
            return Err(Error::ROFS);
        }
        let queue_idx = arch::current::cpu::id() % self.queues.len();
        // Each VirtioBlkQueue is wrapped in SpinLock for interior mutability
        // (submit_bio takes &self; queue mutation needs &mut through &self).
        // Uncontended in the per-CPU case (~5-10 ns); required for shared
        // queues when num_queues < num_cpus.
        let queue = self.queues[queue_idx].lock();
        match bio.op {
            BioOp::Read => queue.submit_request(VirtioBlkReqType::In, bio),
            BioOp::Write => queue.submit_request(VirtioBlkReqType::Out, bio),
            BioOp::Flush => {
                if self.features & VIRTIO_BLK_F_FLUSH != 0 {
                    // submit_flush builds a 3-descriptor chain
                    // (header + empty data + status), stores the bio in
                    // inflight[header_desc] for completion matching, and
                    // submits to the available ring. The completion handler
                    // calls bio_complete() on the flush bio when the device
                    // signals used[header_desc].
                    queue.submit_flush(bio)
                } else {
                    // No volatile cache — flush is a no-op, complete immediately.
                    bio_complete(bio, 0);
                    Ok(())
                }
            }
            BioOp::Discard => {
                if self.features & VIRTIO_BLK_F_DISCARD != 0 {
                    queue.submit_discard(bio)
                } else {
                    Err(Error::NOSYS)
                }
            }
            BioOp::WriteZeroes => {
                if self.features & VIRTIO_BLK_F_WRITE_ZEROES != 0 {
                    queue.submit_write_zeroes(bio)
                } else {
                    Err(Error::NOSYS)
                }
            }
            BioOp::ZoneAppend => Err(Error::NOSYS),
        }
    }

    fn flush(&self) -> Result<()> {
        if self.features & VIRTIO_BLK_F_FLUSH != 0 {
            let queue_idx = arch::current::cpu::id() % self.queues.len();
            self.queues[queue_idx].submit_flush_sync()
        } else {
            Ok(())
        }
    }

    fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()> {
        if self.features & VIRTIO_BLK_F_DISCARD == 0 {
            return Err(Error::NOSYS);
        }
        let queue_idx = arch::current::cpu::id() % self.queues.len();
        self.queues[queue_idx].submit_discard_range(start_lba, len_sectors)
    }

    fn get_info(&self) -> BlockDeviceInfo {
        BlockDeviceInfo {
            logical_block_size: self.blk_size,
            physical_block_size: self.blk_size << self.physical_block_exp,
            capacity_sectors: self.capacity,
            max_segments: if self.features & VIRTIO_BLK_F_SEG_MAX != 0 {
                // seg_max (u32) from config offset 12, clamped to u16 range.
                // `as u16` truncation for values >65535 produces silently wrong
                // results (65536 truncates to 0). Use explicit clamping.
                let seg_max_u32 = self.transport.read_config(12, 4);
                let clamped = core::cmp::min(seg_max_u32, u16::MAX as u32) as u16;
                core::cmp::min(
                    clamped,
                    self.queues[0].size - 2,
                )
            } else {
                self.queues[0].size - 2
            },
            max_bio_size: 0, // Sentinel: 0 means "no explicit byte limit beyond
                             // segment count × page size". The block layer treats
                             // max_bio_size == 0 as unlimited (capped only by
                             // max_segments × PAGE_SIZE). See BlockDeviceInfo docs.
            flags: {
                let mut f = BlockDeviceFlags::empty();
                if self.features & VIRTIO_BLK_F_DISCARD != 0 { f |= BlockDeviceFlags::DISCARD; }
                if self.features & VIRTIO_BLK_F_FLUSH != 0 { f |= BlockDeviceFlags::FLUSH; }
                // VirtIO-blk has no FUA — use flush
                f
            },
            optimal_io_size: if self.features & VIRTIO_BLK_F_TOPOLOGY != 0 {
                // opt_io_size (Le32) is at config offset 28, NOT 24.
                // Offset 24 contains physical_block_exp(u8) + alignment_offset(u8)
                // + min_io_size(Le16). See VirtIO 1.2 §5.2.4 struct virtio_blk_config.
                self.blk_size * self.transport.read_config(28, 4) as u32
            } else {
                self.blk_size
            },
            numa_node: 0, // Virtual device — no NUMA affinity
        }
    }

    fn shutdown(&self) -> Result<()> {
        // Flush pending writes.
        self.flush()?;
        // Reset device.
        self.transport.write_status(0);
        Ok(())
    }
}

15.5.12 KABI Driver Manifest

[driver]
name = "virtio-blk"
version = "1.0.0"
tier = 1
bus-type = "pci"  # Also supports "platform" for MMIO transport

[match]
pci-vendor = "0x1AF4"
pci-device = ["0x1001", "0x1042"]  # Legacy and modern transitional

[match.mmio]
virtio-device-type = 2  # Block device

[capabilities]
dma = true
interrupts = "msi-x"  # Preferred; falls back to MSI, then INTx (PCI) or shared IRQ (MMIO)
max-memory = "2MB"     # Per-queue: desc + avail + used rings

[recovery]
crash-action = "reload"
state-preservation = true  # Replay in-flight bios on reload
max-reload-time-ms = 100

15.5.13 Design Decisions

Decision Rationale
Tier 1 (not Tier 2) VirtIO-blk is the primary boot disk for all virtualized environments. Latency budget is tight (~5 μs per I/O round-trip in QEMU). Tier 2 ring-crossing adds ~5-15 μs — unacceptable.
Split ring as default, packed as upgrade Split ring is universally supported. Packed ring (VIRTIO_F_RING_PACKED) is negotiated when available for better cache behavior.
One queue per CPU Eliminates lock contention. Standard for modern VirtIO-blk (F_MQ). Falls back to single-queue with per-queue SpinLock.
Pre-allocated descriptor pools All descriptors and tracking arrays allocated at probe time. Zero allocation on the I/O hot path.
No indirect descriptors by default Indirect descriptors (VIRTIO_F_INDIRECT_DESC) add an extra DMA read. Only enabled for large bios (>8 segments) to save descriptor slots.
EVENT_IDX for notification suppression Reduces VM exits (and thus I/O latency) by batching notifications. Essential for performance — reduces hypervisor round-trips by ~40%.
Both PCI and MMIO transports PCI is the production path (x86, AArch64 servers). MMIO is required for embedded platforms (ARMv7, RISC-V, PPC) and QEMU -M virt.

15.6 ext4 Filesystem Driver

Scope note: This section provides UmkaOS-specific ext4 filesystem driver specifications: journal modes, error handling, Linux compatibility constraints, and data structure layouts. The on-disk format specification for ext4 is defined by the upstream project and is not duplicated here — UmkaOS implements the same on-disk format bit-for-bit.

The ext4 driver implements the FileSystemOps and InodeOps traits defined in Section 14.1 (VFS layer). ext4 is used in server, workstation, embedded, and consumer contexts; it is not consumer-specific.

15.6.1.1 Evolvable/Nucleus Classification

Component Classification Rationale
JournalSuperblock, JournalHeader, JournalBlockTag, JournalCommitBlock on-disk structs Nucleus On-disk format compatibility with Linux ext4. Any change breaks cross-mount.
Journal in-memory struct fields and state machine Nucleus Transaction state machine correctness is a crash-consistency invariant.
Transaction lifecycle (T_RUNNING through T_FINISHED) Nucleus Ordering guarantees are required for durability.
Handle API (journal_start/journal_stop/journal_get_write_access) Nucleus Correctness contract with filesystem operations.
Recovery and replay algorithm (3-pass) Nucleus Must match Linux JBD2 for cross-mount compatibility.
Revoke record semantics Nucleus Freed-block replay hazard prevention is a correctness invariant.
Adaptive commit interval algorithm Evolvable Heuristic for commit timing. ML-tunable without affecting correctness.
commit_interval_ms bounds (100-5000 ms) Evolvable Policy choice for recovery time vs I/O overhead tradeoff.
Checkpoint thread scheduling priority (nice 5) Evolvable Policy: how urgently to reclaim journal space.
errors= mode selection Evolvable Policy: operator-configurable response to filesystem errors.

15.6.1.2 const_assert! Verification

All #[repr(C)] on-disk structs in this section have const_assert! size verification:

Struct Expected size const_assert! present
JournalSuperblock 1024 Yes
JournalBlockTag 16 Yes
JournalCommitBlock 60 Yes

15.6.2 ext4

Use cases: Default Linux filesystem. Ubiquitous across servers, containers (overlayfs on ext4), embedded roots, VM images, CI/CD storage nodes, and most existing Linux deployments. UmkaOS must read/write ext4 volumes from day one for bare-metal Linux migration compatibility.

Tier: Tier 1 (in-kernel driver; no privilege boundary makes sense for a root filesystem that must be available before any domain infrastructure is up).

Journal modes (selected at mount time via data= option):

Mode What is journalled Durability on crash
data=writeback Metadata only Stale data may appear in reallocated blocks
data=ordered (default) Metadata only; data flushed before metadata commit No stale data
data=journal Metadata and data Strongest; ~2× write amplification

UmkaOS exposes these as mount flags via the FileSystemOps::mount() options string, consistent with Linux behaviour. The VFS durability contract (Section 15.1) requires data=ordered or data=journal to satisfy O_SYNC/fsync guarantees; drivers must reject data=writeback if the volume is mounted as a root or journalled data store unless the operator explicitly overrides.

Key features the driver must implement: - Extents (ext4_extent_tree): 48-bit logical-to-physical mapping via a four-level B-tree embedded in the inode. Supports extents up to 128 MiB contiguous. Replaces the older indirect-block scheme (must also be readable for old volumes without the extents feature flag). - HTree directory indexing: dir_index feature flag. Directories stored as B-trees keyed by filename hash (half-MD4). Required for directories with more than ~10,000 entries; without it readdir degrades to O(n). - 64-bit support: 64bit feature flag extends block count from 32 to 48 bits, enabling volumes >16 TiB. Required for modern datacenter deployments; the driver must handle both 32-bit and 64-bit superblocks. - Inline data: Small files (≤60 bytes) stored directly in the inode body. Important for filesystems hosting millions of tiny files (container layers, npm caches). - Fast commit (fast_commit feature, Linux 5.10+): Appends a small delta journal entry instead of a full transaction commit for common operations (rename, link, unlink). Reduces journal write amplification by 4–10× for metadata-heavy workloads.

Error handling: ext4 supports the standard errors= mount option (Section 14.1): - errors=continue (default): Log the error, continue operation. - errors=remount-ro: Remount the filesystem read-only (see FsErrorMode::RemountRo in Section 14.1 for the procedure). This is the recommended setting for data-critical volumes. - errors=panic: Trigger a kernel panic. Only appropriate for root filesystems with automatic reboot and fsck.

The errors= value is persisted in the ext4 on-disk superblock field s_errors (__le16 at byte offset 0x3C). If not specified at mount time, the on-disk value is used. If neither mount option nor on-disk value is set, Continue is the default.

XFS does not use errors=; it has its own error handling configuration via sysfs (/sys/fs/xfs/<device>/error/). Btrfs always remounts read-only on metadata corruption (equivalent to RemountRo) and has no user-configurable error mode.

Crash recovery: Replay the ext4 journal (jbd2 compatible format) on mount. The VFS freeze/thaw interface (Section 14.1 freeze() / thaw()) is used for consistent snapshots (LVM thin, VM live migration).

Journal writeback error handling: When a writeback error occurs during a journal transaction: (1) The transaction is NOT committed — it remains open. (2) All dirty pages in the transaction are re-dirtied (preserving data for retry). (3) The error is propagated to fsync() callers via the ErrSeq mechanism (Section 4.6). (4) The journal retries the transaction on the next writeback cycle. (5) After 3 consecutive failures, the filesystem enters the configured error mode (default: RemountRo for journal errors).

Linux compatibility: UmkaOS's ext4 driver is wire-compatible with Linux's ext4. Volumes formatted with mkfs.ext4 on Linux are mountable by UmkaOS without conversion. The tune2fs -l feature list (FEATURE_COMPAT, FEATURE_INCOMPAT, FEATURE_RO_COMPAT) governs which features are required vs. optional; the driver rejects mount if any INCOMPAT bit is set that it does not understand.

15.6.2.1 JBD2 Journaling Subsystem

The ext4 filesystem delegates all crash-consistency guarantees to the JBD2 (Journaling Block Device 2) subsystem. UmkaOS's JBD2 implementation is on-disk format compatible with Linux's fs/jbd2/ — volumes journaled by Linux are recoverable by UmkaOS and vice versa. Internal improvements (adaptive commit interval, u64 transaction IDs) do not affect on-disk layout; they change only runtime behavior.

15.6.2.1.1 Transaction State Machine

Every metadata mutation is grouped into a transaction. Exactly one transaction is in T_RUNNING state at any time; a second may be in T_COMMIT (being written to the journal). The state machine is:

T_RUNNING → T_LOCKED → T_FLUSH → T_COMMIT → T_FINISHED
State Description
T_RUNNING Accepting new metadata modifications via journal handles. journal_start() attaches a handle to this transaction. Multiple handles may be active concurrently (one per in-flight filesystem operation).
T_LOCKED No new handles accepted. The commit thread sets this state to drain active handles. Callers of journal_start() block until a new T_RUNNING transaction is created after the current one advances past T_LOCKED.
T_FLUSH Data pages for all inodes in inode_list are being flushed to their final on-disk locations (ordered mode only). This ensures data is stable before metadata referencing it is committed. In writeback mode this state is a no-op pass-through; in journal mode, data blocks are written to the journal instead.
T_COMMIT Journal descriptor blocks, metadata blocks, and the final commit block are being written to the journal device. The commit block carries a CRC32C checksum covering all descriptor and metadata blocks in the transaction. A FUA write (or FLUSH + write + FLUSH on devices without FUA) ensures the commit block is durable before the state advances.
T_FINISHED Commit is complete and durable. The transaction is moved to the checkpoint list. fsync() callers waiting on commit_wq are woken. A new T_RUNNING transaction may now be created.

State transitions are serialized by Journal::state_lock. The T_RUNNING → T_LOCKED transition is triggered by: (a) the periodic commit timer firing, (b) a synchronous fsync() forcing a commit, or (c) journal free space falling below the reservation threshold.

15.6.2.1.2 Core Data Structures
/// In-memory representation of a JBD2 journal instance.
///
/// One `Journal` exists per mounted ext4 filesystem. The journal may reside
/// on the same block device as the filesystem (internal journal, inode 8) or
/// on a separate block device (external journal, specified via `journal_dev=`
/// mount option).
pub struct Journal {
    /// Block device backing the journal. For an internal journal this is the
    /// same device as the filesystem; for an external journal it is a separate
    /// `BlockDeviceHandle`.
    pub dev: BlockDeviceHandle,

    /// On-disk journal superblock (1024 bytes at journal block 0).
    /// Cached in memory; written back on checkpoint advance and clean unmount.
    pub sb: JournalSuperblock,

    /// Journal block number of the first un-checkpointed transaction.
    /// Advances when checkpoint frees journal space.
    pub head: u32,

    /// Journal block number of the next free block (write cursor).
    /// Wraps circularly within the journal.
    pub tail: u32,

    /// Number of free blocks remaining in the journal.
    /// `free = total_blocks - (tail - head)` modulo wrap.
    pub free: u32,

    /// Maximum metadata buffers a single transaction may accumulate before
    /// the commit thread forces a commit. Derived from journal size:
    /// `max_transaction_buffers = journal_blocks / 4` (same heuristic as Linux).
    pub max_transaction_buffers: u32,

    /// Adaptive commit interval in milliseconds. Range: 100–5000 ms.
    /// Adjusted at each commit based on handle start rate (see §Adaptive
    /// Commit Interval below).
    pub commit_interval_ms: AtomicU32,

    /// The currently active transaction accepting new handles.
    /// `None` only during the brief window between one transaction entering
    /// `T_LOCKED` and the next `T_RUNNING` transaction being created.
    pub running_transaction: Option<Arc<Transaction>>,

    /// The transaction currently being written to the journal.
    /// At most one transaction is in the commit pipeline at any time.
    pub committing_transaction: Option<Arc<Transaction>>,

    /// Oldest-first list of committed transactions whose metadata blocks
    /// have not yet been written to their final on-disk locations.
    /// Checkpoint frees journal space by flushing these.
    /// **Collection policy note**: IntrusiveList is acceptable here despite
    /// the general preference for XArray. Transactions form a strict FIFO
    /// order (checkpoint oldest-first), are never accessed by integer key,
    /// and each Transaction already embeds a `checkpoint_link` node. The
    /// list is O(1) insert (tail) and O(1) remove (head), matching the
    /// checkpoint access pattern exactly.
    pub checkpoint_transactions: IntrusiveList<Transaction>,

    /// Protects transaction state transitions (`T_RUNNING` → `T_LOCKED` etc.)
    /// and the `running_transaction` / `committing_transaction` fields.
    pub state_lock: Mutex<()>,

    /// Woken when a commit completes (`T_COMMIT` → `T_FINISHED`).
    /// `fsync()` callers sleep here after requesting a commit.
    pub commit_wq: WaitQueue,

    /// Woken when journal free space increases (after checkpoint).
    /// `journal_start()` callers sleep here when the journal is full.
    pub checkpoint_wq: WaitQueue,

    /// Journal block size in bytes (must equal filesystem block size).
    pub block_size: u32,

    /// Total number of usable journal blocks (excluding the superblock block).
    pub total_blocks: u32,

    /// Feature flags from the on-disk journal superblock.
    pub features: JournalFeatureFlags,
}

/// A single journal transaction.
///
/// Transactions are the unit of atomicity: either all metadata changes in a
/// transaction are replayed on recovery, or none are.
pub struct Transaction {
    /// Monotonically increasing transaction ID. u64 to avoid wrap within
    /// any operational lifetime (at 10,000 commits/sec, wraps after 58
    /// million years). The on-disk format stores only the low 32 bits
    /// (`t_tid` in the commit block); the full u64 is internal only.
    ///
    /// **Recovery disambiguation**: On journal replay, the kernel reads
    /// u32 `t_tid` values from commit blocks. The full u64 is
    /// reconstructed by tracking the high 32 bits across the recovery
    /// scan: start with `epoch = superblock.s_last_tid >> 32`, then for
    /// each commit block, if `t_tid < (prev_t_tid & 0xFFFF_FFFF)` the
    /// u32 has wrapped and `epoch += 1`. The reconstructed tid is
    /// `(epoch << 32) | t_tid`. This correctly handles up to one u32
    /// wrap per recovery scan (at 10,000 commits/sec, u32 wraps every
    /// ~4.97 days — well above the maximum replay window). The
    /// superblock's `s_last_tid` stores the full u64 for cross-mount
    /// continuity.
    ///
    /// // LONGEVITY: u32 on-disk tid wraps at ~4.97 days at 10K
    /// // commits/sec. Acceptable: recovery scans at most the journal
    /// // size (~128 MB / 4 KB blocks = 32K transactions = ~3.2 sec
    /// // of writes), far below one u32 period. The u64 internal counter
    /// // never wraps in practice.
    pub tid: u64,

    /// Current state of this transaction.
    /// Transitions: `T_RUNNING(0) → T_LOCKED(1) → T_FLUSH(2) → T_COMMIT(3) → T_FINISHED(4)`.
    pub state: AtomicU8,

    /// Number of active `JournalHandle` instances attached to this transaction.
    /// Decremented by `journal_stop()`. When this reaches zero in `T_LOCKED`
    /// state, the commit thread is woken to proceed to `T_FLUSH`.
    pub handle_count: AtomicI32,

    /// Total number of metadata buffers accumulated in this transaction.
    /// Used to enforce `max_transaction_buffers`.
    pub nr_buffers: u32,

    /// Metadata blocks to be written to the journal during commit.
    /// Each entry holds a reference to a kernel buffer and the on-disk
    /// block number it maps to. Bounded by `max_transaction_buffers`.
    pub metadata_list: ArrayVec<JournalBufferEntry, MAX_TRANSACTION_BUFFERS>,

    /// After commit: metadata blocks that must still be written to their
    /// final on-disk locations before this transaction's journal space can
    /// be reclaimed. Drained by the checkpoint mechanism.
    pub checkpoint_list: IntrusiveList<JournalBufferEntry>,

    /// Inodes with dirty data pages that must be flushed before metadata
    /// commit (ordered mode only). Populated by `journal_dirty_inode()`.
    /// Bounded by the number of unique inodes touched in one transaction
    /// (typically < 1000; uses a bounded Vec with documented maximum).
    /// InodeId: u64 -- see [Section 14.1](14-vfs.md#virtual-filesystem-layer--core-vfs-data-structures).
    pub inode_list: ArrayVec<InodeId, MAX_TRANSACTION_INODES>,

    /// Block numbers revoked by this transaction. A revoked block must NOT
    /// be replayed during recovery, even if an earlier transaction wrote it
    /// to the journal. This prevents replaying freed-and-reallocated blocks.
    ///
    /// Warm path (populated during truncate/unlink), bounded by the number
    /// of metadata blocks freed in one transaction.
    /// XArray per integer-key policy; presence of block number in the tree
    /// means "revoked". XArray<()> acts as a set (key = block number,
    /// value = unit type for presence-only semantics).
    pub revoke_table: XArray<()>,

    /// Intrusive list linkage for `Journal::checkpoint_transactions`.
    pub checkpoint_link: IntrusiveListLink,

    /// Wall-clock time of commit completion (for adaptive interval tuning).
    pub commit_time_ns: u64,

    /// Number of `journal_start()` calls during this transaction's
    /// `T_RUNNING` lifetime. Used for adaptive commit interval calculation.
    pub handle_starts: AtomicU64,
}

/// ext4-specific per-inode info, stored via `Inode.i_private`.
///
/// In Linux, this is `struct ext4_inode_info`. UmkaOS stores it as a
/// separate slab-allocated struct pointed to by `Inode.i_private: *mut ()`.
/// Access: `unsafe { &*(inode.i_private as *const Ext4InodeInfo) }`.
/// SAFETY: `i_private` is set by ext4's `alloc_inode()` and valid for
/// the inode's lifetime.
/// `#[repr(C)]` ensures deterministic field layout for debugging
/// (consistent offsets in core dumps) and const_assert compatibility.
/// Kernel-internal — not KABI.
#[repr(C)]
pub struct Ext4InodeInfo {
    /// Transaction ID of the last data-modifying operation on this inode.
    /// Used by `ext4_fsync()` to determine which journal transaction to
    /// force-commit. Updated in `ext4_write_end()` after marking pages dirty.
    pub i_datasync_tid: AtomicU64,
    /// Transaction ID of the last metadata-modifying operation.
    /// Used for full `fsync()` (not `fdatasync()`). Updated when inode
    /// metadata (timestamps, size, block pointers) changes.
    pub i_sync_tid: AtomicU64,
    /// ext4 inode flags (EXT4_EXTENTS_FL, EXT4_INLINE_DATA_FL, etc.).
    pub i_flags: u32,
    /// Extent tree depth (0 = inline extents in inode, >0 = B-tree).
    pub i_extent_depth: u16,
    /// Explicit padding.
    pub _pad: [u8; 2],
}

/// Per-handle reservation for a filesystem operation.
///
/// A `JournalHandle` is acquired via `journal_start()` and released via
/// `journal_stop()`. It represents one logical filesystem operation (e.g.,
/// one `unlink`, one `write` metadata update) that may dirty multiple
/// metadata blocks. The `nr_credits` field reserves journal space upfront
/// so the operation cannot deadlock mid-way for lack of journal space.
pub struct JournalHandle {
    /// The transaction this handle is attached to.
    pub transaction: Arc<Transaction>,

    /// Number of journal blocks reserved for this handle.
    /// Set by the caller at `journal_start()` based on the worst-case
    /// number of metadata blocks the operation may dirty. Common values:
    /// - `EXT4_DATA_TRANS_BLOCKS` (8): simple file write metadata
    /// - `EXT4_DELETE_TRANS_BLOCKS` (24): unlink/truncate
    /// - `EXT4_RESERVE_TRANS_BLOCKS` (4): quota update
    pub nr_credits: u32,
}

/// One metadata block tracked by a transaction.
pub struct JournalBufferEntry {
    /// Filesystem block number this buffer maps to.
    pub blocknr: u64,

    /// Copy of the metadata block contents at the time of journaling.
    /// This frozen copy is what gets written to the journal — not the
    /// live buffer, which may have been modified by a subsequent
    /// transaction.
    ///
    /// **Hot-path allocation note**: `frozen_data` is allocated from a
    /// dedicated `jbd2_frozen_slab` cache (block-size-aligned slab objects,
    /// one pool per superblock). The allocation occurs in
    /// `journal_get_write_access()` — the warm path (per metadata write,
    /// not per syscall). The slab is pre-populated at journal init with
    /// `min(journal.j_max_transaction_buffers, 1024)` entries. If the slab
    /// is exhausted under heavy metadata load, the allocation blocks on
    /// the slab mempool (same as Linux's `jbd2_slab_create` pools). The
    /// `Box` is freed to the slab when the transaction commits and the
    /// journal buffer is released.
    pub frozen_data: Option<Box<[u8]>>,

    /// Reference to the live kernel buffer (for checkpoint writeback).
    pub bh: BufferRef,

    /// Intrusive list linkage for `Transaction::checkpoint_list`.
    pub checkpoint_link: IntrusiveListLink,
}

/// Maximum metadata buffers per transaction. Derived from journal size at
/// mount time: `journal_blocks / 4`. This constant is the upper bound for
/// `ArrayVec` capacity; the actual limit is `Journal::max_transaction_buffers`.
///
/// **Collection policy (warm path)**: ArrayVec chosen to embed metadata_list
/// in the Transaction allocation (one heap alloc via `Arc<Transaction>`,
/// contiguous layout for cache locality). Trade: ~1 MB per Transaction
/// (16384 entries x ~64 bytes each) even if sparsely filled. Typical fill
/// under production workloads: <2000 entries (~128 KB used). Vec alternative
/// would add a second heap allocation but reduce waste for small transactions.
/// Transaction creation is warm-path (once per commit cycle, ~100ms-5s), so
/// either choice satisfies the collection policy. ArrayVec is preferred for
/// single-allocation simplicity and contiguous memory.
const MAX_TRANSACTION_BUFFERS: usize = 16384;

/// Maximum unique inodes with dirty data per transaction (ordered mode).
/// Empirically, even extreme workloads rarely exceed 4096 unique inodes
/// in a single 5-second commit window. If exceeded, the transaction is
/// committed early and a new one started.
const MAX_TRANSACTION_INODES: usize = 4096;
15.6.2.1.3 Handle API

The handle API is the interface between the ext4 filesystem driver and the JBD2 journal. Every metadata-modifying filesystem operation brackets its work with journal_start() / journal_stop().

journal_start(journal, nr_credits) -> Result<JournalHandle>

Reserve nr_credits journal blocks and attach a new handle to the current T_RUNNING transaction. If the journal has insufficient free space, the caller blocks on checkpoint_wq until checkpoint frees space. If the running transaction is in T_LOCKED (being committed), the caller blocks until a new T_RUNNING transaction is created. Returns EROFS if the journal has been aborted due to I/O errors.

Concurrency: multiple handles may be active on the same transaction simultaneously — each filesystem operation (from different tasks) gets its own handle. The handle_count atomic tracks the number of active handles.

journal_stop(handle)

Release the handle. Decrements handle_count. If this is the last handle on a T_LOCKED transaction (handle_count reaches zero), wakes the commit thread to proceed with T_FLUSH. If the transaction is still T_RUNNING and the commit timer has not fired, no commit is triggered (the transaction remains open for more operations).

journal_get_write_access(handle, bh) -> Result<()>

Mark a buffer for journaling. Creates a frozen copy of the buffer's current contents (the "before" image for undo logging in journal mode, or simply the snapshot that will be written to the journal). Must be called before modifying the buffer. If the buffer is already tracked by this transaction, this is a no-op.

journal_dirty_metadata(handle, bh) -> Result<()>

Add the buffer to the transaction's metadata_list. Called after the buffer has been modified. The buffer will be written to the journal during the commit phase. If the buffer was not previously registered via journal_get_write_access(), returns EINVAL (programming error in the filesystem driver).

journal_revoke(handle, blocknr) -> Result<()>

Record that blocknr must not be replayed during recovery. Used when a metadata block is freed (e.g., extent tree block freed during truncate). Adds the block number to the transaction's revoke_table. During commit, revoke records are written to the journal as revoke descriptor blocks.

journal_dirty_inode(handle, inode_id) -> Result<()>

Add an inode to the current transaction's inode_list for ordered-mode data flushing. Called from ext4_write_end() (the AddressSpaceOps::write_end implementation) after a data page is dirtied via the write path. The inode is added only once per transaction (idempotent — checks a per-inode I_DIRTY_DATASYNC flag against the current transaction ID).

fn journal_dirty_inode(
    handle: &JournalHandle,
    inode_id: InodeId,
) -> Result<()> {
    let txn = &handle.transaction;
    let inode = inode_lookup(inode_id);
    // Check if this inode is already registered for this transaction.
    // i_datasync_tid tracks the last transaction that dirtied this inode.
    let current_tid = txn.tid;
    if inode.i_datasync_tid.load(Acquire) == current_tid {
        return Ok(()); // Already in this transaction's inode_list.
    }
    // Set the datasync tid to mark this inode as dirty in this txn.
    inode.i_datasync_tid.store(current_tid, Release);
    // Add to the transaction's inode_list (bounded, triggers early
    // commit if MAX_TRANSACTION_INODES is reached).
    txn.inode_list.try_push(inode_id)
        .map_err(|_| IoError::new(Errno::ENOSPC))?;
    Ok(())
}

The call chain from write() to inode_list population: 1. generic_file_write_iter() calls mapping.ops.write_begin(). 2. ext4_write_begin() calls journal_start() to open a handle. 3. generic_file_write_iter() copies user data into the page. 4. generic_file_write_iter() calls mapping.ops.write_end(). 5. ext4_write_end() calls journal_dirty_inode(handle, inode_id). 6. ext4_write_end() calls journal_stop(handle).

At commit time, Transaction.inode_list contains all inodes with dirty data pages that must be flushed before the metadata commit proceeds (step 3 of the commit protocol below).

journal_force_commit(journal, tid) -> Result<()>

Force the transaction identified by tid through the full commit sequence and wait for T_FINISHED. Called by fsync() to ensure durability. If the requested transaction is already committed, returns immediately.

impl Journal {
    /// Force the specified transaction to reach T_FINISHED state.
    ///
    /// # Arguments
    /// - `tid`: Transaction ID to commit (from `inode.i_datasync_tid`).
    ///
    /// # Algorithm
    /// 1. If `tid` is older than the committing transaction: already done.
    /// 2. If `tid` matches the running transaction: trigger T_RUNNING -> T_LOCKED.
    /// 3. Wait on `commit_wq` until the transaction reaches T_FINISHED.
    ///
    /// # Locking
    /// Acquires `state_lock` briefly to read transaction state, then drops
    /// it before sleeping on `commit_wq`.
    pub fn force_commit(&self, tid: u64) -> Result<(), IoError> {
        let guard = self.state_lock.lock();
        // Check if the requested transaction is already committed.
        // committing_transaction has a higher tid than running_transaction.
        if let Some(ref committing) = self.committing_transaction {
            if tid < committing.tid {
                // Already fully committed (tid is in the past).
                return Ok(());
            }
            if committing.tid == tid {
                // Currently being committed — wait for it.
                drop(guard);
                self.commit_wq.wait_event(|| {
                    let g = self.state_lock.lock();
                    self.committing_transaction
                        .as_ref()
                        .map_or(true, |t| t.tid != tid
                            || t.state.load(Acquire) == T_FINISHED)
                });
                return Ok(());
            }
        }
        if let Some(ref running) = self.running_transaction {
            if running.tid == tid {
                // Trigger commit: T_RUNNING -> T_LOCKED.
                // Wake the commit thread to begin the commit sequence.
                running.state.store(T_LOCKED, Release);
                self.commit_thread_wq.wake_up();
                drop(guard);
                // Wait for the commit to complete.
                self.commit_wq.wait_event(|| {
                    let g = self.state_lock.lock();
                    self.committing_transaction
                        .as_ref()
                        .map_or(true, |t| t.tid != tid
                            || t.state.load(Acquire) == T_FINISHED)
                });
                return Ok(());
            }
            if tid < running.tid {
                // Already committed in a previous round.
                return Ok(());
            }
        }
        // No matching transaction — journal is idle or tid is in the future.
        Ok(())
    }
}
15.6.2.1.4 Commit Protocol

The commit thread (jbd2/<device>) runs as a kernel thread, one per mounted ext4 filesystem. It wakes on: (a) commit timer expiry, (b) explicit journal_force_commit(), or (c) journal free space pressure.

Ordered mode (default, data=ordered):

  1. Lock transaction: Atomically transition T_RUNNING → T_LOCKED. Create a new T_RUNNING transaction so new filesystem operations can proceed without blocking on the commit. New journal_start() callers attach to the new transaction.

  2. Drain handles: Wait for handle_count on the locked transaction to reach zero. Each journal_stop() decrements the count; the last one wakes the commit thread.

  3. Flush data (T_LOCKED → T_FLUSH): For each inode in inode_list, issue writeback for all dirty data pages. Wait for all data I/O to complete. This guarantees that data blocks referenced by the metadata being committed are already stable on disk — preventing the stale-data exposure that data=writeback permits.

Note on fsync interaction: When the commit is triggered by fsync(), filemap_write_and_wait_range() (step 1 of the fsync flow in Section 14.4) has already flushed and waited for the target file's dirty data pages. The T_FLUSH step is therefore a no-op for the fsync-triggering inode — its data is already stable. However, T_FLUSH is still necessary for the periodic commit path (not fsync-triggered), where data pages for OTHER inodes in the same transaction may still be dirty and need flushing before their metadata can be committed.

  1. Write journal blocks (T_FLUSH → T_COMMIT):
  2. Write one or more descriptor blocks containing an array of JournalBlockTag entries. Each tag identifies a metadata block by its filesystem block number and flags.
  3. Write the metadata blocks themselves, in the order described by the descriptor block tags.
  4. Write revoke descriptor blocks containing all block numbers in revoke_table.
  5. Write the commit block with a CRC32C checksum covering all descriptor, metadata, and revoke blocks in this transaction. The commit block uses sequence number tid as u32 (low 32 bits).

  6. Flush commit block: Issue the commit block write with BioFlags::FUA | BioFlags::PERSISTENT (Force Unit Access for durability, PERSISTENT for Tier 1 crash recovery preservation (Section 15.2)). On devices that do not support FUA, issue BioFlags::PREFLUSH before and BioFlags::FUA after the commit block write. The commit block landing on stable storage is the atomicity point — if recovery sees a valid commit block, all metadata in the transaction is replayed; if the commit block is missing or has an invalid checksum, the entire transaction is discarded.

  7. Complete (T_COMMIT → T_FINISHED): Move all metadata buffers from metadata_list to checkpoint_list. Append the transaction to Journal::checkpoint_transactions. Wake all waiters on commit_wq. Update Journal::sb with the new sequence number and tail position.

Writeback mode (data=writeback): Step 3 is skipped entirely. Data may be written to disk in any order relative to metadata, which means a crash can expose stale data in recently-allocated blocks.

Journal mode (data=journal): In step 3, data blocks are written to the journal alongside metadata blocks (each data block gets a descriptor tag with JBD2_FLAG_DATA). This provides the strongest crash consistency (both data and metadata are atomic) at the cost of approximately 2x write amplification — every data block is written twice (once to journal, once to final location during checkpoint).

15.6.2.1.5 Checkpoint Mechanism

Checkpointing reclaims journal space by writing committed metadata blocks to their final on-disk locations. Until a transaction is checkpointed, the journal blocks it occupies cannot be reused.

Background checkpoint: A kernel thread (jbd2/<device>-ckpt, SCHED_OTHER, nice 5) runs periodically (every 5 seconds or when journal occupancy exceeds 50%). It walks checkpoint_transactions oldest-first:

  1. For each transaction in the checkpoint list: a. For each JournalBufferEntry in checkpoint_list:
    • If the buffer is still dirty (not yet written back by normal writeback), issue an async write to its final on-disk location.
    • If the buffer is clean (already written back), remove it from the checkpoint list. b. Wait for all issued writes to complete. c. If all buffers in the transaction are clean: remove the transaction from checkpoint_transactions and free its journal space.
  2. Advance Journal::head to the journal block following the last checkpointed transaction.
  3. Update the on-disk journal superblock with the new head position.
  4. Wake checkpoint_wq (unblocking any journal_start() callers waiting for free space).

Foreground checkpoint: When journal_start() discovers that Journal::free is less than the requested nr_credits, it triggers a synchronous foreground checkpoint in the calling task's context. This walks the same checkpoint list but blocks the caller until sufficient space is freed. RT tasks that call fsync() and need journal space run the foreground checkpoint at RT priority (inheriting the calling task's priority), preventing priority inversion where an RT fsync() waits for the nice-5 background thread. If even after checkpointing all eligible transactions the journal is still too full (because in-flight commits occupy the space), the caller sleeps on checkpoint_wq until the committing transaction completes.

Checkpoint ordering: Transactions must be checkpointed in order. A newer transaction cannot be checkpointed before an older one, because advancing the journal head past an uncheckpointed transaction would make that transaction unrecoverable after a crash.

15.6.2.1.6 Recovery and Replay Algorithm

On mount after an unclean shutdown (journal superblock indicates s_start != 0), the JBD2 recovery algorithm replays committed transactions to restore filesystem consistency. The algorithm has three passes:

Pass 1 — Scan (discover transaction boundaries): 1. Read the journal superblock. Extract s_start (first block of the oldest un-checkpointed transaction) and s_sequence (its expected sequence number). 2. Scan forward from s_start, wrapping circularly: a. Read each block. Check if it is a descriptor block (magic JBD2_MAGIC_NUMBER and blocktype JBD2_DESCRIPTOR_BLOCK). b. If it is a descriptor block: verify the sequence number matches the expected value. If so, record the descriptor and skip over the metadata blocks it describes. c. If it is a commit block: verify the sequence number and CRC32C checksum. If valid, the transaction is complete. Increment expected sequence number. d. If it is a revoke block: record it for Pass 2. e. If the block is not a valid journal block (wrong magic or sequence number) or the commit block checksum fails: stop scanning. All transactions up to the last valid commit block are replayable.

Pass 2 — Revoke table construction: 1. Collect all revoke records from all complete transactions discovered in Pass 1 into a single hash table keyed by (block number, sequence number). 2. A revoke record for block B in transaction T means: "do not replay any write to block B from transaction T or any earlier transaction."

Pass 3 — Replay: 1. For each complete transaction (oldest to newest): a. For each metadata block described in the transaction's descriptor blocks: - Look up the block number in the revoke table. If revoked by this or a later transaction, skip it. - Otherwise, read the journaled copy from the journal and write it to the block's final on-disk location. 2. After all transactions are replayed: a. Clear the journal by writing a new journal superblock with s_start = 0. b. The filesystem is now consistent.

Recovery correctness invariant: Because the commit block is the atomicity point (written with FUA), and because replay only processes transactions with valid commit blocks, recovery never applies a partial transaction. Revoke records prevent stale metadata from overwriting blocks that were freed and reallocated in a later transaction — without revoke, truncating a file and then creating a new file that reuses the same blocks could cause recovery to overwrite the new file's metadata with the old file's freed metadata.

Fast commit replay: When JBD2_FEATURE_INCOMPAT_FAST_COMMIT is set, the fast commit area occupies the last s_fc_log_size blocks of the journal (field at superblock offset 0x130). During recovery, after the standard 3-pass replay completes, the recovery algorithm scans the fast commit area for delta records. Each delta encodes a single metadata change (inode update, extent add/remove, directory entry link/unlink). Deltas are applied in sequence order. If a delta's parent_tid does not match the last replayed transaction's tid, the delta is skipped — it belongs to an incomplete fast commit cycle whose parent transaction was not committed. After all valid deltas are applied, the journal is cleared normally (s_start = 0).

15.6.2.1.7 Revoke Records

Revoke records solve the freed-block replay hazard:

  1. Transaction T1 writes metadata block B (e.g., an extent tree node).
  2. Transaction T2 frees block B (truncate) and allocates it for a different purpose (e.g., a data block for a new file). T2 records a revoke for B.
  3. Crash occurs after T2 commits but before T1 is checkpointed.
  4. Recovery sees T1's write to block B in the journal. Without revoke, it would replay T1's stale extent tree node over the new file's data block, silently corrupting the new file.
  5. With revoke: recovery checks the revoke table, finds B revoked by T2 (which is newer than T1), and skips the replay. The new file's data is preserved.

On-disk format: Revoke records are written to the journal as revoke descriptor blocks during commit. Each revoke block contains: - A block header (JBD2_MAGIC_NUMBER, blocktype JBD2_REVOKE_BLOCK, sequence number). - A r_count field indicating the number of bytes of revoke data. - An array of 8-byte block numbers (when JBD2_FEATURE_INCOMPAT_64BIT is set) or 4-byte block numbers (legacy 32-bit journals).

The revoke table is transient — it exists only during recovery. Normal operation does not consult revoke records; they are written to the journal during commit and read back only during replay.

15.6.2.1.8 Adaptive Commit Interval (UmkaOS Improvement)

Linux uses a fixed 5-second commit interval (commit=5 mount option). UmkaOS replaces this with an adaptive algorithm that bounds both recovery time (by committing more frequently under load) and I/O overhead (by deferring commits when idle):

Condition Commit interval Rationale
High metadata rate (>100 journal_start() calls/sec) 100 ms Bound worst-case recovery replay to ~100 ms of transactions
Moderate rate (10–100 starts/sec) Linear interpolation: 100–5000 ms Smooth transition avoids oscillation
Low rate (<10 starts/sec) 5000 ms Minimize journal I/O for mostly-idle filesystems
Idle (0 handles for >1 second) Immediate commit Minimize window of dirty uncommitted metadata

The algorithm samples Transaction::handle_starts at each commit and stores the result in Journal::commit_interval_ms (an AtomicU32). The commit timer re-arms itself with the new interval after each commit.

Override: The commit=N mount option forces a fixed interval (in seconds), disabling adaptive behavior. This provides Linux-compatible behavior for workloads that depend on a predictable commit cadence.

Recovery time bound: At the highest commit rate, worst-case recovery replays at most ~100 ms of transactions (bounded by commit interval × transaction size). At the default Linux interval of 5 seconds, recovery may need to replay up to 5 seconds of metadata mutations — on a busy database server this can mean gigabytes of journal replay.

15.6.2.1.9 On-Disk Journal Format

The on-disk format is byte-identical to Linux JBD2 for volume interoperability. UmkaOS must read journals written by Linux and vice versa.

Journal superblock (1024 bytes at journal block 0):

/// On-disk journal superblock. Layout matches Linux `journal_superblock_s`
/// exactly (1024 bytes).
///
/// **JBD2 on-disk format is big-endian** (defined by ext3 legacy on
/// SPARC/PA-RISC). All multi-byte integer fields use `Be32`/`Be64` types
/// ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) to enforce
/// correct byte-order conversion on all eight supported architectures.
/// On big-endian platforms (PPC32, s390x), `Be32::to_ne()` is a no-op;
/// on little-endian platforms (x86-64, AArch64, ARMv7, RISC-V, PPC64LE,
/// LoongArch64), it performs a byte-swap.
// kernel-internal, not KABI
#[repr(C)]
pub struct JournalSuperblock {
    // --- Static information (set at journal creation) ---
    /// Header: magic (0xC03B3998), blocktype (3 = superblock v2), sequence.
    pub header: JournalHeader,
    /// Journal device block size in bytes. Must equal filesystem block size.
    pub s_blocksize: Be32,
    /// Total number of blocks in the journal (including superblock block).
    pub s_maxlen: Be32,
    /// First usable block in the journal (usually 1, after superblock).
    pub s_first: Be32,

    // --- Dynamic information (updated on checkpoint / clean unmount) ---
    /// Sequence number of the first transaction in the log.
    /// 0 means the journal is clean (no recovery needed).
    pub s_sequence: Be32,
    /// Block number of the first transaction's first block in the log.
    /// 0 when journal is clean.
    pub s_start: Be32,

    /// Error number from a previous abort (0 = no error).
    pub s_errno: Be32,

    // --- Feature flags (superblock v2 only) ---
    /// Compatible feature flags (journal can be mounted even if unknown bits set).
    pub s_feature_compat: Be32,
    /// Incompatible feature flags (journal must not be mounted if unknown bits set).
    pub s_feature_incompat: Be32,
    /// Read-only compatible feature flags.
    pub s_feature_ro_compat: Be32,

    /// UUID of this journal (128-bit). Byte array — no endianness conversion.
    pub s_uuid: [u8; 16],
    /// Number of filesystems sharing this journal (0 or 1 for ext4).
    pub s_nr_users: Be32,
    /// Location of the dynamic superblock copy.
    pub s_dynsuper: Be32,

    /// Maximum number of blocks per transaction.
    pub s_max_transaction: Be32,
    /// Maximum number of data blocks per transaction.
    pub s_max_trans_data: Be32,

    /// Checksum type (1 = CRC32, 2 = MD5, 3 = SHA1, 4 = CRC32C).
    /// ext4 uses CRC32C (4) exclusively since Linux 3.5+. Single byte — no endianness.
    pub s_checksum_type: u8,
    pub s_padding2: [u8; 3],
    /// Number of fast commit blocks (offset 0x54). Required by
    /// JBD2_FEATURE_INCOMPAT_FAST_COMMIT to determine the fast commit
    /// area boundaries.
    pub s_num_fc_blks: Be32,
    /// Block number of the head of the log (offset 0x58). Used for clean
    /// unmount optimization — avoids full journal scan on mount.
    pub s_head: Be32,
    /// Padding to 1024 bytes.
    pub s_padding: [Be32; 40],
    /// CRC32C of this superblock (with this field set to 0 during computation).
    pub s_checksum: Be32,

    /// UUIDs of filesystems sharing this journal. Byte array — no endianness.
    pub s_users: [u8; 768],
}
const_assert!(core::mem::size_of::<JournalSuperblock>() == 1024);

Journal block header (common to all journal block types):

/// Common header at the start of every journal metadata block.
/// All fields are big-endian on disk (JBD2 legacy format).
#[repr(C)]
pub struct JournalHeader {
    /// Magic number: `JBD2_MAGIC_NUMBER` (0xC03B3998), stored big-endian.
    pub h_magic: Be32,
    /// Block type (see `JBD2_DESCRIPTOR_BLOCK` etc.).
    pub h_blocktype: Be32,
    /// Transaction sequence number (low 32 bits of `Transaction::tid`).
    pub h_sequence: Be32,
}
// On-disk JBD2 format: h_magic(4) + h_blocktype(4) + h_sequence(4) = 12 bytes.
const_assert!(core::mem::size_of::<JournalHeader>() == 12);

/// Journal block types.
pub const JBD2_DESCRIPTOR_BLOCK: u32 = 1;
pub const JBD2_COMMIT_BLOCK: u32 = 2;
pub const JBD2_SUPERBLOCK_V1: u32 = 3;
pub const JBD2_SUPERBLOCK_V2: u32 = 4;
pub const JBD2_REVOKE_BLOCK: u32 = 5;

Descriptor block (precedes a sequence of metadata blocks):

/// Tag describing one metadata block in a descriptor block.
///
/// **V3 layout only** (`JBD2_FEATURE_INCOMPAT_CSUM_V3` +
/// `JBD2_FEATURE_INCOMPAT_64BIT`). Matches Linux's `struct journal_block_tag3_s`.
///
/// UmkaOS does not support the V2 tag layout (`journal_block_tag_s`, 12 bytes
/// with Be16 checksum and Be16 flags). V2 journals are rejected at mount time
/// with `EUCLEAN` — Linux forcibly upgrades V2→V3 since kernel 3.18 (2014).
/// Any ext4 volume mounted read-write by any Linux in the last 12 years already
/// has V3. If a V2-only volume is encountered:
///   `return Err(EUCLEAN)` with diagnostic:
///   "JBD2 checksum version 2 not supported; mount with Linux kernel to upgrade."
///
/// An additional 16 bytes for UUID if `JBD2_FLAG_SAME_UUID` is NOT
/// set (first tag only, appended after the tag).
///
/// `journal_tag_bytes()` always returns 16 (no V2 conditional).
///
/// All multi-byte fields are big-endian on disk (JBD2 legacy format).
#[repr(C)]
pub struct JournalBlockTag {
    /// Filesystem block number (low 32 bits).
    pub t_blocknr: Be32,
    /// Flags (`JBD2_FLAG_ESCAPE`, `JBD2_FLAG_SAME_UUID`, etc.).
    /// V3 widens this from Be16 (V2) to Be32. Upper 16 bits are reserved
    /// and must be zero on write, ignored on read.
    pub t_flags: Be32,
    /// Filesystem block number (high 32 bits). Always present (UmkaOS
    /// requires `JBD2_FEATURE_INCOMPAT_64BIT`).
    pub t_blocknr_high: Be32,
    /// Full CRC32C checksum of the journaled block. V3 widens from
    /// 16-bit (CSUM_V2) to full 32-bit for stronger integrity.
    pub t_checksum: Be32,
}
// JournalBlockTag V3: t_blocknr(4) + t_flags(4) + t_blocknr_high(4) +
// t_checksum(4) = 16 bytes.
const_assert!(core::mem::size_of::<JournalBlockTag>() == 16);

/// Tag flag bits. Values match Linux `include/linux/jbd2.h`.
/// Note: in V3 layout, `t_flags` is Be32 but only the low 16 bits
/// carry defined flags. Upper 16 bits are reserved (zero on write).
pub const JBD2_FLAG_ESCAPE: u32     = 0x01; // block content has JBD2_MAGIC at offset 0; escaped
pub const JBD2_FLAG_SAME_UUID: u32  = 0x02; // same UUID as previous tag (omit UUID field)
pub const JBD2_FLAG_DELETED: u32    = 0x04; // block deleted by this transaction
pub const JBD2_FLAG_LAST_TAG: u32   = 0x08; // last tag in this descriptor block

Commit block (marks the end of a transaction):

/// Commit record written as the final block of each transaction.
/// The CRC32C in this block covers all descriptor blocks, metadata blocks,
/// AND revoke blocks in the transaction. A valid commit block = atomic commit
/// point. The checksum is computed incrementally: each block (descriptor,
/// metadata, or revoke) is fed into the running CRC32C as it is written to
/// the journal. The final CRC32C is stored in `h_chksum[0]`. During recovery,
/// the journal replayer recomputes the CRC32C over all blocks between the
/// descriptor block and the commit block (inclusive of revoke blocks) and
/// compares against `h_chksum[0]`; a mismatch means the transaction is
/// incomplete and is discarded.
#[repr(C)]
pub struct JournalCommitBlock {
    /// Standard header: magic, blocktype = JBD2_COMMIT_BLOCK, sequence.
    pub header: JournalHeader,
    /// Checksum type (matches `JournalSuperblock::s_checksum_type`).
    /// Single byte — no endianness conversion.
    pub h_chksum_type: u8,
    /// Checksum size in bytes (4 for CRC32C). Single byte — no endianness.
    pub h_chksum_size: u8,
    pub h_padding: [u8; 2],
    /// CRC32C checksum of all descriptor, metadata, and revoke blocks.
    /// Array of big-endian u32 words.
    pub h_chksum: [Be32; JBD2_CHECKSUM_ELEMENTS],
    /// Commit timestamp (seconds since epoch). Written for debugging;
    /// not used by recovery. Big-endian on disk.
    pub h_commit_sec: Be64,
    /// Commit timestamp (nanoseconds component). Big-endian on disk.
    pub h_commit_nsec: Be32,
}

/// Checksum array element count. 8 elements × 4 bytes per Be32 = 32 bytes.
/// Matches Linux's `JBD2_CHECKSUM_BYTES = 8` (element count).
/// Named `_ELEMENTS` (not `_SIZE`) to prevent misuse as a byte count.
const JBD2_CHECKSUM_ELEMENTS: usize = 8;

// JournalCommitBlock: header(12) + chksum_type(1) + chksum_size(1) +
// h_padding(2) + h_chksum(32) + h_commit_sec(8) + h_commit_nsec(4) = 60.
const_assert!(core::mem::size_of::<JournalCommitBlock>() == 60);

Revoke descriptor block:

/// Revoke block header. Followed by an array of block numbers.
/// All multi-byte fields are big-endian on disk (JBD2 legacy format).
#[repr(C)]
pub struct JournalRevokeHeader {
    /// Standard header: magic, blocktype = JBD2_REVOKE_BLOCK, sequence.
    pub header: JournalHeader,
    /// Number of bytes of revoke data following this header
    /// (including this r_count field).
    pub r_count: Be32,
}
// On-disk JBD2 format: header(12) + r_count(4) = 16 bytes.
const_assert!(core::mem::size_of::<JournalRevokeHeader>() == 16);
// Followed by: array of Be64 (with 64BIT feature) or Be32 block numbers.
// Number of entries = (r_count.to_ne() - sizeof(JournalRevokeHeader)) / sizeof(blocknr).

Feature flags (from journal superblock s_feature_incompat):

Flag Value Meaning
JBD2_FEATURE_INCOMPAT_REVOKE 0x01 Journal contains revoke records (always set for ext4)
JBD2_FEATURE_INCOMPAT_64BIT 0x02 Block tags use 64-bit block numbers
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x04 Commit blocks may be written without a preceding cache flush
JBD2_FEATURE_INCOMPAT_CSUM_V2 0x08 Descriptor tags carry per-block CRC16; commit block carries CRC32C of entire transaction
JBD2_FEATURE_INCOMPAT_CSUM_V3 0x10 Extended tag format with full 32-bit checksums
JBD2_FEATURE_INCOMPAT_FAST_COMMIT 0x20 Fast commit area follows the main journal (Linux 5.10+)

UmkaOS mount-time validation:

  • CSUM_V3 set → accepted (normal path). journal_tag_bytes() = 16.
  • CSUM_V2 set, CSUM_V3 NOT set → rejected with EUCLEAN. Diagnostic: "JBD2 checksum version 2 not supported; mount with Linux kernel to upgrade." Linux has auto-upgraded V2→V3 since kernel 3.18 (2014). Zero code complexity for an obsolete 12-year-old format.
  • Neither CSUM_V2 nor CSUM_V3 set → accepted in read-only mode only. Read-write mount logs a warning and rejects with EROFS. Non-checksummed journals predate Linux 3.5 (2012) and should be upgraded by a Linux fsck.

Magic number and endianness: All on-disk journal fields are big-endian (network byte order), matching the original JBD design from ext3. The magic number 0xC03B3998 is the first 4 bytes of every journal descriptor, commit, and revoke block. If a metadata block being journaled happens to start with 0xC03B3998 at offset 0, the JBD2_FLAG_ESCAPE tag flag is set and the first 4 bytes of the journaled copy are zeroed (restored on replay).

15.7 XFS Filesystem Driver

Scope note: This section provides UmkaOS-specific XFS filesystem driver specifications: allocation group design, log architecture, Linux compatibility constraints, and feature set. The on-disk format specification for XFS is defined by the upstream project and is not duplicated here — UmkaOS implements the same on-disk format bit-for-bit.

The XFS driver implements the FileSystemOps and InodeOps traits defined in Section 14.1 (VFS layer). XFS is used in server, workstation, HPC, and enterprise contexts; it is not consumer-specific.

15.7.1.1 Evolvable/Nucleus Classification

Component Classification Rationale
Allocation group B-tree structures (bnobt, cntbt, inobt, rmapbt) Nucleus On-disk format compatibility with Linux XFS v5.
xlog write-ahead log format and replay Nucleus Crash-consistency invariant. Must match Linux for cross-mount.
CRC32C metadata checksums Nucleus Integrity verification is a correctness property, not a policy.
Reflink extent sharing semantics Nucleus CoW correctness invariant (see Section 14.4).
Delayed allocation heuristics Evolvable Policy: how long to defer allocation is a performance tuning choice.
Speculative preallocation strategy Evolvable Policy: how much to preallocate beyond EOF is workload-dependent.
AG selection for new allocations Evolvable Policy: which allocation group to prefer is a parallelism/fragmentation tradeoff. ML-tunable.

15.7.2 XFS

Use cases: Default filesystem on RHEL, CentOS, Fedora, Rocky Linux, and Oracle Linux. Dominant in enterprise storage servers, HPC scratch filesystems, media production storage, and large-scale NFS servers. Designed for very large files and very large directories.

Tier: Tier 1 (same rationale as ext4).

Design:

XFS partitions the volume into allocation groups (AGs), each an independent unit with its own free-space B-trees (bnobt, cntbt), inode B-tree (inobt), and reverse-mapping B-tree (rmapbt, v5 only). Allocation groups enable parallel allocation for multi-threaded workloads — different AGs are independent, so concurrent file creation on different CPUs does not serialize.

Volume layout (simplified):
  [ Superblock | AG 0 | AG 1 | ... | AG N ]
  Each AG: [ AG header | free-space B-trees | inode B-tree | data blocks ]

Key features: - Delayed allocation (delalloc): Blocks are not physically allocated until writeback, allowing the allocator to choose large contiguous extents instead of the first available fragment. Critical for streaming-write performance. - Speculative preallocation: XFS preallocates beyond the current EOF during sequential writes, then trims unused preallocation on close. Dramatically reduces fragmentation for growing files (logs, databases, media files). - Reflink (XFS v5, Linux 4.16+): Copy-on-write extent sharing for cheap file copies (same semantic as Btrfs reflinks). Required for efficient container image layering and cp --reflink. XFS declares WriteMode::CopyOnWrite and implements ExtentSharingOps — see Section 14.4 for the VFS CoW/RoW infrastructure. - Reverse mapping B-tree (rmapbt, v5): Tracks which owner (inode or B-tree structure) holds each physical block. Required for online scrub, online repair, and reflink. Adds ~5% space overhead. - Real-time device: XFS optionally uses a separate real-time device for files tagged with XFS_XFLAG_REALTIME, guaranteeing allocation from a contiguous extent region. Used in HPC and media production for deterministic I/O latency. UmkaOS supports the real-time device as a second BlockDevice passed in the mount option rtdev=. - xattr namespaces: user., trusted., security., system.posix_acl_*. The trusted. namespace is restricted to CAP_SYS_ADMIN; the kernel enforces this via capability checks in setxattr(2).

Journal (xlog): XFS uses a write-ahead log (xlog) for all metadata mutations. The log is circular; the driver replays from the last checkpoint on mount after unclean shutdown. Log can be on the same device (default) or an external device (logdev=) for better write isolation on HDD-based arrays.

Linux compatibility: XFS v5 (superblock sb_features_incompat bit XFS_SB_FEAT_INCOMPAT_FTYPE) is required for all new volumes. v5 includes a CRC checksum on every metadata block (CRC32C), catching silent corruption that ext4 without metadata checksums would miss. UmkaOS rejects mounting v4 volumes unless a compatibility shim is provided (v4 is deprecated upstream as of Linux 6.x and not worth supporting at launch).

15.8 Btrfs Filesystem Driver

Scope note: This section provides UmkaOS-specific Btrfs filesystem driver specifications: CoW design, RAID profiles, subvolumes, Linux compatibility constraints, and known limitations. The on-disk format specification for Btrfs is defined by the upstream project and is not duplicated here — UmkaOS implements the same on-disk format bit-for-bit.

The Btrfs driver implements the FileSystemOps and InodeOps traits defined in Section 14.1 (VFS layer). Btrfs is used for workstations, snapshots, and deployments requiring transparent compression or send/receive; it is not a general-purpose default.

15.8.1.1 Evolvable/Nucleus Classification

Component Classification Rationale
CoW B-tree structure and transaction commit semantics Nucleus On-disk format compatibility with Linux Btrfs. Correctness invariant for atomic snapshots.
Subvolume and snapshot tree relationships Nucleus Snapshot correctness depends on CoW tree sharing invariants.
Checksum verification (CRC32C, xxhash, sha256, blake2b) Nucleus Data integrity verification is a correctness property.
RAID 1/1C3/1C4/10 mirror placement Nucleus Mirror placement correctness ensures data survives device failure.
incompat_flags feature gating on mount Nucleus Must reject unknown INCOMPAT bits to prevent silent corruption.
Transparent compression algorithm selection (LZO, ZLIB, ZSTD) Evolvable Policy: which algorithm to use for new data is a space/CPU tradeoff. ML-tunable.
Free space cache management strategy (v2 B-tree) Evolvable Policy: how to organize free space metadata is a performance heuristic.
nodatacow decision for database subvolumes Evolvable Policy: operator-configurable CoW bypass per mount/subvolume.
Scrub scheduling and priority Evolvable Policy: when and how aggressively to run background verification.

15.8.2 Btrfs

Use cases: Fedora workstations, Steam Deck, openSUSE. Used in enterprise for snapshot and send/receive capabilities (Proxmox, SUSE). Relevant at kernel level wherever atomic snapshots, compression, or multi-device volumes are needed. Not recommended as a default filesystem — ext4 (general purpose), XFS (enterprise/large files), and ZFS (data integrity/servers) are preferred defaults depending on workload. Btrfs is appropriate when its unique features (subvolume snapshots, transparent compression, send/receive) are specifically required and the operator accepts the limitations documented below.

Tier: Tier 1.

Design: Btrfs is a copy-on-write (CoW) B-tree filesystem. Every write produces a new copy of the modified data/metadata; the old copy is retained until freed. This is the foundation for snapshots (zero-cost at creation) and atomic multi-file transactions. Btrfs declares WriteMode::RedirectOnWrite and implements ExtentSharingOps — see Section 14.4 for the VFS CoW/RoW infrastructure that Btrfs, ZFS, and future UPFS all build upon.

Key features:

Feature Kernel behaviour
Subvolumes Independent CoW trees within a volume; each mountable separately. The kernel tracks the active subvolume ID per mount point.
Snapshots Read-write or read-only clone of a subvolume at a point in time. Zero-cost creation (no data copied). Used by UmkaOS live update rollback (Section 13.18).
Reflinks Shallow file copy (cp --reflink). Shares extent references until written. Critical for container runtimes and package managers.
Transparent compression Per-file or per-subvolume, online. Algorithms: LZO (fast), ZLIB (balanced), ZSTD (best ratio, default for UmkaOS). Kernel compresses on writeback; decompresses on read.
RAID profiles RAID 0 / 1 / 1C3 / 1C4 / 5 / 6 / 10 across multiple BlockDevice instances. RAID 5/6 write hole: Btrfs's CoW design significantly reduces (but does not fully eliminate) the write hole — partial stripe writes are atomic at the Btrfs extent level, but the parity update itself is not crash-atomic. UmkaOS does NOT provide a block-layer mitigation for the Btrfs RAID 5/6 write hole because Btrfs implements its own RAID layer above the block I/O interface — the block layer has no visibility into Btrfs stripe operations. The Btrfs RAID 5/6 write hole is a filesystem-internal problem. Upstream Linux Btrfs RAID 5/6 carries a BIG FAT WARNING about data loss risk; users requiring parity RAID should use ZFS (RAIDZ) or md-raid + ext4/XFS instead of Btrfs RAID 5/6. The block layer stripe log (Section 15.2) applies only to md-raid and dm-raid arrays.
Online scrub Background verification of all data and metadata checksums. Driven by a kernel thread (btrfs-scrub); progress exposed via ioctl and sysfs.
Send/receive Incremental snapshot delta serialisation. btrfs send produces a stream; btrfs receive applies it on another volume. Used for backup, replication, and container image distribution.
Free space tree v2 free-space cache (b-tree based); replaces the v1 file-based cache. Required for large volumes (>1 TiB); UmkaOS always mounts with space_cache=v2.

CoW and O_SYNC interaction: Because Btrfs delays the final tree root update until transaction commit, fsync must trigger a full transaction commit (not just a data flush) to satisfy durability. The driver calls btrfs_commit_transaction() on fsync for non-nodatacow files. This is a known latency source for databases; the architecture recommends nodatacow mount option for database subvolumes (trades crash consistency for performance, consistent with how PostgreSQL and MySQL recommend mounting their data directories on any CoW filesystem).

Live update integration (Section 13.18): Btrfs subvolume snapshots can support snapshot-based atomic OS updates. A live update agent can create a read-only snapshot of the root subvolume before applying an update, making rollback trivial and zero-downtime. This makes Btrfs a natural fit for deployments that use snapshot-based atomic updates; on servers where ext4 or XFS is already in use, this advantage does not justify a migration.

Linux compatibility: Btrfs on-disk format is stable since Linux 3.14. UmkaOS's Btrfs driver is wire-compatible with Linux's. Volumes created on Linux are mountable by UmkaOS. Feature detection uses the incompat_flags superblock field; the driver rejects mount if any unknown INCOMPAT bit is set.

Limitations documented (these are well-known, upstream-acknowledged problems): - RAID 5/6 reliability: The Btrfs RAID 5/6 write hole remains an active concern on LKML as of 2025 despite partial mitigations. Btrfs upstream documentation still marks RAID 5/6 as "not recommended for production." The block layer stripe log (Section 15.2) applies only to md-raid and dm-raid arrays, NOT to Btrfs RAID 5/6 (Btrfs implements its own RAID layer above the block I/O interface — the block layer has no visibility into Btrfs stripe operations). Users requiring parity RAID should use ZFS (RAIDZ) or md-raid + ext4/XFS instead of Btrfs RAID 5/6. Btrfs RAID 1/1C3/1C4/10 are stable and recommended. - fsync latency: CoW transaction commit on fsync is a known latency source for database workloads. The nodatacow workaround trades crash consistency for performance. Database servers should prefer ext4 or XFS. - nodatacow files cannot have checksums. Applications that disable CoW for performance must accept no data integrity checking on those files. - Very large directories (>1M entries) have worse performance than XFS due to CoW overhead on directory mutations. - Less battle-tested than ext4/XFS: Btrfs has a shorter production track record. ext4 has been the Linux default since 2008; XFS has been the RHEL default since 2014. Btrfs became Fedora's desktop default in 2020 and openSUSE's in 2014, but enterprise adoption remains limited outside snapshot-centric workflows.

15.9 Removable Media, Interoperability Filesystems, and FUSE

15.9.1 Removable Media and Interoperability Filesystems

These filesystem drivers serve interoperability with Windows, macOS, and removable media standards. They are not consumer-specific — embedded systems, edge nodes, and industrial devices also use FAT/exFAT/NTFS for removable storage interoperability.

UmkaOS's strategy for these filesystems is native in-kernel drivers implemented as Tier 1 drivers using the standard FileSystemOps / InodeOps / FileOps trait set (Section 14.1). FUSE-backed userspace drivers are supported as a compatibility mechanism for filesystems where a full native implementation is deferred; the FUSE subsystem is specified in Section 15.9.

15.9.1.1 exFAT

Use case: SDXC (SD cards >32 GB) mandates exFAT per JEDEC SD specification. USB flash drives commonly use exFAT. Required for read/write interop with Windows and macOS systems.

Tier: Tier 1 (in-kernel umka-exfat driver).

Implementation: Microsoft published the exFAT specification as an open specification in 2019 (SPDX: LicenseRef-exFAT-Specification; no royalty or patent encumbrance for implementors). The exFAT on-disk format is simpler than ext4 or XFS: a flat cluster chain FAT or an Allocation Bitmap (preferred for exFAT), a root directory cluster chain, and per-file directory entries using UTF-16 with UpCase table normalization. UmkaOS's native umka-exfat driver implements the full read/write path using the FileSystemOps trait.

Compatibility: Read/write. Cluster sizes from 512 B to 32 MB. Files up to 16 EiB (volume limit). Directory entries use UTF-16LE with the volume's UpCase table. Timestamps include UTC offset field (Windows 10+). No journaling; power loss can corrupt a directory entry mid-write. The driver issues a FLUSH CACHE command to the underlying block device after each fsync to bound exposure.

Linux compatibility: exFAT volumes created on Linux (kernel exFAT driver, merged in 5.7) are mountable by UmkaOS and vice versa. The UpCase table format and cluster allocation bitmap are identical.

15.9.1.2 NTFS

Use case: External drives shared with Windows installations. Common on USB hard drives purchased pre-formatted. Required for read/write interop with Windows-hosted data volumes.

Tier: Tier 1 (in-kernel ntfs3 driver; based on the Paragon ntfs3 implementation merged into Linux 5.15).

Implementation: UmkaOS's ntfs3 driver is derived from the upstream Linux ntfs3 implementation by Paragon Software Group. It provides full read/write support including NTFS compression (LZX per-cluster), sparse files (sparse-file runs), and hard links (multiple $FILE_NAME attributes per MFT record).

Features not supported (return EOPNOTSUPP on access): - Alternate Data Streams exposed as separate mount namespace entries (ADS content is preserved on read/write of the primary stream but not enumerable via openat/readdir). - Reparse points used as Windows junction points or symlinks (IO_REPARSE_TAG_SYMLINK, IO_REPARSE_TAG_MOUNT_POINT) — accessed as regular files or returned as DT_UNKNOWN in directory listings. - Encrypted files ($EFS attribute) — opened successfully but content reads return raw ciphertext with a warning in the kernel log.

Phase constraint: Full NTFS write support is present from Phase 2. The NTFS journaling structure ($LogFile, $UsnJrnl) is replayed on mount to ensure volume consistency after unclean shutdown, matching Linux ntfs3 behavior. No NTFS write support is deferred; the complexity of NTFS journaling, compression, and sparse files is handled by the derived ntfs3 implementation.

Linux compatibility: Wire-compatible with Linux ntfs3. Volumes created on Linux ntfs3 are mountable by UmkaOS and vice versa.

15.9.1.3 APFS (Read-Only)

Use case: External drives formatted by macOS. Required for data migration from macOS systems and for mounting Apple Silicon boot drives in dual-boot or forensic scenarios.

Tier: Tier 1 (in-kernel read-only driver, Phase 4+).

Phase constraint: APFS write support is permanently deferred. The APFS on-disk format is not a public specification; Apple documents only enough for APFS tooling on macOS. Reverse-engineered write support risks silent metadata corruption when Apple makes undocumented changes between macOS releases. The read-only constraint is therefore not a temporary limitation but a deliberate design boundary: APFS volumes mounted by UmkaOS are always mounted read-only, enforced in the FileSystemOps::mount() implementation by returning EROFS if MountFlags::READ_WRITE is set.

Implementation: Read-only native kernel driver derived from the apfs-fuse project's reverse-engineered format analysis (MIT licensed). Supported features: - APFS container and volume superblock parsing. - B-tree (object map, file system tree) traversal. - Extent-based and inline file data. - Compression (APFS_COMPRESS_ZLIB, APFS_COMPRESS_LZVN, APFS_COMPRESS_LZFSE). - Symlinks, hard links (inode numbers via DREC_TYPE_HARDLINK). - Extended attributes (xattr tree). - Time Machine snapshot enumeration (read-only).

Phase ordering: Phase 3 delivers HFS+ read-only support (for older macOS volumes). Phase 4 delivers APFS read-only, layered on the HFS+ driver's infrastructure for Apple partition map and CoreStorage detection.

Until Phase 4, APFS volumes are accessible via the FUSE subsystem (Section 15.9) using the apfs-fuse userspace daemon, which provides a compatible FileDescriptor interface through FuseSession.

15.9.1.4 FUSE — Userspace Filesystem Framework

FUSE (Filesystem in Userspace) enables userspace daemons to implement filesystems served through the kernel VFS. UmkaOS implements the FUSE kernel interface as a Tier 2 bridge driver, compatible with the Linux /dev/fuse protocol (FUSE protocol version 7.x; minimum negotiated minor version: 26, released with Linux 4.20, which introduced FUSE_RENAME2 and FUSE_LSEEK).

Scope: FUSE is a compatibility and extensibility mechanism. Native in-kernel drivers are preferred for performance-critical or widely-used filesystems. FUSE is the appropriate path for: - Filesystems with complex or proprietary on-disk formats where a native kernel driver is not feasible (e.g., APFS before Phase 4). - Userspace tools that already implement a filesystem (e.g., sshfs, s3fs, custom FUSE daemons in container runtimes). - Development and prototyping of new filesystem drivers before promotion to Tier 1.

Protocol: The FUSE kernel↔daemon protocol uses /dev/fuse. The kernel writes request messages (opcodes: FUSE_LOOKUP, FUSE_OPEN, FUSE_READ, FUSE_WRITE, FUSE_READDIR, etc.) into the fd; the daemon reads them, processes them, and writes reply messages back. Each request carries a unique unique identifier matching it to its reply. The wire format is identical to Linux libfuse protocol version 7.x, ensuring compatibility with all existing FUSE daemons without recompilation.

FuseSession struct — kernel-side state for one mounted FUSE filesystem:

/// Kernel-side state for one active FUSE mount.
///
/// Created when the userspace daemon opens `/dev/fuse` and calls `mount(2)`
/// with `fstype = "fuse"`. Destroyed when the daemon closes the fd or the
/// mount is forcibly unmounted (`umount -f`).
pub struct FuseSession {
    /// Negotiated FUSE protocol version (major, minor).
    /// Major is always 7 for current FUSE protocol; minor is negotiated
    /// during `FUSE_INIT` handshake. The kernel refuses to mount if the
    /// daemon proposes major != 7.
    pub proto_version: (u32, u32),

    /// The `/dev/fuse` file descriptor held open by the daemon process.
    /// Closing this fd triggers an implicit `FUSE_DESTROY` + unmount.
    pub dev_fd: FileDescriptor,

    /// Mount flags captured at mount time (read-only, no-exec, etc.).
    /// Propagated to `InodeOps::permission()` checks within this session.
    pub mount_flags: MountFlags,

    /// Maximum write payload the daemon declared it can handle, in bytes.
    /// Capped at `FUSE_MAX_MAX_PAGES * PAGE_SIZE` (128 × 4096 = 512 KiB).
    /// The kernel splits `FUSE_WRITE` requests larger than this value.
    pub max_write: u32,

    /// Maximum `readahead` size the kernel will request, in bytes.
    /// Negotiated during `FUSE_INIT`; 0 disables kernel readahead for
    /// this mount.
    pub max_readahead: u32,

    /// Whether the daemon supports `FUSE_ASYNC_READ` (concurrent reads
    /// on the same file handle without serialization). Declared by the
    /// daemon in `FUSE_INIT` flags. When false, the kernel serializes
    /// all reads per file handle.
    pub async_read: bool,

    /// Whether the daemon supports `FUSE_WRITEBACK_CACHE` mode.
    /// When true, the kernel VFS page cache handles write coalescing and
    /// fsync; individual 4 KB write-cache flushes are not sent per page.
    /// When false, every `write(2)` generates a `FUSE_WRITE` request.
    pub writeback_cache: bool,

    /// Pending request queue. Requests generated by VFS operations are
    /// enqueued here; the daemon's `read(2)` on `/dev/fuse` dequeues them.
    /// Bounded to `FUSE_MAX_PENDING` (default: 12 + 1 per CPU) requests
    /// to apply backpressure to VFS callers when the daemon is slow.
    pub pending: FuseRequestQueue,

    /// In-flight requests awaiting a reply from the daemon. Keyed by
    /// `unique` identifier. On daemon close, all in-flight requests are
    /// completed with `ENOTCONN`.
    pub inflight: FuseInflightMap,
}

FuseRequestQueue and FuseInflightMap are internal kernel types; their exact layout is not part of the KABI — only the FuseSession fields visible to the Tier 2 FuseDriver are stable.

// Internal type aliases (not KABI-stable):
// FuseRequestQueue: bounded MPMC ring for pending requests. Capacity is
// FUSE_MAX_PENDING (default: 4096). VFS operations push, daemon read() pops.
type FuseRequestQueue = BoundedMpmcRing<Arc<FuseRequest>, FUSE_MAX_PENDING>;

// FuseInflightMap: integer-keyed XArray for in-flight requests. Keyed by
// `unique` (u64 monotonic request ID). O(1) lookup on daemon reply. RCU-safe
// reads for the abort-all-on-close path.
type FuseInflightMap = XArray<Arc<FuseRequest>>;

See Section 14.11 for the canonical FuseConn struct that uses these types directly with full documentation.

FUSE_INIT handshake: On first read(2) from the daemon, the kernel sends a FUSE_INIT request with major = 7, minor = UMKA_FUSE_MINOR (the maximum minor the kernel supports). The daemon replies with its supported minor; the negotiated minor is min(kernel_minor, daemon_minor). Capabilities (flags field) are intersected: a capability is active only if both sides declare it. The kernel stores the negotiated values in FuseSession::proto_version and the derived async_read, writeback_cache, max_write, max_readahead fields.

Error handling: If the daemon crashes or closes /dev/fuse with in-flight requests, all pending VFS operations on the mount return ENOTCONN. The mount remains in the VFS tree but is marked MS_DEAD; subsequent operations return ENOTCONN until the mount is explicitly removed with umount. A daemon can reconnect to a dead mount by opening /dev/fuse with O_RDWR | O_CLOEXEC and the same mount cookie — this is the basis for daemon live-restart without unmounting (supported when FUSE_CONN_INIT_WAIT is negotiated).

Security: The /dev/fuse fd is accessible only to the mounting user (or root). Filesystem operations that arrive from processes outside the mounting user's UID are checked against the allow_other mount option. Without allow_other, FUSE_ACCESS is called only for processes with the mounting UID/GID; others receive EACCES at the VFS permission check before the FUSE request is even generated.

Phase: FUSE kernel infrastructure is delivered in Phase 3. FUSE daemons such as apfs-fuse, sshfs, and custom drivers are usable from Phase 3 onward. The native APFS in-kernel driver (Phase 4) supersedes apfs-fuse for performance-sensitive workloads but does not remove FUSE support.

15.9.1.4.1.1 FUSE KABI Ring Protocol

FUSE communication uses two BoundedMpmcRing buffers in a shared memory region mapped into both the kernel and the Tier 2 daemon process: - Request ring (kernel → daemon): kernel posts a FuseRequest; daemon pops and processes. - Reply ring (daemon → kernel): daemon posts a FuseReply; kernel pops and unblocks the waiting VFS caller.

Wire format (matches Linux FUSE ABI for daemon compatibility):

#[repr(C, align(8))]
pub struct FuseInHeader {
    pub len:     u32,    // total message length including header
    pub opcode:  u32,    // FuseOpcode
    pub unique:  u64,    // request correlation ID; daemon must echo in reply
    pub nodeid:  u64,    // inode number
    pub uid:     u32,    // requesting process UID
    pub gid:     u32,    // requesting process GID
    pub pid:     u32,    // requesting process PID
    pub _pad:    u32,
}
const_assert!(core::mem::size_of::<FuseInHeader>() == 40);

#[repr(C, align(8))]
pub struct FuseOutHeader {
    pub len:    u32,
    pub error:  i32,     // 0 on success; negative errno on error
    pub unique: u64,     // matches FuseInHeader::unique
}
const_assert!(core::mem::size_of::<FuseOutHeader>() == 16);

// Subset — see linux/fuse.h for complete list. UmkaOS implements all
// Linux FUSE opcodes (52 opcodes as of protocol version 7.45). Only the
// most commonly used opcodes are shown here for reference; the complete
// opcode table with all values is in the FUSE section (fuse-filesystem-in-userspace).
#[repr(u32)]
pub enum FuseOpcode {
    Lookup      = 1,
    Forget      = 2,
    Getattr     = 3,
    Setattr     = 4,
    Readlink    = 5,
    Mknod       = 8,
    Mkdir       = 9,
    Unlink      = 10,
    Rmdir       = 11,
    Rename      = 12,
    Open        = 14,
    Read        = 15,
    Write       = 16,
    Release     = 18,
    Fsync       = 20,
    Flush       = 25,
    Init        = 26,
    Opendir     = 27,
    Readdir     = 28,
    Releasedir  = 29,
    Create      = 35,
    Rename2     = 45,
    Lseek       = 46,
    // ... all remaining opcodes (Symlink=6, Link=13, Statfs=17,
    // SetXattr=21..RemoveXattr=24, FsyncDir=30, GetLk=31..SetLkW=33,
    // Access=34, Interrupt=36, Bmap=37, Destroy=38, Ioctl=39, Poll=40,
    // NotifyReply=41, BatchForget=42, Fallocate=43, ReaddirPlus=44,
    // CopyFileRange=47, SetupMapping=48, RemoveMapping=49, SyncFs=50,
    // TmpFile=51, Statx=52, CopyFileRange64=53) are implemented identically.
}

KABI vtable (registered by the in-kernel FUSE driver; called by the Tier 2 daemon):

#[repr(C)]
pub struct FuseKabiVTable {
    pub vtable_size: u64,
    /// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    pub kabi_version: u64,

    /// Called once by the daemon after opening `/dev/fuse` and mapping the shared rings.
    /// `req_ring` and `rep_ring` are the two ring buffers in the shared memory region.
    pub fuse_connect: unsafe extern "C" fn(
        session:  *mut FuseSession,
        req_ring: *mut BoundedMpmcRing<FuseRequest, FUSE_RING_DEPTH>,
        rep_ring: *mut BoundedMpmcRing<FuseReply,   FUSE_RING_DEPTH>,
    ) -> i32,

    /// Called by the daemon to signal it has drained all pending requests
    /// and is ready for the mount to be torn down.
    pub fuse_disconnect: unsafe extern "C" fn(session: *mut FuseSession) -> i32,

    /// Called by the daemon to invalidate a cached inode or dentry.
    pub fuse_notify: unsafe extern "C" fn(
        session: *mut FuseSession,
        nodeid:  u64,
        notify:  FuseNotifyCode,
    ) -> i32,
}
// KABI vtable — pointer-width-dependent (contains fn pointers).
// 64-bit: vtable_size(8) + kabi_version(8) + 3 fn ptrs(24) = 40 bytes.
#[cfg(target_pointer_width = "64")]
const_assert!(core::mem::size_of::<FuseKabiVTable>() == 40);
#[cfg(target_pointer_width = "32")]
const_assert!(core::mem::size_of::<FuseKabiVTable>() == 28);

pub const FUSE_RING_DEPTH: usize = 256; // power-of-two; tunable via sysctl fuse.ring_depth

/// FUSE notify codes — daemon-to-kernel unsolicited notifications.
/// Values MUST match Linux `include/uapi/linux/fuse.h` `enum fuse_notify_code`
/// exactly — FUSE daemons send these numeric values on the wire.
#[repr(u32)]
pub enum FuseNotifyCode {
    Poll       = 1,   // wake all pollers on the specified file handle
    InvalInode = 2,   // invalidate cached inode attributes
    InvalEntry = 3,   // invalidate a dentry in the parent directory
    Store      = 4,   // push data into the kernel page cache
    Retrieve   = 5,   // pull data from the kernel page cache
    Delete     = 6,   // remove a dentry (daemon-side deletion)
    Resend     = 7,   // resend a previously interrupted request (protocol 7.41+, verified in mainline)
    IncEpoch   = 8,   // increment kernel-side epoch counter (verified in mainline)
    Prune      = 9,   // prune (evict) dentries from a directory (verified in mainline)
}

Session lifecycle:

  1. Daemon opens /dev/fuse (major=10, minor=229, same as Linux).
  2. Daemon maps the shared ring memory region via mmap(2) on the fd.
  3. Daemon calls fuse_connect() via the KABI vtable, passing ring pointers.
  4. Kernel posts a FUSE_INIT request; daemon replies with its capability flags.
  5. After FUSE_INIT handshake, kernel dispatches VFS requests to the request ring.
  6. Daemon pops requests, processes them, pushes FuseReply entries to the reply ring.
  7. On unmount: kernel posts FUSE_DESTROY; daemon responds and calls fuse_disconnect().

Blocking semantics: VFS calls block until the daemon posts the matching reply (matched by unique ID). The blocked task waits on a per-request WaitQueue. If the daemon exits before posting a reply, the kernel detects session teardown and returns -EIO to all pending operations.

15.9.2 Summary of Design Decisions

  1. Tier 1 placement: overlayfs runs in the VFS domain because it is a pure VFS stacking layer with moderate code complexity. Tier 2 would double domain-crossing overhead for every file operation in every container.

  2. xattr-based whiteouts as default: Avoids CAP_MKNOD requirement for rootless containers. Character device 0:0 whiteouts are recognized on read for backward compatibility.

  3. Metacopy enabled by default: Matches modern Docker/containerd behavior. The security caveat (attacker-crafted xattrs) is mitigated by the trusted.* namespace restriction and container runtime control of layer provenance.

  4. Atomic copy-up via workdir rename: Uses the same-filesystem rename guarantee. The workdir must share a superblock with upperdir, which the mount validation enforces.

  5. Dentry invalidation on copy-up: Uses d_invalidate() on the parent directory's dentry for the affected name, forcing re-lookup through the overlay lookup() path which will find the new upper entry.

  6. d_revalidate() for overlay dentries: Checks for copy-up state changes. This is the primary mechanism by which concurrent readers discover that a file has been copied up.

  7. Readdir merge with HashSet dedup: O(entries x layers) with hash-based dedup. The merged listing is cached per-opendir for consistency.

  8. xattr escaping for nested overlays: Supports overlayfs-on-overlayfs via the trusted.overlay.overlay.* prefix convention, matching Linux.

  9. Volatile sentinel directory: Prevents mounting on unclean upper layers. The sentinel is created on mount, removed on clean unmount.

  10. dm-verity + IMA dual coverage: Lower layers protected by dm-verity (block-level, Section 9.3), upper layer by IMA (file-level, Section 9.5). This is cross-referenced for clarity.


15.10 ZFS Integration

15.10.1 Native ZFS and Filesystem Licensing

Linux problem: ZFS can't be merged due to CDDL vs GPL license incompatibility. Users rely on out-of-tree OpenZFS which breaks with kernel updates.

UmkaOS design: - The kernel is licensed under UmkaOS's proposed OKLF v1.3 license framework (see Appendix A of 23-roadmap.md, Section 24.1 for the full specification — OKLF is a novel license being developed for UmkaOS, not a pre-existing published license): GPLv2 base with the Approved Linking License Registry (ALLR) which explicitly includes CDDL as an approved license. CDDL-licensed code (like OpenZFS) communicates with the kernel via KABI IPC without license conflict (no in-kernel linking occurs). - ZFS is a first-class Tier 1 filesystem driver, same tier as ext4, XFS, and Btrfs. The KABI interface provides the license boundary: ZFS is dynamically loaded, has one resolved symbol (__kabi_driver_entry), communicates exclusively through ring buffer IPC and vtable dispatch — no linking, no shared symbols. This provides more isolation than Linux's EXPORT_SYMBOL_GPL boundary (where modules ARE linked into the kernel and share function calls). The license separation is provided by KABI, not by the isolation tier — running a filesystem as Tier 2 (process isolation) for licensing reasons would impose catastrophic I/O overhead (~200-500 cycles per VFS operation) for zero additional legal benefit. - NFSv4 ACLs are first-class (Section 9.2), so ZFS's native ACL model works natively. - Filesystem KABI interface is rich enough to support ZFS's advanced features: snapshots, send/receive, datasets, native encryption, dedup, special vdevs. - ZFS benefits from the stable driver ABI, so it won't break with kernel updates — eliminating the primary pain point of Linux's out-of-tree OpenZFS module.

15.10.2 ZFS Advanced Features

Section 15.10 establishes that ZFS is a first-class UmkaOS citizen via KABI (Tier 1 driver). This section covers advanced ZFS features that benefit from UmkaOS's architecture: capability-based dataset management, RDMA-accelerated replication, and cluster integration.

Dataset hierarchy as capability objects — ZFS datasets form a hierarchy (pool → dataset → child dataset → snapshot → clone). In UmkaOS, each dataset is a capability object (Section 9.2). The capability token for a dataset encodes the specific operations permitted (Section 9.2):

Capability Permits
CAP_ZFS_MOUNT Mount the dataset as a filesystem
CAP_ZFS_SNAPSHOT Create/destroy snapshots of the dataset
CAP_ZFS_SEND Generate a send stream (for replication)
CAP_ZFS_RECV Receive a send stream into this dataset
CAP_ZFS_CREATE Create child datasets
CAP_ZFS_DESTROY Destroy the dataset (highest privilege)

Phase 4+ note: These CAP_ZFS_* capability names are conceptual placeholders. They are not yet defined in the capability model (Ch 9). Until Phase 4+ ZFS-specific KABI is implemented, all ZFS administrative operations require Capability::SysAdmin. The capability names shown here illustrate the target delegation model.

Delegation means transferring a subset of your capabilities to another local entity (a container, a user). A pool administrator holding all capabilities can delegate CAP_ZFS_MOUNT + CAP_ZFS_SNAPSHOT + CAP_ZFS_CREATE for a subtree to a container — the container can mount, snapshot, and create children within its subtree, but cannot destroy the parent dataset or send replication streams. For shared storage across hosts, use clustered filesystems (Section 15.14) backed by the DLM (Section 15.15) over shared block devices (Section 15.13).

zvol (ZFS volumes) — ZFS volumes are datasets that expose a block device interface instead of a POSIX filesystem. UmkaOS integrates zvols with umka-block's device-mapper framework — a zvol can serve as the backing store for dm-crypt, dm-mirror, or as an iSCSI LUN (Section 15.13). This enables ZFS's checksumming, compression, and snapshot capabilities for raw block storage consumers.

zfs send/recv over RDMA — ZFS replication streams (zfs send) are often used for backup, disaster recovery, and dataset migration. In Linux, zfs send | ssh remote zfs recv pushes the stream over TCP (typically SSH-encrypted). UmkaOS provides a native RDMA transport option: - Uses Section 5.4's RDMA infrastructure - Kernel-to-kernel path: when both source and destination run UmkaOS, the send stream bypasses userspace entirely — data moves directly from the source ZFS module through RDMA to the destination ZFS module - Zero-copy: send stream data is RDMA READ from source memory, written directly into destination's transaction group - Encryption: if the dataset uses ZFS native encryption, the stream is already encrypted end-to-end. Otherwise, RDMA transport encryption (Section 5.4) protects data in transit

Import/export compatibility — UmkaOS's ZFS implementation reads and writes the standard ZFS on-disk format (as defined by OpenZFS). Existing zpools created on Linux, FreeBSD, or illumos can be imported by UmkaOS without modification. Conversely, zpools created by UmkaOS can be exported and imported on any OpenZFS-compatible system.

ZFS-specific KABI extensions — ZFS uses the common filesystem KABI interface (FileSystemOps, InodeOps, FileOps) defined in Section 14.1 for standard POSIX operations. ZFS-specific administrative operations (dataset create/destroy, snapshot management, zfs send/recv, pool scrub/resilver, ZFS_IOC_* ioctls) require additional KABI definitions not yet specified. These ZFS-specific KABI extensions are deferred to Phase 4+ implementation. The common filesystem KABI is sufficient for basic ZFS functionality (mount, read, write, fsync, xattr, ACL). Dataset management operations will be routed through the D-Bus bridge (Section 11.11) or ioctl passthrough until dedicated KABI vtable extensions are defined.


15.11 NFS Client, SunRPC, and RPCSEC_GSS

NFS is UmkaOS's primary network filesystem. This section specifies the complete kernel-side stack: - SunRPC transport layer: connection management, XDR encoding, RPC dispatch - RPCSEC_GSS + Kerberos: Kerberos-authenticated NFS (NFSv4 + Kerberos = "krb5i/krb5p") - NFSv4 client state machine: open/lock/delegation/lease - Network filesystem cache (netfs layer): shared page cache for NFS, Ceph, and other network filesystems

15.11.1 SunRPC Transport Layer

SunRPC (RFC 5531) is the RPC framework underlying NFS, lockd, and mount protocol.

RpcTransport trait — abstraction over TCP and UDP transports:

pub trait RpcTransport: Send + Sync {
    fn send_request(&self, req: &RpcMsg, xid: u32, timeout: Duration) -> Result<(), RpcError>;
    fn recv_reply(&self, xid: u32) -> impl Future<Output = Result<RpcMsg, RpcError>>;
    fn close(&self);
    fn reconnect(&self) -> Result<(), RpcError>;
    fn max_payload_size(&self) -> usize;
}

15.11.1.1 RPC Error Taxonomy

The RpcError enum distinguishes transient from permanent failures, enabling the retry logic in XClnt to make correct decisions without ambiguity.

/// RPC transport and protocol error taxonomy.
///
/// Variants are ordered by severity. The `XClnt` retry loop uses the
/// variant to decide:
/// - **Retry immediately**: `Timeout` (with exponential backoff up to
///   `XClnt.retries` attempts).
/// - **Reconnect, then retry**: `ConnReset` (calls `transport.reconnect()`
///   first; retries up to `XClnt.retries` after successful reconnect).
/// - **Refresh credentials, then retry**: `AuthFailed` (calls
///   `auth.refresh()` first; retry once).
/// - **Wait and retry**: `GracePeriod` (sleep for the server's grace
///   period duration, typically 90 seconds for NFSv4, then retry).
/// - **Permanent failure**: `ProgramMismatch`, `GarbageArgs`,
///   `SystemError` — return error to caller without retry.
pub enum RpcError {
    /// RPC call timed out (server or network). Retryable with backoff.
    /// Timeout duration is `XClnt.timeout`; the RPC layer retries up to
    /// `XClnt.retries` times before surfacing the error.
    Timeout,

    /// TCP connection reset by peer or network failure. Requires
    /// `transport.reconnect()` before retrying. If reconnect fails,
    /// the error is surfaced to the caller as `EIO`.
    ConnReset,

    /// Authentication failed. For RPCSEC_GSS (Kerberos), this typically
    /// means the ticket has expired. The RPC layer calls `auth.refresh()`
    /// to obtain a new ticket and retries once. If refresh fails or the
    /// retry also fails, the error is surfaced as `EACCES`.
    AuthFailed,

    /// Server does not support the requested RPC program or version.
    /// Permanent failure — the client must negotiate a different version
    /// or report the error to the caller.
    ProgramMismatch {
        /// The program version the client requested.
        expected_ver: u32,
        /// The highest version the server supports for this program.
        server_ver: u32,
    },

    /// Server rejected the call arguments as malformed (RPC_GARBAGE_ARGS).
    /// Permanent failure — indicates a serialization bug or protocol
    /// mismatch. Surfaced as `EIO`.
    GarbageArgs,

    /// Server returned a system-level error (RPC_SYSTEM_ERR).
    /// The `i32` is a POSIX errno from the server. Surfaced as-is
    /// to the caller (mapped through the NFS error translation table
    /// defined below).
    SystemError(i32),

    /// Server is in NFS grace period (NFSv4 `NFS4ERR_GRACE` or
    /// `NFS4ERR_DELAY`). The client must wait and retry after the grace
    /// period expires. The NFSv4 client state machine
    /// ([Section 15.11](#nfs-client-sunrpc-and-rpcsecgss--nfsv4-client-state-machine))
    /// handles grace period detection and automatic retry scheduling.
    GracePeriod,
}

NFS error translation table — maps NFS protocol error codes to POSIX errno values for the syscall return path. The NFS client applies this mapping when translating an NFS reply status to a kernel Errno. NFSv3 uses nfsstat3 (RFC 1813 §2.6); NFSv4 uses nfsstat4 (RFC 7530 §13.1). The table covers both:

NFS Error Value Errno Notes
NFS3ERR_PERM / NFS4ERR_PERM 1 EPERM Operation not permitted
NFS3ERR_NOENT / NFS4ERR_NOENT 2 ENOENT No such file or directory
NFS3ERR_IO / NFS4ERR_IO 5 EIO I/O error
NFS3ERR_NXIO / NFS4ERR_NXIO 6 ENXIO No such device or address
NFS3ERR_ACCES / NFS4ERR_ACCESS 13 EACCES Permission denied
NFS3ERR_EXIST / NFS4ERR_EXIST 17 EEXIST File exists
NFS3ERR_XDEV / NFS4ERR_XDEV 18 EXDEV Cross-device link
NFS3ERR_NODEV 19 ENODEV No such device
NFS3ERR_NOTDIR / NFS4ERR_NOTDIR 20 ENOTDIR Not a directory
NFS3ERR_ISDIR / NFS4ERR_ISDIR 21 EISDIR Is a directory
NFS3ERR_INVAL / NFS4ERR_INVAL 22 EINVAL Invalid argument
NFS3ERR_FBIG / NFS4ERR_FBIG 27 EFBIG File too large
NFS3ERR_NOSPC / NFS4ERR_NOSPC 28 ENOSPC No space left on device
NFS3ERR_ROFS / NFS4ERR_ROFS 30 EROFS Read-only filesystem
NFS3ERR_MLINK 31 EMLINK Too many links
NFS3ERR_NAMETOOLONG / NFS4ERR_NAMETOOLONG 63 ENAMETOOLONG Filename too long
NFS3ERR_NOTEMPTY / NFS4ERR_NOTEMPTY 66 ENOTEMPTY Directory not empty
NFS3ERR_DQUOT / NFS4ERR_DQUOT 69 EDQUOT Disk quota exceeded
NFS3ERR_STALE / NFS4ERR_STALE 70 ESTALE Stale file handle
NFS3ERR_BADHANDLE 10001 ESTALE Invalid NFS file handle (mapped to ESTALE per Linux nfs3_stat_to_errno)
NFS3ERR_SERVERFAULT 10006 EIO Server internal error
NFS4ERR_DENIED 10010 EAGAIN Lock denied (non-blocking)
NFS4ERR_EXPIRED 10011 EIO Lease/delegation expired
NFS4ERR_LOCKED 10012 EAGAIN File is locked
NFS4ERR_GRACE 10013 EAGAIN Server in grace period (retry)
NFS4ERR_DELAY 10008 EAGAIN Server busy (retry with backoff)
NFS4ERR_WRONGSEC 10016 EPERM Wrong security flavor (EPERM triggers sec= fallback)
NFS4ERR_MOVED 10019 EREMOTE Filesystem migrated
(unknown) * EIO Unmapped errors → EIO

The mapping function is:

/// Translate an NFS protocol status code to a POSIX errno.
/// Handles both NFSv3 (nfsstat3) and NFSv4 (nfsstat4) error spaces.
///
/// This function is called only for errors NOT intercepted by the NFSv4
/// state recovery machine. `NFS4ERR_EXPIRED`, `NFS4ERR_STALE_CLIENTID`,
/// and similar lease/state errors are normally handled by the recovery
/// path before reaching this function. The `EIO` mapping applies only
/// when recovery itself has failed.
fn nfs_status_to_errno(status: i32) -> Errno {
    match status {
        0     => unreachable!("success should not reach error path"),
        1     => Errno::EPERM,
        2     => Errno::ENOENT,
        5     => Errno::EIO,
        6     => Errno::ENXIO,
        13    => Errno::EACCES,
        17    => Errno::EEXIST,
        18    => Errno::EXDEV,
        19    => Errno::ENODEV,
        20    => Errno::ENOTDIR,
        21    => Errno::EISDIR,
        22    => Errno::EINVAL,
        27    => Errno::EFBIG,
        28    => Errno::ENOSPC,
        30    => Errno::EROFS,
        31    => Errno::EMLINK,
        63    => Errno::ENAMETOOLONG,
        66    => Errno::ENOTEMPTY,
        69    => Errno::EDQUOT,
        70    => Errno::ESTALE,
        10001 => Errno::ESTALE,
        10006 => Errno::EIO,
        10008 | 10010 | 10012 | 10013 => Errno::EAGAIN,
        10011 => Errno::EIO,
        10016 => Errno::EPERM, // NFS4ERR_WRONGSEC: EPERM triggers sec= fallback
        10019 => Errno::EREMOTE,
        _     => Errno::EIO, // unmapped NFS errors → EIO
    }
}

TCP transport — one persistent TCP connection per server per NFS client. Record marking (RFC 5531 §10): each RPC message prefixed with a 4-byte record mark (u32 with high bit set indicating last fragment, low 31 bits = fragment length). Multiple RPC messages may be pipelined on one TCP connection. Connection maintained as long as mounts are active; reconnect on ECONNRESET.

Network namespace binding: RpcTransport captures the network namespace at construction time: transport.net_ns = params.net_ns.clone(). All socket operations use this captured namespace. XClnt holds the transport; the namespace is transitively available via self.transport.net_ns.

XClnt (RPC client) struct:

pub struct XClnt {
    pub server_addr:   SockAddr,
    /// Transport connections. Single-element for default (nconnect=1).
    /// `nconnect=N` mount option creates N TCP connections to the server
    /// for bandwidth aggregation. Each PendingRpc records which transport
    /// index it was dispatched on (for reply routing). New RPCs are
    /// dispatched round-robin weighted by `inflight_count` (least-loaded).
    pub transports:    ArrayVec<Arc<dyn RpcTransport>, 16>,
    pub prog:          u32,    // RPC program number (NFS = 100003, mountd = 100005)
    pub vers:          u32,    // Program version (NFSv4 = 4)
    pub auth:          Arc<dyn RpcAuth>,
    // XID is a per-connection transaction correlation tag (RFC 5531).
    // Wrapping is safe: stale XIDs are garbage-collected by RPC timeout
    // (pending map entry removed after `timeout` elapses). The (client_addr,
    // xid) tuple provides uniqueness; no 50-year longevity concern.
    pub xid_counter:   AtomicU32,
    pub pending:       XArray<PendingRpc>,  // xid (u32) → waker; XArray internal lock replaces Mutex
    pub timeout:       Duration,
    pub retries:       u32,
}

pub struct PendingRpc {
    pub xid:    u32,
    pub waker:  Waker,
    pub result: Option<Result<RpcMsg, RpcError>>,
    /// Index into XClnt.transports — identifies which connection this RPC
    /// was dispatched on, for reply routing.
    pub transport_idx: u8,
}

XDR (External Data Representation) — RFC 4506. UmkaOS implements XDR as zero-copy where possible: XdrEncoder writes directly into a NetBuf chain; XdrDecoder reads from received NetBuf without copying. Fixed-size types (u32, u64, bool) are directly encoded; variable-length strings and arrays have a 4-byte length prefix followed by zero-padded data to a 4-byte boundary.

Async RPC dispatchcall_async(proc: u32, args: impl XdrEncode) -> impl Future<Output = Result<R, RpcError>>: builds RpcMsg { xid, call: RpcCall { rpc_version: 2, program, version, procedure, auth, verifier } }, encodes args via XDR, sends via transport, registers PendingRpc in the pending map, returns a future that resolves when the matching reply arrives. The reply receiver loop runs as a Tier 1 kernel task.

15.11.2 RPC Authentication (RpcAuth)

RpcAuth trait:

pub trait RpcAuth: Send + Sync {
    fn auth_type(&self) -> RpcAuthFlavor;
    fn marshal_cred(&self, encoder: &mut XdrEncoder) -> Result<()>;
    fn verify_verf(&self, decoder: &mut XdrDecoder) -> Result<()>;
    fn refresh(&self) -> Result<()>;  // Re-fetch credentials if expired
}

Built-in auth flavors: - AuthNone (flavor 0): null credentials. Used only for portmap/rpcbind. - AuthUnix / AUTH_SYS (flavor 1): uid, gid, supplementary groups. Used for NFSv3, not secure. Translate in-namespace UID/GID through user_ns.uid_map/gid_map before encoding on wire (prevents container root from appearing as host root). The translated host-scope UID/GID are placed into the XDR credential body; if no mapping exists for the caller's in-namespace UID, the RPC fails with EOVERFLOW. - RPCSEC_GSS (flavor 6): GSS-API based authentication. Described in Section 15.11.3.

15.11.3 RPCSEC_GSS and Kerberos

RPCSEC_GSS (RFC 2203) wraps any GSS-API mechanism. UmkaOS implements the Kerberos V5 mechanism (RFC 4121).

Service types (negotiated at mount time via sec= mount option): - krb5: authentication only (integrity of RPC header) - krb5i: authentication + integrity (checksum of entire RPC payload) - krb5p: authentication + integrity + privacy (encryption of RPC payload)

GssContext struct — Per-server-per-credential GSS context. One context per (client principal, server principal) pair, shared by all threads using the same credentials on the same NFS server connection:

pub struct GssContext {
    // --- Authentication state ---
    /// GSS mechanism OID (1.2.840.113554.1.2.2 for Kerberos V5).
    pub mech_oid:    GssMechOid,
    /// Opaque handle to the GSS security context (from gss_init_sec_context).
    pub context_hdl: u64,
    /// Negotiated service level: None / Integrity / Privacy.
    pub service:     GssService,
    /// Monotonic sequence counter for anti-replay (see below).
    /// Stored as u64 internally; truncated to u32 on the wire per RFC 2203.
    /// At ~100K RPCs/sec, the u32 wire space wraps in ~12 hours. Before
    /// wrap, the context must be re-established (Kerberos ticket renewal
    /// typically triggers this well before wrap — default ticket lifetime
    /// is 10 hours). The wire-wrap check examines the **low 32 bits**:
    /// `if (seq_num.load(Relaxed) as u32) >= 0xFFFF_FF00 { force renewal }`.
    /// Comparing the full u64 against 0xFFFF_FF00 would check ~4 billion
    /// total RPCs, not the wire representation approaching wrap.
    pub seq_num:     AtomicU64,
    /// AES-256 session key, zeroed on context destruction.
    pub session_key: Zeroizing<[u8; 32]>,
    /// User ID that established this context.
    pub uid:         UserId,

    // --- Lifecycle state ---
    /// Opaque GSS context token (from gss_init_sec_context). Variable length;
    /// stored as a heap allocation updated atomically on renewal.
    pub token: RwLock<Box<[u8]>>,
    /// Absolute expiry time (nanoseconds since boot).
    pub expiry_ns: AtomicU64,
    /// Current lifecycle state (see `GssContextState` enum below).
    pub state: AtomicU8, // GssContextState as u8
    /// Number of RPCs currently in-flight using this context.
    /// Grace-period teardown waits for this to reach zero before expiring.
    pub in_flight: AtomicU32,
    /// Upcall ID sent to gssd for renewal (0 = none pending).
    pub renewal_upcall_id: AtomicU64,
}

RPCSEC_GSS credential exchange (happens automatically on first NFS connection): 1. Client sends RPCSEC_GSS_INIT call with a Kerberos AP_REQ (service ticket + authenticator) obtained from the kernel keyring (Section 10.2). The request_key("krb5", "nfs@server.example.com", NULL) lookup triggers gssd upcall if no ticket is cached. 2. Server responds with AP_REP (session key confirmation) and assigns a gss_proc_handle. 3. Subsequent RPCs carry the gss_proc_handle + sequence number + integrity/privacy checksum in the credential field.

RpcsecGssAuth struct — implements RpcAuth:

pub struct RpcsecGssAuth {
    /// GSS context. All mutable fields in GssContext use interior mutability
    /// (AtomicU64 for seq_num, Zeroizing for session_key). Context replacement
    /// during renewal creates a new Arc<GssContext> and atomically swaps via
    /// ArcSwap — eliminating the RwLock read-lock overhead (~15-20ns) on every RPC.
    pub ctx:     Arc<GssContext>,
    pub handle:  u32,          // gss_proc_handle from server
    pub service: GssService,
}
- marshal_cred(): writes RPCSEC_GSS credential with current seq_num. - verify_verf(): checks server's GSS MIC (Message Integrity Code) over the reply XID. - refresh(): if ctx.expiry < now, calls request_key() to fetch a new service ticket, re-runs RPCSEC_GSS_INIT.

Key retrieval integration with Section 10.2: Kerberos TGTs are cached as LogonKey entries in the kernel Key Retention Service. When refresh() needs a new service ticket, it calls request_key("krb5tgt", "REALM", NULL) to retrieve the cached TGT LogonKey, then calls request_key("krb5", "nfs@server.example.com", NULL) to obtain (or derive) a service ticket. If no TGT is present, the request_key upcall invokes userspace gssd, which performs the full Kerberos AS exchange, deposits the resulting TGT as a LogonKey, and provides the service ticket. This path requires Capability::SysAdmin only for initial keyring population; subsequent ticket requests use the session keyring of the process that triggered the mount.

Sequence number anti-replay: Each GssContext maintains a monotonic seq_num (AtomicU64). The server rejects any RPC with a sequence number more than 256 positions behind the current window (RFC 2203 §5.3.3). The client never reuses sequence numbers within a context lifetime.

GSS Upcall Mechanism:

Kerberos authentication requires obtaining credentials from userspace (the gssd daemon), since the kernel cannot contact a KDC directly. UmkaOS uses an upcall mechanism:

Channel: A per-mount Unix domain socket (/run/umka/gss/{mount_id}) created when the NFS mount is established. The kernel writes requests and reads responses using a simple binary framing protocol.

Namespace scoping: The GSS upcall socket is created in the network namespace of the NFS mount (i.e., NfsSessionParams::net_ns). The gssd daemon that answers upcalls must be running in the same network namespace — a gssd in the host namespace cannot see upcall sockets created inside a container's network namespace. This ensures that container-scoped NFS mounts with Kerberos authentication use the container's own gssd instance and Kerberos credential cache.

UID namespace awareness: RPCSEC_GSS credential mapping uses the user namespace of the NFS mount for UID translation. When an NFS mount is established inside a user namespace, the GssContext::uid field stores the host UID (translated via the mount's user namespace uid_map). RPCs sent to the server carry the host UID in the GSS credential, not the in-namespace UID. This ensures the NFS server sees consistent identities regardless of the container's UID mapping.

Request format (GssUpcallRequest):

#[repr(C)]
pub struct GssUpcallRequest {
    /// Protocol version (currently 1).
    pub version: u32,
    /// Unique upcall ID for matching responses to requests. Allocated from
    /// a per-mount atomic counter. Required for concurrent upcall support
    /// (up to 32 simultaneous upcalls).
    pub upcall_id: u32,
    /// Request type: 1=INIT_SEC_CONTEXT, 2=ACCEPT_SEC_CONTEXT, 3=GET_MIC, 4=VERIFY_MIC.
    pub req_type: u32,
    /// Client principal name (NUL-terminated, max 256 bytes).
    pub client_principal: [u8; 256],
    /// Target service name, e.g., "nfs@server.example.com" (NUL-terminated, max 256 bytes).
    pub target: [u8; 256],
    /// Input token length (0 for INIT, non-zero for mutual auth response).
    pub input_token_len: u32,
    /// Input token data (up to 65535 bytes; variable length follows this struct).
    // (actual data follows at offset sizeof(GssUpcallRequest))
}
// Userspace boundary (upcall pipe): version(4)+upcall_id(4)+req_type(4)+client_principal(256)+target(256)+input_token_len(4) = 528 bytes.
// repr(C): all u32 fields naturally aligned, [u8;256] align 1. No padding.
const_assert!(core::mem::size_of::<GssUpcallRequest>() == 528);

Response format (GssUpcallResponse):

#[repr(C)]
pub struct GssUpcallResponse {
    pub version:      u32,           // offset 0
    /// Must match the `upcall_id` from the corresponding `GssUpcallRequest`.
    pub upcall_id:    u32,           // offset 4
    pub status:       i32,           // offset 8; 0 = success; negative = GSS error code
    /// Explicit padding for u64 alignment of `context_id`. Zero-initialized
    /// on construction to prevent kernel heap information disclosure.
    pub _pad0:        [u8; 4],       // offset 12
    /// GSS context handle (opaque; returned to kernel for subsequent calls).
    pub context_id:   u64,           // offset 16
    /// Output token length (for INIT_SEC_CONTEXT response token).
    pub output_token_len: u32,       // offset 24
    /// Explicit trailing padding to struct alignment. Zero-initialized.
    pub _pad1:        [u8; 4],       // offset 28
    // output token data follows at offset sizeof(GssUpcallResponse)
}
// repr(C): u32(4)+u32(4)+i32(4)+pad0(4)+u64(8)+u32(4)+pad1(4) = 32 bytes.
const_assert!(core::mem::size_of::<GssUpcallResponse>() == 32);

Timeout: 30 seconds per upcall. If gssd does not respond within 30s: - The kernel returns ETIMEDOUT to the NFS operation. - The upcall socket is closed and re-created; a new connection attempt is made. - After 3 consecutive timeouts, the mount is marked NFS_MOUNT_SECFLAVOUR_FORCE_NONE and falls back to AUTH_SYS (if configured) or returns EACCES permanently until the mount is remounted.

Concurrent upcalls: Multiple upcalls may be in flight simultaneously (one per in-progress authentication). Each upcall is tagged with a unique upcall_id: u32; responses match by upcall_id. A ring buffer of 32 concurrent upcalls is supported.

15.11.3.1 GSS Context Lifecycle and Proactive Renewal

Linux behavior (reference): Linux hard-fails all NFS RPCs with EKEYEXPIRED when the GSS/Kerberos TGT or service ticket expires. The user sees I/O errors on NFS mounts until they re-authenticate (kinit). This is a poor user experience for long-running workloads.

UmkaOS improvement — proactive renewal + grace period:

UmkaOS's GSS context manager proactively renews credentials and provides a short grace period for in-flight RPCs, eliminating spurious I/O errors in well-managed environments.

/// Lifecycle state of a GSS security context.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum GssContextState {
    /// Context is valid and usable for RPC signing/encryption.
    Valid,
    /// Context expires within RENEWAL_LEAD_TIME_SEC (60 s); renewal upcall
    /// has been sent to the gssd daemon. New RPCs may still use this context.
    RenewPending,
    /// Renewal failed or context has just expired; within the grace period
    /// (GRACE_PERIOD_MS = 500 ms). In-flight RPCs are allowed to complete.
    /// New RPCs are queued pending renewal or context replacement.
    GracePeriod,
    /// Grace period elapsed; all RPCs return EKEYEXPIRED until re-authentication.
    Expired,
    /// Context has been explicitly destroyed (session logout or server reset).
    Destroyed,
}

// GssContext: defined above in Section 15.11.3 (the canonical definition
// includes both authentication state and lifecycle state fields).

/// Renewal timing constants.
/// Renewal is triggered this many seconds before expiry.
pub const GSS_RENEWAL_LEAD_TIME_SEC: u64 = 60;
/// After expiry, in-flight RPCs have this long to complete before the context
/// is torn down and new RPCs start returning EKEYEXPIRED.
pub const GSS_GRACE_PERIOD_MS: u64 = 500;

Renewal algorithm (runs in the kthread/gss_renewer background thread):

  1. Wake every 5 seconds (or when notified by an expiry timer).
  2. For each GssContext with state == Valid:
  3. If now_ns >= expiry_ns - GSS_RENEWAL_LEAD_TIME_SEC * NS_PER_SEC OR (ctx.seq_num.load(Relaxed) as u32) >= 0xFFFF_FF00 (wire-wrap imminent):
    • Transition state to RenewPending.
    • Send upcall to gssd: GssUpcallRequest { op: Renew, ... }.
  4. If renewal upcall succeeds (gssd responds within 30 s):
  5. Update token and expiry_ns under token.write().
  6. Transition state back to Valid.
  7. If renewal upcall fails or times out:
  8. If now_ns < expiry_ns: retry after 10 s (transient failure).
  9. If now_ns >= expiry_ns: transition to GracePeriod.
    • Start a 500 ms timer; on expiry: wait for in_flight == 0, then transition to Expired.
  10. New RPCs arriving while state == GracePeriod are queued (not failed); they proceed if renewal succeeds or fail with EKEYEXPIRED if grace period expires.

15.11.4 NFSv4 Client State Machine

NFSv4 (RFC 7530 for v4.0, RFC 5661 for v4.1) is the primary NFS version. Key concepts: - Leases: all NFSv4 state (open files, locks, delegations) is held under a time-limited lease. Client must renew its lease before it expires (default 90s) or all state is purged by the server. - Client ID: a 64-bit clientid identifying the client, established via SETCLIENTID (v4.0) or EXCHANGE_ID (v4.1). - Sessions (v4.1): connection-independent; RPCs can arrive on any TCP connection in the session. CREATE_SESSION establishes a session; SEQUENCE operation prefixes every compound. - Compounds: NFSv4 operations are batched into compounds (multiple operations per RPC call). E.g., PUTFH + GETATTR in one RPC.

NfsClient struct:

pub struct NfsClient {
    pub server_addr:   SockAddr,
    pub rpc_clnt:      Arc<XClnt>,
    pub clientid:      AtomicU64,       // NFSv4 client ID
    pub verifier:      [u8; 8],         // Client verifier (random, per boot)
    pub lease_time_s:  u32,             // Negotiated from server
    pub lease_renewer: JoinHandle<()>,  // Background task renewing the lease
    /// Read-heavy session parameters (capabilities, addresses, negotiated
    /// values). Read lock-free under RCU on the RPC dispatch hot path.
    /// Updated only on session creation, renewal, or server migration (cold).
    pub session_params: RcuCell<NfsSessionParams>,
    /// Per-slot sequence counters for NFSv4.1 session slots.
    /// Separated from `NfsSessionParams` because slot sequences are
    /// shared mutable state (updated on every RPC via `fetch_add`) while
    /// session parameters are read-mostly config (updated at renegotiation
    /// ~90s). Session parameter renewal no longer forces reallocation of
    /// 256 AtomicU32 counters. Updated independently: only when the server
    /// changes the slot count (rare — typically at session creation only).
    pub slot_table: RcuCell<Arc<SlotTable>>,
    /// Write-heavy lease/recovery state. Acquired only on lease renewal,
    /// state recovery, and session reset — all rare operations.
    pub lease_state:   SpinLock<NfsLeaseState>,
    pub nfs_version:   NfsVersion,      // V4_0 or V4_1
    // NFSv4.1 only:
    pub session_id:    Option<[u8; 16]>,
    pub fore_channel:  Option<SessionChannel>,
    pub back_channel:  Option<SessionChannel>,
}

/// Read-heavy session state — accessed under RCU read guard on every RPC
/// dispatch. No lock acquisition in the common path.
///
/// **Note**: Per-slot sequence counters are NOT in this struct. They live in
/// `SlotTable` (separate `RcuCell` on `NfsClient`). This separation ensures
/// that session parameter renegotiation (which replaces `NfsSessionParams`
/// via RCU) does not force reallocation of the 256 AtomicU32 slot counters.
pub struct NfsSessionParams {
    /// Server-advertised capabilities (from EXCHANGE_ID / CREATE_SESSION).
    pub server_caps:    u64,
    /// Negotiated maximum request/response sizes.
    pub max_req_sz:     u32,
    pub max_resp_sz:    u32,
    /// Number of active fore-channel slots. Read from here for config;
    /// `SlotTable::num_slots` is the authoritative count for RPC dispatch.
    pub num_slots:      u32,
    /// Network namespace for all RPC socket creation. Set during mount from the
    /// mounting process's network namespace. All TCP connections to the NFS server
    /// use this namespace for routing, firewall rules, and address resolution.
    /// Ensures container-scoped NFS mounts use the container's network stack.
    pub net_ns: Arc<NetNamespace>,
}

/// Per-slot sequence counters for NFSv4.1 session slots.
///
/// Separated from `NfsSessionParams` to avoid RCU replacement churn.
/// Session parameters change on renegotiation (~90s); slot sequences
/// change on every RPC (`fetch_add`). Independent RcuCell updates mean
/// session renewal never forces reallocation of this table.
///
/// **RPC dispatch hot path**: `rcu_read_lock()` → read `slot_table`
/// RcuCell → `seqs[slot_idx].fetch_add(1, Relaxed)` → release RCU guard.
/// Cost: one RCU read + one Arc deref + one atomic increment.
pub struct SlotTable {
    /// Sequence counters indexed by slot number. `Arc<[AtomicU32]>`
    /// because the server may negotiate any number of slots (typically
    /// 64-256). The Arc allows the array to outlive RCU replacement
    /// (readers that hold a reference during RCU swap continue to use
    /// the old table until they release).
    pub seqs: Arc<[AtomicU32]>,
    /// Number of active fore-channel slots. This is the authoritative
    /// count for RPC dispatch (may differ from `NfsSessionParams::num_slots`
    /// briefly during slot count renegotiation).
    pub num_slots: u32,
}

NFS mounts inside containers: the `net_ns` field ensures NFS client TCP connections
use the container's network namespace. Set at mount time from
`current_task().nsproxy.net_ns`. Immutable after mount  namespace changes via
`setns()` do not affect existing NFS mounts. Remount (`MS_REMOUNT`) from a task
in a different network namespace returns `EINVAL`  NFS cannot switch network
context on a live mount because it would invalidate all TCP connections and state
IDs. A container with a private `net_ns` gets its own NFS connections (separate
TCP sockets, separate NFSv4 sessions, separate clientid). This means an NFS mount
established inside a container always routes through that container's network stack
(routing tables, firewall rules, DNS resolution), regardless of subsequent namespace
manipulation. If the container's network namespace is destroyed while the NFS mount
is still active, all RPC operations return `ENETUNREACH` and the mount enters the
recovery path (lease expiry  reclaim sequence).

/// Write-heavy lease and recovery state — protected by SpinLock, acquired
/// only during lease renewal, state recovery, and session reset.
pub struct NfsLeaseState {
    /// Open owners: keyed by the 28-byte open_owner opaque identifier.
    /// Bounded by MAX_NFS_OPEN_OWNERS (65536). Returns NFS4ERR_RESOURCE on overflow.
    pub open_owners:    BTreeMap<[u8; 28], Arc<NfsOpenState>>,
    /// Current lease expiry (absolute time).
    pub lease_expiry:   Instant,
    /// True while state recovery is in progress. AtomicBool allows the
    /// RPC dispatch path to check `recovering.load(Relaxed)` without
    /// acquiring the NfsLeaseState SpinLock — reduces contention during
    /// recovery (the worst time for extra lock contention).
    pub recovering:     AtomicBool,
    /// Delegation return queue (delegations pending DELEGRETURN).
    pub pending_returns: ArrayVec<[u8; 16], 64>,
}

Open state machineNfsOpenState per open file handle:

pub struct NfsOpenState {
    pub open_stateid: [u8; 16],       // 4-component stateid from server
    pub seqid:        u32,            // Local sequence for state transitions
    pub access:       NfsOpenAccess,  // Read / Write / Both
    pub deny:         NfsOpenDeny,    // None / Read / Write / Both
    pub delegation:   Option<NfsDelegation>,
    /// Per-file byte-range locks. Bounded by per-file lock limit
    /// (MAX_NFS_LOCKS = 256, matching typical server-side limits).
    /// Vec is acceptable: lock operations are warm-path (per-lock syscall),
    /// and typical files have << 10 concurrent locks.
    pub locks:        Vec<NfsLockState>,  // max MAX_NFS_LOCKS (256)
}

pub struct NfsDelegation {
    pub stateid:   [u8; 16],
    pub type_:     DelegationType,  // Read or Write
    pub recall_wq: WaitQueue,       // Signaled when server sends CB_RECALL
}

Write delegation — when the server grants a write delegation, the client may write and cache locally without contacting the server for each operation. On recall (server sends CB_RECALL via the NFSv4 callback channel), the client must flush all dirty pages and send DELEGRETURN before the server can grant access to other clients. The callback channel (established in CREATE_SESSION for v4.1, or via SETCLIENTID for v4.0) is a reverse TCP connection: server connects to client. The back_channel in NfsClient tracks this connection.

Lease renewal — a background kernel task (running as a Tier 1 task) calls RENEW (v4.0) or sends a SEQUENCE-only compound (v4.1) every lease_time_s * 2 / 3 seconds (default: 60s for a 90s lease). The renewal check also examines the GSS context sequence number for wire-wrap proximity: if (ctx.seq_num.load(Relaxed) as u32) >= 0xFFFF_FF00 { force GSS context renewal }. On network partition: lease renewal fails; after lease_time_s the server purges all client state. Client must perform state recovery: sends SETCLIENTID / EXCHANGE_ID (to re-establish client identity), then CLAIM_PREVIOUS opens for each open file, and LOCK reclaims for each lock, concluding with RECLAIM_COMPLETE.

State recovery error paths: - If the server returns NFS4ERR_STALE_CLIENTID during recovery, the client lost its lease entirely: all open-file state is gone, all in-progress writes that were not yet flushed are lost. The VFS layer returns EIO to all blocked file operations. - If CLAIM_PREVIOUS returns NFS4ERR_RECLAIM_BAD, the server no longer has a record of the open: the file descriptor is invalidated, pending writes are dropped with EIO. - Recovery is gated by a per-client recovering flag; new operations block (interruptibly if intr mount option is set) until recovery completes or fails.

Open file state machine — Each NfsOpenState (per open file handle) transitions through the following states:

State Meaning
CLOSED No open, no stateid
OPENING OPEN RPC sent, awaiting server response
OPEN File open, lease active, stateid valid
DELEGATED Delegation granted (read or write)
RECALL_PENDING Server sent CB_RECALL; grace period active (90s)
RETURNING DELEGRETURN RPC sent, awaiting server acknowledgment
LEASE_EXPIRED Lease timer fired; stateid may be invalid on server
RECLAIMING Server restart detected; reclaim sequence in progress
RECLAIM_COMPLETE All files reclaimed; resuming normal operation

State transitions:

From Event To Action
CLOSED open(2) called OPENING Send OPEN RPC
OPENING OPEN response OK OPEN Store stateid; start lease timer
OPENING OPEN response error CLOSED Return errno to caller
OPEN Server grants delegation DELEGATED Store delegation stateid
OPEN Lease timer fires LEASE_EXPIRED Attempt RENEW RPC
OPEN NFS4ERR_STALE_CLIENTID RECLAIMING Re-establish client ID; pause I/O
DELEGATED CB_RECALL received RECALL_PENDING Start 90s grace timer
RECALL_PENDING DELEGRETURN sent RETURNING
RECALL_PENDING Grace timer expires RETURNING Force send DELEGRETURN
RETURNING DELEGRETURN OK OPEN Delegation relinquished; normal I/O
LEASE_EXPIRED RENEW RPC OK OPEN Lease refreshed
LEASE_EXPIRED RENEW fails (NFS4ERR_EXPIRED) RECLAIMING Server evicted state; reclaim needed
RECLAIMING All OPEN CLAIM_PREVIOUS sent RECLAIM_COMPLETE Send RECLAIM_COMPLETE RPC
RECLAIM_COMPLETE RECLAIM_COMPLETE RPC OK OPEN Normal operation resumes
RECLAIMING Grace period expires (60s) CLOSED All stateids invalidated; return EIO
Any close(2) + CLOSE RPC OK CLOSED Stateid invalidated

Lease renewal timer: fires at lease_time_s * 2 / 3 (default: 60s for a 90s lease). Three consecutive renewal failures → LEASE_EXPIRED transition. Under NFSv4.1, renewal is implicit via the SEQUENCE operation in every compound RPC.

RECLAIM phase (triggered by NFS4ERR_STALE_CLIENTID or server restart detection):

  1. Pause all pending I/O on this client (operations return EINPROGRESS internally). Note: EINPROGRESS is an NFS-internal status during RECLAIM — it is never returned to userspace write(2) callers. The VFS layer translates EINPROGRESS to EIO or retries transparently depending on the operation and mount flags.
  2. Send SETCLIENTID (v4.0) or EXCHANGE_ID (v4.1) to re-establish client identity.
  3. For each cached NfsOpenState: send OPEN with CLAIM_PREVIOUS to reclaim the open.
  4. For each cached NfsLockState: send LOCK with reclaim = true.
  5. Send RECLAIM_COMPLETE; resume paused I/O. If RECLAIM fails for a specific file (server returns NFS4ERR_RECLAIM_BAD): that file's state transitions to CLOSED and all pending operations on it return ESTALE.

15.11.5 netfs Page Cache Layer

The netfs layer provides a shared page cache infrastructure for network filesystems. UmkaOS implements it as the cache tier between NFS (and future Ceph/AFS) and the page allocator. It replaces ad-hoc per-filesystem readahead and writeback logic with a unified, testable implementation.

Core abstractions:

pub trait NetfsInode: Send + Sync {
    /// Populate subrequests for a read covering [rreq.start, rreq.start + rreq.len).
    fn init_read_request(&self, rreq: &mut NetfsReadRequest);
    /// Issue a single subrequest to the server (or local cache).
    fn issue_read(&self, subreq: &mut NetfsSubrequest);
    /// Issue a write request to the server.
    fn issue_write(&self, wreq: &mut NetfsWriteRequest);
    /// Split a dirty range into write requests.
    fn create_write_requests(&self, wreq: &mut NetfsWriteRequest, start: u64, len: u64);
}

pub struct NetfsReadRequest {
    pub inode:       Arc<dyn NetfsInode>,
    pub start:       u64,             // Byte offset in file
    pub len:         usize,
    pub subrequests: Vec<NetfsSubrequest>,
    pub netfs_priv:  u64,             // Filesystem-private field
}

pub struct NetfsSubrequest {
    pub rreq:   Weak<NetfsReadRequest>,
    pub start:  u64,
    pub len:    usize,
    pub source: NetfsSource,   // Server, Cache, LocalXfer
    pub state:  AtomicU32,     // Pending / InFlight / Completed / Failed
}

/// Write request covering a contiguous dirty range.
/// Created by `netfs_writeback()`, split into sub-ranges by
/// `create_write_requests()`, issued via `issue_write()`.
pub struct NetfsWriteRequest {
    /// The inode being written to.
    pub inode:    Arc<dyn NetfsInode>,
    /// Byte offset of the first dirty byte in the file.
    pub offset:   u64,
    /// Total length of the dirty range in bytes.
    pub len:      u32,
    /// Folios backing this write. Pinned for the duration of the write;
    /// unpinned on completion. Fixed-size to avoid heap allocation in the
    /// writeback path (maximum 16 folios = 64 KiB at 4 KiB pages, matching
    /// the typical NFS `wsize`).
    pub folios:   ArrayVec<PageRef, 16>,
    /// Write stability mode: Unstable (deferred COMMIT), FileSync (O_SYNC),
    /// or DataSync (O_DSYNC).
    pub stability: NetfsWriteStability,
    /// Lifecycle state of this write request.
    pub state:    NetfsWriteState,
    /// Filesystem-private field (e.g., NFS verifier for COMMIT correlation).
    pub netfs_priv: u64,
}

/// Write request lifecycle states.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum NetfsWriteState {
    /// Folios collected, not yet issued.
    Pending,
    /// RPC in flight to server.
    InFlight,
    /// Server acknowledged the write.
    Complete,
    /// Write failed; errno stored for propagation to `fsync()` callers.
    Error(i32),
}

/// Write stability modes (matching NFSv4 stable_how).
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum NetfsWriteStability {
    /// Deferred commit: data may be in server's volatile cache until
    /// explicit `COMMIT` RPC (issued at `fsync()` time).
    Unstable,
    /// Write + fsync semantics: server flushes to stable storage before ACK.
    FileSync,
    /// Data-only sync: server flushes data (not metadata) before ACK.
    DataSync,
}

Read path: On page fault or explicit read() hitting an NFS-backed folio not in the page cache, netfs_read_folio() creates a NetfsReadRequest, calls init_read_request() which the NFS implementation uses to split the range into subrequests (one per READ RPC, sized to rsize), issues them concurrently via async tasks, and waits for all subrequests to complete. If a local CacheFiles cache is configured, subsets of reads may be served from disk cache rather than issuing an RPC.

Write path: On writeback(), netfs_writeback() groups dirty folios into write requests sorted by file offset, calls create_write_requests() to split into WRITE RPC-sized chunks (sized to wsize), and issues them via issue_write(). Ordering within a single writeback is by offset to maximize sequential I/O on the server. NFSv4 WRITE with FILE_SYNC stability mode is used when O_SYNC is active; otherwise UNSTABLE writes are used followed by a COMMIT RPC at fsync() time.

Readahead: The NetfsReadaheadControl struct drives speculative prefetch. When sequential read access is detected (via pos tracking in the file's NetfsInode), the readahead window expands up to max_readahead pages (default: 128 pages = 512 KiB at 4 KiB page size, configurable via mount option readahead=N). Readahead requests are lower priority than demand reads and are cancelled if memory pressure rises.

NFS dirty page backpressure: Network filesystems require writeback throttling beyond what local filesystems need, because NFS write RPCs can stall indefinitely when the server is unreachable. Without throttling, dirty pages accumulate in the page cache until memory pressure triggers the OOM killer — a catastrophic failure mode for NFS-heavy workloads.

/// Per-NFS-mount writeback throttle state. Embedded in the NFS superblock's
/// `NfsMountState` struct. Cooperates with the kernel's `balance_dirty_pages()`
/// infrastructure to throttle dirty page generation when NFS RPCs are stalled.
pub struct NfsWritebackThrottle {
    /// Number of outstanding NFS WRITE RPCs for this mount.
    /// Incremented when a WRITE RPC is dispatched; decremented on RPC
    /// completion (success or failure). Read by the throttle check on
    /// every balance_dirty_pages() callback.
    pub outstanding_rpcs: AtomicU32,

    /// Maximum outstanding WRITE RPCs before new writes begin blocking.
    /// Default: 256. Configurable via mount option `nfs_max_writes=N`.
    /// Upper bound: clamped to 4096 at mount time (prevents misconfiguration
    /// from exhausting RPC slot table or consuming excessive memory for
    /// in-flight buffers). At wsize=1048576 (1 MiB), 256 RPCs = 256 MiB
    /// of in-flight dirty data, which is a reasonable default for most
    /// NFS servers. The upper bound of 4096 caps at 4 GiB in-flight.
    pub max_outstanding_rpcs: u32,

    /// Set to true when the NFS transport detects server unreachable
    /// (TCP connection reset without successful reconnect, or RPC timeout
    /// exceeding 3 × timeo without response). When true, ALL new writes
    /// block in balance_dirty_pages() until the flag is cleared (server
    /// becomes reachable again and at least one RPC completes).
    /// Relaxed ordering is acceptable: cache coherence guarantees propagation
    /// within microseconds, and server congestion transitions are rare events
    /// lasting seconds to minutes. The slight observation delay is irrelevant.
    pub server_congested: AtomicBool,

    /// Wait queue for writers blocked by `server_congested == true` on hard
    /// mounts. When the SunRPC transport clears `server_congested` (after
    /// successful reconnect + first RPC completion), it calls
    /// `congestion_waitq.wake_up_all()` to unblock all waiting writers.
    /// Without this, blocked writers would need to poll, wasting CPU.
    pub congestion_waitq: WaitQueueHead,
}

Integration with balance_dirty_pages(): The kernel's writeback subsystem calls balance_dirty_pages() on every write() path to enforce global and per-BDI dirty page limits. NFS registers a BDI-specific dirty throttle callback that adds NFS-aware checks:

  1. If outstanding_rpcs.load(Relaxed) >= max_outstanding_rpcs: the callback returns a throttle rate of zero, causing balance_dirty_pages() to block the writing process until RPCs drain below the threshold. This prevents unbounded dirty page accumulation.
  2. If server_congested.load(Relaxed) == true: the callback blocks unconditionally. No new dirty pages are generated for this mount until the server is reachable. On hard mounts, this blocks indefinitely (correct: the write will eventually complete when the server returns). On soft mounts, the RPC layer returns EIO after retrans timeouts, which clears the congestion and propagates the error to the writing process.
  3. Otherwise: the callback returns a proportional throttle rate based on outstanding_rpcs / max_outstanding_rpcs, smoothly reducing write rate as the RPC queue fills.

Congestion detection: The server_congested flag is set by the SunRPC transport layer when ConnReset errors persist for longer than 3 × timeo (default: 3 × 60 = 180 deciseconds = 18 seconds). It is cleared when a reconnect() succeeds and at least one subsequent RPC completes. This avoids false congestion signals during brief network glitches (a single TCP reset triggers reconnect, not congestion).

Soft-mount EIO and dirty page preservation: When a soft-mount NFS client returns EIO due to server timeout: dirty pages are re-dirtied in the page cache (NOT released). This preserves data — the application can retry after server recovery. The server_congested flag prevents new writes from queueing. When the server becomes reachable again, the writeback engine flushes the re-dirtied pages normally. If the application calls fsync() during the outage, it receives EIO. If the application does not call fsync(), the data is silently re-flushed when the server recovers.

15.11.6 Mount Options and Integration

NFS mounts use the new mount API (fsopen("nfs4") + fsconfig() + fsmount(), as specified in Section 14.6):

fsconfig(fd, FSCONFIG_SET_STRING, "source", "server.example.com:/export")
fsconfig(fd, FSCONFIG_SET_STRING, "sec",    "krb5p")
fsconfig(fd, FSCONFIG_SET_STRING, "vers",   "4.1")
fsconfig(fd, FSCONFIG_SET_STRING, "rsize",  "1048576")
fsconfig(fd, FSCONFIG_SET_STRING, "wsize",  "1048576")
fsconfig(fd, FSCONFIG_SET_STRING, "timeo",  "600")    // 60 seconds (units: 1/10 s)
fsconfig(fd, FSCONFIG_SET_STRING, "retrans","2")
fsconfig(fd, FSCONFIG_SET_FLAG,   "hard",   NULL)     // Hard mount: retry indefinitely

Key mount options:

Option Values Meaning
vers 4.0, 4.1, 4.2 NFSv4 minor version. When vers=4.2 is specified, the client negotiates NFSv4.2 (RFC 7862) with the server. v4.2 operations deferred to Phase 4: COPY (server-side copy), SEEK (hole/data), ALLOCATE/DEALLOCATE, CLONE. Until Phase 4, the client uses v4.1-equivalent fallbacks.
sec sys, krb5, krb5i, krb5p Security flavor
rsize 4096–1048576 Read buffer size (bytes); must be multiple of 4096
wsize 4096–1048576 Write buffer size (bytes); must be multiple of 4096
hard / soft flag Hard: retry indefinitely; soft: return error after retrans timeouts
intr flag Allow signals to interrupt hard-mount retries
timeo integer (1/10 s) Per-RPC timeout before retransmit
retrans integer Number of retransmits before soft-mount error
nconnect 1–16 Number of parallel TCP connections to the server
readahead pages Readahead window size (default 128)
ac / noac flag Attribute caching; noac disables client-side attribute cache
actimeo seconds Unified attribute cache timeout

nconnect implementation: When nconnect=N is set, the XClnt maintains N TcpTransport instances. Each async RPC call is dispatched to the transport with the lowest in-flight queue depth (round-robin with depth tie-breaking). This spreads NFS traffic across multiple TCP flows, which improves throughput on high-bandwidth links where a single TCP flow is CPU- or window-limited. Connections in the RECONNECTING state are excluded from the dispatch pool. In-flight RPCs whose underlying TCP connection fails are re-dispatched to a surviving connection (selected by lowest queue depth). When the failed connection completes reconnection, it is re-added to the dispatch pool with queue depth 0.

Capability requirements: - Capability::SysAdmin: required to mount NFS (same as Linux). Enforced in nfs4_validate_mount_data() called from the fsconfig() implementation. - Capability::NetAdmin: required to configure NFS server-side parameters (not client mounts). - Rootless containers: NFS mounts inside a user namespace require that the filesystem server grants access to the mapped UID/GID range; the mount itself is permitted only if the user namespace has a mapping for UID 0 (i.e., is a privileged user namespace in context of the host).

sysfs interface/sys/kernel/umka/nfs/: - clients/: one directory per active NfsClient, containing: - clientid: hex-encoded 64-bit client ID - server: server address - lease_time_s: negotiated lease period - state: active / recovering / expired - session_id (v4.1 only): hex-encoded 128-bit session ID - servers/: per-server aggregate statistics: - rtt_us: exponentially smoothed round-trip time (microseconds) - retransmissions: total retransmitted RPCs since mount - ops_per_sec: rolling 1-second average of completed RPCs

15.11.7 Locking: lockd and NFSv4 Built-in Locks

NFSv3 uses lockd (Network Lock Manager, NLM protocol, RFC 1813 appendix) for advisory file locking. NFSv4 has locking built into the compound protocol (LOCK / UNLOCK / LOCKT operations).

NfsLockState (NFSv4):

/// Sentinel value for byte-range locks extending to end of file.
/// Used in `NfsLockState.length` and NLM/NFSv4 LOCK operations.
/// Matches Linux NFS_LOCK_TO_EOF (0xFFFF_FFFF_FFFF_FFFF).
pub const NFS_LOCK_TO_EOF: u64 = u64::MAX;

pub struct NfsLockState {
    pub stateid: [u8; 16],
    pub type_:   NfsLockType,  // Read / Write
    pub offset:  u64,
    pub length:  u64,          // NFS_LOCK_TO_EOF = to end of file
    pub seqid:   u32,
}

NFSv4 LOCK compoundSEQUENCE + PUTFH + LOCK { type_, reclaim, offset, length, locker: OpenToLockOwner { open_seqid, open_stateid, lock_seqid, lock_owner } }. On success returns lock_stateid used for subsequent LOCKU. On NFS4ERR_DENIED, returns the conflicting lock's owner, offset, and length so the caller can implement blocking via POSIX F_SETLKW semantics (client polls with exponential backoff up to timeo).

lockd (NFSv3) — NLM protocol between kernel lockd threads. lockd starts automatically when the first NFSv3 mount is established (Capability::SysAdmin required). The NLM daemon: 1. Registers with portmap/rpcbind as program 100021 version 4. 2. Accepts NLM_LOCK, NLM_UNLOCK, NLM_TEST RPCs from clients (server role) and issues them to remote servers (client role). 3. Implements the grace period subsystem: after server reboot, accepts only NLM_LOCK with reclaim=true until all clients have re-claimed their locks or the grace period (default 45s) expires.

Interaction between NLM and the VFS lock layer: NLM calls vfs_lock_file() (which calls the filesystem's lock() inode operation) on behalf of remote clients. UmkaOS's lock layer tracks pending NLM locks in INode::nlm_locks: Vec<NlmLock>, serialized by the inode's lock_mutex. When a lock is granted to a remote client, the NlmLock entry records the remote host and lock owner opaque identifier so it can be released on client crash (detected via NSM — Network Status Monitor callbacks, registered via SM_NOTIFY).

15.11.8 Design Decisions

  1. NFSv4.1 as the default minor version: v4.1 sessions eliminate the need for the callback channel to traverse firewalls (server uses the established fore channel for callbacks in v4.1), simplify lease recovery (session semantics), and enable parallel slot usage. The client attempts v4.1 first and falls back to v4.0 only if the server rejects EXCHANGE_ID.

  2. RPCSEC_GSS in-kernel, not userspace: Keeping GSS context management in the kernel (with upcalls to gssd only for ticket acquisition) eliminates a round-trip to userspace per-RPC at krb5i/krb5p security levels. The integrity and privacy transforms (AES-256-CTS + HMAC-SHA-512/256 per RFC 8009) are performed in-kernel using the crypto subsystem.

  3. nconnect for throughput scaling: A single TCP connection is limited by the TCP window and per-CPU processing. Multiple connections allow the NFS client to drive higher server throughput without RDMA. This matches Linux behavior since kernel 5.3.

  4. Hard mounts as default: Soft mounts return EIO on transient network failures and can corrupt application data. Hard mounts block until the server is reachable again. Applications that need timeout behavior use intr + SIGINT handling or O_NONBLOCK at the VFS layer.

Layered retry semantics: TCP retransmission and NFS RPC retry operate independently at different layers. TCP handles segment-level retransmission (typically 3 retransmits, ~60s total before connection reset). NFS RPC retry is above TCP: on a soft mount, the RPC layer returns EIO after retrans timeouts of timeo each (default for TCP: 2 x 600 = 1200 deciseconds = 120s total). On a hard mount, the RPC layer retries indefinitely, reconnecting the TCP transport if the connection drops. The total visible timeout is max(TCP retransmit window, NFS RPC timeout) — typically the NFS RPC layer dominates because it waits for the TCP connection to be re-established before retrying.

  1. netfs layer as shared infrastructure: Rather than NFS implementing its own readahead and writeback, the netfs layer provides a single tested implementation. Future addition of Ceph or AFS clients reuses the same infrastructure without duplicating logic.

  2. Zero-copy XDR via NetBuf chains: RPC payloads for large reads and writes avoid data copies by encoding directly into or decoding directly from the NetBuf chains used by the TCP transport (Section 12). The record-mark framing is prepended as a single 4-byte header NetBuf node; the data pages are appended as additional NetBuf nodes referencing page cache pages directly.

  3. Attribute caching (ac option): NFS attributes (size, mtime, ctime, nlinks) are cached for actimeo seconds (default: 3–60s, scaling with file size change frequency). noac disables caching entirely, providing close-to-open coherence at the cost of one GETATTR per VFS operation. The attribute cache is stored in the NfsInode overlaid on the Inode (as with all UmkaOS filesystem-specific inode data).

NFS d_revalidate lock ordering: During ref-walk path resolution, NFS d_revalidate acquires i_rwsem shared on the parent inode before issuing a GETATTR RPC. Lock ordering: mmap_lock < i_rwsem < socket_lock. The RPC may block on network I/O; i_rwsem shared mode allows concurrent lookups on the same directory.

  1. Network namespace composition: Each NFS mount is bound to the network namespace active at mount time (NfsSessionParams::net_ns). The NFS client (NfsClntState) holds an Arc<NetNamespace> reference captured from current_task().nsproxy.net_ns during FileSystemOps::mount(). All SunRPC connections for this mount are created within the captured network namespace: socket creation calls sock_create_kern(net_ns, AF_INET/AF_INET6, SOCK_STREAM, 0) with the stored net_ns reference, ensuring TCP connections use the namespace's routing table, firewall rules, and port space. When a container with its own network namespace mounts NFS, the mount's RPC connections are confined to the container's network stack. If the network namespace is destroyed while NFS mounts remain (container teardown without explicit unmount), all in-flight RPCs receive ENETUNREACH and the mount enters NFS4ERR_STALE recovery — the hard mount option blocks until the namespace is re-created or the admin force-unmounts with umount -f.

15.12 NFS Server (nfsd)

UmkaOS's NFS server (nfsd) enables exporting local filesystems to remote NFS clients over NFSv3 (RFC 1813) and NFSv4.1 (RFC 5661). The server runs as a pool of kernel threads that service SunRPC requests arriving on UDP and TCP port 2049. Configuration is via /proc/fs/nfsd/ and the exportfs(8) utility, which parses /etc/exports and writes export records into the kernel. NFSv4.1 is the default negotiated minor version; NFSv4.0 and NFSv3 clients are accepted by capability negotiation at connection time. The NFS server integrates with:

  • Section 13 (VFS) for all filesystem operations (lookup, read, write, getattr, setattr, readdir, lock, fsync).
  • Section 15.11 (NFS Client) for the shared SunRPC transport and RPCSEC_GSS machinery (the same RpcTransport infrastructure is used in both client and server roles).
  • Section 8 (Security) for Kerberos GSS context establishment and UID/GID credential validation.

15.12.1 Overview

The NFS server is structured into four layers:

  1. Transport: svc_recv() — per-thread blocking receive over the shared RpcSocket.
  2. Dispatch: svc_dispatch() — demultiplex by RPC program / version / procedure.
  3. NFS handlers: per-procedure functions that validate export permissions, decode XDR arguments, call into the VFS, and encode XDR replies.
  4. Stable state: the NFSv4 state machine (clients, sessions, opens, locks, delegations) and the stable-storage journal for crash recovery.

The Duplicate Request Cache (DRC) sits between layers 2 and 3 to suppress re-execution of non-idempotent operations on retransmitted requests.

15.12.2 VFS ExportOps Interface

The NFS server requires filesystems to implement ExportOperations to allow stable file handles — handles that survive server restart and that the server can use to reconstruct a dentry from opaque bytes alone, without a mounted path hierarchy.

/// Implemented by filesystems that support being NFS-exported.
///
/// Stable file handles survive server restarts. The server must be able to
/// reconstruct a `Dentry` from the opaque handle bytes alone. Filesystems
/// that do not implement this trait cannot be NFS-exported; attempting to do
/// so returns `EINVAL`.
///
/// # Safety invariant
/// `encode_fh` and `fh_to_dentry` must be inverses: for any inode `i`,
/// `fh_to_dentry(sb, buf, ty)` where `(buf, ty) = encode_fh(i, buf, None)`
/// must return a dentry pointing to the same inode.
pub trait ExportOperations: Send + Sync {
    /// Encode `inode` (and optionally its `parent`) into `fh`.
    ///
    /// Returns the handle-type byte stored in the on-wire NFS file handle.
    /// Typical implementations encode `(ino, generation)` for `parent = None`
    /// and `(ino, generation, parent_ino, parent_generation)` when a parent
    /// is supplied.
    fn encode_fh(
        &self,
        inode: &Inode,
        fh: &mut [u8; 128],
        parent: Option<&Inode>,
    ) -> u8;

    /// Reconstruct a dentry from a file handle.
    ///
    /// Called on every NFS operation that arrives with a file handle. The
    /// implementation must locate the inode (by inode number + generation or
    /// by UUID) and return an instantiated dentry. Returns `ESTALE` if the
    /// inode no longer exists.
    fn fh_to_dentry(
        &self,
        sb: &SuperBlock,
        fh: &[u8],
        fh_type: u8,
    ) -> Result<Arc<Dentry>, KernelError>;

    /// Reconstruct the parent dentry from a file handle that contains parent
    /// information (i.e., was encoded with `parent = Some(...)`).
    ///
    /// Returns `ESTALE` if the parent inode no longer exists.
    fn fh_to_parent(
        &self,
        sb: &SuperBlock,
        fh: &[u8],
        fh_type: u8,
    ) -> Result<Arc<Dentry>, KernelError>;

    /// Return the filename of `child` within `parent`.
    ///
    /// Used during NFSv4 `READDIR` to build parent-relative paths for
    /// directory entries. Returns `ENOENT` if `child` is not in `parent`.
    ///
    /// Returns `ArrayString<256>` (stack-allocated, no heap) because
    /// filenames are bounded by `NAME_MAX` (255 bytes) on all supported
    /// filesystems. This avoids a `String` heap allocation on a path
    /// that can be called frequently during NFS stale-handle recovery
    /// and LOOKUPP operations.
    fn get_name(
        &self,
        parent: &Dentry,
        child: &Dentry,
    ) -> Result<ArrayString<256>, KernelError>;

    /// Return the parent dentry of `child`.
    ///
    /// Used to walk upward toward the export root when the client traverses
    /// beyond the export boundary. Returns `EXDEV` if `child` is already the
    /// filesystem root.
    fn get_parent(&self, child: &Dentry) -> Result<Arc<Dentry>, KernelError>;
}

Standard UmkaOS-supported filesystems (ext4, XFS, Btrfs, tmpfs) implement ExportOperations using (inode_number, generation_number) as the file handle payload. The generation number is incremented each time an inode number is reused, ensuring handles from before a delete are correctly rejected as ESTALE rather than silently aliasing a new file.

15.12.3 Exports Database

The exports table maps (host_pattern, local_path) to ExportOptions. It is loaded at server startup and updated by exportfs -a writing binary records to /proc/fs/nfsd/exports.

/// One row in the NFS exports table.
pub struct NfsExport {
    /// Root dentry of the exported directory tree.
    pub path:       Arc<Dentry>,
    /// Unique filesystem-ID for this export, embedded in NFSv3 `fsstat` and
    /// NFSv4 `fs_locations`. Auto-assigned from `sb.dev` unless overridden
    /// by `fsid=` option.
    pub fsid:       u64,
    /// Host specifier: single IP (`192.168.1.5`), CIDR subnet
    /// (`10.0.0.0/24`), DNS name (`host.example.com`), NIS netgroup
    /// (`@cluster`), or wildcard (`*`).
    pub client:     NfsClientSpec,
    /// Parsed export options.
    pub options:    ExportOptions,
    /// Effective UID for unauthenticated or squashed access (default 65534,
    /// the traditional `nfsnobody` UID).
    pub anon_uid:   u32,
    /// Effective GID for unauthenticated or squashed access (default 65534).
    pub anon_gid:   u32,
}

/// Parsed export options from `/etc/exports`.
pub struct ExportOptions {
    /// Allow write access. Default: `false` (read-only).
    pub rw:            bool,
    /// Require that every `WRITE` is committed to stable storage before the
    /// RPC reply is sent (`sync` option). Default: `true` (`sync`).
    ///
    /// **Rationale**: Linux changed the default from `async` to `sync` in
    /// kernel 2.6.33 (2010). The `async` default caused silent data loss
    /// on server crash: the server acknowledged writes that had not yet
    /// reached stable storage, and NFS clients (which trust the server's
    /// response) discarded their cached copies. This violated the POSIX
    /// `write()` durability contract that applications depend on. UmkaOS
    /// follows the modern `sync` default. Administrators who accept the
    /// data-loss risk for performance can explicitly set `async` in
    /// `/etc/exports`.
    pub sync:          bool,
    /// Map UID 0 to `anon_uid`. Default: `true`.
    pub root_squash:   bool,
    /// Map all UIDs to `anon_uid`. Default: `false`.
    pub all_squash:    bool,
    /// Verify that file handles refer to a file within the exported subtree
    /// (not just the exported filesystem). Incurs a full path walk per
    /// request. Default: `false` (disabled since Linux 2.6.x; the
    /// performance cost is rarely worth the security benefit on modern
    /// systems).
    pub subtree_check: bool,
    /// Accepted security flavors, in preference order. Default: `[Sys]`.
    pub sec:           ArrayVec<NfsSec, 4>,
    /// Explicit `fsid=` override. Supersedes the auto-assigned value.
    pub fsid:          Option<u64>,
    /// Automatically re-export submounts visible under this path. Default:
    /// `false`.
    pub crossmnt:      bool,
    /// Do not hide submounts from clients; clients must traverse them
    /// explicitly via a separate mount. Default: `false`.
    pub nohide:        bool,
    /// Skip AUTH_NLM authentication for NFSv3 lock requests. Default:
    /// `false`.
    pub no_auth_nlm:   bool,
    /// Require that the export is only activated when this path is an active
    /// mountpoint (the `mp=` option). `None` = no requirement.
    /// Bounded by PATH_MAX (4096 bytes).
    pub mp:            Option<String>,
}

/// Security flavor accepted on this export.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum NfsSec {
    /// AUTH_SYS (UID/GID in RPC credential, no authentication).
    Sys,
    /// RPCSEC_GSS Kerberos 5: authentication only.
    Krb5,
    /// RPCSEC_GSS Kerberos 5: authentication + integrity.
    Krb5i,
    /// RPCSEC_GSS Kerberos 5: authentication + integrity + privacy.
    Krb5p,
}
/// Global export table. RCU-protected hash table keyed on (path_hash, client_addr).
/// Hot-path lookup is O(1) with no lock acquisition. Writer path (exportfs -a)
/// holds a Mutex and publishes via RCU grace period.
pub struct NfsExportTable {
    /// RCU-protected hash map: (path_hash, client_addr) → NfsExport.
    /// Non-integer composite key; RcuHashMap is the correct collection
    /// per collection usage policy.
    pub entries: RcuHashMap<(u64, IpAddr), NfsExport>,
    /// Writer lock for export add/remove operations.
    pub write_lock: Mutex<()>,
}

On each NFS request the server calls export_table.lookup(dentry, peer_addr) — an O(1) RCU read with no lock acquisition on the hot path. Updates (from exportfs -a writing /proc/fs/nfsd/exports) take the writer lock, rebuild the affected bucket, and publish via an RCU grace period.

15.12.4 Server Threads

/// The nfsd thread pool. One pool per NUMA node (optional; by default a
/// single pool is used for all CPUs).
pub struct NfsdPool {
    /// Active kernel threads servicing RPC requests.
    /// Bounded: max `NFSD_MAX_THREADS` (8192) per pool.
    /// Collection policy: warm-path allocation (thread count changes are
    /// admin operations via `/proc/fs/nfsd/threads`). Vec with documented
    /// bound is acceptable per collection policy.
    /// Memory budget at max: 8192 threads x ~48 KB (16 KB stack + 32 KB
    /// request/reply buffers) = ~384 MB. Typical production: 32-512
    /// threads (~1.5-24 MB). Probe-time validation rejects thread counts
    /// exceeding `NFSD_MAX_THREADS`.
    pub threads:      Vec<KernelThread>,
    /// Current configured thread count. Writable via
    /// `/proc/fs/nfsd/threads`. Default: 8. Typical production: 32–512.
    /// Hard upper bound: `NFSD_MAX_THREADS` (8192).
    pub count:        AtomicU32,
    /// Shared RPC transport abstraction for port 2049. Despite the singular
    /// name, `RpcSocket` internally multiplexes both UDP and TCP listeners
    /// (and optionally RDMA). The singular name follows Linux's `svc_serv`
    /// convention. See `RpcSocket` for multi-transport details.
    pub socket:       Arc<RpcSocket>,
    /// Duplicate request cache shared across all threads in this pool.
    pub drc:          Arc<DuplicateRequestCache>,
    /// Per-pool statistics (requests received, dispatched, dropped).
    pub stats:        NfsdPoolStats,
}

Thread lifecycle:

  1. rpc.nfsd(8) opens /proc/fs/nfsd/threads and writes the desired thread count.
  2. The kernel spawns that many nfsd/<n> kernel threads.
  3. Each thread loops: svc_recv(socket)svc_authenticate(req)svc_dispatch(req)svc_send(reply).
  4. svc_recv() blocks in poll()/epoll_wait() on the shared socket; threads compete for incoming requests (one request per wakeup).
  5. Each thread owns a private 16 KB request buffer and a private 16 KB reply buffer. These buffers are stack-allocated within the kernel thread's stack; no per-request heap allocation is required for the common case.
  6. Writing 0 to /proc/fs/nfsd/threads shuts down all threads after draining in-flight requests.

Because nfsd threads are kernel threads (not user processes), each VFS call from a thread executes directly in kernel context with the caller's effective credential set — no context switch to user space is required between RPC dispatch and filesystem operation.

15.12.5 Duplicate Request Cache (DRC)

The DRC prevents non-idempotent operations from being re-executed on retransmitted requests. It is mandatory for correctness: a client that retransmits CREATE foo after a network timeout would otherwise create foo a second time if the first succeeded.

Non-idempotent procedures covered: SETATTR, WRITE, CREATE, MKDIR, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK (NFSv3); OPEN, CLOSE, SETATTR, WRITE, CREATE, REMOVE, RENAME, LINK, LOCK, LOCKU (NFSv4 — note: NFSv4.1 sessions provide their own exactly-once semantics via slot + sequence IDs, so the DRC is used only for NFSv3 and NFSv4.0 in UmkaOS).

/// Number of DRC shards. Each RPC locks only its shard — 64x reduction
/// in contention compared to a single global lock.
pub const DRC_SHARD_COUNT: usize = 64;

/// Duplicate request cache: sharded by `hash(client_addr, xid)`.
/// Each shard has its own SpinLock and LRU cache, so concurrent RPCs
/// targeting different shards proceed without contention.
pub struct DuplicateRequestCache {
    shards: [SpinLock<DrcShard>; DRC_SHARD_COUNT],
}

/// One shard of the DRC. Per-shard capacity =
/// `(1024 * nfsd_thread_count) / DRC_SHARD_COUNT`.
pub struct DrcShard {
    entries:     LruCache<DrcKey, DrcEntry>,
    max_entries: usize,
}

/// Cache key: uniquely identifies one RPC call from one client.
#[derive(Hash, Eq, PartialEq, Clone)]
pub struct DrcKey {
    /// IPv4 or IPv6 address of the originating client.
    pub client_addr: IpAddr,
    /// RPC transaction ID (XID) from the call header.
    pub xid:         u32,
}

/// Maximum NFS RPC reply size cached in the DRC (8 KiB covers all NFSv3/v4
/// non-idempotent replies including GETATTR post-op attributes).
const NFS_DRC_MAX_REPLY: usize = 8192;

/// Cached reply for a completed non-idempotent operation.
pub struct DrcEntry {
    /// Serialized XDR reply bytes, ready to retransmit.
    /// Bounded by `NFS_DRC_MAX_REPLY` — replies exceeding this are not cached
    /// (the operation is re-executed on replay, which is safe because the
    /// DRC only caches non-idempotent ops that already committed).
    pub reply:     ArrayVec<u8, NFS_DRC_MAX_REPLY>,
    /// Wall-clock time the entry was inserted (for eviction policy).
    pub timestamp: Instant,
    /// Adler-32 of the full request body. Used to detect the degenerate case
    /// where two different requests happen to share the same XID — in that
    /// case the cached reply is discarded and the new request is executed.
    pub checksum:  u32,
}

Request processing for non-idempotent procedures:

  1. Compute DrcKey { client_addr, xid } and checksum = adler32(request_body).
  2. Select shard: shard_idx = hash(client_addr, xid) % DRC_SHARD_COUNT.
  3. Lock shards[shard_idx] (SpinLock) and look up the key.
  4. Hit, checksum matches: return entry.reply directly; skip VFS execution.
  5. Hit, checksum mismatch: evict stale entry; release shard lock; proceed to execute (new request collided with an old XID).
  6. Miss: release shard lock, execute VFS operation, re-lock shard, insert DrcEntry { reply, timestamp, checksum }, release shard lock, send reply.
  7. Entries are evicted LRU when per-shard capacity is exceeded, or after 120 seconds (hard TTL).

15.12.6 NFSv3 Protocol Dispatch

NFSv3 (RPC program 100003, version 3, RFC 1813) uses a stateless request/reply model. All NFS file handles are opaque blobs of up to 64 bytes. The server reconstructs a dentry from the file handle on every request via ExportOperations::fh_to_dentry().

Procedure Handler Idempotent
NULL (0) nfsd3_null() yes
GETATTR (1) nfsd3_getattr() yes
SETATTR (2) nfsd3_setattr() no
LOOKUP (3) nfsd3_lookup() yes
ACCESS (4) nfsd3_access() yes
READLINK (5) nfsd3_readlink() yes
READ (6) nfsd3_read() yes
WRITE (7) nfsd3_write() no
CREATE (8) nfsd3_create() no
MKDIR (9) nfsd3_mkdir() no
SYMLINK (10) nfsd3_symlink() no
MKNOD (11) nfsd3_mknod() no
REMOVE (12) nfsd3_remove() no
RMDIR (13) nfsd3_rmdir() no
RENAME (14) nfsd3_rename() no
LINK (15) nfsd3_link() no
READDIR (16) nfsd3_readdir() yes
READDIRPLUS (17) nfsd3_readdirplus() yes
FSSTAT (18) nfsd3_fsstat() yes
FSINFO (19) nfsd3_fsinfo() yes
PATHCONF (20) nfsd3_pathconf() yes
COMMIT (21) nfsd3_commit() yes

WRITE stability semantics: NFSv3 WRITE carries a stable_how field:

  • FILE_SYNC: data and metadata must be written to stable storage before reply. Implemented by calling vfs_write() followed by vfs_fsync(file, 0, len, 1).
  • DATA_SYNC: data must reach stable storage; metadata update may be deferred. Implemented by vfs_write() + vfs_fdatasync().
  • UNSTABLE: data written to page cache only (no fsync). The server returns the current write_verifier (a 64-bit value, initialized to ktime_get_boot_ns() at server start and written to /proc/fs/nfsd/write_verifier). The client must issue a COMMIT RPC before treating UNSTABLE writes as durable.

COMMIT: nfsd3_commit() calls vfs_fsync_range(file, offset, offset + count - 1, 0) and returns the write_verifier. If the verifier has changed since the client last received it (indicating a server restart), the client must re-issue all UNSTABLE writes.

READDIRPLUS: returns both directory entry names and their attributes in a single RPC, amortizing the per-entry GETATTR round trips. Implemented by iterating vfs_iterate_dir() and calling vfs_getattr() on each child inode, packing results into a single XDR reply up to the maxcount limit supplied by the client.

15.12.7 NFSv4.1 Compound Dispatch

NFSv4.1 (RPC program 100003, version 4, RFC 5661) replaces the per-procedure dispatch model with a compound RPC: a single RPC carries a sequence of operations processed left-to-right. If an operation fails with any status other than NFS4_OK, the server stops processing and returns partial results — only the first failed operation's status is returned along with the results of all preceding successful operations.

SEQUENCE must be the first operation in every compound (except BIND_CONN_TO_SESSION and EXCHANGE_ID). It provides session ID, slot ID, sequence ID, and cache-this flag. The server's slot table enforces exactly-once semantics: slot i may not carry a new request until the previous request on slot i has been replied to. This replaces the NFSv3/v4.0 DRC with a per-session, per-slot mechanism.

Key operations and their VFS mappings:

NFSv4.1 Operation VFS call Notes
EXCHANGE_ID Client registration; returns clientid + capabilities
CREATE_SESSION Establishes fore/back channels; negotiates slot counts and max RPC sizes
DESTROY_SESSION Tears down session; releases slot table
DESTROY_CLIENTID Releases all state for a clientid
SEQUENCE Slot/sequence enforcement; lease renewal
PUTROOTFH VFS root dentry Sets current FH to the export root
PUTFH fh_to_dentry() Sets current FH from wire handle
GETFH Returns current FH to client
SAVEFH / RESTOREFH Push/pop FH onto per-compound stack
LOOKUP path_lookup() Walks one path component
LOOKUPP path_lookup("..") Walks to parent directory
OPEN vfs_open() Returns stateid + open flags
CLOSE vfs_release() Releases open stateid
READ vfs_read() Returns data + EOF flag
WRITE vfs_write() Returns bytes written + stability
COMMIT vfs_fsync_range() Flushes unstable writes
GETATTR vfs_getattr() Returns requested attribute bitmask
SETATTR vfs_setattr() Sets attributes; stateid required for size truncation
CREATE vfs_mkdir() / vfs_symlink() / vfs_mknod() Non-regular files only (regular files via OPEN)
REMOVE vfs_unlink() / vfs_rmdir() Inferred from inode type
RENAME vfs_rename() Atomic cross-directory rename
LINK vfs_link() Hard link
READDIR vfs_iterate_dir() Returns entries with requested attributes
READLINK vfs_readlink() Returns symlink target
LOCK vfs_lock_file() Byte-range lock; returns lock stateid
LOCKT vfs_lock_file(F_GETLK) Test for conflicting lock
LOCKU vfs_lock_file(F_UNLCK) Release byte-range lock
DELEGRETURN Client returns a read or write delegation
LAYOUTGET pNFS metadata pNFS layout (optional; Tier 1 storage backends only)
LAYOUTRETURN pNFS metadata Client returns layout

15.12.7.1 pNFS Data Server Interface

pNFS (parallel NFS, RFC 5661 Section 12 and RFC 8435) distributes file data across multiple data servers (DSes) while the metadata server (MDS) handles namespace operations and layout leases. The following trait must be implemented by any Tier 1 block driver that wishes to serve as a pNFS data server.

/// pNFS data server operations. A pNFS layout divides file data across one or more
/// data servers (DSes); the metadata server (MDS) provides layout leases.
/// Each data server implements this trait to provide layout-specific I/O.
///
/// Layouts defined by RFC 5661 (NFS 4.1): FILE, BLOCK, OBJECT, FLEX_FILE (RFC 8435).
/// UmkaOS implements FILE layout (direct NFS I/O to data servers) and FLEX_FILE layout
/// (mirrors/striping with per-DS error tolerance).
pub trait PnfsDataServer: Send + Sync {
    /// Unique server identifier (IP:port or RDMA endpoint address).
    fn server_addr(&self) -> &PnfsServerAddr;

    /// Read `len` bytes from the data server at file offset `file_offset` into `buf`.
    /// Uses the layout credential from `layout_stateid`.
    ///
    /// Returns `Ok(bytes_read)` or an error. On `PNFS_NO_LAYOUT` error, the caller
    /// must fall back to the metadata server (MDS) for I/O.
    fn read(
        &self,
        layout_stateid: &LayoutStateId,
        file_offset: u64,
        len: u32,
        buf: &mut [u8],
    ) -> Result<u32, PnfsError>;

    /// Write `data` to the data server at file offset `file_offset`.
    /// `stable` indicates whether stable (synchronous) or unstable write is requested.
    ///
    /// Unstable writes are buffered in the data server; a subsequent `commit()`
    /// flushes them to stable storage. Stable writes are immediately persistent.
    fn write(
        &self,
        layout_stateid: &LayoutStateId,
        file_offset: u64,
        data: &[u8],
        stable: WriteStability,
    ) -> Result<WriteResponse, PnfsError>;

    /// Flush unstable writes to stable storage on the data server.
    /// Returns the write verifier that can be compared with previous unstable writes.
    fn commit(
        &self,
        layout_stateid: &LayoutStateId,
        file_offset: u64,
        count: u64,
    ) -> Result<WriteVerifier, PnfsError>;

    /// Return the data server's capabilities (supported layout types, max I/O size).
    fn capabilities(&self) -> PnfsDataServerCaps;

    /// Called when the layout is recalled by the MDS or invalidated. The data server
    /// must flush all pending writes and return the layout.
    fn layout_recall(&self, layout_stateid: &LayoutStateId, recall_type: RecallType);
}

/// Write stability mode for pNFS data server writes.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum WriteStability {
    /// Data written to client-side buffer only (DATA_SYNC on server side).
    Unstable,
    /// Data written to server stable storage before reply (FILE_SYNC).
    Stable,
}

/// Response from a pNFS data server write operation.
pub struct WriteResponse {
    /// Number of bytes actually written.
    pub count: u32,
    /// Write stability achieved (may be higher than requested).
    pub stability: WriteStability,
    /// Write verifier: random value chosen by server at startup.
    /// If verifier changes between write and commit, a server restart occurred
    /// and uncommitted writes are lost (client must retry).
    pub verifier: WriteVerifier,
}

/// Capabilities of a pNFS data server. Returned by `PnfsDataServer::capabilities()`
/// which is implemented by Tier 1 drivers communicating via KABI ring boundary.
/// `#[repr(C)]` is required for stable layout across compilation units.
#[repr(C)]
pub struct PnfsDataServerCaps {
    /// Maximum I/O size for a single read or write RPC.
    pub max_rw_size: u32,
    /// Supported layout types (FILE, BLOCK, OBJECT, FLEX_FILE).
    pub layout_types: u32,
    /// 1 if RDMA transport is available for this data server, 0 otherwise.
    /// u8 (not bool) avoids the bool validity invariant across KABI boundary.
    pub rdma_available: u8,
    /// Explicit padding to 4-byte alignment.
    pub _pad: [u8; 3],
}
const_assert!(core::mem::size_of::<PnfsDataServerCaps>() == 12);

/// Opaque pNFS layout stateid (per RFC 5661 §14.5.2).
pub type LayoutStateId = [u8; 16];
/// pNFS write verifier (per RFC 5661 §17.3): 8-byte opaque value.
pub type WriteVerifier = [u8; 8];
/// Opaque server network address.
pub struct PnfsServerAddr { pub addr: [u8; 48], pub len: u8, pub _pad: [u8; 7] }

/// Errors specific to pNFS data server operations.
#[derive(Debug)]
pub enum PnfsError {
    /// The layout stateid is no longer valid (server recalled or expired it).
    /// Client must fetch a new layout from the MDS.
    NoLayout,
    /// Data server is temporarily unavailable. Client may retry or fall back to MDS.
    Unavailable,
    /// I/O error on the data server.
    Io(KernelError),
    /// Layout type not supported by this data server.
    UnsupportedLayout,
}

15.12.8 NFSv4 State Management

NFSv4 introduces stateful file access. The server tracks client IDs, sessions, open owners, lock owners, and delegations. All state has an associated lease; state from clients whose leases expire is reclaimed by the server.

/// All per-client NFSv4 state. Protected by `NfsdStateTable::client_lock`.
pub struct NfsdClientState {
    /// 64-bit client ID assigned at `EXCHANGE_ID`. Unique for the server's
    /// lifetime.
    pub clientid:     u64,
    /// 8-byte verifier supplied by the client at `EXCHANGE_ID`. Used to
    /// detect client restarts (same IP, new verifier → client rebooted).
    pub verifier:     [u8; 8],
    /// Confirmed IP address of the client (from the TCP connection that
    /// issued `CREATE_SESSION`).
    pub client_addr:  IpAddr,
    /// RPCSEC_GSS principal name if the client authenticated with Kerberos.
    /// `None` for AUTH_SYS clients. Maximum 256 bytes (matching
    /// `GssUpcallRequest.client_principal` size); EXCHANGE_ID rejects
    /// overlong principals with `NFS4ERR_BADXDR`. Cold-path allocation
    /// (one per client session).
    pub principal:    Option<String>,
    /// Active sessions (fore + back channel pairs).
    /// NFSv4.1 clients typically maintain 1-4 sessions; 16 is a generous upper bound.
    pub sessions:     ArrayVec<Arc<NfsdSession>, 16>,
    /// Open owners: keyed by the 28-byte `open_owner` opaque identifier.
    /// Bounded by MAX_NFS_OPEN_OWNERS (65536). Returns NFS4ERR_RESOURCE on overflow.
    /// BTreeMap is correct per collection policy: [u8; 28] is a non-integer ordered
    /// key. Ordered iteration is useful for crash recovery (deterministic state
    /// replay). Matches Linux's rb-tree for state_owner lookup. At N=65536 with
    /// 28-byte keys, BTreeMap lookup is ~16 comparisons — fast with BTreeMap's
    /// cache-friendly node layout.
    pub open_owners:  BTreeMap<[u8; 28], Arc<OpenOwner>>,
    /// Lock owners: keyed by the 28-byte `lock_owner` opaque identifier.
    /// Bounded by MAX_NFS_LOCK_OWNERS (65536). Returns NFS4ERR_RESOURCE on overflow.
    /// Same BTreeMap justification as `open_owners` above.
    pub lock_owners:  BTreeMap<[u8; 28], Arc<LockOwner>>,
    /// Read and write delegations currently granted to this client.
    /// Bounded by the server's per-client delegation limit
    /// (`NFSD_MAX_DELEGATIONS_PER_CLIENT`, default 1024). Uses Vec instead of
    /// ArrayVec<_, 1024> to avoid 8 KiB inline allocation per client — most
    /// clients hold 0-10 delegations.
    ///
    /// **Enforcement**: Checked in `nfsd_grant_delegation()` before inserting.
    /// If `self.delegations.len() >= NFSD_MAX_DELEGATIONS_PER_CLIENT`, the
    /// server declines the delegation (returns the OPEN response without a
    /// delegation stateid). This is a UmkaOS improvement over Linux, which
    /// enforces only a global delegation limit and allows a single client to
    /// consume all global slots. The per-client limit prevents delegation
    /// starvation across clients.
    pub delegations:  Vec<Arc<Delegation>>,
    /// Absolute time at which this client's lease expires if not renewed.
    /// Renewed on every `SEQUENCE` from this client.
    pub lease_expiry: Instant,
}

/// Maximum fore-channel slots per session (RFC 8881 recommends up to 256).
const NFSD_MAX_SLOTS: usize = 256;

/// An NFSv4.1 session (one `CREATE_SESSION` creates one session).
pub struct NfsdSession {
    pub session_id:   [u8; 16],
    /// Fore channel: client → server request slots.
    /// Slot count negotiated at `CREATE_SESSION` time (max 256 per RFC 8881).
    pub fore_slots:   ArrayVec<NfsdSlot, NFSD_MAX_SLOTS>,
    /// Back channel: server → client callback slots.
    pub back_channel: Option<RpcBackChannel>,
    /// Maximum request size negotiated at `CREATE_SESSION` (bytes).
    pub max_req_sz:   u32,
    /// Maximum response size negotiated at `CREATE_SESSION` (bytes).
    pub max_resp_sz:  u32,
}

/// One slot in a session's fore channel.
pub struct NfsdSlot {
    pub seq_id:       u32,
    /// Cached reply for the last compound on this slot (for replay detection).
    /// Bounded by `NFS_DRC_MAX_REPLY` (same limit as DRC entries).
    /// Heap-allocated to avoid ~8 KiB inline per slot — with 256 slots per
    /// session and potentially thousands of sessions, inline would be ~2 MB/session.
    pub cached_reply: Option<Box<[u8]>>,
    pub in_use:       AtomicBool,
}

/// An open-owner and the associated open stateid.
pub struct OpenOwner {
    /// Current stateid (seqid increments on each OPEN/CLOSE/OPEN_DOWNGRADE).
    pub stateid:    StateId,
    /// The opened file's dentry.
    pub file:       Arc<Dentry>,
    /// Share access bits granted to this open (read, write, or both).
    pub access:     OpenAccess,
    /// Share deny bits this open holds (deny read, deny write, or neither).
    pub deny:        OpenDeny,
    /// Reference count: number of times the client has opened this
    /// (owner, file) pair without a corresponding CLOSE.
    pub open_count: u32,
}

/// An NFSv4 stateid: identifies one open, lock, or delegation instance.
pub struct StateId {
    /// Sequence number, incremented on each state transition.
    pub seqid: u32,
    /// 12 opaque bytes unique within the server's lifetime.
    pub other: [u8; 12],
}

/// A delegation granted to a client.
pub struct Delegation {
    pub stateid:     StateId,
    pub dtype:       DelegationType,  // Read or Write
    pub file:        Arc<Dentry>,
    pub client:      u64,             // clientid
    /// Time at which a pending recall (CB_RECALL) was sent. `None` if no
    /// recall is in progress.
    pub recall_sent: Option<Instant>,
}

#[derive(Clone, Copy, PartialEq, Eq)]
pub enum DelegationType {
    /// Read delegation: client may cache reads without contacting server.
    Read,
    /// Write delegation: client has exclusive write access; all writes are
    /// cached locally and flushed on DELEGRETURN or recall.
    Write,
}

Lease renewal: Each NfsdClientState has a lease_expiry deadline. Any SEQUENCE operation from the client resets the deadline to now + nfsd_lease_time (default: 90 seconds). The lease reaper task runs every 10 seconds and reclaims state for clients whose lease_expiry is in the past: all OpenOwner entries are closed, byte-range locks are released via vfs_lock_file(F_UNLCK), and delegations are revoked.

Grace period: After nfsd starts (or restarts), the server enters a grace period of nfsd_gracetime seconds (default: 90 seconds, equal to the lease time). During the grace period, the server: - Accepts OPEN with claim_type = CLAIM_PREVIOUS (state reclaim) from clients that held opens or delegations before the restart. - Rejects new OPEN with claim_type = CLAIM_NULL with NFS4ERR_GRACE. - Reads the stable-storage journal (Section 15.12.10) to learn which clients had state before the restart, populating the set of expected reclaimants.

Once the grace period expires (or all expected reclaimants have completed reclaim, whichever is first), the server transitions to normal operation.

Delegations and recalls: The server grants a Read delegation when a file is opened for read and there are no write opens or write delegations outstanding. It grants a Write delegation when a file is opened for write and there is exactly one open (the requesting client's) and no conflicting opens or delegations. When a conflicting open arrives for a delegated file, the server issues CB_RECALL on the back channel to the delegating client and waits nfsd_lease_time / 2 seconds for DELEGRETURN before forcibly revoking the delegation with NFS4ERR_DELEG_REVOKED.

Dirty page limit during server unreachability: When the NFS server is unreachable, dirty pages accumulate in the client's page cache. The accumulation is bounded by the cgroup's memory.max limit and the global dirty_ratio / dirty_bytes thresholds. If memory pressure triggers writeback to an unreachable server, the writeback thread blocks for up to nfs_timeout seconds (default: 60s) before reporting EIO. Delegations are returned on lease expiry (default: 90s) regardless of server reachability.

15.12.9 Authentication and Security

AUTH_SYS (auth_flavor = AUTH_UNIX): the RPC credential carries a plaintext UID, GID, and supplementary GID list. When running inside a container (nfsd in a non-init user namespace), the server translates incoming AUTH_SYS wire UIDs through the container's user_ns.uid_map before constructing filesystem credentials. This prevents a containerized NFS server from accessing files as the host UID. Outside a container (init user namespace), the server uses the wire credentials directly as the effective credential for VFS calls. No cryptographic authentication is performed. AUTH_SYS is a legacy mechanism for trusted private-network deployments only — credentials are trivially forgeable by any host on the network segment. Use sec=krb5p (authentication + integrity + privacy) for production deployments, or at minimum sec=krb5i (authentication + integrity). AUTH_SYS should be restricted to legacy appliances or isolated lab networks where deploying a Kerberos KDC is not feasible. Rejected on exports that specify sec=krb5 or stronger.

RPCSEC_GSS / Kerberos 5 (RFC 2203 + RFC 7861): three protection levels:

  • krb5: authentication only. The RPC call header contains a GSS MIC token covering the XID and procedure number; the server verifies the MIC using the session key. Payload is transmitted in clear.
  • krb5i: authentication + integrity. The entire RPC body (arguments + results) is covered by a GSS MIC token. Payload is transmitted in clear but any tampering is detected.
  • krb5p: authentication + integrity + privacy. The entire RPC body is wrapped with GSS Wrap (encrypt-then-MAC). Payload is opaque to network observers.

In all three cases the cryptographic transforms use AES-256-CTS-HMAC-SHA512-256 (enctypes aes256-cts-hmac-sha512-256, RFC 8009) when negotiated with a Kerberos 5 KDC that supports it, falling back to aes128-cts-hmac-sha256-128 (RFC 8009) or aes256-cts-hmac-sha1-96 (RFC 3962) for older KDCs.

GSS context establishment flow:

  1. The client sends RPCSEC_GSS_INIT with a GSS_Init_sec_context token (Kerberos AP-REQ encapsulated in GSS-API).
  2. The server calls rpc_gss_svc_accept_sec_context() which makes a synchronous upcall to gssd via a kernel–user pipe. gssd calls gss_accept_sec_context() with the host's keytab (/etc/krb5.keytab) and returns the derived session key and client principal to the kernel.
  3. The kernel stores the session key in GssContext::session_key (protected by a Mutex<>; the key is zeroed on context expiry via Drop). Subsequent RPCs perform MIC/Wrap/Unwrap in-kernel using the UmkaOS crypto subsystem (Section 8).
  4. The svcgssd daemon (alternative to gssd) is also supported; the upcall interface is identical.

UID mapping: applied after credential extraction, before any VFS call: - root_squash (default on): UID 0 → anon_uid (65534), GID 0 → anon_gid (65534). - all_squash: all UIDs/GIDs → anon_uid/anon_gid. - Neither option: credentials passed through unchanged.

UID mapping is applied per-export, so the same file can be accessed with different effective credentials by clients matched to different export rows.

15.12.10 /proc/fs/nfsd Interface

The /proc/fs/nfsd/ pseudo-filesystem is the control plane for the NFS server. It is mounted at boot when the nfsd kernel module is loaded (or when the first export is created, if nfsd is built-in).

/proc/fs/nfsd/
├── threads          (rw): read = "N\n" current thread count; write N to spawn/trim threads
├── exports          (rw): current exports table in exportfs format; write to update
├── clients/         (r-x): one subdirectory per active NFSv4 client
│   └── <clientid>/        clientid in lowercase hex (16 hex digits)
│       ├── info     (r--): "addr: ...\nprincipal: ...\nlease_remaining: ...s\n"
│       ├── states   (r--): one line per open stateid and delegation
│       └── ctl      (-w-): write "expire\n" to immediately revoke this client's lease
├── pool_stats       (r--): per-pool thread count, requests served, DRC hit rate
├── write_verifier   (r--): current write verifier as 16-char lowercase hex
├── nfsv4leasetime   (rw): NFSv4 lease duration in seconds (default 90, range 10–3600)
├── nfsv4gracetime   (rw): grace period duration in seconds (default = nfsv4leasetime)
├── nfsv4minorversion (rw): highest NFSv4 minor version offered (0 or 1; default 1; v4.2 deferred to Phase 4)
└── stable_storage   (rw): path to the stable-state journal file
                            (default: /var/lib/nfs/v4recovery)

The stable_storage path points to a directory on a local persistent filesystem. The server writes one file per client (named by clientid) containing serialized NfsdClientState (open owners, lock owners, delegation stateids) using a binary format. Each client file begins with an 8-byte header: UMKA magic (4 bytes) + version: Le32 (4 bytes, currently 1). Unknown versions cause the file to be ignored — the client must re-establish state from scratch during the grace period. A CRC-32C checksum covers the header and body. These files are read during the grace period to populate the set of expected reclaimants. They are deleted when a client sends DESTROY_CLIENTID or when its lease expires normally.

15.12.11 NLM (Network Lock Manager) Server

NFSv3 byte-range locking uses a separate RPC protocol: NLM (program 100021, version 4, defined in the OpenGroup XNFS specification). The NLM server runs as part of lockd alongside the NFS server.

NLM server procedures:

Procedure Handler Notes
NLM_TEST nlm4_test() Test for conflicting lock (non-destructive)
NLM_LOCK nlm4_lock() Acquire byte-range lock; may block if block=true
NLM_CANCEL nlm4_cancel() Cancel a pending blocked lock request
NLM_UNLOCK nlm4_unlock() Release a byte-range lock
NLM_GRANTED nlm4_granted() Callback: server notifies client of granted blocked lock
NLM_TEST_MSG async variant of TEST One-way; reply via NLM_TEST_RES callback
NLM_LOCK_MSG async variant of LOCK One-way; reply via NLM_LOCK_RES callback
NLM_UNLOCK_MSG async variant of UNLOCK One-way; reply via NLM_UNLOCK_RES callback
NLM_SHARE nlm4_share() DOS-style share reservation (rarely used)
NLM_UNSHARE nlm4_unshare() Release share reservation
NLM_NM_LOCK nlm4_nm_lock() Non-monitored lock (NSM not involved)
NLM_FREE_ALL nlm4_free_all() Release all locks for a client (NSM reboot notification)

VFS integration: nlm4_lock() calls vfs_lock_file(file, F_SETLKW, flock) with the translated struct file_lock. Granted locks are recorded in INode::nlm_locks: Vec<NlmLock>, protected by INode::lock_mutex. Maximum NLM_MAX_LOCKS_PER_INODE (default: 1024, configurable via sysctl nfs.nlm_max_locks_per_inode). Lock requests exceeding this limit are rejected with NLM_DENIED_NOLOCKS. Each NlmLock entry stores the remote host address and the NLM lock_owner opaque cookie so the lock can be released if the client crashes.

NSM (Network Status Monitor) integration: rpc.statd (program 100024) runs in user space and monitors client liveness. When lockd grants a lock to a remote client, it calls nsm_monitor(client_addr) to register the client with rpc.statd. If the client reboots, rpc.statd calls nsm_callback() which delivers SM_NOTIFY to the kernel's nfsd_sm_notify() entry point. The kernel then calls nlm_host_rebooted(), which iterates INode::nlm_locks for all inodes holding locks from that host and calls vfs_lock_file(F_UNLCK) to release them, allowing other waiters to proceed.

Grace period: After lockd restarts (following a server crash), it enters a grace period (default 45 seconds) during which it accepts only NLM_LOCK requests with reclaim = true. This allows clients to re-acquire locks they held before the crash before the server accepts new competing lock requests.

15.12.12 Linux Compatibility

  • /etc/exports format: identical to Linux nfsd, including all documented export options (rw, ro, sync, async, root_squash, no_root_squash, all_squash, no_all_squash, subtree_check, no_subtree_check, sec=, fsid=, anonuid=, anongid=, crossmnt, nohide, no_auth_nlm, mp=). Unrecognized options are rejected with a logged warning (not silently ignored).
  • exportfs(8), showmount(8), nfsstat(8), rpc.nfsd(8), rpc.mountd(8) all operate without modification.
  • /proc/fs/nfsd/ layout matches Linux kernel 5.15+ nfsd. Fields that do not exist in older kernels (e.g., nfsv4minorversion) are additive and ignored by older tools.
  • NFSv3 wire protocol: RFC 1813 compliant, interoperable with Linux, Solaris, macOS, FreeBSD, and Windows NFS clients.
  • NFSv4.1 wire protocol: RFC 5661 compliant. pNFS metadata operations (LAYOUTGET, LAYOUTRETURN, LAYOUTCOMMIT, GETDEVICEINFO) are implemented; pNFS data-server operations require a Tier 1 block driver that exposes the PnfsDataServer interface (optional; falls back to MDS-only mode if unavailable).
  • NFSv4.0 minor version: accepted (negotiated down from v4.1 if the client does not support EXCHANGE_ID). The DRC (Section 15.12.5) provides exactly-once semantics for v4.0. NFSv4.0 compatibility requires:
  • SETCLIENTID (opcode 35) / SETCLIENTID_CONFIRM (opcode 36) handlers for initial client identification (v4.0 does not use EXCHANGE_ID/CREATE_SESSION).
  • v4.0 callback channel established via reverse TCP connection to the client address from SETCLIENTID's r_netid/r_addr fields.
  • v4.0 compounds go through the DRC directly (no session slots or SEQUENCE).
  • RENEW (opcode 30) for lease renewal (replaces SEQUENCE in v4.1).
  • NFSv3 and NFSv4 server can run concurrently; both are enabled by default. Writing 3 or 4 to a hypothetical nfsv_versions knob is not yet implemented; the standard mechanism (exportfs options + kernel compile flags) applies as on Linux.

15.12.13 Design Decisions

  1. NFSv4.1 sessions replace the DRC for v4.1 clients: The per-session, per-slot sequence-ID mechanism in NFSv4.1 (RFC 5661 §2.10) provides exactly-once semantics without the hash-table overhead of the DRC. The DRC is retained only for NFSv3 and NFSv4.0 clients. NFSv4.1 clients receive NFS4ERR_SEQ_MISORDERED on sequence violations rather than a cached reply.

  2. Stable storage journal for NFSv4 state: Writing client open/lock/delegation state synchronously to disk on every OPEN, CLOSE, LOCK, LOCKU, and DELEGRETURN allows the server to survive a crash and offer clients a grace period for state reclamation (RFC 5661 §9.4.2). Without stable storage, the server would be forced to return NFS4ERR_NO_GRACE to all clients, requiring them to re-open all files from scratch — disruptive for workloads with thousands of open files.

  3. Thread pool model over event-driven dispatch: Kernel threads (one thread per outstanding request, blocking on svc_recv) keep the code path from RPC arrival to VFS call entirely synchronous. An event-driven model (one thread multiplexing many connections via epoll) would require explicit continuation passing through VFS callbacks, adding complexity with negligible throughput benefit at the connection counts typical for NFS servers (100s–1000s of clients, not millions).

  4. ExportOperations as a required trait: Requiring filesystems to provide stable file handles (encode_fh / fh_to_dentry) makes the correctness contract explicit at the type level. Filesystems that cannot provide stable handles (e.g., a synthetic in-memory filesystem with no persistent inode allocation) simply do not implement the trait and cannot be exported — instead of being exported with silently broken ESTALE behavior.

  5. AUTH_SYS and Kerberos both in-kernel: The Kerberos per-RPC integrity and privacy transforms (AES-256-CTS + HMAC) are performance-critical at high RPC rates and belong in the kernel crypto subsystem. Only the initial GSS context negotiation (involving the KDC and the host keytab) uses a user-space upcall to gssd. This is identical to Linux nfsd's approach and ensures compatibility with existing gssd/svcgssd deployments.

  6. NLM co-located with nfsd: The NLM lock manager shares the lockd kernel threads and the per-inode nlm_locks list with the NFS server rather than running as a separate subsystem. This avoids a cross-subsystem RPC for every lock operation and allows lock grants and lock releases to be performed atomically with respect to VFS inode locking.


15.13 Block Storage Networking

Storage networking protocols that expose remote block devices as local storage. These integrate with UmkaOS's block layer (umka-block), RDMA infrastructure (Section 5.4), and driver recovery model.

15.13.1 Wire Format Validation

All #[repr(C)] structs in this section cross node boundaries via the peer protocol. Every struct has: - Explicit _pad fields for all implicit alignment padding - Le16/Le32/Le64 for all multi-byte integers (never native endian) - u8 (not bool) for boolean fields - const_assert!(size_of::<T>() == N) after every struct definition

Verified structs in this section:

Struct Size Le types Pad fields const_assert bool-as-u8 Notes
NvmeDiscoveryLogEntry 1024 Yes (Le16) Yes (_reserved0, _reserved1) Yes N/A NVMe spec struct
BlockServiceRequest 64 Yes (Le64, Le32, Le16) Yes (_pad1) Yes N/A
SglEntry 12 Yes (Le64, Le32) N/A N/A (trivially 12B) N/A
BlockServiceCompletion 32 Yes (Le64, Le32) Yes (_reserved, _pad) Yes N/A
BlockServiceDeviceInfo 112 Yes (Le64, Le32, Le16) Yes (_pad) Yes Yes (7 bool fields) const_assert present and correct
DataIntegrityField 8 YES (Le16, Le32) YES YES guard(2)+app_tag(2)+ref_tag(4)=8 Fixed: converted to Le16/Le32

All struct deviations in this section have been fixed: DataIntegrityField now uses Le16/Le32, and BlockServiceDeviceInfo has an active const_assert!.

iSCSI Initiator

Tier 1 umka-block module implementing the iSCSI initiator role (RFC 7143): - Session management: login, logout, connection multiplexing, session recovery - SCSI command encapsulation over TCP - CHAP authentication (unidirectional and mutual) - Header and data digests (CRC32C) for integrity - Multi-connection sessions (MC/S) for bandwidth aggregation - Error recovery levels 0, 1, and 2

iSCSI Target

Tier 1 module exposing local block devices as iSCSI LUNs: - LIO-compatible configuration interface (existing targetcli works via SysAPI layer) - ACL-based access control (initiator IQN whitelist + CHAP) - Multiple LUNs per target portal group - SCSI Persistent Reservations (PR) support (required for clustered filesystems)

iSCSI CHAP Authentication

CHAP (Challenge-Handshake Authentication Protocol, RFC 3720 §12.1) is the standard authentication mechanism for iSCSI. Required for enterprise deployments; both initiator and target sides must support unidirectional and mutual CHAP.

/// CHAP authentication configuration for an iSCSI session.
/// RFC 3720 §12.1, updated by RFC 7143 §12.1.
pub struct IscsiChapAuth {
    /// CHAP algorithm: MD5 (5, legacy) or SHA-256 (7, preferred).
    pub algorithm: ChapAlgorithm,
    /// Target authenticates initiator (one-way CHAP).
    pub target_auth: ChapCredential,
    /// Initiator authenticates target (mutual CHAP, recommended).
    /// Mutual CHAP prevents rogue targets from harvesting initiator credentials.
    pub mutual_auth: Option<ChapCredential>,
}

/// CHAP credential pair (name + shared secret).
pub struct ChapCredential {
    /// CHAP name (typically the iSCSI qualified name of the peer).
    pub name: KString,
    /// CHAP secret (shared secret, 12-256 bytes per RFC 7143 §12.1.3).
    /// Wrapped in `Zeroizing` to ensure memory is cleared on drop.
    pub secret: Zeroizing<ArrayVec<u8, 256>>,
}

/// CHAP hash algorithm identifier (IANA "PPP Authentication Algorithms" registry).
pub enum ChapAlgorithm {
    /// MD5 (legacy, for backward compatibility with older initiators only).
    Md5 = 5,
    /// SHA-1 (legacy, for interop with older initiators).
    Sha1 = 6,
    /// SHA-256 (recommended, RFC 7143 §13.11). Preferred for all new deployments.
    Sha256 = 7,
    /// SHA3-256. IANA-assigned (David_Black). FIPS 202 compliant.
    /// Interoperability limited to implementations that support CHAP algorithm 8.
    Sha3_256 = 8,
}

CHAP negotiation flow during iSCSI login:

  1. Initiator sends LoginRequest with AuthMethod=CHAP in the key-value text parameters.
  2. Target selects CHAP and responds with CHAP_A (algorithm list), CHAP_I (identifier, one byte), CHAP_C (challenge, random bytes).
  3. Initiator computes response = Hash(CHAP_I || secret || CHAP_C) using the selected algorithm, sends CHAP_N (name) and CHAP_R (response).
  4. Target verifies the response against its stored credential for the initiator's name.
  5. If mutual CHAP is negotiated: the target sends its own CHAP_I and CHAP_C in the same response. The initiator verifies the target's identity using the mutual secret. This prevents man-in-the-middle attacks where a rogue target impersonates the real one.

CHAP credentials are stored in the kernel key retention service (Section 10.2) under the iscsi_chap keyring. The targetcli configuration interface writes credentials via the configfs auth/ directory (see targetcli configfs management below).

iSER (iSCSI Extensions for RDMA)

When RDMA fabric is available (InfiniBand, RoCE, iWARP — Section 5.4), iSCSI sessions transparently upgrade to RDMA transport: - Zero-copy data transfer: RDMA READ/WRITE directly between initiator/target memory - Kernel-bypass data path: data moves without CPU involvement - Same iSCSI session management and authentication, different transport - Transparent upgrade: if both ends advertise RDMA capability during login, iSER is negotiated automatically. Applications and management tools see a standard iSCSI session.

NVMe-oF Initiator (Host)

Tier 1 umka-block module implementing the NVMe over Fabrics host side (NVM Express 2.0, NVMe TCP Transport Specification TP 8000, NVMe/RDMA part of original NVMe-oF specification June 2016). For NVMe/TCP transport, the initiator creates kernel-internal TCP sockets via the standard networking stack (Section 16.1). All network-level operations (routing, congestion control, netfilter) apply to NVMe-oF TCP traffic as normal flows. For NVMe/RDMA transport, it uses the RDMA pool manager (Section 5.4) for zero-copy buffer registration.

  • Discovery: NVMe-oF discovery protocol (well-known discovery NQN) — initiator queries a discovery controller to enumerate available subsystems and transport addresses. Supports Discovery Log Page, referrals, and persistent discovery connections, unique discovery controller identification (TP 8013a).
  • NVMe/TCP transport: NVMe commands encapsulated in TCP (NVMe TCP Transport Specification, TP 8000, widely deployed). Lighter than iSCSI — no SCSI translation layer, native NVMe command set. Supports header and data digests (CRC32C), and TLS 1.3 for in-transit encryption (TP 8011). Uses the kernel-internal socket API (SocketOps trait, Section 16.3) for TCP connections — connect(), sendmsg() with MSG_MORE for PDU framing, recvmsg() for response parsing. Each NVMe I/O queue maps to one TCP connection. Zero-copy TX uses NetBuf scatter-gather (Section 16.5) to avoid copying NVMe command capsules and data payloads.
  • NVMe/RDMA transport: NVMe commands over RDMA (InfiniBand, RoCE, iWARP). Capsule commands sent via RDMA SEND, data transferred via RDMA READ/WRITE — zero-copy, kernel-bypass. Lowest latency option (~3-5 μs network transport; ~10-20 μs end-to-end including NVMe target processing). NVMe-oF I/O SGLs (scatter-gather lists for data buffers) are allocated from RdmaPoolManager::alloc("nvmeof", size) (Section 5.4). On quota exhaustion (the NVMe-oF RDMA pool is depleted), the initiator returns BLK_STS_RESOURCE to the block layer, which applies backpressure by re-queuing the I/O request and throttling submission until pool capacity is recovered.
  • Multipath: native NVMe multipath (ANA — Asymmetric Namespace Access). Multiple paths to the same namespace are managed by the NVMe driver itself (not dm-multipath). ANA groups indicate path optimality (optimized, non-optimized, inaccessible). UmkaOS's NVMe multipath integrates with the recovery-aware volume layer (Section 15.2) — if a path fails due to driver crash, the volume layer waits for recovery rather than immediately failing over.
  • Namespace management: attach/detach namespaces, resize, format — full NVMe-oF namespace management command set.
  • Zoned namespaces (ZNS): NVMe-oF supports zoned namespaces. UmkaOS exposes these through the block layer's zone interface, compatible with zonefs and f2fs.

NVMe-oF Target (Subsystem)

Tier 1 module exposing local NVMe devices (or any block device) as NVMe-oF subsystems:

  • Subsystem management: create/destroy NVMe subsystems, each with one or more namespaces backed by local block devices (NVMe, zvol, dm device, or any umka-block device).
  • Transport bindings: simultaneous TCP and RDMA listeners on the same subsystem. Clients connect via whichever transport is available.
  • Access control: per-host NQN ACLs. Each allowed host can be restricted to specific namespaces within the subsystem.
  • ANA groups: configure asymmetric namespace access for multipath. Allows active/passive and active/active configurations.
  • Passthrough mode: for local NVMe devices, optionally pass NVMe commands directly to the hardware (no block layer translation). Provides the lowest-latency target implementation — remote host gets near-local NVMe performance.
  • Configuration interface: nvmetcli-compatible JSON configuration (existing Linux NVMe target management tools work via SysAPI layer).

NVMe-oF TLS 1.3 Security

NVMe/TCP supports in-band TLS 1.3 encryption (NVMe TP 8011). This protects data in transit without requiring IPsec or network-level encryption, and is mandatory for deployments where storage traffic traverses untrusted network segments.

/// TLS 1.3 configuration for an NVMe-oF target port or initiator connection.
/// Implements NVMe TP 8011 (Secure Channel — TLS for NVMe/TCP).
pub struct NvmeofTlsConfig {
    /// TLS mode for this port/connection.
    pub mode: NvmeofTlsMode,
    /// PSK identity hint (for PSK mode). Matches the identity configured
    /// on the initiator via `nvme gen-tls-key` / `nvme set-key`.
    pub psk_identity: Option<KString>,
    /// Pre-shared key value (for PSK mode). TLS 1.3 PSK, up to 48 bytes
    /// (SHA-384 output). Wrapped in `Zeroizing` for secure memory handling.
    pub psk: Option<Zeroizing<[u8; 48]>>,
    /// X.509 certificate for certificate-based TLS. The certificate and
    /// private key are stored in the kernel key retention service
    /// ([Section 10.2](10-security-extensions.md#kernel-key-retention-service)).
    pub cert: Option<Arc<X509Cert>>,
    /// Whether to require client (initiator) authentication.
    /// When true, the target requests and verifies a client certificate
    /// during the TLS handshake (mutual TLS).
    pub require_client_auth: bool,
}

/// NVMe-oF TLS mode selection.
pub enum NvmeofTlsMode {
    /// No TLS (plaintext NVMe/TCP). Default for backward compatibility.
    None,
    /// TLS 1.3 with Pre-Shared Key (NVMe TP 8011). The PSK is provisioned
    /// out-of-band and identified by `psk_identity`. Simpler deployment
    /// than certificates; suitable for static clusters.
    Psk,
    /// TLS 1.3 with X.509 certificates. Provides identity verification
    /// via certificate chain validation. Required for multi-tenant or
    /// cross-organizational deployments.
    Certificate,
}

TLS offload: when the NIC supports kTLS offload (Section 16.15), TLS record-layer encryption and decryption are performed in hardware, making encrypted NVMe/TCP effectively zero-copy. The NVMe-oF initiator and target negotiate TLS during the NVMe/TCP connection setup phase (before the NVMe Connect command). The TLS session keys are installed into the kTLS socket via setsockopt(SOL_TLS, TLS_TX) / setsockopt(SOL_TLS, TLS_RX).

DH-HMAC-CHAP (NVMe TP 8001) provides an alternative in-band authentication mechanism that does not require TLS infrastructure. It can be used standalone or as a pre- authentication step before TLS handshake. The discovery controller also supports DH-HMAC-CHAP (see NVMe-oF Discovery Controller below).

NVMe-oF Discovery Controller

The NVMe-oF discovery controller provides interoperability with non-UmkaOS nodes (Linux, Windows, ESXi) that discover NVMe-oF targets using the standard NVMe-oF discovery protocol. Without a discovery controller, non-UmkaOS initiators cannot find UmkaOS NVMe-oF subsystems — they require out-of-band configuration of target addresses, which defeats the self-discovery model that NVMe-oF was designed for.

Implementation (Tier 1, part of the NVMe-oF target module):

  • Well-known discovery NQN: listens as nqn.2014-08.org.nvmexpress.discovery (the standard NVMe-oF discovery NQN defined in NVMe Base Specification 2.0 §4.1). Initiators connect to this NQN to retrieve the discovery log page.

  • Well-known port: listens on TCP port 8009 (the IANA-assigned NVMe-oF discovery port, also used by Linux nvmet and commercial NVMe-oF targets). Also listens for RDMA connections on the same port when RDMA transport is available.

  • Dual transport: the discovery controller accepts connections over both TCP and RDMA transports simultaneously. Initiators connect via whichever transport they support. The discovery log page includes entries for both TCP and RDMA target addresses, allowing the initiator to select its preferred data transport.

  • Discovery Log Page (NVMe Base Specification 2.0 §5.3): responds to the Get Log Page command (Log Identifier 0x70) with a standard discovery log page containing one entry per locally-exported NVMe subsystem + transport binding.

/// NVMe-oF Discovery Log Page Entry (NVMe Base Spec 2.0, Figure 292).
/// One entry per (subsystem, transport, address) tuple.
/// Size: 1024 bytes per entry (fixed, per NVMe spec).
/// Multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct NvmeDiscoveryLogEntry {
    /// Transport type: 0x01 = RDMA, 0x03 = TCP.
    pub trtype: u8,
    /// Address family: 0x01 = IPv4, 0x02 = IPv6.
    pub adrfam: u8,
    /// Subsystem type: 0x01 = NVMe I/O subsystem, 0x02 = discovery.
    pub subtype: u8,
    /// Transport requirements (RDMA: RDMA_QPTYPE, RDMA_PRTYPE, RDMA_CMS).
    pub treq: u8,
    /// Port ID (unique per transport address on this target).
    pub portid: Le16,
    /// Controller ID (0xFFFF = dynamic, assigned at connect).
    pub cntlid: Le16,
    /// Admin max SQ size.
    pub asqsz: Le16,
    /// Extended discovery flags (NVMe 1.4+). Bit 0: EPCSD (explicit persistent
    /// connection to discovery controller). Bit 1: DUPRETINFO (duplicate return info).
    pub eflags: Le16,
    /// Reserved padding.
    pub _reserved0: [u8; 20],
    /// Transport service identifier (port number as ASCII string,
    /// e.g., "4420" for NVMe-oF I/O, "8009" for discovery).
    pub trsvcid: [u8; 32],
    /// Explicit padding.
    pub _reserved1: [u8; 192],
    /// NVMe subsystem qualified name (NQN) — null-terminated ASCII,
    /// max 223 characters + NUL (NVMe spec §4.1).
    pub subnqn: [u8; 256],
    /// Transport address (IP address as ASCII string for TCP/RDMA,
    /// e.g., "192.168.1.100" or "fe80::1").
    pub traddr: [u8; 256],
    /// Transport-specific address subtype (RDMA: partition key, TCP: unused).
    pub tsas: [u8; 256],
}
// NVMe Base Spec 2.1: discovery log entry is exactly 1024 bytes.
// trtype(1) + adrfam(1) + subtype(1) + treq(1) + portid(2) + cntlid(2) +
// asqsz(2) + eflags(2) + _reserved0(20) + trsvcid(32) + _reserved1(192) +
// subnqn(256) + traddr(256) + tsas(256) = 1024.
const_assert!(core::mem::size_of::<NvmeDiscoveryLogEntry>() == 1024);
  • Automatic enumeration: the discovery controller scans all locally-configured NVMe-oF subsystems (from the nvmet configuration) and generates discovery log page entries for each subsystem + transport combination. When subsystems are added or removed, the discovery log page generation counter is incremented, and initiators with persistent discovery connections receive an AEN (Asynchronous Event Notification) prompting them to re-read the log page.

  • Persistent discovery connections (TP 8013a): initiators can maintain a long-lived connection to the discovery controller. The controller sends AENs when the discovery log changes (subsystem added/removed, transport address changed, ANA state changed). This eliminates periodic polling — the initiator is notified immediately of topology changes.

  • Referrals: the discovery controller can include referral entries pointing to discovery controllers on other UmkaOS nodes. This enables distributed discovery: an initiator connects to any one UmkaOS node's discovery controller and learns about NVMe-oF subsystems across the entire cluster. Referral entries use subtype = 0x02 (discovery subsystem) with the remote node's transport address.

  • Security: discovery controller connections support DH-HMAC-CHAP authentication (NVMe TP 8001) and TLS 1.3 (NVMe TP 8011) when configured. Unauthenticated discovery is permitted by default for compatibility with existing initiators; operators can require authentication via the nvmet access control configuration.

  • Mixed cluster interoperability: in a cluster containing both UmkaOS and non-UmkaOS nodes, UmkaOS nodes discover storage via the native PeerRegistry (Section 5.2), while non-UmkaOS nodes use the NVMe-oF discovery controller on TCP port 8009. Both paths expose the same NVMe subsystems — the discovery controller simply translates PeerRegistry storage advertisements into standard NVMe-oF discovery log entries.

targetcli configfs Management

Both iSCSI and NVMe-oF targets are configured via configfs (Section 14.12). The configfs tree layout is Linux-compatible so that existing user-space tools (targetcli, targetcli-fb, rtslib-fb, nvmetcli) work without modification via the SysAPI layer.

iSCSI target configfs hierarchy (/sys/kernel/config/target/iscsi/):

Path Purpose
<iqn>/ mkdir: create an iSCSI target with the given IQN
<iqn>/tpgt_<n>/ mkdir: create target portal group N
<iqn>/tpgt_<n>/enable echo 1 >: activate the portal group
<iqn>/tpgt_<n>/lun/lun_<m>/ mkdir + symlink to backstore: map a LUN
<iqn>/tpgt_<n>/acls/<initiator_iqn>/ mkdir: create ACL entry for an initiator
<iqn>/tpgt_<n>/acls/<initiator_iqn>/auth/ CHAP credentials: userid, password, userid_mutual, password_mutual
<iqn>/tpgt_<n>/np/<ip>:<port>/ mkdir: create a network portal (listen address)
<iqn>/tpgt_<n>/param/ iSCSI session parameters: MaxRecvDataSegmentLength, MaxBurstLength, FirstBurstLength, DefaultTime2Wait, DefaultTime2Retain, HeaderDigest, DataDigest

NVMe-oF target configfs hierarchy (/sys/kernel/config/nvmet/):

Path Purpose
subsystems/<nqn>/ mkdir: create an NVMe subsystem
subsystems/<nqn>/attr_allow_any_host echo 1 >: disable host NQN ACL checking
subsystems/<nqn>/namespaces/<nsid>/ mkdir: create a namespace
subsystems/<nqn>/namespaces/<nsid>/device_path echo /dev/nvme0n1 >: set backing device
subsystems/<nqn>/namespaces/<nsid>/enable echo 1 >: activate the namespace
ports/<port_id>/ mkdir: create a transport port
ports/<port_id>/addr_trtype Transport type: tcp, rdma
ports/<port_id>/addr_traddr Listen address (e.g., 192.0.2.1)
ports/<port_id>/addr_trsvcid Listen port (e.g., 4420)
ports/<port_id>/param_tls TLS mode: none, psk, certificate (see NVMe-oF TLS 1.3 above)
ports/<port_id>/subsystems/<nqn> Symlink: bind a subsystem to this port
hosts/<nqn> mkdir: register an allowed host NQN for ACL

Configuration operations are serialized by the configfs group_mutex (one writer at a time). Reads (e.g., cat param/MaxBurstLength) are lock-free via RCU-protected snapshots of the parameter structures. Runtime parameter changes (e.g., adjusting MaxRecvDataSegmentLength) take effect on new sessions only; existing sessions retain the parameters negotiated at login time.

NVMe-oF over Fabrics — Why It Matters

NVMe-oF is replacing iSCSI in new deployments because it eliminates the SCSI translation layer. iSCSI encapsulates SCSI commands (a protocol designed for parallel buses in 1986) over TCP. NVMe-oF speaks NVMe natively — the same command set used by local NVMe SSDs. This means: - No SCSI CDB translation overhead - Native support for NVMe features (multipath/ANA, zoned namespaces, NVMe reservations) - Simpler protocol state machine (NVMe queue pairs vs iSCSI session/connection/task) - Lower latency at every layer

UmkaOS supports both because iSCSI remains dominant in existing infrastructure (and iSER makes it competitive on RDMA fabrics), while NVMe-oF is the clear direction for new deployments.

Protocol comparison:

Protocol Transport CPU overhead Latency Bandwidth
iSCSI TCP High (TCP stack + SCSI) ~100μs 10-25 Gbps
iSER RDMA Minimal (zero-copy) ~15-25μs end-to-end (transport only: ~5-10μs; end-to-end with SCSI target: ~15-25μs) Line rate (100+ Gbps)
NVMe-oF/TCP TCP Medium (no SCSI layer) ~15-30μs 25-100 Gbps
NVMe-oF/RDMA RDMA Minimal ~10-20μs end-to-end¹ Line rate

¹ NVMe-oF/RDMA latency breakdown: ~3-5 μs network transport (RDMA) + NVMe target processing. The 3-5 μs figure commonly cited represents RDMA transport latency only; end-to-end I/O latency including NVMe device processing is typically ~10-20 μs.

Recovery advantage — Both iSCSI and NVMe-oF initiators run as Tier 1 drivers with state preservation (Section 11.9). If an initiator driver crashes: 1. Connection state is checkpointed to the state preservation buffer 2. Driver reloads in ~50-150ms 3. RDMA transports (iSER, NVMe-oF/RDMA): When a driver crashes, the local RNIC's Queue Pair enters Error state, and the remote side's QP also transitions to Error state from retransmission timeouts. QP state cannot be transparently restored from a checkpoint — the QP must be destroyed and re-created (Reset -> Init -> RTR -> RTS). UmkaOS performs a fast QP re-creation: checkpointed session parameters (remote QPN, GID, LID, PSN, MTU, RDMA capabilities) allow the new QP to be configured without full connection manager negotiation. The remote side detects the QP failure (via async error event or failed RDMA operation) and cooperates in re-establishing the QP pair. Total recovery: ~50-150ms (fast re-creation, not transparent restore), vs. 10-30 seconds for full re-discovery in Linux. 4. TCP transports (iSCSI/TCP, NVMe-oF/TCP): Full TCP connection state cannot be reliably restored after a crash (the remote peer's TCP state has advanced: retransmissions, window adjustments, etc.). Instead, UmkaOS performs a fast reconnect: the checkpointed session parameters (target portal, ISID, TSIH for iSCSI; NQN, controller ID for NVMe-oF) allow session re-establishment without full discovery. The target accepts the reconnect as a session continuation (iSCSI RFC 7143 Section 7.3.5 session reinstatement; NVMe-oF controller reconnect). I/O commands in flight are retried by the block layer. Total recovery: ~200-500ms (vs. 10-30 seconds for full re-discovery in Linux).

In Linux, an initiator crash requires full session re-establishment: TCP/RDMA reconnection, login/connect, LUN/namespace re-discovery, and filesystem remount. This can take 10-30 seconds and may cause I/O errors visible to applications.

Multipath — Two multipath models coexist: - iSCSI: dm-multipath integration with the recovery-aware volume layer (Section 15.2). Multiple iSCSI paths (via different network interfaces or through different target portals) provide redundancy. - NVMe-oF: native NVMe ANA multipath (managed by the NVMe driver, not dm-multipath). ANA state changes are handled in-driver with recovery awareness.

Both models coordinate with the volume state machine — if a path fails due to driver crash (not network failure), the volume layer waits for driver recovery rather than immediately failing over.

15.13.2 NVMe-oF Reconnect Policy

The external NVMe-oF protocol is Linux-compatible (same wire format, same controller reconnect semantics). The reconnect strategy — when and how to retry — is UmkaOS's internal design space. Without backoff and jitter, all hosts in a cluster that lose fabric connectivity simultaneously will attempt to reconnect simultaneously, overloading the target's accept queue and prolonging the outage. UmkaOS uses exponential backoff with full jitter to spread reconnect attempts across the cluster.

Algorithm: exponential backoff with full jitter

When a fabric connection drops (TCP disconnect, QP error event) or an initial connect attempt fails:

attempt = 0
base_delay_ms = 100
max_delay_ms  = 30_000   // 30 seconds
jitter_frac   = 0.25     // ±25%

loop:
    delay = min(base_delay_ms * 2^attempt, max_delay_ms)
    jitter = random_uniform(-delay * jitter_frac, +delay * jitter_frac)
    sleep(delay + jitter)
    attempt = attempt + 1
    try connect()
    if connected: reset attempt = 0, break

Delays without jitter (for reference): 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 25.6s, 30s, 30s, ...

With jitter, the actual delay is uniformly random in [0.75×delay, 1.25×delay]. Full jitter (as opposed to equal jitter or decorrelated jitter) provides the best protection against synchronized reconnects in large clusters — reconnect attempts spread across the jitter window rather than clustering at the same instant. Reference: AWS Architecture Blog "Exponential Backoff And Jitter" (2015).

ANA path failover — If a path transitions to ANAState::Inaccessible, UmkaOS immediately tries the next available ANA-optimized path before entering the reconnect loop for the failed path. The reconnect loop is only entered after all optimized paths for a namespace are exhausted. This preserves I/O availability during single-path failures without incurring any reconnect delay.

Fast-path reconnect (NVMe-oF/TCP only) — If the TCP connection drops but the NVMe-oF controller was previously established (implying a fabric-layer issue rather than a target reset or controller crash), the first reconnect attempt uses a fixed 10ms delay instead of the normal 100ms base delay. The rationale: the target controller is likely still healthy and ready to accept the reconnect immediately; the full backoff sequence is reserved for cases where the target itself is unavailable.

Maximum reconnect attempts — After 20 consecutive failed attempts (approximately 10 minutes at the 30s ceiling), the controller is marked NvmeControllerState::Offline and I/O to namespaces served only by this controller fails with EIO. The controller remains registered; operators can re-trigger connection attempts via sysfs or the umkafs control interface at /ukfs/kernel/nvmeof/<nqn>/reconnect.


15.13.3 Block Service Provider

When a host has a storage device managed by a traditional KABI driver (NVMe, SCSI, virtio-blk), the block layer can provide that device as a cluster service via the peer protocol. Remote peers access the device through the standard block device interface — they do not know or care which driver manages it on the serving host.

This is the block-layer instantiation of the capability service provider model described in Section 5.7. In a uniform UmkaOS cluster, the block service provider provides remote storage access without NVMe-oF targets, iSCSI daemons, or any external protocol stack.

15.13.3.1 Service Provider and Wire Protocol

Device-native providers (Tier M): When an NVMe drive's firmware implements the umka peer protocol (Section 11.1), the drive IS the block service provider — no host-side KABI NVMe driver is involved. The drive advertises BLOCK_STORAGE via CapAdvertise, the host creates a BlockServiceClient via the PeerServiceProxy bridge (Section 5.11), and I/O flows through the ring pair directly to drive hardware. The wire protocol (BlockServiceRequest/Completion) is identical whether the provider is device firmware, a host-proxy kernel module, or a remote host. Sharing model: multiple hosts can ServiceBind to the same block device simultaneously with reservation coordination (Reserve/Release/Preempt opcodes).

// umka-block/src/service_provider.rs

/// Registers a local block device for remote access by cluster peers.
/// The service provider listens for incoming block I/O requests on the
/// peer protocol and dispatches them to the local block layer.
pub struct BlockServiceProvider {
    /// The local block device being served.
    device: BlockDeviceHandle,
    /// Unique service instance identifier. Used for reservation namespace
    /// and multi-path target identification.
    service_id: ServiceInstanceId,
    /// Per-CPU I/O queues (see "Multi-Queue I/O" below). One queue pair per
    /// connected client CPU, up to `max_queues` per client.
    /// Bounded: max `max_queues_per_client × MAX_CONNECTED_CLIENTS` (32 × 1024 = 32768).
    /// `MAX_CONNECTED_CLIENTS` is enforced in the connection accept path: new
    /// connections beyond the limit are rejected with a protocol-level error.
    /// Allocated at client connect time (warm path).
    queues: Vec<BlockServiceQueue>,
    /// Maximum I/O queues per client connection. Default: min(server_cpus, 32).
    /// Each queue is a separate peer queue pair for full parallelism.
    max_queues_per_client: u16,
    /// Maximum concurrent I/O operations per queue (backpressure).
    /// Default: 128. Total max inflight = max_queues × queue_depth.
    queue_depth: u16,
    /// Write-back cache for coalescing remote writes (optional).
    /// Disabled by default for safety. Enabled via export configuration
    /// when the remote consumer tolerates write-back semantics.
    writeback_cache: Option<WritebackCache>,
    /// Connected clients, tracked for reservation state and recovery.
    /// Keyed by PeerId (u64). XArray provides O(1) lookup with native
    /// RCU-protected reads and internal xa_lock for write serialization.
    clients: XArray<BlockServiceClientState>,
}

/// Server-side write coalescing cache. Buffers remote writes in memory
/// before flushing to the backing block device, reducing small-write
/// amplification for workloads with temporal locality (e.g., metadata
/// updates). Disabled by default for safety — only enabled via explicit
/// export configuration when the remote consumer tolerates write-back
/// semantics (i.e., acknowledges that unflushed writes are lost on
/// server crash, same as a local volatile write cache).
pub struct WritebackCache {
    /// Per-client dirty page tracking. XArray keyed by (offset / block_size).
    /// Provides O(1) lookup for coalescing successive writes to the same block
    /// and ordered iteration for sequential flush.
    dirty_map: XArray<DirtyEntry>,
    /// Maximum dirty bytes before flush (backpressure). Default: 64 MiB.
    /// When `dirty_bytes` reaches this threshold, new writes block until
    /// the periodic flush or an explicit Flush request drains enough data.
    max_dirty_bytes: u64,
    /// Current dirty bytes. Updated atomically on write (add) and flush (sub).
    dirty_bytes: AtomicU64,
    /// Flush interval in milliseconds. A periodic writeback timer fires at
    /// this interval to flush aged dirty entries. Default: 5000.
    flush_interval_ms: u32,
    /// Write-through threshold: writes larger than this bypass the cache
    /// and go directly to the block device. Default: 256 KiB. Large
    /// sequential writes do not benefit from coalescing and would evict
    /// useful cached small writes.
    write_through_threshold: u32,
}

/// A single dirty block in the writeback cache.
struct DirtyEntry {
    /// Client that wrote this block (for invalidation on disconnect).
    client_id: PeerId,
    /// Data buffer (slab-allocated, block_size bytes).
    data: SlabRef<[u8]>,
    /// Timestamp of last write (monotonic ns, for age-based flush).
    last_write_ns: u64,
}

/// Per-client state tracked on the server side. One entry per connected
/// remote peer, stored in `BlockServiceProvider::clients` (XArray keyed
/// by PeerId).
pub struct BlockServiceClientState {
    /// Remote peer identity.
    peer_id: PeerId,
    /// Number of queues established by this client.
    nr_queues: u16,
    /// Reservation state (if this client holds a reservation on the device).
    reservation: Option<ReservationState>,
    /// In-flight I/O count for this client (for fair scheduling across
    /// clients sharing the same export).
    inflight_count: AtomicU32,
    /// Bandwidth consumed (bytes/sec, EWMA with α = 1/16). For QoS
    /// enforcement — the server throttles clients exceeding their
    /// per-client bandwidth limit.
    bandwidth_ewma: AtomicU64,
    /// Connection timestamp (monotonic ns). Used for diagnostics and
    /// connection age reporting in sysfs.
    connected_since_ns: u64,
    /// Request ID deduplication window for reconnect. Tracks the last
    /// N completed request_ids to reject duplicates after reconnection.
    /// Ring buffer, size = queue_depth per queue. On reconnect, the
    /// client may re-submit requests that already completed on the server
    /// before the connection dropped. The server checks incoming request_ids
    /// against this window and returns the cached completion without
    /// re-executing the I/O.
    dedup_window: ArrayVec<u64, 256>,
}

/// Per-client reservation state (server-side).
pub struct ReservationState {
    /// Reservation type (exclusive write, shared read, etc.).
    reservation_type: BlockReservationType,
    /// Reservation key (client-chosen, used for preemption identification).
    key: u64,
    /// Generation counter for SCSI-3 PR compatibility. Incremented on
    /// every reservation change for this client. Used by clustered
    /// filesystems (GFS2, OCFS2) to detect stale reservations.
    generation: u32,
}

// NOTE: ReservationType enum was removed. Use `BlockReservationType` (below,
// with correct SCSI-3 PR values starting at 1) for all reservation state.

/// A single I/O queue within an export. Each queue is serviced by a
/// dedicated kernel thread pinned to one CPU — no lock contention
/// between queues (same model as NVMe hardware queues).
pub struct BlockServiceQueue {
    /// Peer protocol queue pair for this I/O queue. Established at
    /// ServiceBind time; the concrete implementation depends on the
    /// transport binding (RDMA RC QP, TCP socket, CXL doorbell, PCIe
    /// BAR ring). Service providers use the ring pair abstraction
    /// ([Section 5.1](05-distributed.md#distributed-kernel-architecture--peer-ring-entry-format)),
    /// not raw transport operations.
    qp: PeerQueuePair,
    /// Submission ring: client writes requests here.
    submit_ring: RingBuffer<BlockServiceRequest>,
    /// Completion ring: server writes completions here.
    completion_ring: RingBuffer<BlockServiceCompletion>,
    /// CPU this queue is bound to on the server.
    cpu: u32,
}

/// Block I/O request from a remote peer.
/// Size: 64 bytes (one cache line, fits in one transport send).
/// This struct crosses node boundaries via the peer protocol. Per the DSM
/// wire format policy ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)):
/// every `#[repr(C)]` struct that crosses a node boundary MUST use
/// `Le16`/`Le32`/`Le64` for all multi-byte integer fields. Single-byte
/// fields (`u8`) and byte arrays (`[u8; N]`) are endianness-neutral.
#[repr(C, align(64))]
pub struct BlockServiceRequest {
    /// Client-assigned request ID. Echoed in completion.
    pub request_id: Le64,
    /// Operation code.
    pub opcode: BlockServiceOpcode,
    /// I/O priority class (see "I/O Priority and QoS" below). Higher = more urgent.
    /// 0 = default (best-effort). Used by server-side I/O scheduler
    /// for QoS enforcement when multiple clients share an export.
    pub priority: BlockServicePriority,
    /// Flags: FUA, barrier, scatter-gather, data integrity.
    pub flags: Le16,
    /// Explicit padding: Le types have alignment 1, so no implicit padding
    /// exists. This 4-byte pad ensures `offset` starts at byte 16 (8-byte
    /// aligned), which is conventional for wire formats.
    pub _pad1: [u8; 4],
    /// Byte offset on the block device.
    pub offset: Le64,
    /// Length in bytes (for Read/Write/Discard/CompareAndWrite).
    pub len: Le32,
    /// Number of scatter-gather entries (see "Scatter-Gather I/O" below).
    /// 0 = single contiguous buffer (data_region_offset).
    /// 1-15 = scatter-gather list follows the request as inline SGL.
    /// The inline SGL (sgl_count × 12 bytes) plus the 64-byte header must
    /// fit within one transport send inline threshold. Standard ConnectX NICs
    /// support 256-byte inline send → max 16 SGL entries inline
    /// ((256 - 64) / 12 = 16, capped at 15 by sgl_count).
    /// If the SGL exceeds the inline threshold, it is written into a
    /// pre-registered server buffer via push_page() and
    /// data_region_offset points to the SGL, not the data.
    pub sgl_count: u8,
    /// Reserved for alignment.
    pub _reserved: [u8; 3],
    /// Offset within the per-queue ServiceDataRegion established at
    /// ServiceBind time. For Read: server writes data here; for Write:
    /// server reads data from here. Zero for non-data ops.
    /// When sgl_count > 0, this points to the first SglEntry.
    pub data_region_offset: Le64,
    /// For CompareAndWrite: offset within the ServiceDataRegion for the
    /// compare buffer. The compare buffer contains the expected data;
    /// data_region_offset contains the new data to write if comparison
    /// succeeds.
    pub compare_region_offset: Le64,
    /// Explicit padding to fill the 64-byte cache line.
    pub _pad: [u8; 16],
    // Layout: request_id(8) + opcode(1) + priority(1) + flags(2) +
    // _pad1(4) + offset(8) + len(4) + sgl_count(1) + _reserved(3)
    // + data_region_offset(8) + compare_region_offset(8) + _pad(16) = 64.
}
const_assert!(core::mem::size_of::<BlockServiceRequest>() == 64);

/// Scatter-gather list entry for multi-segment I/O (see "Scatter-Gather I/O" below).
/// Size: 12 bytes (offset: Le64 + len: Le32). Cross-node wire format — all
/// multi-byte fields use Le types per DSM wire format policy
/// ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)).
#[repr(C)]
pub struct SglEntry {
    /// Offset within the per-queue ServiceDataRegion.
    pub region_offset: Le64,
    /// Length of this segment in bytes.
    pub len: Le32,
}
// Wire format: Le64(8) + Le32(4) = 12 bytes. Le types have alignment 1.
const_assert!(core::mem::size_of::<SglEntry>() == 12);

/// Block service wire protocol opcode. These values are INDEPENDENT of
/// `BioOp` values — the block service protocol has its own opcode space
/// because it includes operations (GetInfo, Abort, Reserve, CompareAndWrite,
/// etc.) that have no `BioOp` equivalent.
///
/// **Conversion**: The block service provider MUST convert between `BioOp`
/// and `BlockServiceOpcode` using explicit match arms, NOT numeric casting.
/// After the SF-192 fix, `BioOp` values match Linux's `req_op` (with gaps),
/// while `BlockServiceOpcode` uses sequential numbering. Numeric casting
/// (`bio.op as u8`) produces WRONG opcodes.
///
/// ```rust
/// fn bio_op_to_service_opcode(op: BioOp) -> BlockServiceOpcode {
///     match op {
///         BioOp::Read        => BlockServiceOpcode::Read,
///         BioOp::Write       => BlockServiceOpcode::Write,
///         BioOp::Flush       => BlockServiceOpcode::Flush,
///         BioOp::Discard     => BlockServiceOpcode::Discard,
///         BioOp::WriteZeroes => BlockServiceOpcode::WriteZeroes,
///         BioOp::SecureErase => BlockServiceOpcode::Discard, // mapped to discard on wire
///         BioOp::ZoneAppend  => BlockServiceOpcode::Write,   // treated as write on wire
///     }
/// }
/// ```
#[repr(u8)]
pub enum BlockServiceOpcode {
    Read = 0,
    Write = 1,
    Flush = 2,
    Discard = 3,
    WriteZeroes = 4,
    GetInfo = 5,
    /// Abort a previously submitted request by request_id.
    Abort = 6,
    /// Reservation operations (see "Reservations for Shared Access" below).
    Reserve = 7,
    ReleaseReservation = 8,
    Preempt = 9,
    /// Atomic compare-and-write (see "Atomic Compare-and-Write" below).
    /// Reads `len` bytes at `offset`, compares with `compare_region_offset`
    /// buffer. If equal, writes `data_region_offset` buffer. If not equal,
    /// fails with ECANCELED and returns the current data in
    /// `compare_region_offset`.
    CompareAndWrite = 10,
    /// Reset the exported device (see "Error Recovery and Reconnection" below). Last-resort recovery
    /// when Abort fails. Aborts all in-flight I/O, resets device state.
    ResetDevice = 11,
}

/// I/O priority class. Maps to Linux I/O priority (ioprio) levels.
#[repr(u8)]
pub enum BlockServicePriority {
    /// Background — lowest priority. Batch jobs, scrubbing.
    Idle = 0,
    /// Best-effort, low urgency (default for most workloads).
    BestEffortLow = 1,
    /// Best-effort, normal urgency.
    BestEffort = 2,
    /// Best-effort, high urgency.
    BestEffortHigh = 3,
    /// Real-time, low urgency. Latency-sensitive but not critical.
    RealTimeLow = 4,
    /// Real-time, normal urgency. Database journal commits.
    RealTime = 5,
    /// Real-time, high urgency. UPFS metadata operations.
    RealTimeHigh = 6,
    /// Real-time, critical. Fencing and reservation operations.
    RealTimeCritical = 7,
}

bitflags! {
    /// In-memory representation of block service flags.
    ///
    /// The wire format in `BlockServiceRequest.flags` is `Le16`. Conversion:
    /// - Deserialize: `BlockServiceFlags::from_bits_truncate(request.flags.to_ne())`
    /// - Serialize: `Le16::from_ne(flags.bits())`
    pub struct BlockServiceFlags: u16 {
        /// Force Unit Access — bypass volatile write cache, ensure data
        /// reaches persistent storage before completion. Maps to
        /// `BioFlags::FUA` in the block layer.
        const FUA = 1 << 0;
        /// This request is part of a write barrier sequence
        /// (see "Write Ordering and Barriers" below). Server must preserve ordering.
        const BARRIER = 1 << 1;
        /// Data integrity fields are present (see "Data Integrity" below).
        /// Completion will include integrity verification result.
        const DATA_INTEGRITY = 1 << 2;
    }
}

/// Completion sent back to the requesting peer.
/// Size: 32 bytes (power-of-two for ring buffer slot alignment).
/// Layout: request_id(8) + status(4) + bytes_done(4) + info_len(4) +
/// integrity_status(1) + _reserved(3) + _pad(8) = 32.
/// Cross-node wire format — all multi-byte fields use Le types
/// per DSM wire format policy ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)).
#[repr(C, align(32))]
pub struct BlockServiceCompletion {
    /// Matches the request_id from the original request.
    pub request_id: Le64,
    /// 0 on success, negative errno on failure.
    /// CompareAndWrite: -ECANCELED if comparison failed.
    /// Transmitted as `Le32` (unsigned wire representation of a signed i32).
    /// Receiver converts: `status.to_ne() as i32`. This avoids introducing a
    /// separate `Lei32` type — the Le* family covers unsigned integers only.
    pub status: Le32,
    /// Bytes transferred (for Read/Write). 0 for non-data ops.
    pub bytes_done: Le32,
    /// For GetInfo: serialized BlockServiceDeviceInfo follows as inline data.
    /// For other ops: reserved, zero.
    pub info_len: Le32,
    /// Data integrity result (see "Data Integrity" below).
    /// 0 = integrity check passed or not requested.
    /// Non-zero = integrity error (DIF_GUARD_ERROR, DIF_REF_ERROR, DIF_APP_ERROR).
    pub integrity_status: u8,
    pub _reserved: [u8; 3],
    /// Explicit padding to fill the 32-byte alignment boundary. Must be zeroed.
    pub _pad: [u8; 8],
}
const_assert!(core::mem::size_of::<BlockServiceCompletion>() == 32);

Wire protocol: per-queue ring pairs on the peer transport. Each queue has a submission ring and a completion ring. Data transfers use remote write (server pushes read data into client's data region) and remote read (server fetches write data from client's data region) via the peer transport. The request/completion messages themselves are sent via ring pair entries.

Transport abstraction: All service provider wire structs use transport-neutral addressing. Data references are region_offset: u64 values — offsets within the ServiceDataRegion established at ServiceBind time (Section 5.1). The peer transport layer maps these offsets to the concrete mechanism: on RDMA, offset + bind-time rkey + base_addr form an RDMA Write/Read target; on CXL, offset indexes into hardware-coherent shared memory; on TCP, the sender transmits the data inline (remote memory access is not available). Service providers never reference transport-specific types (rkeys, RDMA work requests, etc.) — they use peer protocol ring pairs and region offsets.

This is structurally identical to NVMe-oF over RDMA fabrics (submission queue + completion queue per CPU), but uses the native peer protocol instead of NVMe capsules.

15.13.3.2 Multi-Queue I/O

A single I/O queue is a bottleneck for high-IOPS devices. Modern NVMe SSDs deliver 1M+ IOPS; a single queue pair saturates at ~200-400K IOPS (limited by completion polling and doorbell overhead).

Client connection setup:
1. Client connects to server's BlockServiceProvider.
2. Server advertises max_queues_per_client and queue_depth.
3. Client creates N queue pairs (typically one per local CPU that will
   issue I/O, up to max_queues_per_client).
4. Each queue pair is an independent reliable connected transport queue.
5. Client pins each queue to a local CPU. Server pins the corresponding
   server-side queue to a server CPU.

I/O dispatch (client side):
  cpu = smp_processor_id()
  queue = export_queues[cpu % nr_queues]
  queue.submit(request)
  // No cross-CPU contention — each CPU uses its own queue.

I/O processing (server side):
  // Each server queue thread is pinned to one CPU.
  // Polls its submission ring, dispatches to local block layer,
  // posts completions. No locks between queues.

Queue count negotiation: the client requests its preferred queue count (typically min(nr_cpus, 32)). The server grants up to max_queues_per_client. For a 32-core client talking to a 16-core server, the server grants 16 queues. The client maps 2 CPUs per queue.

RDMA resource partitioning: NVMe-oF and DLM allocate QPs from separate pools to prevent resource starvation. NVMe-oF allocates from the I/O QP pool (budget: num_cpus × num_targets QPs). DLM allocates from the control QP pool (budget: 2 × num_peers QPs). Both pools draw from the RDMA device's total QP capacity. If either pool is exhausted, the requesting subsystem queues allocation and retries on QP release — no cross-pool borrowing.

15.13.3.3 Write Ordering and Barriers

Filesystem journaling requires strict write ordering: journal data must reach persistent storage before the commit record. The block layer expresses this through write barriers (BioFlags::PREFLUSH, BioFlags::FUA).

Block service provider preserves write ordering within each queue:

Ordering guarantees:
1. WITHIN a single queue: requests complete in submission order.
   Write A submitted before Write B → A completes before B.
   This matches NVMe command ordering within a single SQ.

2. ACROSS queues: no ordering guarantee. Same as NVMe across
   different SQs, same as local block layer across different CPUs.

3. FLUSH: drains all prior writes in ALL queues to persistent storage.
   Server translates to blkdev_issue_flush() on the local device.
   Flush completion means all prior writes are persistent.

4. FUA (Force Unit Access): this specific write bypasses volatile cache.
   Server translates to `BioFlags::FUA` on the local block layer. The write
   is persistent when the completion is returned.

5. BARRIER flag: server processes this request only after all prior
   requests in the same queue have completed. Used by filesystem
   journaling to sequence: writes → flush → commit_record(FUA).

Correctness argument: a filesystem on the client issues journal writes on one CPU (one queue), then a flush, then the commit record with FUA. All three go to the same queue (same CPU → same queue). Within-queue ordering guarantees the sequence: writes complete → flush drains to disk → commit record is FUA-written. This is the same guarantee that local NVMe provides.

15.13.3.4 Error Recovery and Reconnection

I/O timeout: each submitted request has a timeout (default: 30 seconds, configurable). If no completion arrives within the timeout:

  1. Client sends Abort { request_id } to the server.
  2. If the server responds with abort confirmation, the original request is failed with ETIMEDOUT. The client's block layer retries or fails upward depending on the filesystem's error handling.
  3. If the abort itself times out (server unreachable), the client transitions to reconnection.

Reconnection follows the same model as NVMe-oF reconnect (Section 15.13):

Reconnect protocol:
1. Client detects server unreachable (heartbeat Dead, or I/O + abort timeout).
2. Client enters RECONNECTING state. All new I/O is queued (not failed).
3. Client attempts to reconnect with exponential backoff:
   initial=1s, max=30s, multiplier=2, jitter_frac=0.25.
   Note: jitter prevents reconnect storms when many clients lose connectivity
   simultaneously (same rationale as NVMe-oF reconnect in Section 15.7).
4. On successful reconnect:
   a. Client re-creates queue pairs.
   b. Client re-sends all in-flight (unacknowledged) requests.
   c. Server detects duplicate request_ids and deduplicates.
   d. Queued I/O is drained.
5. After 20 failed attempts (~10 minutes), client marks the export
   OFFLINE. I/O fails with EIO. Manual reconnect via sysfs.

During RECONNECTING:
  - Read I/O: queued (stalls the calling process).
  - Write I/O: queued (filesystem journal stalls until reconnect).
  - New opens of the block device: succeed (device is still registered).
  - fsync: stalls until reconnect or OFFLINE.

Server reboot recovery: client detects server reboot via PeerRegistry generation change. Client reconnects as above. Server-side volatile write cache (if enabled) is lost — client must assume unflushed writes are lost and rely on the filesystem's journal replay for consistency. This is the same guarantee as a local power loss: FUA writes survived, cached writes may not have.

In-flight I/O deduplication: the server maintains a sliding window of recently completed request IDs per client (size: 2 × queue_depth). On reconnect, if a retransmitted request_id matches a recently completed request, the server returns the cached completion without re-executing. This prevents duplicate writes after reconnect.

15.13.3.5 Reservations for Shared Access

When multiple peers need coordinated access to the same exported block device (e.g., for clustered filesystems — Section 15.14), they use block reservations managed through the DLM (Section 15.15).

/// Reservation type. Mirrors SCSI Persistent Reservation types
/// for compatibility with clustered filesystem expectations (GFS2, OCFS2).
#[repr(u8)]
pub enum BlockReservationType {
    /// Write Exclusive — one peer can write, all can read.
    WriteExclusive = 1,
    /// Exclusive Access — one peer can read and write.
    ExclusiveAccess = 2,
    /// Write Exclusive, Registrants Only — registered peers can
    /// write, all can read.
    WriteExclusiveRegistrantsOnly = 3,
    /// Exclusive Access, Registrants Only — only registered peers
    /// can read or write.
    ExclusiveAccessRegistrantsOnly = 4,
}

/// Reservation state for one export.
pub struct BlockReservationState {
    /// Current reservation holder (None if unreserved).
    holder: Option<PeerId>,
    /// Reservation type.
    res_type: BlockReservationType,
    /// Registered peers (may access device under RegistrantsOnly types).
    registrants: ArrayVec<PeerId, 16>,
    /// DLM lock resource for this reservation.
    dlm_resource: DlmLockResource,
    /// Generation counter — incremented on every reservation change.
    /// Used for fencing (stale reservations are rejected).
    generation: u64,
}

Reservation flow (peer B reserves export on peer A):

  1. Peer B sends Reserve { type: WriteExclusive } to peer A.
  2. Peer A acquires a DLM lock on the reservation resource in exclusive mode. If another peer holds a conflicting reservation, the request blocks or fails with EBUSY.
  3. On DLM grant, peer A records peer B as the holder and responds success.
  4. Subsequent I/O from non-holders is rejected per the reservation type (e.g., writes from non-holders fail with EACCES under WriteExclusive).

Preemption: a peer with higher priority (or admin action) can preempt an existing reservation via Preempt { request_id }. The DLM handles the lock transfer; the preempted peer receives an asynchronous notification and must cease I/O.

Fencing on peer failure: when a reservation-holding peer is declared Dead (heartbeat timeout), the DLM releases its locks. The export server clears the reservation and notifies remaining registrants. Clustered filesystems detect the reservation change and trigger journal replay for the failed peer.

SCSI PR compatibility: the reservation types map directly to SCSI Persistent Reservation types. Clustered filesystems (GFS2, OCFS2) that expect SCSI PR semantics work without modification — the block export translates reservation operations to DLM locks internally.

15.13.3.6 Multi-Path I/O

When a client has multiple network paths to the same export server (e.g., two transport devices, or a direct CXL link plus an RDMA link), block resource export supports multi-path I/O for both performance and high availability.

Multi-path model:
  Client has two transport devices: NIC-A (port 1) and NIC-B (port 2).
  Server exports block device with service_id=42.

  Client creates two connections to the same export:
    Connection 1: NIC-A → Server NIC-X  (queues 0-7)
    Connection 2: NIC-B → Server NIC-Y  (queues 8-15)

  I/O policy (configurable per-export):
    round-robin:  distribute I/O across all healthy paths.
    active-standby: use path 1; failover to path 2 on failure.
    min-latency:  use the path with lowest measured RTT
                  (from topology graph edge weights, Section 5.2.9.5).

  Failover:
    1. Path failure detected (transport error or heartbeat miss on that link).
    2. All queues on the failed path are drained (in-flight I/O retried
       on surviving paths).
    3. When the path recovers, queues are re-created and I/O is
       rebalanced across all healthy paths.

Path identification: the client identifies paths by the pair (local_nic, remote_nic). Multiple paths to the same service_id are recognized as the same device. The client block device presents a single /dev/umkaN device regardless of path count.

Relationship to topology graph: the topology graph (Section 5.2) models all links between peers. Multi-path I/O uses the same link information but operates at the block layer: the topology graph provides cost/latency for path selection; the block multi-path layer handles I/O distribution and failover. This is analogous to the separation between routing (L3) and link aggregation (L2) in networking.

15.13.3.7 Scope and Relationship to NVMe-oF/iSCSI

Block service provider is designed for uniform UmkaOS clusters. It provides the same functionality as NVMe-oF and iSCSI but integrated into the cluster infrastructure:

Feature Block Export NVMe-oF/RDMA iSCSI
Wire protocol Native peer protocol NVMe capsules SCSI CDB over TCP/RDMA
Multi-queue Per-CPU queue pairs Per-CPU SQ/CQ Per-session queues
Write ordering In-queue ordering + FUA + Flush NVMe ordering + FUA SCSI ordering + FUA
Reservations DLM-backed (SCSI PR compatible) NVMe reservations SCSI Persistent Reservations
Multi-path Built-in (topology-aware) ANA + dm-multipath dm-multipath
Reconnection Exponential backoff + dedup NVMe-oF reconnect iSCSI session recovery
Compare-and-write Atomic CAS (Section 15.13) NVMe Compare SCSI COMPARE AND WRITE
Data integrity T10-DIF compatible (Section 15.13) NVMe PI T10-DIF
I/O priority 8-level priority (Section 15.13) NVMe urgency iSCSI task priority
Scatter-gather Per-request SGL (Section 15.13) NVMe SGL iSCSI data segments
Max I/O negotiation Connection setup (Section 15.13) NVMe MDTS iSCSI login params
Discovery PeerRegistry (automatic) Discovery controller iSNS / SendTargets
Authentication Peer capabilities DH-HMAC-CHAP CHAP
Configuration Zero (auto-discovered) nvmet-cli / configfs tgtd / LIO configfs
Daemons required None nvmet kernel target tgtd or LIO

For non-UmkaOS initiators (Linux, Windows, ESXi), NVMe-oF (Section 15.13) and iSCSI targets remain available as compatibility protocols.

For device-native storage providers (firmware shim implementing the umka-protocol): the block service provider is unnecessary. The device is directly addressable as a peer — remote hosts submit I/O via the peer protocol without any host-proxy layer.

15.13.3.8 Atomic Compare-and-Write

Atomic compare-and-write (CAS at block level) is essential for building high-performance clustered filesystems. It enables lock-free metadata updates: instead of acquiring a DLM lock, reading a metadata block, modifying it, and releasing the lock, the filesystem can read the block, prepare the update locally, and submit a single CompareAndWrite that atomically succeeds or fails.

CompareAndWrite flow:
1. Client reads metadata block at offset X (normal Read).
2. Client prepares updated metadata locally.
3. Client submits CompareAndWrite:
     offset = X
     len = block_size (typically 4096)
     compare_region_offset → buffer containing the ORIGINAL data read in step 1
     data_region_offset → buffer containing the UPDATED data from step 2
4. Server atomically:
   a. Reads current data at offset X.
   b. Compares with compare buffer (byte-for-byte).
   c. If equal: writes data buffer to offset X. Returns success.
   d. If not equal: does NOT write. Returns ECANCELED.
      The current (conflicting) data is written into compare_region_offset
      so the client can retry with the updated baseline.
5. Client on ECANCELED: re-read current data from compare_region_offset,
   re-compute the update, retry from step 3.

Atomicity guarantee:
  The server executes CompareAndWrite under a per-range spinlock keyed by
  (device, offset / max_compare_write_bytes). The lock covers read +
  compare + conditional write as a single critical section. This spinlock
  is separate from the device's I/O submission path — it only serializes
  overlapping CAS operations. Non-CAS reads and writes proceed without the
  lock (they are naturally ordered by the submission queue). The per-range
  granularity ensures that CAS operations on non-overlapping ranges execute
  in parallel with no contention.

Maximum CAS size: limited by max_compare_write_bytes negotiated at connection setup (Section 15.13). Minimum: 512 bytes (one sector). Typical: 4096 bytes (one filesystem block). Maximum: 1 MB (for large metadata structures). Larger CAS increases the chance of conflicts; UPFS metadata blocks are typically 4-64 KB.

FUA support: CompareAndWrite respects the FUA flag. With FUA set, the written data reaches persistent storage before the completion is returned. Essential for UPFS metadata integrity.

Interaction with reservations: CompareAndWrite is subject to reservation checks — a peer without the correct reservation type cannot perform CAS on a reserved device.

15.13.3.9 I/O Priority and QoS

When multiple clients share an exported block device (common in UPFS deployments), the server must arbitrate I/O fairly. The priority field in BlockServiceRequest enables server-side QoS enforcement.

Priority model:
  8 priority levels (BlockServicePriority), mapped to the server's
  local I/O scheduler:

  Level 0 (Idle):             Background scrub, RAID rebuild.
  Level 1-3 (BestEffort):     Normal application I/O.
  Level 4-6 (RealTime):       Latency-sensitive workloads, UPFS metadata.
  Level 7 (RealTimeCritical): Fencing, reservation operations.

  Server-side enforcement:
  - Each priority level gets a token bucket (configurable rate + burst).
  - Higher-priority I/O is dispatched first.
  - Within the same priority: FIFO per queue.
  - Starvation prevention: even Idle I/O gets a minimum share
    (default: 5% of device bandwidth).

  Client-side mapping:
  - Process I/O priority (ioprio_set) maps to BlockServicePriority.
  - UPFS metadata operations use RealTimeHigh (level 6).
  - UPFS journal commits use RealTime (level 5) + FUA.
  - Regular data I/O uses BestEffort (level 2).

Linux ioprio mapping: Linux encodes ioprio as (class << 13) | data. UmkaOS converts BlockServicePriority to ioprio encoding at the syscall compatibility boundary (Section 19.1).

BlockServicePriority Linux ioprio class ioprio data
Idle = 0 IOPRIO_CLASS_IDLE 0
BestEffortLow = 1 IOPRIO_CLASS_BE 7
BestEffort = 2 IOPRIO_CLASS_BE 5
BestEffortHigh = 3 (default) IOPRIO_CLASS_BE 4
RealTimeLow = 4 IOPRIO_CLASS_BE 2
RealTime = 5 IOPRIO_CLASS_BE 0
RealTimeHigh = 6 IOPRIO_CLASS_RT 4
RealTimeCritical = 7 IOPRIO_CLASS_RT 0

The mapping preserves relative ordering within each Linux class. UmkaOS's eight levels provide finer granularity than Linux's three classes (IDLE, BE, RT) while remaining fully compatible at the syscall boundary: ioprio_get() returns the mapped Linux value, and ioprio_set() maps the Linux value back to the closest BlockServicePriority level.

Per-client bandwidth limits: the server can enforce per-client bandwidth and IOPS limits via export configuration (sysfs). This prevents one client from monopolizing the device. Limits are enforced by the token bucket independently of I/O priority.

15.13.3.10 Scatter-Gather I/O

Large I/O requests (1 MB+ stripe writes in UPFS) often span multiple non-contiguous memory regions on the client. Without scatter-gather support, the client must copy data into a contiguous buffer — defeating zero-copy.

Scatter-gather model:
  Request with sgl_count = 0:
    Single contiguous buffer. data_region_offset points to the data.
    (This is the common case for small I/O.)

  Request with sgl_count = N (1-15):
    data_region_offset points to an array of N SglEntry structs.
    Each SglEntry describes one data region segment: {region_offset, len}.
    Total I/O length = sum of all segment lengths = request.len.

    Server processes the SGL by issuing N remote read/write operations
    (one per segment), coalesced into a single local block I/O.

  Example: 1 MB striped write from UPFS
    SGL: [
      { region_offset=0x1000, len=256KB },   // journal header
      { region_offset=0x5000, len=512KB },   // data block 1
      { region_offset=0xA000, len=256KB },   // data block 2
    ]
    sgl_count = 3, len = 1MB
    Server issues 3 remote reads, assembles into 1MB contiguous write
    to the local block device.

Maximum SGL entries: 15 per request (sgl_count is u8, capped at 15 to keep the SGL within one transport send inline threshold). For I/O requiring more segments, the client splits into multiple requests.

Maximum total I/O size per request: max_io_bytes negotiated at connection setup (Section 15.13). The sum of all SGL segment lengths must not exceed this.

15.13.3.11 Data Integrity (T10-DIF Compatible)

For production clustered filesystems, silent data corruption must be detected. Block service provider supports end-to-end data integrity compatible with T10-DIF (Data Integrity Field), the industry standard used by both SCSI and NVMe.

/// Data integrity metadata. Appended after each protected block
/// (typically 512 or 4096 bytes) when DATA_INTEGRITY flag is set.
/// Size: 8 bytes per protected block (T10-DIF Type 1 layout).
#[repr(C)]
pub struct DataIntegrityField {
    /// CRC-16 of the data block (T10-DIF guard tag).
    /// Computed by the client before remote write; verified by the server
    /// before writing to disk; re-verified on read before returning.
    pub guard: Le16,
    /// Application tag. UPFS uses this for inode number or metadata type.
    /// Enables detection of misdirected writes (data written to wrong block).
    pub app_tag: Le16,
    /// Reference tag. Contains the expected LBA (lower 32 bits).
    /// Detects misdirected writes where data is written to the correct
    /// device but wrong offset.
    pub ref_tag: Le32,
}
// Wire/on-disk format (T10-DIF): guard(Le16=2) + app_tag(Le16=2) + ref_tag(Le32=4) = 8 bytes.
// All fields little-endian per CLAUDE.md rule 12 (wire struct crossing node boundary).
const_assert!(core::mem::size_of::<DataIntegrityField>() == 8);

Protection path (end-to-end):

Client:
  1. Compute CRC-16 guard for each data block.
  2. Set app_tag (filesystem-assigned), ref_tag (LBA).
  3. Append DIF metadata after each data block in the data region buffer.
  4. Set DATA_INTEGRITY flag in request.

Network:
  5. The transport provides its own integrity at the transport level (RDMA
     iCRC, TCP checksums, CXL link CRC). Double protection: CRC-16 for data
     + transport integrity.

Server:
  6. Verify guard, app_tag, ref_tag before writing to local device.
  7. If local device supports T10-DIF (PI): pass DIF through to device.
     Device verifies again on write (triple protection).
  8. If local device does NOT support PI: server verifies DIF, strips it,
     writes raw data. DIF is re-computed on read.

Server (on read):
  9. Read data from device (with DIF if supported, without if not).
  10. Compute/verify DIF.
  11. Remote write data + DIF to client buffer (via peer transport).

Client:
  12. Verify guard, app_tag, ref_tag on received data.
  13. Strip DIF, return data to filesystem.

Integrity error handling: if any DIF check fails (guard mismatch, ref_tag mismatch, app_tag mismatch), the server returns an error with integrity_status set in the completion. The client retries from a different path (if multi-path) or returns EIO to the filesystem. UPFS logs the integrity violation via FMA (Section 20.1).

Negotiation: data integrity support is negotiated at connection setup (Section 15.13). Both client and server must support it. If the server's underlying device supports T10-DIF (PI), end-to-end protection covers the entire path including the physical media. If not, the server provides software DIF (covers network + server memory, not physical media).

15.13.3.12 Connection Setup and Capability Negotiation

When a client connects to a block export, the server and client negotiate capabilities and limits. This replaces the complex login phase of iSCSI and the NVMe-oF Connect command with a single exchange.

/// Server advertises these capabilities in the connection response.
/// The client uses the minimum of its own and the server's capabilities.
///
/// This struct crosses node boundaries in `ConnectResponse`. Per the DSM
/// wire format policy ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)):
/// all multi-byte integer fields use `Le*` types. Bool fields use `u8`
/// (0/1) to avoid Rust UB from non-0/1 bytes received from remote peers.
#[repr(C)]
pub struct BlockServiceDeviceInfo {
    /// Export identifier.
    pub service_id: ServiceInstanceId,
    /// Block device name (human-readable, for diagnostics).
    pub name: [u8; 64],
    /// Total device capacity in bytes.
    pub capacity_bytes: Le64,
    /// Logical block size (typically 512 or 4096).
    pub block_size: Le32,
    /// Physical block size (alignment hint for optimal I/O).
    pub physical_block_size: Le32,
    /// Maximum I/O size in bytes per request. Client must not submit
    /// requests with len > max_io_bytes. Typical: 1 MB - 4 MB.
    /// Determined by: min(server_rdma_max_msg, device_max_transfer,
    ///                     server_configured_limit).
    pub max_io_bytes: Le32,
    /// Maximum compare-and-write size in bytes. 0 = CAS not supported.
    /// Typical: 4096 (one FS block) to 1 MB (large metadata).
    pub max_compare_write_bytes: Le32,
    /// Maximum I/O queues the server will grant to this client.
    pub max_queues: Le16,
    /// Maximum queue depth (outstanding requests per queue).
    pub max_queue_depth: Le16,
    /// Server supports data integrity (T10-DIF).
    pub supports_integrity: u8, // 0 = false, 1 = true
    /// Server supports scatter-gather.
    pub supports_sgl: u8, // 0 = false, 1 = true
    /// Maximum SGL entries per request (0 if !supports_sgl).
    pub max_sgl_entries: u8,
    /// Device supports discard (TRIM/UNMAP).
    pub supports_discard: u8, // 0 = false, 1 = true
    /// Device supports write zeroes.
    pub supports_write_zeroes: u8, // 0 = false, 1 = true
    /// Device is read-only.
    pub read_only: u8, // 0 = false, 1 = true
    /// Volatile write cache present (Flush is meaningful).
    pub has_volatile_cache: u8, // 0 = false, 1 = true
    /// Explicit padding (1 byte) for 4-byte alignment of next field.
    pub _pad: u8,
    /// Optimal I/O alignment in bytes (for best performance).
    /// Client should align offsets and lengths to this boundary.
    pub optimal_io_alignment: Le32,
}
// Le types are byte-array-backed (alignment 1). ServiceInstanceId is 8 bytes (Le64).
// Layout: 8 (service_id) + 64 (name) + 8 + 4 + 4 + 4 + 4 + 2 + 2 + 7×1 + 1 (pad) + 4 = 112.
const_assert!(size_of::<BlockServiceDeviceInfo>() == 112);

Connection handshake:

1. Client sends ConnectRequest:
   { service_id, requested_queues, requested_queue_depth,
     want_integrity, want_sgl, client_max_io_bytes }

2. Server responds with ConnectResponse:
   { status, device_info: BlockServiceDeviceInfo }
   The device_info contains negotiated values (min of client request
   and server capability).

3. Client creates queue pairs based on negotiated queue count.
4. Client registers data regions with the peer transport, sized
   based on max_io_bytes and max_queue_depth.
5. I/O can begin.

Capability gating: remote block access requires CAP_BLOCK_REMOTE (Section 22.5). Checked once at connection setup, not per-I/O.

Data region authentication: The ConnectRequest carries a CapabilityToken (signed by the capability subsystem, Section 22.5) proving the requester holds CAP_BLOCK_ACCESS for the target device. The responder validates the token signature and scope before establishing data regions in the ConnectResponse. This prevents unauthorized nodes from obtaining remote memory access to device data buffers. The token is validated once at connection setup; subsequent I/O operations on the established connection are authorized by the transport binding established at bind time (revocable via region re-registration if the capability is later revoked).

Discovery: hosts exporting block devices advertise BLOCK_STORAGE in their PeerRegistry capabilities (Section 5.2). Remote peers discover available exports by querying PeerRegistry::peers_with_cap(BLOCK_STORAGE), then sending GetInfo to the exporting peer to retrieve the list of available exports with their BlockServiceDeviceInfo.

Why two-phase discovery (PeerCapFlags + GetInfo RPC): block service uses a two-phase discovery model because block device properties (capacity, block size, cache status) change at runtime (online resize, cache mode switch). Inline properties (32 bytes, set at advertisement time in PeerCapFlags) cannot reflect runtime changes. The GetInfo RPC fetches current device state, ensuring the client sees accurate geometry before connecting. Other capability service providers (e.g., serial, USB) use inline properties because their advertised characteristics are static for the lifetime of the advertisement.

15.13.3.13 Client-Side Block Device (BlockServiceClient)

The server-side BlockServiceProvider and wire protocol are defined above. This section specifies the client-side kernel module that turns a ServiceBindAck into a usable local block device. The client registers as a standard BlockDeviceOps implementation (Section 15.2), so filesystems, dm-*, LVM, and every other block consumer work without modification.

Tier assignment: Tier 1 (Evolvable). The client is an umka-block module running in a hardware-isolated domain (MPK/POE/DACR on supported architectures, Tier 0 fallback on RISC-V/s390x/LoongArch64). A client crash triggers driver reload (~50-150ms); in-flight I/O is re-submitted from the block layer's retry queue. No kernel panic.

Phase: Phase 3 (requires RDMA stack, peer protocol, block layer, and block service provider to be functional).

15.13.3.13.1 BlockServiceClient Struct
// umka-block/src/service_client.rs

/// Memory region registered with the peer transport at ServiceBind time.
/// Backing depends on transport: RDMA MR, CXL shared-memory window,
/// TCP bounce buffer. Service providers access it via region_offset
/// values in wire structs.
pub struct ServiceDataRegion {
    /// Local virtual address of the region base.
    base: *mut u8,
    /// Size of the region in bytes.
    size: usize,
    /// Opaque transport handle (RDMA lkey, CXL window ID, etc.).
    transport_handle: u64,
}

/// Client-side module that creates a local block device backed by a remote
/// BlockServiceProvider. Holds connection state, per-CPU transport queues,
/// and adaptive polling state. One instance per remote block device.
///
/// Implements `BlockDeviceOps` — the block layer routes bios here exactly
/// as it would for a local NVMe device.
pub struct BlockServiceClient {
    /// Remote peer hosting the BlockServiceProvider.
    peer_id: PeerId,
    /// ServiceInstanceId of the remote export (from ServiceBindAck).
    service_id: ServiceInstanceId,
    /// Negotiated device info (capacity, block size, limits).
    /// Immutable after connection setup; replaced atomically on reconnect
    /// if the remote device geometry changed (e.g., online resize).
    device_info: RcuCell<BlockServiceDeviceInfo>,
    /// Per-queue state. Array length = negotiated queue count.
    /// Index = queue_id (0..nr_queues-1). Each CPU maps to one queue
    /// via `cpu_to_queue: [u16; NR_CPUS_MAX]`.
    queues: ArrayVec<ClientQueue, 64>,
    /// CPU-to-queue mapping. Populated at connection setup based on
    /// negotiated queue count: `cpu_to_queue[cpu] = cpu % nr_queues`.
    /// Length = nr_possible_cpus (runtime-discovered). Allocated once
    /// from slab at connection time (warm path).
    cpu_to_queue: Box<[u16]>,
    /// Connection state machine.
    state: AtomicU8, // ClientState as u8
    /// Pre-registered data regions for bulk transfer. One region per queue,
    /// sized to hold `queue_depth × max_io_bytes` of concurrent I/O.
    /// Registered once at connection setup with the peer transport; avoids
    /// per-I/O registration overhead (saves ~1-3μs per I/O).
    data_regions: ArrayVec<ServiceDataRegion, 64>,
    /// Multipath state (None if single-path).
    multipath: Option<MultipathState>,
    /// Block device handle for deregistration on disconnect.
    bdev_handle: Option<BlockDeviceHandle>,
    /// Reconnection backoff state.
    reconnect: ReconnectState,
    /// Per-device I/O timeout in milliseconds. Default: 30_000.
    /// Range: 1_000..=600_000 (1 second to 10 minutes).
    /// Values outside this range are clamped on write.
    /// Configurable via sysfs at `/sys/block/umkaXpYbZ/queue/io_timeout`.
    io_timeout_ms: AtomicU32,
}

/// Per-queue client state. Each queue has its own peer queue pair,
/// request/completion rings, and polling thread. Queues are fully independent
/// — no locks between them on the I/O submission or completion paths.
pub struct ClientQueue {
    /// Peer transport queue pair connected to the server's corresponding
    /// BlockServiceQueue. Reliable connected mode.
    qp: PeerQueuePair,
    /// Request IDs for in-flight tracking. Pre-allocated bitmap +
    /// request metadata array. Size = queue_depth.
    inflight: InflightTracker,
    /// Adaptive poll/interrupt mode for completions (see below).
    poll_mode: AtomicU8, // PollMode as u8
    /// Completion thread handle (one per queue).
    completion_thread: Option<TaskHandle>,
    /// Queue index (matches server-side queue index).
    queue_id: u16,
    /// Data region for this queue's bulk transfers.
    data_region: ServiceDataRegion,
}

/// Tracks in-flight requests per queue. Fixed-size, no heap allocation
/// on the I/O path.
pub struct InflightTracker {
    /// Bitmap of in-use request IDs (1 = in-flight).
    /// Size: queue_depth bits, rounded up to u64 words.
    bitmap: ArrayVec<AtomicU64, 16>, // supports up to 1024 queue depth (16*64=1024)
    /// Per-slot metadata for in-flight requests.
    slots: Box<[InflightSlot]>,
    /// Queue depth (number of slots).
    depth: u16,
}

/// Bitmap allocation algorithm (lock-free, O(1) amortized):
///
/// Allocate:
///   1. `hint = per-CPU last_allocated_hint` (avoids contention across CPUs
///      scanning the same word).
///   2. `word = bitmap[hint / 64]`
///   3. If word has any zero bits:
///        `bit = ctz(!word.load(Acquire))`  // count trailing zeros of inverted
///        attempt CAS: `word.compare_exchange(old, old | (1 << bit), AcqRel, Acquire)`
///        if CAS succeeds: update hint, return `hint_base + bit`
///        else: retry same word (contention — another CPU claimed this bit)
///   4. If word is all 1s (full): advance hint to next word, wrap at `depth / 64`.
///      After scanning all words without finding a free bit:
///      return `Err(QueueFull)` → `BLK_STS_RESOURCE` (block layer re-queues the bio).
///
/// Free:
///   `bitmap[slot / 64].fetch_and(!(1 << (slot % 64)), Release)`
///
/// This is the standard lock-free bitmap allocator used in high-performance
/// I/O stacks (NVMe blk-mq tag allocator, SPDK). The per-CPU hint eliminates
/// false sharing: each CPU scans from where it last succeeded, so concurrent
/// CPUs naturally spread across different bitmap words.

/// Metadata for one in-flight I/O request.
/// kernel-internal, not KABI — pointer-width-dependent (contains *mut Bio).
#[repr(C, align(64))]
pub struct InflightSlot {
    /// Original bio pointer (for completion callback).
    ///
    /// SAFETY: The bio pointer is valid for the entire duration the slot is
    /// marked as in-use (bitmap bit set). The block layer guarantees that a
    /// bio is not freed until its completion callback has been invoked, and
    /// BlockServiceClient only invokes the callback when clearing the bitmap
    /// bit (in `process_completion`). Therefore, the pointer is always valid
    /// when accessed through a set bitmap bit. The Release ordering on
    /// bitmap clear ensures the bio completion is visible before the slot
    /// is reused.
    bio: *mut Bio,
    /// Submission timestamp (nanoseconds, monotonic). For timeout detection.
    submit_ns: u64,
    /// Request ID assigned to this slot (= slot index, unique per queue).
    request_id: u64,
    /// Which multipath path this request was submitted on (0 if single-path).
    path_index: u8,
    /// Retry count for this request (0 = first attempt).
    retries: u8,
    _pad: [u8; 6],
}

/// Connection state machine.
#[repr(u8)]
pub enum ClientState {
    /// Initial state. No connection to server.
    Disconnected = 0,
    /// ServiceBind sent, awaiting ServiceBindAck.
    Connecting = 1,
    /// Queue pairs being created and connected.
    QueueSetup = 2,
    /// Fully connected. I/O flows normally.
    Active = 3,
    /// Connection lost. I/O is queued, reconnection in progress.
    Reconnecting = 4,
    /// Max reconnect attempts exceeded. I/O fails with EIO.
    Offline = 5,
    /// Graceful disconnect in progress. Draining in-flight I/O.
    Draining = 6,
}

/// Adaptive polling mode for completion processing.
#[repr(u8)]
pub enum PollMode {
    /// Busy-poll completions. Used when I/O rate exceeds the poll
    /// threshold (default: 10K IOPS per queue). Lowest latency,
    /// highest CPU usage. SPDK-inspired.
    Poll = 0,
    /// Interrupt-driven. Completion queue event triggers wakeup.
    /// Used when queue is idle or below the poll threshold.
    /// Saves CPU at the cost of ~2-5μs interrupt latency.
    Interrupt = 1,
    /// Hybrid: poll for `poll_spin_us` microseconds after each
    /// completion batch, then fall back to interrupt if no new
    /// completions arrive. Default mode.
    Hybrid = 2,
}
15.13.3.13.2 Device Registration

When BlockServiceClient connects successfully, it registers a block device with the umka-block layer. The device appears as a standard block device accessible to filesystems, dm-*, LVM, and all block consumers.

Device naming: /dev/umka/peer{N}_blk{M} where N is the PeerId (u64, rendered as hex) and M is the service instance index on that peer. These are NOT /dev/sd* (reserved for local SCSI) or /dev/nvme* (reserved for local NVMe). The umka/ subdirectory groups all cluster block devices. Symlinks in /dev/disk/by-id/ use the format umka-{service_id} for stable identification across reconnects.

sysfs integration: the device appears in /sys/block/umkaXpYbZ/ with standard block device attributes (size, queue/, stat). Additional cluster-specific attributes:

sysfs path Content
device/peer_id Remote PeerId (hex)
device/service_id ServiceInstanceId (hex)
device/state Current ClientState name
device/transport rdma or tcp
queue/io_timeout Per-request timeout in ms (r/w)
queue/nr_queues Number of transport queues
queue/queue_depth Depth per queue
queue/poll_mode poll, interrupt, or hybrid (r/w)
device/multipath/policy Multipath policy name (r/w, if multipath)
device/multipath/paths Per-path state table

BlockDeviceOps implementation:

impl BlockDeviceOps for BlockServiceClient {
    /// Convert bio → BlockServiceRequest and submit to the transport queue
    /// for the current CPU. No intermediate request queue, no I/O
    /// scheduler between client and network — the server runs its own
    /// scheduler ([Section 15.13](#block-storage-networking--io-priority-and-qos)).
    ///
    /// This is the hot path. Zero heap allocation. The bio's memory pages
    /// are already registered in `data_region` (pre-registered) so no
    /// per-I/O transport registration is needed.
    fn submit_bio(&self, bio: &mut Bio) -> Result<()> {
        // 1. Check state. If not Active, queue or fail.
        //    Reconnecting → queue bio in backlog (bounded, queue_depth × 4).
        //    Offline → return -EIO immediately.
        //    Other → return -ENODEV.

        // 2. Select queue: cpu_to_queue[smp_processor_id()].
        //    If multipath: path_select() first, then queue on chosen path.

        // 3. Allocate inflight slot (bitmap scan, O(1) amortized).
        //    If no slots → return BLK_STS_RESOURCE (block layer retries).

        // 4. Build BlockServiceRequest from bio:
        //    - request_id = slot index (unique per queue).
        //    - opcode = bio.op → BlockServiceOpcode mapping:
        //        BioOp::Read → Read, BioOp::Write → Write,
        //        BioOp::Flush → Flush, BioOp::Discard → Discard,
        //        BioOp::WriteZeroes → WriteZeroes.
        //    - offset = bio.start_lba × device_info.block_size.
        //    - len = sum of bio segment lengths.
        //    - flags: FUA if bio.flags.contains(BioFlags::FUA),
        //             BARRIER if bio.flags.contains(BioFlags::PREFLUSH).
        //    - priority: bio.cgroup_id → ioprio → BlockServicePriority
        //      (table in "I/O Priority and QoS" above).
        //    - data_region_offset: points into pre-registered data_region.
        //      For writes: bio pages are COPIED into a slot in data_region
        //      (the bio's own page frames are not pre-registered; only
        //      data_region is registered with the transport). For reads:
        //      the server will remote-write into the same data_region slot.
        //    - sgl_count: if bio has >1 segment and server supports SGL,
        //      build inline SGL entries. Otherwise, bounce into contiguous
        //      buffer within data_region.

        // 5. Record bio pointer and timestamp in inflight slot.

        // 6. Send BlockServiceRequest via the queue's ring pair.

        // 7. Return Ok(()). Completion is asynchronous.
    }

    fn flush(&self) -> Result<()> {
        // Submit Flush opcode synchronously (wait on completion).
        // Uses bio_submit_and_wait() which sets bio_sync_end_io callback.
    }

    fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()> {
        // Submit Discard opcode. Returns ENOSYS if
        // !device_info.supports_discard.
    }

    fn get_info(&self) -> BlockDeviceInfo {
        // Read from RcuCell<BlockServiceDeviceInfo>, convert to
        // BlockDeviceInfo. Fields map directly:
        //   logical_block_size = device_info.block_size
        //   physical_block_size = device_info.physical_block_size
        //   capacity_sectors = device_info.capacity_bytes / block_size
        //   max_segments = device_info.max_sgl_entries.max(1)
        //   max_bio_size = device_info.max_io_bytes
        //   supports_discard = device_info.supports_discard
        //   supports_flush = device_info.has_volatile_cache
        //   supports_fua = true (always supported by wire protocol)
        //   optimal_io_size = device_info.optimal_io_alignment
        //   numa_node = local transport device's NUMA node
    }

    fn shutdown(&self) -> Result<()> {
        // Graceful disconnect: drain → ServiceUnbind → deregister.
    }
}

No intermediate I/O scheduler: unlike iSCSI and NVMe-oF kernel initiators which funnel through the full Linux block layer multi-queue infrastructure (blk-mq → I/O scheduler → hw dispatch queue), BlockServiceClient submits directly from submit_bio() to the transport queue. The server has its own I/O scheduler (Section 15.13) — a client-side scheduler would add latency without improving ordering. This saves ~1-3μs per I/O compared to the Linux blk-mq path.

15.13.3.13.3 Connection Lifecycle
Phase 1 — Discovery:
  1. Client queries PeerRegistry::peers_with_cap(BLOCK_STORAGE)
     ([Section 5.2](05-distributed.md#cluster-topology-model--peer-registry)).
  2. For each peer with BLOCK_STORAGE: send GetInfo (BlockServiceOpcode::GetInfo)
     to retrieve available exports and their BlockServiceDeviceInfo.
  3. Client filters exports by policy (admin config, cgroup affinity, NUMA
     proximity to local transport device).

Phase 2 — Connect:
  4. Client validates it holds CAP_BLOCK_REMOTE
     ([Section 9.2](09-security.md#permission-and-acl-model)). Checked once here, not per-I/O.
  5. Client sends ServiceBind ([Section 5.1](05-distributed.md#distributed-kernel-architecture--message-payload-structs))
     for the selected export:
       service_id = target export's ServiceInstanceId
       ring_pair_index = 0 (first; more will follow)
       requested_queue_depth = min(local_preference, 128)
       requested_entry_size = 128 (minimum for BlockServiceRequest + alignment)
  6. Server validates CapabilityToken, responds with ServiceBindAck:
       granted_queue_depth, granted_entry_size, transport_params (transport-specific:
       RDMA QP number + rkey + remote_addr; CXL doorbell offset; TCP port).
  7. Client stores negotiated parameters in device_info.

Phase 3 — Queue Setup:
  8. Client creates N peer queue pairs (N = negotiated queue count).
     Each queue pair: reliable connected mode, linked to server's
     corresponding queue.
  9. Client registers data regions with the peer transport (one per queue):
       region_size = queue_depth × max_io_bytes
     These are pre-registered once; subsequent I/O uses region offsets.
  10. Client populates cpu_to_queue mapping:
       for cpu in 0..nr_cpus: cpu_to_queue[cpu] = cpu % nr_queues
  11. Client spawns one completion thread per queue (see "I/O Completion" below).
  12. State transitions: Connecting → QueueSetup → Active.

Phase 4 — Steady State:
  13. Bios arrive via submit_bio(). Converted to BlockServiceRequest, posted
      to the per-CPU queue's ring pair. Completions arrive on the same queue.
  14. Adaptive polling adjusts per-queue poll mode based on I/O rate.

Phase 5 — Disconnect:
  15. Trigger: admin request, or device removal on server, or peer departure.
  16. State → Draining. Block layer is notified: no new bios accepted
      (QUEUE_FLAG_QUIESCING set on the block device).
  17. Wait for all in-flight requests to complete or timeout (io_timeout_ms).
  18. Destroy transport queue pairs and deregister data regions.
  19. Send ServiceUnbind ([Section 5.1](05-distributed.md#distributed-kernel-architecture)).
  20. Deregister block device from umka-block. /dev/umka/peer{N}_blk{M}
      disappears. Any open file descriptors see -ENODEV on subsequent I/O.
  21. State → Disconnected.

Phase 6 — Reconnection (on transient failure):
  22. Trigger: transport error, completion timeout, peer heartbeat Suspect/Dead.
  23. State → Reconnecting. New bios are queued in a bounded backlog
      (capacity: nr_queues × queue_depth × 4). Overflow → BLK_STS_RESOURCE.
  24. Reconnect loop with exponential backoff + full jitter:
        base_delay_ms = 100, max_delay_ms = 30_000, jitter = ±25%.
      Same algorithm as NVMe-oF reconnect
      ([Section 15.13](#block-storage-networking--nvme-of-reconnect-policy)).
  25. On successful reconnect:
      a. Re-create QPs, re-register MRs.
      b. Re-submit in-flight requests (server deduplicates by request_id —
         see "In-flight I/O deduplication" in server spec above).
      c. Drain backlog.
      d. State → Active.
  26. After 20 consecutive failures (~10 minutes): State → Offline.
      I/O fails with EIO. Manual recovery via:
        echo reconnect > /sys/block/umkaXpYbZ/device/state

Reconnect state:

/// Reconnection backoff state. Tracks consecutive failures and
/// computes the next retry delay.
pub struct ReconnectState {
    /// Consecutive failed reconnect attempts.
    attempt: u32,
    /// Timestamp of last reconnect attempt (monotonic ns).
    last_attempt_ns: u64,
    /// Maximum attempts before transitioning to Offline.
    max_attempts: u32, // default: 20
    /// Base delay in milliseconds. Default: 100.
    base_delay_ms: u32,
    /// Maximum delay in milliseconds. Default: 30_000.
    max_delay_ms: u32,
}

impl ReconnectState {
    /// Compute next delay with exponential backoff + full jitter.
    /// Returns delay in milliseconds.
    pub fn next_delay(&mut self) -> u32 {
        let delay = core::cmp::min(
            self.base_delay_ms.saturating_mul(1u32 << self.attempt.min(20)),
            self.max_delay_ms,
        );
        let jitter_range = delay / 4; // ±25%
        let jitter = prng_uniform_u32(jitter_range * 2) as i32 - jitter_range as i32;
        let result = (delay as i32 + jitter).max(1) as u32;
        self.attempt = self.attempt.saturating_add(1);
        result
    }

    /// Reset on successful reconnect.
    pub fn reset(&mut self) {
        self.attempt = 0;
    }
}
15.13.3.13.4 Multipath

Multiple paths to the same remote export (via different transport devices or different network fabrics) are managed directly inside BlockServiceClient. No dm-multipath dependency — multipath is built-in.

/// Multipath state for a BlockServiceClient with multiple paths to
/// the same remote export.
pub struct MultipathState {
    /// All known paths to this export. XArray keyed by path_id (u64).
    /// Path IDs are assigned sequentially and never reused (u64 counter).
    paths: XArray<PathInfo>,
    /// Active path selection policy.
    policy: AtomicU8, // MultipathPolicy as u8
    /// Round-robin counter (used by RoundRobin policy).
    rr_counter: AtomicU64,
    /// Number of currently active paths.
    active_count: AtomicU16,
}

/// Per-path connection state.
pub struct PathInfo {
    /// Unique path identifier within this BlockServiceClient instance.
    /// Monotonically increasing per-instance (starts at 0 on device creation,
    /// never reset). Scoped to one BlockServiceClient — not shared across
    /// devices. At 1 billion path events/sec, wraps after 584 years.
    path_id: u64,
    /// The underlying client connection for this path. Each path has
    /// its own set of transport queue pairs and data regions.
    queues: ArrayVec<ClientQueue, 64>,
    /// CPU-to-queue mapping for this path.
    cpu_to_queue: Box<[u16]>,
    /// Local transport device used for this path (RDMA NIC, CXL port, etc.).
    local_transport: TransportDeviceRef,
    /// Remote transport endpoint for this path.
    remote_endpoint: PeerEndpoint,
    /// Current path state.
    state: AtomicU8, // PathState as u8
    /// NUMA node of the local transport device. Used by NUMA-aware path selection
    /// to prefer paths whose device is on the same NUMA node as the submitting CPU.
    numa_node: u16,
    /// Exponentially weighted moving average of completion latency (ns).
    /// Updated on each completion. Used by LeastLatency policy.
    avg_latency_ns: AtomicU64,
    /// Number of in-flight requests on this path. Used by LeastQueueDepth policy.
    inflight_count: AtomicU32,
}

/// Path health state.
#[repr(u8)]
pub enum PathState {
    /// Path is healthy and accepting I/O.
    Active = 0,
    /// Path is configured but not preferred (admin-designated standby).
    Standby = 1,
    /// Path has failed (transport error or timeout). Reconnection in progress.
    /// I/O is redirected to other Active paths.
    Failed = 2,
    /// Path is being removed (admin disconnect or NIC removal).
    Removing = 3,
}

/// Multipath I/O distribution policy.
#[repr(u8)]
pub enum MultipathPolicy {
    /// Distribute I/O across active paths in round-robin order.
    /// Simple, fair, good default for symmetric paths.
    RoundRobin = 0,
    /// Select the path with fewest in-flight requests.
    /// Best for asymmetric paths (different bandwidths).
    LeastQueueDepth = 1,
    /// Select the path whose local transport device is on the same NUMA node
    /// as the submitting CPU. Falls back to RoundRobin for CPUs
    /// without a same-node path.
    NumaAware = 2,
    /// Select the path with lowest measured completion latency (EWMA).
    /// Best for paths with different link speeds or hop counts.
    LeastLatency = 3,
}

Path selection (hot path — no locks, no allocation):

path_select(bio) → PathInfo:
  match policy:
    RoundRobin:
      idx = rr_counter.fetch_add(1, Relaxed) % active_count
      return nth_active_path(idx)
    LeastQueueDepth:
      scan active paths, return one with lowest inflight_count.load(Relaxed)
      (tie-break: lowest path_id)
    NumaAware:
      cpu_numa = cpu_to_numa_node(smp_processor_id())
      scan active paths for one with matching numa_node
      if found: return it
      else: fall back to RoundRobin
    LeastLatency:
      scan active paths, return one with lowest avg_latency_ns.load(Relaxed)

Failover: automatic, completion-timeout-based. No dedicated heartbeat — the peer heartbeat (Section 5.8) already detects peer-level failures. Path-level failures are detected by I/O timeout:

  1. Request completes with transport error, or no completion within io_timeout_ms → mark the request's path_index path as Failed.
  2. Re-submit the failed request on the next Active path (up to 3 retries total across different paths).
  3. If all paths are Failed → enter Reconnecting state on all paths.
  4. Path recovery: when transport reconnect succeeds, transition FailedActive. I/O rebalances automatically.

Path discovery: new paths are discovered when: - A new transport device comes online (hotplug) that has connectivity to the server. - The topology graph (Section 5.2) reports a new link to the server peer. - Admin explicitly adds a path via sysfs: echo <transport_dev>:<remote_endpoint> > /sys/block/umkaXpYbZ/device/multipath/add_path

15.13.3.13.5 I/O Completion and Error Handling

Each queue has a dedicated completion thread. The thread uses adaptive polling to balance latency against CPU usage.

Adaptive poll/interrupt mode:

Completion thread main loop (per queue):

  poll_spin_us = 10  // configurable, default 10μs
  poll_threshold_iops = 10_000  // per-queue IOPS to enter poll mode

  loop:
    match poll_mode:
      Poll:
        // Busy-poll the completion queue. Lowest latency (~0.5-1μs).
        // Burns one CPU core per queue — only used at high IOPS.
        batch = poll_cq(qp.cq, max_batch=32)
        if batch.is_empty():
          spin_loop_hint()  // pause instruction
          continue

      Interrupt:
        // Arm the CQ for completion notification, then sleep.
        arm_cq(qp.cq)
        wait_for_cq_completion(qp.cq)  // blocks until completion interrupt
        batch = rdma_poll_cq(qp.cq, max_batch=32)

      Hybrid:
        // Poll for poll_spin_us, then fall back to interrupt.
        deadline = now_ns() + poll_spin_us * 1000
        loop:
          batch = poll_cq(qp.cq, max_batch=32)
          if !batch.is_empty(): break
          if now_ns() >= deadline:
            // No completions in poll window → switch to interrupt.
            arm_cq(qp.cq)
            wait_for_cq_completion(qp.cq)
            batch = poll_cq(qp.cq, max_batch=32)
            break
          spin_loop_hint()

    // Process batch of completions.
    for wc in batch:
      process_completion(wc)

    // Adaptive mode switching (evaluated every 1000 completions):
    //   recent_iops > poll_threshold_iops → switch to Poll
    //   recent_iops < poll_threshold_iops / 2 → switch to Hybrid
    //   no completions for 100ms → switch to Interrupt
    update_poll_mode()

Completion batching: completion queue poll returns up to 32 completions at once (max_batch=32). Each completion is a BlockServiceCompletion received via the ring pair. Processing multiple completions per poll avoids per-completion overhead (doorbell, cache line bouncing).

Per-completion processing (process_completion):

process_completion(wc: TransportCompletion):
  1. Extract request_id from the BlockServiceCompletion in the recv buffer.
  2. Look up inflight slot: inflight.slots[request_id].
  3. Validate: slot must be in-use (bitmap bit set). If not → log + discard
     (stale completion from pre-reconnect).
  4. Clear bitmap bit (atomic, release ordering).
  5. Map completion status → bio status:
       0 → BIO_OK (0)
       -ECANCELED → -ECANCELED (CompareAndWrite mismatch)
       -EIO → -EIO
       -ENOSPC → -ENOSPC
       -ENOMEM → -ENOMEM (server out of memory)
       other negative → -EIO (unexpected server error)
  6. If integrity_status != 0 and DATA_INTEGRITY was requested:
       Log integrity violation via FMA ([Section 20.1](20-observability.md#fault-management-architecture)).
       Set bio status to -EIO.
  7. Set bio.status to mapped value.
  8. Invoke bio completion:
       bio_complete(bio, status)
     This calls the bio's `end_io` callback, which wakes the waiting
     filesystem/application or triggers the next stage in an async I/O pipeline.
  9. Update path statistics:
       path.avg_latency_ns EWMA update.
       path.inflight_count.fetch_sub(1, Release).
  10. Repost receive entry for the next completion (pre-posted recv pool).

Timeout detection: a per-queue timer fires every io_timeout_ms / 4 (default: 7.5 seconds). It scans the inflight bitmap for requests whose submit_ns exceeds the timeout:

timeout_scan(queue):
  now = monotonic_ns()
  deadline = now - io_timeout_ms * 1_000_000
  for slot in inflight.slots where bitmap bit is set:
    if slot.submit_ns < deadline:
      // Request timed out.
      if multipath and other paths Active:
        // Retry on different path (up to 3 total retries).
        if slot.retries < 3:
          slot.retries += 1
          resubmit_on_different_path(slot)
          continue
      // No retry possible → fail the bio.
      mark_path_failed(slot.path_index)
      complete_bio_with_error(slot, -EIO)

Error classification and retry policy:

Completion status Retry? Action
0 (success) No Complete bio successfully
-ECANCELED (CAS mismatch) No Pass to caller (not a transport error)
-EIO Yes (different path) Transient storage error — retry on another path
-ENOMEM Yes (same path, after backoff) Server memory pressure — back off 10ms, retry
-ENOSPC No Propagate to filesystem
Transport error Yes (different path) Path failure — failover
Timeout (no completion) Yes (different path) Path failure — failover
Transport fatal error No (path dead) Mark path Failed, trigger reconnect

Maximum retries: 3 per I/O request across all paths. After 3 retries with no success, the bio completes with -EIO. The filesystem or application handles the error (journal replay, read retry, user notification).

15.13.3.13.6 Pre-Registered Memory Regions

Per-I/O transport memory registration costs ~1-3μs (kernel page pin, device doorbell, PTE update). At 1M IOPS, that is 1-3 seconds of CPU time per second — unacceptable. BlockServiceClient pre-registers data regions at connection setup.

Pre-registration model:
  Per queue:
    region_size = queue_depth × max_io_bytes
    Typical example: 32 slots × 128KB = 4MB per queue.
    For 8 queues: 32MB total pre-registered.
    High-performance example: 128 slots × 1MB = 128MB per queue.
    For 16 queues: 2GB total pre-registered (requires dedicated RDMA NIC memory).

  The region is a contiguous virtual allocation (vmalloc) backed by
  physical pages. It is registered as a single ServiceDataRegion with
  the peer transport, yielding a local handle and a remote-accessible
  token (communicated to the server at connection setup via
  ServiceBindTransportParams).

  I/O submission:
    For writes: bio pages are copied into a slot in the pre-registered
    region (memcpy cost ~0.5μs for 4KB, amortized by zero registration
    overhead). For large I/O with SGL: each SGL entry points into the
    pre-registered region (no copy if bio pages happen to be within
    the region — but this is not guaranteed, so copy is the common path).

    For reads: the server remote-writes directly into the slot. On
    completion, the client copies from the slot to the bio's target pages.

  Trade-off: memory copy (~0.5μs/4KB) vs per-I/O registration (~1-3μs/op).
  For 4KB I/O: copy wins. For 1MB I/O: copy cost (~30μs) approaches
  registration cost — but registration has higher variance (device
  contention). Pre-registration is uniformly better for mixed workloads.

  Alternative (future optimization): on-demand region cache. Keep a pool
  of pre-registered page-aligned regions. On submit, check if bio pages
  are already registered. If yes, use directly (zero-copy). If no, fall
  back to copy into pre-registered region. This is Phase 4 work.

Memory budget: the pre-registered region size is configurable via sysfs (/sys/block/umkaXpYbZ/queue/mr_size_mb). Default is computed from negotiated parameters. On memory-constrained systems, reducing queue_depth or max_io_bytes proportionally reduces MR size.

15.13.3.13.7 Performance Comparison

Why this is better than existing remote block protocols:

Overhead source iSCSI (Linux) NVMe-oF/RDMA (Linux) BlockServiceClient
Protocol translation SCSI CDB encode/decode NVMe capsule build None — native wire format
I/O scheduler (client) blk-mq + mq-deadline blk-mq + none None — direct submit
Request conversion bio → scsi_cmnd → iSCSI PDU bio → nvme_request → capsule bio → BlockServiceRequest (1 step)
Region registration Per-I/O (no pre-reg in Linux iSCSI) Per-I/O or FMR pool Pre-registered (zero per-I/O cost)
Completion model Interrupt-only Interrupt + poll (since 5.x) Adaptive poll/hybrid/interrupt
Multipath dm-multipath (separate layer) NVMe ANA (in-driver) Built-in (no layer crossing)
Connection setup iSCSI login (multi-round) NVMe Connect (2 rounds) ServiceBind (1 round)

Expected latency (4KB random read, RDMA/RoCE, single hop): - Network RTT: ~3-5μs (RDMA RC one-sided) - Client overhead (submit + completion): ~1-2μs - submit_bio → build request + post to ring pair: ~0.5μs - Poll CQ + process completion + bio callback: ~0.5-1μs - Server overhead: ~2-4μs (receive + local NVMe submit + local completion + remote write back) - Total: ~6-11μs end-to-end (vs ~15-25μs for NVMe-oF/RDMA in Linux, ~50-100μs for iSCSI/TCP)


15.14 Clustered Filesystems

Shared-disk filesystems where multiple nodes access the same block device simultaneously, coordinated by a distributed lock manager (DLM).

Linux problem — GFS2 and OCFS2 require a complex multi-daemon stack: - Corosync: cluster membership and messaging - Pacemaker: resource manager and fencing coordinator - DLM: distributed lock manager (kernel module + userspace daemon) - Fencing agent: STONITH (Shoot The Other Node In The Head) — kills unresponsive nodes to prevent split-brain corruption

These components are developed by different teams, have different configuration languages, and interact in subtle ways. Diagnosing failures requires understanding all four components and their interactions. A single daemon crash can fence the entire node.

UmkaOS design — The cluster infrastructure from Section 5.1 provides the foundation. UmkaOS integrates these components into a coherent architecture:

DLM over RDMA — The DLM (Section 15.15) uses Section 5.4's RDMA transport for lock operations. Lock grant/release round-trip is ~3-5μs over RDMA (vs ~30-50μs over TCP in Linux's DLM). This directly impacts filesystem performance — every metadata operation (create, rename, delete, stat) requires at least one DLM lock. At 3-5μs per lock, clustered filesystem metadata operations approach local filesystem performance. See Section 15.15 for the full DLM design, including RDMA-native lock protocols, lease-based extension, batch operations, and recovery.

Fencing — When a node becomes unresponsive, the cluster must fence it (prevent it from accessing shared storage) before allowing other nodes to recover its locks: - IPMI/BMC fencing: power-cycle the node via out-of-band management - SCSI-3 Persistent Reservations: revoke the node's reservation on the shared storage device — the storage controller itself blocks I/O from the fenced node - Same mechanisms as Linux, but integrated into Section 5.8's cluster membership protocol rather than requiring a separate Pacemaker/STONITH stack

Quorum — Inherits from Section 5.8's split-brain handling. A partition with fewer than quorum nodes self-fences (stops accessing shared storage) to prevent data corruption.

GFS2 compatibility — Read the GFS2 on-disk format, implemented as an umka-vfs module: - Resource groups, dinodes, journaled metadata - GFS2 DLM lock types mapped to DLM lock modes (Section 15.15) - Journal recovery for failed nodes - Existing GFS2 volumes can be mounted by UmkaOS without reformatting

OCFS2 compatibility — Similar approach: read OCFS2 on-disk format, implement as an umka-vfs module. Lower priority than GFS2.

Recovery advantage — This is where UmkaOS's architecture fundamentally changes clustered filesystem behavior: - Linux: if a node's storage driver crashes, the DLM loses heartbeat from that node. Fencing kicks in — the node is killed (power-cycled or SCSI-3 PR revoked). After reboot (~60s), the node must rejoin the cluster, replay its journal, and re-acquire locks. Other nodes are blocked on any locks held by the crashed node until fencing and recovery complete. - UmkaOS: if a node's storage driver crashes, the driver recovers in ~50-150ms (Tier 1 reload). The cluster heartbeat continues throughout (heartbeat runs in umka-core, not the storage driver), so the node is never declared dead. The node stays in the cluster. Its locks remain valid. No fencing, no journal replay, no lock recovery. Other nodes never notice.

This transforms clustered filesystem reliability from "minutes of disruption per failure" to "50ms blip per failure." See Section 15.15 for detailed recovery comparison.


15.15 Distributed Lock Manager

The Distributed Lock Manager (DLM) is a first-class kernel subsystem in umka-core that provides cluster-wide lock coordination for shared-disk filesystems (Section 15.14), distributed applications, and any kernel subsystem requiring cross-node synchronization. It implements the VMS/DLM lock model — the same model used by Linux's DLM, GFS2, OCFS2, and VMS clustering.

The DLM lives in umka-core (not a separate daemon or Tier 1 driver). This is a deliberate architectural choice: lock state survives Tier 1 driver restarts, cluster heartbeat continues during storage driver reloads (keeping the node alive in the cluster), and there are zero kernel/userspace boundary crossings for lock operations.

15.15.1 Design Overview and Linux Problem Statement

Linux's DLM implementation suffers from seven systemic problems that limit clustered filesystem performance. Each problem stems from architectural decisions made when the Linux DLM was designed for 1 Gbps Ethernet and 4-node clusters in the early 2000s. UmkaOS's DLM addresses each problem by design:

# Linux Problem Impact UmkaOS Fix
1 Global recovery quiesce — DLM stops ALL lock activity cluster-wide during any node failure recovery Seconds of cluster-wide stall; all nodes blocked, not just those sharing resources with the dead node Per-resource recovery: only resources mastered on the dead node are affected; all other lock operations continue uninterrupted (Section 15.15)
2 TCP lock transport (~30-50 μs per lock operation) Orders of magnitude slower than hardware allows; metadata-heavy workloads bottleneck on lock latency RDMA-native: Atomic CAS for uncontested locks (~3-5 μs including confirmation, zero remote CPU on CAS path), RDMA Send for contested locks (~5-8 μs) (Section 15.15)
3 No lock batching — each lock request is a separate network round-trip rename() requires 3 locks = 3 round-trips = ~90-150 μs on Linux DLM Batch API: up to 64 locks grouped by master in a single RDMA Write (~5-10 μs total) (Section 15.15)
4 BAST (Blocking AST) callback storms — O(N) invalidation messages for N holders of a contended resource, including uncontended downgrades Metadata-heavy workloads on large clusters see network saturation from invalidation traffic Lease-based extension: holders extend cheaply via RDMA Write; minimal traffic for uncontended resources — only periodic one-sided RDMA lease renewals that bypass the remote CPU (zero CPU-consuming traffic, vs. Linux BASTs on every downgrade that require CPU processing); contended worst case is still O(K) for K active holders but K ≤ N because expired leases are reclaimed without messaging (Section 15.15)
5 Separate daemon architecture — corosync + pacemaker + dlm_controld with kernel/userspace boundary crossings Every membership change requires multiple kernel↔userspace transitions; diagnosis requires understanding 4 separate components Integrated in-kernel: membership events from Section 5.8 delivered directly to DLM; single heartbeat source; no userspace daemons (Section 15.15)
6 Lock holder must flush ALL dirty pages on lock downgrade Dropping an EX lock on a 100 GB file flushes all dirty pages, even if only 4 KB was written Targeted writeback: DLM tracks dirty page ranges per lock; only modified pages within the lock's range are flushed (Section 15.15)
7 No speculative multi-resource lock acquire GFS2 rgrp allocation: each attempt to lock a resource group is a full round-trip; 8 attempts = 8 × 30-50 μs lock_any_of(N) primitive: single message tries N resources, first available is granted (Section 15.15)

15.15.2 Lock Modes and Compatibility Matrix

The DLM implements the six standard VMS/DLM lock modes. GFS2 uses all six modes — this is not a simplification, it is the minimum required for correct clustered filesystem operation.

/// DLM lock modes, ordered by exclusivity (lowest to highest).
/// Compatible with Linux DLM, GFS2, and OCFS2 expectations.
#[repr(u8)]
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
pub enum LockMode {
    /// Null Lock — placeholder, compatible with everything.
    /// Used to hold a position in the lock queue without blocking others.
    NL = 0,

    /// Concurrent Read — read access, compatible with all except EX.
    /// Used by GFS2 for inode lookup (reading inode from disk).
    CR = 1,

    /// Concurrent Write — write access, compatible with NL, CR, CW.
    /// Used by GFS2 for writing to a file while others read metadata.
    CW = 2,

    /// Protected Read — read-only, blocks writers.
    /// Used by GFS2 for operations requiring consistent metadata snapshot.
    PR = 3,

    /// Protected Write — write, compatible with NL and CR only.
    /// Used by GFS2 for metadata modification (create, rename, unlink).
    PW = 4,

    /// Exclusive — sole access, incompatible with everything except NL.
    /// Used by GFS2 for operations requiring exclusive inode access.
    EX = 5,
}

Compatibility matrixtrue means the two modes can be held concurrently by different nodes:

NL CR CW PR PW EX
NL yes yes yes yes yes yes
CR yes yes yes yes yes no
CW yes yes yes no no no
PR yes yes no yes no no
PW yes yes no no no no
EX yes no no no no no

This matrix follows the standard VMS/DLM compatibility semantics (OpenVMS Programming Concepts Manual, Red Hat DLM Programming Guide Table 2-2; Linux kernel fs/dlm/lock.c __dlm_compat table). Key points: PW is compatible with NL and CR only (PW is the "update lock" — allows one writer with concurrent readers); CW is compatible with NL, CR, and CW (CW allows concurrent writers); PW and CW are mutually incompatible (PW forbids other writers, including CW holders). The matrix is stored as a compile-time constant lookup table for zero-cost compatibility checks on the lock grant path.

15.15.3 Lock Value Blocks (LVBs)

Each lock resource carries a 64-byte Lock Value Block — a small metadata payload piggybacked on lock state. LVBs are the critical optimization that makes clustered filesystem metadata operations efficient.

/// Lock Value Block — 64 bytes of metadata attached to a lock resource.
/// Updated by the last EX/PW holder on downgrade or unlock.
/// Read by PR/CR holders on lock grant.
///
/// MUST be cache-line aligned (`align(64)`). On all target RDMA hardware
/// (ConnectX-5+, EFA, RoCEv2 NICs), a cache-line-aligned 64-byte RDMA Read
/// is performed as a single PCIe transaction, providing de facto atomicity.
/// The alignment is a correctness requirement for the double-read protocol;
/// see the "LVB read consistency" section below.
#[repr(C, align(64))]
pub struct LockValueBlock {
    /// Application-defined data (e.g., inode size, mtime, block count).
    pub data: [u8; 56],

    /// Sequence counter — incremented on every LVB update.
    /// Readers use this to detect stale LVBs after recovery.
    ///
    /// Stored as u64 for alignment and RDMA atomic operation compatibility
    /// (RDMA atomics require 8-byte aligned 8-byte values).
    ///
    /// **Odd/even protocol**: Writers use FAA to increment the counter before
    /// and after writing data. An odd value indicates mid-update (reader should
    /// retry); an even value indicates stable data. The counter is initialized
    /// to 0 (even) on LVB creation.
    ///
    /// **Masking requirement**: Readers MUST mask with `LVB_SEQUENCE_MASK`
    /// (0x0000_FFFF_FFFF_FFFF) before checking parity or comparing values.
    /// The high 16 bits are used for special sentinel values (e.g., INVALID)
    /// and should not be interpreted as part of the sequence counter.
    ///
    /// The 48-bit counter wraps after ~9.2 years at 1M increments/sec
    /// (2^48 / 10^6 ≈ 290 million seconds). The LVB rotation protocol
    /// ensures the effective counter lifetime exceeds the 50-year uptime
    /// target at any sustained write rate. At 1M increments/sec, rotation
    /// triggers every ~8 years; the rotation is transparent to lock holders
    /// (50-100 us pause). See "LVB sequence counter wrap limitation" below
    /// for wrap-safety analysis and handling guidance.
    pub sequence: u64,
}
// Wire/RDMA format: data(56) + sequence(8) = 64 bytes (cache-line aligned).
const_assert!(core::mem::size_of::<LockValueBlock>() == 64);

/// Mask to extract the 48-bit sequence counter from the u64 field.
/// MUST be applied before checking odd/even parity or comparing sequence values.
pub const LVB_SEQUENCE_MASK: u64 = 0x0000_FFFF_FFFF_FFFF;

/// Sentinel value indicating an invalid LVB (after recovery from dead holder).
/// Uses high bits outside the 48-bit sequence space to avoid collision.
/// Readers observing this value must treat the LVB as invalid and refresh
/// from disk before use.
pub const LVB_SEQUENCE_INVALID: u64 = 0xFFFF_0000_0000_0000;

Why LVBs matter: Consider the common case of reading a file's size on a clustered filesystem:

Without LVB:
  Node A holds inode EX lock → writes file → updates size on disk → releases EX
  Node B acquires inode PR lock → reads inode FROM DISK → gets current size
  Cost: 1 lock operation (~3-5 μs) + 1 disk read (~10-15 μs NVMe) = ~13-20 μs

With LVB:
  Node A holds inode EX lock → writes file → writes size to LVB → releases EX
  Node B acquires inode PR lock → reads size FROM LVB (in lock grant message)
  Cost: 1 lock operation (~4-6 μs, LVB included) + 0 disk reads = ~4-6 μs

LVBs eliminate one disk read per metadata operation in the common case. GFS2 uses LVBs to cache inode attributes (i_size, i_mtime, i_blocks, i_nlink) and resource group statistics (free blocks, free dinodes). The VFS layer reads these attributes from the LVB via Section 14.7's per-field inode validity mechanism.

Note: UmkaOS uses 64-byte LVBs (56 data + 8 sequence counter), vs Linux's 32 bytes, to accommodate extended metadata including the sequence counter and capability token. GFS2 on-disk format compatibility requires translating between 32-byte and 64-byte LVB formats at the filesystem layer: UmkaOS's GFS2 implementation packs the standard 32-byte GFS2 LVB fields into the first 32 bytes of the 56-byte data portion, using the remaining 24 bytes for UmkaOS-specific metadata. The layout:

/// UmkaOS LVB extension — the 24 bytes after the standard 32-byte LVB data.
/// Layout will be defined when DLM capability integration is implemented (Phase 3+).
#[repr(C)]
pub struct UmkaLvbExtension {
    pub _reserved: [u8; 24],
}
const_assert!(core::mem::size_of::<UmkaLvbExtension>() == 24);

When importing a GFS2 volume from Linux, the filesystem driver zero-extends Linux's 32-byte LVBs into the 64-byte format on first lock acquire.

LVB read consistency: RDMA does not provide atomic reads for 64-byte payloads (RDMA atomics are limited to 8 bytes). When a node reads an LVB via RDMA Read, a concurrent writer could update the LVB mid-read, producing a torn value. The protocol: 1. Reader performs RDMA Read of the full 64-byte LVB. 2. Reader checks sequence counter. If sequence is odd, the writer is mid-update (writers set sequence to an odd value before writing data, then increment to even after). Retry the read. 3. Reader performs a second RDMA Read of the full 64-byte LVB. If every byte (data + sequence) matches the first read, the data is consistent. If any byte differs, retry from step 1. The full-payload comparison (not just the sequence field) catches the case where a writer completes two full updates between the reader's two reads: the 48-bit sequence counter (bits 47:0 of the sequence field) is monotonically increasing (wraps after ~9.2 years at 500K writes/sec — two FAAs per write equals 1M increments/sec — far exceeding practical deployment lifetimes; the correctness argument holds for any deployment shorter than this), so it will differ after any update. The full-payload comparison is a defense-in-depth measure that also detects torn reads where the sequence counter itself was partially updated.

LVB sequence counter wrap limitation: The 48-bit sequence counter (bits 47:0 of the sequence field, masked by LVB_SEQUENCE_MASK) wraps after 2^48 increments. At maximum sustained write rate (500,000 writes/sec = 1,000,000 FAA operations/sec), wrap occurs in approximately 290 million seconds (~9.2 years). During the wrap transition, a reader could observe sequence=2^48-1 on the first read and sequence=0 on the second read, incorrectly concluding that no write occurred between reads (ABA problem on the sequence field). This is an acceptable limitation because: (1) the wrap interval far exceeds typical cluster deployment lifetimes; (2) the full-payload comparison (data + sequence) still detects torn reads even during wrap, since the writer's data changes between FAA operations; (3) production deployments monitor LVB write rate and proactively replace LVB structures approaching the wrap threshold. Clusters with write-intensive workloads exceeding ~50,000 writes/sec on critical LVBs may configure periodic LVB rotation to avoid theoretical wrap scenarios in long-running deployments.

LVB rotation protocol (for wrap avoidance in long-running clusters):

  1. The DLM master monitors each LVB's sequence counter. When (current_seq & LVB_SEQUENCE_MASK) > LVB_ROTATION_THRESHOLD (default: 0x0000_E000_0000_0000, ~87.5% of the 48-bit space), the master initiates rotation.
  2. The master acquires an exclusive (EX) lock on the resource owning the LVB, blocking all other lock operations on that resource.
  3. Under the EX lock, the master zeros the embedded sequence counter (lvb.sequence = 0) and increments rotation_epoch += 1, preserving the existing data payload in place. No allocation or pointer swap is needed — the LVB is a by-value field in DlmResourceInner, and the rotation is simply a counter reset + epoch bump under exclusive access.
  4. The master releases the EX lock. Subsequent LVB writes start from sequence = 0 with a fresh 48-bit counter space.

Rotation failure handling: If the master cannot acquire the EX lock within the rotation timeout (default 30 seconds — e.g., because the EX holder has died and the lock is stuck in recovery), the rotation is deferred. If FAA increments sequence past the 48-bit boundary ((value >> 48) != 0 and value != LVB_SEQUENCE_INVALID), the LVB enters a degraded state: all readers fall back to two-sided read (same as the LVB_SEQUENCE_INVALID path), which remains correct but slower. The degraded state is cleared by the next successful rotation. An FMA event (LVB_ROTATION_DEFERRED) is emitted at warning severity to alert the cluster administrator.

This protocol is transport-agnostic: it operates on the master's local DlmResource.inner.lvb field under the resource's SpinLock + an EX DLM lock and does not involve any transport-specific operations. The EX lock alone provides the necessary serialization — no RDMA fences or transport-specific ordering mechanisms are involved in the rotation itself.

Post-rotation even/odd invariant: After rotation, lvb.sequence is 0 (even), indicating stable data. This is correct because the data IS stable (the EX lock holder preserved the payload). The first subsequent writer will FAA to 1 (odd = writing), update data, then FAA to 2 (even = stable) — the even/odd protocol continues correctly from 0 without any special handling.

Visibility guarantees after rotation (per-transport):

  • RDMA transports: The LVB resides in RDMA-registered memory. One-sided readers see the new counter value after the EX lock release because the release triggers an RDMA Send to the holder transitioning out of EX, which provides an ordering fence at the responder NIC. The one-sided double-read protocol continues to work correctly with the reset counter.
  • TCP transports: LVB reads are always two-sided (see "Two-sided LVB read fallback" below). The master returns the fresh counter (and the updated rotation_epoch) in subsequent DlmLvbReadResponsePayload messages.

Rotation frequency: At maximum sustained rate (500K writes/sec), rotation occurs approximately every 8 years (87.5% of the 9.2-year wrap interval). The rotation itself takes ~50-100 us (one EX lock cycle + counter reset) and blocks the resource for the duration. The LVB_ROTATION_THRESHOLD is configurable via sysctl cluster.dlm.lvb_rotation_threshold (valid range: 50%-99% of 2^48).

Wrap-safety for cache invalidation ordering: The LVB sequence counter is also used to detect stale LVB data during cache invalidation. When a node receives an LVB update, it compares the incoming sequence with its last-known sequence. This comparison is wrap-safe because the comparison window (difference between two consecutive reads of the same LVB by a single node) is always much smaller than 2^47 (half the 48-bit counter space). Specifically, a node that reads an LVB will re-read it only upon the next lock acquire, which happens at most milliseconds to seconds later — accumulating at most a few thousand sequence increments. Since the wrap interval is ~9.2 years, the comparison window is negligible relative to 2^47, and signed 48-bit comparison ((new_seq - old_seq) > 0 in 48-bit modular arithmetic) correctly determines ordering even near the wrap boundary.

Rotation safety for lockless read_lvb() callers: The rotation protocol (above) resets the sequence counter from ~87.5% of 2^48 to 0 under an EX lock. Lock-holding readers are notified of the reset via BAST/revocation (they re-acquire after rotation and see the new counter). However, read_lvb() callers (TCP two-sided path) do NOT hold locks and have no revocation channel. A lockless reader that cached sequence S near LVB_ROTATION_THRESHOLD before rotation, then calls read_lvb() after rotation and gets sequence ~0, would compute a massive negative difference and incorrectly conclude the LVB is stale.

To handle this, the DlmLvbReadResponsePayload includes a rotation_epoch field (incremented on each rotation). Lockless readers MUST compare rotation_epoch values before using sequence comparison for ordering: - If new_rotation_epoch != cached_rotation_epoch: a rotation occurred. The reader MUST discard its cached sequence and treat the new LVB as authoritative (no ordering comparison is meaningful across rotation boundaries). - If rotation_epoch values match: the standard signed 48-bit sequence comparison applies.

Callers that use read_lvb() solely for torn-read detection (checking even/odd parity) are unaffected by rotation — the zeroed counter is even (stable), and the even/odd protocol continues correctly from 0 (see "Post-rotation even/odd invariant" below).

The sequence counter detects torn reads: the reader retries if the sequence changed during the read. This is a consistency mechanism, not an ABA prevention mechanism — ABA is not applicable because the reader does not perform compare-and-swap on the LVB data. The writer protocol uses RDMA Fetch-and-Add (FAA) for both transitions: FAA(sequence, 1) (now odd = writing) → update data → FAA(sequence, 1) (now even = stable). FAA is a standard RDMA atomic operation, ensuring visibility to concurrent one-sided readers.

LVB single-writer guarantee: The double-read protocol's correctness depends on there being at most one concurrent LVB writer for a given resource. This invariant is provided by the DLM lock itself: only a node holding an EX (Exclusive) or PW (Protected Write) lock on a resource may write to that resource's LVB (per the DLM compatibility matrix in Section 15.15). Because the DLM guarantees that at most one node holds EX or PW on a resource at any time, the single-writer invariant is guaranteed by the lock mode rules — no additional coordination is needed. During master failover, LVB writes are suspended until the new master is established and the lock state has been recovered, preventing interleaved writes from two nodes each believing they hold the lock.

RDMA ordering correctness argument: The writer updates the LVB via three RDMA operations posted to a single Reliable Connection (RC) Queue Pair: (1) FAA on sequence, (2) RDMA Write to data bytes, (3) FAA on sequence. Per the InfiniBand Architecture Specification (Vol 1, Section 11.5), operations within a single RC QP are processed at the responder (target NIC) in posting order. Therefore, when FAA #3 completes, the data Write #2 has already completed at the responder's memory. A reader on a DIFFERENT QP (QP_B) may see operations from QP_A interleaved with its own reads — this is the "no inter-QP ordering" property of RDMA. However, the double-read protocol handles this correctly: if QP_A's operations interleave with QP_B's first Read, the torn value will differ from QP_B's second Read (because the writer changed data and/or sequence between reads), causing a retry. The only remaining concern is whether QP_A's three operations can interleave with BOTH of QP_B's reads to produce identical torn values — this is impossible because the FAA operations on the sequence counter are 8-byte RDMA atomics (always observed atomically, no partial reads), and the sequence counter is monotonically increasing. If the reader's two RDMA Reads see the same sequence value (even), the writer either completed all three operations before both reads (data is consistent) or has not started (data is unchanged). If the sequence values differ between the two reads, the reader retries. The double-read protocol is therefore correct under RDMA's relaxed inter-QP ordering model without requiring explicit fencing between QPs.

RDMA Read atomicity and the SIGMOD 2023 analysis: The InfiniBand Architecture Specification does not formally guarantee that an RDMA Read larger than 8 bytes is delivered atomically. Ziegler et al. (SIGMOD 2023) investigated this question and found that in practice, cache-line-aligned 64-byte RDMA Reads are delivered atomically on all tested hardware — their experiments observed no torn reads for objects that fit within a single cache line. This empirical finding supports our cache-line-aligned LVB design. Nevertheless, the IB spec provides no formal guarantee, and future NICs or memory subsystems could behave differently. The double-read protocol provides defence-in-depth across three complementary layers:

  1. Cache-line alignment (de facto atomicity): The #[repr(C, align(64))] requirement ensures the 64-byte LVB is always cache-line aligned. On all shipping RDMA NICs (ConnectX-5+, AWS EFA, RoCEv2 adapters), the responder NIC reads from the last-level cache or memory controller, which operates at cache-line granularity. A cache-line-aligned 64-byte read therefore arrives from the responder as a single coherent unit — a single PCIe TLP — providing de facto atomicity even without formal IB spec guarantees. This is the primary defence.

Hardware qualification note: 64-byte RDMA read atomicity is a de-facto property of specific NICs, not guaranteed by the InfiniBand specification. It is confirmed on: Mellanox/NVIDIA ConnectX-5, ConnectX-6, ConnectX-7 (single cache-line reads are atomic in the NIC's memory subsystem), and AWS EFA (Elastic Fabric Adapter) NICs. It is NOT guaranteed on iWARP NICs (Chelsio T6, Intel X722) or InfiniBand HCAs without this property. UmkaOS's LVB implementation checks for the RDMA_ATOMIC_64B capability flag at device initialization and falls back to the double-read protocol (read → check sequence → read again if sequence changed) when the flag is absent. The double-read protocol is correct regardless of hardware atomicity; the single-read optimization is enabled only when the flag is present.

  1. Probabilistic defence via double-read: Even if a torn read occurs on a specific platform (e.g., under unusual NUMA topology or memory subsystem conditions), the double-read comparison provides a strong probabilistic defence. For both reads to produce identical torn values, the writer's in-progress modifications must create the EXACT same byte pattern in both torn snapshots — including the monotonically increasing sequence counter. Because the sequence counter changes by exactly 2 per complete write (odd during update, even after), reconstructing the same even sequence value twice from independent torn reads of two different write phases would require an astronomically unlikely alignment of byte delivery from two distinct PCIe transactions. In practice this is negligible.

  2. Two-sided fallback (absolute correctness): After 8 retries the reader falls back to a two-sided RDMA Send to the resource master, which reads the LVB under its local lock and returns a consistent snapshot. This path is unconditionally correct regardless of RDMA read atomicity guarantees or NIC implementation details.

Together these three layers ensure correctness: the first eliminates torn reads on all known hardware, the second provides defence-in-depth on any hypothetical future hardware, and the third guarantees forward progress regardless of RDMA semantics.

Livelock prevention: A continuously-updated LVB could cause a reader to retry indefinitely (the writer keeps changing the sequence counter between the reader's two RDMA Reads). To prevent this, the reader enforces a maximum of 8 retries with exponential backoff (1 μs, 2 μs, 4 μs, ..., 128 μs). If all retries are exhausted, the reader falls back to a two-sided RDMA Send to the resource master, requesting a consistent LVB snapshot. The master reads the LVB under its local lock (preventing concurrent writer updates during the read) and returns the consistent value. This fallback adds ~5-8 μs but guarantees forward progress. In practice, a single retry suffices in over 99% of cases — the 8-retry limit is a safety bound for pathological writer contention.

Typical case: 1 RDMA Read + 1 RDMA Read (64 bytes) = ~3-4 μs total.

After lock master recovery (Section 15.15), LVBs from dead holders are marked INVALID (sequence counter set to u64::MAX). The next EX or PW holder must refresh the LVB from disk before other nodes can trust it (both EX and PW are write modes that can update the LVB, per the compatibility matrix above).

15.15.3.1 Two-Sided LVB Read Fallback

Applicability of the one-sided LVB read protocol above: The double-read/seqlock protocol described above (RDMA Read → check sequence → retry) applies ONLY when transport.supports_one_sided() == true (RDMA, CXL). For TCP peers, LVBs are read via the two-sided LvbReadRequest/LvbReadResponse path described in this section, or piggybacked on lock grant messages (DlmLockGrantPayload.lvb_len). The double-read protocol is never used on TCP.

For peers connected via TCP (where transport.supports_one_sided() == false), a node that needs to read an LVB without acquiring a lock uses the two-sided LVB read path:

  1. The requester sends a DlmMessageType::LvbReadRequest message to the resource master, identifying the resource by name hash.
  2. The master receives the request, acquires the resource's inner SpinLock (preventing concurrent LVB writes), reads the current LVB data, sequence counter, and rotation_epoch, releases the lock, and sends a DlmMessageType::LvbReadResponse containing the 64-byte LVB content, sequence counter, and rotation epoch.
  3. The requester uses the received LVB data directly — no double-read or sequence checking is needed because the master serialized the read under its local lock.

Cost: ~50-200 μs on TCP (one round-trip) vs ~3-5 μs for RDMA one-sided double-read. Still far cheaper than a full lock acquire+release round-trip (~100-400 μs on TCP), because the LVB read does not modify lock state, does not enter the granted/waiting queues, and does not generate BAST callbacks.

The DlmResource::read_lvb() API dispatches transparently:

/// Read the LVB for a resource without acquiring a lock.
/// Returns the 64-byte LVB data and the sequence counter.
///
/// Transport dispatch:
/// - RDMA/CXL: one-sided double-read with seqlock protocol (fast path).
/// - TCP: two-sided LvbReadRequest/LvbReadResponse (message path).
pub fn read_lvb(&self, resource: &ResourceName) -> Result<LockValueBlock, DlmError> {
    let master = self.hash_ring.master(resource);
    let transport = self.peer_transport(master);
    if transport.supports_one_sided() {
        self.read_lvb_one_sided(master, resource)
    } else {
        self.read_lvb_two_sided(master, resource)
    }
}

Wire message structs for the two-sided LVB read path are defined in the "DLM Wire Protocol" section below (DlmLvbReadRequestPayload, DlmLvbReadResponsePayload).

LVB write protocol — TCP alternative: The FAA+Write+FAA sequence described in the "RDMA ordering correctness argument" section above applies only to RDMA transports where one-sided writes are used. TCP peers do not use FAA or RDMA Write for LVB updates. Instead, LVB updates are carried in LockConvert or LockRelease messages: the LVB data is appended to the wire message per the lvb_len field in DlmLockConvertPayload / DlmLockReleasePayload. The master updates DlmResourceInner.lvb under the resource's SpinLock. No FAA or RDMA Write operations are involved — the seqlock protocol is unnecessary for message-based transports because the master serializes all updates.

LVB read direction — LockGrant: LVBs are also piggybacked on LockGrant messages (master-to-requester direction) via DlmLockGrantPayload.lvb_len. This is the read path: when a node acquires a PR/CR lock, the grant message carries the current LVB snapshot. The write direction (holder-to-master) uses LockConvert and LockRelease as described above.

15.15.4 Lock Resource Naming and Master Assignment

Lock resources are identified by hierarchical names that encode the filesystem, resource type, and specific object:

Format: <filesystem>:<uuid>:<type>:<id>[:<subresource>]

Examples:
  gfs2:550e8400-e29b:inode:12345:data      — data lock for inode 12345
  gfs2:550e8400-e29b:inode:12345:meta      — metadata lock for inode 12345
  gfs2:550e8400-e29b:rgrp:42               — resource group 42 allocation lock
  gfs2:550e8400-e29b:journal:3             — journal 3 ownership lock
  gfs2:550e8400-e29b:dir:789:bucket:5      — directory 789 hash bucket 5
  app:mydb:table:users:row:1001            — application-level row lock

Master assignment: Each lock resource is assigned a master node responsible for maintaining the granted/converting/waiting queues. The master is determined by consistent hashing using a virtual-node ring (note: this is deliberately different from DSM home-node assignment in Section 6.4, which uses modular hashing — hash % cluster_size — for simpler O(1) lookups; DLM uses consistent hashing because lock resources are more numerous and benefit from minimal redistribution on node changes):

// Each physical node has V virtual nodes on the ring (default V=64).
// The ring is a sorted array of (hash, physical_node_id) pairs.
ring = [(hash(node_0, vnode_0), 0), (hash(node_0, vnode_1), 0), ...,
        (hash(node_N, vnode_V), N)]

master(resource_name) = ring.successor(hash(resource_name)).physical_node_id

When a node joins or leaves the cluster, only ~1/N of total resources are remapped (the resources whose ring position falls between the departed node's virtual nodes and their successors). This is the key property of consistent hashing — unlike modular hashing (hash % cluster_size), which remaps nearly all resources on membership change.

Design choice — consistent hashing vs. directory-based master assignment: Linux's DLM uses modular hashing for lock resource mastering. UmkaOS uses consistent hashing with virtual nodes because: (1) it is fully distributed with no single point of failure — any node can compute any resource's master locally from the ring (O(log V×N) binary search); (2) membership changes remap only ~1/N of resources instead of ~all. Note that the DLM's consistent hashing is deliberately different from DSM's modular hashing (Section 6.4, hash % cluster_size): DSM uses modular hashing for simpler O(1) lookups with full rehash on membership change, while the DLM uses consistent hashing for minimal redistribution on node changes. These are separate protocols with different tradeoffs, not a shared scheme. The tradeoff is that consistent hashing cannot optimize for locality (a node that uses a resource heavily is not preferentially assigned as its master). For workloads where locality matters (e.g., a single node accessing a file exclusively), the DLM's lease mechanism (Section 15.15) compensates: the holder simply extends its lease without contacting the master, so master location is irrelevant on the fast path.

/// Consistent hash ring for DLM master assignment. Each physical node
/// contributes V virtual nodes (default 64) to the ring. The ring is a
/// sorted array of (hash_point, node_id) pairs; master lookup is O(log N)
/// binary search for the successor of hash(resource_name).
///
/// The ring is immutable between membership changes. On node join/departure,
/// a new ring is computed and swapped atomically via RCU. Lock operations in
/// flight see a consistent snapshot — either the old ring or the new one,
/// never a partially-updated ring.
pub struct DlmHashRing {
    /// Sorted array of (hash_point, physical_node_id) pairs.
    /// Length = N_nodes * VNODES_PER_NODE. Sorted by hash_point ascending.
    /// Binary search finds the successor: the first entry with
    /// hash_point >= hash(resource_name). If no such entry exists (wrap),
    /// the first entry in the array is the successor (ring wraps around).
    ///
    /// Hash function: SipHash-2-4(resource_name) → u64 for resource lookups;
    /// SipHash-2-4(node_id || vnode_index) → u64 for ring point generation.
    /// SipHash is chosen for DoS resistance (keyed hash prevents adversarial
    /// resource name selection that skews master assignment).
    pub points: ArrayVec<HashRingPoint, MAX_RING_POINTS>,
}

/// Maximum ring points = max cluster nodes * vnodes per node.
/// 256 nodes * 64 vnodes = 16384 points. Sufficient for the largest
/// supported cluster size.
pub const MAX_RING_POINTS: usize = 16384;

/// Virtual nodes per physical node on the consistent hash ring.
/// **Imbalance analysis**: With 64 vnodes per node and SipHash-2-4, the
/// expected load imbalance between the most- and least-loaded master nodes
/// is ~10-15% for clusters of 4-16 nodes, decreasing to ~5% at 64+ nodes
/// (standard deviation ~1/sqrt(vnodes)). 64 vnodes provides a practical
/// balance between ring size (memory: 16 bytes × 64 = 1 KiB per node)
/// and distribution uniformity. Increasing to 256 would reduce imbalance
/// to ~2-3% but quadruples per-node ring memory.
pub const VNODES_PER_NODE: u32 = 64;

/// A single point on the consistent hash ring.
/// Points are sorted by `(hash, node_id)` to break ties deterministically.
/// With SipHash-2-4 and 16384 points in a 64-bit hash space, collisions
/// have probability ~1.5e-11 per ring build, but the deterministic
/// tie-break ensures all nodes agree on the master even in the
/// degenerate case.
// kernel-internal, not KABI — local in-memory consistent hashing structure.
#[repr(C)]
pub struct HashRingPoint {
    /// Hash value (SipHash-2-4 of node_id || vnode_index).
    pub hash: u64,
    /// Physical node ID that owns this ring point.
    pub node_id: NodeId,
}
// kernel-internal: hash(8) + node_id(8) = 16 bytes.
const_assert!(core::mem::size_of::<HashRingPoint>() == 16);

impl DlmHashRing {
    /// Look up the master node for a given resource name.
    /// Returns the node_id of the first ring point whose hash >= hash(resource_name).
    /// O(log(N * VNODES_PER_NODE)) binary search.
    pub fn master(&self, resource_name: &ResourceName) -> NodeId {
        let h = siphash_2_4(resource_name.as_bytes());
        // Binary search for first point with hash >= h.
        // If none found (h > all points), wrap to points[0].
        match self.points.binary_search_by_key(&h, |p| p.hash) {
            Ok(idx) => self.points[idx].node_id,
            Err(idx) => {
                if idx < self.points.len() {
                    self.points[idx].node_id
                } else {
                    self.points[0].node_id // wrap around
                }
            }
        }
    }

    /// Rebuild the ring after a membership change. Called when a node joins
    /// or departs the cluster. The new ring is built from scratch from the
    /// current member set and swapped in via RCU.
    ///
    /// `members`: current cluster member set (post-join or post-departure).
    /// `sip_key`: SipHash key for ring point generation (cluster-wide constant,
    ///   derived from the lockspace creation seed).
    pub fn rebuild(members: &[NodeId], sip_key: &SipKey) -> Self {
        let mut ring = DlmHashRing {
            points: ArrayVec::new(),
        };
        for &node_id in members {
            for vnode in 0..VNODES_PER_NODE {
                let h = siphash_2_4_keyed(sip_key, &(node_id, vnode));
                ring.points.push(HashRingPoint { hash: h, node_id });
            }
        }
        ring.points.sort_unstable_by_key(|p| (p.hash, p.node_id));
        ring
    }
}

Master migration on membership change: When a node departs (crash or graceful leave), the surviving nodes rebuild the hash ring. Resources whose master was the departed node are now hashed to their successor in the new ring. The new master broadcasts MasterMigration to all nodes that hold locks on affected resources. Each node re-targets pending lock requests to the new master. Resources mastered on surviving nodes are unaffected — their hash position and successor are unchanged.

When a node joins, the new ring is computed and ~1/N of resources shift to the new node. The old master for each shifted resource sends the resource's granted/converting/ waiting queues to the new master via a MasterTransfer message. Lock operations on shifted resources are briefly queued (not rejected) during the transfer window (~1-5 ms typical).

/// Intrusive doubly-linked list node. Embedded in structs that need to
/// be linked without heap allocation.
///
/// # Safety invariant
/// A node must be removed from all lists before its containing struct
/// is freed. Leaving a dangling node pointer causes use-after-free.
///
/// Fields are private to encapsulate unsafe pointer manipulation.
/// All modifications go through `IntrusiveList` methods that document
/// their safety contracts.
pub struct IntrusiveListNode {
    prev: *mut IntrusiveListNode,
    next: *mut IntrusiveListNode,
}

impl IntrusiveListNode {
    /// Create a new unlinked node with null prev/next pointers.
    ///
    /// After slab allocation places the node at its permanent address,
    /// the caller must invoke [`init_at_final_address`] to establish the
    /// self-referential "unlinked" state. Until then, `is_unlinked()`
    /// returns `true` (null pointers are treated as unlinked).
    pub const fn new() -> Self {
        IntrusiveListNode {
            prev: core::ptr::null_mut(),
            next: core::ptr::null_mut(),
        }
    }

    /// Initialise a node at its permanent (pinned) address so that
    /// prev/next point to itself, establishing the "unlinked" sentinel
    /// state.
    ///
    /// # Safety
    ///
    /// `this` must point to a valid, pinned `IntrusiveListNode` that
    /// will not be moved for the lifetime of the containing allocation
    /// (e.g., a slab object).
    pub unsafe fn init_at_final_address(this: *mut IntrusiveListNode) {
        // SAFETY: Caller guarantees `this` is valid and pinned.
        unsafe {
            (*this).prev = this;
            (*this).next = this;
        }
    }

    /// Returns true if this node is not currently linked in any list.
    ///
    /// A node is unlinked if both pointers are null (freshly constructed)
    /// or both point to self (initialised but not inserted).
    pub fn is_unlinked(&self) -> bool {
        let self_ptr = self as *const _ as *mut _;
        (self.prev.is_null() && self.next.is_null())
            || (self.prev == self_ptr && self.next == self_ptr)
    }
}

/// Head sentinel for an intrusive list. The `prev`/`next` pointers
/// form a circular doubly-linked list with the head acting as a
/// sentinel. An empty list has `head.prev == head.next == &head`.
pub struct IntrusiveListHead {
    sentinel: IntrusiveListNode,
    len: usize,
}

impl IntrusiveListHead {
    /// Return the number of entries in this list.
    pub fn len(&self) -> usize { self.len }
    /// Return true if the list is empty.
    pub fn is_empty(&self) -> bool { self.len == 0 }
}

/// Typed intrusive list. `T` must embed an `IntrusiveListNode` accessible
/// via the `node_offset` (computed by `field_offset!` at the call site).
///
/// All pointer manipulation is encapsulated in `insert_after()`,
/// `insert_before()`, and `remove()` methods. These are the only
/// entry points that modify `IntrusiveListNode` fields, ensuring
/// safety invariants are auditable in one location.
pub struct IntrusiveList<T> {
    head: IntrusiveListHead,
    _marker: PhantomData<T>,
}

impl<T> IntrusiveList<T> {
    /// Insert `node` after the sentinel (at the front of the list).
    ///
    /// # Safety
    /// `node` must be a valid pointer to an `IntrusiveListNode` embedded
    /// in a live `T`. The node must not be currently linked in any list.
    /// The caller must hold the protecting lock (e.g., `DlmResource.inner`).
    pub unsafe fn insert_front(&mut self, node: *mut IntrusiveListNode) {
        // SAFETY: caller guarantees node validity and mutual exclusion.
        (*node).next = self.head.sentinel.next;
        (*node).prev = &mut self.head.sentinel as *mut _;
        (*self.head.sentinel.next).prev = node;
        self.head.sentinel.next = node;
        self.head.len += 1;
    }

    /// Insert `node` before the sentinel (at the back of the list).
    ///
    /// # Safety
    /// Same preconditions as `insert_front`.
    pub unsafe fn insert_back(&mut self, node: *mut IntrusiveListNode) {
        // SAFETY: caller guarantees node validity and mutual exclusion.
        (*node).prev = self.head.sentinel.prev;
        (*node).next = &mut self.head.sentinel as *mut _;
        (*self.head.sentinel.prev).next = node;
        self.head.sentinel.prev = node;
        self.head.len += 1;
    }

    /// Remove `node` from this list.
    ///
    /// # Safety
    /// `node` must be currently linked in THIS list (not another list).
    /// The caller must hold the protecting lock.
    pub unsafe fn remove(&mut self, node: *mut IntrusiveListNode) {
        // SAFETY: caller guarantees node is in this list and mutual exclusion.
        (*(*node).prev).next = (*node).next;
        (*(*node).next).prev = (*node).prev;
        // Reset to self-referential (unlinked sentinel).
        (*node).prev = node;
        (*node).next = node;
        self.head.len -= 1;
    }
}
/// DLM resource name. Variable-length, hierarchical (e.g., "gfs2:fsid:inode:12345").
/// Maximum 256 bytes (matching the DLM protocol maximum resource name length).
/// Compared by byte equality for lock matching.
///
/// **Memory budget**: At 258 bytes per `ResourceName`, 100K lock resources
/// consume ~25 MB; 1M resources consume ~246 MB. For workloads exceeding
/// 500K concurrent lock resources, a compact representation (inline 64 bytes
/// + slab-allocated overflow) is recommended as a Phase 4+ optimization.
pub struct ResourceName {
    /// Name bytes (NUL-terminated, max 256 bytes including NUL).
    pub bytes: [u8; 256],
    /// Length of the name (excluding NUL terminator).
    pub len: u16,
}

/// Wait-for graph for distributed deadlock detection.
/// Nodes are lock holders (identified by (node_id, lock_id) pairs).
/// Edges represent "waits for" relationships. Cycle detection runs
/// periodically (default: every 100ms) using a DFS traversal.
pub struct WaitForGraph {
    /// Adjacency list: waiter → set of holders it's waiting for.
    /// Bounded to MAX_CONCURRENT_LOCKS (65536) entries.
    ///
    /// **BTreeMap rationale**: Deadlock detection runs only after a lock
    /// has been waiting >5 seconds (see Section 15.12.9) — this is off the
    /// hot lock-grant path entirely. BTreeMap provides ordered iteration by
    /// WaiterId (NodeId, lock_id), ensuring all cluster nodes evaluate
    /// deadlock victim candidates in the same deterministic order. This
    /// eliminates the need for an explicit sort before victim selection.
    /// The DFS cycle detection itself traverses per-vertex adjacency lists
    /// (ArrayVec), not the BTreeMap iteration order; the BTreeMap ordering
    /// matters only for victim selection when multiple candidate victims
    /// exist. The 65536-entry bound caps memory at ~2MB
    /// (65536 x (16+8+padding) bytes), acceptable for a background structure.
    /// An alternative HashMap would give O(1) average but non-deterministic
    /// iteration order would require an explicit sort before victim
    /// selection.
    pub edges: BTreeMap<WaiterId, ArrayVec<WaiterId, 8>>,
}

/// Deterministic ordering is required for consistent cycle detection
/// across all cluster nodes.
#[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy)]
pub struct WaiterId {
    pub node_id: NodeId,
    pub lock_id: u64,
}

/// A lock resource managed by the DLM.
///
/// **Locking protocol**: The `inner` field wraps all mutable resource state
/// (LVB, lock queues, pending CAS) in a `SpinLock`. This lock is held for
/// O(1) operations only: queue manipulation, LVB read/write, pending CAS
/// update. It MUST NOT be held across any network message send or RDMA
/// operation — those are performed after releasing the lock with the
/// relevant data copied out.
///
/// Lock ordering: `DlmResource.inner` is below lockspace-level locks
/// (e.g., `DlmLockspace.shard_locks`) and above nothing — it is a leaf
/// lock. Acquiring two `DlmResource.inner` locks simultaneously is
/// forbidden (deadlock risk with lock conversion across resources).
pub struct DlmResource {
    /// Resource name (hierarchical, variable-length).
    /// Immutable after creation — no lock needed for reads.
    pub name: ResourceName,

    /// Node ID of the resource master.
    /// Updated only during re-mastering (under lockspace shard lock).
    pub master: NodeId,

    /// Mutable resource state protected by a SpinLock.
    pub inner: SpinLock<DlmResourceInner>,

    /// Per-resource DSM dependency for recovery ordering. Tracks whether
    /// this resource's CAS word page resides in a DSM region. Used during
    /// node failure recovery to determine whether re-mastering must wait
    /// for DSM home reconstruction
    /// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--cross-subsystem-recovery-ordering-dsm-and-dlm)).
    /// Populated at resource creation time when the master allocates CAS
    /// word arrays from RDMA-registered memory. Immutable after creation.
    pub dsm_dep: DlmResourceDsmDep,
}

/// Mutable state within a `DlmResource`, protected by `DlmResource.inner`
/// SpinLock. All fields in this struct require holding the lock for access.
pub struct DlmResourceInner {
    /// Lock Value Block for this resource.
    pub lvb: LockValueBlock,

    /// Rotation epoch — incremented each time the LVB sequence counter is
    /// rotated (reset to 0). Returned in `DlmLvbReadResponsePayload` so
    /// lockless `read_lvb()` callers can detect rotation discontinuities.
    /// See "Rotation safety for lockless `read_lvb()` callers".
    pub rotation_epoch: u64,

    /// Granted queue — locks currently held.
    /// Intrusive linked list: DlmLock nodes are allocated from a per-lockspace
    /// slab allocator (fixed-size, no heap resizing on the lock grant path).
    pub granted: IntrusiveList<DlmLock>,

    /// Converting queue — locks being converted (upgrade/downgrade).
    /// Processed in FIFO order before the waiting queue.
    pub converting: IntrusiveList<DlmLock>,

    /// Waiting queue — new lock requests waiting for compatibility.
    pub waiting: IntrusiveList<DlmLock>,

    /// Pending CAS confirmations ([Section 15.15](#distributed-lock-manager--rdma-atomic-cas-lock-fast-path)).
    /// When remote nodes acquire a lock via RDMA CAS but have not yet sent
    /// the confirmation RDMA Send, this field tracks the expected confirmations.
    /// The master defers processing new incompatible-mode requests against this
    /// resource until all confirmations arrive or time out. A bounded collection
    /// is required — not Option<PendingCas> — because shared-mode CAS operations
    /// (e.g., PR acquires) allow multiple peers to win concurrently (each
    /// successive shared-mode CAS increments holder_count). For exclusive-mode
    /// CAS (EX, PW), at most one entry exists. Cap of 64 is bounded by CAS
    /// serialization within the 500 us confirmation timeout, not by cluster size
    /// (a single CAS takes ~1 us round-trip; at most ~64 can complete within
    /// 500 us under contention).
    pub pending_cas: ArrayVec<PendingCas, MAX_PENDING_CAS>,
}

/// Region identifier for DSM (Distributed Shared Memory) regions.
/// Matches the DsmRegionId defined in the DSM subsystem ([Section 6.2](06-dsm.md#dsm-design-overview)).
/// `DsmRegionId` (u64 alias in DLM) is the unwrapped value from
/// `DsmRegionHandle(u64)` in the DSM subsystem ([Section 6.1](06-dsm.md#dsm-foundational-types)).
type DsmRegionId = u64;

/// Per-resource DSM dependency metadata for recovery ordering.
/// 16 bytes overhead per DlmResource.
pub struct DlmResourceDsmDep {
    /// DSM region containing this resource's CAS word page.
    /// 0 = no DSM region dependency (CAS word is in local, non-DSM RDMA
    /// pool memory — the common case for lockspaces that do not use
    /// DSM-backed state). DSM region IDs are always positive (assigned
    /// by the region coordinator starting at 1).
    pub region_id: u64,
    /// Virtual address of the CAS word within the DSM region.
    /// Used during recovery to check whether the specific page was homed
    /// on the failed node: `home_node(region_id, cas_word_va) == failed_node`.
    /// 0 when `region_id == 0` (kernel virtual addresses are never 0).
    pub cas_word_va: u64,
}
const_assert!(core::mem::size_of::<DlmResourceDsmDep>() == 16);

pub const MAX_PENDING_CAS: usize = 64;

/// Tracks a pending CAS confirmation for a DlmResource.
pub struct PendingCas {
    /// Peer that performed the CAS.
    pub peer: PeerId,
    /// Lock mode the node acquired.
    pub mode: LockMode,
    /// Sequence value in the CAS word after the acquire (for timeout reset).
    pub post_cas_sequence: u64,
    /// Timestamp when the CAS was detected (for 500 μs timeout).
    pub detected_at_ns: u64,
}

// Note on allocation strategy: DlmLock nodes are allocated from a per-lockspace
// slab allocator (umka-core Section 4.3). The slab pre-allocates DlmLock-sized objects
// and grows in page-sized chunks, so individual lock grant/release operations
// never trigger the general-purpose heap allocator. This ensures bounded latency
// on the contested lock path. The intrusive list avoids the pointer indirection
// and dynamic resizing of VecDeque/Vec.
//
// Note on byte-range lock tracking: each DlmLock's associated LockDirtyTracker
// (Section 15.12.8) uses LargeRangeBitmap (not a flat SparseBitmap) to track dirty
// pages within the lock's byte range. This supports files of any practical size:
// ≤ 1 GiB files use the flat SparseBitmap path (zero overhead), while larger files
// use the two-level LargeRangeBitmap with lazily-allocated 1 GiB slots.

/// A single lock held or requested by a node.
pub struct DlmLock {
    /// Node that owns this lock.
    pub node: NodeId,

    /// Requested/granted lock mode.
    pub mode: LockMode,

    /// Process ID on the owning node (for deadlock detection).
    pub pid: u32,

    /// Flags (NOQUEUE, CONVERT, CANCEL, etc.).
    pub flags: LockFlags,

    /// Timestamp for ordering and deadlock victim selection.
    pub timestamp_ns: u64,

    /// Revocation handler — called on the holder's node when the master
    /// sends a revocation/downgrade request due to a conflicting lock.
    ///
    /// The handler encapsulates all application-specific revocation logic:
    /// - UPFS: flush dirty pages (targeted writeback), invalidate cache,
    ///   update LVB with latest metadata, then downgrade or release.
    /// - VFS export: break client leases, flush data, release.
    /// - Generic application: release immediately.
    ///
    /// The DLM drives the entire flow: detect contention → send revocation
    /// → handler runs on holder → handler calls dlm_convert() or
    /// dlm_unlock() → DLM grants to the new requester. No separate
    /// "token layer" needed — the handler IS the token behavior.
    ///
    /// Set at lock acquire time. If None, the DLM uses a default handler
    /// that releases the lock immediately on revocation.
    pub revocation_handler: Option<&'static dyn DlmRevocationHandler>,

    /// Intrusive list linkage for membership in `DlmResourceInner.granted`,
    /// `.converting`, or `.waiting` queues. A DlmLock is in exactly one
    /// queue at any time. Access requires holding `DlmResource.inner`.
    pub queue_link: IntrusiveListNode,
}

/// **DlmLock intrusive list lifecycle**:
/// 1. **Allocation**: DlmLock is allocated from the per-lockspace slab on `dlm_lock()`.
/// 2. **Waiting**: Inserted into `DlmResourceInner.waiting` queue. Remains there until
///    compatibility check passes or the request is cancelled.
/// 3. **Granted**: Moved from `waiting` → `granted` when the lock mode is compatible
///    with all existing grants. The move is O(1) (unlink + relink).
/// 4. **Converting**: Moved from `granted` → `converting` on `dlm_convert()`.
///    Moved back to `granted` when the conversion is compatible.
/// 5. **Release**: Removed from whichever queue it occupies on `dlm_unlock()`.
///    The slab object is returned to the per-lockspace slab allocator.
/// A DlmLock is NEVER on two queues simultaneously.

/// Trait for lock revocation handlers. Implemented by subsystems that
/// need custom behavior on lock downgrade/revocation (UPFS, VFS export,
/// block export reservations).
///
/// The handler runs in a DLM worker thread on the lock holder's node.
/// It must complete within a bounded time (configurable per lockspace,
/// default: 5 seconds). If the handler exceeds the timeout, the DLM
/// forcibly releases the lock and logs an FMA event.
pub trait DlmRevocationHandler: Send + Sync {
    /// Called when the DLM master requests downgrade from `current_mode`
    /// to `requested_mode` (e.g., EX → PR when a reader arrives).
    ///
    /// The handler MUST:
    /// 1. Perform any application-specific cleanup (flush dirty data,
    ///    invalidate caches, update LVB).
    /// 2. Call `lock.convert(requested_mode)` to complete the downgrade.
    ///    OR call `lock.unlock()` to release entirely.
    ///
    /// `context` carries the conflicting requester's information (node,
    /// requested mode) for handlers that need it (e.g., UPFS may choose
    /// different flush strategies based on the requester's lock type).
    fn on_revoke(&self, lock: &DlmLock, current_mode: LockMode,
                 requested_mode: LockMode, context: &RevocationContext);
}

/// Context passed to revocation handlers.
pub struct RevocationContext {
    /// Node requesting the conflicting lock.
    pub requester_node: NodeId,
    /// Mode requested by the conflicting lock.
    pub requester_mode: LockMode,
    /// Urgency: Normal (best-effort timing) or Urgent (minimize delay,
    /// used for fencing and reservation preemption).
    pub urgency: RevocationUrgency,
}

#[repr(u8)]
pub enum RevocationUrgency {
    /// Normal revocation. Handler has the full timeout to complete.
    Normal = 0,
    /// Urgent revocation (fencing, reservation preempt). Handler should
    /// complete as quickly as possible. DLM reduces the timeout to 1 second.
    Urgent = 1,
}

/// Opaque handle to an acquired DLM lock. Returned by lock_acquire()
/// and lock_any_of(), used by lock_release() and lock_convert().
/// Contains the lock identity (resource + mode) and a version counter
/// to detect stale handles after lock migration or failover.
pub struct DlmLockHandle {
    /// Unique ID assigned by the DLM master at grant time.
    pub lock_id: u64,
    /// Name of the locked resource.
    pub resource_name: ResourceName,
    /// Granted lock mode (may differ from requested mode after convert).
    pub mode: LockMode,
    /// Version counter — incremented on each convert or migration.
    /// Used by the DLM to reject operations on stale handles.
    pub version: u64,
    /// Optional causal consistency attachment for DSM-coordinated locks.
    /// When a DLM lock protects DSM-managed pages, the lock release or
    /// downgrade message carries the releasing node's CausalStampWire so
    /// the next lock holder can verify causal ordering of DSM page updates.
    /// Set by `dsm_bind_lock()` when the lock is bound to a DSM region;
    /// `None` for non-DSM locks. On lock release/downgrade, if this field
    /// is `Some`, the CausalStampWire is serialized into the LOCK_DOWNGRADE
    /// or LOCK_RELEASE message payload sent to the lock master, which
    /// forwards it to the next granted holder.
    /// See [Section 6.6](06-dsm.md#dsm-coherence-protocol-moesi) §6.6 for the causal consistency
    /// protocol and CausalStampWire wire format.
    pub dsm_causal_stamp: Option<CausalStampWire>,
}

/// Handle binding a DLM lock to DSM dirty tracking.
/// Created by `dsm_bind_dirty_tracker()`, released on `lock_release()`.
/// While this binding is active, dirty page tracking on the locked
/// resource is forwarded to the DSM bitmap identified by `region_id`.
pub struct DsmLockBindingHandle {
    /// The underlying DLM lock that owns this binding.
    pub lock: DlmLockHandle,
    /// DSM region whose dirty bitmap is bound to this lock.
    pub region_id: DsmRegionId,
    /// Offset within the DSM dirty bitmap where this lock's
    /// dirty tracking begins. Used to partition a single DSM region
    /// across multiple lock-protected sub-ranges.
    pub bitmap_offset: u32,
}

15.15.5 Transport-Agnostic Lock Operations

The DLM uses the ClusterTransport trait (Section 5.10) for all network operations. Each peer's transport is obtained from PeerNode.transport (Arc<dyn ClusterTransport>). On RDMA peers, lock operations use one-sided RDMA atomics for lowest latency (~2-3 μs CAS round-trip, ~3-5 μs full acquire). On TCP peers, lock operations use serialized request-response messages (~50-200 μs). On CXL peers, hardware CAS provides the fastest path (~0.1-0.3 μs). The DLM protocol is identical across all transports; only the per-peer latency differs.

Transport selection at lockspace initialization: When a DLM lockspace is created or a node joins an existing lockspace, the DLM calls select_transport() (Section 5.5) for each peer in the lockspace. The selected transport is stored per-peer in DlmLockspace and used for all subsequent lock operations to that peer. Transport selection follows the standard priority: CXL shared memory > RDMA > TCP. If transport.supports_one_sided() returns true (RDMA, CXL), the DLM enables the CAS fast path for uncontested acquires. If supports_one_sided() returns false (TCP), all lock operations use the two-sided transport.send_reliable() path (protocol 2 below), which is still 5-10x faster than Linux's TCP-based DLM due to integrated kernel-to-kernel messaging without userspace daemon involvement.

Four protocol flows cover the full lock lifecycle:

1. Uncontested acquire (transport.atomic_cas(), ~3-5 μs on RDMA, ~50-200 μs on TCP)

When a resource has no current holders or only compatible holders, and the transport supports one-sided operations (transport.supports_one_sided() == true), the requesting node can acquire the lock via transport.atomic_cas() on the master's lock state word — a 64-bit value encoding the current lock state. On RDMA transports, this maps to a NIC-side RDMA Atomic CAS (zero remote CPU involvement). On TCP transports, the remote kernel thread performs the CAS locally and returns the old value:

/// 64-bit lock state word, laid out for RDMA Atomic CAS.
/// Stored in master's RDMA-accessible memory for each DlmResource.
///
///   bits [63:61] = current_mode (3 bits: 0=NL, 1=CR, 2=CW, 3=PR, 4=PW, 5=EX)
///   bits [60:48] = holder_count (13 bits: up to 8191 concurrent holders;
///                   sufficient for clusters with hundreds of peers, with
///                   margin for future expansion)
///   bits [47:0]  = sequence (48 bits: monotonic counter for ABA prevention)
///
/// IMPORTANT: current_mode encodes a SINGLE lock mode. This means the CAS fast
/// path only works for HOMOGENEOUS holder sets — all holders must be in the same
/// mode. When holders have different compatible modes (e.g., CR + PR, or CR + PW),
/// the CAS word cannot represent the mixed state. These transitions MUST use the
/// two-sided RDMA Send path (protocol 2 below), where the master's control thread
/// maintains per-holder mode information in the full DlmResource granted queue.
///
/// This is a deliberate design tradeoff: the CAS fast path covers the most common
/// lock patterns in practice:
///   - EX for exclusive write access (single writer)
///   - PR for shared read access (multiple readers)
///   - CR for concurrent read (e.g., GFS2 inode attribute reads via LVB)
/// Mixed-mode combinations (CR+PR, CR+PW, CR+CW) are valid but uncommon in
/// GFS2 workloads — they arise primarily during mode transitions (one node
/// downgrades while another acquires). The two-sided path at ~5-8μs is still
/// 5-10x faster than Linux's TCP-based DLM.
///
/// ABA safety: 48-bit sequence counter. At 500,000 lock ops/sec on a single
/// resource (sustained maximum), wrap time = 2^48 / 500,000 = ~563 million
/// seconds (~17.8 years). This eliminates ABA as a practical concern.
/// (Note: this is the CAS lock-word sequence counter, which increments once
/// per lock acquisition. The LVB sequence counter in Section 15.12.3 wraps in ~9.2
/// years because it increments twice per write — once at begin_write, once
/// at end_write — giving 1M increments/sec at 500K writes/sec.)
///
/// The full granted/converting/waiting queues are maintained separately in the
/// master's local memory. The CAS word is a fast-path optimization — it
/// encodes enough state for common homogeneous transitions without remote CPU
/// involvement. The master's granted queue is the authoritative lock state;
/// the CAS word is a cache of that state for the fast path.

CAS fast path cases (homogeneous mode only):

Transition CAS expected CAS desired Ops Notes
Unlocked → EX NL\|0\|seq EX\|1\|seq+1 1 CAS First exclusive holder
Unlocked → PR NL\|0\|seq PR\|1\|seq+1 1 CAS First protected reader
Unlocked → CR NL\|0\|seq CR\|1\|seq+1 1 CAS First concurrent reader
PR → PR (add reader) PR\|K\|seq PR\|K+1\|seq+1 Read + CAS Add same-mode holder
CR → CR (add reader) CR\|K\|seq CR\|K+1\|seq+1 Read + CAS Add same-mode holder
EX → NL (unlock) EX\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
PR → NL (last reader) PR\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
CR → NL (last reader) CR\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
PR (remove reader) PR\|K\|seq PR\|K-1\|seq+1 Read + CAS K>1, decrement count
CR (remove reader) CR\|K\|seq CR\|K-1\|seq+1 Read + CAS K>1, decrement count
Unlocked → PW NL\|0\|seq PW\|1\|seq+1 1 CAS Single PW holder (PW+PW incompatible)
Unlocked → CW NL\|0\|seq CW\|1\|seq+1 1 CAS First concurrent writer
CW → CW (add writer) CW\|K\|seq CW\|K+1\|seq+1 Read + CAS CW is self-compatible (per Section 15.15 matrix)
CW → NL (last writer) CW\|1\|seq NL\|0\|seq+1 1 CAS Last CW holder releases
CW (remove writer) CW\|K\|seq CW\|K-1\|seq+1 Read + CAS K>1, decrement count

Transitions that CANNOT use CAS (require two-sided path): - Any mode conversion (e.g., PR→EX, EX→PR, CR→PW) - Acquiring a mode different from current holders (e.g., CW when current_mode=CR, or PR when current_mode=CW) - Adding a second PW holder (PW is not self-compatible) - These transitions require the master's control thread to evaluate the full compatibility matrix and update per-holder mode tracking in the granted queue.

Requester                                   Master (remote memory)
    |                                            |
    |--- transport.atomic_cas(master,       ---->|
    |    lock_word_addr,                         |
    |    expected=UNLOCKED,                      |
    |    desired=EX|1|seq+1)                     |
    |<-- old_value (CAS result) ------------------|
    |                                            |
    If old_value matched expected: lock acquired.|
    RDMA: zero remote CPU involvement, ~2-3 μs.  |
    TCP: server-side CAS, ~50-200 μs.            |
    Full acquire (CAS + confirmation): ~3-5 μs   |
    on RDMA, ~100-400 μs on TCP.                 |

For the Read+CAS path (adding a shared reader when holders exist), the requester first reads the current state (transport.fetch_page() or transport.send_reliable() for small reads), then calls transport.atomic_cas() to atomically increment the holder count. Total: 2 transport operations (~3-5 μs on RDMA, ~100-400 μs on TCP). CAS failure (due to concurrent modification) triggers retry with the returned value as the new expected value.

Important: The CAS word is an optimization for the uncontested fast path. It does NOT replace the full lock queues maintained in the master's local memory. When a CAS succeeds, the acquiring node MUST send transport.send_reliable() to the master confirming its identity (node ID) and the acquired lock mode. The master updates the full granted queue upon receiving this confirmation. If the master does not receive confirmation within ~500 μs (the confirmation timeout), it assumes the CAS winner crashed before completing the acquire and resets the lock state word via its own CAS (restoring the pre-acquire state). The CAS target word includes a generation counter (the 48-bit sequence field) to prevent ABA issues during this reclamation — the master's restoration CAS uses the post-acquire sequence value as the expected value, so a concurrent legitimate acquire by another node will not be clobbered. This confirmation step is a required correctness measure, not an optimization: without it, if the CAS winner crashes before the master processes its queue entry, recovery would iterate the granted queue and find no record of the holder, leaving the lock state word permanently wedged. When a CAS fails (contested lock, incompatible mode), the requester falls back to the two-sided protocol below. The master's control thread is the sole authority for complex operations (conversions, waiters, deadlock detection).

CAS outcome determination and transport failure recovery. RDMA Atomic CAS is a single round-trip operation: the RNIC performs the compare-and-swap on the remote memory and returns the previous value of the target word in the CAS completion. The requester determines the CAS outcome entirely from this return value — if the returned old value matches the expected value, the CAS succeeded and the lock is held. No separate "confirmation response" from the master's CPU is involved in determining CAS success or failure; the RDMA NIC hardware handles the entire operation atomically. This means the requester always knows whether it acquired the lock, as long as the RDMA completion is delivered.

If the RDMA transport itself fails during a CAS operation (e.g., the Queue Pair enters Error state due to a link failure, cable pull, or remote RNIC reset), the requester receives a Work Completion with an error status (not a successful CAS completion). In this case, the CAS may or may not have been applied to the master's memory — the requester cannot distinguish between "CAS was never sent", "CAS was sent but not executed", and "CAS succeeded but the response was lost in transit." The requester must handle this ambiguity:

  1. Assume the CAS may have succeeded. The requester must not retry the CAS blindly (doing so could double-acquire or corrupt the sequence counter).
  2. Query the master via a recovery path. The requester establishes a fresh RDMA connection (or uses a separate TCP fallback if the RDMA fabric is partitioned) and sends a two-sided lock state query to the master's control thread. The master reads its authoritative lock state — the CAS word in registered memory — and responds with the current lock state plus the sequence counter value.
  3. Master's lock word is ground truth. If the CAS word shows the requester's expected post-CAS value (matching mode, holder count, and sequence), the CAS succeeded and the requester proceeds with the confirmation RDMA Send (on the new connection). If the CAS word shows a different state, the CAS either was not applied or was already reclaimed by the master's confirmation timeout (the ~500 μs timeout described above). In either case, the requester starts a fresh lock acquisition attempt.
  4. Interaction with confirmation timeout. If the CAS succeeded but the requester takes longer than ~500 μs to query the master (due to connection re-establishment), the master may have already reclaimed the lock via its confirmation timeout logic. This is safe: the master's reclamation CAS uses the post-acquire sequence value, so if reclamation occurred, the lock word has been reset and the requester's recovery query will see the reset state. The requester then re-acquires normally.

This recovery path is exercised rarely (only on RDMA transport failures, not on normal CAS contention), so its higher latency (~1-5 ms for connection re-establishment + query) does not affect steady-state performance.

Pending CAS confirmation window: Between a successful CAS and the arrival of the confirmation Send, the CAS word and the master's granted queue are temporarily inconsistent — the CAS word shows a lock held, but the granted queue has no entry. During this window, if another node's CAS fails and it falls back to the two-sided path, the master must handle the discrepancy correctly:

  1. When the master receives a two-sided lock request, it checks BOTH the granted queue AND the CAS word state. If the granted queue is empty but the CAS word shows a held lock, the master knows a CAS confirmation is pending.
  2. The master enqueues the incoming request in the waiting queue and defers processing until either: (a) the CAS confirmation arrives (at which point the granted queue is updated and the waiting queue is processed normally), or (b) the confirmation timeout expires (at which point the master resets the CAS word and processes the waiting queue against the now-empty granted queue).
  3. If the pending CAS mode is compatible with the incoming request's mode (per the Section 15.15 compatibility matrix), the master grants the incoming request immediately without waiting for the CAS confirmation. The master also updates the CAS word via its own local CAS to reflect the new holder (incrementing holder_count in the CAS word to account for both the pending CAS winner and the newly granted node). The CAS winner's confirmation, when it arrives, simply adds the CAS winner to the already-updated granted queue. This eliminates the blocking window entirely for same-mode shared requests (e.g., multiple concurrent PR acquires), which are the most common contested case.
  4. For incompatible-mode requests, this deferred processing adds at most 500 μs of latency to the second node's request in the worst case (CAS winner crashed). In the normal case, the confirmation arrives within ~1-2 μs (one RDMA Send), so the deferred processing completes almost immediately. A crashed node's 500 μs delay is negligible compared to the 50-200 ms DLM recovery time.
  5. The master tracks pending CAS confirmations with a per-resource pending_cas: ArrayVec<PendingCas, MAX_PENDING_CAS> field (see DlmResource struct in Section 15.15). A bounded collection is required — not Option<PendingCas> — because shared-mode CAS operations (e.g., PR acquires) allow multiple peers to win concurrently: each successive shared-mode CAS increments the holder_count field embedded in the CAS word and updates the sequence number, so two or more nodes can complete their CAS atomics before any confirmation arrives. The master must reconcile ALL concurrent CAS winners: it reads the final CAS word once all confirmations have arrived (or the polling timeout expires) and uses the holder_count to verify that the number of confirmations received matches the number of nodes that successfully CAS'd. Any node whose confirmation does not arrive within the timeout is treated as crashed and is excluded from the granted queue. For exclusive-mode CAS (EX, PW), at most one node can win — the CAS word format enforces mutual exclusion — so the collection will contain at most one entry in that case. This field is set when the master observes a CAS word change via periodic polling of the CAS word in its registered memory region, and cleared when all confirmations arrive or times out. Note: The master does NOT receive RDMA completion queue notifications for remote CAS operations (one-sided RDMA is CPU-transparent at the responder). Detection relies on the master's targeted polling of CAS words with pending requests only — the master maintains a per-lockspace pending set of resources with outstanding CAS operations, and polls only those CAS words (poll interval: ~100μs per pending resource). Resources with no pending CAS operations are not polled, so the CPU overhead scales with O(pending) not O(total_resources). On a lockspace with 10,000 resources but only 50 with pending CAS operations, polling generates ~500K polls/second — manageable on a single core. Optimization note: For workloads with consistently high pending-CAS counts (>100), an interrupt-driven notification path is available: the requesting node sends a two-sided RDMA Send to the master after completing its CAS, triggering a completion queue event instead of requiring polling. The master switches to interrupt-driven mode per-resource when the pending count exceeds a configurable threshold (default: 100). This trades higher per-lock latency (~1μs CQ processing vs ~0.1μs poll) for reduced CPU overhead.

Security: RDMA CAS access to the lock state word is controlled via RDMA memory registration (Memory Regions / MRs). The master registers each lockspace's CAS word array as a separate RDMA MR and distributes the Remote Key (rkey) only to nodes that hold CAP_DLM_LOCK for that lockspace. Capability verification happens at lockspace join time (a two-sided RDMA Send to the master, which checks CAP_DLM_LOCK via umka-core's capability system before returning the rkey). Nodes that lose CAP_DLM_LOCK have their rkey revoked via RDMA MR re-registration (which invalidates the old rkey). This enforces the capability boundary at the RDMA transport layer — a node without the rkey physically cannot issue CAS operations to the lock state words. The rkey is per-lockspace, so CAP_DLM_LOCK scoping (Section 15.15) maps directly to RDMA access control.

Rkey lifetime and TOCTOU safety: RDMA rkeys are registered for the lifetime of the node's DLM membership in the lockspace, not per-operation. When a node joins a lockspace, the master registers the RDMA Memory Region and returns the rkey; when the node leaves (graceful or fenced), the MR is deregistered and the rkey is invalidated. This eliminates TOCTOU (time-of-check-to-time-of-use) races: a node that passes the capability check at join time retains a valid rkey for all subsequent lock operations until membership ends. Rkey revocation (for CAP_DLM_LOCK loss) uses RDMA MR re-registration, which atomically invalidates the old rkey -- any in-flight CAS using the old rkey will fail with a remote access error (IBA v1.4 Section 14.6.7.2: deregistered MR causes Remote Access Error completion), and the node must re-join the lockspace (re-passing the capability check) to obtain a new rkey.

Revocation ordering: The MR re-registration is the authoritative enforcement mechanism — it must complete before the capability is marked as revoked in the local capability table. Sequence: (1) master calls dereg_mr() on the RNIC, which invalidates the rkey in hardware; (2) master updates the lockspace membership record (removes node); (3) capability revocation propagates to the evicted node. This ordering ensures no window exists where the capability is revoked but the rkey is still valid. If the evicted node races a CAS between steps (1) and (3), the RNIC rejects it (rkey already invalid). Rkey revocation is hardware-enforced with < 1ms latency from the dereg_mr() call — there is no exposure window. This is the same eager dereg_mr() mechanism used for cluster membership revocation (Section 5.8); the 180s rkey rotation grace period described in Section 5.3 (Mitigation 2) is a separate defense-in-depth against rkey leakage to non-cluster entities, not the revocation path for DLM membership loss.

2. Contested acquire (transport.send_reliable(), ~5-8 μs on RDMA, ~100-400 μs on TCP)

When the CAS fails (resource is already locked in an incompatible mode), or when the transport does not support one-sided operations (transport.supports_one_sided() == false), the requester uses a two-sided exchange via transport.send_reliable():

Requester                                   Master
    |                                            |
    |--- transport.send_reliable(master,    ---->|
    |    lock_request_msg)                       |
    |                                [enqueue in waiting list]
    |                                [check compatibility]
    |                                [if compatible: grant]
    |<-- transport.send_reliable(requester, -----|
    |    lock_grant_msg + LVB)                   |
    |                                            |
    RDMA: 2 RDMA Send round-trips (~5-8 μs).    |
    TCP: 2 TCP request-response (~100-400 μs).   |

The master's kernel thread processes the request, checks compatibility against the granted queue, and either grants immediately or enqueues for later grant.

3. Lock conversion (upgrade/downgrade)

A node holding a lock can convert it to a different mode without releasing and reacquiring. Conversions use the same protocol as contested acquire (RDMA Send to master). The converting queue is processed before the waiting queue — a conversion request from an existing holder takes priority over new requests.

Common conversions: - PR → EX: upgrade from read to write (e.g., before modifying an inode) - EX → PR: downgrade from write to read (triggers targeted writeback, Section 15.15) - EX → NL: release write lock but keep queue position (for future reacquire)

4. Batch request (up to 64 locks, ~5-10 μs on RDMA, ~150-500 μs on TCP)

Multiple lock requests destined for the same master are grouped into a single transport message:

Requester                                   Master
    |                                            |
    |--- transport.send_reliable(master,    ---->|
    |    batch_msg: 8 lock requests)             |
    |                                [process all 8]
    |<-- transport.send_reliable(requester, -----|
    |    batch_response: 8 grants/queued)        |
    |                                            |
    RDMA: ~5-10 μs for 8 locks.                 |
    TCP: ~150-500 μs for 8 locks.               |
    Linux DLM: 8 × 30-50 μs = 240-400 μs.      |

Batch requests are critical for operations that require multiple locks atomically. A rename() requires locks on the source directory, destination directory, and the file being renamed — three locks that can be batched into a single network operation when they share the same master.

When batch locks span multiple masters, the requester sends one batch per master in parallel and waits for all grants. Worst case: N masters = N parallel RDMA operations completing in max(individual latencies) rather than sum(individual latencies).

15.15.6 Lease-Based Lock Extension

Problem solved: Linux DLM's BAST (Blocking AST) callback storms.

In Linux, when a node requests a lock in a mode incompatible with current holders, the DLM sends a BAST callback to every holder. For a popular file with 100 readers (PR mode), a writer requesting EX mode triggers 100 BAST messages — O(N) network traffic per contention event. On large clusters (64+ nodes), this becomes a significant source of network overhead.

UmkaOS's lease-based approach:

  • Every granted lock includes a lease duration (configurable per resource type):
  • Metadata locks: 30 seconds default
  • Data locks: 5 seconds default
  • Application locks: configurable (1-300 seconds)

  • Lease extension: Holders extend their lease cheaply via transport.push_page() to update a timestamp in the master's lease table. On RDMA transports, this is a single one-sided RDMA Write (zero master CPU involvement, ~1-2 μs). On TCP transports, this is a request-response pair (~50-200 μs). Cost is amortized because renewals happen at 50% of lease duration (e.g., every 15s for 30s metadata leases).

  • Revocation strategy:

  • Uncontended resource: No revocation needed. Holders extend leases indefinitely. Minimal network traffic for uncontended locks — only periodic one-sided RDMA lease renewals, which do not interrupt the remote CPU (vs. Linux's periodic BAST heartbeats that require CPU processing on every node).
  • Contended resource (incompatible request arrives): Master checks lease expiry for all incompatible holders. If all leases have expired, master grants to new requester immediately. If any leases are active, master sends revocation messages to those holders. For the worst case (EX request on a resource with K active CR/PR holders), this is O(K) revocations — the same as Linux's BAST count. The improvement over Linux is for the common case: uncontended resources have zero CPU-consuming traffic — only one-sided RDMA lease renewals that bypass the remote CPU (Linux BASTs are sent even for uncontended downgrade requests and require CPU processing on the receiving node), and resources where most holders' leases have naturally expired need only revoke the few remaining active holders.
  • Emergency revocation: For locks with NOQUEUE flag (non-blocking), the master immediately checks compatibility and returns EAGAIN if blocked. No revocation attempted.

  • Correctness guarantee: Lease expiry is a sufficient condition for revocation — if a holder fails to extend its lease, the master knows the lock can be safely reclaimed. For contended resources, the fallback to immediate revocation (single targeted message) preserves correctness identically to Linux's BAST mechanism.

  • Clock skew safety: Lease timing is master-clock-relative only. The master is the sole arbiter of lease validity. To handle clock skew between holder and master:

  • Grant messages include the master's absolute expiry timestamp.
  • Holders renew at 50% of lease duration (e.g., 15s for a 30s metadata lease), providing a safety margin larger than any reasonable clock skew (seconds).
  • Holders track the master's clock offset from grant/renewal responses and adjust their renewal timing accordingly.
  • If a holder discovers its lease was revoked (via a failed extension response), it must immediately stop using cached data and flush any dirty pages before reacquiring the lock. This is the hard correctness boundary: the holder's opinion of lease validity does not matter — only the master's.
  • NTP or PTP synchronization is recommended but not required for correctness. The protocol is safe with unbounded clock skew — only the renewal safety margin shrinks, increasing the probability of unnecessary revocations (performance, not correctness).

  • Network traffic reduction: From O(N) BASTs per contention event to O(1) for uncontended resources (no active holders — just clear the lease) and O(K) for contended resources with K active holders. Cluster-wide lock traffic is reduced by orders of magnitude on large clusters.

15.15.7 Speculative Multi-Resource Lock Acquire

Problem solved: GFS2 resource group contention.

GFS2 must find a resource group (rgrp) with free blocks before allocating file data. In Linux, this is sequential: try rgrp 0, if locked → full round-trip (~30-50 μs); try rgrp 1, if locked → another round-trip. On a busy cluster with 8 rgrps, worst case is 8 × 30-50 μs = 240-400 μs just to find a free rgrp.

UmkaOS's lock_any_of() primitive:

/// Request an exclusive lock on ANY ONE of the provided resources.
/// The DLM tries all resources and grants the first available one.
/// Returns the index of the granted resource and the lock handle.
pub fn lock_any_of(
    resources: &[ResourceName],
    mode: LockMode,
    flags: LockFlags,
) -> Result<(usize, DlmLockHandle), DlmError>;

The requester sends a single message listing N candidate resources. The master (or masters, if resources span multiple masters) evaluates each candidate and grants the first one that is available in the requested mode.

Requester                              Master(s)
    |                                       |
    |--- "Lock any of [rgrp0..rgrp7]" ---->|
    |                           [try rgrp0: locked]
    |                           [try rgrp1: locked]
    |                           [try rgrp2: FREE → grant]
    |<-- "Granted: rgrp2" ------------------|
    |                                       |
    Total: ~5-10 μs (single round-trip).   |
    Linux: up to 8 × 30-50 μs = 240-400 μs.|

For resources spanning multiple masters, the requester sends parallel requests to each master. The first grant received is accepted; the requester cancels remaining requests. Cancel uses a two-phase protocol: (1) send CANCEL to all nodes where the lock was requested, (2) wait for either CANCEL_ACK or GRANT from each. A GRANT that arrives after CANCEL intent is released immediately via an unconditional UNLOCK message. This prevents double-grant: at most one resource is held after lock_any_of() returns, regardless of message reordering.

15.15.8 Targeted Writeback on Lock Downgrade

Problem solved: Linux's "flush ALL pages" on lock drop.

In Linux, when a node holding an EX lock on a GFS2 inode downgrades to PR or releases to NL, the kernel must flush ALL dirty pages for that inode to disk. This is because Linux's page cache has no concept of which pages were dirtied under which lock — the dirty tracking is per-inode, not per-lock-range.

For a 100 GB file where only 4 KB was modified, Linux flushes ALL dirty pages (which could be the entire file if it was recently written). This turns a lock downgrade into a multi-second I/O operation.

UmkaOS's per-lock-range dirty tracking:

The DLM integrates with the VFS layer (Section 14.7) to track dirty pages per lock range:

/// A 512-byte chunk holding 64 consecutive u64 words of the bitmap.
/// Allocated from the slab allocator as a unit; freed when all 64 words
/// become zero. One chunk covers 64 × 64 = 4,096 bit positions.
pub struct SparseBitmapChunk {
    /// 64 consecutive bitmap words. Index within the chunk is `(bit / 64) % 64`.
    pub words: [u64; 64],
}

/// Sparse bitmap for tracking dirty page ranges.
///
/// Two-level structure:
/// - **Top level**: a 64-bit presence word per chunk. Bit `c` of `top` is set
///   whenever `chunks[c]` is allocated (i.e., has at least one set bit). This
///   allows O(1) skip of empty chunks during iteration.
/// - **Bottom level**: up to 64 chunk slots, each covering 64 u64 words.
///
/// A chunk is allocated on the first `set()` that falls within it and freed
/// when the last `clear()` empties all 64 words. Maximum coverage:
/// 64 chunks × 64 words × 64 bits = 262,144 tracked positions.
///
/// **Addressing**: bit `b` maps to chunk `b / 4096`, word-in-chunk
/// `(b / 64) % 64`, bit-in-word `b % 64`.
///
/// **Allocation cost**: O(set_chunks), not O(set_bits). A fully-dense
/// 262,144-bit bitmap requires 64 slab allocations of 512 bytes each,
/// versus 4,096 individual allocations under the old per-word scheme.
/// Cache locality: all 64 words of a chunk occupy 8 consecutive cache lines,
/// so sequential scans stay within L1 for the active chunk.
///
/// Used by DLM targeted writeback ([Section 15.15](#distributed-lock-manager--targeted-writeback-on-lock-downgrade)) to track dirty pages
/// within a lock range.
pub struct SparseBitmap {
    /// Top-level presence map. Bit `c` is set iff `chunks[c]` is `Some(_)`.
    /// Allows fast iteration: `leading_zeros()` / `trailing_zeros()` locate
    /// the next non-empty chunk in one instruction.
    pub top: u64,
    /// Chunk slots. `None` means the chunk is all-zeros and not allocated.
    /// 64 slots × 512 bytes/chunk = 32 KiB maximum live data.
    pub chunks: [Option<Box<SparseBitmapChunk>>; 64],
    /// Total number of set bits across all chunks. Maintained by `set()`
    /// and `clear()`. Allows O(1) `is_empty()` and density checks.
    pub popcount: u32,
}

/// Sparse bitmap for tracking locked byte ranges.
///
/// A flat `SparseBitmap` covers 262,144 bit positions. When each bit represents
/// a 4 KiB page, that covers 1 GiB — sufficient for most files. However, the
/// DLM must track byte-range locks on files that can be much larger (e.g.,
/// 100 GB NFS exports). `LargeRangeBitmap` provides a two-level fallback:
///
/// - **Files ≤ 1 GiB** (common case): uses a flat `SparseBitmap` directly.
///   Zero overhead versus the existing flat bitmap — `small` is `Some(bitmap)`,
///   `large` is `None`.
/// - **Files > 1 GiB**: uses a two-level structure where each top-level slot
///   covers a 1 GiB region and is lazily allocated as a `SparseBitmap` when
///   first needed.
pub struct LargeRangeBitmap {
    /// For files ≤ 1 GiB (common case): flat bitmap.
    small: Option<SparseBitmap>,
    /// For files > 1 GiB: array of 1 GiB-covering SparseBitmaps, lazily allocated.
    /// Index N covers byte range [N * 2^30, (N+1) * 2^30).
    /// Maximum file size supported: 1 TiB (1024 slots × 1 GiB each).
    /// Uses `Box<SparseBitmap>` per-slot to keep the top-level array small:
    /// 1024 * 8 = 8 KiB (pointers only). Individual SparseBitmaps (~528 bytes
    /// each) are heap-allocated only on first `set()` to that slot.
    large: Option<Box<[Option<Box<SparseBitmap>>; 1024]>>,
    /// Total file size in bytes (determines which level to use).
    file_size: u64,
}

LargeRangeBitmap design notes:

  • Lazy transition: The bitmap starts in small mode. On the first set() call targeting a bit position beyond the 1 GiB boundary (bit index ≥ 262,144), the small bitmap is moved into slot 0 of the newly-allocated large array, and small is set to None. Subsequent accesses compute the slot index as bit / 262_144 and the intra-slot bit index as bit % 262_144.

  • Two levels of lazy allocation: (1) The large array itself (8 KiB of Option<Box<SparseBitmap>> pointers) is heap-allocated only when needed (files > 1 GiB that actually have locks past the 1 GiB boundary). (2) Within the large array, each slot's Box<SparseBitmap> is allocated on first set() to that slot — empty slots remain None (8-byte null pointer).

  • Maximum coverage: 1 TiB (1024 slots × 1 GiB each). Files larger than 1 TiB use coarse-grained lock tracking: byte-range locks map to 1 GiB granules, with potential false conflicts for adjacent byte ranges within the same 1 GiB granule. This is acceptable because files > 1 TiB with fine-grained byte-range locking are extremely rare in practice; whole-file or large-region locks dominate.

  • Performance: For files ≤ 1 GiB (the common case), zero overhead versus the existing flat SparseBitmap — one branch on small.is_some(). For large files, each access adds one pointer dereference (slot lookup) plus the existing SparseBitmap O(1) per-bit cost.

  • range_coverage_bytes() -> u64: Returns the current maximum byte range the bitmap can track at full granularity. In small mode: 1 GiB (262,144 × 4 KiB). In large mode: 1 TiB (1024 × 1 GiB). For files beyond 1 TiB: returns file_size (coarse tracking covers the entire file, but at 1 GiB granularity beyond the 1 TiB fine-grained limit).

SparseBitmap method contracts:

  • set(b: u64): Computes chunk index c = b / 4096, word index w = (b / 64) % 64, bit index k = b % 64. If chunks[c] is None, allocates a SparseBitmapChunk from the slab allocator and sets bit c in top. Sets bit k in chunks[c].words[w]. If the bit was previously clear, increments popcount.

  • clear(b: u64): Computes (c, w, k) as above. Clears bit k in chunks[c].words[w]. If the bit was set, decrements popcount. If all 64 words in chunks[c] are now zero, frees the chunk and clears bit c in top.

  • test(b: u64) -> bool: Computes (c, w, k). If chunks[c] is None, returns false. Otherwise returns (chunks[c].words[w] >> k) & 1 != 0.

  • iter_set() -> impl Iterator<Item = u64>: Iterates over set chunk indices using top.trailing_zeros() / bit-clear loop. Within each chunk, iterates over non-zero words using words[w].trailing_zeros(). Yields absolute bit positions. Total cost: O(set_chunks + set_bits).

/// Dirty page tracker associated with a DLM lock.
/// Tracks which pages were modified while this lock was held.
pub struct LockDirtyTracker {
    /// Byte range covered by this lock (for range locks).
    /// For whole-file locks: 0..u64::MAX.
    pub range: core::ops::Range<u64>,

    /// Bitmap of dirty pages within the lock's range.
    /// Indexed by (page_offset - range.start) / PAGE_SIZE.
    ///
    /// Uses `LargeRangeBitmap` to support files of any practical size:
    /// - Files ≤ 1 GiB: flat `SparseBitmap` (O(1) per page, zero overhead).
    /// - Files > 1 GiB: two-level structure with lazily-allocated 1 GiB slots.
    /// - Files > 1 TiB: coarse 1 GiB granule tracking (rare in practice).
    ///
    /// O(1) set/clear per page, O(dirty_chunks + dirty_bits) iteration.
    /// Slab allocation is per-chunk (512 bytes), not per set bit, keeping
    /// allocator pressure and fragmentation proportional to the number of
    /// 256 KB dirty regions rather than the number of dirty pages.
    pub dirty_pages: LargeRangeBitmap,

    /// Optional delegation to a DSM dirty bitmap. When a `DsmLockBinding`
    /// ([Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching--dlm-token-binding)) is active
    /// for this lock, the DLM delegates all dirty tracking to the binding's
    /// `DsmDirtyBitmap` instead of maintaining its own `dirty_pages` bitmap.
    /// This avoids double bookkeeping and ensures a single source of truth.
    ///
    /// Set by `dsm_bind_lock()` at binding registration time; cleared by
    /// `dsm_unbind_lock()` at binding teardown.
    pub dsm_delegate: Option<DsmDirtyDelegate>,
}

/// Delegation handle connecting a DLM lock's dirty tracking to a
/// `DsmLockBinding`'s `DsmDirtyBitmap`. When present, all dirty page
/// tracking operations on `LockDirtyTracker` are forwarded to the
/// DSM bitmap.
pub struct DsmDirtyDelegate {
    /// Handle to the active DsmLockBinding that owns the canonical
    /// dirty bitmap. The DLM never reads or writes `dirty_pages`
    /// while this handle is live.
    pub binding_handle: DsmLockBindingHandle,
}

Dirty tracking delegation contract (DLM ↔ DSM):

When a DsmLockBinding is registered for a DLM lock (Section 6.12), the DLM's LockDirtyTracker and the DSM's DsmDirtyBitmap would both track the same set of dirty pages — one driven by VFS page-fault write paths (setting PTE dirty bits), the other by MOESI state transitions (Exclusive-dirty / SharedOwner). Maintaining both bitmaps independently wastes memory, risks divergence if one path marks a page dirty but the other does not, and complicates writeback (which bitmap is authoritative?).

Setup sequencing: The binding must follow a strict ordering: (1) DLM lock acquire completes (the lock is held in EX or PW mode), (2) DSM region join (the region is locally mapped and the coherence protocol is active), (3) dirty tracker bind to DLM lock via dsm_bind_lock(). This ordering ensures the lock is held before tracking begins — if the dirty tracker were bound before the lock was granted, incoming MOESI invalidations could mark pages dirty in a tracker that has no corresponding lock protection, violating the invariant that every tracked dirty page is covered by a held DLM lock.

Resolution — single-owner delegation:

  1. Bind: When dsm_bind_lock() is called, the DSM subsystem sets lock.dirty_tracker.dsm_delegate = Some(DsmDirtyDelegate { binding_handle }) on the DLM lock's tracker. From this point:
  2. LockDirtyTracker::mark_dirty(page) forwards to DsmDirtyBitmap::mark_dirty(page_idx) via the delegate handle.
  3. LockDirtyTracker::iter_dirty() returns DsmDirtyBitmap::iter_dirty() via the delegate.
  4. LockDirtyTracker::dirty_count() returns DsmDirtyBitmap::dirty_count().
  5. The local dirty_pages: LargeRangeBitmap is not accessed; it remains in its last state (or empty if the binding was created before any writes).

  6. Writeback: On lock downgrade or release, the DLM's targeted writeback path (Section 15.15) calls lock.dirty_tracker.iter_dirty(). Because the delegate is active, this iterates the DsmDirtyBitmap, which reflects every MOESI M/O transition — including pages dirtied through DSM coherence protocol messages that the VFS write path would not have seen.

  7. Unbind: When dsm_unbind_lock() is called (or the DLM lock is released), the DSM subsystem clears lock.dirty_tracker.dsm_delegate. Any remaining dirty pages in the DsmDirtyBitmap are written back synchronously before the delegate is cleared (per the existing unbind contract). After unbind, LockDirtyTracker reverts to its local dirty_pages bitmap for any subsequent non-DSM use.

  8. Invariant: At no point are both dirty_pages and dsm_delegate actively tracking writes. The DLM checks dsm_delegate.is_some() on every mark_dirty / iter_dirty / dirty_count call (single branch, predicted taken when DSM is active). This is a warm-path check (called per dirty page, not per instruction), so the branch cost is negligible.

VFS call site for mark_dirty(): The VFS set_page_dirty() path checks if the page's inode has an active DLM lock with dirty tracking enabled. If so, it calls lock.dirty_tracker.mark_dirty(page.index) to record the dirty page index in the per-lock bitmap. This is the sole entry point for populating the LockDirtyTracker during normal file writes — page-fault write paths and buffered-write paths both converge on set_page_dirty(), ensuring no dirty page is missed regardless of the I/O path taken.

Downgrade behavior:

  • EX/PW → PR (downgrade to read): Flush only pages in dirty_pages bitmap. If 4 KB of a 100 GB file was modified, flush exactly 1 page (~10-15 μs for NVMe), not the entire file. PW (Protected Write) follows the same writeback rules as EX, since both are write modes that can dirty pages (per the compatibility matrix in Section 15.15).
  • EX/PW → NL (release): Flush dirty pages, then invalidate only pages covered by this lock's range. Other cached pages (from other lock ranges or read-only access) remain valid.
  • Range lock downgrade: When a byte-range lock is downgraded, only dirty pages within that specific byte range are flushed. Pages outside the range are untouched.

Page cache invalidation on lock release/downgrade: When a DLM lock is released or downgraded, cached pages protected by that lock must be invalidated to prevent stale reads by other nodes. The DLM calls dlm_invalidate_pages(resource, node) in the lock release path after dirty page writeback completes. This callback invokes invalidate_inode_pages2_range() on the inode's address space for the byte range covered by the lock. Pages that are currently under writeback are waited on before invalidation. The invalidation is synchronous — the lock release message is not sent to the master until all local cached pages for the lock's range are evicted.

Cost reduction: From O(file_size) to O(dirty_pages_in_range). For the common case of small writes to large files, this reduces lock downgrade cost by orders of magnitude.

15.15.9 Deadlock Detection

The DLM uses a distributed wait-for graph (WFG) with two detection tiers: immediate local cycle detection for same-node deadlocks, and a probe-based protocol for cross-node deadlocks that activates after a configurable wait threshold.

15.15.9.1 Local Wait-For Graph Construction

Each node maintains a local wait-for graph of lock dependencies. Vertices are globally unique process identifiers (node_id, pid) — bare PIDs are insufficient because PID 1234 on Node A and PID 1234 on Node B are different processes. Edges represent lock dependencies: process (N1, P) holds lock A, process (N2, Q) waits for lock A → edge (N2, Q) → (N1, P). The pid field always refers to the initial (host) PID namespace, not a container-local PID namespace. Containers that each have PID 1 are unambiguously distinguished this way. Container-local PIDs are translated to initial-namespace PIDs at the DLM boundary before insertion into the wait-for graph.

Edge management: - Insertion: When a lock request blocks (enqueued on the waiting queue), an edge is added from the requesting task to each current holder of the conflicting lock mode. For mode conversions, the edge points from the converting task to each holder of an incompatible mode. - Removal: When the lock is granted or the request is cancelled (dlm_unlock() or EDEADLK victim cancellation), all edges originating from that task for the given resource are removed.

15.15.9.2 Local Cycle Detection (Immediate)

On each new edge insertion, the master node runs a depth-first search (DFS) starting from the newly blocked task. If the DFS visits a node already on the current traversal stack, a cycle is detected locally.

/// Perform local cycle detection starting from `waiter`.
///
/// Returns `Some(victim)` if a cycle is found, `None` otherwise.
/// Runs under the WFG lock (held for the duration of the DFS).
/// Worst case O(E) where E = number of edges in the local graph.
fn detect_local_cycle(
    graph: &WaitForGraph,
    waiter: WaiterId,
    policy: VictimPolicy,
) -> Option<WaiterId>;

Algorithm: 1. Mark waiter as visiting (push onto DFS stack). 2. For each holder h that waiter is waiting for: a. If h is already on the DFS stack → cycle found. Collect all nodes on the cycle path from h back to h on the stack. b. If h is waiting for other locks on this node, recurse into h. c. If h is waiting for a lock mastered on a remote node, stop local DFS for this branch — the dependency crosses a node boundary and requires the distributed probe protocol (below). 3. If no cycle found locally, return None.

Victim selection: from the set of tasks in the detected cycle, the victim is chosen by the configured VictimPolicy:

/// Policy for selecting the deadlock victim from a cycle.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum VictimPolicy {
    /// Cancel the youngest transaction (most recent lock_id / highest timestamp).
    /// Default. Minimizes wasted work by aborting the task that has done the least.
    Youngest,
    /// Cancel the lowest-priority task (smallest `nice` value).
    LowestPriority,
    /// Cancel the task holding the fewest locks (smallest transaction footprint).
    SmallestTransaction,
}

The victim's lock request is cancelled with EDEADLK. For same-node deadlocks, this completes in O(edges) time without any network round-trips, typically within microseconds of the blocking request.

15.15.9.3 Distributed Probe Protocol

If local DFS reaches a task waiting on a lock mastered on a remote node, and the lock request has been waiting longer than the activation threshold (default: 5 seconds, configurable via DlmLockspaceConfig.deadlock_timeout_ns), the node initiates a distributed probe.

Probe message format:

/// Distributed deadlock detection probe message.
/// Sent between DLM nodes to trace cross-node wait-for chains.
/// kernel-internal, not KABI — contains ArrayVec (not wire-safe).
/// Serialized to wire format before transmission.
#[repr(C)]
pub struct DlmProbe {
    /// Monotonically increasing probe ID (per initiator node).
    /// Used for deduplication — nodes cache recent probe_ids to avoid
    /// re-processing probes that have already been forwarded.
    pub probe_id: u64,
    /// Node that initiated this probe (the node where the blocked task resides).
    pub initiator_node: NodeId,
    /// The blocked task that triggered the probe (cycle target).
    pub initiator_waiter: WaiterId,
    /// Probe path: list of (node_id, lock_id) pairs traversed so far.
    /// Bounded to MAX_PROBE_PATH_LEN (32) entries. If a probe exceeds this
    /// depth, it is dropped — real deadlock cycles in practice involve <10 nodes.
    pub path: ArrayVec<WaiterId, 32>,
}

/// Maximum probe path length. Probes exceeding this depth are dropped
/// as they are unlikely to represent real deadlock cycles.
pub const MAX_PROBE_PATH_LEN: usize = 32;

Protocol steps:

  1. Initiation: When local DFS reaches a remote dependency and the wait time exceeds the threshold, the initiator node constructs a DlmProbe with a fresh probe_id, sets initiator_waiter to the originally blocked task, and appends the local path traversed so far. The probe is sent to the remote node that masters the lock being waited on.

  2. Forwarding: The receiving node looks up the lock resource, identifies the current holders, and continues the DFS in its local WFG:

  3. For each holder that is itself waiting on a local lock: extend the DFS locally.
  4. For each holder waiting on a lock mastered on yet another remote node: append the local path segment to DlmProbe.path and forward the probe to that node.
  5. If a holder is not waiting on anything (it holds the lock and is running): the probe terminates on this branch (no deadlock on this path).

  6. Cycle detection: If the probe arrives back at the initiator node (the receiving node's node_id matches DlmProbe.initiator_node) and the DFS reaches initiator_waiter, a distributed cycle is confirmed. The full cycle path is the concatenation of DlmProbe.path plus the local segment.

  7. Victim selection: The node that detects the cycle selects the victim using the same VictimPolicy applied to all tasks in the cycle path. The detecting node sends a DLM_MSG_CANCEL to the victim's home node, which cancels the victim's lock request with EDEADLK.

  8. Probe deduplication: Each node maintains a bounded LRU cache of recently seen (initiator_node, probe_id) pairs (capacity: 1024 entries). When a probe arrives whose ID is already in the cache, it is dropped without processing. This prevents probe storms in dense wait-for graphs where multiple paths lead to the same node.

15.15.9.4 Gossip-Based Edge Propagation

In addition to the probe protocol (which is demand-driven), nodes exchange wait-for graph edges with their neighbors via periodic gossip. This provides a secondary detection mechanism and accelerates probe convergence:

  • Every 100 ms, each node selects ceil(log2(N)) random peers from the cluster membership list (Section 5.8) and sends its current local WFG edges (anti-entropy gossip). Random selection ensures convergence in O(log N) rounds with high probability.
  • Each gossip message includes the (node_id, pid) tuples for both endpoints of each edge, ensuring no PID aliasing across nodes or containers.
  • Edge removal: When a lock request is granted or cancelled, the node removes the corresponding edge from its local graph and propagates a tombstone (edge + deletion timestamp) in the next gossip round. Tombstones are garbage-collected after 2x the gossip interval (200 ms).
  • Each node runs local cycle detection on its accumulated graph (local edges + edges received via gossip). If a cycle is found, the youngest transaction (highest timestamp) is selected as the victim and receives EDEADLK.
  • After 3 * ceil(log2(N)) gossip rounds without detecting a complete cycle, the detector falls back to a centralized query to the DLM coordinator (lowest live node-id), adding one extra RTT but guaranteeing termination regardless of gossip convergence.

Victim selection is configurable per lockspace: youngest (default), lowest priority, or smallest transaction (fewest locks held).

15.15.9.5 Performance Characteristics

Zero overhead on fast path: Deadlock detection only activates when a lock request has been waiting for longer than the configurable threshold (default: 5 seconds). Short waits (the common case for contended locks) complete before deadlock detection engages. The gossip protocol runs on a low-priority background thread and uses minimal bandwidth. Per-message bound: Each gossip message carries at most MAX_GOSSIP_EDGES (128) WFG edges. If a node has more local edges than the limit, edges are sent across multiple gossip rounds (round-robin). At 16 bytes per edge (two (node_id, pid) pairs + mode + timestamp), 128 edges = 2 KiB per message, well within a single RDMA inline send or UDP datagram.

Latency tradeoff justification: The 5-second activation threshold means a true deadlock waits ~5 seconds before detection begins, which is 1,000,000x the typical lock latency (~5 us). This is acceptable because: (1) deadlocks are rare in practice -- most lock waits resolve within milliseconds; (2) the alternative (immediate distributed cycle detection on every wait) would add gossip overhead to every contended lock operation, degrading the common-case latency that the DLM is optimized for; (3) the 5-second threshold matches Linux DLM's deadlock detection timeout and is well within application tolerance for the rare deadlock case.

Local fast-path detection: For locks mastered on the same node, the master performs immediate local cycle detection when enqueueing a new waiter -- if the waiter and all holders in the cycle are on the same node, the deadlock is detected in O(edges) time without any network round-trips, typically within microseconds. The 5-second probe-based detection is only needed for cross-node deadlock cycles, where the wait-for graph edges span multiple nodes and must be traced via the probe protocol.

15.15.10 Integration with Cluster Membership (Section 5.8)

The DLM receives cluster membership events directly from Section 5.8's cluster membership protocol:

  • NodeJoined: New node added to consistent hash ring. Some lock resources are remapped to the new master (~1/N of resources). The new node receives resource state from the old masters.
  • NodeSuspect: Heartbeat missed. DLM begins preparing for potential recovery but does NOT stop lock operations. Current lock holders continue normally.
  • NodeDead: Confirmed node failure. DLM initiates recovery for resources mastered on or held by the dead node (Section 15.15). Ordering constraint: The DLM lock reclaim timer (DLM_LOCK_RECLAIM_DELAY_NS, 200 ms) starts ONLY after the membership layer has delivered the NodeDead event. The DLM MUST NOT initiate lock reclaim based solely on its own liveness probe (DLM_MONITOR_INTERVAL_NS, 500 ms) — the monitor is advisory and may pre-stage recovery preparation, but actual lock removal from granted queues requires authoritative NodeDead confirmation from the membership layer. The callback:
/// Called by the membership layer ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery))
/// when a node is authoritatively confirmed dead (10 missed heartbeats, 1000ms).
/// This is the ONLY entry point for DLM lock reclaim initiation.
fn on_node_dead(node_id: PeerId) {
    // 1. Cancel any in-flight RDMA operations to the dead node.
    dlm_cancel_rdma_to(node_id);
    // 2. Start the reclaim delay timer. Lock reclaim begins when
    //    this timer fires, allowing the dead node's RDMA NIC to
    //    drain any in-flight operations (200ms grace).
    dlm_recovery.start_reclaim_timer(node_id, DLM_LOCK_RECLAIM_DELAY_NS);
}

Total worst-case recovery latency (lock holder failure): - Heartbeat detection: 1000 ms (10 missed heartbeats at 100 ms interval) - NodeDead delivery to DLM: < 1 ms (in-kernel function call) - Reclaim delay: 200 ms (DLM_LOCK_RECLAIM_DELAY_NS) - Lock queue processing: < 10 ms (per-resource, not global) - Total: ~1210 ms from crash to lock availability for new requesters.

  • NodeLeaving: Graceful departure. Node transfers mastered resources to their new owners before leaving. Zero disruption.

Single membership source: The DLM does NOT make its own authoritative membership decisions. It relies on the cluster membership layer (Section 5.8) as the single source of truth for node liveness. The DLM does run its own lightweight monitor thread (DLM_MONITOR_INTERVAL_NS, 500 ms) to pre-stage lock reclaim before the membership layer confirms failure, but this monitor cannot unilaterally declare a node dead or trigger fencing. This eliminates the Linux problem where DLM and corosync can disagree on node liveness — in UmkaOS, there is exactly one authority for cluster membership.

15.15.11 Recovery Protocol

Cross-subsystem ordering: When both DSM and DLM require recovery after a node failure, DSM home reconstruction runs first for pages that the DLM depends on. DLM re-mastering proceeds per-resource: resources whose CAS word pages have no DSM dependency (or whose pages are homed on surviving nodes) re-master immediately; only resources whose CAS word pages were homed on the failed node wait for dsm_recovery_complete on the affected DSM region (~1% of resources in typical deployments). See Section 5.8 for the full per-resource ordering protocol, DlmResourceDsmDep tracking structure, and rationale.

Four failure scenarios, each with a targeted recovery flow:

1. Lock holder failure (a node holding locks crashes)

Timeline:
  t=0:    Node B crashes while holding locks on resources R1, R2, R3
  t=300ms: Cluster membership heartbeat ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery),
           100ms interval) detects NodeSuspect(B) (3 missed heartbeats).
           Note: the DLM's own monitor thread (DLM_MONITOR_INTERVAL_NS=500ms)
           may have pre-staged lock reclaim by this point but does NOT trigger
           failure autonomously — it only sends an advisory hint to the membership layer.
  t=1000ms: NodeDead(B) confirmed by membership layer (10 missed heartbeats)

Recovery (per-resource, NOT global):
  For each resource where B held a lock:
    1. Master removes B's lock from granted queue
    2. If B held EX with dirty LVB: mark LVB as INVALID (sequence = LVB_SEQUENCE_INVALID)
    3. Process converting queue, then waiting queue (grant compatible waiters)
    4. If B held journal lock: trigger journal recovery for B's journal

  Resources NOT involving B: completely unaffected. Zero disruption.

  Lease expiry race handling: NodeSuspect is detected at 300ms (3 missed heartbeats),
  but leases may not expire until their full timeout (metadata: 30s, data: 5s). If the
  master attempts to send revocation messages to B during recovery and B is already
  dead (RDMA Send fails), the master does not block indefinitely waiting for B to
  acknowledge revocation. Instead, the master records B as "revocation pending" and
  proceeds with resource recovery immediately — the lease timeout will naturally
  invalidate B's access rights when it expires. For data locks (5s timeout), the
  recovery completes within the lease window; for metadata locks (30s timeout), the
  master may grant new locks on the resource before B's lease expires. This is
  correct because B is confirmed dead at t=1000ms and cannot access the resource.
  The lease timeout provides a safety net in the corner case where NodeDead
  confirmation is delayed beyond the lease duration — if the master cannot confirm
  B's death, B retains access until lease expiry, preserving correctness at the cost
  of temporary unavailability for incompatible lock requests.

2. Lock master failure (the node responsible for a resource's lock queues crashes)

Timeline:
  t=0:    Node M crashes (was master for resources hashing to M)
  t=1000ms: NodeDead(M) confirmed (10 missed heartbeats per Section 5.8.2.2)

Recovery:
  1. Consistent hashing reassigns M's resources to surviving nodes.
     (~1/N resources move, distributed across all survivors.)
  2. Each survivor that held locks on M's resources reports its lock
     state to the new master via RDMA Send.
  3. New master rebuilds granted/converting/waiting queues from
     survivor reports.
  4. Lock operations resume for affected resources.

  Timeline: ~50-200ms for affected resources.
  All other resources: unaffected (their masters are alive).

3. Split-brain (network partition divides cluster)

Inherits Section 5.8's quorum protocol: - Majority partition: Continues normal DLM operation. Resources mastered on nodes in the minority partition are remapped. - Minority partition: Blocks new EX/PW lock acquisitions to prevent conflicting writes. Existing EX/PW locks are downgraded to PR — the holder retains the lock (avoiding re-acquisition on partition heal) but cannot write. Dirty pages held under the downgraded lock are flushed before the downgrade completes (targeted writeback, Section 15.15). Existing PR and CR locks remain valid for local cached reads.

How nodes learn they are in the minority partition: The cluster membership subsystem (Section 5.8) calls dlm_partition_event(PartitionRole::Minority) on the DLM when quorum is lost. This is the single notification entry point — the DLM does not independently monitor heartbeats or quorum; it relies entirely on the membership layer's event. The event is delivered on a dedicated kernel thread and holds the DLM partition_lock during processing to serialize with ongoing lock grant decisions.

In-flight write handling: An in-flight write is any operation where a write() syscall has returned to userspace but the dirtied pages have not yet been included in the LockDirtyTracker for the covering EX lock. Two sub-cases:

Case A — write() completed before partition detected:
  Pages are already in the dirty page cache and tracked by LockDirtyTracker.
  The downgrade flushes them via targeted writeback (normal path).

Case B — write() in progress (PTE dirty bit set, LockDirtyTracker not yet updated):
  The VFS page-fault path sets the dirty bit before returning to userspace.
  DLM's partition handler waits for the write_seq counter to stabilize
  (spin at most 1ms — write() syscall cannot hold a page lock indefinitely)
  then calls sync_file_range(ALL) on all files covered by EX locks.
  This forces any PTE-dirty pages into tracked writeback before downgrade.

Atomic writeback-then-downgrade sequence:

For each EX or PW lock held by this node:
  1. Set lock.state = LOCK_CONVERTING (blocks new writers via KABI fence).
  2. Flush in-flight writes: sync_file_range(file, lock.range.start, lock.range.end).
     This is synchronous: returns only when all dirty pages in the range
     are submitted to the block layer (not necessarily persisted to disk).
  3. Call targeted_writeback_flush(lock) ([Section 15.12.8]):
     Walk LockDirtyTracker, submit writeback for each dirty page.
     Wait for writeback completion (submit + await journal commit).
  4. Only after step 3 completes: change lock mode from EX/PW → PR.
     This is the atomic downgrade: no window where lock is PR but pages are dirty.
  5. Send LOCK_DOWNGRADE message to lock master (majority partition).
     If the lock has a DSM binding (`dsm_causal_stamp.is_some()`), the
     LOCK_DOWNGRADE message includes the CausalStampWire payload
     ([Section 6.6](06-dsm.md#dsm-coherence-protocol-moesi)). The master forwards this
     stamp to the next granted holder so it can verify causal ordering
     of DSM page updates made under the previous lock tenure.
     Master updates granted queue: replaces EX entry with PR entry.

The "atomic" guarantee is within a single CPU: steps 3→4 are serialized by partition_lock. Concurrent readers (PR/CR holders) may read stale data from the page cache during the flush window (step 2-3), but they cannot read partially-flushed state because each page is either fully clean or fully dirty at page cache granularity. No intermediate state is visible. Lease enforcement is suspended in the minority partition: since masters in the majority partition cannot be reached for lease renewal, lease expiry cannot be used to revoke locks. No new writes are permitted. No data corruption is possible because the minority cannot acquire or hold write locks, and read-only access to stale data is explicitly safe for PR/CR modes at the filesystem level (no on-disk corruption or metadata structure damage, though application-visible staleness is possible (e.g., readdir may return deleted entries or miss new files created on the majority partition)). Applications requiring linearizable reads (e.g., databases with ACID guarantees) may see stale values during the partition; this is inherent to any system that allows minority-partition reads (CAP theorem). DSM integration: The DLM's write-lock downgrade is consistent with the DSM's SUSPECT page mechanism (Section 5.8): DSM write-protects SUSPECT pages while allowing reads. Both subsystems independently block writes in the minority partition, providing defense-in-depth. - Partition heals: Minority nodes rejoin. Lock state is reconciled: 1. Minority nodes report their held lock state to the (majority-elected) masters. 2. Masters compare against current granted queues (majority wins for conflicts). 3. Any minority-held locks that conflict with locks granted during the partition are forcibly revoked on the minority nodes (cached data invalidated). 4. Non-conflicting locks are re-validated and lease timers restarted.

4. Simultaneous holder + master failure (the node holding locks is also the master for those resources, or both the holder and master crash at the same time)

Timeline:
  t=0:    Node B crashes. B held EX on resources R1, R2 (with dirty LVBs).
          B was also the master for R1 (self-mastered). Node M was the master
          for R2 and also crashes at t=0 (e.g., rack power failure).
  t=1000ms: NodeDead(B) and NodeDead(M) confirmed.

Recovery (composes scenarios 1 + 2, master rebuild first):

  Phase 1 — Master rebuild (scenario 2):
    1. Consistent hashing reassigns R1 (was mastered on B) and R2 (was
       mastered on M) to surviving nodes. New master N1 gets R1, new master
       N2 gets R2.
    2. Surviving nodes report their lock state to N1 and N2:
       - For R1: Node C reports "I have PR on R1", Node D reports "I am
         waiting for EX on R1." No node reports holding EX on R1.
       - For R2: Node C reports "I have PR on R2." No node reports holding
         EX on R2.
    3. N1 and N2 rebuild granted/converting/waiting queues from survivor
       reports. Dead node B's locks are absent (B cannot report).

  Phase 2 — Dead holder cleanup (scenario 1, applied by new masters):
    4. N1 examines R1's rebuilt state: PR holders exist (C), but no EX
       holder. A waiting EX request exists (D). N1 infers that the dead
       node B held the missing EX lock:
       - INFERENCE RULE: If a resource has waiters for an incompatible mode
         but no granted lock blocking them, the dead node(s) held the
         blocking lock. The new master does not need to know WHICH dead
         node — the lock is simply gone.
    5. N1 marks R1's LVB as INVALID (LVB_SEQUENCE_INVALID) because the
       dead EX holder may have written a dirty LVB that no survivor has.
    6. N1 processes the waiting queue: grants D's EX request on R1.
    7. N2 performs the same for R2: marks LVB INVALID, grants waiters.

  Phase 3 — Journal recovery:
    8. If B held journal locks, journal recovery runs against B's journal
       slot (same as scenario 1 step 4). The new master coordinates this.

  Timeline: same as scenario 2 (~50-200ms for affected resources).
  The holder cleanup (phase 2) adds negligible time — it is local queue
  manipulation on the new master, no network round-trips.

The key insight is ordering: master rebuild (phase 1) must complete before dead holder cleanup (phase 2), because the new master needs the rebuilt queue state to infer which locks the dead node held. An implementer must NOT attempt scenario 1 cleanup before scenario 2 rebuild — the old master is dead and cannot execute holder cleanup steps.

Key difference from Linux: NO global recovery quiesce. Linux's DLM stops ALL lock activity cluster-wide while recovering from ANY node failure. This is because Linux's DLM recovery protocol requires a globally consistent view of all lock state before it can proceed — every node must acknowledge the recovery, and no new lock operations can be processed until all nodes agree.

UmkaOS's DLM recovers per-resource: only resources mastered on or held by the dead node require recovery. The remaining (typically 90%+) of lock resources continue operating without any pause.

15.15.12 UmkaOS Recovery Advantage

The combination of umka-core's architecture and the per-resource DLM recovery protocol creates a fundamentally different failure experience:

Linux path (storage driver crash on Node B):

t=0:      Driver crash
t=0-30s:  Fencing: cluster must confirm B is dead (IPMI/BMC power-cycle
          or SCSI-3 PR revocation). Conservative timeout.
t=30-90s: Reboot: Node B reboots, OS loads, cluster stack starts.
t=90-120s: Rejoin: B rejoins cluster. DLM recovery begins.
          GLOBAL QUIESCE: ALL nodes stop ALL lock operations.
t=120-130s: DLM recovery: all nodes exchange lock state, rebuild queues.
t=130s:    Normal operation resumes.
Total: 80-130 seconds of disruption. ALL nodes affected.

UmkaOS path (storage driver crash on Node B):

t=0:       Driver crash in Tier 1 storage driver.
t=0:       Cluster heartbeat CONTINUES (heartbeat runs in umka-core, not
           the storage driver). Cluster does NOT detect a node failure.
t=50-150ms: Driver reloads (Tier 1 recovery, Section 11.7). State restored
           from checkpoint.
t=150ms:   Driver operational. Lock state was never lost (DLM is in
           umka-core). No fencing needed. No recovery needed.
Total: 50-150ms I/O pause on Node B only. Zero lock disruption.
Zero impact on other nodes.

The difference is architectural: in Linux, the DLM runs in the same failure domain as storage drivers (all are kernel modules that crash together). In UmkaOS, the DLM is in umka-core — it survives driver crashes. The DLM only needs recovery when umka-core itself fails (which means the entire node is down).

DLM-driver supervisor integration: The DLM and cluster heartbeat run in umka-core, independent of any driver. When the driver supervisor detects a Tier 1 crash, it notifies the DLM via dlm_driver_recovering(driver_id). The DLM suspends lock grant callbacks for that driver's lockspaces until the driver signals ready via dlm_driver_ready(driver_id).

15.15.13 Application-Level Distributed Locking

The DLM provides application-visible locking interfaces:

  • flock() on clustered filesystem → transparently maps to DLM lock operations. Applications using flock() for coordination get cluster-wide locking without code changes.
  • fcntl(F_SETLK) byte-range locks → DLM range lock resources. POSIX byte-range locks on clustered filesystems provide true cluster-wide exclusion.
  • Explicit DLM API via /dev/dlm → compatible with Linux's dlm_controld interface. Applications that use libdlm for explicit distributed locking work without modification.
  • flock2() system call (new, UmkaOS extension) — enhanced distributed lock with:
  • Lease semantics: caller specifies desired lease duration
  • Failure callback: notification when lock is lost due to node failure
  • Partition behavior: configurable (block, release, or fence)
  • Batch support: lock multiple files in a single system call

15.15.14 Capability Model

DLM operations are gated by capabilities (Section 9.1):

Capability Permits
CAP_DLM_LOCK Acquire, convert, and release locks on resources in permitted lockspaces
CAP_DLM_ADMIN Create and destroy lockspaces, configure parameters, view lock state
CAP_DLM_CREATE Create new lock resources (for application-level locking via /dev/dlm)

Lockspaces provide namespace isolation — a container with CAP_DLM_LOCK scoped to its own lockspace cannot interfere with locks in other lockspaces. Each lockspace is fully isolated: lock names in one lockspace have no relationship to identically named locks in another. Lockspaces provide both namespace isolation (containers) and domain separation (filesystem vs. application locks). GFS2 creates a lockspace per filesystem; applications create lockspaces via /dev/dlm.

New node provisioning: When a node joins the cluster, it does not initially hold any CAP_DLM_LOCK capabilities. The cluster coordinator (Raft leader) provisions DLM capabilities via capability delegation (Section 5.7): the coordinator creates a scoped CAP_DLM_LOCK for each lockspace the node is authorized to join, signs it, and sends it as part of the membership acknowledgment. The node presents this capability when joining a lockspace (the two-sided RDMA Send that returns the rkey). Nodes not delegated CAP_DLM_LOCK for a lockspace cannot join it.

15.15.15 Lockspace Lifecycle API

/// Create a new DLM lockspace.
///
/// The caller must hold `CAP_DLM_ADMIN`. The lockspace name must be unique
/// cluster-wide; if a lockspace with the same name already exists, returns
/// `DlmError::AlreadyExists`. The creating node becomes the initial member
/// and master assignment seed.
///
/// `config`: lockspace-level parameters (lease durations, slab pre-allocation
/// capacity, deadlock detection policy). Defaults are used for any field set
/// to zero.
///
/// On success, the lockspace is broadcast to all cluster peers via the Raft
/// state machine ([Section 5.1](05-distributed.md#distributed-kernel-architecture--raft-consensus)). Peers
/// that hold `CAP_DLM_LOCK` for the new lockspace may join immediately.
pub fn dlm_lockspace_create(
    name: &LockspaceName,
    config: &LockspaceConfig,
) -> Result<DlmLockspaceHandle, DlmError>;

/// Destroy a DLM lockspace.
///
/// The caller must hold `CAP_DLM_ADMIN`. All locks in the lockspace must be
/// released before destruction; if any locks remain, returns
/// `DlmError::LockspaceBusy`. Destruction is propagated to all peers via
/// Raft — peers that are still members receive a `LockspaceDestroyed` event
/// and drop their local state.
///
/// After destruction, the lockspace name may be reused by a subsequent
/// `dlm_lockspace_create()`. The old lockspace's generation is retained
/// to prevent stale `DlmLockspaceHandle` reuse.
pub fn dlm_lockspace_destroy(
    handle: DlmLockspaceHandle,
) -> Result<(), DlmError>;

/// Join an existing DLM lockspace on this node.
///
/// The caller must hold `CAP_DLM_LOCK` scoped to the target lockspace.
/// This node registers with the lockspace master, receives the current
/// RDMA rkey for CAS word arrays ([Section 15.15](#distributed-lock-manager--transport-agnostic-lock-operations)),
/// and begins participating in lock operations. Transport selection
/// ([Section 5.5](05-distributed.md#distributed-ipc--transparent-transport-selection)) is performed
/// for each existing peer in the lockspace during join.
///
/// If this node is already a member, returns `DlmError::AlreadyJoined`.
pub fn dlm_lockspace_join(
    name: &LockspaceName,
) -> Result<DlmLockspaceHandle, DlmError>;

/// Leave a DLM lockspace on this node.
///
/// Graceful departure: all locks held by this node in the lockspace are
/// released (with LVB writeback for EX/PW holders). The node's RDMA rkey
/// is deregistered. Remaining members re-master any resources that were
/// mastered on this node using consistent hashing
/// ([Section 15.15](#distributed-lock-manager--recovery-protocol)).
///
/// Unlike `dlm_lockspace_destroy()`, the lockspace continues to exist for
/// other members. The leaving node's local `DlmLockspaceHandle` is
/// invalidated.
pub fn dlm_lockspace_leave(
    handle: DlmLockspaceHandle,
) -> Result<(), DlmError>;

/// Lockspace configuration parameters.
pub struct LockspaceConfig {
    /// Lease configuration (metadata, data, application lease durations
    /// and grace period). Zero values select defaults.
    pub lease: LeaseConfig,

    /// Pre-allocated slab capacity for DlmResource entries.
    /// The ShardedMap grows in page-sized chunks; this sets the initial
    /// allocation. Default: 1024 resources per shard (256K total).
    pub initial_resource_capacity: u32,

    /// Deadlock detection mode.
    /// `true`: enable wait-for graph cycle detection (adds ~2-5 μs to
    /// contested lock path). `false`: disable (caller responsible for
    /// deadlock avoidance, e.g., lock ordering discipline).
    /// Default: `true`.
    pub deadlock_detection: bool,

    /// Transport preference override. If `None`, standard priority
    /// (CXL > RDMA > TCP) is used. If `Some`, the specified transport
    /// is forced for all peers in this lockspace.
    pub transport_override: Option<TransportType>,
}

/// Opaque handle to a joined lockspace on this node.
/// Carries an internal generation counter to prevent use-after-destroy.
pub struct DlmLockspaceHandle {
    /// Index into the node-local lockspace table.
    index: u32,
    /// Generation at creation time. Compared on every operation to detect
    /// stale handles after lockspace destroy + name reuse.
    generation: u64,
}

Lockspace lifecycle state machine:

Created ──(dlm_lockspace_join() by any peer)──► Active (≥1 member)
Active  ──(dlm_lockspace_leave() by last member)──► Empty (no members, state preserved)
Active  ──(dlm_lockspace_destroy() with no locks)──► Destroyed
Empty   ──(dlm_lockspace_join() by any peer)──► Active
Empty   ──(dlm_lockspace_destroy())──► Destroyed
Destroyed: lockspace name freed, generation counter retained.

Kernel-internal vs. application lockspaces: GFS2 and other kernel subsystems call dlm_lockspace_create() / dlm_lockspace_join() directly from kernel context. Application-level locking (Section 15.15) uses the /dev/dlm ioctl interface, which maps to the same lifecycle functions with capability checks against the calling process's credential.

15.15.16 Performance Summary

Operation UmkaOS (RDMA) UmkaOS (TCP fallback) Linux DLM vs Linux (RDMA)
Uncontested acquire ~3-5 μs (CAS + confirmation) ~100-400 μs (two-sided) ~30-50 μs (TCP) ~10-15x
Uncontested acquire + LVB read ~4-6 μs ~150-500 μs ~100 μs ~20x
Contested acquire (same master) ~5-8 μs ~100-400 μs ~100-200 μs (TCP) ~20-30x
Batch N locks (same master) ~5-10 μs ~150-500 μs N x 30-50 μs ~Nx8x
Lock any of N resources ~5-10 μs ~150-500 μs N x 30-50 μs (sequential) ~Nx8x
Lease extension ~1-2 μs (push_page) ~50-200 μs N/A (no leases) --
Lock holder recovery ~50-200 ms (affected only) ~50-200 ms 5-10 s (global quiesce) ~50x
Lock master recovery ~200-500 ms (affected only) ~200-500 ms 5-10 s (global quiesce) ~20x

TCP fallback note: On TCP transports, the DLM CAS fast path is unavailable (transport.supports_one_sided() == false). All lock operations use the two-sided transport.send_reliable() path. TCP latency (~100-400 μs per lock) is comparable to Linux DLM latency but benefits from integrated kernel-to-kernel messaging (no kernel/userspace transitions, no separate daemon processes). Recovery times are identical across transports because recovery is dominated by heartbeat timeouts and queue rebuilds, not transport latency.

Arithmetic basis: RDMA CAS latency is measured at 1.5-2.5 μs on InfiniBand HDR (200 Gb/s) and RoCEv2 (100 Gb/s) in published benchmarks. The full uncontested acquire includes the raw CAS (~2-3 μs) plus the mandatory confirmation transport.send_reliable() (~1-2 μs on RDMA), totaling ~3-5 μs. Contested locks add ~1-2 μs for receive-side processing on RDMA. On TCP, each transport.send_reliable() call incurs kernel TCP stack processing (~15-20 μs per direction) plus cluster message framing, totaling ~50-200 μs per round-trip. Linux DLM TCP latency includes TCP stack processing (~15-20 μs round-trip), DLM lock manager processing (~10-15 μs), and completion notification (~5-10 μs), totaling ~30-50 μs in published GFS2 benchmarks. Note: The Linux DLM runs entirely in-kernel since kernel 2.6; dlm_controld handles only membership events, not lock operations.

15.15.17 Data Structures

/// Fixed-capacity open-addressing hash table shard with slab-backed storage.
/// Capacity is chosen at construction time and never changes — no rehashing,
/// no heap allocation on the insert hot path, no spinlock hold during allocation.
///
/// Slots store `SlabHandle` indices (8 bytes each) rather than inline `(K, V)`
/// pairs. The actual `(K, V)` entries are allocated from a per-lockspace slab
/// allocator, keeping the per-shard inline array small: 4096 × 8 = 32 KiB
/// per shard. Without this indirection, inline `(K, V)` pairs for
/// `(ResourceName, DlmResource)` would be ~2,760 bytes each, producing
/// ~11 MiB per shard and ~2.83 GiB per lockspace — catastrophically unusable.
pub struct ShardedMapShard<K, V, const CAP: usize> {
    /// Open-addressing table. Each slot stores a slab handle pointing to
    /// the actual `(K, V)` entry. CAP must be a power of 2. Load factor
    /// kept <= 0.75 by construction.
    slots: [Option<SlabHandle<(K, V)>>; CAP],
    count: usize,
}

/// Opaque handle into the per-lockspace slab allocator.
/// 8 bytes (u64 index), allowing O(1) entry access via slab_get(handle).
/// SlabHandle is Copy + Clone for hash table operations.
#[derive(Copy, Clone)]
pub struct SlabHandle<T> {
    index: u64,
    _marker: core::marker::PhantomData<T>,
}

/// Sharded lock table for DLM. Each shard has its own spinlock to minimize contention.
///
/// ShardedMap uses fixed-capacity open-addressing with slab-backed entries to ensure
/// spinlock hold times are bounded and O(1). The DLM must pre-allocate sufficient
/// slab capacity based on expected concurrent lock count; capacity exhaustion returns
/// `DlmError::TableFull` rather than blocking. Insertion returns `Err` if the load
/// factor would exceed 75%.
/// `insert_or_update` and `remove` complete in bounded time under the spinlock —
/// there is no rehashing, no inline entry allocation, and no unbounded iteration.
/// Default SHARD_CAP of 4096 with 256 shards gives 256 * 4096 * 0.75 = ~786K
/// resources at 75% load factor. Per-shard memory: 4096 * 8 = 32 KiB;
/// 256 shards total: ~8 MiB (slot arrays only; slab memory is separate).
/// GFS2 workloads may have millions of locked inodes;
/// `initial_resource_capacity` in `DlmCreateParams` allows tuning the
/// per-lockspace capacity at creation time.
pub struct ShardedMap<K: Hash + Eq, V, const SHARDS: usize = 256, const SHARD_CAP: usize = 4096> {
    shards: [SpinLock<ShardedMapShard<K, V, SHARD_CAP>>; SHARDS],
    /// Per-lockspace slab allocator for `(K, V)` entries. Grows in
    /// page-sized chunks; individual entries are O(1) alloc/free.
    slab: SlabAllocator<(K, V)>,
}
/// DLM lockspace — namespace for a set of related lock resources.
pub struct DlmLockspace {
    /// Lockspace name (e.g., "gfs2:550e8400-e29b" for a GFS2 filesystem).
    pub name: LockspaceName,

    /// Lock resources in this lockspace.
    /// Sharded concurrent hash map: 256 shards, each with its own SpinLock.
    /// Shard = hash(resource_name) & 0xFF. This reduces lock contention from
    /// a single global bottleneck to per-shard contention. Individual lock
    /// operations only hold their shard's SpinLock, allowing concurrent access
    /// to resources in different shards. DlmResource entries are allocated
    /// from a per-lockspace slab allocator.
    pub resources: ShardedMap<ResourceName, DlmResource, 256>,

    /// Lease configuration for this lockspace.
    pub lease_config: LeaseConfig,

    /// Deadlock detection state.
    pub wait_for_graph: Mutex<WaitForGraph>,

    /// Statistics counters.
    pub stats: DlmStats,
}

/// Per-lockspace lease configuration.
pub struct LeaseConfig {
    /// Default lease duration for metadata locks.
    pub metadata_lease_ns: u64,

    /// Default lease duration for data locks.
    pub data_lease_ns: u64,

    /// Default lease duration for application locks.
    pub app_lease_ns: u64,

    /// Grace period after lease expiry before forced revocation.
    pub grace_period_ns: u64,
}

/// DLM statistics (per-lockspace, exposed via umkafs unified-object-namespace).
pub struct DlmStats {
    /// Total lock operations (acquire + convert + release).
    pub lock_ops: AtomicU64,

    /// Operations served by RDMA CAS fast path (uncontested).
    pub fast_path_ops: AtomicU64,

    /// Operations requiring RDMA Send (contested).
    pub slow_path_ops: AtomicU64,

    /// Batch operations.
    pub batch_ops: AtomicU64,

    /// Lock-any-of operations.
    pub lock_any_ops: AtomicU64,

    /// Deadlocks detected.
    pub deadlocks_detected: AtomicU64,

    /// Recovery events (holder + master).
    pub recovery_events: AtomicU64,
}

/// DLM error type returned by lock/unlock operations.
/// Maps to standard errno values for Linux compatibility.
pub enum DlmError {
    /// Deadlock detected by the wait-for graph (Section 15.12.9).
    Deadlock,
    /// Trylock failed — lock is held in an incompatible mode.
    Again,
    /// Lockspace does not exist.
    NoEntry,
    /// Resource exhaustion (slab allocator, ShardedMap capacity).
    NoMemory,
    /// Wait interrupted by signal delivery.
    Interrupted,
    /// Timeout expired before lock was granted.
    TimedOut,
    /// Lock not held by caller (unlock of unheld lock).
    NotHeld,
    /// Invalid lock handle (already released or corrupted).
    InvalidHandle,
    /// ShardedMap shard is at capacity (load factor > 75%).
    TableFull,
}

/// Per-node DLM recovery state machine. Each entry in the DLM's node table
/// carries one of these states. The state transitions enforce the ordering
/// constraint that actual lock reclaim (removal from granted queues) MUST NOT
/// proceed until the membership layer has delivered authoritative `NodeDead`.
///
/// State transitions:
///   Normal → AwaitingNodeDead: DLM's own liveness probe detects suspect node.
///   AwaitingNodeDead → Recovering: membership layer delivers `NodeDead`.
///   Recovering → Normal: all affected resources have been re-mastered and
///     lock queues unfrozen.
///   AwaitingNodeDead → Normal: membership layer declares the node alive
///     (false positive from the DLM probe — network partition healed).
pub enum DlmRecoveryState {
    /// No recovery in progress for this node.
    Normal,
    /// DLM probe detected a suspect node but the membership layer has NOT yet
    /// confirmed `NodeDead`. During this phase, the DLM pre-stages recovery:
    /// freezes local lock queues mentioning the suspect node, pre-computes
    /// new master assignments, and cancels in-flight RDMA operations. No locks
    /// are removed from granted queues — the suspect node may still be alive
    /// (e.g., transient network partition).
    AwaitingNodeDead {
        /// The node under suspicion.
        failed_node: NodeId,
        /// Monotonic instant when the DLM probe first detected the failure.
        detected_at: Instant,
    },
    /// Membership layer has confirmed `NodeDead`. Actual lock reclaim is in
    /// progress: dead node's lock entries are removed from granted queues,
    /// waiting/converting queues are re-evaluated, and blocked lock requests
    /// are granted where the dead node's hold was the sole blocker. Resources
    /// whose CAS-word pages have DSM dependency on the dead node wait for
    /// `dsm_recovery_complete` before re-mastering
    /// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--cross-subsystem-recovery-ordering-dsm-and-dlm)).
    Recovering {
        /// The confirmed-dead node.
        dead_node: NodeId,
        /// Resources whose lock queues are currently frozen pending re-mastering.
        /// Bounded by the number of resources mastered on or held by the dead
        /// node — typically O(R/N) where R is total resources and N is cluster
        /// size. Stored as an XArray keyed by `ResourceId` for O(1) lookup.
        frozen_resources: XArray<ResourceId>,
    },
}

DLM error codes: All lock and unlock operations return Result<_, DlmError>. The following table defines the complete error space:

Operation Error errno Meaning
lock() 0 Success
lock() Deadlock EDEADLK Deadlock detected by wait-for graph (Section 15.15)
lock() Again EAGAIN Trylock failed (lock held in incompatible mode)
lock() NoEntry ENOENT Lockspace does not exist
lock() NoMemory ENOMEM Resource exhaustion (slab or lock table full)
lock() Interrupted EINTR Wait interrupted by signal
lock() TimedOut ETIMEDOUT Timeout expired before grant
lock() TableFull ENOSPC ShardedMap shard at capacity
unlock() 0 Success
unlock() NotHeld ENOENT Lock not held by caller
unlock() InvalidHandle EINVAL Invalid lock handle (already released or corrupted)

The errno mapping is applied at the syscall compatibility boundary (Section 19.1) for application-level DLM operations via /dev/dlm (Section 15.15) and flock()/fcntl() on clustered filesystems. Kernel-internal callers (e.g., GFS2, UPFS) receive the typed DlmError enum directly.

DlmErrorLockError conversion: The ClusterLockAdapter trait (Section 15.15) returns LockError, not DlmError. The conversion is defined as:

impl From<DlmError> for LockError {
    fn from(e: DlmError) -> Self {
        match e {
            DlmError::Deadlock => LockError::Deadlock,
            DlmError::TimedOut => LockError::Timeout,
            DlmError::Interrupted => LockError::Cancelled,
            DlmError::Again => LockError::WouldBlock,
            DlmError::NoEntry => LockError::ResourceDestroyed,
            DlmError::NoMemory => LockError::WouldBlock, // transient slab exhaustion; VFS retries
            DlmError::NotHeld => LockError::InvalidMode,
            DlmError::InvalidHandle => LockError::ResourceDestroyed,
            DlmError::TableFull => LockError::WouldBlock, // capacity limit; VFS retries with backoff
        }
    }
}

This mapping is intentionally lossy — LockError is the VFS-facing error type with coarser granularity than the DLM's internal DlmError. Filesystem drivers that need finer-grained error handling (e.g., distinguishing NoMemory from TableFull for retry backoff) should call the DLM API directly and receive DlmError.

15.15.18 Licensing

The VMS/DLM lock model is published academic work (VAX/VMS Internals and Data Structures, Digital Press, 1984). The six-mode compatibility matrix, Lock Value Block concept, and granted/converting/waiting queue model are well-documented in public literature and implemented by multiple independent projects (Linux DLM, Oracle DLM, HP OpenVMS DLM). No patent or proprietary IP concerns.

RDMA Atomic CAS and Send/Receive operations are standard InfiniBand/RoCE verbs defined by the IBTA (InfiniBand Trade Association) specification, which is publicly available.

15.15.19 DLM Master Election and Liveness Integration

The DLM uses a deterministic master election based on node ranking rather than a Paxos/Raft round to minimize election latency in the common case (no failures).

Master selection rule: The node with the lowest node_id among currently healthy cluster members is the DLM master. On membership change (join/leave), all nodes independently compute the new master from the updated membership view — no election protocol needed. This requires consistent failure detection.

/// DLM master state. One instance per DLM domain (per filesystem/cluster).
pub struct DlmMaster {
    /// Node ID of the current master (determined by lowest-node-id rule).
    /// Atomically updated on membership changes. Zero = no master (election in progress).
    /// u64 matches NodeId width — no truncation on large clusters.
    pub master_node_id: AtomicU64,
    /// True if this node is the current DLM master.
    pub is_master: AtomicBool,
    /// Monotonic epoch counter. Incremented on each master transition.
    /// Used to detect stale messages from a previous master.
    pub epoch: AtomicU64,
    /// Per-peer liveness tracking. Keyed by PeerId (u64). XArray provides
    /// O(1) lookup with native RCU-protected reads and ordered iteration.
    /// Updated from cluster heartbeat callbacks and DLM message receipt.
    pub peers: XArray<DlmPeerState>,
    /// Last time this node received any DLM message from each peer (nanoseconds).
    /// Keyed by PeerId (u64). XArray native RCU reads replace RcuHashMap.
    /// Updated on receipt of ANY DLM message (lock request, grant, convert, etc.)
    /// — every DLM message is implicit proof of liveness. Also updated when the
    /// cluster heartbeat layer notifies the DLM of a received heartbeat from a
    /// peer that participates in this lockspace.
    pub last_heard_ns: XArray<AtomicU64>,
}

/// Liveness tracking state for one peer node within this DLM domain.
/// Updated from two sources: (1) cluster heartbeat callbacks
/// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--heartbeat-protocol)), and
/// (2) receipt of any DLM wire message from the peer.
pub struct DlmPeerState {
    /// Node is considered live by the DLM's local view. Set to `false` when
    /// the DLM monitor detects prolonged silence (no DLM messages AND no
    /// cluster heartbeats forwarded for this peer). This is an advisory
    /// signal only — authoritative failure is determined by the cluster
    /// membership layer's `NodeDead` event.
    pub alive: AtomicBool,
    /// Number of consecutive DLM monitor wakeups with no activity from this peer.
    pub missed_intervals: AtomicU32,
    /// Cluster heartbeat sequence from the last forwarded heartbeat.
    /// Used to correlate DLM liveness with the cluster-level heartbeat
    /// sequence (avoids re-processing stale forwarded heartbeats).
    pub last_hb_seq: AtomicU64,
}

/// DLM monitor wakeup interval. The DLM monitor thread wakes at this interval
/// to check `last_heard_ns` for each peer and pre-stage lock reclaim if a
/// peer appears unresponsive. The DLM does NOT send its own heartbeat messages
/// — liveness information comes from the cluster-level heartbeat protocol
/// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--heartbeat-protocol)), which
/// uses neighbor-only topology and scales O(neighbors) not O(peers).
pub const DLM_MONITOR_INTERVAL_NS: u64 = 500_000_000; // 500 ms

/// A peer is considered suspect by the DLM if no activity (DLM messages or
/// forwarded cluster heartbeats) has been observed for this duration.
/// 3x monitor interval provides tolerance for transient delays.
pub const DLM_SUSPECT_TIMEOUT_NS: u64 = 1_500_000_000; // 1.5 s (3 x 500 ms)

/// After the cluster membership layer delivers a `NodeDead` event, wait this
/// long before reclaiming the dead node's locks. Allows the failed node's
/// RDMA NIC to drain in-flight operations.
pub const DLM_LOCK_RECLAIM_DELAY_NS: u64 = 200_000_000; // 200 ms

Liveness model — no DLM-specific heartbeat: The DLM does NOT run its own heartbeat protocol. Sending DLM-specific heartbeats to all peers in a lockspace would create O(N^2) traffic on TCP clusters (N nodes each sending to N-1 peers), making clusters of >100 nodes impractical. Instead, the DLM derives liveness from two existing sources:

  1. Cluster-level heartbeat (Section 5.8): The cluster membership layer heartbeats only direct neighbors in the topology graph (O(neighbors) per node, typically 2-6). When a cluster heartbeat is received from a neighbor that participates in this DLM lockspace, the heartbeat layer forwards a callback to the DLM, which updates last_heard_ns[sender]. This piggyback is free — no additional network traffic.

  2. DLM message receipt: Every DLM wire message (lock request, grant, convert, release, revocation, etc.) is implicit proof of liveness. The DLM updates last_heard_ns[sender] on receipt of any message. For active lockspaces with ongoing lock traffic, this provides sub-millisecond failure detection without any heartbeat at all.

Failure detection algorithm (runs in kthread/dlm_monitor):

  1. Every DLM_MONITOR_INTERVAL_NS (500 ms): the monitor thread wakes and checks last_heard_ns for each peer in the lockspace.
  2. For each peer, if now_ns - last_heard_ns[peer] > DLM_SUSPECT_TIMEOUT_NS and peer.alive == true:
  3. Set peer.alive = false, increment missed_intervals.
  4. Send advisory liveness-suspect hint to the membership layer (not an authoritative failure declaration — the membership layer independently confirms or rejects the failure via its own quorum-protected protocol).
  5. Pre-stage lock reclaim preparation (freeze local lock queues mentioning the suspect node, pre-compute new master assignments). Actual lock removal from granted queues does NOT occur until the membership layer delivers authoritative NodeDead confirmation — see the ordering constraint in the Integration with Cluster Membership section above.
  6. On receipt of any DLM message or forwarded cluster heartbeat from a peer: update last_heard_ns[sender], reset missed_intervals[sender], set peer.alive = true if previously suspect.
  7. Master recomputation: after any membership change (authoritative NodeDead or NodeJoined from the membership layer), all nodes compute new_master = min(alive_node_ids). If new_master != master_node_id, atomically swap and increment epoch.

Relationship to cluster heartbeat: The DLM is a consumer of the cluster heartbeat, not a producer. The cluster membership layer (Section 5.8) provides neighbor-only heartbeat with O(neighbors) traffic, failure detection via Suspect/Dead state transitions, and authoritative NodeDead events protected by quorum consensus. The DLM's monitor thread provides DLM-scoped early warning (pre-staging lock reclaim before the membership layer confirms failure) using locally observed liveness signals. This two-tier approach eliminates the Linux DLM problem where DLM and corosync can disagree on node liveness during partial-failure scenarios, while avoiding the O(N^2) traffic that a DLM-specific heartbeat would produce.

15.15.20 VFS Lock Integration (ClusterLockAdapter)

Clustered filesystems (GFS2, OCFS2, future UPFS) must translate POSIX file locking semantics (flock(), fcntl(F_SETLK)) into DLM lock operations. Rather than each filesystem implementing ad-hoc DLM integration, UmkaOS defines a standard adapter trait that the VFS file locking layer calls directly.

/// A byte range for file-level locking. Used by the ClusterLockAdapter to
/// translate POSIX/flock byte-range locks into DLM lock scope. The range
/// is inclusive on both ends: `[start, end]`. A range of `[0, u64::MAX]`
/// represents a whole-file lock (equivalent to flock semantics).
pub struct LockRange {
    /// Start byte offset (inclusive).
    pub start: u64,
    /// End byte offset (inclusive). u64::MAX = end-of-file.
    pub end: u64,
}

/// Errors returned by DLM lock operations (lock, unlock, convert).
#[derive(Debug)]
pub enum LockError {
    /// The lock request would cause a deadlock (detected by the distributed
    /// wait-for graph). The caller should abort the operation and retry.
    Deadlock,
    /// The lock request timed out waiting for grant (exceeded the caller's
    /// specified timeout or the default 30-second DLM timeout).
    Timeout,
    /// The non-blocking trylock failed because the lock is held in an
    /// incompatible mode. The caller should retry after a brief backoff.
    /// Maps from `DlmError::Again` (errno `EAGAIN`). Distinct from `Timeout`
    /// which indicates an expired wait duration.
    WouldBlock,
    /// The lock request was cancelled by the caller or by cluster membership
    /// change (the requesting node was evicted during the wait).
    Cancelled,
    /// The requested lock mode is invalid for this operation (e.g., converting
    /// from EX to a mode incompatible with the resource's current state).
    InvalidMode,
    /// The target lock resource has been destroyed (e.g., the filesystem was
    /// unmounted or the lockspace was released while the lock was pending).
    ResourceDestroyed,
}

/// A lock resource entry in the DLM master's resource directory. Tracks the
/// resource name, current master node, and the set of waiters for deadlock
/// detection. This is the master-side view; each node also maintains a local
/// resource cache (`DlmResource`) for resources it has active locks on.
pub struct DlmLockResource {
    /// Resource name (hierarchical, variable-length, max 256 bytes).
    /// Format: "fsname:type:object" (e.g., "gfs2:inode:0x1234").
    /// Uses `ResourceName` (byte array, not UTF-8-enforced `ArrayString`)
    /// because DLM resource names may contain non-UTF-8 bytes (e.g.,
    /// binary inode numbers in GFS2/OCFS2 lock naming conventions).
    pub name: ResourceName,
    /// Node ID of the current master for this resource. Updated during
    /// re-mastering after node failure recovery.
    pub master_node: NodeId,
    /// Queue of waiters for this resource, used by the distributed deadlock
    /// detector to construct the wait-for graph. Each entry represents a
    /// pending lock request that is blocked by an incompatible holder.
    /// Bounded at 64 concurrent waiters per resource — if exceeded, new lock
    /// requests receive `DlmError::TryAgain` (backpressure). In practice,
    /// clustered filesystems rarely exceed 8-16 waiters per inode lock.
    pub lock_queue: SpinLock<ArrayVec<DlmWaitEdge, 64>>,
}

/// Adapter trait for translating VFS file locking operations to DLM operations.
/// Implemented by clustered filesystems (GFS2, OCFS2) in their filesystem driver.
pub trait ClusterLockAdapter {
    /// Translate a POSIX/flock lock request to a DLM lock acquisition.
    /// mode mapping: LOCK_SH → DLM_LOCK_PR, LOCK_EX → DLM_LOCK_EX
    /// LOCK_UN → dlm_unlock(). Blocking locks: DLM_LKF_WAIT.
    fn lock_file(
        &self,
        inode_id: InodeId,
        range: LockRange,
        mode: LockMode,
        wait: bool,
    ) -> Result<DlmLockHandle, LockError>;

    /// Release a DLM lock when the corresponding VFS lock is dropped.
    fn unlock_file(&self, handle: DlmLockHandle) -> Result<(), LockError>;

    /// Integrate with deadlock detection. Returns the DLM's global wait-for
    /// graph entries for this filesystem. VFS merges these with its local
    /// WaitForGraph ([Section 14.14](14-vfs.md#local-file-locking)) for cross-node deadlock detection.
    /// Returns at most `MAX_DLM_WAIT_EDGES` (64) remote wait-for edges for deadlock
    /// detection. 64 matches the per-resource `lock_queue` bound (see `DlmResource`
    /// above) — there can never be more than 64 waiters per resource, so 64 edges
    /// suffices. If multiple resources are queried, the deadlock detector calls
    /// `get_remote_waiters()` per inode and merges results.
    fn get_remote_waiters(&self, inode_id: InodeId) -> ArrayVec<DlmWaitEdge, 64>;
}

/// An edge in the distributed wait-for graph. Maps a DLM lock waiter to
/// the VFS thread space so that the VFS deadlock detector can merge DLM
/// (cross-node) and local (single-node) wait-for relationships.
pub struct DlmWaitEdge {
    /// Thread ID of the waiter (translated from DLM owner ID to local
    /// ThreadId space via the cluster membership table).
    pub waiter: ThreadId,
    /// Thread ID of the holder that the waiter is blocked on.
    /// For remote holders, this is a synthetic ThreadId allocated from
    /// a per-node range to avoid collision with local threads.
    pub holder: ThreadId,
    /// The DLM lock mode requested by the waiter.
    pub requested_mode: LockMode,
    /// The DLM lock mode currently held by the holder.
    pub held_mode: LockMode,
    /// Node ID of the waiter (for diagnostic logging and cycle reporting).
    pub waiter_node: u64,
    /// Node ID of the holder.
    pub holder_node: u64,
}

VFS integration point: - FileSystemOps::cluster_lock_ops() -> Option<&dyn ClusterLockAdapter> — returns the adapter for clustered filesystems, None for local filesystems. - vfs_lock_file() (the kernel's central file locking entry point, Section 14.14): checks sb.fs_ops.cluster_lock_ops(). If Some → DLM path via the adapter. If None → local VFS locking (Section 14.14).

Lock mode mapping:

VFS Lock DLM Mode Semantics
LOCK_SH / F_RDLCK DLM_LOCK_PR (Protected Read) Multiple readers allowed, writers excluded
LOCK_EX / F_WRLCK DLM_LOCK_EX (Exclusive) Single writer, all others excluded
LOCK_UN / F_UNLCK dlm_unlock() Release the DLM lock

Blocking semantics: When wait == true (corresponding to F_SETLKW or flock() without LOCK_NB), the DLM lock request is submitted with DLM_LKF_WAIT. The calling task sleeps interruptibly until the lock is granted or a signal is delivered (returning EINTR). The DLM's deadlock detector (Section 15.15) may return EDEADLK if a cross-node deadlock cycle is detected.

Deadlock detection: The VFS maintains a local WaitForGraph for single-node deadlock detection (Section 15.15). For clustered filesystems, this graph must be extended with cross-node wait edges. The adapter's get_remote_waiters() returns DLM-tracked wait edges for a given inode. The VFS deadlock detector merges these remote edges with its local graph during cycle detection, providing unified cross-node deadlock detection without requiring the VFS to understand DLM internals.

DLM resource naming: The adapter translates inode IDs and byte ranges into DLM resource names. The recommended convention is "FL:<fsid>:<inode_id>:<start>:<end>" for byte-range locks and "FL:<fsid>:<inode_id>" for whole-file flock() locks. Each filesystem's adapter implementation defines its own naming scheme.

15.15.21 DLM as Foundation for UPFS Token Management

The DLM is designed to serve as the foundation for UPFS — UmkaOS's own GPFS-class clustered filesystem. GPFS's "token manager" — its core coordination mechanism — maps naturally onto the DLM's existing primitives:

GPFS Token Concept DLM Equivalent
Token types (data, metadata, layout, quota) Resource name prefix: "D:", "M:", "L:", "Q:" + inode
Byte-range tokens Byte-range lock resources: "D:inode:start:len"
Token revocation callback Lease-based revocation (Section 15.15) — master sends targeted revocation to active holders
Downgrade callback (EX→PR) Lock conversion (Section 15.15) with targeted writeback (Section 15.15)
Token batching (multi-resource) Speculative multi-resource acquire (Section 15.15)
Lock Value Block for metadata piggybacking LVB (Section 15.15) — 64-byte inline data attached to lock

What the DLM provides natively that GPFS needs:

  1. Downgrade with targeted writeback. When a UPFS metadata server revokes a write token, the DLM's EX → PR conversion triggers LockDirtyTracker-based writeback of only modified pages within the lock's range (Section 15.15). This is directly equivalent to GPFS's "flush dirty data, then downgrade token" flow — and better than Linux DLM's BAST storms (Section 15.15).

  2. Lock Value Blocks for metadata piggybacking. GPFS piggybacks small metadata updates (inode timestamps, file sizes) on token grant/revoke messages to avoid separate metadata RPCs. The DLM's 64-byte LVB (Section 15.15) provides exactly this: the last EX holder writes updated metadata into the LVB on downgrade; the next PR holder reads it from the LVB without contacting the metadata server.

  3. Speculative multi-resource acquire. UPFS allocators need to lock a resource group (block allocation bitmap). With hundreds of resource groups, contention is common. The DLM's lock_any_of() primitive (Section 15.15) tries multiple resource groups in a single round-trip — same optimization GPFS uses for allocation.

  4. RDMA-native fast path. Uncontested token acquire via RDMA CAS (Section 15.15) at ~2 μs is competitive with GPFS's token manager latency. Contested path at ~5-8 μs (RDMA Send/Recv) matches GPFS's two-sided token path.

What UPFS builds on top (minimal — naming conventions only):

  • Token type semantics. The DLM provides generic lock modes (NL, CR, CW, PR, PW, EX). UPFS defines which modes map to its token types: e.g., "data write token = EX on D:inode:range", "metadata read token = PR on M:inode". This is a resource name prefix convention, not a DLM change.

  • Revocation handlers. UPFS registers DlmRevocationHandler implementations for each token type at lock acquire time. The DLM calls these handlers directly on revocation — no intermediate token layer needed:

  • Data token handler: targeted writeback → DSM invalidation → convert/release
  • Metadata token handler: update LVB with latest inode attrs → convert/release
  • Layout token handler: flush stripe map changes → release
  • Quota token handler: flush quota deltas to LVB → convert/release

The DLM drives the entire revocation flow. The UPFS handlers are stateless functions that operate on the lock's associated state (LockDirtyTracker, LockValueBlock). No "token manager" object, no token state machine, no separate revocation protocol.

  • Stripe-group coordination. When UPFS stripes a file across N storage servers, each stripe has independent data tokens. The UPFS client holds N byte-range data tokens (one per stripe) and submits I/O to N block exports in parallel. Coordination between stripes (e.g., extending a file across a stripe boundary) uses metadata tokens on the stripe map inode.

  • Quota tokens. GPFS uses tokens for quota enforcement (user/group quota fragments cached on each node). Maps to DLM locks on quota resources with LVB carrying the cached quota values.

The "token layer" is essentially zero code. UPFS's token management consists of: 1. A set of resource name prefix conventions ("D:", "M:", "L:", "Q:"). 2. A set of DlmRevocationHandler implementations (one per token type, ~50-100 lines each). 3. Helper functions that call dlm_lock() with the right resource name and handler.

There is no token manager object, no token state machine, no separate protocol, and no separate recovery mechanism. The DLM IS the token manager. This is the design goal: the DLM is not a "bolt-on lock service" that an UPFS wraps — it is the native token infrastructure.

15.15.22 DLM Wire Protocol

All DLM messages are carried over ClusterTransport (Section 5.10) using the standard ClusterMessageHeader (40 bytes, defined in Section 5.2). DLM messages use PeerMessageType::DlmOp as the message_type. The payload after ClusterMessageHeader is a DlmMessageHeader followed by a message-type-specific payload struct.

All multi-byte integers in wire structs use Le types (Section 6.1) for correct operation on mixed-endian clusters (PPC32, s390x are big-endian).

/// DLM message types. Carried in DlmMessageHeader.msg_type.
#[repr(u16)]
pub enum DlmMessageType {
    /// Request a lock on a resource. Sent by requester to resource master.
    LockRequest      = 0x0001,
    /// Grant a lock to a requester. Sent by master to requester.
    LockGrant        = 0x0002,
    /// Convert an existing lock to a different mode. Sent to master.
    LockConvert      = 0x0003,
    /// Confirm a conversion. Sent by master to holder.
    LockConvertGrant = 0x0004,
    /// Release a lock. Sent by holder to master.
    LockRelease      = 0x0005,
    /// Revoke a lock (lease expiry or contention). Sent by master to holder.
    LockRevocation   = 0x0006,
    /// Deadlock detection probe. Forwarded along wait-for edges.
    DeadlockProbe    = 0x0007,
    /// Deadlock victim notification. Sent by detector to victim.
    DeadlockVictim   = 0x0008,
    /// Look up the master node for a resource. Sent when the local hash
    /// ring indicates a different master than expected (post-migration).
    MasterLookup     = 0x0009,
    /// Reply to MasterLookup with the authoritative master node.
    MasterLookupReply = 0x000A,
    /// Transfer resource master state (granted/converting/waiting queues)
    /// to a new master during membership change.
    MasterTransfer   = 0x000B,
    /// Notify nodes that a resource's master has migrated.
    MasterMigration  = 0x000C,
    /// Lease renewal (one-sided RDMA write of lease timestamp; this type
    /// is used only on the TCP fallback path where one-sided is unavailable).
    LeaseRenew       = 0x000D,
    /// Batch lock request (up to 64 locks in a single message).
    LockBatch        = 0x000E,
    /// Read an LVB without acquiring a lock (TCP two-sided fallback).
    /// Sent by requester to resource master when `supports_one_sided() == false`.
    /// See [Section 15.15](#distributed-lock-manager--two-sided-lvb-read-fallback).
    LvbReadRequest   = 0x000F,
    /// Response to LvbReadRequest. Contains the 64-byte LVB data + sequence.
    LvbReadResponse  = 0x0010,
}

/// DLM message header. Follows ClusterMessageHeader in every DLM message.
/// Total: 24 bytes.
///
/// Note: Le64 fields have alignment 1 (they are `#[repr(transparent)]` over
/// `[u8; 8]`), so `#[repr(C)]` layout would pack `lockspace_id` at offset 4
/// with no implicit padding, producing a 20-byte struct. We add explicit
/// `_pad` to ensure 8-byte alignment of the lockspace_id field on the wire,
/// which prevents deserialization bugs on architectures that trap on unaligned
/// access and makes the wire layout conventional (all 8-byte fields at 8-byte
/// offsets).
#[repr(C)]
pub struct DlmMessageHeader {
    /// DLM message type (DlmMessageType as Le16).
    pub msg_type: Le16,
    /// Flags (reserved for future use). Must be zero on send.
    pub flags: Le16,
    /// Explicit padding to align lockspace_id at offset 8. Must be zeroed
    /// on send to prevent information disclosure.
    pub _pad: [u8; 4],
    /// Lockspace ID. Identifies the lockspace context for this message.
    /// Each lockspace has a unique 64-bit ID assigned at creation time.
    pub lockspace_id: Le64,
    /// Resource name hash (SipHash-2-4 of the full resource name).
    /// Used for routing and shard selection on the receiver. The full
    /// resource name is carried in the payload when needed (LockRequest,
    /// MasterLookup) but omitted from compact messages (LockGrant,
    /// LockRelease) where the hash suffices for lookup.
    pub resource_hash: Le64,
}
const_assert!(size_of::<DlmMessageHeader>() == 24);
// Layout: msg_type(2) + flags(2) + _pad(4) + lockspace_id(8) + resource_hash(8) = 24 bytes.
// Total wire message: ClusterMessageHeader (40) + DlmMessageHeader (24) + payload.

/// LockRequest payload. Sent by requester to resource master.
/// Total: 48 bytes.
#[repr(C)]
pub struct DlmLockRequestPayload {
    /// Requester's node-local lock ID. Unique within the requester node.
    /// The master returns this ID in LockGrant for correlation.
    pub lock_id: Le64,
    /// Requested lock mode (LockMode as u8).
    pub mode: u8,
    /// Lock request flags.
    pub flags: u8,
    /// Padding to 8-byte boundary for requester_node alignment.
    pub _pad: [u8; 6],
    /// Requester's node ID (redundant with ClusterMessageHeader.node_id
    /// but included for self-contained payload parsing).
    pub requester_node: Le64,
    /// Resource name length (bytes). The full resource name follows this
    /// struct as a variable-length tail (max 256 bytes).
    pub name_len: Le16,
    /// Padding.
    pub _pad2: [u8; 6],
    /// Lease duration requested (nanoseconds). 0 = use lockspace default.
    pub lease_ns: Le64,
    /// DSM causal epoch for lock requests that coordinate with DSM
    /// coherence. 0 = no DSM dependency. Non-zero values carry the
    /// epoch component of the CausalStampWire, which is sufficient for
    /// causal ordering between lock acquire and DSM page access. The
    /// full CausalStampWire (epoch + dirty page bitmap) is variable-length
    /// and cannot fit in a fixed-size field; the dirty bitmap is only
    /// needed for page reconstruction and is sent in a separate
    /// DsmReconstructRequest message when the lock holder needs to
    /// reconstruct pages.
    pub dsm_causal_epoch: Le64,
    // Followed by: resource_name: [u8; name_len]
}
const_assert!(size_of::<DlmLockRequestPayload>() == 48);

/// LockGrant payload. Sent by master to requester.
/// Total: 24 bytes.
#[repr(C)]
pub struct DlmLockGrantPayload {
    /// Lock ID from the original LockRequest.
    pub lock_id: Le64,
    /// Granted lock mode (may differ from requested if a conversion was
    /// applied by the master's compatibility check).
    pub granted_mode: u8,
    /// Grant status: 0 = success, non-zero = error code.
    pub status: u8,
    /// Padding.
    pub _pad: [u8; 2],
    /// Master-assigned sequence number for this lock instance. Used for
    /// CAS-word ABA prevention and lease tracking.
    pub master_seq: Le32,
    /// LVB data length attached to this grant (0-64 bytes). If non-zero,
    /// the LVB data follows this struct.
    pub lvb_len: Le32,
    /// Padding.
    pub _pad2: [u8; 4],
    // Followed by: lvb_data: [u8; lvb_len] (0-64 bytes)
}
const_assert!(size_of::<DlmLockGrantPayload>() == 24);

/// LockConvert payload. Sent by holder to master.
/// Total: 28 bytes.
#[repr(C)]
pub struct DlmLockConvertPayload {
    /// Lock ID of the lock to convert.
    pub lock_id: Le64,
    /// Current lock mode (for validation).
    pub current_mode: u8,
    /// Requested new mode.
    pub new_mode: u8,
    /// Padding.
    pub _pad: [u8; 2],
    /// LVB data length to write on downgrade (0-64). If converting from
    /// EX/PW to a lower mode, the holder writes updated LVB data here.
    pub lvb_len: Le32,
    /// Padding.
    pub _pad2: [u8; 4],
    /// Causal consistency epoch from the converting node's CausalStampWire.
    /// Le64 stores the epoch component only; full CausalStampWire
    /// reconstruction is handled by DSM.
    pub dsm_causal_epoch: Le64,
    // Followed by: lvb_data: [u8; lvb_len] (0-64 bytes)
}
const_assert!(size_of::<DlmLockConvertPayload>() == 28);

/// LockRelease payload. Sent by holder to master.
/// Total: 24 bytes.
#[repr(C)]
pub struct DlmLockReleasePayload {
    /// Lock ID to release.
    pub lock_id: Le64,
    /// LVB data length to write on release (0-64). EX/PW holders write
    /// the final LVB on release so the next acquirer sees updated data.
    pub lvb_len: Le32,
    /// Padding.
    pub _pad: [u8; 4],
    /// Causal consistency epoch from the releasing node's CausalStampWire.
    /// Le64 stores the epoch component only; full CausalStampWire
    /// reconstruction is handled by DSM.
    pub dsm_causal_epoch: Le64,
    // Followed by: lvb_data: [u8; lvb_len] (0-64 bytes)
}
const_assert!(size_of::<DlmLockReleasePayload>() == 24);

/// LockRevocation payload. Sent by master to holder.
/// Total: 16 bytes.
#[repr(C)]
pub struct DlmLockRevocationPayload {
    /// Lock ID to revoke.
    pub lock_id: Le64,
    /// Requested downgrade mode. The holder should convert to this mode
    /// (or release entirely if mode == NL). The holder has grace_period_ns
    /// to comply before the master forcibly revokes.
    pub requested_mode: u8,
    /// Padding.
    pub _pad: [u8; 3],
    /// Deadline (nanoseconds from message timestamp) by which the holder
    /// must comply. After this deadline, the master force-revokes.
    pub deadline_ns: Le32,
}
const_assert!(size_of::<DlmLockRevocationPayload>() == 16);

/// DeadlockProbe payload. Forwarded along wait-for graph edges.
/// Total: 40 bytes.
#[repr(C)]
pub struct DlmDeadlockProbePayload {
    /// Originator of the probe (the node that started cycle detection).
    pub origin_node: Le64,
    /// Lock ID that initiated the probe (the waiting lock).
    pub origin_lock_id: Le64,
    /// Current probe depth (incremented at each hop). If depth exceeds
    /// MAX_DEADLOCK_DEPTH (16), the probe is dropped (prevents infinite
    /// forwarding in pathological wait-for graphs).
    pub depth: Le32,
    /// Padding.
    pub _pad: [u8; 4],
    /// Probe generation (from the origin node's monotonic counter).
    /// Duplicate probes with the same (origin_node, probe_gen) are dropped.
    pub probe_gen: Le64,
    /// Number of WaiterId entries in the variable-length path that follows
    /// this fixed payload. Each hop appends its local waiter to the path
    /// before forwarding, enabling cycle reconstruction on detection.
    pub path_len: Le32,
    /// Padding.
    pub _pad2: [u8; 4],
    // Variable-length path follows the fixed payload: path_len × Le64
    // (WaiterId values). Per-hop reconstruction: each node appends its
    // local waiter to the path and forwards. When the probe returns to
    // the origin_node, the complete cycle is the path array.
}
const_assert!(size_of::<DlmDeadlockProbePayload>() == 40);

/// Maximum deadlock probe depth before dropping. 16 hops covers any
/// realistic deadlock cycle in clustered filesystem workloads.
/// **Relationship to MAX_PROBE_PATH_LEN**: MAX_DEADLOCK_DEPTH (16) is the
/// wire protocol hop limit for `DlmDeadlockProbePayload.depth` — probes
/// exceeding this are dropped by the receiver. MAX_PROBE_PATH_LEN (32) is
/// the in-memory path array capacity for `DlmProbe.path`, which is used
/// in the gossip-based protocol. The wire limit is stricter because each
/// wire hop has RDMA latency cost; the in-memory limit is more generous
/// to accommodate fan-out in the wait-for graph reconstruction.
pub const MAX_DEADLOCK_DEPTH: u32 = 16;

/// MasterLookup payload. Sent when a node needs to confirm or discover
/// the current master for a resource (e.g., after membership change).
/// Total: 8 bytes + variable name.
#[repr(C)]
pub struct DlmMasterLookupPayload {
    /// Resource name length.
    pub name_len: Le16,
    /// Padding.
    pub _pad: [u8; 6],
    // Followed by: resource_name: [u8; name_len]
}
const_assert!(size_of::<DlmMasterLookupPayload>() == 8);

/// MasterLookupReply payload.
/// Total: 24 bytes.
#[repr(C)]
pub struct DlmMasterLookupReplyPayload {
    /// Resource name hash (from the original lookup).
    pub resource_hash: Le64,
    /// Authoritative master node for this resource.
    pub master_node: Le64,
    /// Status: 0 = known master, 1 = resource not found (no active locks).
    pub status: u8,
    /// Padding.
    pub _pad: [u8; 7],
}
const_assert!(size_of::<DlmMasterLookupReplyPayload>() == 24);

/// MasterTransfer payload. Sent from old master to new master during
/// membership change. Carries the full lock state for a resource.
/// Total: 16 bytes header + variable queues.
#[repr(C)]
pub struct DlmMasterTransferPayload {
    /// Resource name hash.
    pub resource_hash: Le64,
    /// Resource name length.
    pub name_len: Le16,
    /// Number of entries in the granted queue (following the name).
    pub granted_count: Le16,
    /// Number of entries in the converting queue.
    pub converting_count: Le16,
    /// Number of entries in the waiting queue.
    pub waiting_count: Le16,
    // Followed by:
    //   resource_name: [u8; name_len]
    //   granted: [DlmQueueEntry; granted_count]
    //   converting: [DlmQueueEntry; converting_count]
    //   waiting: [DlmQueueEntry; waiting_count]
}
const_assert!(size_of::<DlmMasterTransferPayload>() == 16);

/// A lock queue entry for MasterTransfer wire format.
/// Total: 24 bytes.
#[repr(C)]
pub struct DlmQueueEntry {
    /// Node ID of the lock holder/waiter.
    pub node_id: Le64,
    /// Node-local lock ID.
    pub lock_id: Le64,
    /// Lock mode (held or requested).
    pub mode: u8,
    /// Padding.
    pub _pad: [u8; 7],
}
const_assert!(size_of::<DlmQueueEntry>() == 24);

/// LvbReadRequest payload. Sent by a node that wants to read an LVB without
/// acquiring a lock when `transport.supports_one_sided() == false` (TCP).
/// The master reads the LVB under its internal resource lock and responds
/// with `DlmLvbReadResponsePayload`.
/// Total: 8 bytes + variable name.
/// See [Section 15.15](#distributed-lock-manager--two-sided-lvb-read-fallback).
#[repr(C)]
pub struct DlmLvbReadRequestPayload {
    /// Resource name length.
    pub name_len: Le16,
    /// Padding.
    pub _pad: [u8; 6],
    // Followed by: resource_name: [u8; name_len]
}
const_assert!(size_of::<DlmLvbReadRequestPayload>() == 8);

/// LvbReadResponse payload. Sent by the resource master in reply to
/// LvbReadRequest. Contains the full 64-byte LVB snapshot read under
/// the resource's `inner` SpinLock (guaranteed consistent — no double-read
/// needed by the receiver).
/// Total: 88 bytes.
/// Layout: resource_hash(8) + status(1) + _pad(3) + lvb_len(4) +
///   rotation_epoch(8) + lvb_data(64) = 88.
#[repr(C)]
pub struct DlmLvbReadResponsePayload {
    /// Resource name hash (for correlation with the request).
    pub resource_hash: Le64,
    /// Status: 0 = success (LVB data valid), 1 = resource not found,
    /// 2 = LVB invalid (INVALID sentinel, needs disk refresh).
    pub status: u8,
    /// Padding.
    pub _pad: [u8; 3],
    /// LVB data length (0-64). 0 if the resource has no LVB or the
    /// LVB is in INVALID state (status == 2).
    pub lvb_len: Le32,
    /// Rotation epoch — monotonically increasing counter incremented each
    /// time the LVB sequence counter is rotated (reset to 0). Lockless
    /// `read_lvb()` callers MUST compare this with their cached
    /// `rotation_epoch` before using sequence comparison for ordering.
    /// If epochs differ, the cached sequence is stale across a rotation
    /// boundary and must be discarded. See "Rotation safety for lockless
    /// `read_lvb()` callers" above.
    pub rotation_epoch: Le64,
    /// LVB data (56 bytes of application data + 8 bytes sequence counter).
    /// Only the first `lvb_len` bytes are meaningful.
    pub lvb_data: [u8; 64],
}
const_assert!(size_of::<DlmLvbReadResponsePayload>() == 88);

Wire message size summary:

Message Type Header Payload (fixed) Variable Typical Total
LockRequest 64 48 0-256 (name) ~112-368 bytes
LockGrant 64 24 0-64 (LVB) ~88-152 bytes
LockConvert 64 28 0-64 (LVB) ~92-156 bytes
LockRelease 64 24 0-64 (LVB) ~88-152 bytes
LockRevocation 64 16 0 80 bytes
DeadlockProbe 64 40 N*8 (path) ~104+ bytes
MasterLookup 64 8 0-256 (name) ~72-328 bytes
MasterTransfer 64 16 name + queues variable
LockBatch 64 8 (count) N * 48 ~120-3144 bytes
LvbReadRequest 64 8 0-256 (name) ~72-328 bytes
LvbReadResponse 64 88 0 152 bytes

Header = ClusterMessageHeader (40) + DlmMessageHeader (24) = 64 bytes.

Batch messages: LockBatch carries up to 64 DlmLockRequestPayload entries in a single wire message, grouped by destination master. The master processes all entries atomically and returns a single batch reply with per-lock grant/reject status. This reduces RDMA round-trips for operations like rename() (3 locks) and GFS2 resource group allocation (8+ locks). The batch reply uses DlmMessageType::LockGrant with the batch flag set in DlmMessageHeader.flags.


15.16 Persistent Memory

15.16.1 The Hardware

CXL-attached persistent memory is coming (Samsung CMM-H with NAND-backed persistence via CXL GPF, SK Hynix). Also: battery-backed DRAM (NVDIMM-N) for enterprise storage. The model: byte-addressable memory that survives power loss.

15.16.2 Design: DAX (Direct Access) Integration

// umka-core/src/mem/persistent.rs

/// Persistent memory region descriptor.
pub struct PersistentMemoryRegion {
    /// Physical address range.
    pub base: PhysAddr,
    pub size: u64,

    /// NUMA node this persistent memory is attached to.
    pub numa_node: u16,

    /// Technology type (affects performance characteristics).
    pub tech: PmemTechnology,

    /// Is this region backed by a filesystem (DAX mode)?
    pub dax_device: Option<DeviceNodeId>,
}

#[repr(u32)]
pub enum PmemTechnology {
    /// Intel Optane / 3D XPoint (legacy, for existing deployments).
    Optane          = 0,
    /// CXL-attached persistent memory.
    CxlPersistent   = 1,
    /// Battery-backed DRAM (NVDIMM-N).
    BatteryBacked   = 2,
}

15.16.3 Memory-Mapped Persistent Storage

When a filesystem on persistent memory is mounted with DAX:

Standard file I/O (non-DAX):
  read() → VFS → page cache → memcpy to userspace
  write() → VFS → page cache → writeback → storage device

DAX file I/O:
  read() → VFS → mmap directly to persistent memory → load instruction
  write() → VFS → store instruction → persistent memory
  No page cache. No copies. No writeback.
  CPU load/store talks directly to persistent media.

AS_DAX inode initialization: The AS_DAX flag is set per-inode during inode initialization when: (1) the filesystem is mounted with -o dax=always (all regular file inodes), or (2) the filesystem is mounted with -o dax=inode and the inode has the FS_DAX_FL persistent attribute (set via ioctl(FS_IOC_SETFLAGS) or chattr +x). The flag is set in the filesystem's iget()/alloc_inode() path by calling mapping_set_dax(inode.i_mapping), which sets bit AS_DAX in AddressSpace.flags. Once set, the flag is immutable for the lifetime of the inode in memory.

The memory manager must handle persistent pages differently: - Persistent pages are NOT evictable (they ARE the storage) - fsync() → CPU cache flush (CLWB/CLFLUSH) not block I/O - MAP_SYNC flag ensures metadata (file size, timestamps) is also persistent - Crash consistency: partial writes are visible after reboot (see Section 15.16)

15.16.4 Crash Consistency Protocol

Persistent memory stores survive power loss, but CPU caches do not. Without explicit cache flushing, writes to persistent memory may be reordered or lost in the CPU write-back cache. The kernel must enforce a strict persistence protocol:

Persistence primitives (x86):
  CLWB addr     — Write-back cache line, leave line CLEAN but VALID in cache.
                  (Preferred: no performance penalty on subsequent reads.)
  CLFLUSHOPT addr — Flush cache line, INVALIDATE from cache.
                  (Legacy: forces re-fetch on next read.)
  SFENCE        — Store fence. Guarantees all preceding CLWB/CLFLUSHOPT
                  have reached the persistence domain (ADR/eADR boundary).

Correct write sequence for persistent data:
  1. Store data to persistent memory region (mov/memcpy)
  2. CLWB for each modified cache line (64 bytes each)
  3. SFENCE  ← data is now durable
  4. Store metadata update (e.g., committed flag, log tail pointer)
  5. CLWB for metadata cache line(s)
  6. SFENCE  ← metadata is now durable (atomically marks data as committed)

ARM equivalent:
  DC CVAP addr  — Clean data cache to Point of Persistence (ARMv8.2+)
  DSB           — Data Synchronization Barrier

fsync() on a DAX-mounted filesystem translates to cache writeback + store fence (not block I/O). msync(MS_SYNC) on DAX mappings follows the same path. The kernel provides pmem_flush() and pmem_drain() helpers that abstract the architecture-specific instructions.

15.16.4.1 DAX fsync Path

When fsync(fd) is called on a DAX file (AS_DAX set in AddressSpace.flags, Section 14.1), the standard page-cache writeback path is bypassed entirely. Instead, the VFS dispatches to dax_fsync():

fsync(fd)
  → sys_fsync()
  → vfs_fsync_range(file, 0, LLONG_MAX, datasync=false)
  → file.f_op.fsync(file, start, end, datasync)
  → dax_fsync(file, start, end)         // DAX-specific path

dax_fsync() implementation:

/// Flush dirty DAX mappings to persistent media.
/// Called instead of the standard filemap_write_and_wait_range() path
/// for DAX files. There are no page cache pages — data lives directly
/// in persistent memory. The only action needed is to flush CPU caches
/// for any dirty ranges.
fn dax_fsync(
    file: &File,
    start: i64,
    end: i64,
) -> Result<(), IoError> {
    let mapping = &file.inode.i_mapping;

    // (1) Check for DAX hardware errors (MCE/SEA).
    //     Compare file.f_dax_err (AtomicU32, snapshot at open time) with
    //     mapping.dax_err. If different, report -EIO.
    //     f_dax_err is AtomicU32 for interior mutability (dax_fsync takes &File).
    if file.f_dax_err.load(Acquire) != mapping.dax_err.load(Acquire) {
        file.f_dax_err.store(mapping.dax_err.load(Acquire), Release);
        return Err(IoError::EIO);
    }

    // (2) Walk dirty DAX mappings in [start, end] and flush
    //     CPU caches to the persistence domain.
    dax_writeback_range(mapping, start, end)?;

    Ok(())
}

/// Walk all DAX mappings in the given range that have been dirtied
/// (PTE dirty bit set) and issue architecture-specific cache writeback
/// instructions to push data from CPU caches to persistent media.
///
/// This is the DAX equivalent of filemap_write_and_wait_range() for
/// page-cache-backed files.
fn dax_writeback_range(
    mapping: &AddressSpace,
    start: i64,
    end: i64,
) -> Result<(), IoError> {
    // Defensive guard: the VFS should never pass negative values, but
    // a programming error upstream would cause catastrophic behavior
    // (wrapping u64 creating an infinite-length range).
    if start < 0 || end < 0 || start > end {
        return Err(IoError::EINVAL);
    }
    // Walk the filesystem's iomap to find the physical addresses
    // backing the dirty range.
    let mut pos = start as u64;
    while pos <= end as u64 {
        let iomap = mapping.inode.i_op.iomap_begin(
            pos,
            (end as u64 + 1).saturating_sub(pos),
            AccessType::Read,
        )?;

        match iomap.kind {
            IomapKind::Mapped { phys_addr } => {
                let len = core::cmp::min(iomap.length, (end as u64 + 1) - pos);
                // Flush CPU caches for this physical range.
                arch_wb_cache_pmem(phys_addr, len);
            }
            IomapKind::Hole | IomapKind::Unwritten => {
                // No data to flush.
            }
        }

        pos += iomap.length;
    }

    // Issue a store fence to ensure all cache writebacks have reached
    // the persistence domain before returning.
    arch_pmem_drain();

    Ok(())
}

arch_wb_cache_pmem(addr, len) — per-architecture cache writeback:

Architecture Instructions Notes
x86-64 CLWB for each 64-byte cache line in [addr, addr+len) CLWB preferred (leaves line clean but valid in cache). Falls back to CLFLUSHOPT on CPUs without CLWB (pre-Skylake). Trailing fence is exclusively arch_pmem_drain().
AArch64 DC CVAP for each 64-byte cache line in [addr, addr+len) DC CVAP = Clean to Point of Persistence (ARMv8.2-A DPB feature). Falls back to DC CVAC (clean to Point of Coherency) on CPUs without DPB.
RISC-V cbo.flush for each cache line (Zicbom extension) Without Zicbom, RISC-V has no cache management instructions — persistent memory requires explicit fence only (assumes platform guarantees writeback on fence).
PPC64LE dcbst for each cache line dcbst forces writeback of the specified cache block. Trailing fence deferred to arch_pmem_drain() to avoid redundant fences when flushing multiple extents.
ARMv7 DCCMVAC (MCR p15, 0, Rd, c7, c10, 1) for each cache line Clean D-cache by MVA to Point of Coherency. No Point of Persistence concept in ARMv7; relies on platform-level battery backup.
PPC32 dcbst for each cache line Same as PPC64LE.
s390x PFP (Perform Frame Management Function) or store + BCR serialization z/Architecture uses channel I/O for persistent storage; CPU cache is write-through to SCM via firmware-managed paths.
LoongArch64 cacop 0x19 (D-cache writeback + invalidate) for each cache line LoongArch cache operations use cacop instructions with type/level encoding.

arch_pmem_drain() — per-architecture store fence:

Architecture Instruction Semantics
x86-64 SFENCE Guarantees all preceding CLWB/CLFLUSHOPT have reached the persistence domain (ADR/eADR boundary).
AArch64 DSB SY Data Synchronization Barrier — all prior DC CVAP operations complete before subsequent memory accesses.
RISC-V fence w, w Write-write ordering fence. Ensures all prior stores (including cbo.flush) are visible.
PPC64LE sync Heavyweight barrier. All prior dcbst operations reach the persistence domain.
ARMv7 DSB Data Synchronization Barrier — all prior cache maintenance operations complete.
PPC32 sync Same as PPC64LE.
s390x BCR 14,0 Serialization instruction. All prior store operations reach the persistence domain.
LoongArch64 dbar 0 Full barrier (dbar 0x00). All prior cache operations complete before subsequent accesses.

Why the page cache path does not apply: DAX files have page_cache: None in their AddressSpace (Section 14.1). There are no dirty page cache pages to write back. Data was written directly to persistent memory via CPU store instructions (through the DAX mapping). The only thing that might not be persistent is data sitting in CPU write-back caches. dax_fsync() + arch_wb_cache_pmem() flushes exactly those cache lines.

15.16.5 PMEM Error Handling

Persistent memory is physical media and can develop errors (bit rot, wear-out, manufacturing defects). The error model mirrors Linux badblocks:

Error sources:
  1. UCE (Uncorrectable Error) — MCE (Machine Check Exception) on x86,
     SEA (Synchronous External Abort) on ARM.
     CPU receives #MC / abort when reading a poisoned cache line.

  2. ARS (Address Range Scrub) — ACPI background scan discovers latent
     errors before they're read. Results reported via ACPI NFIT.

  3. CXL Media Error — CXL 3.0 devices report media errors via CXL
     event log (Get Event Records command).

Kernel response:
  MCE/SEA on PMEM page:
    1. Mark physical page as HWPoison (same as DRAM MCE path).
    2. Add to per-region badblocks list.
    3. If a process has the page mapped:
       a. DAX mapping → deliver SIGBUS (BUS_MCEERR_AR) with fault address.
       b. Process can handle SIGBUS and skip/retry the corrupted region.
    4. Filesystem (ext4/xfs DAX) is notified via dax_notify_failure().
       Filesystem marks affected file range as damaged.

  ARS/CXL background error:
    1. ACPI notification or CXL event interrupt.
    2. Add to badblocks list.
    3. If mapped: deliver SIGBUS (BUS_MCEERR_AO — action optional).
    4. Userspace can query badblocks via /sys/block/pmemN/badblocks.

15.16.6 Integration with Memory Tiers

Persistent memory becomes another level in the memory hierarchy. Note: the "Memory Level" numbering below refers to the memory distance hierarchy, NOT the driver isolation tiers (Tier 0/1/2) used elsewhere in this architecture.

Existing memory levels (see numa-topology-and-policy, dsm-global-memory-pool):
  Level 0: Per-CPU caches
  Level 1: Local DRAM
  Level 2: Remote DRAM (cross-socket)
  Level 3: CXL pooled memory
  ...

Extended:
  Level N: Persistent memory (CXL-attached or NVDIMM)
    Properties:
      - Byte-addressable (like DRAM)
      - Survives power loss (like storage)
      - Higher latency than DRAM (~200-500ns vs ~80ns)
      - Lower bandwidth than DRAM
      - Cannot be evicted (it IS the backing store)

15.16.7 Linux Compatibility

Linux persistent memory interfaces are preserved:

/dev/pmem0, /dev/pmem1:       Block device interface (libnvdimm)
/dev/dax0.0, /dev/dax1.0:    Character DAX device (devdax)
mount -o dax /dev/pmem0 /mnt: DAX-mounted filesystem
mmap() with MAP_SYNC:         Guaranteed persistence of metadata

Optane Discontinuation Note:

Intel discontinued Optane persistent memory products in 2022. The persistent memory design in this section is hardware-agnostic — it applies to any byte-addressable persistent medium. CXL 3.0 Type 3 devices with persistence (battery-backed or inherently persistent media) are the expected successor. The PmemTechnology enum includes CxlPersistent for this reason. The DAX path, cache flush protocol, and error handling are technology-independent.

PMEM Namespace Discovery:

Persistent memory regions are discovered via:

  • ACPI NFIT (NVDIMM Firmware Interface Table): For NVDIMM-N and legacy Optane. The NFIT describes each PMEM region's physical address range, interleave set, and health status.
  • CXL DVSEC (Designated Vendor-Specific Extended Capability): For CXL-attached persistent memory. CXL devices advertise memory regions via PCIe DVSEC structures. The kernel's CXL driver enumerates regions and creates /dev/daxN.M device nodes.
  • Namespace management: Regions are partitioned into namespaces via ndctl (userspace tool) using the Linux-compatible namespace management ioctl interface. UmkaOS implements the same ioctls via umka-sysapi.

15.16.8 Performance Impact

Zero overhead for systems without persistent memory. When persistent memory is present: DAX I/O is faster than standard I/O (eliminates page cache copies and writeback). Performance improves.

15.16.9 Filesystem Repair and Consistency Checking

Filesystem repair (fsck, xfs_repair, btrfs check) is handled by existing Linux userspace utilities running against UmkaOS's block device interface. UmkaOS does not implement in-kernel repair paths — the standard Linux repair tools are unmodified userspace binaries that interact with block devices via standard syscalls (open, read, write, ioctl). Since UmkaOS implements the complete block device interface (Section 15.2) and the relevant filesystem syscalls (Section 19.1), these tools work unchanged:

  • e2fsck / fsck.ext4 for ext4 repair
  • xfs_repair for XFS repair
  • btrfs check / btrfs scrub for btrfs repair (btrfs scrub runs online)
  • ZFS self-heals via block-level checksums (Section 15.10); zpool scrub is the equivalent of fsck for ZFS

No kernel-side changes are needed to support these tools. The only UmkaOS-specific consideration is that filesystem drivers should expose consistent BLKFLSBUF and BLKRRPART ioctl behavior matching Linux, as some repair tools use these to synchronize cache state.

15.16.10 SCSI-3 Persistent Reservations

SCSI-3 Persistent Reservations (PR) are required for shared-storage cluster fencing (Section 15.14). UmkaOS's block I/O layer implements the following PR commands as ioctls on block devices:

  • PR_REGISTER / PR_REGISTER_AND_IGNORE: register a reservation key with the storage target. Each node registers a unique key (derived from node ID).
  • PR_RESERVE: acquire a reservation (Write Exclusive, Exclusive Access, or their "Registrants Only" variants).
  • PR_RELEASE: release a held reservation.
  • PR_CLEAR: clear all registrations and reservations.
  • PR_PREEMPT / PR_PREEMPT_AND_ABORT: preempt another node's reservation (used for fencing — a surviving node preempts the fenced node's key).

These map to SCSI PR IN / PR OUT commands (SPC-4) for SCSI/SAS devices and to NVMe Reservation Register/Acquire/Release/Report commands for NVMe devices. The block layer translates between the common ioctl interface and the device-specific command set. The fencing integration with Section 5.8's membership protocol uses PR_PREEMPT_AND_ABORT to revoke a dead node's storage access before recovering its DLM locks.


15.17 Computational Storage

15.17.1 Problem

NVMe Computational Storage Devices (CSDs) can run compute on the storage device: filter, aggregate, search, compress — without moving data to the host CPU.

15.17.2 Design: CSD as AccelBase Device

A CSD naturally fits the accelerator framework (Section 22.1). It's a device with local memory (flash) and compute capability (embedded processor):

// Extends AccelDeviceClass (Section 22.1)

#[repr(u32)]
pub enum AccelDeviceClass {
    Gpu             = 0,
    GpuCompute      = 1,
    Npu             = 2,
    Tpu             = 3,
    Fpga            = 4,
    Dsp             = 5,
    MediaProcessor  = 6,
    /// Computational Storage Device.
    /// "Local memory" = flash storage on the device.
    /// "Compute" = embedded processor running submitted programs.
    ComputeStorage  = 7,
    Other           = 255,
}

Note: The AccelDeviceClass enum is canonically defined in Section 22.1 (11-accelerators.md). The ComputeStorage variant (value 7) must be added to the canonical definition to support computational storage devices.

15.17.3 CSD Command Submission

Standard NVMe read (move data to compute):
  Host CPU ← 1 TB data ← NVMe SSD
  Host CPU processes 1 TB → produces 1 MB result
  Total data moved: 1 TB

CSD compute (move compute to data):
  Host CPU → submit "grep pattern" → CSD
  CSD processes 1 TB internally → produces 1 MB result
  Host CPU ← 1 MB ← CSD
  Total data moved: 1 MB (1000x reduction)

The CSD accepts commands via the AccelBase vtable: - create_context: allocate CSD execution context - submit_commands: submit a compute program (filter, aggregate, map, etc.) - poll_completion: check if computation is done - Results returned via DMA to host memory

15.17.4 CSD Security Model

CSDs run arbitrary compute programs on the device's embedded processor. The kernel must enforce access boundaries:

Capability-gated namespace access:
  1. Each NVMe namespace has an owner (cgroup or capability).
  2. CSD compute programs can ONLY access namespaces granted to
     the submitting process's capability set.
  3. Cross-namespace access (e.g., join across two datasets on
     different namespaces) requires capabilities for BOTH namespaces.
  4. The CSD driver enforces this BEFORE submitting to hardware
     via the NVMe Computational Storage command set.

Program validation:
  - CSD programs are opaque to the kernel (device-specific bytecode).
  - The kernel does NOT inspect or validate program contents.
  - Trust boundary: the NVMe device enforces isolation between
    namespaces at the hardware level (NVMe namespace isolation).
  - If the CSD hardware lacks namespace isolation, the kernel
    treats the device as single-tenant (only one cgroup at a time).

DMA buffer isolation:
  - Result DMA buffers are allocated from the submitting process's
    address space (via IOMMU-mapped regions, same as GPU DMA).
  - CSD cannot DMA to arbitrary host memory — IOMMU enforces this.

CSD Program Validation and IOMMU Enforcement:

Before submitting a CSD program to a device, the kernel performs:

1. IOMMU domain restriction: The CSD device is placed in an isolated IOMMU domain (one per process/namespace submitting CSD work). The IOMMU mapping for the CSD domain is restricted to: - The input data region(s) specified in the submission descriptor. - The output data region(s) specified in the submission descriptor. - The program binary itself (if stored in a device-accessible region). Any attempt by the CSD device to DMA outside these regions raises an IOMMU fault, which terminates the CSD operation and returns EPERM to the submitting process.

2. Capability check: CSD program submission requires CAP_ACCEL_SUBMIT (Section 9.2) on the CSD device's capability object. Programs submitted via a cgroup with storage quota enforcement additionally require that the submission's estimated compute units do not exceed the cgroup's CSD budget.

3. Program opaqueness vs. DMA opaqueness: The program logic is opaque to the kernel (vendor-specific bytecode). However, the DMA access pattern is NOT opaque: the IOMMU enforces that the device can only DMA to the addresses explicitly listed in the submission. The program cannot expand its DMA scope at runtime.

4. Namespace isolation: Each process namespace maps to a distinct IOMMU domain. Programs from process A cannot access data mapped into process B's CSD domain. Shared CSD regions (for cooperative workloads) require an explicit capability grant from process B to process A (Section 9.1 capability delegation) and a corresponding IOMMU mapping shared between the two domains.

5. Program signing (optional policy): Operators can configure CSD device policies to reject programs without a valid signature (csd_policy: require_signed = true). The signature is checked against the system's IMA policy (Section 9.5). Unsigned programs return EKEYREJECTED.

15.17.5 CSD Error Handling

Error scenarios and kernel response:

Timeout (program runs too long):
  1. CSD command timeout (default: 300s, configurable via AccelBase).
     300s default accommodates long-running CSD programs (e.g., full-scan
     compression, dedup over multi-TB datasets). Short timeouts can be set
     per-submission via AccelSubmitParams.timeout_ns for latency-sensitive ops.
  2. Kernel sends NVMe Abort command for the specific command ID.
  3. Returns -ETIMEDOUT to the submitting process.
  4. If abort fails: NVMe controller reset (same path as NVMe I/O timeout).

Hardware error (device reports failure):
  1. CSD returns NVMe status code (e.g., Internal Error, Data Transfer Error).
  2. Kernel maps to errno: -EIO for hardware faults, -ENOMEM for device
     memory exhaustion, -EINVAL for malformed programs.
  3. Error counter incremented in /sys/class/accel/csdN/errors.
  4. If error rate exceeds threshold: driver marks device degraded,
     stops accepting new submissions, notifies userspace via udev event.

Device reset:
  1. NVMe controller reset via PCIe FLR (Function Level Reset).
  2. All in-flight CSD commands are failed with -EIO.
  3. Contexts are invalidated; processes must re-create them.
  4. Same recovery path as standard NVMe timeout handling in Linux.

15.17.6 Linux Compatibility

NVMe Computational Storage is defined in separate NVMe technical proposals — primarily TP 4091 (Computational Programs) and TP 4131 (Subsystem Local Memory) — not in the NVMe 2.0 base specification. These TPs define the Computational Programs I/O command set and the Subsystem Local Memory I/O command set as independent command sets within the NVMe 2.0 specification library architecture (which separates base spec, command set specs, and transport specs into distinct documents). Linux support is emerging (/dev/ngXnY namespace devices). UmkaOS supports the same device files and NVMe ioctls through umka-sysapi.

CSD Programming Model:

CSD programs are opaque command buffers — the kernel does not interpret or compile them. The programming model:

  1. Vendor SDK in userspace: Each CSD vendor provides a userspace SDK that compiles programs for their embedded processor (e.g., Samsung SmartSSD SDK, ScaleFlux CSD SDK).
  2. NVMe TP 4091 (Computational Programs): The NVMe technical proposal defines a standard command set for managing computational programs on CSDs. Programs are uploaded via NVMe admin commands and executed via NVMe I/O commands.
  3. Kernel role: The kernel manages namespace access (capability-gated), DMA buffer allocation (IOMMU-protected), command timeout enforcement, and error reporting. The kernel does NOT validate program correctness — that is the vendor SDK's responsibility.

CSD Data Affinity:

For workloads that benefit from computational storage, data should be placed on the CSD's local namespaces:

  • Filesystem-level routing: Mount a CSD-backed filesystem and place data files on it. CSD compute programs access data locally (no PCIe transfer).
  • Cgroup hint: csd.preferred_device cgroup knob suggests which CSD device should be preferred for new file allocations within that cgroup's processes. Advisory only — the filesystem makes the final placement decision.
  • Explicit placement: Applications using O_DIRECT + the NVMe passthrough interface can target specific CSD namespaces directly.

15.17.7 Performance Impact

CSD offload reduces host CPU usage and PCIe bandwidth consumption. Performance improves for data-heavy workloads. Zero overhead when CSDs are not present.


15.18 I/O Priority and Scheduling

UmkaOS implements per-task I/O priority with full Linux ioprio_set/ioprio_get syscall compatibility. The UmkaOS I/O scheduler (MQPA — Multi-Queue Priority-Aware) is a unified implementation that replaces the Linux family of pluggable schedulers (CFQ, mq-deadline, BFQ, kyber) with a single, purpose-built scheduler that is correct, composable, and integrates natively with NVMe multi-queue hardware.

15.18.1 Syscall Interface

ioprio_set(which: i32, who: i32, ioprio: i32) -> 0 | -EINVAL | -EPERM | -ESRCH
ioprio_get(which: i32, who: i32) -> ioprio: i32 | -EINVAL | -EPERM | -ESRCH

Syscall numbers (x86-64): ioprio_set = 251, ioprio_get = 252. Syscall numbers (i386 compat): ioprio_set = 289, ioprio_get = 290. Syscall numbers (AArch64): ioprio_set = 30, ioprio_get = 31.

which argument — target scope:

Constant Value Meaning
IOPRIO_WHO_PROCESS 1 Single process or thread identified by who (PID/TID). If who = 0, the calling thread.
IOPRIO_WHO_PGRP 2 All processes in the process group identified by who. If who = 0, the caller's process group.
IOPRIO_WHO_USER 3 All processes whose real UID matches who.

ioprio_get with PGRP/USER: When multiple processes match, returns the highest priority found: RT > BE > Idle, and within the same class, the numerically lowest level (0 = highest).

Error conditions:

Error Condition
EINVAL which is not one of the three valid values; ioprio encodes an invalid class (> 3) or level (> 7); the level is non-zero for IoSchedClass::Idle.
EPERM Caller lacks CAP_SYS_ADMIN when setting RT class; caller lacks CAP_SYS_NICE when setting another user's tasks.
ESRCH No process matching the given which/who combination was found.

15.18.2 IoPriority Encoding

The ioprio value is a 16-bit quantity passed as a 32-bit int (upper 16 bits must be zero). The bit layout is identical to Linux's <linux/ioprio.h>:

bits 15-13: I/O scheduling class (3 bits)
bits 12-3:  Priority hint (10 bits; used for SCSI command duration limits)
bits  2-0:  Priority level within the class (3 bits; values 0-7)

The 13-bit "data" field (bits 12-0, accessed via IOPRIO_PRIO_DATA()) is further split into a 10-bit hint and a 3-bit level. Linux added the hint sub-field in 6.0 for SCSI command duration limit hints (IOPRIO_HINT_DEV_DURATION_LIMIT_*).

/// Per-task I/O priority. Wire-compatible with Linux `ioprio` values.
///
/// Bit layout (little-endian u16):
///   [15:13] = IoSchedClass (3 bits)
///   [12:3]  = hint (10 bits; IOPRIO_HINT_* values)
///   [2:0]   = level (3 bits; values 0–7; 0 = highest priority)
///
/// The `IOPRIO_PRIO_DATA(ioprio)` macro returns bits [12:0] (hint + level combined).
/// UmkaOS exposes separate `hint()` and `level()` accessors for clarity.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct IoPriority(u16);

impl IoPriority {
    /// Construct an `IoPriority` from class, hint, and level.
    ///
    /// `level` must be in 0..=7. `hint` must be in 0..=0x3ff (10 bits).
    /// Callers should validate before constructing.
    pub const fn new(class: IoSchedClass, hint: u16, level: u8) -> Self {
        IoPriority(
            ((class as u16) << 13)
            | ((hint & 0x3ff) << 3)
            | (level as u16 & 0x7)
        )
    }

    /// Decode the scheduling class from the encoded value.
    pub fn class(self) -> IoSchedClass {
        match (self.0 >> 13) & 0x7 {
            0 => IoSchedClass::None,
            1 => IoSchedClass::RealTime,
            2 => IoSchedClass::BestEffort,
            3 => IoSchedClass::Idle,
            _ => IoSchedClass::None, // bits 4-7 are invalid; treat as None
        }
    }

    /// Decode the priority hint (10 bits). Used for SCSI command duration
    /// limit hints (`IOPRIO_HINT_DEV_DURATION_LIMIT_*`).
    pub fn hint(self) -> u16 {
        (self.0 >> 3) & 0x3ff
    }

    /// Decode the priority level (3 bits, 0 = highest within the class).
    pub fn level(self) -> u8 {
        (self.0 & 0x7) as u8
    }

    /// Return the combined 13-bit data field (hint + level), matching
    /// Linux's `IOPRIO_PRIO_DATA(ioprio)` = `ioprio & 0x1fff`.
    pub fn data(self) -> u16 {
        self.0 & 0x1fff
    }

    /// Round-trip to/from the raw `i32` syscall argument.
    pub fn from_raw(raw: i32) -> Option<Self> {
        if raw < 0 || raw > 0xffff { return None; }
        let class = (raw >> 13) & 0x7;
        if class > 3 { return None; } // Invalid class (4-7 reserved)
        Some(IoPriority(raw as u16))
    }

    pub fn to_raw(self) -> i32 {
        self.0 as i32
    }

    /// The zero value: class = None, hint = 0, level = 0.
    /// Semantics: inherit priority from CPU nice value.
    pub const NONE: IoPriority = IoPriority(0);
}

/// SCSI command duration limit hint values (bits [12:3] of ioprio).
/// Linux `include/uapi/linux/ioprio.h` defines these since kernel 6.0.
pub const IOPRIO_HINT_NONE: u16                    = 0;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_1: u16    = 1;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_2: u16    = 2;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_3: u16    = 3;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_4: u16    = 4;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_5: u16    = 5;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_6: u16    = 6;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_7: u16    = 7;

/// I/O scheduling class. Numeric values are identical to Linux
/// `IOPRIO_CLASS_*` constants — do not renumber.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum IoSchedClass {
    /// Class not set. I/O priority is derived from CPU nice (see Section 15.15.3).
    None       = 0,
    /// Real-time. Levels 0–7, 0 = highest. Preempts all BestEffort and Idle I/O.
    RealTime   = 1,
    /// Best-effort. Levels 0–7, 0 = highest. Default class for all tasks.
    BestEffort = 2,
    /// Idle. Served only when no RT or BE I/O is pending.
    /// The level field is ignored; all Idle I/O is equal.
    Idle       = 3,
}

Validation rules (enforced by ioprio_set before storing): - Class must be 0–3 (values 4–7 are reserved; return EINVAL). - For RT and BE: level must be 0–7 (return EINVAL if level > 7). Hint bits are NOT validated — arbitrary hint values are passed through to the block layer (Linux behavior). Unknown hints are ignored by drivers. - For Idle: level must be 0 (any non-zero level is EINVAL). Hint is ignored. - For None: level must be 0, hint must be 0.

15.18.3 Priority Inheritance from CPU Nice

When a task has IoPriority::NONE (class = IoSchedClass::None), its effective I/O priority is derived from its CPU nice value at dispatch time. This matches Linux behavior:

effective_class = BestEffort
effective_level = (nice + 20) / 5

This maps the nice range −20..+19 to BE levels 0..7:

nice effective BE level
−20 0 (highest)
−15 1
−10 2
−5 3
0 4 (default)
5 5
10 6
19 7 (lowest)

The derivation happens in the dispatch path, not at ioprio_set time, so that a subsequent setpriority(2) call continues to influence I/O priority as expected.

15.18.4 Task Storage and Inheritance

/// Fields added to the Task structure (see Chapter 8).
pub struct Task {
    // ... existing fields ...

    /// Explicitly set I/O priority. `IoPriority::NONE` means "derive from nice".
    pub io_priority: IoPriority,
}

Fork semantics: On fork(2) and clone(2) without CLONE_IO, the child inherits the parent's io_priority value verbatim. If the parent had IoPriority::NONE, the child also starts with IoPriority::NONE and its effective priority is derived from its own nice value (which it also inherits from the parent, but may be changed independently).

CLONE_IO: When CLONE_IO is set, the child shares the parent's I/O context (same io_context pointer). In this case the io_priority is also shared — a write by either task is visible to the other. This is the same as Linux.

Thread groups: POSIX threads within the same process do NOT share io_priority by default (consistent with Linux). Each thread has an independent io_priority. Tools that wish to set the priority for all threads of a process must call ioprio_set(IOPRIO_WHO_PROCESS, tid, ioprio) once per thread, using TIDs from /proc/<pid>/task/.

15.18.5 Permission Model

UmkaOS enforces the same permission rules as Linux:

Operation Required capability
Set IoSchedClass::RealTime for any task CAP_SYS_ADMIN
Set IoSchedClass::BestEffort or IoSchedClass::Idle for own tasks None
Set IoSchedClass::BestEffort or IoSchedClass::Idle for another user's tasks CAP_SYS_NICE
Set priority for a process group or all processes of a UID Same as for individual processes
Read priority of any task None (always permitted)

"Own tasks" means: tasks whose real or effective UID matches the caller's real UID, or tasks in the caller's process group when which = IOPRIO_WHO_PGRP. Setting a higher-than-current BE level (lower priority number) for one's own tasks is always permitted.

15.18.6 UmkaOS I/O Scheduler: Multi-Queue Priority-Aware (MQPA)

UmkaOS does not implement CFQ, BFQ, mq-deadline, or kyber as separate pluggable schedulers. Instead, UmkaOS implements a single unified scheduler — MQPA — that provides the correct behavior for all workloads without the configuration complexity of Linux's scheduler selection knob.

Design rationale vs Linux schedulers: - CFQ: deprecated in Linux 5.0, removed in 5.3. Had global elevator lock, per-process queues with O(n) dispatch, poor NVMe multi-queue support. - BFQ: per-process B-WF2Q+ scheduling with budget tracking. Good fairness, but complex and has a single per-device lock that limits scaling on high-queue-depth SSDs. - mq-deadline: simple, fast, low overhead, but only provides read/write starvation prevention — no per-class prioritization beyond that. - kyber: good SSD latency targeting, but no class-based priority support.

MQPA provides class-based strict priority (RT > BE > Idle), weighted round-robin within BE levels, per-CPU queues for lock-free submission, elevator merge optimization, and NVMe hardware queue integration — without any of the above limitations.

15.18.6.1 Scheduler Data Structures

Note on state ownership: The canonical per-device queue container is DeviceIoQueues, owned by the BlockDevice (not by the scheduler algorithm). The dispatch algorithm (dispatch_one) becomes the IoSchedOps::pick_next() implementation. See Section 15.15.10 for the full ownership model.

/// A single block I/O request, created by the submission path and dispatched
/// through the DeviceIoQueues to the hardware queue. Each IoRequest corresponds
/// to one contiguous LBA range (possibly merged from adjacent submissions).
pub struct IoRequest {
    /// Logical block address of the first sector.
    pub lba: Lba,
    /// Length in bytes (aligned to sector size). Named `len_bytes` (not `len`)
    /// to prevent ambiguity between bytes and sectors. u64: UmkaOS rule — no u32
    /// for sizes. u32 caps at 4 GiB, which is too restrictive for NVMe 128 KiB+
    /// scatter-gather and CXL memory-mapped storage.
    pub len_bytes: u64,
    /// Operation type. Uses BioOp directly (no redundant IoOp enum).
    /// See [Section 15.2](#block-io-and-volume-management) for BioOp values.
    pub op: BioOp,
    /// Resolved priority from the submitting task (Section 15.15.2).
    pub priority: IoPriority,
    /// Monotonic timestamp of submission (ns). Used for latency accounting
    /// in `/proc/PID/io` (Section 15.15.9) and deadline starvation detection.
    pub submit_ns: u64,
    /// Absolute deadline (ns). For RT class: `submit_ns + rt_deadline_ns`.
    /// For BE/Idle: `submit_ns + be_deadline_ns`. If the scheduler has not
    /// dispatched this request by `deadline_ns`, it is promoted to the head
    /// of its queue (starvation prevention).
    pub deadline_ns: u64,
    /// PID of the submitting task (for per-process I/O accounting).
    pub pid: Pid,
    /// cgroup ID of the submitting task (for cgroup I/O throttling, Section 15.15.8).
    pub cgroup_id: u64,
    /// Scatter-gather list of physical pages backing this request.
    /// Pinned for the duration of the I/O. See [Section 4.14](04-memory.md#dma-subsystem).
    pub sgl: DmaSgl,
    /// Back-pointer to the originating Bio. The scheduler uses this to:
    /// 1. Extract the Bio at dispatch time — `BlockDeviceOps::submit_bio()`
    ///    takes `&mut Bio`, so the scheduler unwraps the IoRequest back to
    ///    its originating Bio for driver dispatch.
    /// 2. Call `bio_complete(req.bio, status)` when hardware signals completion,
    ///    routing the result through the Bio's `end_io` callback back to the
    ///    submitter (filesystem, page cache, io_uring, sync waiter).
    ///
    /// # Safety
    /// The Bio is kept alive for the duration of the IoRequest's lifetime.
    /// The submitter transfers ownership to the completion path at
    /// `bio_submit()` time. The Bio is not freed until `bio_complete()`
    /// invokes `end_io`. See [Section 15.2](#block-io-and-volume-management--bio-lifecycle-and-ownership).
    pub bio: *mut Bio,
}

// IoOp removed — use BioOp ([Section 15.2](#block-io-and-volume-management)) directly.
// IoOp was a redundant subset of BioOp that lacked SecureErase and ZoneAppend.
// The I/O scheduler classifies operations for WRR dispatch and merge eligibility
// using BioOp directly via IoRequest.bio.op.

Design note (Decision 4 — IoCompletion removal): The previous IoCompletion enum had three variants (TaskWake, IoUringCqe, None) and required a bridging conversion (IoCompletion::from_bio_completion()) to route completion from IoRequest back to Bio. That bridge was never defined, creating a broken completion chain (BIO-01, BIO-05). The fix: IoRequest carries a *mut Bio back-pointer. On completion, the scheduler calls bio_complete(req.bio, status), which invokes the Bio's end_io callback — the callback set by the original submitter. No bridging conversion needed. The submitter (filesystem, io_uring, sync waiter) controls completion routing by setting Bio.end_io before bio_submit().

15.18.6.2 State Ownership for Live Evolution

The I/O scheduler follows the state spill avoidance pattern (see Section 13.18): per-device I/O queues are owned by the BlockDevice, not by the scheduler component. The scheduler is a stateless dispatch function that reads queue state and selects the next request to issue. This enables:

  1. Scheduler swap without queue drain: replacing the MQPA algorithm (or swapping to a future alternative) replaces only the pick_next function. All queued requests, in-flight counters, and deadline tracking survive the swap untouched.
  2. Driver crash recovery: when a Tier 1 storage driver crashes and reloads (Section 11.9), the I/O queues survive in the BlockDevice (Tier 0 kernel memory). The new driver instance sees all pending requests — no I/O is lost.
  3. Zero-overhead steady state: the IoSchedOps trait is a &'static dyn pointer resolved once at device init. No vtable indirection per-request beyond the single dispatch call.
/// Scheduler-private metadata embedded in each IoRequest.
/// The scheduler may use this to store per-request scheduling state
/// (e.g., virtual time, WRR credits, BFQ budget slice) without heap
/// allocation. The format is opaque to the block layer — only the
/// active scheduler implementation reads/writes these bytes.
///
/// 64 bytes = one cache line. Sufficient for all known scheduling
/// algorithms (BFQ budget: 24 bytes, mq-deadline: 8 bytes, MQPA WRR: 4 bytes).
pub const SCHED_DATA_SIZE: usize = 64;

Additional IoRequest field for scheduler state:

    /// Scheduler-private per-request metadata. Opaque to the block layer.
    /// Written by `IoSchedOps::on_submit()`, read by `IoSchedOps::pick_next()`.
    /// Zeroed on request allocation; the scheduler initializes it during submission.
    pub sched_data: [u8; SCHED_DATA_SIZE],
/// I/O scheduler algorithm interface (stateless pattern).
///
/// The scheduler does NOT own the I/O queues — they are owned by the
/// BlockDevice (via `DeviceIoQueues`). The scheduler provides stateless
/// decision functions that operate on queue references. This enables
/// live scheduler replacement without draining queues.
///
/// **Steady-state cost**: one `&'static dyn IoSchedOps` pointer dereference
/// per dispatch call. No additional indirection.
pub trait IoSchedOps: Send + Sync {
    /// Algorithm name (for `/sys/block/<dev>/queue/scheduler`).
    fn name(&self) -> &'static str;

    /// Called when a new request enters a queue. The scheduler may
    /// initialize `req.sched_data` (e.g., compute virtual time, assign
    /// WRR credits). The request is already inserted into the appropriate
    /// `IoQueue` by the block layer based on its `IoPriority` class/level.
    fn on_submit(&self, queues: &DeviceIoQueues, req: &mut IoRequest);

    /// Select the next request to dispatch to hardware. Returns `None` if
    /// all queues are empty or rate-limited. The scheduler reads queue state
    /// and `sched_data` but does NOT modify the queues — the block layer
    /// calls `IoQueue::pop_front()` on the selected queue after `pick_next`
    /// returns.
    fn pick_next(&self, queues: &DeviceIoQueues, cpu: CpuId) -> Option<PickResult>;

    /// Notification that a request completed. Used for accounting
    /// (e.g., decrement inflight counters, update WRR round state).
    fn on_complete(&self, queues: &DeviceIoQueues, req: &IoRequest);
}

/// Result of `pick_next`: identifies which queue and (for sorted queues)
/// which request to dispatch.
pub struct PickResult {
    /// Class of the selected queue (RT, BE, or Idle).
    pub class: IoSchedClass,
    /// Level within the class (0-7 for RT/BE, 0 for Idle).
    pub level: u8,
    /// CPU whose per-CPU queue to dequeue from.
    pub cpu: CpuId,
}
/// Per-device I/O queue set. Owned by BlockDevice (Tier 0 kernel memory),
/// NOT by the scheduler. Survives both scheduler swap and driver crash.
///
/// Created once during `BlockDevice` registration. Destroyed only when the
/// device is permanently removed.
///
/// **Memory budget**: 8 RT + 8 BE + 1 idle = 17 PerCpu<IoQueue> arrays per
/// device. On a 256-CPU system: 17 * 256 = 4352 IoQueue instances per device.
/// Each IoQueue is ~64 bytes (backing + oldest_enqueue_time + dispatched_this_round),
/// total ~272 KiB per device. For a server with 24 NVMe devices: ~6.4 MiB.
/// This is warm-path allocation (device registration), bounded per device.
pub struct DeviceIoQueues {
    /// Per-CPU dispatch queues for RT class, indexed by level (0 = highest).
    pub rt_queues: [PerCpu<IoQueue>; 8],
    /// Per-CPU dispatch queues for BE class, indexed by level (0 = highest).
    pub be_queues: [PerCpu<IoQueue>; 8],
    /// Single per-CPU idle queue.
    pub idle_queue: PerCpu<IoQueue>,
    /// Monotonic count of in-flight requests across all classes.
    pub inflight: AtomicU32,
    /// In-flight RT requests.
    pub inflight_rt: AtomicU32,
    /// Maximum queue depth supported by the device.
    pub queue_depth: u32,
    /// Block device identifier for per-cgroup-per-device io.latency budget lookup.
    /// Set at BlockDevice registration time.
    pub device_id: DeviceId,
    /// KABI service handle for the block device driver. Used by
    /// `dispatch_pending()` to call `submit_bio()` on the driver via
    /// `kabi_call!(block_handle, submit_bio, bio)`. The handle encodes
    /// the transport decision (direct call for Tier 0, ring for Tier 1)
    /// cached at device registration time.
    pub block_handle: KabiServiceHandle,
    /// Per-device slab cache for `IoRequest` allocation. Sized at boot:
    /// `nr_cpus * 128` entries. Used by `bio_to_io_request()` via
    /// `SlabArc::new(&request_slab, req)` to avoid heap allocation on
    /// the I/O submission hot path.
    pub request_slab: SlabCache<IoRequest>,
    /// Active scheduler algorithm. Swapped atomically during live evolution.
    ///
    /// Uses `RcuCell<&'static dyn IoSchedOps>` instead of `AtomicPtr` because
    /// `dyn IoSchedOps` is an unsized trait object with a fat pointer (data
    /// pointer + vtable pointer = 2 words). `AtomicPtr` only handles thin
    /// (single-word) pointers; `AtomicPtr<&dyn Trait>` is not valid Rust.
    /// `RcuCell` stores both words atomically and provides RCU-protected
    /// reads (lock-free on the dispatch hot path) with safe writer-side swap.
    pub sched_ops: RcuCell<&'static dyn IoSchedOps>,
}

Scheduler evolution protocol:

  1. Prep: new IoSchedOps implementation loaded and verified.
  2. Atomic swap: DeviceIoQueues::sched_ops is replaced via sched_ops.rcu_replace(new_ops). The old reference is freed after an RCU grace period. No queue quiescence needed — the queues are untouched.
  3. Post-swap: new scheduler's on_submit() is called for newly arriving requests. Existing requests in queues retain their old sched_data; the new scheduler's pick_next() must handle sched_data written by the old scheduler (or treat unknown data as default priority). This is safe because pick_next() can always fall back to FIFO order within each priority queue.

Swap latency: ~1 us (RCU pointer swap + release fence). No stop-the-world IPI. No queue drain. Compare with the general component evolution path (Section 13.18) which requires 1-10 us stop-the-world.

/// Backing storage for an `IoQueue`, parameterised by media type.
///
/// - **Sorted** (rotational media — HDD): requests ordered by LBA for elevator
///   merge and seek-distance minimisation. `BTreeMap` provides O(log N) insert,
///   O(log N) predecessor/successor lookup for merge checks, and O(log N)
///   `pop_first()` for dispatch. Allocation per insert is negligible vs HDD
///   access latency (~4 ms for a 7200 RPM drive).
/// - **Fifo** (non-rotational media — NVMe, SSD, PMEM): no seek penalty; FIFO
///   preserves submission order and is optimal for deep hardware queues. Uses
///   `BoundedRing` (O(1) push/pop, pre-allocated at device init, no per-element allocation).
///
/// The variant is set once at `DeviceIoQueues` creation from `blk_queue_flag_set(QUEUE_FLAG_NONROT)`
/// and never changes at runtime. All `IoQueue` instances within one `DeviceIoQueues`
/// use the same variant.
pub enum IoQueueBacking {
    /// **Arc overhead tradeoff**: `Arc<IoRequest>` adds 16 bytes (refcount +
    /// allocation header) and one atomic increment/decrement per submit/complete.
    /// This is acceptable because: (1) `IoRequest` must be shared between the
    /// submitter, the scheduler, and the completion path (three owners); (2) the
    /// atomic refcount cost (~5-15 ns) is negligible compared to device I/O
    /// latency (~2-10 us for NVMe, ~4 ms for HDD); (3) the alternative (raw
    /// pointers) would require unsafe lifetime tracking across the async I/O
    /// boundary with no performance benefit.
    Sorted(BTreeMap<Lba, Arc<IoRequest>>),
    /// Bounded ring buffer for non-rotational media. Capacity is set to the
    /// device's hardware queue depth (NVMe MQES) at `DeviceIoQueues` creation
    /// time — no heap allocation on the I/O submission hot path.
    Fifo(BoundedRing<Arc<IoRequest>>),
}

/// A single priority-level dispatch queue.
pub struct IoQueue {
    /// Backing storage, parameterised by media type. See `IoQueueBacking`.
    pub backing: IoQueueBacking,

    /// Timestamp when the oldest request in this queue was enqueued.
    /// Used for starvation detection: if `Instant::now() - oldest_enqueue_time`
    /// exceeds the starvation threshold, the request is promoted.
    /// `None` if the queue is empty.
    oldest_enqueue_time: Option<Instant>,

    /// Number of requests dispatched from this queue in the current WRR round
    /// (BE queues only; unused for RT and Idle).
    dispatched_this_round: u32,
}

impl IoQueue {
    /// Dequeue the highest-priority request from this queue.
    /// For rotational media (BTreeMap), this pops the lowest-LBA entry.
    /// For non-rotational media (VecDeque/Fifo), this pops from the front.
    /// Returns `None` if the queue is empty.
    pub fn pop_front(&mut self) -> Option<Arc<IoRequest>> { /* ... */ }

    /// Insert a request, attempting back-merge or front-merge with adjacent
    /// requests sharing the same block device and contiguous LBA range.
    /// If no merge is possible, the request is appended. Merged requests
    /// are capped at `MERGE_SIZE_LIMIT` (64 KiB) to bound latency.
    pub fn insert_merged(&mut self, req: Arc<IoRequest>) { /* ... */ }
}

15.18.6.3 Dispatch Algorithm

The dispatch loop runs when the device signals readiness for more commands (doorbell ring, completion interrupt, or explicit dispatch_pending() call from the submit path).

fn dispatch_one(sched: &DeviceIoQueues, cpu: CpuId) -> Option<Arc<IoRequest>> {
    // IRQ-disable scope: held for the full scan+pop to prevent a completion
    // IRQ on the same CPU from modifying queue state between the starvation
    // check and the pop_front(). If IRQs were re-enabled between the check
    // and the pop, a completion IRQ could drain the queue, invalidating the
    // starvation check result.
    //
    // Worst-case IRQ-disabled window: ~5-10 us on cold cache (8 RT levels +
    // 8 BE levels with starvation checks + cgroup budget lookups). This is
    // below hard-RT deadlines but is a known overhead. Future optimization:
    // split into preempt-disabled scan (select target queue) + IRQ-disabled
    // pop (narrow critical section), retrying if the pop returns None.
    let irq_guard = IrqDisabledGuard::acquire();
    // IrqDisabledGuard implies preemption disabled. Obtain a PreemptGuard
    // from the disabled-IRQ context for the PerCpu API.
    let mut preempt_guard = PreemptGuard::from_irq_disabled(&irq_guard);
    // All PerCpu accesses below use .get_mut_nosave(&mut preempt_guard, &irq_guard).

    // Step 1: RT always wins. Scan RT levels 0..7, take first non-empty queue.
    for level in 0..8 {
        if let Some(req) = sched.rt_queues[level]
            .get_mut_nosave(&mut preempt_guard, &irq_guard).pop_front()
        {
            sched.inflight_rt.fetch_add(1, Release);
            sched.inflight.fetch_add(1, Release);
            return Some(req);
        }
    }

    // Step 2: Starvation promotion (BE). If any BE request has waited beyond
    // the starvation threshold (500ms since enqueue), treat it as RT-priority
    // for one dispatch.
    for level in 0..8 {
        let q = sched.be_queues[level]
            .get_mut_nosave(&mut preempt_guard, &irq_guard);
        if q.oldest_enqueue_time.map_or(false, |t| t.elapsed() > Duration::from_millis(500)) {
            if let Some(req) = q.pop_front() {
                sched.inflight.fetch_add(1, Release);
                return Some(req);
            }
        }
    }

    // Step 3: BE weighted round-robin with io.latency enforcement.
    // Weights: level 0 = 8, level 1 = 4, level 2 = 2, levels 3-7 = 1.
    //
    // io.latency enforcement (cgroup `io.latency` target):
    // For each cgroup with an active lat_target_us, the block layer tracks
    // per-cgroup I/O completion latency as an EMA (7/8 decay). When a
    // cgroup's EMA exceeds its target, sibling cgroups' dispatch budgets
    // are reduced proportionally:
    //   sibling_budget = max(1, normal_budget * target_us / sibling_ema)
    // This throttles siblings to give the latency-sensitive cgroup more
    // I/O bandwidth. The budget reduction is per-device, recalculated on
    // each completion. See [Section 17.2](17-containers.md#control-groups) for the full specification.
    let be_weights: [u32; 8] = [8, 4, 2, 1, 1, 1, 1, 1];
    for level in 0..8 {
        let q = sched.be_queues[level]
            .get_mut_nosave(&mut preempt_guard, &irq_guard);
        if q.dispatched_this_round < be_weights[level] {
            if let Some(req) = q.pop_front() {
                // io.latency check: if the dequeued request's cgroup has
                // exhausted its dispatch budget for this device, re-queue
                // and try the next level.
                if let Some(cg) = cgroup_for_bio(&req) {
                    if cg.io_dispatch_budget(sched.device_id).load(Relaxed) == 0 {
                        q.push_front(req);
                        continue;
                    }
                }
                q.dispatched_this_round += 1;
                sched.inflight.fetch_add(1, Release);
                return Some(req);
            }
        }
    }
    // End of WRR round: reset counters and retry from level 0.
    for level in 0..8 {
        sched.be_queues[level]
            .get_mut_nosave(&mut preempt_guard, &irq_guard)
            .dispatched_this_round = 0;
    }
    for level in 0..8 {
        let q = sched.be_queues[level]
            .get_mut_nosave(&mut preempt_guard, &irq_guard);
        if let Some(req) = q.pop_front() {
            q.dispatched_this_round = 1;
            sched.inflight.fetch_add(1, Release);
            return Some(req);
        }
    }

    // Step 4: Starvation promotion (Idle). 5s threshold since enqueue.
    {
        let iq = sched.idle_queue
            .get_nosave(&preempt_guard, &irq_guard);
        if iq.oldest_enqueue_time.map_or(false, |t| t.elapsed() > Duration::from_secs(5)) {
            let iq_mut = sched.idle_queue
                .get_mut_nosave(&mut preempt_guard, &irq_guard);
            if let Some(req) = iq_mut.pop_front() {
                sched.inflight.fetch_add(1, Release);
                return Some(req);
            }
        }
    }

    // Step 5: Idle — only when RT and BE are empty.
    sched.idle_queue
        .get_mut_nosave(&mut preempt_guard, &irq_guard)
        .pop_front()
        .map(|req| {
            sched.inflight.fetch_add(1, Release);
            req
        })
}

Starvation prevention: - BE requests that wait longer than 500ms are promoted once (dispatched as if RT, then return to normal BE accounting afterward). - Idle requests that wait longer than 5s are promoted once (dispatched regardless of pending BE I/O). - Promotion is per-request, not per-queue: only the single oldest request in a queue is promoted at a time, preserving ordering within the queue.

15.18.6.4 Elevator Merge Optimization

For rotational media (IoQueueBacking::Sorted), requests are sorted by starting LBA. When a new request arrives:

  1. Back-merge check: Look up the predecessor entry via BTreeMap::range(..lba).next_back(). If the predecessor's end LBA + 1 == new request's start LBA, and the combined bio size is ≤ 64 KB (the merge size limit), extend the predecessor's IoRequest to cover the new range and discard the new request object.
  2. Front-merge check: Look up the successor via BTreeMap::range(lba..).next(). If the successor's start LBA == new request's end LBA + 1, and combined size ≤ 64 KB, extend the new request and replace the successor.
  3. No merge: Insert the new request into the BTreeMap keyed by its start LBA.

Each merge check is O(log N). There is no global elevator lock: the per-CPU IoQueue is accessed only while holding the per-CPU scheduler lock (preempt-disable critical section on the submitting CPU).

For non-rotational media (IoQueueBacking::Fifo), back/front merge checks are still attempted (same logic, but searching by LBA in the BoundedRing is O(N)); dispatch pops from the front of the ring rather than the lowest-LBA entry. Default-off for NVMe: On devices with rotational=0 and native NVMe multi-queue, merge is disabled by default (/sys/block/<dev>/queue/nomerges=2) because NVMe controllers handle coalescing internally and the O(N) scan cost exceeds any merge benefit. Merge can be re-enabled via sysfs for devices where software merge is beneficial (e.g., SATA SSDs behind AHCI with single HW queue).

The 64 KB merge limit is chosen to match a typical NVMe preferred transfer size and to bound the latency spike of a merged request. This can be adjusted per-device at initialization time by querying the device's MDTS (Maximum Data Transfer Size) field in the NVMe identify controller data structure.

15.18.6.5 Submission Path

pub fn submit(sched: &DeviceIoQueues, req: Arc<IoRequest>, task: &Task) {
    let priority = task.effective_io_priority(); // derives from nice if NONE
    let cpu = current_cpu();
    let irq_guard = IrqDisabledGuard::acquire();

    match priority.class() {
        IoSchedClass::RealTime => {
            sched.rt_queues[priority.level() as usize]
                .get_mut_nosave(&irq_guard)
                .insert_merged(req);
        }
        IoSchedClass::BestEffort | IoSchedClass::None => {
            let level = match priority.class() {
                IoSchedClass::None => task.nice_to_be_level(),
                _ => priority.level() as usize,
            };
            sched.be_queues[level]
                .get_mut_nosave(&irq_guard)
                .insert_merged(req);
        }
        IoSchedClass::Idle => {
            sched.idle_queue
                .get_mut_nosave(&irq_guard)
                .insert_merged(req);
        }
    }

    dispatch_pending(sched, cpu);
}

15.18.6.6 dispatch_pending() — Submit-to-hardware bridge

/// Drains the I/O scheduler queue and submits the originating Bios to the
/// block device via `kabi_call!`. This is the critical bridge between
/// "request inserted into scheduler queue" and "request dispatched to hardware."
///
/// Called from:
/// - `submit()` after inserting a new request (submit path kickoff)
/// - NVMe/AHCI completion handler after freeing a hardware slot (refill)
///
/// **BIO-06 fix**: The driver's `BlockDeviceOps::submit_bio()` accepts
/// `&mut Bio`, not `&IoRequest`. The scheduler extracts the originating Bio
/// from `req.bio` (the `*mut Bio` back-pointer stored by `bio_to_io_request()`)
/// and dispatches the Bio directly to the driver. The IoRequest is a
/// scheduler-internal wrapper for priority/merging/deadline tracking — the
/// driver never sees it.
///
/// The tier awareness lives in `kabi_call!` — the handle knows whether the
/// driver is Tier 0 (direct vtable call) or Tier 1 (ring submission). No
/// `DispatchMode` enum needed. The dispatch loop is transport-agnostic.
///
/// # Algorithm
///
/// ```text
/// fn dispatch_pending(queues: &DeviceIoQueues, cpu: usize) {
///     let irq_guard = disable_irqs();
///     let sched_ops = queues.sched_ops.rcu_read();
///     loop {
///         let pick = match sched_ops.pick_next(queues, cpu) {
///             Some(p) => p,
///             None => break, // all queues empty
///         };
///         let req = queues.dequeue(pick);
///         // Extract the originating Bio from the IoRequest.
///         // SAFETY: req.bio was set by bio_to_io_request() and the Bio is
///         // alive (owned by the completion path via ManuallyDrop).
///         let bio = unsafe { &mut *req.bio };
///         // Submit the Bio to the driver via KABI transport.
///         // The handle cached at device registration time determines
///         // direct call (Tier 0) vs ring submission (Tier 1).
///         match kabi_call!(queues.block_handle, submit_bio, bio) {
///             Ok(()) => {
///                 // Request accepted by hardware. Track in-flight count.
///                 queues.inflight.fetch_add(1, Relaxed);
///                 if req.priority.class() == IoSchedClass::RealTime {
///                     queues.inflight_rt.fetch_add(1, Relaxed);
///                 }
///             }
///             Err(e) if e == Error::BUSY => {
///                 // Hardware queue full. Requeue the IoRequest at the front
///                 // of the scheduler's dispatch queue for the next attempt.
///                 // The completion handler will call dispatch_pending()
///                 // again when a slot frees up.
///                 queues.requeue_front(req, pick);
///                 break;
///             }
///             Err(e) => {
///                 // Permanent error — complete the originating Bio.
///                 bio_complete(req.bio, -(e as i32));
///             }
///         }
///     }
/// }
/// ```
///
/// **Completion path (Decision 4)**: When hardware signals completion (via
/// IRQ ring for Tier 1, or direct callback for Tier 0), the completion
/// handler calls `bio_complete(req.bio, status)`. This invokes the Bio's
/// `end_io` callback — the function pointer set by the original submitter
/// (filesystem, io_uring, sync waiter). No `IoCompletion` bridging needed.
///
/// The scheduler's `on_complete()` hook is called for accounting (decrement
/// inflight counters, update WRR round state) before or after `bio_complete()`,
/// depending on whether the callback may free the Bio:
///
/// ```text
/// fn complete_request(queues: &DeviceIoQueues, req: &IoRequest, status: i32) {
///     // 1. Notify scheduler for accounting.
///     let sched_ops = queues.sched_ops.rcu_read();
///     sched_ops.on_complete(queues, req);
///     // 2. Decrement in-flight counters.
///     queues.inflight.fetch_sub(1, Relaxed);
///     if req.priority.class() == IoSchedClass::RealTime {
///         queues.inflight_rt.fetch_sub(1, Relaxed);
///     }
///     // 3. Complete the originating Bio (invokes end_io callback).
///     bio_complete(req.bio, status);
///     // 4. Free the IoRequest back to the slab.
///     // (Arc<IoRequest> refcount drops to zero here.)
///     // 5. Kick dispatch to refill the freed hardware slot.
///     dispatch_pending(queues, current_cpu());
/// }
/// ```

15.18.7 NVMe Multi-Queue Integration

NVMe hardware supports multiple independent submission/completion queue pairs. UmkaOS maps the MQPA scheduler to NVMe hardware queues as follows:

Queue layout per NVMe controller: - One hardware queue pair per online CPU (as Linux does with blk-mq). - Each hardware queue has its own DeviceIoQueues instance — no cross-queue locking. - Tasks submit requests to the DeviceIoQueues associated with their current CPU. The dispatcher drains that scheduler's queues into the hardware submission queue doorbell.

NVMe queue priority (QPRIO): When the NVMe controller supports the Weighted Round Robin with Urgent Priority Class arbitration mechanism (reported in CAP.AMS), UmkaOS creates dedicated submission queue tiers:

NVMe QPRIO Value (CDW11[2:1]) Used for
Urgent 00b RT I/O class (all levels 0-7)
High 01b BE levels 0-1
Medium 10b BE levels 2-4
Low 11b BE levels 5-7 and Idle

Queue priority is set at queue creation time via the QPRIO field in CDW11 of the Create I/O Submission Queue admin command. This maps UmkaOS's software priority classes to NVMe hardware arbitration, so that the drive's internal scheduler also respects UmkaOS priorities — not just the host-side MQPA scheduler.

If the controller does not support CAP.AMS priority, all queues are created at the default (equal) priority and MQPA's software dispatch order is the sole priority mechanism.

RT fast path: RT requests are eligible for direct hardware queue submission without going through the sorted BTreeMap, provided the hardware queue has available slots. This reduces the RT dispatch latency to approximately one PCIe round trip (2–4 μs on Gen4/Gen5 NVMe) without waiting for a dispatch tick.

Completion handling: NVMe completions arrive per-queue. Each completion decrements inflight and inflight_rt (if RT), then calls dispatch_one to fill the freed slot. This keeps queue depth at the device's preferred level for maximum throughput.

CPU hotplug handling: When a CPU goes offline, the I/O scheduler must drain or migrate requests from the dead CPU's per-CPU IoQueue. The hotplug sequence: 1. CPU_DEAD notifier fires for the going-offline CPU. 2. For each block device: acquire the dead CPU's IoQueue lock. 3. Drain all pending requests from the dead CPU's queues (RT, BE[0..7], Idle). 4. Re-submit drained requests via submit() on the current (live) CPU, which inserts them into the live CPU's queues at their original priority. 5. In-flight requests (already submitted to hardware) complete normally on any CPU via interrupt steering — no migration needed. 6. When a CPU comes online (CPU_ONLINE), a fresh per-CPU IoQueue set is allocated and registered. No request migration is needed for online events.

15.18.8 cgroup Integration

UmkaOS's io cgroup v2 controller and blkio cgroup v1 controller interact with MQPA:

cgroup v2 io controller:

/sys/fs/cgroup/<group>/io.weight

Integer weight 1–10000 (default 100). Maps to an effective BE level multiplier:

effective_weight = io.weight           // 1-10000, default 100
be_dispatch_quota = be_weights[level] * effective_weight / 100

Tasks in a cgroup with io.weight=500 (5× default) get 5× the per-round dispatch quota at their BE level. Tasks in a cgroup with io.weight=10 get 0.1× quota (rounded up to 1 dispatch per round to avoid starvation).

/// Read the I/O weight for a cgroup. Called by the I/O scheduler at dispatch
/// time to determine the cgroup's proportional share.
pub fn cgroup_io_weight(cgroup: &Cgroup) -> u32 {
    cgroup.io.as_ref().map_or(100, |io| io.weight.load(Ordering::Relaxed))
}

The per-cgroup weight applies within the same BE priority level. A task at BE level 0 with io.weight=10 still preempts a task at BE level 1 with io.weight=10000 — class and level take strict priority; cgroup weight only affects relative bandwidth within the same level.

cgroup v2 io.max — hard rate limits:

/sys/fs/cgroup/<group>/io.max

Format (Linux compatible): MAJ:MIN rbps=N wbps=N riops=N wiops=N

Implemented as a token bucket per cgroup per device. Tokens refill at the configured rate; requests that arrive when the bucket is empty are held in a per-cgroup delay queue and released when tokens become available. Rate-limited requests retain their original MQPA priority and are inserted into the normal dispatch queue when released from the delay queue.

Token bucket parameters: - Bucket capacity: 4× the per-second rate limit (allows burst up to 4 seconds of quota). - Refill granularity: every 1ms tick (avoids thundering herd on 1-second boundaries).

cgroup v1 blkio controller:

Supported knobs and their v2 equivalents:

v1 knob v2 equivalent Notes
blkio.weight io.weight Per-cgroup default weight
blkio.weight_device io.weight (per-device) Per-device weight override
blkio.throttle.read_bps_device io.max rbps= Hard rate limit
blkio.throttle.write_bps_device io.max wbps= Hard rate limit
blkio.throttle.read_iops_device io.max riops= Hard rate limit
blkio.throttle.write_iops_device io.max wiops= Hard rate limit

v1 blkio.bfq.* knobs are accepted but ignored with a logged warning (BFQ is not implemented; MQPA provides equivalent or better behavior).

cgroup v2 io.stat — I/O accounting:

/sys/fs/cgroup/<group>/io.stat

Format (Linux 4.16+ compatible):

MAJ:MIN rbytes=N wbytes=N rios=N wios=N dbytes=N dios=N

Fields: - rbytes / wbytes: bytes read/written from storage (not page cache hits) - rios / wios: number of completed read/write I/O operations - dbytes / dios: bytes/ops issued as discard (TRIM/UNMAP) commands

Counters are updated on I/O completion, not on submission. Accounted per-task first, then aggregated to the cgroup hierarchy on read.

15.18.9 /proc/PID/io Accounting

Each task accumulates I/O counters in its RusageAccum structure (defined in Chapter 8). These are exposed in /proc/<pid>/io with the following format (Linux compatible):

rchar: <N>
wchar: <N>
syscr: <N>
syscw: <N>
read_bytes: <N>
write_bytes: <N>
cancelled_write_bytes: <N>

Field definitions:

Field Type Description
rchar u64 Bytes passed to read(2) and similar calls. Includes page cache hits. Does not represent physical I/O.
wchar u64 Bytes passed to write(2) and similar calls. Includes writes to page cache. Does not represent physical I/O.
syscr u64 Number of read-class syscalls (read, pread64, readv, preadv, preadv2, sendfile, copy_file_range).
syscw u64 Number of write-class syscalls (write, pwrite64, writev, pwritev, pwritev2, sendfile, copy_file_range).
read_bytes u64 Bytes actually fetched from storage (cache misses that triggered block I/O). Updated at I/O completion.
write_bytes u64 Bytes actually written to storage (writeback completions). Updated at writeback completion.
cancelled_write_bytes u64 Bytes charged to write_bytes that were subsequently cancelled because the page was truncated before writeback.

Implementation: - rchar and wchar are incremented in the VFS read/write path before checking the page cache. - syscr and syscw are incremented at syscall entry. - read_bytes is incremented in the block I/O completion handler when the originating task can be attributed (via IoRequest::pid). - write_bytes is incremented in the writeback completion handler. Writeback is attributed to the task that dirtied the page (recorded in the page's DirtyAccountable field). - cancelled_write_bytes is incremented in truncate_inode_pages when a dirty page is discarded before writeback.

Thread aggregation: /proc/<pid>/io reports the sum across all threads in the process. Per-thread values are available at /proc/<pid>/task/<tid>/io.

15.18.10 sysfs Interface

/sys/block/<dev>/queue/scheduler:

UmkaOS presents the MQPA scheduler under the name umka-mqpa. For compatibility with tools that check this file (e.g., fio, tuned, irqbalance), the file also accepts none, mq-deadline, bfq, and kyber as writes — all are silently mapped to umka-mqpa. The read value always shows [umka-mqpa] in the list of available schedulers.

/sys/block/<dev>/queue/iosched/:

UmkaOS presents as mq-deadline for maximum tool compatibility (iostat, blktrace, fio all detect the scheduler name and adjust output accordingly). The following tunables are honored:

Tunable Default Meaning in UmkaOS
read_expire 500ms Starvation deadline for BE read requests
write_expire 5000ms Starvation deadline for BE write requests
writes_starved 2 (ignored; MQPA WRR handles fairness)
front_merges 1 0 = disable front-merge check; 1 = enable (default)
fifo_batch 16 (ignored; MQPA dispatches one request per call)

All other mq-deadline tunables (async_depth, prio_aging_expire, etc.) are accepted via sysfs write but have no effect. A single-line message is logged at info level when an ignored tunable is written: umka-mqpa: tunable '<name>' accepted but has no effect.

MQPA-native tunables (exposed under /sys/block/<dev>/queue/iosched/):

Tunable Default Description
wrr_quantum_us 100 WRR time quantum per BE level (microseconds)
rt_starve_limit 64 Max RT requests dispatched before one BE is served
idle_batch 4 Max idle-class requests dispatched per round
merge_max_kb 64 Maximum merged request size (KiB)

/sys/block/<dev>/queue/ common knobs honored by MQPA:

Knob Description
nr_requests Maximum queue depth. UmkaOS clamps to the device's reported NVMe MQES.
rq_affinity 0 = complete on any CPU, 1 = complete on submitting CPU's socket, 2 = complete on exact submitting CPU.
add_random 0 = do not contribute to /dev/random entropy pool on I/O completion.
rotational 0 = SSD/NVMe (disable elevator C-scan; use FIFO-within-level order instead of LBA order).

When rotational=0, each IoQueue is created with backing: IoQueueBacking::Fifo (a BoundedRing<Arc<IoRequest>> pre-allocated to the device's hardware queue depth). Back/front merge checks are still performed but dispatch pops from the front of the ring rather than the lowest-LBA entry. This avoids unnecessary seek-optimisation work on random-access media.

15.18.11 Linux Compatibility Notes

Item Detail
Syscall numbers (x86-64) ioprio_set = 251, ioprio_get = 252
Syscall numbers (i386 compat) ioprio_set = 289, ioprio_get = 290
Syscall numbers (AArch64) ioprio_set = 30, ioprio_get = 31
IOPRIO_CLASS_NONE 0
IOPRIO_CLASS_RT 1
IOPRIO_CLASS_BE 2
IOPRIO_CLASS_IDLE 3
IOPRIO_PRIO_CLASS(ioprio) (ioprio >> 13) & 0x7
IOPRIO_PRIO_DATA(ioprio) ioprio & 0x1fff (13-bit combined hint+level)
IOPRIO_PRIO_HINT(ioprio) (ioprio >> 3) & 0x3ff (10-bit hint, Linux 6.0+)
IOPRIO_PRIO_LEVEL(ioprio) ioprio & 0x7 (3-bit level, Linux 6.0+)
IOPRIO_PRIO_VALUE(class, level) ((class) << 13) \| (level) (hint=0 compat)
ionice(1) (util-linux) Works without modification
iopriority field in /proc/<pid>/status Not exposed; use ioprio_get(2)
taskset / chrt Unaffected; these set CPU/RT scheduler priority, not I/O priority
cgroup v2 io.stat format Compatible with Linux 4.16+
cgroup v2 io.weight range 1–10000, default 100 (Linux compatible)
blkio.weight v1 range 10–1000, mapped to v2 weight via weight * 10
/proc/<pid>/io format Identical to Linux (all 7 fields, same names)

ionice(1) tool compatibility: The ionice utility from util-linux calls ioprio_set(2) and ioprio_get(2) directly via syscall(2) (no glibc wrapper exists). No modification is required.

Tools that query /sys/block/<dev>/queue/scheduler: Tools like fio, tuned, and storage benchmarks that read or write the scheduler knob will see [umka-mqpa] and accept writes of mq-deadline without error. The fio engine io_uring and libaio are unaffected by scheduler selection — they bypass the scheduler for direct I/O (O_DIRECT).

O_DIRECT and io_uring with fixed buffers: Requests submitted via io_uring with IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED on O_DIRECT file descriptors are still subject to MQPA priority. The submitting task's io_priority is sampled at io_uring_enter(2) time and embedded in each IoRequest generated from the submission ring.


15.19 NVMe Host Controller Driver Architecture

Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules. &self methods use interior mutability for mutation. Atomic fields use .store()/.load(). All #[repr(C)] structs have const_assert! size verification. See CLAUDE.md Spec Pseudocode Quality Gates.

The NVMe driver is a Tier 1 KABI driver that manages local PCIe-attached NVMe solid-state drives through the NVM Express register and command interface. This is the primary high-performance block storage driver for UmkaOS — NVMe SSDs are the default boot and data disk on modern servers, workstations, and laptops.

Reference specification: NVM Express Base Specification 2.1 (NVM Express, Inc., August 2024). NVM Express Zoned Namespace Command Set Specification 1.1b (August 2022).

NVMe-oF (over Fabrics) is a separate subsystem. The NVMe-oF initiator and target are defined in Section 15.13. This section covers the local PCIe NVMe host controller driver only. The two share the NvmeCommand format and namespace abstraction — see Section 15.19 below.

15.19.1 Controller Memory Space (CMS) Registers

The NVMe controller exposes a memory-mapped register set at PCI BAR0. All registers are little-endian. The first 0x40 bytes are controller-wide registers; doorbell registers start at offset 0x1000 (configurable via CAP.DSTRD).

/// NVMe controller registers (BAR0 MMIO, offsets 0x00-0x3F).
/// All registers are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures including big-endian
/// PPC32 and s390x. Matches Linux `struct nvme_bar` which uses `__le64`/`__le32`.
#[repr(C)]
pub struct NvmeRegisters {
    /// Controller Capabilities (CAP) — read-only, 64-bit.
    /// Bits: MQES (15:0) maximum queue entries supported (0-based),
    /// CQR (16) contiguous queues required,
    /// AMS (18:17) arbitration mechanism supported (0=round-robin, 1=WRR+urgent),
    /// TO (31:24) timeout in 500ms units (worst-case time for CSTS.RDY transitions),
    /// DSTRD (35:32) doorbell stride (2^(2+DSTRD) bytes between doorbells),
    /// NSSRS (36) NVM subsystem reset supported,
    /// CSS (44:37) command set supported (bit 0=NVM, bit 6=I/O command sets, bit 7=admin only),
    /// BPS (45) boot partition support,
    /// CPS (47:46) controller power scope,
    /// MPSMIN (51:48) memory page size minimum (2^(12+MPSMIN) bytes),
    /// MPSMAX (55:52) memory page size maximum (2^(12+MPSMAX) bytes),
    /// PMRS (56) persistent memory region supported,
    /// CMBS (57) controller memory buffer supported,
    /// NSSS (58) NVM subsystem shutdown supported,
    /// CRMS (60:59) controller ready modes supported.
    pub cap: Le64,
    /// Version (VS) — read-only, 32-bit.
    /// Major (31:16), minor (15:8), tertiary (7:0). E.g., 0x00020100 = 2.1.0.
    pub vs: Le32,
    /// Interrupt Mask Set (INTMS) — write-only, 32-bit.
    /// Set bits to mask corresponding interrupt vectors.
    pub intms: Le32,
    /// Interrupt Mask Clear (INTMC) — write-only, 32-bit.
    /// Set bits to unmask corresponding interrupt vectors.
    pub intmc: Le32,
    /// Controller Configuration (CC) — read-write, 32-bit.
    /// Bits: EN (0) enable,
    /// CSS (6:4) I/O command set selected,
    /// MPS (10:7) memory page size (2^(12+MPS) bytes, must be within MPSMIN..MPSMAX),
    /// AMS (13:11) arbitration mechanism selected,
    /// SHN (15:14) shutdown notification (00=none, 01=normal, 10=abrupt),
    /// IOSQES (19:16) I/O submission queue entry size (2^N bytes, must be 6 for 64-byte),
    /// IOCQES (23:20) I/O completion queue entry size (2^N bytes, must be 4 for 16-byte),
    /// CRIME (24) controller ready independent of media enable.
    pub cc: Le32,
    /// Reserved (0x18).
    pub _reserved0: Le32,
    /// Controller Status (CSTS) — read-only, 32-bit.
    /// Bits: RDY (0) ready,
    /// CFS (1) controller fatal status,
    /// SHST (3:2) shutdown status (00=normal, 01=in-progress, 10=complete),
    /// NSSRO (4) NVM subsystem reset occurred,
    /// PP (5) processing paused.
    pub csts: Le32,
    /// NVM Subsystem Reset (NSSR) — read-write, 32-bit.
    /// Write 0x4E564D65 ("NVMe") to initiate subsystem reset (if CAP.NSSRS=1).
    pub nssr: Le32,
    /// Admin Queue Attributes (AQA) — read-write, 32-bit.
    /// ASQS (11:0) admin submission queue size (0-based, max 4095 entries),
    /// ACQS (27:16) admin completion queue size (0-based, max 4095 entries).
    pub aqa: Le32,
    /// Admin Submission Queue Base Address (ASQ) — read-write, 64-bit.
    /// Physical address, page-aligned (bits 11:0 must be zero).
    pub asq: Le64,
    /// Admin Completion Queue Base Address (ACQ) — read-write, 64-bit.
    /// Physical address, page-aligned (bits 11:0 must be zero).
    pub acq: Le64,
    /// Controller Memory Buffer Location (CMBLOC) — offset 0x38, 32-bit.
    /// Indicates the location and access parameters of the CMB if CAP.CMBS=1.
    /// If CMB is not supported, this register is reserved.
    pub cmbloc: Le32,
    /// Controller Memory Buffer Size (CMBSZ) — offset 0x3C, 32-bit.
    /// Indicates the size and capabilities of the CMB if CAP.CMBS=1.
    /// SZU (3:0) size units, SZ (31:4) size.
    pub cmbsz: Le32,
}
// NVMe Base Spec 2.1: registers 0x00-0x3F = 64 bytes.
const_assert!(core::mem::size_of::<NvmeRegisters>() == 64);

15.19.2 Submission/Completion Queue Pair Model

NVMe uses paired ring buffers for command submission and completion. Each pair consists of a Submission Queue (SQ) and a Completion Queue (CQ). The admin queue pair (QID 0) handles controller management; I/O queue pairs (QID 1+) handle data transfer.

Submission Queue (SQ): Circular buffer of 64-byte command entries. The host writes commands and advances the SQ tail doorbell. The controller fetches commands from the SQ head (tracked internally by the controller).

Completion Queue (CQ): Circular buffer of 16-byte completion entries. The controller writes completions and the host detects new entries via the phase bit (bit 16 of DW3). The phase bit toggles on each CQ wraparound, allowing the host to distinguish new completions from stale entries without reading a head register. The host advances the CQ head doorbell after processing completions.

Doorbell registers start at BAR0 + 0x1000, spaced by 4 << CAP.DSTRD bytes: - SQ Y Tail Doorbell: offset 0x1000 + (2Y) * (4 << DSTRD) - CQ Y Head Doorbell: offset 0x1000 + (2Y + 1) * (4 << DSTRD)

/// NVMe Submission Queue Entry — 64 bytes. All NVMe commands use this format.
/// The first 16 bytes are common; CDW10-CDW15 are command-specific.
/// All multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct NvmeCommand {
    /// Command Dword 0 (CDW0).
    /// OPC (7:0) opcode,
    /// FUSE (9:8) fused operation (00=normal, 01=first, 10=second),
    /// PSDT (15:14) PRP or SGL for data transfer (00=PRP, 01/10=SGL),
    /// CID (31:16) command identifier (unique per SQ, used to correlate completions).
    pub cdw0: Le32,
    /// Namespace Identifier (NSID). 0xFFFFFFFF for controller-wide commands.
    pub nsid: Le32,
    /// Command Dword 2-3 — reserved for most commands.
    pub cdw2: Le32,
    pub cdw3: Le32,
    /// Metadata Pointer (MPTR) — physical address of metadata buffer (if applicable).
    pub mptr: Le64,
    /// Data Pointer — two PRP entries (PRP1, PRP2) or one SGL descriptor.
    /// For PRP mode: PRP1 = first page, PRP2 = second page or PRP list address.
    pub dptr: [Le64; 2],
    /// Command Dwords 10-15 — command-specific parameters.
    pub cdw10: Le32,
    pub cdw11: Le32,
    pub cdw12: Le32,
    pub cdw13: Le32,
    pub cdw14: Le32,
    pub cdw15: Le32,
}
// NVMe Base Spec: cdw0(4)+nsid(4)+cdw2(4)+cdw3(4)+mptr(8)+dptr(16)+cdw10-15(24) = 64 bytes.
const_assert!(core::mem::size_of::<NvmeCommand>() == 64);

/// NVMe Completion Queue Entry — 16 bytes.
/// All multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct NvmeCompletion {
    /// Command-specific result (DW0).
    pub result: Le32,
    /// Command-specific result (DW1). Zero for NVM I/O commands; carries
    /// additional data for some admin commands (e.g., Identify, Create I/O
    /// Queue). See NVMe Base Specification 2.0+ Figure 89.
    pub result_hi: Le32,
    /// SQ Head Pointer — controller's current SQ head position.
    /// The host uses this to reclaim SQ entries.
    pub sq_head: Le16,
    /// SQ Identifier — identifies which SQ this completion is for.
    pub sq_id: Le16,
    /// Command Identifier — matches the CID from the submitted NvmeCommand.
    pub cid: Le16,
    /// Status Field (NVMe Base Spec 2.0, Figure 89).
    /// P (bit 0) phase bit — toggled on each CQ wraparound.
    /// SC (bits 8:1) status code — 8-bit field.
    /// SCT (bits 11:9) status code type (0=generic, 1=command-specific, 2=media, 3=path, 6-7=vendor).
    /// CRD (bits 13:12) command retry delay.
    /// M (bit 14) more — more status available (via Error Info log page).
    /// DNR (bit 15) do not retry — 1 means permanent error, 0 means transient (may retry).
    pub status: Le16,
}
// NVMe Base Spec: result(4)+result_hi(4)+sq_head(2)+sq_id(2)+cid(2)+status(2) = 16 bytes.
const_assert!(core::mem::size_of::<NvmeCompletion>() == 16);

15.19.3 NVMe Command Opcodes

/// Admin command opcodes (Opcode field in CDW0, used on QID 0).
#[repr(u8)]
pub enum NvmeAdminOpcode {
    /// Delete I/O Submission Queue.
    DeleteIoSq    = 0x00,
    /// Create I/O Submission Queue.
    CreateIoSq    = 0x01,
    /// Get Log Page (error log, SMART, firmware slot, AEN config, etc.).
    GetLogPage    = 0x02,
    /// Delete I/O Completion Queue.
    DeleteIoCq    = 0x04,
    /// Create I/O Completion Queue.
    CreateIoCq    = 0x05,
    /// Identify — returns controller or namespace data structures.
    Identify      = 0x06,
    /// Abort — request cancellation of a previously submitted command.
    Abort         = 0x08,
    /// Set Features — configure controller parameters.
    SetFeatures   = 0x09,
    /// Get Features — read controller parameters.
    GetFeatures   = 0x0A,
    /// Asynchronous Event Request — register for async notifications.
    AsyncEventReq = 0x0C,
    /// Namespace Management — create/delete namespaces.
    NsMgmt        = 0x0D,
    /// Firmware Commit — activate firmware image.
    FwCommit      = 0x10,
    /// Firmware Image Download — transfer firmware to controller.
    FwDownload    = 0x11,
    /// Namespace Attachment — attach/detach namespace to controller.
    NsAttach      = 0x15,
    /// Format NVM — low-level format a namespace.
    FormatNvm     = 0x80,
}

/// NVM I/O command opcodes (used on I/O queues, QID 1+).
#[repr(u8)]
pub enum NvmeIoOpcode {
    /// Flush — commit volatile write cache to non-volatile media.
    Flush         = 0x00,
    /// Write — transfer data from host to namespace.
    Write         = 0x01,
    /// Read — transfer data from namespace to host.
    Read          = 0x02,
    /// Write Uncorrectable — mark LBA range as invalid (read returns error).
    WriteUncor    = 0x04,
    /// Compare — compare data in namespace with host buffer.
    Compare       = 0x05,
    /// Write Zeroes — set LBA range to zero without transferring data.
    WriteZeroes   = 0x08,
    /// Dataset Management — TRIM/deallocate, volatile write cache hints.
    Dsm           = 0x09,
    /// Verify — verify data integrity without transferring data.
    Verify        = 0x0C,
    /// Reservation Register — register/unregister reservation keys.
    ResrvRegister = 0x0D,
    /// Reservation Report — report current reservations.
    ResrvReport   = 0x0E,
    /// Reservation Acquire — acquire/preempt reservations.
    ResrvAcquire  = 0x11,
    /// Reservation Release — release reservations.
    ResrvRelease  = 0x15,
    /// Zone Append (ZNS) — write data to zone write pointer.
    ZoneAppend    = 0x7D,
    /// Zone Management Send (ZNS) — open/close/finish/reset zone.
    ZoneMgmtSend  = 0x79,
    /// Zone Management Receive (ZNS) — report zone descriptors.
    ZoneMgmtRecv  = 0x7A,
}

15.19.4 Driver State

/// NVMe controller driver state — lives in the Tier 1 driver domain.
/// One instance per NVMe controller (PCI function).
pub struct NvmeController {
    /// PCI BAR0 MMIO accessor for controller registers.
    pub regs: MmioRegion,
    /// Controller capabilities (cached from CAP register at init).
    pub cap: NvmeCapabilities,
    /// Maximum queue entries supported (CAP.MQES + 1).
    /// u32 because MQES is 16 bits: when MQES = 65535, the actual entry count
    /// is 65536, which overflows u16.
    pub max_queue_entries: u32,
    /// Doorbell stride in bytes (4 << CAP.DSTRD).
    pub doorbell_stride: u32,
    /// Host memory page size configured in CC.MPS (bytes, power of 2).
    pub page_size: u32,
    /// Admin queue pair (QID 0). Always present after initialization.
    pub admin_queue: NvmeQueuePair,
    /// I/O queue pairs (QID 1+). One per CPU, up to controller maximum.
    /// ArrayVec capacity 256: compile-time upper bound for the number of
    /// I/O queues. Actual count is min(nr_cpu_ids, CAP.MQES, 256) at init.
    /// 256 is sufficient for current NVMe controllers (most support <=128
    /// queues). Systems with >256 CPUs share queues (queue_idx = cpu % N).
    /// If future controllers support >256 queues, this constant must be
    /// increased or replaced with a slab-allocated slice.
    pub io_queues: ArrayVec<NvmeQueuePair, 256>,
    /// Active namespaces discovered via Identify. One NvmeNamespace per NSID.
    /// XArray keyed by NSID (u32) — integer-keyed mapping per collection policy.
    /// NVMe allows up to 2^32-1 namespaces (NN field from Identify Controller);
    /// XArray provides runtime-sized, O(log₆₄ N) lookup without hardcoded limits.
    /// Populated at probe time (warm-path), accessed on I/O submission (hot-path
    /// via cached queue→namespace binding, not repeated XArray lookup).
    pub namespaces: XArray<NvmeNamespace>,
    /// Number of MSI-X vectors allocated.
    pub msix_vectors: u16,
    /// Controller serial number (20 ASCII bytes from Identify Controller).
    pub serial: [u8; 20],
    /// Controller model number (40 ASCII bytes from Identify Controller).
    pub model: [u8; 40],
    /// Firmware revision (8 ASCII bytes from Identify Controller).
    pub firmware_rev: [u8; 8],
    /// Maximum Data Transfer Size in bytes. Derived from controller MDTS field.
    /// 0 means no limit (use host page size × max PRP list length).
    pub max_transfer_size: u32,
    /// Number of outstanding Async Event Requests (AER) the controller supports.
    pub aerl: u8,
    /// Controller supports volatile write cache (Identify Controller, VWC bit 0).
    pub volatile_write_cache: bool,
    /// Controller power state management.
    pub power_state: NvmePowerState,
    /// Error recovery state.
    pub error_state: AtomicU8,
    /// NUMA node of the PCI device (for queue/interrupt affinity).
    pub numa_node: u16,
}

/// Cached controller capabilities from the CAP register.
pub struct NvmeCapabilities {
    /// Maximum Queue Entries Supported (0-based). Actual max = mqes + 1.
    pub mqes: u16,
    /// Contiguous Queues Required.
    pub cqr: bool,
    /// Timeout in 500ms units (for CSTS.RDY transitions).
    pub timeout: u8,
    /// Doorbell Stride (2^(2+dstrd) bytes).
    pub dstrd: u8,
    /// Minimum host memory page size (2^(12+mpsmin) bytes).
    pub mpsmin: u8,
    /// Maximum host memory page size (2^(12+mpsmax) bytes).
    pub mpsmax: u8,
}

/// NVMe submission/completion queue pair.
pub struct NvmeQueuePair {
    /// Queue identifier (0 = admin, 1+ = I/O).
    pub qid: u16,
    /// Queue depth (number of entries, power of 2, max CAP.MQES+1).
    pub depth: u16,
    /// DMA-coherent submission queue buffer.
    pub sq: DmaBox<[NvmeCommand]>,
    /// DMA-coherent completion queue buffer.
    pub cq: DmaBox<[NvmeCompletion]>,
    /// SQ tail index — next slot to write a command. Advanced by the host.
    pub sq_tail: u16,
    /// CQ head index — next slot to read a completion. Advanced by the host.
    pub cq_head: u16,
    /// Current CQ phase bit. Starts at 1; toggles on each CQ wraparound.
    pub cq_phase: bool,
    /// Doorbell offset for SQ tail (BAR0 + 0x1000 + qid*2*stride).
    pub sq_doorbell_offset: u32,
    /// Doorbell offset for CQ head (BAR0 + 0x1000 + (qid*2+1)*stride).
    pub cq_doorbell_offset: u32,
    /// In-flight command tracking: maps CID → pending Bio.
    /// Allocated at queue creation with length = actual queue depth
    /// (discovered from controller CAP.MQES, typically 64-1024).
    /// Warm-path allocation (driver init only).
    pub inflight: Box<[Option<NvmeInflightCmd>]>,
    /// Next command identifier. Wraps at queue depth.
    pub next_cid: u16,
    /// MSI-X vector assigned to this queue's CQ.
    pub irq_vector: u16,
    /// Number of commands posted since the last doorbell write.
    /// Used by `nvme_ring_doorbell()` to skip redundant MMIO writes.
    pub pending_doorbells: u16,
    /// Batch mode flag. When true, `nvme_submit_io()` defers doorbell
    /// writes. Set by `nvme_submit_batch()`, cleared after the batch
    /// doorbell write.
    pub batch_mode: bool,
    /// DMA device handle for IOMMU/SWIOTLB address translation.
    pub dma_device: DmaDevice,
    /// Reference to the controller's BAR0 MMIO region (shared across all
    /// queues). Used by `nvme_ring_doorbell()` to write the SQ tail and
    /// CQ head doorbells at the queue-specific offsets.
    pub regs: MmioRegion,
    /// Per-CID flush waiter. `submit_flush_sync` stores a slab-allocated
    /// `Completion` handle here; the IRQ handler wakes it on flush
    /// completion. Allocated at queue creation with length = actual queue
    /// depth. `None` for CIDs not used by synchronous flush.
    ///
    /// Uses slab-allocated `Completion` (not stack references) to avoid
    /// dangling pointers if the submit path returns early via `?`.
    pub flush_waiters: Box<[Option<SlabBox<Completion>>]>,
    /// Pre-allocated PRP list page pool. Each entry is a DMA-coherent
    /// page (4096 bytes, 512 Le64 entries) for multi-page I/O commands.
    /// Pool size = queue depth (one PRP list per in-flight command).
    /// Allocated at queue creation; no allocation on the I/O hot path.
    pub prp_pool: PrpPool,
    /// Domain ID of the NVMe driver's isolation domain. Set at driver init
    /// from the `DomainService.domain_id` passed during module registration.
    /// `CORE_DOMAIN_ID` (0) during early boot (Tier 0); updated to the Tier 1
    /// domain ID after promotion. Used by `nvme_signal_completion()` to select
    /// direct (`bio_complete()`) vs ring-based completion path.
    pub domain_id: DomainId,
    /// Outbound KABI completion ring for Tier 1 mode. Initialized during
    /// module registration when the driver binds to the block layer service.
    /// The Tier 0 block layer consumer drains this ring and calls
    /// `bio_complete()` for each entry. `None` in Tier 0 (boot) mode.
    pub outbound_ring: Option<CrossDomainRing>,
}

/// In-flight command context — tracks a submitted command until completion.
/// Stores all information needed to rebuild the SQ entry on retry
/// (opcode, nsid, DMA mapping, PRP list).
pub struct NvmeInflightCmd {
    /// Pointer to the originating Bio (for completion callback).
    pub bio: *mut Bio,
    /// PRP list page (if the command required a PRP list for >2 segments).
    /// None if the command fit in the two inline PRP pointers. All PRP
    /// entries are little-endian per NVMe spec — Le64, not native u64.
    pub prp_list: Option<DmaBox<[Le64; 512]>>,
    /// DMA mapping for this command's data transfer. Unmapped in the
    /// completion handler after bio signaling to prevent IOMMU leak.
    pub dma_map: Option<DmaMapping>,
    /// Command opcode (for retry classification on error).
    pub opcode: u8,
    /// Namespace identifier (needed for command rebuild on retry).
    pub nsid: u32,
    /// Retry count (incremented on transient error, max 3).
    pub retries: u8,
}

15.19.5 Namespace State

/// NVMe namespace — one per active NSID on the controller.
pub struct NvmeNamespace {
    /// Namespace Identifier (1-based).
    pub nsid: u32,
    /// Namespace capacity in logical blocks.
    pub capacity_blocks: u64,
    /// Logical Block Address (LBA) format: sector size in bytes.
    /// Derived from Identify Namespace LBAF[FLBAS].LBADS: size = 2^lbads.
    pub block_size: u32,
    /// Metadata size per block (from LBAF[FLBAS].MS). 0 if no metadata.
    pub metadata_size: u16,
    /// Namespace supports thin provisioning (NSFEAT bit 0).
    pub thin_provisioned: bool,
    /// Namespace supports deallocate (DSM, Dataset Management command).
    pub supports_deallocate: bool,
    /// Maximum number of LBA ranges per DSM command (from Identify Namespace DMRL).
    /// 0 means no limit reported; driver uses 256 (spec maximum).
    pub dsm_range_limit: u16,
    /// Namespace is a Zoned Namespace (ZNS). See [Section 15.19](#nvme-driver-architecture--zoned-namespaces-zns).
    pub zns: Option<NvmeZnsInfo>,
    /// Optimal I/O boundary in logical blocks (NOIOB from Identify Namespace).
    /// Straddling this boundary may degrade performance. 0 = no boundary.
    pub optimal_io_boundary: u16,
    /// Preferred write granularity in logical blocks (NPWG from Identify Namespace).
    pub preferred_write_granularity: u16,
    /// Preferred write alignment in logical blocks (NPWA from Identify Namespace).
    pub preferred_write_alignment: u16,
    /// End-to-end data protection type (0=none, 1=Type1, 2=Type2, 3=Type3).
    pub pi_type: u8,
}

15.19.6 Initialization Sequence

Eight steps, from PCI probe to ready:

  1. PCI probe and BAR mapping: Match PCI class 01:08:02 (Mass Storage, NVM Express, NVM Express I/O Controller). Map BAR0 as uncacheable MMIO. Read CAP and VS registers. Validate spec version (minimum 1.0 for basic operation).

  2. Controller reset: Clear CC.EN (bit 0). Poll CSTS.RDY until it clears. Timeout = CAP.TO × 500ms. If CSTS.RDY does not clear, the controller is non-functional — abort initialization with EIO.

  3. Configure controller: Set CC.MPS to match the host page size (must be within CAP.MPSMIN..CAP.MPSMAX). Set CC.IOSQES = 6 (64-byte SQ entries). Set CC.IOCQES = 4 (16-byte CQ entries). Set CC.AMS = 1 (WRR with urgent) if CAP.AMS bit 0 is set; otherwise CC.AMS = 0 (round-robin). Set CC.CSS = 0 (NVM command set).

  4. Admin queue setup: Determine admin queue depth (min of 4096 and CAP.MQES + 1). DMA memory tradeoff: 4096-entry admin queue consumes 256 KiB (4096 × 64B SQ entries) + 64 KiB (4096 × 16B CQ entries) = 320 KiB of DMA-coherent memory per controller. This is generous for admin commands (typically <100 concurrent), but simplifies firmware update, namespace management, and device self-test flows that can submit many commands concurrently. For memory-constrained systems, the default can be reduced via boot parameter nvme.admin_queue_depth=256. Allocate DMA-coherent buffers for admin SQ and CQ. Write physical addresses to ASQ and ACQ. Write queue sizes to AQA (both ASQS and ACQS fields). Set CC.EN = 1. Poll CSTS.RDY until it sets (timeout = CAP.TO × 500ms). If CSTS.CFS (Controller Fatal Status) is set, reset and retry once before aborting.

  5. Identify Controller: Submit Identify command (opcode 0x06, CNS=0x01) on the admin queue. Parse the 4096-byte Identify Controller data structure:

  6. Bytes 24-63: Serial Number (SN), Model Number (MN).
  7. Byte 77: MDTS — Maximum Data Transfer Size as a power of 2 in units of CAP.MPSMIN. If 0, no limit. Otherwise, max transfer = (1 << MDTS) × page_size.
  8. Bytes 257-258: OACS — Optional Admin Command Support (namespace management, firmware commands, format NVM).
  9. Byte 259: ACL — Abort Command Limit (max outstanding Abort commands).
  10. Byte 260: AERL — Async Event Request Limit.
  11. Byte 525: VWC — Volatile Write Cache (bit 0: present).
  12. Bytes 514-515: NVSCC — NVM Vendor Specific Command Configuration.

  13. Identify Namespaces: Submit Identify command (CNS=0x02) to get the active namespace list. For each NSID in the list, submit Identify Namespace (CNS=0x00) to discover:

  14. NSZE (bytes 0-7): namespace size in logical blocks.
  15. NCAP (bytes 8-15): namespace capacity.
  16. FLBAS (byte 26): formatted LBA size index (selects entry from LBAF array).
  17. LBAF[0..63] (bytes 128-191): LBA format descriptors (LBADS = data size log2, MS = metadata size, RP = relative performance).
  18. DPS (byte 29): data protection settings.
  19. NSFEAT (byte 24): namespace features (thin provisioning, deallocate support).
  20. NOIOB (bytes 72-73): namespace optimal I/O boundary.
  21. NPWG/NPWA (bytes 74-77): preferred write granularity and alignment.

  22. I/O queue creation: Allocate MSI-X vectors — one per I/O queue plus one for the admin queue. Determine I/O queue count: min(online_cpus, controller_max_queues). Set Features (feature ID 0x07 — Number of Queues) to request the desired count. The controller may grant fewer. For each I/O queue pair: a. Allocate DMA-coherent CQ buffer. Submit Create I/O CQ (opcode 0x05) on admin queue: CDW10 = (size-1) << 16 | QID, CDW11 = irq_vector << 16 | IEN | PC (physically contiguous, interrupts enabled). b. Allocate DMA-coherent SQ buffer. Submit Create I/O SQ (opcode 0x01) on admin queue: CDW10 = (size-1) << 16 | QID, CDW11 = CQID << 16 | QPRIO | PC. c. Assign the queue pair to a specific CPU for interrupt affinity.

  23. Ready: Register Async Event Requests (up to AERL+1 outstanding). Enable interrupt coalescing via Set Features (feature ID 0x08) if the workload benefits from batching (tunable: threshold count + aggregation time). Register each namespace with umka-block as a BlockDevice.

15.19.7 I/O Path

Bio-to-NVMe command translation: The block layer submits a Bio containing an LBA range and a scatter-gather list of memory segments. The NVMe driver translates this into an NvmeCommand with PRP (Physical Region Page) data pointers.

PRP construction: NVMe uses two PRP pointers in each command (PRP1 and PRP2):

  • 1 segment (data fits in one page): PRP1 = physical address of the data buffer. PRP2 = 0 (unused).
  • 2 segments (data spans two pages): PRP1 = first page physical address. PRP2 = second page physical address.
  • >2 segments: PRP1 = first page physical address. PRP2 = physical address of a PRP list — a page-aligned buffer of u64 physical addresses for the remaining pages. Each PRP list page holds up to 512 entries (4096 / 8). If more than 512 additional pages are needed, the last entry in the PRP list points to the next PRP list page (chained).
fn nvme_submit_io(queue: &mut NvmeQueuePair, bio: &mut Bio,
                  ns: &NvmeNamespace) -> Result<()> {
    // Find a free CID by scanning forward from next_cid.
    // Each slot in `inflight` is Some(_) while in-flight, None when free.
    let start = queue.next_cid as usize;
    let depth = queue.depth as usize;
    let mut cid = None;
    for i in 0..depth {
        let idx = (start + i) % depth;
        if queue.inflight[idx].is_none() {
            cid = Some(idx as u16);
            queue.next_cid = ((idx + 1) % depth) as u16;
            break;
        }
    }
    let cid = cid.ok_or(EBUSY)?; // All slots in-flight — queue full.

    // Build NvmeCommand at SQ tail.
    let cmd = &mut queue.sq[queue.sq_tail as usize];
    *cmd = NvmeCommand::zeroed();

    // All NvmeCommand fields are Le32/Le64 — explicit conversion via
    // Le32::from_ne() required on big-endian architectures (PPC32, s390x).
    cmd.cdw0 = Le32::from_ne(match bio.op {
        BioOp::Read  => NvmeIoOpcode::Read as u32,
        BioOp::Write => NvmeIoOpcode::Write as u32,
        _ => unreachable!(), // Flush/Discard handled separately
    } | ((cid as u32) << 16));

    cmd.nsid = Le32::from_ne(ns.nsid);

    // CDW10-11: Starting LBA (64-bit).
    let slba = bio.start_lba;
    cmd.cdw10 = Le32::from_ne(slba as u32);
    cmd.cdw11 = Le32::from_ne((slba >> 32) as u32);

    // CDW12: Number of logical blocks (0-based) | FUA bit 30.
    let total_bytes: u64 = bio.segments.iter().map(|s| s.len as u64).sum::<u64>()
        + bio.segments_ext.as_deref().map_or(0u64, |ext| ext.iter().map(|s| s.len as u64).sum());
    let nlb = (total_bytes / ns.block_size as u64) - 1;
    let fua = if bio.flags.contains(BioFlags::FUA) { 1u32 << 30 } else { 0u32 };
    cmd.cdw12 = Le32::from_ne(nlb as u32 | fua);

    // Build PRP entries from bio segments.
    let opcode = match bio.op {
        BioOp::Read  => NvmeIoOpcode::Read as u8,
        BioOp::Write => NvmeIoOpcode::Write as u8,
        _ => unreachable!(),
    };
    let mut inflight = NvmeInflightCmd {
        bio: bio as *mut Bio,
        prp_list: None,
        dma_map: None,
        opcode,
        nsid: ns.nsid,
        retries: 0,
    };

    // Map bio segments to DMA addresses via IOMMU/SWIOTLB translation.
    // BioSegment contains (page: Arc<Page>, offset: u32, len: u32) —
    // physical/bus addresses are obtained by calling dma_map_sgl(),
    // not by direct field access. See §4.11 DMA Subsystem.
    let sgl = DmaSgl::from_bio_segments(&bio.segments, bio.segments_ext.as_deref());
    let dma_map = queue.dma_device.dma_map_sgl(
        &sgl, DmaDirection::from_bio_op(bio.op),
    )?;
    // Extract DMA addresses BEFORE moving dma_map into inflight.
    // Rust move semantics: accessing dma_map after move is a compilation error.
    let dma_addrs = dma_map.addresses();
    inflight.dma_map = Some(dma_map);

    // DmaAddr is u64 (native-endian); NvmeCommand.dptr is [Le64; 2]
    // and PRP list entries are Le64 — NVMe is a little-endian wire
    // protocol. Le64::from_ne() is a no-op on LE architectures (x86-64,
    // AArch64 LE, RISC-V LE) and a byte-swap on BE (PPC32, PPC64 BE,
    // s390x). See also the Le32::from_ne() conversions for cdw0/nsid above.
    match dma_addrs.len() {
        0 => {} // No data (should not happen for read/write)
        1 => {
            cmd.dptr[0] = Le64::from_ne(dma_addrs[0]);
        }
        2 => {
            cmd.dptr[0] = Le64::from_ne(dma_addrs[0]);
            cmd.dptr[1] = Le64::from_ne(dma_addrs[1]);
        }
        n => {
            cmd.dptr[0] = Le64::from_ne(dma_addrs[0]);
            // Allocate PRP list from per-queue pre-allocated pool.
            // PRP list entries are Le64 per NVMe spec.
            let prp_list = queue.alloc_prp_list();
            for i in 1..n {
                prp_list[i - 1] = Le64::from_ne(dma_addrs[i]);
            }
            cmd.dptr[1] = Le64::from_ne(prp_list.phys_addr());
            inflight.prp_list = Some(prp_list);
        }
    }

    queue.inflight[cid as usize] = Some(inflight);

    // Advance SQ tail. Doorbell write is deferred for batch submission.
    queue.sq_tail = (queue.sq_tail + 1) % queue.depth;
    queue.pending_doorbells += 1;

    // Ring doorbell immediately only for single-command submission.
    // Batch callers use `nvme_submit_batch()` which defers the doorbell
    // write until all commands in the batch are posted. This reduces
    // MMIO writes from N to 1 per batch (MMIO writes are ~100-500ns each
    // due to uncacheable PCIe BAR access). For single-command submission
    // (the common case for fsync/flush), ring immediately.
    if !queue.batch_mode {
        nvme_ring_doorbell(queue);
    }

    Ok(())
}

/// Ring the NVMe submission queue doorbell. Writes the current SQ tail
/// to the controller's doorbell register, notifying hardware of new commands.
/// Must be called after a write barrier to ensure all command data is visible.
fn nvme_ring_doorbell(queue: &mut NvmeQueuePair) {
    if queue.pending_doorbells == 0 {
        return;
    }
    // Write memory barrier — ensure commands are visible before doorbell write.
    core::sync::atomic::fence(Release);
    queue.regs.write32(queue.sq_doorbell_offset, queue.sq_tail as u32);
    queue.pending_doorbells = 0;
}

/// Submit a batch of bios with a single deferred doorbell write.
/// Each bio is individually placed into the SQ; the doorbell is rung
/// once after all commands are posted. Reduces MMIO overhead from N
/// writes to 1 per batch. Used by the block layer's request merging
/// and plugging infrastructure.
fn nvme_submit_batch(queue: &mut NvmeQueuePair, bios: &mut [&mut Bio],
                     ns: &NvmeNamespace) -> Result<()> {
    queue.batch_mode = true;
    for bio in bios.iter_mut() {
        nvme_submit_io(queue, bio, ns)?;
    }
    queue.batch_mode = false;
    nvme_ring_doorbell(queue);
    Ok(())
}

15.19.7.1 NVMe Helper Functions

impl NvmeQueuePair {
    /// Allocate a free Command ID by scanning forward from `next_cid`.
    /// Returns the CID index. Returns `Err(Error::BUSY)` if all slots in-flight.
    pub fn alloc_cid(&mut self) -> Result<u16> {
        let start = self.next_cid as usize;
        let depth = self.depth as usize;
        for i in 0..depth {
            let idx = (start + i) % depth;
            if self.inflight[idx].is_none() {
                self.next_cid = ((idx + 1) % depth) as u16;
                return Ok(idx as u16);
            }
        }
        Err(Error::BUSY)
    }

    /// Allocate a PRP list page from the per-queue pre-allocated PRP pool.
    /// Each NvmeQueuePair has a slab of pre-allocated PRP list pages (one
    /// per queue depth entry). The PRP list is page-aligned (4096 bytes)
    /// and holds up to 512 Le64 entries.
    /// Returns (virtual_ptr, dma_addr) or Err(Error::NOMEM) if pool exhausted.
    pub fn alloc_prp_list(&mut self) -> Result<(*mut Le64, u64)> {
        self.prp_pool.alloc().ok_or(Error::NOMEM)
    }

    /// Return a PRP list page to the per-queue pool.
    pub fn free_prp_list(&mut self, ptr: *mut Le64) {
        self.prp_pool.free(ptr);
    }

    /// Build and submit a flush command (NvmeIoOpcode::Flush, opcode 0x00).
    /// The inflight entry must already be stored by the caller.
    /// Uses the packed CDW0 API: OPC(7:0) | CID(31:16).
    pub fn submit_flush_cmd(&mut self, nsid: u32, cid: u16) -> Result<()> {
        let cmd = &mut self.sq[self.sq_tail as usize];
        *cmd = NvmeCommand::zeroed();
        cmd.cdw0 = Le32::from_ne(
            NvmeIoOpcode::Flush as u32 | ((cid as u32) << 16),
        );
        cmd.nsid = Le32::from_ne(nsid);
        // No data pointers, no CDW10-15 for flush.
        self.sq_tail = (self.sq_tail + 1) % self.depth;
        nvme_ring_doorbell(self);
        Ok(())
    }

    /// Submit a flush command and block until completion.
    /// Used by BlockDeviceOps::flush() for synchronous flush.
    ///
    /// Uses a slab-allocated `Completion` (not a stack-local reference)
    /// so that early return via `?` cannot leave a dangling pointer in
    /// `flush_waiters`. Cleanup of `flush_waiters[cid]` and `inflight[cid]`
    /// is performed explicitly on all exit paths.
    pub fn submit_flush_sync(&mut self, nsid: u32) -> Result<()> {
        let cid = self.alloc_cid()?;
        let completion = SlabBox::new(Completion::new());
        self.inflight[cid as usize] = Some(NvmeInflightCmd {
            bio: core::ptr::null_mut(), // no bio — completion wakes waiter
            prp_list: None,
            dma_map: None,
            opcode: NvmeIoOpcode::Flush as u8,
            nsid,
            retries: 0,
        });
        self.flush_waiters[cid as usize] = Some(completion);
        // Submit the flush command. On error, clean up both slots.
        if let Err(e) = self.submit_flush_cmd(nsid, cid) {
            self.flush_waiters[cid as usize] = None;
            self.inflight[cid as usize] = None;
            return Err(e);
        }
        // Block until completion handler signals (TASK_KILLABLE).
        self.flush_waiters[cid as usize].as_ref().unwrap().wait_killable();
        // Cleanup: the completion handler has already processed the inflight
        // entry (via .take()). Clear the waiter slot.
        self.flush_waiters[cid as usize] = None;
        Ok(())
    }
}

/// Requeue a command after transient NVMe error (SC != 0, DNR == 0).
/// Re-submits the same command with the same CID. Increments retries,
/// rebuilds the SQ entry from the inflight state, and re-rings the
/// doorbell. The inflight entry is re-inserted into the tracking array
/// (the caller must NOT have consumed the DMA mapping or PRP list).
///
/// Takes owned `NvmeInflightCmd` because the caller `.take()`d it from
/// the inflight array. After rebuilding the SQ entry, the inflight is
/// stored back into `queue.inflight[cid]` so the completion handler
/// can find it when the retried command completes.
fn requeue_command(queue: &mut NvmeQueuePair, mut inflight: NvmeInflightCmd, cid: u16) {
    inflight.retries += 1;
    // Rebuild SQ entry from inflight fields (same packed CDW0 API as nvme_submit_io).
    let cmd = &mut queue.sq[queue.sq_tail as usize];
    *cmd = NvmeCommand::zeroed();
    cmd.cdw0 = Le32::from_ne(
        inflight.opcode as u32 | ((cid as u32) << 16),
    );
    cmd.nsid = Le32::from_ne(inflight.nsid);
    // Re-use existing DMA mapping and PRP list (still valid — NOT consumed).
    if let Some(ref dma_map) = inflight.dma_map {
        let addrs = dma_map.addresses();
        if !addrs.is_empty() {
            cmd.dptr[0] = Le64::from_ne(addrs[0]);
        }
        if addrs.len() == 2 {
            cmd.dptr[1] = Le64::from_ne(addrs[1]);
        } else if let Some(ref prp_list) = inflight.prp_list {
            cmd.dptr[1] = Le64::from_ne(prp_list.phys_addr());
        }
    }
    // Re-insert inflight entry so the completion handler finds it.
    queue.inflight[cid as usize] = Some(inflight);
    queue.sq_tail = (queue.sq_tail + 1) % queue.depth;
    nvme_ring_doorbell(queue);
}

Flush command: NvmeIoOpcode::Flush (opcode 0x00), no data pointers, NSID set. CDW10-15 all zero. The controller commits volatile write cache to non-volatile media. Flush inflight construction (must include opcode and nsid fields):

queue.inflight[cid as usize] = Some(NvmeInflightCmd {
    bio: bio as *mut Bio,
    prp_list: None,
    dma_map: None,
    opcode: NvmeIoOpcode::Flush as u8,
    nsid: ns.nsid,
    retries: 0,
});

Discard (DSM/Deallocate): NvmeIoOpcode::Dsm (opcode 0x09). CDW10 = number of ranges - 1. CDW11 = attribute bits (bit 2 = Deallocate). The data buffer contains an array of NvmeDsmRange entries:

/// Dataset Management range descriptor — 16 bytes per range.
/// All multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct NvmeDsmRange {
    /// Context attributes (optional hints).
    pub attributes: Le32,
    /// Number of logical blocks in this range.
    pub length: Le32,
    /// Starting LBA of this range.
    pub slba: Le64,
}
// NVMe DSM range: attributes(4) + length(4) + slba(8) = 16 bytes.
const_assert!(core::mem::size_of::<NvmeDsmRange>() == 16);

15.19.8 Interrupt Handling

MSI-X preferred: The driver requests one MSI-X vector per I/O queue plus one for the admin queue. This provides per-queue interrupt isolation — each I/O queue's completions are delivered to the CPU that owns that queue, avoiding cross-CPU interrupt migration. If MSI-X is unavailable, fall back to MSI (single vector, shared across all queues) then to INTx legacy (pin-based).

Interrupt coalescing: Configured via Set Features (Feature ID 0x08 — Interrupt Coalescing). Parameters: aggregation threshold (number of completions before interrupt) and aggregation time (100μs units). Default: threshold=8, time=100μs (10 ticks). Tunable per workload — latency-sensitive workloads disable coalescing; throughput workloads increase the threshold.

Completion processing (see Section 3.6 for the formal Completion primitive).

Tier boundary: The NVMe driver runs in a Tier 1 domain. bio_complete() is a Tier 0 function (Section 15.2). Per the Unified Domain Model (Section 12.8), the driver cannot call bio_complete() directly -- that would be a cross-domain direct call violating isolation.

Instead, the driver enqueues completion events on its outbound KABI completion ring targeting the Tier 0 block layer. The Tier 0 block layer consumer dequeues these events and calls bio_complete() in Tier 0 context. The bio pointer is passed as an opaque cookie: u64 (cast from *mut Bio); the Tier 0 consumer recovers the Bio reference and invokes bio_complete(bio, status).

Tier 0 boot path exception: During early boot (before Tier 1 promotion), the NVMe driver runs in Tier 0 (Domain 0). In this mode, bio_complete() is in the same domain and can be called directly. The completion path checks self.domain_id == CORE_DOMAIN_ID to select the direct or ring path. After promotion to Tier 1, all completions go through the outbound ring.

fn nvme_irq_handler(queue: &mut NvmeQueuePair) -> IrqReturn {
    let mut completed = 0u32;

    loop {
        let cqe = &queue.cq[queue.cq_head as usize];

        // Check phase bit -- if it doesn't match our expected phase,
        // there are no more new completions.
        let phase = (cqe.status.to_ne() & 1) != 0;
        if phase != queue.cq_phase {
            break;
        }

        // Read memory barrier -- ensure CQE fields are visible after phase check.
        core::sync::atomic::fence(Acquire);

        // Le16 fields must be converted to native before bit extraction.
        let cid = cqe.cid.to_ne();
        let status = cqe.status.to_ne();
        let status_code = (status >> 1) & 0xFF;  // SC: bits 8:1, 8-bit field
        let status_type = (status >> 9) & 0x07; // SCT: bits 11:9, 3-bit field
        let dnr = (status >> 15) & 1;

        if let Some(inflight) = queue.inflight[cid as usize].take() {
            if status_code == 0 {
                // Success path -- unmap DMA BEFORE signaling completion.
                // After completion, the waiter may free the bio and reuse
                // the data pages. The IOMMU mapping must be torn down first
                // to prevent stale mappings.
                if let Some(dma_map) = inflight.dma_map {
                    queue.dma_device.dma_unmap(dma_map);
                }
                if let Some(prp_list) = inflight.prp_list {
                    queue.free_prp_list(prp_list);
                }
                nvme_signal_completion(queue, inflight.bio, 0);
            } else if dnr == 0 && inflight.retries < 3 {
                // Transient error -- retry. Do NOT consume DMA mapping
                // or PRP list: they are reused by the retried command.
                // requeue_command takes ownership and re-inserts into
                // queue.inflight[cid].
                requeue_command(queue, inflight, cid);
            } else {
                // Permanent error or retries exhausted -- unmap DMA, complete.
                if let Some(dma_map) = inflight.dma_map {
                    queue.dma_device.dma_unmap(dma_map);
                }
                if let Some(prp_list) = inflight.prp_list {
                    queue.free_prp_list(prp_list);
                }
                let errno = nvme_status_to_errno(status_type, status_code);
                nvme_signal_completion(queue, inflight.bio, errno);
            }

            // Check if there is a flush waiter for this CID.
            if let Some(ref completion) = queue.flush_waiters[cid as usize] {
                completion.signal();
            }
        }

        // Advance CQ head. Toggle phase on wraparound.
        queue.cq_head = (queue.cq_head + 1) % queue.depth;
        if queue.cq_head == 0 {
            queue.cq_phase = !queue.cq_phase;
        }
        completed += 1;
    }

    if completed > 0 {
        // Update CQ head doorbell -- tells controller it can reuse CQ entries.
        queue.regs.write32(queue.cq_doorbell_offset, queue.cq_head as u32);
        IrqReturn::Handled
    } else {
        IrqReturn::None
    }
}

/// Signal bio completion respecting the Tier 0/Tier 1 boundary.
///
/// In Tier 0 (boot, before promotion): calls `bio_complete()` directly.
/// In Tier 1 (post-promotion): enqueues a completion event on the
/// outbound KABI ring targeting the Tier 0 block layer consumer.
///
/// The Tier 0 block layer consumer ([Section 15.2](#block-io-and-volume-management))
/// dequeues the completion and calls `bio_complete(bio, status)`.
///
/// # Arguments
///
/// - `queue`: The NVMe queue pair (provides access to the outbound ring
///   handle and domain_id).
/// - `bio`: Raw pointer to the Bio being completed.
/// - `status`: 0 = success, negative = -errno.
fn nvme_signal_completion(
    queue: &NvmeQueuePair,
    bio: *mut Bio,
    status: i32,
) {
    if queue.domain_id == CORE_DOMAIN_ID {
        // Tier 0 (boot path) -- same domain, direct call is safe.
        // SAFETY: bio pointer was validated at submit_bio() time and
        // stored in the inflight table. The inflight entry was consumed
        // by take() above, so we have exclusive access.
        let bio_ref = unsafe { &mut *bio };
        bio_complete(bio_ref, status);
    } else {
        // Tier 1 (post-promotion) -- cross-domain, use outbound ring.
        // Enqueue a T1CompletionEntry on the outbound KABI ring.
        // The Tier 0 block layer consumer processes this and calls
        // bio_complete() in Tier 0 context.
        let entry = T1CompletionEntry {
            cookie: bio as u64,  // Bio pointer as opaque cookie.
            status,
            result_len: 0,
            result_offset: 0,
            _reserved: [0u8; 44],
        };
        // outbound_ring is always Some in Tier 1 mode (initialized at promotion).
        let ring = queue.outbound_ring.as_ref()
            .expect("outbound_ring must be Some in Tier 1 mode");
        match ring.try_enqueue(&entry) {
            Ok(()) => {}
            Err(()) => {
                // Outbound ring full -- this should not happen in practice
                // because the ring is sized to match the queue depth. Log
                // and mark the bio as failed (EIO). The Tier 0 block layer
                // consumer will pick this up on the next drain cycle.
                //
                // This is a serious error -- it means completions are being
                // generated faster than the Tier 0 consumer can drain them.
                // The FMA subsystem is notified for diagnosis.
                klog_err!("NVMe: outbound completion ring full, bio {:p} lost", bio);
                // Cannot call bio_complete() from Tier 1 -- the bio may
                // already be referenced by the Tier 0 submitter. The Tier 0
                // block layer will time out the bio via its completion
                // timeout mechanism.
            }
        }
    }
}

/// Map NVMe status to errno.
fn nvme_status_to_errno(sct: u16, sc: u16) -> i32 {
    match (sct, sc) {
        (0, 0x02) => -EINVAL,   // Invalid Field in Command
        (0, 0x80) => -EREMOTEIO, // LBA Out of Range (addressing error, maps to BLK_STS_TARGET per Linux)
        (0, 0x81) => -ENOSPC,   // Capacity Exceeded
        (0, 0x82) => -EIO,      // Namespace Not Ready
        (2, 0x81) => -EIO,      // Unrecovered Read Error
        (2, 0x82) => -EIO,      // Write Fault
        (2, 0x83) => -EIO,      // Deallocated/Unwritten Logical Block
        (2, 0x84) => -ENODATA,  // End-to-End Guard Check Error
        (2, 0x85) => -ENODATA,  // End-to-End Application Tag Check Error
        (2, 0x86) => -ENODATA,  // End-to-End Reference Tag Check Error
        (3, 0x00) => -ENXIO,    // Internal Path Error
        (3, 0x01) => -ENXIO,    // Asymmetric Access Persistent Loss
        (3, 0x02) => -EAGAIN,   // Asymmetric Access Inaccessible
        (3, 0x03) => -EAGAIN,   // Asymmetric Access Transition
        _         => -EIO,      // All other errors
    }
}

15.19.9 Error Recovery

NVMe error recovery operates at three levels:

Command-level retry: On transient errors (DNR=0 in completion status), the driver re-submits the command up to 3 times. Transient errors include path errors (ANA transitions), abort due to SQ deletion, and internal controller errors. Permanent errors (DNR=1) are reported to the block layer immediately.

Controller-level reset: Triggered by controller fatal status (CSTS.CFS=1), command timeout (no completion within 30 seconds), or unrecoverable command errors:

  1. Set error_state = Recovering. New I/O submissions return EAGAIN.
  2. Disable controller: clear CC.EN. Poll CSTS.RDY = 0 (timeout = CAP.TO × 500ms). If timeout expires, perform PCI function-level reset (FLR) via PCIE_CAP + 0x08.
  3. Delete all I/O queues (the controller forgets them on reset).
  4. Rebuild admin queue: rewrite ASQ, ACQ, AQA. Set CC.EN = 1. Wait for CSTS.RDY = 1.
  5. Re-identify controller and namespaces (configuration may have changed).
  6. Re-create I/O queues via Create I/O CQ / Create I/O SQ admin commands.
  7. Replay in-flight commands: the block layer retains all bios that were submitted but not completed. After queue re-creation, these bios are re-submitted through the normal submit_bio() path.
  8. Set error_state = Normal. Resume accepting submissions.

Async Event Notification (AEN): The driver maintains AERL+1 outstanding Async Event Requests with the controller. When the controller detects a noteworthy condition, it completes an AER with the event type:

Event Type Action
Error (0x00) — persistent internal error Read Error Log (Log Page 0x01), report via FMA
SMART/Health (0x01) — threshold exceeded Read SMART Log (Log Page 0x02), report temperature/wear via FMA
Notice (0x02) — namespace attribute changed Re-identify affected namespace
Notice (0x02) — firmware activation starting Quiesce I/O, wait for activation complete
I/O command set specific (0x06) — zone changed Refresh zone descriptors for affected namespace

After processing each AEN, the driver resubmits a replacement Async Event Request to maintain the outstanding AER count.

15.19.10 Namespace Management

Multi-namespace controllers expose multiple independent block devices. Each namespace has its own LBA space, block size, and capabilities.

Namespace attachment/detachment: Admin commands NsAttach (opcode 0x15) with CDW10 select action: 0x00 = attach, 0x01 = detach. The data buffer contains a controller list specifying which controllers the namespace is attached to. On detachment, the driver unregisters the corresponding BlockDevice from umka-block and fails pending bios with ENXIO.

Format NVM: Admin command FormatNvm (opcode 0x80) performs a low-level format on a namespace. CDW10 specifies the target LBAF index and secure erase setting. This is a destructive operation — all data in the namespace is lost. The driver blocks I/O to the namespace during format (which may take minutes for large devices), then re-identifies the namespace to pick up the new LBA format.

15.19.11 Power State Management

NVMe controllers define multiple power states (PS0 = highest performance, PS4+ = deepest idle). Each power state specifies maximum power consumption and entry/exit latencies.

/// NVMe power state descriptor — from Identify Controller (bytes 2048+).
/// 32 bytes per entry, up to 32 power states (NVMe 2.0 Figure 275).
/// Multi-byte fields are little-endian (DMA-returned from controller).
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
///
/// Field layout matches Linux `struct nvme_id_power_state` (include/linux/nvme.h)
/// and the NVMe Base Specification.
#[repr(C)]
pub struct NvmePowerStateDesc {
    /// Maximum power (bytes 0-1). Units depend on `flags` MPS bit:
    /// MPS=0 → centiwatts (0.01 W), MPS=1 → milliwatts (0.001 W).
    pub max_power: Le16,
    /// Byte 2: reserved.
    pub _rsvd2: u8,
    /// Byte 3: Flags.
    /// Bit 0 = MPS (Max Power Scale: 0=centiwatts, 1=milliwatts).
    /// Bit 1 = NOPS (Non-Operational State: 1=non-operational).
    pub flags: u8,
    /// Entry Latency in microseconds (bytes 4-7).
    pub entry_lat_us: Le32,
    /// Exit Latency in microseconds (bytes 8-11).
    pub exit_lat_us: Le32,
    /// Relative Read Throughput (byte 12, 0 = best within this power state).
    pub rrt: u8,
    /// Relative Read Latency (byte 13, 0 = best).
    pub rrl: u8,
    /// Relative Write Throughput (byte 14, 0 = best).
    pub rwt: u8,
    /// Relative Write Latency (byte 15, 0 = best).
    pub rwl: u8,
    /// Idle Power consumption (bytes 16-17). Units: see `idle_scale`.
    pub idle_power: Le16,
    /// Byte 18: Idle Power Scale (bits 1:0). 0=not reported, 1=0.0001W, 2=0.01W.
    pub idle_scale: u8,
    /// Byte 19: reserved.
    pub _rsvd19: u8,
    /// Active Power consumption (bytes 20-21). Units: see `active_work_scale`.
    pub active_power: Le16,
    /// Byte 22: Active Power Workload (bits 2:0) + Active Power Scale (bits 7:6).
    /// Workload: 0=not reported, 1=workload #1, 2=workload #2.
    /// Scale: 0=not reported, 1=0.0001W, 2=0.01W.
    pub active_work_scale: u8,
    /// Bytes 23-31: reserved.
    pub _rsvd23: [u8; 9],
    // Layout: 2+1+1+4+4+1+1+1+1+2+1+1+2+1+9 = 32 bytes.
}
// NVMe power state descriptor: 32 bytes per entry.
const_assert!(core::mem::size_of::<NvmePowerStateDesc>() == 32);

/// Runtime power state tracking.
pub struct NvmePowerState {
    /// Current operational power state (0-based index).
    pub current_ps: u8,
    /// Number of supported power states (from Identify Controller NPSS+1).
    pub num_states: u8,
    /// Power state descriptors (cached from Identify Controller).
    pub states: ArrayVec<NvmePowerStateDesc, 32>,
    /// APST (Autonomous Power State Transitions) enabled.
    pub apst_enabled: bool,
    /// APST transition table: for each idle threshold, target power state.
    pub apst_table: ArrayVec<NvmeApstEntry, 32>,
}

/// APST table entry — configures automatic idle power state transition.
pub struct NvmeApstEntry {
    /// Idle time threshold in milliseconds before transitioning to target_ps.
    pub idle_threshold_ms: u32,
    /// Target power state for this idle threshold.
    pub target_ps: u8,
}

APST (Autonomous Power State Transitions): When supported (Identify Controller APSTA bit), the controller autonomously transitions between power states based on idle time. The driver programs the APST table via Set Features (Feature ID 0x0C — Autonomous Power State Transition). UmkaOS programs a conservative table: 100ms idle → PS1, 500ms → PS2, 2s → PS3. Non-operational states (NOPS=1) are excluded from the APST table — these states halt I/O processing and require explicit host-initiated transition.

Integration with runtime PM (Section 7.5): The NVMe driver registers with the runtime PM framework. On runtime_suspend(), the driver sets the deepest non-operational power state via Set Features (Feature ID 0x02 — Power Management, CDW11 = target power state). On runtime_resume(), the driver transitions back to PS0. The autosuspend delay defaults to 5 seconds for NVMe.

System suspend path: Flush volatile write cache (Flush command), then set shutdown notification (CC.SHN = 01 for normal shutdown). Poll CSTS.SHST until it reads 10b (shutdown complete). On resume: re-enable controller (CC.EN), wait for CSTS.RDY, re-create queues.

15.19.12 Tier 1 Isolation Integration

The NVMe driver runs as a Tier 1 driver — Ring 0 execution with hardware memory domain isolation (MPK on x86-64, POE on AArch64 where available, page table isolation as fallback). See Section 11.9 for the complete Tier 1 recovery protocol.

DMA isolation: All DMA buffers (SQ, CQ, PRP lists, data buffers) are mapped through the IOMMU. The NVMe controller's PCI function is assigned a dedicated IOMMU domain. The IOMMU page table restricts the controller to accessing only memory regions explicitly mapped for NVMe I/O — it cannot read or write arbitrary physical memory. On architectures without IOMMU (rare for NVMe-capable systems), DMA buffers are allocated from physically contiguous regions and the bounce buffer (SWIOTLB) path is used.

Crash recovery sequence:

  1. Fault detection: Hardware memory domain fault (MPK/POE violation), null pointer dereference, kernel OOPS within the NVMe driver domain, or watchdog timeout.
  2. Domain isolation: The faulting Tier 1 domain is immediately isolated — its memory domain key is revoked. No other kernel subsystem is affected.
  3. Controller quiesce: Assert PCI FLR (Function-Level Reset) to halt all DMA. The IOMMU domain prevents any stale DMA from reaching memory after reset.
  4. Driver reload: The KABI framework loads a fresh copy of the NVMe driver into a new Tier 1 domain. The driver re-initializes following the full 8-step sequence.
  5. I/O replay: The block layer replays all in-flight bios that were submitted but not completed before the crash. The new driver instance processes them normally.
  6. Recovery time: ~50-150ms (dominated by controller reset + queue re-creation). The block layer's retry queue absorbs the gap — filesystems and applications see a brief latency spike, not an error.

15.19.13 Zoned Namespaces (ZNS)

Zoned Namespaces (NVMe ZNS, TP 4053) divide the namespace into sequential-write zones. Within each zone, writes must proceed sequentially from the zone write pointer. This aligns with the erase-block behavior of NAND flash, enabling the SSD controller to eliminate the Flash Translation Layer (FTL) and reduce write amplification.

/// ZNS namespace information (from Identify Namespace, Zoned fields).
pub struct NvmeZnsInfo {
    /// Zone size in logical blocks (fixed for all zones in the namespace).
    pub zone_size_blocks: u64,
    /// Maximum open zones allowed simultaneously. 0 = no limit.
    pub max_open_zones: u32,
    /// Maximum active zones. 0 = no limit.
    pub max_active_zones: u32,
    /// Zone append size limit in logical blocks (ZASL from Identify Controller ZNS).
    /// Maximum data size for a single Zone Append command.
    pub zone_append_size_limit: u32,
}

/// Zone descriptor — returned by Zone Management Receive (Report Zones).
/// Layout per ZNS Command Set Specification 1.1b, Figure 40.
/// Total: 64 bytes.
#[repr(C)]
pub struct NvmeZoneDescriptor {
    /// Byte 0: Zone type. 0x02 = Sequential Write Required (SWR).
    pub zone_type: u8,
    /// Byte 1: Zone condition (bits 7:4) and zone attributes (bits 3:0).
    /// Zone condition values (upper nibble):
    ///   0x00=Empty, 0x10=ImplicitlyOpened, 0x20=ExplicitlyOpened,
    ///   0x30=Closed, 0x40=ReadOnly, 0xE0=Full, 0xF0=Offline.
    /// Zone attribute bits (lower nibble):
    ///   bit 2 = Zone Finished by Controller (ZFC).
    ///   bit 1 = Reset Recommended (RZR).
    ///   bit 0 = Zone Descriptor Extension Valid (ZDEV).
    pub zone_condition_and_attrs: u8,
    /// Bytes 2-7: Reserved.
    pub _rsvd: [u8; 6],
    /// Bytes 8-15: Zone capacity in logical blocks (may be < zone_size).
    pub zone_capacity: Le64,
    /// Bytes 16-23: Zone Start LBA.
    pub zone_start_lba: Le64,
    /// Bytes 24-31: Write Pointer — next LBA for sequential writes.
    /// 0xFFFF_FFFF_FFFF_FFFF if invalid (zone in ReadOnly or Offline state).
    pub write_pointer: Le64,
    /// Bytes 32-63: Reserved.
    pub _rsvd2: [u8; 32],
}
const_assert!(core::mem::size_of::<NvmeZoneDescriptor>() == 64);

Zone operations via Zone Management Send (opcode 0x79):

CDW13 Action Operation Description
0x01 Close Zone Transition zone from Open to Closed. Frees active zone resources.
0x02 Finish Zone Fill remaining zone capacity with zeros. Zone becomes Full.
0x03 Open Zone Explicitly open a zone for writing.
0x04 Reset Zone Reset zone write pointer to start. Zone becomes Empty. All data lost.
0x08 Offline Zone Take zone offline (administrative action).

Zone Append (opcode 0x7D): Write data to a zone without specifying an exact LBA. The controller appends data at the current write pointer and returns the actual written LBA in the completion entry (result field). This eliminates host-side write pointer tracking contention — multiple threads can zone-append concurrently, and the controller serializes them.

Filesystem integration: ZNS namespaces register with umka-block as zoned block devices. Zone-aware filesystems (F2FS, btrfs zoned mode) issue zone commands through the BlockDeviceOps interface: - BioOp::ZoneAppend maps to NVMe Zone Append (opcode 0x7D). - Zone management (open/close/finish/reset) is exposed via a separate zone_mgmt(&self, zone_slba: u64, action: ZoneAction) -> Result<()> method on BlockDeviceOps. - Zone report queries are exposed via report_zones(&self, start_lba: u64, buf: &mut [ZoneDescriptor]) -> Result<usize>.

15.19.14 NVMe-oF Fabrics Bridge

The local NVMe driver and the NVMe-oF subsystem (Section 15.13) share core abstractions:

Shared types: NvmeCommand (64-byte submission entry) and NvmeCompletion (16-byte completion entry) are the identical wire format for both local PCIe and fabric transports. The NvmeIoOpcode enum is transport-agnostic. Namespace identification (nsid: u32) is common to both paths.

NVMe-oF Target passthrough: When the NVMe-oF target operates in passthrough mode (exporting a local NVMe namespace to remote hosts), it submits NVMe commands received from the fabric directly to the local NVMe controller's I/O queues — bypassing the block layer entirely. The NvmeCommand from the fabric capsule is validated (opcode whitelist, NSID check, LBA bounds) and then placed in the local SQ. Completions are forwarded back through the fabric transport.

Unified namespace model: Both local and remote NVMe namespaces appear as BlockDevice instances in umka-block. The block layer, volume manager, and filesystems are agnostic to whether a namespace is local (PCIe) or remote (NVMe-oF/TCP, NVMe-oF/RDMA). The BlockDeviceInfo returned by each path reflects the true capabilities — local NVMe reports hardware FUA support while NVMe-oF/TCP does not (flush is required).

15.19.15 BlockDeviceOps Implementation

/// Per-namespace block device wrapper. One `NvmeBlockDevice` is created per
/// NVMe namespace discovered during controller initialization. Registered
/// with the block layer via `register_block_device()`. The block layer
/// holds `Arc<dyn BlockDeviceOps>` which points to this struct.
pub struct NvmeBlockDevice {
    /// Reference to the parent NVMe controller (shared across all namespaces).
    pub ctrl: Arc<NvmeController>,
    /// Namespace metadata (NSID, capacity, format, features).
    pub ns: NvmeNamespace,
    /// NUMA node closest to this controller's PCIe slot (for allocation affinity).
    pub numa_node: u32,
}

impl BlockDeviceOps for NvmeBlockDevice {
    fn submit_bio(&self, bio: &mut Bio) -> Result<()> {
        if self.ctrl.error_state.load(Acquire) != 0 {
            return Err(Error::IO); // Controller in error recovery
        }
        let queue_idx = arch::current::cpu::id() % self.ctrl.io_queues.len();
        // Each NvmeQueuePair is wrapped in SpinLock for interior mutability
        // (submit_bio takes &self; queue mutation needs &mut through &self).
        // Uncontended in the per-CPU case (~5-10 ns).
        let queue = self.ctrl.io_queues[queue_idx].lock();
        match bio.op {
            BioOp::Read | BioOp::Write => {
                nvme_submit_io(queue, bio, &self.ns)
            }
            BioOp::Flush => {
                if self.ctrl.volatile_write_cache {
                    // Allocate an inflight slot for the flush command so
                    // that the CQE completion handler can map the CID back
                    // to this bio and signal completion. Without this, the
                    // flush bio's StackWaiter would never be woken.
                    let cid = queue.alloc_cid()?;
                    queue.inflight[cid as usize] = Some(NvmeInflightCmd {
                        bio: bio as *mut Bio,
                        dma_map: None,
                        prp_list: None,
                        opcode: NvmeIoOpcode::Flush as u8,
                        nsid: self.ns.nsid,
                        retries: 0,
                    });
                    queue.submit_flush_cmd(self.ns.nsid, cid)
                } else {
                    // No volatile cache -- flush is a no-op. Signal
                    // completion immediately via the tier-aware path.
                    nvme_signal_completion(&queue, bio as *mut Bio, 0);
                    Ok(())
                }
            }
            BioOp::Discard => {
                if self.ns.supports_deallocate {
                    queue.submit_dsm_deallocate(bio, self.ns.nsid)
                } else {
                    Err(Error::NOSYS)
                }
            }
            BioOp::WriteZeroes => {
                queue.submit_write_zeroes(bio, self.ns.nsid)
            }
            BioOp::ZoneAppend => {
                if self.ns.zns.is_some() {
                    queue.submit_zone_append(bio, self.ns.nsid)
                } else {
                    Err(Error::NOSYS)
                }
            }
        }
    }

    fn flush(&self) -> Result<()> {
        if !self.ctrl.volatile_write_cache { return Ok(()); }
        let queue_idx = arch::current::cpu::id() % self.ctrl.io_queues.len();
        self.ctrl.io_queues[queue_idx].submit_flush_sync(self.ns.nsid)
    }

    fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()> {
        if !self.ns.supports_deallocate { return Err(Error::NOSYS); }
        let queue_idx = arch::current::cpu::id() % self.ctrl.io_queues.len();
        self.ctrl.io_queues[queue_idx]
            .submit_dsm_range(self.ns.nsid, start_lba, len_sectors)
    }

    fn get_info(&self) -> BlockDeviceInfo {
        BlockDeviceInfo {
            logical_block_size: self.ns.block_size,
            physical_block_size: self.ns.block_size, // NVMe: logical == physical
            capacity_sectors: self.ns.capacity_blocks
                * (self.ns.block_size as u64 / 512),
            max_segments: if self.ctrl.max_transfer_size > 0 {
                (self.ctrl.max_transfer_size / self.ctrl.page_size) as u16
            } else {
                256 // Default: 256 pages = 1MB at 4K pages
            },
            max_bio_size: self.ctrl.max_transfer_size,
            flags: {
                let mut f = BlockDeviceFlags::empty();
                if self.ns.supports_deallocate { f |= BlockDeviceFlags::DISCARD; }
                if self.ctrl.volatile_write_cache { f |= BlockDeviceFlags::FLUSH; }
                f |= BlockDeviceFlags::FUA; // NVMe always supports FUA (CDW12 bit 30)
                f
            },
            optimal_io_size: if self.ns.optimal_io_boundary > 0 {
                self.ns.optimal_io_boundary as u32 * self.ns.block_size
            } else {
                self.ns.block_size
            },
            numa_node: self.ctrl.numa_node,
        }
    }

    fn shutdown(&self) -> Result<()> {
        // Flush volatile write cache.
        self.flush()?;
        // Normal shutdown notification.
        let cc = self.ctrl.regs.read32(0x14); // CC register
        self.ctrl.regs.write32(0x14, (cc & !0xC000) | 0x4000); // SHN=01
        // Poll CSTS.SHST for shutdown complete (10b).
        let timeout = self.ctrl.cap.timeout as u64 * 500;
        poll_until(timeout, || {
            let csts = self.ctrl.regs.read32(0x1C);
            (csts >> 2) & 0x3 == 0x2
        })
    }
}

15.19.16 KABI Driver Manifest

[driver]
name = "nvme"
version = "1.0.0"
tier = 1
bus-type = "pci"

[match]
pci-class = "01:08:02"  # Mass Storage / NVM Express / NVM Express I/O Controller

[capabilities]
dma = true
interrupts = "msi-x"    # One vector per I/O queue + 1 admin; falls back to MSI, then INTx
max-memory = "16MB"      # SQ/CQ buffers + PRP list pools (scales with queue count)

[recovery]
crash-action = "reload"
state-preservation = true  # Replay in-flight bios on reload
max-reload-time-ms = 200

15.19.17 Design Decisions

Decision Rationale
Tier 1 (not Tier 2) NVMe is the primary storage path. ~2-5 μs per I/O on fast SSDs. Ring 3 crossing adds ~5-15 μs — doubling or tripling latency is unacceptable.
One I/O queue per CPU NVMe SQs have no locking — each CPU writes to its own SQ tail doorbell without contention. This is the design NVMe was built for.
PRP over SGL PRP is mandatory in the NVMe base spec; SGL is optional. PRP is simpler (array of page-aligned addresses) and sufficient for block I/O. SGL is used only for NVMe-oF passthrough where the fabric provides SGLs.
Pre-allocated PRP list pool Each queue pre-allocates a pool of PRP list pages at init time. No heap allocation on the I/O hot path. Pool size = queue depth (one PRP list per possible in-flight command).
Conservative APST table Aggressive power transitions cause latency spikes on some controllers. UmkaOS defaults to conservative thresholds (100ms/500ms/2s) and lets userspace tune via sysfs if needed.
MSI-X per queue Per-queue interrupt vectors eliminate the CQ polling fan-out that single-vector modes require. On a 32-queue controller, a single MSI vector would force scanning all 32 CQs on every interrupt.
30-second command timeout NVMe spec does not define a command timeout. 30 seconds is the Linux default and covers worst-case garbage collection stalls on consumer SSDs. Configurable via the nvme_io_timeout_ms module parameter (default 30000). For consumer TLC/QLC workloads with heavy GC, operators may increase to 120000 (120s).
IOMMU mandatory for DMA NVMe DMA without IOMMU means the controller can write to any physical address — a firmware bug or malicious device could corrupt kernel memory. IOMMU containment is a Tier 1 requirement.

15.19.18 Error Recovery

NVMe error recovery handles both transient command failures and catastrophic controller events. The recovery sequence is modeled as a state machine:

/// NVMe controller error recovery states.
///
/// State machine: Normal → ErrorDetected → ControllerReset → QueueReinit → ReplayIo → Normal
///                                       ↘ Fatal (if reset fails)
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum NvmeRecoveryState {
    /// Normal operation — no error recovery in progress.
    Normal,
    /// Error detected; new I/O submissions blocked. In-flight commands
    /// await timeout or controller-reported failure.
    ErrorDetected,
    /// Controller reset in progress (CC.EN = 0 → wait CSTS.RDY = 0 → CC.EN = 1).
    ControllerReset,
    /// Re-creating admin and I/O queue pairs after reset.
    QueueReinit,
    /// Replaying in-flight I/O that was interrupted by the reset.
    ReplayIo,
    /// Unrecoverable: controller failed to reset within `CAP.TO * 500ms`.
    /// Block device returns `-EIO` for all subsequent I/O.
    Fatal,
}

/// Error sources that trigger recovery.
pub enum NvmeErrorSource {
    /// Completion entry with non-zero status code.
    CompletionError {
        /// CQE status field: bits [15:1] = status code, bit 0 = phase tag.
        status: u16,
        /// Command ID that failed.
        cid: u16,
        /// Queue pair that reported the error.
        qid: u16,
    },
    /// CSTS.CFS (Controller Fatal Status) bit set — controller firmware crash.
    ControllerFatalStatus,
    /// Command timeout: no completion received within `nvme_io_timeout_ms`.
    CommandTimeout { cid: u16, qid: u16 },
    /// PCIe AER (Advanced Error Reporting) event forwarded by the PCIe subsystem.
    /// Typical events: Uncorrectable Internal Error, Completion Timeout, Data Link
    /// Protocol Error, Poisoned TLP.
    AerEvent { severity: AerSeverity, error_type: u32 },
}

/// AER severity levels.
#[derive(Clone, Copy)]
pub enum AerSeverity {
    /// Correctable: logged, no recovery needed.
    Correctable,
    /// Non-fatal uncorrectable: device may still respond; attempt reset.
    NonFatal,
    /// Fatal: device is non-responsive; attempt reset with escalation to
    /// PCIe Function Level Reset (FLR) if the standard NVMe reset fails.
    Fatal,
}

Controller reset sequence (runs in the NVMe driver's recovery workqueue):

nvme_controller_reset(ctrl):
  1. Set ctrl.state = ControllerReset
  2. Block new submissions: set NVME_CTRL_RESETTING flag, drain KABI ring
  3. CC.EN = 0 (disable controller)
  4. Poll CSTS.RDY == 0 with timeout = CAP.TO * 500ms
     - If timeout: escalate to PCI FLR (pci_reset_function())
     - If FLR fails: ctrl.state = Fatal, return
  5. Re-configure CC (MPS, AMS, IOSQES, IOCQES)
  6. CC.EN = 1 (re-enable controller)
  7. Poll CSTS.RDY == 1 with timeout = CAP.TO * 500ms
  8. Re-create admin queue pair, issue Identify Controller
  9. Re-create I/O queue pairs (one per CPU), re-register interrupt vectors
  10. ctrl.state = ReplayIo
  11. For each in-flight I/O (tracked in per-queue command ID bitmap):
      - Re-submit the command (bio is preserved in the driver's shadow ring)
      - Original completion callback fires when the replayed command completes
  12. ctrl.state = Normal

Asynchronous Event Reporting (AER/AEN): The driver posts one AEN (Asynchronous Event Request) admin command at init time. The controller sends a completion when an asynchronous event occurs (error, SMART threshold, namespace change, firmware activation, etc.). On receipt, the driver logs the event, handles it (e.g., triggers controller reset for critical errors, re-scans namespaces for NS_ATTR_CHANGED), and immediately re-posts a new AEN command to receive the next event.

15.19.19 Autonomous Power State Transitions (APST)

NVMe controllers support multiple power states (PS0=highest performance through PS5=deepest sleep). APST allows the controller to autonomously transition to lower power states after configurable idle periods.

/// APST table entry (one per power state transition).
/// Written to the controller via Set Features (Feature 0x0C) as DMA data.
/// Multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
///
/// UmkaOS writes up to 32 entries (one per supported non-operational
/// power state). The controller transitions to the target state after
/// the specified idle time and transitions back to operational on any
/// new command submission.
#[repr(C)]
pub struct ApstEntry {
    /// Idle time threshold in microseconds before transitioning.
    /// 0 = disable this transition.
    pub idle_transition_us: Le32,
    /// Idle Transition Power State (ITPS) — target power state index.
    pub idle_transition_ps: u8,
    pub _reserved: [u8; 3],
}
// NVMe APST entry: idle_transition_us(4) + idle_transition_ps(1) + _reserved(3) = 8 bytes.
const_assert!(core::mem::size_of::<ApstEntry>() == 8);

/// Default APST table. Conservative thresholds suitable for server workloads.
/// Desktop/laptop use cases may use more aggressive values via sysfs.
pub const DEFAULT_APST: &[(u32, u8)] = &[
    // (idle_ms, target_power_state)
    (100,  1),   // After 100ms idle → PS1 (slightly reduced performance)
    (500,  2),   // After 500ms idle → PS2 (moderate power saving)
    (2000, 3),   // After 2s idle → PS3 (deep idle, 50-500ms exit latency)
];

APST configuration sequence (during controller init, after Identify Power State descriptors are parsed):

  1. Read Identify Controller power state descriptors (PS0-PS31).
  2. Filter: only non-operational states with exit_latency_us < apst_max_latency_us (sysctl nvme.apst_max_latency_us, default 25000 = 25ms).
  3. Build APST table from DEFAULT_APST, mapping each target state to the highest-numbered non-operational state whose exit latency is within the threshold.
  4. Issue Set Features (Feature 0x0C, APSTE=1) with the computed table.
  5. If the controller rejects APST (Invalid Field in Command), log a warning and continue without power management (some consumer NVMe controllers have buggy APST firmware).

Sysfs interface (/sys/class/nvme/nvmeN/power/):

File Description
pm_policy default (APST) or none (disabled)
apst_max_latency_us Maximum acceptable exit latency for APST transitions
power_state Current power state (read-only, queried via Get Features)

15.20 fscrypt — File-Level Encryption

fscrypt is the Linux filesystem-level encryption subsystem. It encrypts file contents and filenames on a per-directory policy basis, transparently to applications: userspace reads and writes cleartext; the kernel encrypts on writeback and decrypts on read-in. Supported backing filesystems include ext4, f2fs, and ubifs (Btrfs and XFS do not implement fscrypt hooks). UmkaOS implements fscrypt for Linux ABI compatibility and because it is required by Android file-based encryption (FBE), Chromebook disk encryption, and enterprise per-directory encryption workflows.

Reference specification: Linux kernel Documentation/filesystems/fscrypt.rst (canonical), include/uapi/linux/fscrypt.h (UAPI header).

Tier: Tier 0 (in-kernel, part of the VFS/filesystem path). fscrypt hooks execute inside the page cache read/write path and cannot be isolated behind a domain boundary without unacceptable latency on every I/O.

15.20.1 Encryption Policies

A directory is encrypted by setting an encryption policy on it (via ioctl) before any files are created inside it. All files and subdirectories created within an encrypted directory inherit the parent's policy. Two policy versions exist; V2 is required for new deployments.

/// fscrypt policy version 2 (FSCRYPT_POLICY_V2).
///
/// V1 policies (`FscryptPolicyV1`) are supported for backward compatibility
/// with existing Android and Chrome OS volumes but are deprecated: V1 uses
/// an ad-hoc AES-128-ECB KDF that is non-standard and reversible. All new
/// encrypted directories must use V2.
///
/// Matches the Linux `struct fscrypt_policy_v2` layout exactly (UAPI ABI).
#[repr(C)]
pub struct FscryptPolicyV2 {
    /// Policy version: always `2`.
    pub version: u8,
    /// Contents encryption mode ([`FscryptMode`]).
    pub contents_encryption_mode: u8,
    /// Filenames encryption mode ([`FscryptMode`]).
    pub filenames_encryption_mode: u8,
    /// Policy flags (bitwise OR of `FSCRYPT_POLICY_FLAG_*`).
    pub flags: u8,
    /// Log2 of the data unit size for contents encryption.
    /// 0 means the filesystem block size (default). Non-zero values
    /// allow sub-block encryption granularity (Linux 6.7+).
    pub log2_data_unit_size: u8,
    /// Reserved, must be zero.
    pub reserved: [u8; 3],
    /// Master key identifier: first 16 bytes of
    /// `HKDF-SHA512(master_key, info="fscrypt\0" || 0x01)`.
    /// Computed by the kernel on `FS_IOC_ADD_ENCRYPTION_KEY` and matched
    /// against the policy when unlocking.
    pub master_key_identifier: [u8; FSCRYPT_KEY_IDENTIFIER_SIZE],
}
// UAPI ABI: version(1)+contents(1)+filenames(1)+flags(1)+log2_du(1)+reserved(3)+key_id(16) = 24 bytes.
const_assert!(core::mem::size_of::<FscryptPolicyV2>() == 24);

/// Size of the master key identifier (bytes).
pub const FSCRYPT_KEY_IDENTIFIER_SIZE: usize = 16;

/// Size of the per-file nonce (bytes).
/// Not in Linux UAPI header; kernel-internal constant from fs/crypto/fscrypt_private.h.
pub const FSCRYPT_FILE_NONCE_SIZE: usize = 16;

/// Encryption mode constants.
///
/// Values match Linux `FSCRYPT_MODE_*` exactly (UAPI ABI).
#[repr(u8)]
pub enum FscryptMode {
    /// AES-256-XTS — contents encryption (default, recommended).
    /// 64-byte key (two 256-bit AES keys: one for data, one for tweak).
    Aes256Xts   = 1,
    /// AES-256-CTS-CBC — filenames encryption (default, recommended).
    /// 32-byte key. CTS (ciphertext stealing) handles non-block-aligned names.
    Aes256Cts   = 4,
    /// AES-128-CBC-ESSIV — legacy contents mode. Not recommended for new use.
    Aes128Cbc   = 5,
    /// AES-128-CTS-CBC — legacy filenames mode. Not recommended for new use.
    Aes128Cts   = 6,
    /// SM4-XTS — contents encryption (Chinese national standard, GM/T 0002).
    Sm4Xts      = 7,
    /// SM4-CTS-CBC — filenames encryption (GM/T 0002).
    Sm4Cts      = 8,
    /// Adiantum — both contents and filenames. Wide-block cipher built on
    /// XChaCha12 + AES-256 + NH + Poly1305. Designed for devices without
    /// AES hardware acceleration (low-end ARM, older RISC-V).
    Adiantum    = 9,
    /// AES-256-HCTR2 — contents encryption (wide-block, Linux 6.7+).
    /// Hash-Counter-Hash construction over AES-256 + XCTR + POLYVAL.
    /// Better semantic security than XTS for small data units.
    Aes256Hctr2 = 10,
}

/// Policy flag constants (UAPI ABI).
pub const FSCRYPT_POLICY_FLAG_DIRECT_KEY:      u8 = 0x04;
pub const FSCRYPT_POLICY_FLAG_IV_INO_LBLK_64:  u8 = 0x08;
pub const FSCRYPT_POLICY_FLAG_IV_INO_LBLK_32:  u8 = 0x10;

Valid mode combinations (enforced at FS_IOC_SET_ENCRYPTION_POLICY time):

Contents mode Filenames mode Notes
Aes256Xts (1) Aes256Cts (4) Default. Recommended for all platforms with AES-NI / ARMv8 CE / AES ISA.
Aes128Cbc (5) Aes128Cts (6) Legacy. V1 policies only.
Adiantum (9) Adiantum (9) No-AES-hardware path. Required for ARMv7 without CE, some RISC-V.
Sm4Xts (7) Sm4Cts (8) Chinese regulatory compliance (GM/T).
Aes256Hctr2 (10) Aes256Cts (4) Wide-block contents with standard filenames.
Aes256Hctr2 (10) Aes256Hctr2 (10) Wide-block for both contents and filenames.

The kernel rejects any mode combination not in this table with EINVAL.

15.20.2 Key Derivation

fscrypt V2 uses HKDF-SHA512 (Section 10.1) for all key derivation. The master key is the HKDF input keying material (IKM); no salt is used. Different application-specific info strings (the HKDF info parameter) produce distinct derived keys:

Key identifier:
  info = "fscrypt\0" || 0x01
  → 16-byte identifier stored in FscryptPolicyV2.master_key_identifier

Per-file encryption key (default, no DIRECT_KEY flag):
  info = "fscrypt\0" || 0x02 || file_nonce[16]
  → One unique key per file. file_nonce is random, stored in xattr.

Per-mode encryption key (DIRECT_KEY flag set):
  info = "fscrypt\0" || 0x03 || mode_number[1]
  → One key per (master_key, mode) pair. File nonce mixed into IV instead.

Per-mode IV_INO_LBLK_64 key:
  info = "fscrypt\0" || 0x04 || mode_number[1]
  → Inode number and block index combined into a 64-bit IV.

Dirhash key (for case-insensitive/casefolded directories):
  info = "fscrypt\0" || 0x05 || file_nonce[16]
  → SipHash-2-4 key for directory entry hashing.

Per-mode IV_INO_LBLK_32 key:
  info = "fscrypt\0" || 0x06 || mode_number[1]
  → 32-bit inode+block IV for hardware with limited IV width.

All context bytes (0x01..0x06) are reserved by the fscrypt specification. UmkaOS must not redefine them.

15.20.2.1 Master Key Lifecycle

  1. Userspace provides the raw master key via FS_IOC_ADD_ENCRYPTION_KEY. (For V1 policies, the legacy path uses sys_add_key() with the fscrypt-provisioning key type — see Section 10.2 for the formal syscall definition. V2 policies use the dedicated ioctl instead of the generic keyring syscalls.)
  2. The kernel derives the 16-byte key identifier via HKDF and stores the master key in the filesystem-level keyring (not the user session keyring — V2 improvement).
  3. When an encrypted inode is opened, the kernel matches the inode's master_key_identifier against the keyring, derives the per-file key, and caches the derived key in the in-core FscryptInfo attached to the inode.
  4. On FS_IOC_REMOVE_ENCRYPTION_KEY, derived keys are zeroized and evicted. Inodes with cached keys are marked stale; subsequent I/O returns ENOKEY.

15.20.2.2 Per-File Context (On-Disk)

/// Per-file fscrypt context stored in the inode's encryption xattr.
///
/// For ext4: xattr name `c` in the `system.` namespace (index 9).
/// For f2fs: stored in the inode's `i_extra` area.
/// For ubifs: stored as an extended attribute.
///
/// Matches Linux `struct fscrypt_context_v2` layout exactly (kernel-internal).
#[repr(C)]
pub struct FscryptContextV2 {
    /// Context version: `2`.
    pub version: u8,
    /// Contents encryption mode.
    pub contents_encryption_mode: u8,
    /// Filenames encryption mode.
    pub filenames_encryption_mode: u8,
    /// Policy flags.
    pub flags: u8,
    /// Log2 of data unit size (0 = filesystem block size).
    pub log2_data_unit_size: u8,
    /// Reserved, must be zero.
    pub reserved: [u8; 3],
    /// Master key identifier (16 bytes).
    pub master_key_identifier: [u8; FSCRYPT_KEY_IDENTIFIER_SIZE],
    /// Random per-file nonce generated at inode creation time.
    /// Used as HKDF input (per-file key mode) or as IV tweak (DIRECT_KEY mode).
    pub nonce: [u8; FSCRYPT_FILE_NONCE_SIZE],
}
// On-disk format: version(1)+contents(1)+filenames(1)+flags(1)+log2_du(1)+reserved(3)+key_id(16)+nonce(16) = 40 bytes.
const_assert!(core::mem::size_of::<FscryptContextV2>() == 40);

15.20.3 Ioctls

All ioctls use magic number 'f' (0x66). Values match Linux UAPI exactly.

Ioctl Direction Nr Arg type Description
FS_IOC_SET_ENCRYPTION_POLICY _IOR 19 fscrypt_policy_v1 Set encryption policy on an empty directory. V2 policies are passed via the same ioctl with version=2 in the struct. Returns ENOTEMPTY if directory is non-empty, EEXIST if a policy is already set.
FS_IOC_GET_ENCRYPTION_POLICY_EX _IOWR 22 fscrypt_get_policy_ex_arg Get encryption policy (V1 or V2) with version discrimination.
FS_IOC_ADD_ENCRYPTION_KEY _IOWR 23 fscrypt_add_key_arg Add master key to the filesystem keyring. Derives and stores the key identifier. Any user may add keys; key is ref-counted per-user.
FS_IOC_REMOVE_ENCRYPTION_KEY _IOWR 24 fscrypt_remove_key_arg Remove the calling user's claim on a master key. When the last user removes, derived keys are wiped and inodes evicted.
FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS _IOWR 25 fscrypt_remove_key_arg Force-remove for all users. Requires CAP_SYS_ADMIN.
FS_IOC_GET_ENCRYPTION_KEY_STATUS _IOWR 26 fscrypt_get_key_status_arg Query whether a master key is present, absent, or incompletely removed.
FS_IOC_GET_ENCRYPTION_NONCE _IOR 27 [u8; 16] Retrieve the file's 16-byte encryption nonce (for backup/restore tooling).

All ioctls can be issued on any file or directory on the target filesystem; the filesystem root directory is the conventional target. FS_IOC_SET_ENCRYPTION_POLICY must target the directory to be encrypted.

15.20.4 I/O Path Integration

15.20.4.1 Read Path

  1. VFS read() dispatches to the filesystem's readpage() / readahead().
  2. The filesystem reads ciphertext blocks from disk into the page cache.
  3. fscrypt_decrypt_pagecache_blocks() decrypts the page in-place using the per-file key cached in FscryptInfo. The decryption transform is obtained from Section 10.1 (skcipher for AES-XTS/AES-CTS, lskcipher for Adiantum/HCTR2).
  4. The cleartext page is returned to userspace.

If the master key is absent (not added or removed), the filesystem returns ENOKEY on open() for files requiring content decryption. Directory listing is still possible (filenames are shown as no-key names — see below).

15.20.4.2 Write Path

  1. VFS write() copies cleartext data into page cache pages.
  2. On writeback, fscrypt_encrypt_pagecache_blocks() allocates a bounce page, encrypts the cleartext page into the bounce page, and submits the bounce page to the block layer.
  3. The original cleartext page remains in the page cache for subsequent reads (no redundant decryption).
  4. The bounce page is freed after I/O completion.

Memory allocation: Bounce pages are drawn from a dedicated mempool (fscrypt_bounce_page_pool) to guarantee forward progress under memory pressure. The pool is sized at FSCRYPT_BOUNCE_POOL_SIZE (default: 32 pages, configurable via boot parameter fscrypt.bounce_pool_size). This fixed size avoids CPU hotplug sensitivity. Allocation uses GFP_NOFS to avoid filesystem re-entry deadlock. Backpressure on exhaustion: When all pool pages are in use, mempool_alloc() blocks the calling writeback thread until a bounce page is freed by I/O completion. This naturally throttles concurrent writebacks to the pool size. The mempool guarantee means allocation never fails — it may block indefinitely, but forward progress is assured because in-flight bounce pages are freed on I/O completion (I/O completion runs in softirq/workqueue context, independent of the blocked writeback thread).

15.20.4.3 Filename Encryption

Directory entries on disk store encrypted filenames. The kernel translates between encrypted and cleartext forms:

  • fscrypt_fname_disk_to_usr(): Decrypts an on-disk filename for readdir() and lookup(). When the key is present, the cleartext name is returned. When the key is absent, a no-key name is returned: the ciphertext encoded as base64url (RFC 4648 section 5, no padding), prefixed with _ if the name would otherwise start with . (to avoid hiding entries in directory listings).
  • fscrypt_fname_usr_to_disk(): Encrypts a cleartext filename for create(), rename(), link(), and unlink(). Requires the key to be present; returns ENOKEY otherwise.

Filename encryption uses AES-256-CTS-CBC (or Adiantum, or AES-256-HCTR2 depending on policy). CTS handles names whose length is not a multiple of the AES block size without padding, preserving the original name length in the directory entry.

15.20.4.4 IV Construction

The IV (initialisation vector) varies by policy flag:

Policy mode IV layout (little-endian)
Default (per-file key) data_unit_index[8] || zeros[8]
DIRECT_KEY data_unit_index[8] || file_nonce[16] (24 bytes total; AES-XTS/CTS use only the first 16 bytes; Adiantum/HCTR2 use all 24)
IV_INO_LBLK_64 data_unit_index[4] || inode_number[4] || zeros[8]
IV_INO_LBLK_32 hash(inode_number) + data_unit_index (mod 2^32)[4] || zeros[12]

15.20.5 Inline Crypto Engine (ICE) Support

Modern SoCs include inline encryption hardware that encrypts/decrypts data in the storage controller's DMA path, eliminating CPU-side crypto overhead entirely. UmkaOS integrates with this hardware through the blk-crypto framework (Section 15.2).

When inline crypto is available and supports the requested mode, fscrypt attaches a BlkCryptoKey to the Bio instead of performing software encryption. The block layer programs the key into a hardware keyslot and the storage controller encrypts/decrypts transparently. If the hardware does not support the requested mode (or no inline crypto hardware is present), blk-crypto falls back to software encryption automatically — no filesystem or fscrypt code change is needed.

Hardware-wrapped keys (Linux 6.15+): On SoCs that support it (Qualcomm ICE, Samsung FMP), the master key can be provided in hardware-wrapped form. The hardware unwraps the key internally and programs the derived inline encryption key into a keyslot without ever exposing it to software. This provides defense-in-depth: even a kernel compromise cannot extract the raw encryption key.

15.20.5.1 Per-Architecture Inline Crypto Availability

Arch ICE hardware Notes
x86-64 Rare (some Intel platforms with IBECC) Primarily software path. AES-NI provides fast software AES-XTS (~2 cycles/byte).
AArch64 Common: Qualcomm ICE, Samsung FMP, MediaTek UFS inline crypto Standard on mobile/embedded SoCs. Hardware-wrapped key support on Qualcomm SM8x50+.
ARMv7 Limited: older Qualcomm ICE on 32-bit SoCs Adiantum mode recommended when AES CE is absent.
RISC-V No known ICE hardware (as of 2026) Software path only. Adiantum recommended for devices without scalar AES extensions.
PPC32 No ICE hardware Software path only.
PPC64LE No ICE hardware Software path only. POWER9/10 AES instructions provide adequate software throughput.

15.20.6 Filesystem Integration Points

Each filesystem that supports fscrypt must implement a set of hooks. These are not a separate trait; they are woven into the existing InodeOps and FileOps implementations (Section 14.1).

Hook Filesystem responsibility
Inode creation Generate random 16-byte nonce, store FscryptContextV2 as xattr.
Inode load Read FscryptContextV2 from xattr, call fscrypt_get_encryption_info() to derive/cache keys.
readpage / readahead Read ciphertext from disk, call fscrypt_decrypt_pagecache_blocks().
Writeback Call fscrypt_encrypt_pagecache_blocks() to produce bounce pages.
lookup / readdir Decrypt filenames via fscrypt_fname_disk_to_usr().
create / rename / link Encrypt filenames via fscrypt_fname_usr_to_disk().
statfs No change (encrypted and unencrypted data occupy the same space).

Per-filesystem xattr storage:

  • ext4: FscryptContextV2 stored in the system. xattr namespace (index 9, name c). Retrieved during inode read-in from the inode's xattr area.
  • f2fs: Stored in the inode's i_extra inline area (not a separate xattr block), avoiding an extra disk read for key setup.
  • ubifs: Stored as a standard extended attribute on the inode.

15.20.7 Crypto Backend Integration

fscrypt uses the Section 10.1 algorithm registry exclusively. It never calls hardware crypto instructions directly.

Algorithm allocation (performed once per master key, cached):

Purpose Crypto API algorithm name Transform type
Key derivation hmac(sha512) Shash
AES-256-XTS content xts(aes) Skcipher
AES-256-CTS-CBC filenames cts(cbc(aes)) Skcipher
AES-128-CBC-ESSIV content essiv(cbc(aes),sha256) Skcipher
Adiantum both adiantum(xchacha12,aes) Skcipher
AES-256-HCTR2 both hctr2(aes) Skcipher
SM4-XTS content xts(sm4) Skcipher
SM4-CTS-CBC filenames cts(cbc(sm4)) Skcipher
Dirhash siphash24 Shash

Hardware-accelerated implementations (AES-NI on x86-64, ARMv8 CE on AArch64, etc.) are selected automatically by the crypto API's priority-based dispatch. No fscrypt-specific code is needed to prefer hardware acceleration.

Key zeroization: When a master key is removed (FS_IOC_REMOVE_ENCRYPTION_KEY), all derived keys cached in FscryptInfo structs are overwritten with zeros (memzero_explicit) before the memory is freed. The master key in the filesystem keyring is similarly zeroized. This limits the window during which key material is resident in kernel memory.

GFP flags: All crypto allocations within the fscrypt I/O path use GFP_NOFS to prevent deadlock from re-entrant filesystem calls during memory reclaim.

15.20.8 In-Core State

/// Per-inode fscrypt state, allocated when an encrypted inode is first accessed
/// with a valid master key. Attached to the in-core inode and freed when the
/// inode is evicted from the inode cache.
pub struct FscryptInfo {
    /// Encryption mode for file contents.
    pub contents_mode: FscryptMode,
    /// Encryption mode for filenames (meaningful only for directory inodes).
    pub filenames_mode: FscryptMode,
    /// Policy flags from the inode's `FscryptContextV2`.
    pub flags: u8,
    /// Derived per-file encryption key (zeroized on drop).
    /// For DIRECT_KEY mode: this is the per-mode key, shared across files.
    pub contents_key: ZeroizingKey,
    /// Derived filenames encryption key (directories only; zeroized on drop).
    pub filenames_key: Option<ZeroizingKey>,
    /// Allocated crypto transform for contents encryption.
    pub contents_tfm: SkcipherHandle,
    /// Allocated crypto transform for filenames encryption (directories only).
    pub filenames_tfm: Option<SkcipherHandle>,
    /// The file's 16-byte nonce (copied from `FscryptContextV2`).
    pub nonce: [u8; FSCRYPT_FILE_NONCE_SIZE],
    /// Reference to the master key entry in the filesystem keyring.
    /// Prevents the master key from being fully removed while this inode
    /// is still in use.
    pub master_key_ref: Arc<FscryptMasterKey>,
    /// For inline crypto: the `BlkCryptoKey` prepared for hardware keyslot
    /// programming. `None` if software encryption is used.
    pub blk_crypto_key: Option<BlkCryptoKey>,
}

ZeroizingKey is a wrapper around ArrayVec<u8, 64> that implements Drop by calling memzero_explicit on the key material. It must never implement Clone.

15.20.9 Security Considerations

  • Threat model: fscrypt protects data at rest. It does not protect against a running kernel compromise (the kernel holds derived keys in memory). For protection against a compromised kernel, use confidential computing (Section 9.7).
  • Key scrubbing: Derived keys are zeroized on removal, but the master key may persist in userspace memory (e.g., a PAM module or key agent). UmkaOS cannot control userspace key hygiene.
  • No authenticated encryption: fscrypt uses unauthenticated encryption modes (XTS, CTS-CBC). An attacker with physical disk access can modify ciphertext without detection (bit-flipping attacks). Filesystem metadata checksums (ext4 metadata_csum, Btrfs checksums) detect some corruption but do not provide cryptographic authentication. For authenticated at-rest protection, use dm-crypt with AEAD (dm-integrity + dm-crypt) or full-disk authenticated encryption at the block layer.
  • Filename length leakage: Encrypted filenames preserve the original length (CTS does not pad). An attacker can observe filename lengths on the encrypted volume. This is a known and accepted trade-off (padding would break directory entry size constraints).

15.20.10 Cross-References

  • Section 10.1 -- underlying algorithm registry and hardware dispatch
  • Section 15.6 -- implements fscrypt hooks (ext4_encrypt_page, ext4_decrypt_page)
  • Section 15.7, Section 15.8 -- do NOT implement fscrypt hooks (noted here for completeness; future integration is Phase 4+)
  • Section 14.16 -- fscrypt context stored as inode xattr
  • Section 9.7 -- complementary protection (fscrypt = at-rest, CC = in-use)
  • Section 15.2 -- blk-crypto inline encryption framework
  • Section 14.1 -- VFS read/write hooks, InodeOps, FileOps
  • Section 10.2 -- filesystem keyring integration

15.21 SMB Server (ksmbd)

ksmbd is UmkaOS's in-kernel SMB server, providing high-performance SMB file sharing without the overhead of running Samba as a userspace daemon. Originally merged into Linux 5.15, the ksmbd architecture splits work between the kernel (data path: read, write, directory enumeration, oplock/lease management) and a lightweight userspace helper (ksmbd.mountd, for authentication and configuration parsing). UmkaOS supports SMB 2.1, 3.0, and 3.1.1 protocols — sufficient for all modern Windows, macOS, and Linux CIFS clients.

Use cases: Windows interoperability (file sharing with unmodified Windows 10/11 clients), NAS appliances, Samba-compatible file servers, container-based file gateways.

Tier: Tier 1 (in-kernel, hardware domain isolated). The SMB data path executes entirely in Ring 0 within a Tier 1 isolation domain; the authentication and configuration plane is delegated to the ksmbd.mountd userspace daemon via a netlink IPC channel.

15.21.1 Server State

/// ksmbd server state -- one instance per SMB listener.
pub struct KsmbdServer {
    /// Listening TCP socket (port 445).
    pub listener: Arc<TcpListener>,
    /// Active sessions, keyed by session ID (u64).
    pub sessions: RwLock<XArray<Arc<SmbSession>>>,
    /// Share configuration (loaded from ksmbd.conf via ksmbd.mountd).
    /// Updated via RCU: readers (SMB request handlers) never block.
    pub shares: RcuVec<SmbShare>,
    /// Global server GUID (randomly generated at first start, persisted
    /// across restarts in `/etc/ksmbd/server_guid`).
    pub server_guid: [u8; 16],
    /// Supported dialects, ordered by preference (highest first).
    pub dialects: ArrayVec<SmbDialect, 4>,
    /// Server capabilities advertised in SMB2 NEGOTIATE response.
    pub capabilities: SmbServerCapabilities,
    /// Worker thread pool for request processing. One thread per
    /// concurrent SMB connection; threads are spawned on accept and
    /// exit when the connection closes.
    pub worker_pool: WorkerPool,
    /// IPC transport to ksmbd.mountd (userspace helper for auth + config).
    pub mountd_ipc: NetlinkSocket,
}

15.21.2 SMB Session

Each authenticated client connection produces one SmbSession. Sessions are independent: a client may establish multiple sessions (e.g., one per user credential) over the same or different TCP connections.

pub struct SmbSession {
    /// Session ID (unique per server, assigned at session setup).
    pub session_id: u64,
    /// Authenticated user credentials (resolved by ksmbd.mountd).
    pub user: SmbUser,
    /// Session key (derived from authentication exchange). 16 bytes per MS-SMB2.
    pub session_key: Zeroizing<[u8; 16]>,
    /// Signing key (KDF from session key per MS-SMB2 §3.1.4.2). 16 bytes.
    pub signing_key: Zeroizing<[u8; 16]>,
    /// Encryption keys (SMB 3.0+). `None` if encryption not negotiated.
    /// 32 bytes to support AES-256-CCM and AES-256-GCM (negotiated via
    /// `SMB2_ENCRYPTION_CAPABILITIES`). The KDF produces 32 bytes for
    /// AES-256; AES-128 ciphers use only the first 16 bytes of the buffer.
    pub encrypt_key: Option<Zeroizing<[u8; 32]>>,
    pub decrypt_key: Option<Zeroizing<[u8; 32]>>,
    /// Tree connections (mounted shares), keyed by tree ID (u32).
    pub tree_connects: SpinLock<XArray<Arc<SmbTreeConnect>>>,
    /// Open file handles, keyed by volatile file ID (u64).
    pub open_files: SpinLock<XArray<Arc<SmbOpenFile>>>,
    /// Connection transport (TCP or RDMA).
    pub transport: SmbTransport,
    /// Negotiated dialect for this session.
    pub dialect: SmbDialect,
    /// Session lifecycle state (stored as `SmbSessionState as u8`).
    /// Use `session_state()` / `set_session_state()` typed accessors
    /// instead of raw AtomicU8 operations to avoid u8-to-enum mismatch bugs.
    pub state: AtomicU8,
}

/// Session lifecycle states.
#[repr(u8)]
pub enum SmbSessionState {
    /// Negotiate complete, session setup in progress.
    InProgress = 0,
    /// Fully authenticated and active.
    Valid = 1,
    /// Session expired (idle timeout or explicit logoff).
    Expired = 2,
}

impl SmbSession {
    /// Read the current session state as a typed enum.
    /// Returns `SmbSessionState::Expired` for any unrecognized value
    /// (defensive — treats corruption as expired to prevent use of
    /// an invalid session).
    pub fn session_state(&self) -> SmbSessionState {
        match self.state.load(Acquire) {
            0 => SmbSessionState::InProgress,
            1 => SmbSessionState::Valid,
            _ => SmbSessionState::Expired,
        }
    }

    /// Set the session state atomically.
    pub fn set_session_state(&self, s: SmbSessionState) {
        self.state.store(s as u8, Release);
    }
}

15.21.3 Dialect Negotiation

/// SMB protocol dialects supported by UmkaOS ksmbd.
#[repr(u16)]
pub enum SmbDialect {
    /// SMB 2.1 (Windows 7 / Server 2008 R2).
    Smb21  = 0x0210,
    /// SMB 3.0 (Windows 8 / Server 2012). Adds multichannel, encryption.
    Smb30  = 0x0300,
    /// SMB 3.0.2 (Windows 8.1 / Server 2012 R2).
    Smb302 = 0x0302,
    /// SMB 3.1.1 (Windows 10+ / Server 2016+). Adds pre-auth integrity,
    /// AES-256, compression. Preferred dialect.
    Smb311 = 0x0311,
}

The client sends SMB2 NEGOTIATE with a list of supported dialects. The server selects the highest common dialect. If no common dialect exists, the server returns STATUS_NOT_SUPPORTED and closes the connection. SMB 1.0/CIFS is deliberately not supported — it has known security vulnerabilities (EternalBlue, MS17-010) and no modern client requires it.

15.21.3.1 SMB 3.1.1 Negotiate Context

SMB 3.1.1 extends NEGOTIATE with typed negotiate contexts:

  • Pre-authentication integrity (SMB2_PREAUTH_INTEGRITY_CAPABILITIES): SHA-512 hash chain of all negotiate messages. The pre-auth integrity hash is carried into the session setup exchange, binding the authenticated session to the specific negotiate sequence and preventing downgrade attacks.
  • Encryption (SMB2_ENCRYPTION_CAPABILITIES): ordered cipher preference list. UmkaOS supports AES-128-CCM (mandatory per MS-SMB2), AES-128-GCM, AES-256-CCM, and AES-256-GCM. The server selects the first client-offered cipher it supports.
  • Compression (SMB2_COMPRESSION_CAPABILITIES): LZ77, LZ77+Huffman, LZNT1, Pattern_V1. Compression is optional and negotiated per-connection.
  • Signing (SMB2_SIGNING_CAPABILITIES): AES-128-CMAC (SMB 3.0/3.0.2) or AES-128-GMAC (SMB 3.1.1, preferred for hardware acceleration via AES-NI).

15.21.4 Share Configuration

/// One exported SMB share.
pub struct SmbShare {
    /// Share name as visible to clients (e.g., "public", "homes").
    pub name: KString,
    /// Local filesystem path (must be an absolute path to a mounted directory).
    pub path: KString,
    /// Share type (disk, printer, or IPC).
    pub share_type: SmbShareType,
    /// Maximum access mask the server will grant on this share.
    pub max_access: SmbAccessMask,
    /// Read-only flag. When true, all write operations return STATUS_ACCESS_DENIED.
    pub read_only: bool,
    /// Allow guest (unauthenticated) access. Default: false.
    pub guest_ok: bool,
    /// Visible in SMB network neighborhood enumeration. Default: true.
    pub browseable: bool,
    /// Oplocks enabled on this share. Default: true.
    pub oplocks: bool,
    /// Per-share encryption required. When true, unencrypted sessions cannot
    /// access this share (returns STATUS_ACCESS_DENIED). SMB 3.0+ only.
    pub encrypt: bool,
}

#[repr(u32)]
pub enum SmbShareType {
    /// Disk share (file/directory access).
    Disk    = 0x00,
    /// Printer share.
    Printer = 0x01,
    /// IPC share (named pipes for inter-process communication).
    Ipc     = 0x02,
}

Share configuration is loaded from /etc/ksmbd/ksmbd.conf by ksmbd.mountd and pushed to the kernel via the netlink IPC channel. The kernel stores shares in an RcuVec so that SMB request handlers can look up shares without acquiring any lock. Configuration reloads (triggered by ksmbd.mountd --reload) replace the entire share list via an RCU update; in-progress tree connects on the old configuration complete normally.

15.21.5 Oplock and Lease Model

SMB oplocks (opportunistic locks) and leases control client-side caching. They are the SMB equivalent of NFSv4 delegations (Section 15.11).

/// SMB oplock level (per MS-SMB2 section 2.2.14).
#[repr(u8)]
pub enum SmbOplockLevel {
    /// No oplock granted.
    None      = 0x00,
    /// Level II: read caching only. Multiple clients may hold simultaneously.
    LevelII   = 0x01,
    /// Exclusive: read + write caching. Only one client may hold.
    Exclusive = 0x08,
    /// Batch: read + write + handle caching. Client may delay close.
    Batch     = 0x09,
    /// Lease (SMB 2.1+): fine-grained, per-file-name caching state.
    Lease     = 0xFF,
}

/// SMB2 Lease -- per-file-name (not per-handle) caching state.
/// Leases survive handle close and reopen, unlike oplocks.
pub struct SmbLease {
    /// Client-generated lease key (unique per client per file name).
    pub lease_key: [u8; 16],
    /// Current lease state (combination of R, W, H flags).
    pub lease_state: LeaseState,
    /// Lease epoch (incremented on each lease break/upgrade).
    /// Protocol-mandated u16 (MS-SMB2 wire format). Wrap is safe: SMB2
    /// uses modular comparison for epoch changes; absolute value is not meaningful.
    pub lease_epoch: u16,
    /// Parent lease key for directory leases (SMB 3.0+). Enables
    /// directory change caching: the client can cache readdir results
    /// until the parent lease is broken.
    pub parent_lease_key: Option<[u8; 16]>,
}

bitflags! {
    /// Lease state flags (MS-SMB2 section 2.2.13.2.8).
    pub struct LeaseState: u32 {
        /// Read caching: client may cache read data locally.
        const READ   = 0x01;
        /// Write caching: client may cache writes locally (flush on break).
        const WRITE  = 0x02;
        /// Handle caching: client may defer close operations.
        const HANDLE = 0x04;
    }
}

Lease break protocol: When a conflicting access arrives (e.g., another client opens a file for write while a read-write lease is held), the server sends a lease break notification. The client must acknowledge the break within 35 seconds (the oplock break timeout per MS-SMB2 section 3.3.5.22.1) or the lease is forcibly revoked. During the break period, the client flushes cached writes and downgrades its lease state. The VFS integration for conflict detection uses vfs_test_lock() and the file notification subsystem (Section 14.13).

15.21.6 SMB Multichannel

SMB 3.0 and later support multichannel: multiple TCP connections per session for bandwidth aggregation and fault tolerance.

  • Interface discovery: The client issues FSCTL_QUERY_NETWORK_INTERFACE_INFO to discover the server's network interfaces (IP addresses, link speeds, RSS capability). The server populates the response by enumerating all network interfaces via the netdevice subsystem (Section 16.13), filtering to interfaces that are UP and have a routable address. Each entry includes: interface index, link speed (from ethtool_link_ksettings), RSS capability flag, and IPv4/IPv6 socket addresses. The response is bounded by the number of interfaces (typically <32).
  • Connection binding: Additional connections are bound to the existing session via SMB2 SESSION_SETUP with SMB2_SESSION_FLAG_BINDING. All connections in a session share the same session key and signing/encryption keys.
  • Request distribution: Requests are distributed across channels per-request (not per-session). The server processes requests from any channel interchangeably.
  • Failover: If one channel fails (TCP RST or timeout), in-flight requests on that channel are retried on a surviving channel. The session remains valid as long as at least one channel is active.
  • Channel limit: UmkaOS supports up to 32 channels per session. The limit is configurable via ksmbd.conf (max_channels = N).

15.21.7 SMB Direct (RDMA)

SMB Direct (MS-SMBD, SMB 3.0+) enables RDMA transport for zero-copy file transfer, eliminating TCP/IP overhead on RDMA-capable networks.

  • Transport: Uses iWARP, RoCE v2, or InfiniBand RDMA via the UmkaOS RDMA subsystem (Section 5.4).
  • Data transfer: Bulk data (read/write payloads) uses RDMA Read/Write operations; control messages (SMB headers, negotiate, session setup) use RDMA Send/Receive.
  • Buffer descriptors: Each RDMA data transfer is described by a SmbdBufferDescriptor { offset: u64, token: u32, length: u32 } that the peer uses for remote DMA addressing.
  • Credit-based flow control: The receiver advertises receive credits; the sender must not exceed the credit count. Credits are replenished in each response.
  • Negotiation: SMB Direct is negotiated at connection time. If both endpoints support RDMA, the connection transparently upgrades from TCP to RDMA. Applications and management tools see a standard SMB session.
  • Supported hardware: Mellanox/NVIDIA ConnectX series, Chelsio T6+, Intel E810 (iWARP). Any NIC exposing the UmkaOS RDMA verbs interface is usable.

15.21.8 ksmbd.mountd IPC Protocol

The kernel/userspace split follows the same model as Linux ksmbd: the kernel handles the fast data path while ksmbd.mountd handles authentication and configuration.

ksmbd.mountd responsibilities: - Parse /etc/ksmbd/ksmbd.conf (share definitions, global parameters). - Manage the user database (/etc/ksmbd/ksmbdpwd.db): NTLM password hashes. - Authenticate SMB session setup requests (NTLMv2, Kerberos via GSSAPI/SPNEGO). - Return session keys and user credentials to the kernel.

IPC channel: Generic Netlink family (KSMBD_GENL_NAME = "KSMBD_GENL"). ksmbd uses Generic Netlink with a dynamically registered family, not a fixed netlink protocol number. The protocol is a simple request/response framing:

  1. Kernel sends KSMBD_EVENT_LOGIN_REQUEST { account_name, domain_name } when an SMB2 SESSION_SETUP arrives.
  2. ksmbd.mountd validates credentials (NTLM challenge-response or Kerberos AP-REQ), sends KSMBD_EVENT_LOGIN_RESPONSE { session_key, uid, gid, status }.
  3. On share configuration reload: ksmbd.mountd sends KSMBD_EVENT_SHARE_CONFIG_REQUEST with the full share table; kernel replaces the RcuVec<SmbShare> atomically.

Failure mode: If ksmbd.mountd is not running, new session setup requests are rejected with STATUS_LOGON_FAILURE. Existing authenticated sessions continue to operate (the kernel has cached the session key). This allows ksmbd.mountd to be restarted without disrupting active file transfers.

15.21.9 VFS Integration

SMB operations map directly to UmkaOS VFS operations (Section 14.1):

SMB2 Command VFS Call Notes
SMB2_CREATE vfs_open() / vfs_create() Creates or opens a file/directory
SMB2_READ vfs_read() / vfs_splice_read() Splice path for zero-copy when possible
SMB2_WRITE vfs_write() Respects share read_only flag
SMB2_CLOSE vfs_close() Releases oplock/lease if last handle
SMB2_FLUSH vfs_fsync() Flushes to stable storage
SMB2_QUERY_INFO vfs_getattr() / vfs_getxattr() File/FS/security info classes
SMB2_SET_INFO vfs_setattr() / vfs_setxattr() Includes timestamp, size, ACL updates
SMB2_QUERY_DIRECTORY vfs_readdir() Pattern matching (wildcards) in kernel
SMB2_CHANGE_NOTIFY Section 14.13 Maps to inotify/fanotify watchers
SMB2_LOCK vfs_lock_file() Byte-range locks (Section 14.14)
SMB2_IOCTL FSCTL dispatch FSCTL_GET_REPARSE_POINT, FSCTL_PIPE_WAIT, etc.

Extended attributes and NT ACLs: Windows NT security descriptors (DACLs/SACLs) are stored as extended attributes in the security.NTACL xattr namespace (Section 14.16). When a Windows client sets file permissions via the Security tab, the server serializes the NT security descriptor into the xattr. On SMB2_QUERY_INFO with SMB2_0_INFO_SECURITY, the xattr is read and returned as a wire-format security descriptor. If no security.NTACL xattr exists, the server synthesizes a default ACL from POSIX mode bits.

Durable handles: SMB 2.x durable handles allow a client to reconnect to open files after a brief network disconnection (up to the durable handle timeout, default 60 seconds). The server retains the SmbOpenFile entry and its associated VFS state (OpenFile, byte-range locks, oplock/lease) across the disconnect. On reconnect, the client presents the durable file ID and the server resumes the open without re-executing vfs_open(). Persistent handles (SMB 3.0+, requires stable storage) survive server restart — the open file state is journaled to /var/lib/ksmbd/persistent_handles/. Constraint: The persistent handle journal directory MUST NOT be on the same exported share that the handles reference — this avoids a circular dependency where reconstructing a persistent handle requires mounting a share that itself has a persistent handle pending reconstruction. The journal directory must be on a local filesystem (ext4/XFS) that is mounted before ksmbd starts.

15.21.10 Security

Authentication: Delegated entirely to ksmbd.mountd: - NTLMv2: Challenge-response using NT password hash from ksmbdpwd.db. The kernel generates the 8-byte challenge; ksmbd.mountd validates the response. - Kerberos (SPNEGO): ksmbd.mountd accepts the Kerberos AP-REQ via GSSAPI/SPNEGO, validates it against the host keytab, and returns the session key.

Message signing: When negotiated (mandatory for SMB 3.1.1 when the client requests signing), all SMB2 messages carry an AES-CMAC or AES-GMAC signature computed over the message header and payload. The signing key is derived from the session key via KDF(SessionKey, label, context) per MS-SMB2 section 3.1.4.2. Signing prevents man-in-the-middle modification of SMB traffic.

Encryption: Per-session or per-share encryption (SMB 3.0+). When enabled, the entire SMB2 transform header and payload are encrypted using AES-128-CCM, AES-128-GCM, AES-256-CCM, or AES-256-GCM (negotiated during SMB2 NEGOTIATE). Encryption keys are derived from the session key. All cryptographic operations use the UmkaOS kernel crypto subsystem (Section 10.1).

Guest access: Configurable per-share via guest_ok. Disabled by default. When enabled, unauthenticated connections are granted the nobody credential (UID 65534). Guest sessions cannot use signing or encryption.

Capability requirement: Capability::NetAdmin is required to configure ksmbd (start/stop the server, modify shares). Standard users may connect as SMB clients without any special capability.

15.21.11 Cross-references

15.21.12 Design Decisions

  1. Kernel/userspace split: The data path (read, write, directory enumeration) runs entirely in-kernel for minimal latency. Authentication and configuration parsing run in userspace (ksmbd.mountd) where they can use standard libraries (MIT Kerberos, OpenSSL) without kernel-space constraints. This matches the Linux ksmbd architecture and is the right trade-off: authentication is per-session (infrequent), while data operations are per-request (high frequency).

  2. SMB 1.0 not supported: SMB 1.0/CIFS has critical security vulnerabilities (EternalBlue/WannaCry) and no legitimate modern use case. All supported clients (Windows 10+, macOS 10.12+, Linux CIFS) speak SMB 2.1 or later. Excluding SMB 1.0 eliminates a large attack surface.

  3. Tier 1 placement: ksmbd is a kernel-resident server that accesses the VFS via kabi_call! (resolves to ring dispatch since ksmbd and VFS are in different hardware isolation domains). Tier 2 (userspace) placement would add a privilege transition on top of the ring dispatch, further increasing latency. Tier 1 provides ring-based VFS access with hardware domain isolation for fault containment and implicit batching for throughput.

  4. RCU for share configuration: Share lookups happen on every SMB request (to verify access rights). RCU eliminates lock contention on the read path. Configuration reloads are rare (operator-initiated) and use the RCU update path.

  5. Oplock/lease integration via VFS file notification: Rather than implementing a separate conflict detection mechanism, ksmbd uses the VFS file notification subsystem (Section 14.13) to detect conflicting opens and trigger lease breaks. This ensures consistent behavior between local processes and remote SMB clients accessing the same files.

  6. Durable handles with VFS state retention: Durable handles keep the VFS OpenFile alive across client disconnects, avoiding the cost of re-opening and re-acquiring locks. The 60-second default timeout is short enough that server resources are not held indefinitely by crashed clients.