Chapter 15: Storage and Filesystems¶
Durability guarantees, block I/O, volume management, block storage networking, clustered filesystems, persistent memory, SATA/AHCI, ext4/XFS/Btrfs, ZFS
The storage subsystem spans block I/O, volume management (device-mapper), filesystem drivers (ext4, XFS, Btrfs, ZFS), NFS client/server, and persistent memory. Durability guarantees are explicit: every I/O path documents its crash-consistency model. The I/O scheduler is a replaceable policy layer for 50-year uptime via live kernel evolution.
15.1 Durability Guarantees¶
Linux problem: Applications couldn't reliably know when data was on disk. The ext4
delayed-allocation data loss bugs (2008-2009) were a symptom. Worse, fsync() error
reporting was broken — errors could be silently lost between calls. Partially fixed with
errseq_t in kernel 4.13 (with subsequent refinements in 4.14 and 4.16), but the contract between applications and filesystems around
durability remains murky.
UmkaOS design:
- Error reporting: Every filesystem operation tracks errors via a per-file error
sequence counter. fsync() returns errors exactly once and never silently drops them.
The VFS layer enforces this — individual filesystem implementations cannot bypass it.
- Durability contract: Three explicit levels, documented and testable:
1. write() → data in page cache (may be lost on crash)
2. fsync() → data + metadata on stable storage (guaranteed)
3. O_SYNC / O_DSYNC → each write waits for stable storage
- Filesystem crash consistency: All filesystem implementations must declare their
consistency model (journal, COW, log-structured) and pass a crash-consistency test
suite as part of KABI certification.
- Error propagation: Writeback errors propagate to ALL file descriptors that have the
file open, not just the one that triggered writeback. No silent data loss.
15.1.1 Boot Initialization Sequence¶
The storage subsystem initializes in dependency order during the canonical boot sequence (Section 2.3). Phase numbers below match the master table.
Phase 4.4a bus_enumerate() + Phase 4.5 block_init():
- Bus enumeration (Phase 4.4a) discovers NVMe/AHCI/VirtIO controllers in ACPI/DT namespace.
- Block layer registration (Phase 4.5): I/O scheduler, bio slab, request queues.
Phase 5.4 storage_probe() — NVMe/SCSI/AHCI/VirtIO/eMMC:
- Probe discovered controllers. Allocate NVMe submission/completion queues from slab.
Register block devices. This step requires Tier 1 driver loading (5.3) for
storage drivers behind the KABI boundary.
Phase 5.45 dm_init():
- Initialize device-mapper: register target types (linear, striped, crypt, verity,
thin, cache, mirror), then assemble dm devices specified on the kernel command line
or in initramfs. Depends on 5.4 because dm devices are built on top of physical
block devices that must already be probed.
Phase 5.5 mount_rootfs():
- Scan registered block devices for GPT/MBR partition tables. Identify root by
PARTUUID from kernel command line. If root= specifies a dm device
(root=/dev/dm-0 for LVM/LUKS root), the dm device was assembled in 5.45.
- Register filesystem types (ext4_init(), xfs_init(), btrfs_init()).
- Mount root filesystem. On failure, panic with diagnostic showing failed PARTUUID
and all detected block devices/partitions.
Phase 6.x — Post-root (on demand):
- fuse_init(): Register FUSE filesystem type. Daemon not yet running.
- nfs_init(): Register NFS. Network stack must be up for NFS root (initramfs handles).
- dlm_init(): Distributed Lock Manager for clustered filesystems.
Ordering constraints:
| Constraint | Canonical steps | Reason |
|---|---|---|
| storage_probe before dm_init | 5.4 → 5.45 | dm devices are layered on physical block devices |
| NVMe/VirtIO/SATA before root scan | 5.4 → 5.5 | Devices must be registered before root scan |
| Network stack before NFS root | 5.2 → 6.1 | NFS client requires TCP; initramfs manages |
dlm_init() after nfs_init() if co-located |
6.1 → 6.3 | Standalone NFS does not need DLM; clustered filesystems (GFS2/OCFS2) need DLM. DLM uses its own TCP transport (not NFS). NFS inits first (Phase 6.1) for early NFS root; DLM inits later (Phase 6.3) for cluster locking. |
Error handling: If any phase 5.4 device init fails, that device is marked unavailable and an FMA event is raised; boot continues without it. If root mount fails in 5.5, the kernel panics with a diagnostic serial console message showing the failed PARTUUID and all detected block devices and partitions.
15.1.2 Filesystem Error Mode Selection by Error Code¶
Operators configure per-mount via errors=continue|remount-ro|panic mount option.
The table below defines the default error mode when no mount option is specified:
| Error | Default mode | Rationale |
|---|---|---|
EIO |
Continue (retry) | Transient device error, may recover |
ENOSPC |
Continue | Out of space is recoverable (free space, retry) |
EROFS |
RemountRo | Filesystem corruption detected |
EUCLEAN |
RemountRo | Metadata checksum failure |
EREMOTEIO |
Continue (retry) | Remote transport failure (NFS, iSCSI, cluster FS); transient network issue, retry after reconnect |
ETIMEDOUT |
Continue (retry) | I/O timeout; device or network may recover on retry |
EUCLEAN (critical) |
Panic | Critical filesystem corruption (superblock/journal/bitmap). Same errno as non-critical EUCLEAN above; the FMA severity (Critical vs Warning) determines escalation to Panic. Linux uses EUCLEAN (aliased as EFSCORRUPTED) for all corruption levels. |
The check_fs_error_mode() function (Section 15.2) consults
the superblock's error_mode field (set at mount time). If the operator has set
errors=, that overrides the per-error-code defaults above.
15.1.3 I/O Result Codes¶
IoResultCode is a type alias for i32 (negated errno), matching Linux bio completion
semantics. Every KABI vtable method that performs I/O returns IoResultCode. This
includes BlockDeviceOps::submit_bio() completion callbacks, filesystem
AddressSpaceOps::writepage() completions, and NVMe passthrough command results.
/// I/O completion result code. Negated errno value.
/// 0 = success, negative = error.
///
/// This is the same encoding used by Linux's `blk_status_to_errno()` and
/// bio completion callbacks — existing driver code and filesystem error
/// handling logic works without translation.
pub type IoResultCode = i32;
Common values:
| Value | Constant | Meaning |
|---|---|---|
0 |
(success) | I/O completed successfully |
-5 |
-EIO |
Device error (timeout, transport failure, uncorrectable media error) |
-28 |
-ENOSPC |
No space left on device (filesystem or thin-provisioned volume) |
-30 |
-EROFS |
Read-only filesystem (write attempted after remount-ro) |
-117 |
-EUCLEAN |
Metadata checksum failure (block or filesystem layer detected corruption) |
-74 |
-EBADMSG |
CRC verification failure (used by XFS for xfs_buf_verify() failures). Handled identically to EUCLEAN by check_fs_error_mode(). Note: Linux defines EFSCORRUPTED as alias for EUCLEAN (117), not 74. ext4 uses EUCLEAN for corruption; XFS uses both EUCLEAN and EBADMSG. |
-121 |
-EREMOTEIO |
Remote I/O error (NFS, iSCSI, or cluster filesystem transport failure) |
-110 |
-ETIMEDOUT |
I/O timeout (command did not complete within device timeout) |
The error mode mapping table above defines the default filesystem response to each
IoResultCode. The mapping from IoResultCode to filesystem action is:
bio_completion(IoResultCode) -> check_fs_error_mode(errno) -> ErrorAction.
15.2 Block I/O and Volume Management¶
Linux problem: LVM/mdadm are mature but fragile when a block device disappears momentarily — the volume layer panics or marks the device as failed. A NVMe driver reload that takes 50ms can cascade into a degraded RAID array and an unnecessary multi-hour resync.
UmkaOS design:
15.2.1 Evolvable/Nucleus Classification¶
The block I/O subsystem follows the UmkaOS Evolvable component model (Section 13.18). The table below classifies every major data structure, trait, and algorithm in this section.
Nucleus (non-replaceable, verified correctness, survives live evolution):
| Component | Rationale |
|---|---|
Bio struct layout and lifetime |
Correctness: every I/O path depends on Bio field semantics. Changing Bio layout requires full subsystem quiesce. |
IoRequest struct and merge rules |
Correctness: elevator merge correctness depends on request ordering invariants. A broken merge can corrupt data. |
BlockDeviceOps trait signature |
ABI contract: drivers implement this trait. Changing the signature breaks all compiled drivers. |
| Elevator merge algorithm correctness | Correctness: merge must never combine requests that cross partition or stripe boundaries. This is a safety invariant, not a policy choice. |
| Write barrier ordering guarantees | Correctness: barrier semantics are part of the durability contract (Section 15.1). Violating barrier ordering causes data loss. |
Device-mapper target interface (DmTarget trait) |
ABI contract: dm targets implement this trait. Must be stable across live evolution. |
Evolvable (replaceable policy, hot-swappable via EvolvableComponent):
| Component | Rationale |
|---|---|
| I/O scheduler algorithm (mq-deadline, BFQ, none) | Policy: which requests to dispatch first is a heuristic. Different workloads benefit from different schedulers. ML can tune or replace. |
| Writeback throttling policy | Policy: how aggressively to throttle dirty page generation is a tunable heuristic. Optimal policy depends on device speed and workload. |
| Readahead strategy | Policy: how many pages to prefetch is a heuristic. Sequential vs random detection and prefetch window sizing are ML-tunable. |
| Stripe log flush policy | Policy: when to flush the RAID write-hole journal is a latency/durability tradeoff. Tunable per workload. |
| I/O priority class mapping | Policy: how ioprio classes map to dispatch weights is a scheduling policy decision. |
| Device-mapper thin provisioning overcommit thresholds | Policy: when to warn or block on overcommitted thin pools is an operator-tunable policy. |
15.2.2 Storage Driver Isolation Tiers at Boot¶
Boot-critical storage drivers (NVMe, AHCI/SATA, virtio-blk) follow a two-phase isolation model that reconciles the requirements of fast boot with post-boot fault containment.
Phase 1 — Tier 0 at boot (Phases 5.1–5.4): During boot, storage drivers needed for root filesystem access load as Tier 0 (in-kernel, statically linked, no isolation domain). This is required because:
- The root filesystem must be mounted (Phase 5.5) before the full module loader infrastructure and IOMMU domain allocation are exercised under load.
- Boot-critical storage I/O runs a single code path with no concurrent untrusted work — the isolation overhead of Tier 1 domain switching has no security benefit during early boot when only kernel-authored code is executing.
- The canonical boot sequence (Section 2.3) loads Tier 0 drivers at Phase 5.1, then Tier 1 at Phase 5.3, followed by storage probe at Phase 5.4. Boot-critical storage drivers are loaded in Phase 5.1 (Tier 0) so they are ready for Phase 5.4 probing.
The boot command line identifies boot-critical storage drivers via the root device
specification (e.g., root=/dev/nvme0n1p2, root=UUID=...). The kernel's root
device resolver maps this to the required driver (NVMe, AHCI, or virtio-blk) and
ensures it loads in Phase 5.1 as Tier 0 rather than waiting for Phase 5.3 Tier 1
loading.
Phase 2 — Optional Tier 1 assignment (post-boot): After rootfs mount, the operator or system policy may set a boot-critical storage driver to Tier 1 to gain crash recovery and fault isolation:
1. System is booted, rootfs mounted, init running.
2. Operator (or systemd unit) sets tier to 1:
echo 1 > /ukfs/kernel/drivers/nvme0/tier
3. Registry initiates tier change for the target device:
a. Quiesce I/O: drain all in-flight bios for this device (flush + barrier).
b. Allocate isolation domain (MPK PKEY, POE overlay, or per-arch equivalent).
c. Remap driver pages into the new domain.
d. Resume I/O through the Tier 1 dispatch trampoline.
4. Device is now Tier 1: crashes cause driver reload (~50-150ms), not kernel panic.
Non-boot storage drivers (USB mass storage, SD/eMMC card readers, iSCSI initiator) always load as Tier 1 via the standard Phase 5.3 module loader path. They are never Tier 0 because they are not on the rootfs critical path.
Decision matrix:
| Driver | Boot role | Initial tier | Post-boot promotion | Rationale |
|---|---|---|---|---|
| NVMe | Root device | Tier 0 | Yes (recommended) | Root FS access; promote after init |
| AHCI/SATA | Root device | Tier 0 | Yes (recommended) | Root FS access on SATA systems |
| virtio-blk | Root device (VMs) | Tier 0 | Yes (recommended) | Root FS in virtual machines |
| USB mass storage | Never root | Tier 1 | N/A (already Tier 1) | Removable media, not boot-critical |
| SD/eMMC | Rarely root | Tier 0 if root, else Tier 1 | Yes if Tier 0 | Embedded systems may boot from eMMC |
| iSCSI | Network root | Tier 1 | N/A | Network boot uses initramfs pivot |
Cross-references: isolation tier model (Section 11.3), device registry boot sequence (Section 11.6), crash recovery for Tier 1 block drivers (Section 11.9).
15.2.3 Block Device Trait¶
/// Block device abstraction — the interface between the block I/O layer
/// and storage device drivers (NVMe, SATA, virtio-blk, eMMC, SD, dm-*).
///
/// Every storage driver registers a `BlockDevice` with umka-block.
/// The block I/O layer routes bio requests through this trait.
pub trait BlockDeviceOps: Send + Sync {
/// Submit a block I/O request. The request contains one or more
/// bio segments (contiguous LBA ranges with associated memory pages).
/// Returns immediately; completion is signaled via the bio's completion
/// callback. For synchronous I/O, the caller waits on the callback.
fn submit_bio(&self, bio: &mut Bio) -> Result<()>;
/// Flush volatile write cache to stable storage. Called by fsync(),
/// sync(), and journal commit paths. Must not return until all
/// previously submitted writes are on stable media.
fn flush(&self) -> Result<()>;
/// Discard (TRIM/UNMAP) the specified LBA range. The device may
/// deallocate the underlying storage. Not all devices support this;
/// return ENOSYS if unsupported.
fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()>;
/// Return device geometry and capabilities.
fn get_info(&self) -> BlockDeviceInfo;
/// Shut down the device. Flushes caches and releases hardware resources.
fn shutdown(&self) -> Result<()>;
}
bitflags! {
/// Block device capability flags. Replaces individual bool fields for
/// extensibility — new capabilities (zoned, write zeroes, zone append,
/// secure erase, copy offload) can be added as new bits without changing
/// the struct layout.
pub struct BlockDeviceFlags: u32 {
/// Device supports discard/TRIM (ATA TRIM, NVMe Deallocate, SCSI UNMAP).
const DISCARD = 1 << 0;
/// Device has a volatile write cache and supports flush commands.
const FLUSH = 1 << 1;
/// Device supports FUA (Force Unit Access) — write directly to media
/// without requiring a separate flush.
const FUA = 1 << 2;
/// Device is a rotational disk (HDD). If not set, assumed non-rotational (SSD/NVMe).
const ROTATIONAL = 1 << 3;
/// Device supports write zeroes command (NVMe Write Zeroes, SCSI Write Same).
const WRITE_ZEROES = 1 << 4;
/// Device is a zoned block device (ZNS NVMe, SMR HDD).
const ZONED = 1 << 5;
/// Device supports secure erase.
const SECURE_ERASE = 1 << 6;
}
}
/// Block device metadata and capabilities.
/// Kernel-internal, not KABI — populated within the same compilation unit
/// (Tier 0 block layer or Tier 1 driver via KABI ring serialization).
pub struct BlockDeviceInfo {
/// Logical sector size in bytes (typically 512 or 4096).
pub logical_block_size: u32,
/// Physical sector size in bytes (4096 for AF drives).
pub physical_block_size: u32,
/// Total device capacity in logical sectors.
pub capacity_sectors: u64,
/// Maximum segments per bio request.
pub max_segments: u16,
/// Maximum total bytes per bio request.
/// A value of 0 means "no explicit limit beyond segment count" — the block
/// layer uses `max_segments * PAGE_SIZE` as the effective limit. Drivers
/// that set this to 0 (e.g., VirtIO-blk) rely solely on segment limits.
pub max_bio_size: u32,
/// Device capability flags (discard, flush, FUA, rotational, etc.).
pub flags: BlockDeviceFlags,
/// Optimal I/O size in bytes (for alignment).
pub optimal_io_size: u32,
/// NUMA node affinity (for interrupt/queue placement).
pub numa_node: u16,
}
/// Cached immutable block device parameters. Populated once at device
/// registration from `BlockDeviceOps::get_info()` and stored in the
/// `BlockDevice` wrapper struct. Avoids vtable dispatch on the hot
/// bio-to-request conversion path — device geometry is immutable after
/// registration.
///
/// Fields are a subset of `BlockDeviceInfo` — only those needed on
/// the hot I/O path. Additional fields may be added as needed.
pub struct BlockDeviceCachedParams {
/// Logical sector size in bytes (typically 512 or 4096).
pub logical_block_size: u32,
/// Physical sector size in bytes.
pub physical_block_size: u32,
/// Maximum total bytes per bio request.
pub max_bio_size: u32,
/// Device capability flags (discard, flush, FUA, etc.).
pub flags: BlockDeviceFlags,
}
/// Concrete block device wrapper. Holds the driver's `BlockDeviceOps` vtable
/// together with cached geometry, I/O queues, and per-device accounting. Every
/// registered block device produces one `BlockDevice` instance stored in the
/// device registry XArray (keyed by `dev_t`). The `Bio.bdev` field holds
/// `Arc<BlockDevice>`, not `Arc<dyn BlockDeviceOps>` — this gives the block
/// layer access to both the ops vtable and the cached parameters without
/// double-indirection.
///
/// **Nucleus component**: The struct layout is Nucleus (field changes require
/// full subsystem quiesce). The device registration/teardown code is Evolvable.
pub struct BlockDevice {
/// Driver-provided operations vtable (submit_bio, flush, get_info, etc.).
pub ops: Arc<dyn BlockDeviceOps>,
/// I/O scheduler queues. `None` for devices using hardware multi-queue
/// dispatch (NVMe). `Some` for devices that benefit from software
/// scheduling (AHCI/SATA with single hardware queue).
pub io_queues: Option<DeviceIoQueues>,
/// Immutable geometry cached at registration time. Avoids vtable dispatch
/// on the hot bio-to-request conversion path.
pub cached_params: BlockDeviceCachedParams,
/// Device number (major:minor). Unique identifier in the device registry.
pub dev: DevT,
/// Human-readable device name (e.g., "nvme0n1", "sda").
pub name: ArrayVec<u8, 32>,
/// Per-device requeue list for bios returned with EAGAIN by the driver.
/// Bounded by `MAX_REQUEUE_DEPTH` (default 4096). Re-drained by the
/// device's completion IRQ handler via `blk_kick_requeue()`.
///
/// **Use-after-free prevention (BIO-09 fix)**: Each entry stores a
/// `(generation, *mut Bio)` tuple. The `generation` is a snapshot of
/// `bio.generation` at enqueue time. When `blk_kick_requeue()` dequeues
/// an entry, it first CAS's `bio.state` from `Inflight` to `Inflight`
/// (a no-op CAS that succeeds only if the bio is still inflight). If the
/// CAS fails, the bio was completed or timed out — the entry is stale.
/// Additionally, the generation check (`bio.generation == saved_gen`)
/// detects the ABA case where the slab recycled the memory for a new bio
/// that happens to be in `Inflight` state. The generation counter is
/// incremented on every `bio_alloc()`, so a recycled bio will have a
/// different generation than the saved snapshot.
///
/// **Why not Arc**: `Arc<Bio>` would prevent the slab from recycling the
/// bio until the requeue list drops its reference, defeating the purpose
/// of slab-based allocation (bounded pool, no heap growth). The generation
/// counter achieves the same safety guarantee without extending bio lifetime.
///
/// Uses `BoundedDeque` (fixed-capacity ring buffer with O(1) push/pop at
/// both ends) instead of `ArrayVec` because `blk_kick_requeue()` needs
/// FIFO semantics: drain from front, re-insert deferred bios at front.
/// `ArrayVec` has no `pop_front()`/`push_front()` methods.
pub requeue_list: SpinLock<BoundedDeque<RequeueEntry, 4096>>,
}
/// Entry in the per-device requeue list. Pairs a bio pointer with a
/// generation snapshot to detect use-after-free (BIO-09 fix).
pub struct RequeueEntry {
/// Raw pointer to the bio. Valid only if `generation` matches
/// `bio.generation` at dequeue time.
pub bio: *mut Bio,
/// Snapshot of `bio.generation` at enqueue time. If
/// `bio.generation != saved_generation` at dequeue, the slab
/// recycled the memory — the entry is stale and must be skipped.
pub generation: u64,
}
/// Bio lifecycle states. Resolves the completion/timeout race via CAS.
///
/// Both the device completion handler and the synchronous timeout path
/// attempt to CAS from `Inflight` to their target state. The winner
/// proceeds with its action; the loser observes a non-`Inflight` state
/// and bails out (no-op). This eliminates the `mem::replace` race that
/// existed before: `mem::replace` is NOT atomic and two concurrent
/// `mem::replace` calls corrupt the completion field.
///
/// State diagram:
/// ```
/// Inflight ──CAS──→ Completing ──store──→ Done
/// │ ↑
/// └──CAS──→ TimedOut ──store─────────────┘
/// ```
///
/// **IoRequest.bio state tracking invariant** (BIO-12 resolution):
/// When a bio is wrapped in an IoRequest (scheduler path), the bio's
/// `state` field remains the single source of truth. The IoRequest does
/// NOT have its own state field. The completion path always goes through
/// the bio:
///
/// 1. Hardware signals completion → driver calls `bio_complete(req.bio, status)`.
/// 2. `bio_complete()` CAS's `bio.state` from `Inflight` to `Completing`.
/// 3. If CAS succeeds: invoke `bio.end_io` callback, store `Done`.
/// 4. If CAS fails: timeout handler already won — completion is a no-op.
///
/// The IoRequest is freed by the scheduler after `bio_complete()` returns
/// (whether the CAS succeeded or failed). The bio is freed by its `end_io`
/// callback (or by the synchronous waiter, depending on the submission
/// path). There is exactly one completion attempt per bio, regardless of
/// whether it went through the scheduler or not. The `*mut Bio` in
/// IoRequest is never dereferenced after `bio_complete()` transitions the
/// bio to `Done` — the IoRequest is dropped immediately after.
#[repr(u32)]
pub enum BioState {
/// Bio submitted, I/O in progress.
Inflight = 0,
/// Device completion handler won the CAS; executing callback.
Completing = 1,
/// Timeout handler won the CAS; executing timeout action.
TimedOut = 2,
/// Terminal state — bio lifecycle complete.
Done = 3,
}
/// Bio completion callback type. A function pointer set by the submitter
/// (filesystem, page cache, io_uring, synchronous waiter) before calling
/// `bio_submit()`. Called by `bio_complete()` when I/O finishes.
///
/// **Design rationale (Decision 4)**: The previous `BioCompletion` enum had 5
/// variants (`None`, `Callback`, `DeferredCallback`, `Waiter`, `StackWaiter`)
/// and required a bridging conversion (`IoCompletion::from_bio_completion()`)
/// to route scheduler completion back to the bio. That bridge was never
/// defined, creating a broken completion chain (BIO-01, BIO-05). The function
/// pointer replaces ALL variants: each submitter provides its own callback
/// that performs the appropriate completion action. The I/O scheduler wraps
/// the Bio in an IoRequest for merging/sorting, and on completion calls
/// `bio_complete()` which invokes this callback.
///
/// **Matches Linux's `bio->bi_end_io`**: Linux uses `void (*bi_end_io)(struct bio *)`
/// — a function pointer set by the submitter. UmkaOS adds a `status: i32`
/// parameter for direct error propagation (Linux uses `bio->bi_status` which
/// the callback reads separately; passing it avoids an extra atomic load).
///
/// **`*mut Bio` parameter**: Raw pointer because the callback runs in
/// interrupt/softirq context after the CAS-protected state transition in
/// `bio_complete()`. The CAS guarantees exclusive access — no aliasing.
/// The callback may free the bio (via `SlabBox::from_raw()`) or retain it
/// for retry. `&mut Bio` is unsuitable because the bio may already be
/// behind a raw pointer in the requeue list or I/O scheduler.
///
/// **Callback context constraints**: See "Bio Completion Callback Constraints"
/// below. Callbacks that need process context (page cache updates, sleeping
/// locks) must schedule work on the `blk-io` workqueue and return immediately.
///
/// **Common callback implementations**:
///
/// | Submitter | Callback | Action |
/// |-----------|----------|--------|
/// | `bio_submit_and_wait()` | `bio_sync_end_io` | Sets status, wakes stack waiter |
/// | Async writeback | `writeback_end_io_deferred` | Enqueues `blk-io` workqueue item for page cache updates |
/// | io_uring block ops | `io_uring_bio_end_io` | Posts CQE to completion ring |
/// | Direct I/O (no waiter) | `bio_noop_end_io` | Logs warning (catches double-signal bugs) |
///
/// **Status convention**: `status` is 0 on success, negative errno on error
/// (e.g., `-(EIO as i32)`). Matches the `bio.status` AtomicI32 encoding.
pub type BioEndIo = fn(bio: *mut Bio, status: i32);
/// Default (no-op) bio completion callback. Logs a warning if invoked —
/// catches double-signal bugs and bios submitted without a completion
/// callback set (programming error).
fn bio_noop_end_io(_bio: *mut Bio, _status: i32) {
klog_warn!("bio_complete: called on bio with default (noop) completion");
}
/// Synchronous bio completion callback. Used by `bio_submit_and_wait()`.
/// Sets `bio.status` and wakes the stack-allocated waiter. The waiter
/// pointer is stored in `bio.private` (set by `bio_submit_and_wait()`
/// before submission).
///
/// # Safety
/// - `bio` is valid (CAS-protected exclusive access in `bio_complete()`).
/// - `bio.private` is a valid `*const WaitQueueHead` pointing to the
/// caller's stack frame. The caller blocks until completion, so the
/// stack frame outlives this callback. If the timeout path wins the
/// CAS instead, this callback is never invoked.
fn bio_sync_end_io(bio: *mut Bio, status: i32) {
// SAFETY: bio pointer is valid (CAS-protected in bio_complete).
let bio = unsafe { &mut *bio };
bio.status.store(status, Ordering::Release);
// SAFETY: bio.private was set to a valid *const WaitQueueHead by
// bio_submit_and_wait(). The caller is blocked, so the stack frame
// (and thus the WaitQueueHead) is alive. The CAS in bio_complete()
// ensures this callback runs only if the timeout path did NOT win.
let wq = bio.private as *const WaitQueueHead;
unsafe { (*wq).wake_up(); }
}
/// Deferred writeback completion callback. Enqueues the actual page cache
/// update work (`writeback_end_io`) on the `blk-io` per-CPU workqueue.
/// This two-phase approach is required because page cache operations
/// (xa_lock, wait queue wake, `nr_dirty` decrement) are forbidden in
/// interrupt/softirq context where `bio_complete()` runs.
fn writeback_end_io_deferred(bio: *mut Bio, status: i32) {
// Schedule deferred execution on the `blk-io` per-CPU workqueue.
// The workqueue item captures the bio pointer and status. The
// bio remains valid until the deferred callback runs and either
// frees or recycles it.
workqueue_enqueue("blk-io", move || {
// SAFETY: bio is valid — ownership transferred from bio_complete()
// through the workqueue item. No other path accesses the bio
// between bio_complete() and this deferred execution.
writeback_end_io(unsafe { &mut *bio }, status);
});
}
/// Unified bio completion entry point. ALL completion paths MUST use this.
///
/// Performs CAS(Inflight -> Completing), stores the status, invokes the
/// bio's `end_io` callback, then transitions to Done. If CAS fails
/// (timeout or double-completion), does nothing — the timeout path or
/// prior completion already owns the bio.
///
/// **Why a function, not a method**: The CAS guarantees exclusive access
/// to the bio. Calling the `end_io` callback requires passing `bio` as
/// `*mut Bio` — the callback may free the bio, retry it, or chain it.
/// A `&mut self` method on Bio would be unsound because the callback
/// receives the same pointer. The free function takes `*mut Bio`
/// explicitly, and the CAS ensures no aliasing.
///
/// This eliminates the TOCTOU race (SF-373): the callback is a function
/// pointer (not an enum extracted via `mem::take`), so there is no
/// extraction step between CAS and invocation. The CAS guarantees
/// exclusive access; the callback is invoked directly.
///
/// Usage (in device IRQ handler, scheduler completion, or Tier 0 ring consumer):
/// ```
/// bio_complete(bio, 0); // success
/// bio_complete(bio, -(EIO as i32)); // error
/// ```
///
/// **Caller context**: May be called from interrupt/softirq context
/// (device IRQ handler), process context (timeout handler), or the
/// I/O scheduler completion path. The `end_io` callback must respect
/// the Bio Completion Callback Constraints documented below.
pub fn bio_complete(bio: *mut Bio, status: i32) {
// SAFETY: caller guarantees bio is a valid pointer to a live Bio.
// The CAS below establishes exclusive ownership before any mutation.
let bio_ref = unsafe { &*bio };
match bio_ref.state.compare_exchange(
BioState::Inflight as u32,
BioState::Completing as u32,
Ordering::AcqRel,
Ordering::Acquire,
) {
Ok(_) => {
// Won the CAS — exclusive ownership of the bio.
// Store status before invoking callback (callback may read it).
bio_ref.status.store(status, Ordering::Release);
// Invoke the submitter's completion callback. The callback
// receives `*mut Bio` and the status. It may:
// - Free the bio (SlabBox::from_raw)
// - Wake a waiter (bio_sync_end_io)
// - Schedule deferred work (writeback_end_io_deferred)
// - Post an io_uring CQE (io_uring_bio_end_io)
(bio_ref.end_io)(bio, status);
// No post-callback state store. After end_io returns, the
// bio may have been freed and its slab slot recycled by
// another CPU's bio_alloc(). Writing to freed memory would
// corrupt the new bio's state.
//
// The CAS to BioState::Completing is the terminal state for
// bio_complete()'s ownership. Sync waiters (bio_sync_end_io)
// wake on `bio.status != BIO_STATUS_PENDING`, which is set
// BEFORE the callback. eBPF/blktrace observers treat
// Completing as equivalent to Done.
}
Err(_) => {
// Lost the CAS — timeout path or another completion already
// claimed the bio. Do nothing.
}
}
}
Bio Completion Callback Constraints: The end_io callback function
executes in interrupt or softirq context
(Section 3.8) — specifically the
BLOCK_SOFTIRQ vector (index 4) for block I/O completion processing, or in the
I/O scheduler's completion path (which may also run in softirq context for
scheduler-attached devices). Callbacks MUST NOT:
- Acquire sleeping locks (
Mutex,RwLock,Semaphore) - Allocate memory with
GFP_KERNEL(only pre-allocated objects orGFP_ATOMICfrom the emergency reserve) - Call filesystem operations or page cache methods
- Trigger KABI domain crossings
- Block or sleep for any reason
The end_io callback runs in interrupt or softirq context. Page cache state updates
(clearing PG_WRITEBACK, waking waiters, updating AddressSpace.wb_err) must be
deferred to the blk-io workqueue from within the callback (e.g.,
writeback_end_io_deferred schedules a workqueue item). This handoff adds ~1-5us
latency but is required for correctness — page cache operations may acquire sleeping
locks.
Clarification: end_page_writeback() (clearing PG_WRITEBACK and waking waiters)
is the operation deferred to the workqueue. Read-path completions that only clear
PageFlags::LOCKED and wake waiters run directly in softirq — no workqueue deferral
for page flag updates on the read path. The "workqueue deferral" described here applies
to writeback completion, where filesystem journal updates and AddressSpace.wb_err
manipulation may need sleeping locks.
For the performance impact of this deferral on fsync/O_SYNC paths, see Section 3.4.
Permitted operations:
- Atomic bitfield updates (set/clear
PageFlags) - Wake wait queues (
WaitQueueHead::wake_up) - Update per-CPU counters
- Schedule work on a workqueue for deferred processing
- Set
bio.statusand signal completion
/// Block I/O request — carries data between the block layer and device drivers.
///
/// A Bio represents a contiguous logical block range and its associated
/// memory pages. Multiple bios can be chained for scatter-gather I/O.
/// The bio is the fundamental unit of block I/O in UmkaOS, equivalent to
/// Linux's `struct bio`.
pub struct Bio {
/// Target block device (concrete wrapper, not trait object).
///
/// `Arc<BlockDevice>` provides access to both the driver's `BlockDeviceOps`
/// vtable (`bdev.ops`) and the cached device parameters (`bdev.cached_params`)
/// without double-indirection. The block layer uses `bdev.cached_params` on
/// the hot bio-to-request path and `bdev.ops.submit_bio()` for dispatch.
///
/// **Collection policy exemption**: `Arc<BlockDevice>` is used despite being
/// on the I/O hot path because the block device outlives all its bios.
/// The Arc refcount increment/decrement is a single atomic op (~5 ns)
/// amortized across the full bio lifecycle. The alternative (raw pointer +
/// manual lifetime tracking) would sacrifice Rust's use-after-free
/// prevention for negligible gain. Clone occurs at bio_alloc time (warm
/// path), not per-sector.
pub bdev: Arc<BlockDevice>,
/// Operation type.
pub op: BioOp,
/// Starting logical block address (in logical sectors).
pub start_lba: u64,
/// Scatter-gather list of memory segments.
pub segments: ArrayVec<BioSegment, 16>,
/// Extension segment list for bios with >16 segments.
///
/// **Hot-path allocation note**: Most I/O requests fit within the inline
/// 16-segment ArrayVec (filesystem block I/O, direct I/O up to 64 KB with
/// 4 KB pages). The `Box<[BioSegment]>` fallback is allocated only for
/// large scatter-gather lists (e.g., O_DIRECT reads >64 KB into a
/// fragmented user buffer). This allocation is from the `bio_slab` pool
/// (a dedicated slab cache with pre-allocated pages), NOT from the general
/// heap, ensuring bounded allocation latency on the I/O submit path. The
/// slab cache is sized at boot: `min(1024, nr_cpus * 64)` entries, each
/// holding up to `BIO_MAX_SEGMENTS - 16 = 240` BioSegments. If the slab
/// is exhausted, `bio_submit()` blocks on the slab mempool (same as Linux's
/// `bioset` mempool behaviour). The `Box` is freed to the slab on Bio
/// completion (drop path).
pub segments_ext: Option<Box<[BioSegment]>>,
/// Completion callback. Set by the submitter (filesystem, page cache,
/// io_uring, sync waiter) before calling `bio_submit()`. Invoked by
/// `bio_complete()` when I/O finishes. See `BioEndIo` type documentation
/// for callback constraints and common implementations.
///
/// **Default**: `bio_noop_end_io` (logs warning — catches bios submitted
/// without a callback set). Submitters MUST set this before `bio_submit()`.
pub end_io: BioEndIo,
/// Opaque per-submitter private data. Used by the completion callback to
/// locate submitter-specific state without an extra indirection. Common
/// uses:
/// - `bio_submit_and_wait()`: `*const WaitQueueHead` (stack waiter)
/// - io_uring: `*const IoRingBioPrivate` (ring + user_data)
/// - Writeback: `*const AddressSpace` (for wb_err update)
///
/// Stored as `usize` (pointer-sized opaque value). Each callback knows
/// how to interpret it. Initialized to 0 by `bio_alloc()`.
pub private: usize,
/// Atomic state machine for bio lifecycle. Resolves the completion/timeout
/// race: both the completion handler and timeout path CAS from INFLIGHT to
/// their target state. Winner proceeds, loser bails. No `mem::replace` race.
///
/// ```
/// INFLIGHT ──CAS──→ COMPLETING ──store──→ DONE
/// │ ↑
/// └──CAS──→ TIMED_OUT ──store────────────┘
/// ```
///
/// States:
/// - `Inflight` (0): bio submitted, I/O in progress.
/// - `Completing` (1): device completion handler won the CAS; executing callback.
/// - `TimedOut` (2): timeout handler won the CAS; executing timeout action.
/// - `Done` (3): terminal state, bio lifecycle complete.
pub state: AtomicU32,
/// Error status (set by the driver on completion).
pub status: AtomicI32,
/// Flags controlling I/O semantics and crash recovery behavior.
pub flags: BioFlags,
/// Originating cgroup ID for I/O accounting and throttling.
/// Set to 0 by default; populated by `bio_submit()` before dispatch.
/// This is the **global** cgroup ID (unique across all cgroup namespaces,
/// assigned monotonically by the cgroup core). Not namespace-scoped —
/// the block layer operates below namespace boundaries and uses the
/// global ID for blkcg throttling and accounting.
pub cgroup_id: u64,
/// Generation counter for use-after-free detection (BIO-09 fix).
/// Incremented by `bio_alloc()` on every allocation from the slab.
/// The requeue list stores a snapshot of this value; at dequeue time,
/// if `bio.generation != saved_generation`, the slab recycled the
/// memory for a new bio and the requeue entry is stale.
///
/// **Longevity**: u64 at 10 billion I/Os per second wraps after ~58
/// years. Exceeds the 50-year operational target.
pub generation: u64,
}
/// A single segment of a bio — a contiguous range of physical memory.
pub struct BioSegment {
/// Physical page containing the data. Raw pointer with manual refcount
/// management via `page_get()` / `page_put()` on `Page._refcount`.
///
/// # Why not `Arc<Page>`
///
/// `Page` already has an intrinsic atomic refcount (`_refcount: AtomicI32`).
/// Wrapping it in `Arc` adds a second refcount, doubling the atomic
/// operations on the hot I/O path and creating confusion about which
/// refcount is authoritative for page lifetime.
///
/// # Safety
///
/// - `page_get()` must be called before storing the pointer (bio_add_page).
/// - `page_put()` must be called when the segment is consumed (bio_endio).
/// - The page must not be freed while any BioSegment holds a pointer to it.
/// - Callers must not dereference the pointer after `page_put()`.
pub page: *const Page,
/// Offset within the page (bytes).
pub offset: u32,
/// Length of this segment (bytes).
pub len: u32,
}
// SAFETY: BioSegment is Send because the page pointer validity is
// maintained by page refcount (page_get/page_put). The page refcount
// ensures the page is not freed while any BioSegment holds a pointer.
// Cross-CPU completion (interrupt on different CPU) requires Send.
unsafe impl Send for BioSegment {}
// SAFETY: BioSegment is Sync because all fields are read-only after
// construction. The page pointer is dereferenced only for DMA address
// calculation, which is a pure read of Page.phys_addr.
unsafe impl Sync for BioSegment {}
/// Block I/O operation type. Values MUST match Linux `include/linux/blk_types.h`
/// `enum req_op` exactly — BioOp values are serialized across the KABI ring
/// boundary and observed by eBPF/blktrace tools. Gaps in the numbering
/// (4, 6, 8, 10-16) are reserved for future zone management ops matching Linux.
#[repr(u8)]
pub enum BioOp {
Read = 0, // REQ_OP_READ
Write = 1, // REQ_OP_WRITE
Flush = 2, // REQ_OP_FLUSH
Discard = 3, // REQ_OP_DISCARD
SecureErase = 5, // REQ_OP_SECURE_ERASE
ZoneAppend = 7, // REQ_OP_ZONE_APPEND
WriteZeroes = 9, // REQ_OP_WRITE_ZEROES
// Note: Read-ahead is signaled via `BioOp::Read` + `BioFlags::RAHEAD`,
// not as a separate BioOp variant. This matches Linux's design: read-ahead
// has identical device-level semantics to Read but different error handling
// (silently droppable on resource pressure). Keeping it as a flag avoids
// duplicating all Read handling in the block layer dispatch.
//
// Future zone ops (Phase 3+): ZoneOpen=11, ZoneClose=13, ZoneFinish=15,
// ZoneReset=17, ZoneResetAll=19, DrvIn=34, DrvOut=35.
}
bitflags! {
/// Bio flags controlling I/O semantics and crash recovery behavior.
pub struct BioFlags: u32 {
/// Force Unit Access — bypass volatile write cache.
const FUA = 1 << 0;
/// Pre-flush — flush device write cache before this I/O.
const PREFLUSH = 1 << 1;
/// Metadata I/O (journal, superblock) — higher priority.
const META = 1 << 2;
/// Synchronous I/O — caller expects low latency.
const SYNC = 1 << 3;
/// Read-ahead — low priority, can be dropped under pressure.
const RAHEAD = 1 << 4;
/// Persistent bio — must be replayed after Tier 1 driver crash
/// recovery. Bios WITHOUT this flag are drained with `-EIO` on
/// crash. Used for filesystem journal commits, superblock writes,
/// and any I/O where silent loss causes data corruption.
/// See [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation--bio-crash-recovery).
const PERSISTENT = 1 << 5;
/// No-merge hint — do not merge with adjacent bios.
const NOMERGE = 1 << 6;
/// Marks bio as submitted from an async context (io_uring, AIO).
/// Completion uses async notification rather than blocking waiter.
const ASYNC = 1 << 7;
}
}
15.2.3.1 Bio Crash Recovery¶
When a Tier 1 block device driver crashes, in-flight bios are handled based on
the PERSISTENT flag:
BioFlags::PERSISTENTset: The bio is preserved in the per-device bio retry list (allocated in umka-core memory, outside the driver's isolation domain). After driver reload, these bios are replayed automatically — cleared ofBIO_ERRORflags and re-submitted to the new driver instance. Used for journal commits, superblock writes, and other I/O that cannot be silently lost.
Replay ordering: PERSISTENT bios are replayed in submission order
within each device (FIFO, matching the order they were originally submitted
to the driver). This preserves the filesystem's write-ordering assumptions
(e.g., journal commit block written after journal data blocks). For RAID
arrays, submission-order replay is critical: ascending-LBA replay could
reorder data and parity writes within a stripe, corrupting the parity. The
bio retry list is maintained as a FIFO queue (append on capture, replay
from head) during the crash recovery collection phase. Each captured bio
records a monotonically increasing capture_seq: u64 for tie-breaking if
bios are captured from multiple CPUs concurrently (sorted by capture_seq
after collection, before replay).
Replay set size bound: The maximum number of PERSISTENT bios retained per
device is bounded by MAX_PERSISTENT_BIOS_PER_DEVICE (default: 256). This
limit corresponds to the maximum number of concurrent journal commit + superblock
writes that can be in-flight for a single block device. If the retry list exceeds
this limit (indicating a pathological workload or a stuck driver), the oldest bios
are drained with -EIO to prevent unbounded memory consumption. The limit is
configurable via umka.max_persistent_bios=N.
Error handling during replay: If a replayed bio fails on the new driver
instance (the new driver returns an error for that I/O), the error is handled
as follows:
- The failed bio is logged via FMA with severity Degraded, including the
LBA range, bio flags, and error code.
- The bio is skipped (not retried again) — replay continues with the next bio
in the sorted list. The failed bio's completion callback fires with the error
status from the new driver.
- The block device is marked DEGRADED in the volume layer state machine.
If the device is part of a RAID array, the standard degraded-mode handling
applies (parity reconstruction for RAID5/6, mirror failover for RAID1).
- Replay does NOT abort on a single failure. All remaining PERSISTENT bios
are still replayed — a transient error on one LBA range should not prevent
replay of unrelated ranges.
BioFlags::PERSISTENTnot set: The bio is drained withbio.status = -EIO. The completion callback fires with error status. Applications retry via standard I/O error handling (fsync retry, read retry). This avoids replaying stale data bios whose contents may have been superseded.
Filesystem layers set PERSISTENT on critical I/O:
- ext4/XFS/Btrfs journal commits: BioFlags::FUA | BioFlags::PERSISTENT
- Superblock writes: BioFlags::PREFLUSH | BioFlags::FUA | BioFlags::PERSISTENT
- Regular data writes: no PERSISTENT — drained with -EIO on crash
15.2.3.2 Bio Lifecycle and Ownership¶
Bio objects are allocated from a dedicated permanent slab cache (bio_slab).
This slab is marked PERMANENT — it is never garbage-collected, ensuring that
bio allocation latency remains bounded even under sustained memory pressure.
Ownership rules:
-
bio_submit_and_wait()(synchronous): The caller owns the bio for its entire lifetime. Afterbio_submit_and_wait()returns, the caller may reuse the bio for another I/O or drop it. TheDropimpl freessegments_ext(if allocated) and returns the bio to thebio_slabcache. -
bio_submit()(asynchronous): After callingbio_submit(), ownership of the bio is logically transferred to the I/O completion path. The bio'send_iocallback is responsible for either reusing the bio (e.g., for retry or chained I/O) or callingbio_free()to return it to the slab. The submitter must not access the bio afterbio_submit()returns. -
Completion callback context: The completion callback fires in interrupt or softirq context (see Bio Completion Callback Constraints above). If the callback needs to perform work that may sleep (e.g., page cache updates), it must schedule a workqueue item and return immediately.
/// Allocate a bio from the bio slab cache.
///
/// Returns a `SlabBox<Bio>` — a slab-allocated owning pointer with RAII
/// `Drop`. `SlabBox<T>` wraps `NonNull<T>` + `&'static SlabCache<T>`;
/// the `Drop` impl calls `cache.free(ptr)`, making use-after-free
/// impossible by construction. No explicit `bio_free()` needed.
///
/// The previous `&'static mut Bio` return type was unsound: it claimed
/// the borrow had `'static` lifetime, but the bio may be freed (returned
/// to the slab) at any time — creating a dangling reference.
///
/// For the completion callback path (where ownership transfers to the
/// I/O stack), use `ManuallyDrop<SlabBox<Bio>>` or explicit consumption
/// via `SlabBox::into_raw()` / `SlabBox::from_raw()`.
///
/// `SlabBox<T>` is defined in [Section 4.2](04-memory.md#physical-memory-allocator--slabbox).
pub fn bio_alloc() -> SlabBox<Bio>;
Ownership model: SlabBox<Bio> provides type-safe slab lifetime management.
When dropped, the bio is returned to its originating SlabCache<Bio>. For bios
transferred to the completion path via bio_submit(), the submitter uses
ManuallyDrop::new(bio) to suppress the automatic drop; the completion callback
calls ManuallyDrop::into_inner() to resume RAII ownership, or uses
SlabBox::into_raw() / SlabBox::from_raw() for raw pointer interop with the
block layer's per-request state.
15.2.3.3 Bio-to-IoRequest Conversion¶
For block devices with an I/O scheduler attached (Section 15.18),
bios are converted to IoRequest objects before being dispatched to hardware
queues. This conversion bridges the filesystem-facing bio interface with the
scheduler-facing IoRequest interface.
/// Scheduler-facing I/O request. Created from a Bio by `bio_to_io_request()`.
/// The IoRequest wraps a Bio for the scheduler's merging/sorting/priority
/// logic. On completion, the scheduler calls `bio_complete()` on the
/// originating Bio, which invokes the Bio's `end_io` callback.
// Kernel-internal, not KABI.
pub struct IoRequest {
/// Starting logical block address.
pub lba: Lba,
/// Request length in **bytes** (not sectors). Corresponds to Linux's
/// `struct request.__data_len`. Named `len_bytes` (not `len`) to prevent
/// ambiguity between bytes and sectors.
pub len_bytes: u64,
/// I/O operation type. Uses BioOp directly (see [Section 15.2](#block-io-and-volume-management)).
pub op: BioOp,
/// Scheduling priority (derived from task + cgroup).
pub priority: IoPriority,
/// Submission timestamp (monotonic nanoseconds).
pub submit_ns: u64,
/// Deadline (set by the I/O scheduler on insertion). 0 = not yet assigned.
pub deadline_ns: u64,
/// PID of the submitting task.
pub pid: Pid,
/// Cgroup ID for accounting and throttling.
pub cgroup_id: u64,
/// Scatter-gather list of DMA-mapped segments.
pub sgl: DmaSgl,
/// Back-pointer to the originating Bio. The scheduler uses this to:
/// 1. Extract the Bio at dispatch time (`submit_bio` takes `&mut Bio`).
/// 2. Call `bio_complete()` on the Bio when hardware signals completion.
///
/// # Safety
/// The Bio is kept alive for the duration of the IoRequest's lifetime.
/// The submitter transfers ownership of the Bio to the I/O completion
/// path via `ManuallyDrop` / `SlabBox::into_raw()` at `bio_submit()`
/// time. The Bio is not freed until `bio_complete()` invokes `end_io`,
/// which may free it. The IoRequest must not outlive the Bio.
///
/// **Lifetime guarantee**: The Bio's slab allocation is pinned (the
/// `bio_slab` is PERMANENT — never garbage-collected). The raw pointer
/// remains valid until `bio_complete()` transitions the Bio to `Done`
/// and the `end_io` callback frees or recycles it.
pub bio: *mut Bio,
}
/// Convert a bio into an IoRequest for the I/O scheduler.
/// Called by `bio_submit()` when the target device has an I/O scheduler
/// attached (i.e., `bdev.io_queues` is `Some`).
///
/// The bio's LBA range, operation type, and memory segments are transferred
/// to the IoRequest. The bio pointer is stored as a back-reference for
/// dispatch (the driver's `submit_bio` takes `&mut Bio`) and completion
/// (the scheduler calls `bio_complete()` on the originating Bio).
///
/// Priority is derived from the submitting task's effective I/O priority.
///
/// For devices WITHOUT an I/O scheduler (e.g., NVMe with native multi-queue),
/// `bio_submit()` dispatches directly to `bdev.ops.submit_bio()` without
/// conversion.
/// **DmaSgl construction paths** (resolves BIO-07 — DmaSgl is NOT built twice):
///
/// - **Scheduler path** (`bio_to_io_request()`): DmaSgl is built HERE from
/// bio segments. The I/O scheduler stores it in `IoRequest.sgl`. At
/// dispatch time, the scheduler calls `bdev.ops.submit_request(req)`
/// which passes the pre-built SGL to the driver. The driver does NOT
/// rebuild it from the bio — it uses `req.sgl` directly.
///
/// - **Direct dispatch path** (NVMe with no scheduler): `bio_submit()`
/// calls `bdev.ops.submit_bio(bio)` directly. The NVMe driver builds
/// the DmaSgl inside its `submit_bio()` implementation. No IoRequest
/// is ever created — the DmaSgl is built exactly once.
///
/// In both paths, `DmaSgl::from_bio_segments()` is called exactly once
/// per bio. The two call sites are mutually exclusive (scheduler attached
/// vs no scheduler).
fn bio_to_io_request(bio: &mut Bio, task: &Task) -> IoRequest {
IoRequest {
lba: Lba(bio.start_lba),
// Device geometry is cached at registration time in the block device
// wrapper struct. No vtable dispatch per bio — BlockDeviceCachedParams
// holds immutable geometry discovered during device probe.
len_bytes: bio.total_sectors() as u64
* bio.bdev.cached_params.logical_block_size as u64,
op: bio.op,
priority: task.effective_io_priority(),
submit_ns: clock_monotonic_ns(),
deadline_ns: 0, // Populated by the I/O scheduler on insertion
pid: task.pid(),
cgroup_id: bio.cgroup_id,
sgl: DmaSgl::from_bio_segments(&bio.segments, bio.segments_ext.as_deref()),
// Store the bio pointer for dispatch and completion routing.
// The bio remains alive until bio_complete() is called.
bio: bio as *mut Bio,
}
}
15.2.3.4 Bio Submission Functions¶
The block I/O layer provides two free functions for submitting bios. All bio
submission flows through bio_submit(), which handles cgroup accounting and
throttling before dispatching to the device driver.
/// Submit a bio for asynchronous processing. Returns immediately.
/// The bio's `end_io` callback is invoked when I/O finishes.
///
/// **I/O scheduler path**: If the target block device has an I/O scheduler
/// attached (`bdev.io_queues.is_some()`), the bio is converted to an
/// `IoRequest` via `bio_to_io_request()` and submitted to the scheduler's
/// per-CPU queue ([Section 15.18](#io-priority-and-scheduling)). The scheduler merges,
/// reorders, and dispatches requests to hardware queues.
///
/// **Direct dispatch path**: If the device has no I/O scheduler (e.g.,
/// NVMe devices with hardware multi-queue), the bio is dispatched directly
/// to `bdev.ops.submit_bio()`.
pub fn bio_submit(bio: &mut Bio) {
// 1. Tag bio with originating cgroup for I/O accounting.
//
// **Readahead cgroup attribution**: Readahead bios are submitted by the
// readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)) which may run in
// a kernel worker thread context (kworker), not the original task that
// triggered the readahead. Naively using `current_task().cgroup_id()`
// here would attribute readahead I/O to the root cgroup (the kworker's
// cgroup), starving the triggering task's cgroup of its I/O budget and
// allowing readahead to bypass cgroup I/O throttling entirely.
//
// To fix this, the readahead engine stamps `bio.cgroup_id` with the
// triggering task's cgroup context *before* calling `bio_submit()`.
// The readahead entry points (`page_cache_readahead()`,
// `page_cache_async_readahead()`) capture `current_task().cgroup_id()`
// at call time (when the triggering task is still current) and propagate
// it through the `ReadaheadControl` struct to all bios created during
// the readahead window. If `bio.cgroup_id` is already set (non-zero)
// when `bio_submit()` is entered, step 1 preserves it; only bios with
// an unset cgroup_id (zero) are tagged with `current_task().cgroup_id()`.
if bio.cgroup_id == 0 {
bio.cgroup_id = current_task().cgroup_id();
}
// 2. Check cgroup I/O throttling (Section 17.2.5).
if let Some(throttle) = cgroup_io_throttle(bio) {
throttle.wait_for_token(bio); // may sleep if GFP_KERNEL
}
// 3. Resolve backing device (concrete `BlockDevice` wrapper,
// gives access to both `ops` vtable and `cached_params`/`io_queues`).
let bdev = &bio.bdev;
// 4. Set bio state to Inflight. This must happen BEFORE dispatch so
// that bio_complete()'s CAS from Inflight→Done succeeds.
// Without this, asynchronous callers would have state = uninitialized,
// and bio_complete's CAS from Inflight would silently fail.
bio.state.store(BioState::Inflight as u32, Release);
// 5. Dispatch: scheduler path or direct.
if let Some(ref queues) = bdev.io_queues {
let task = current_task();
let req = bio_to_io_request(bio, task);
// **Hot-path allocation note**: `IoRequest` is allocated from a
// dedicated per-CPU slab cache (`io_request_slab`), not from
// `Arc::new()` (general heap). The slab pool is sized at boot:
// `nr_cpus * 128` entries. `SlabArc::new(io_request_slab, req)`
// returns an `Arc<IoRequest>` backed by the slab cache, avoiding
// heap allocation on every I/O submission.
let arc_req = SlabArc::new(&queues.request_slab, req);
// See [Section 15.18](#io-priority-and-scheduling) for the `submit()` function
// (inserts into per-CPU scheduler queue and calls kick_dispatch).
submit(queues, arc_req, task);
} else {
// Direct dispatch — no scheduler.
// **Tier 1 domain crossing**: If the block device driver is Tier 1
// (post-boot NVMe/AHCI promotion), `bdev.ops.submit_bio()` is
// dispatched via `kabi_call!(bdev.block_handle, submit_bio, bio)`.
// The KABI transport serializes the Bio's relevant fields (sector,
// op, segments) into the T1CommandEntry argument buffer. The
// driver's consumer loop deserializes and programs DMA. Completion
// is signaled via the driver's KABI completion ring, which the
// Tier 0 block layer consumer converts back to `bio_complete()`.
// Domain crossing cost: ~23-46 cycles per bio (amortized with
// batched submission via plugging/unplug).
// Handle EAGAIN from the driver.
match kabi_call!(bdev.block_handle, submit_bio, bio) {
Ok(()) => {}
Err(e) if e == Error::AGAIN => {
// Descriptor ring full or device temporarily unavailable.
// Place on per-device requeue list. Bounded by
// `MAX_REQUEUE_DEPTH` (4096). The device's completion IRQ
// handler calls `blk_kick_requeue()` after freeing
// descriptors, which re-submits bios in FIFO order.
let mut requeue = bdev.requeue_list.lock();
if requeue.len() < requeue.capacity() {
requeue.push(RequeueEntry {
bio: bio as *mut Bio,
generation: bio.generation,
});
} else {
// Requeue list full — fail the bio immediately.
bio_complete(bio as *mut Bio, -(Error::NOSPC as i32));
}
}
Err(e) => {
// Permanent error — fail the bio.
bio_complete(bio as *mut Bio, -(e as i32));
}
}
}
}
/// Re-drain the per-device requeue list after the driver frees descriptors.
/// Called from the device's completion IRQ handler (or Tier 1 completion
/// ring consumer) after freeing DMA descriptors, to re-submit bios that
/// previously returned EAGAIN.
///
/// The requeue list is a bounded FIFO (`SpinLock<BoundedDeque<*mut Bio, 4096>>`).
/// Bios are re-submitted in FIFO order. For each bio:
/// 1. Check bio state: if not `Inflight` (e.g., timeout handler already
/// transitioned to `TimedOut` or `Done`), skip — do not re-submit.
/// 2. Re-submit via `kabi_call!(bdev.block_handle, submit_bio, bio)`.
/// 3. On EAGAIN: push back to front of requeue list (will retry on next
/// completion). On permanent error: `bio_complete(bio, error)`.
pub fn blk_kick_requeue(bdev: &BlockDevice) {
let mut list = bdev.requeue_list.lock();
let mut retry_later = BoundedDeque::<RequeueEntry, 64>::new();
while let Some(entry) = list.pop_front() {
// BIO-09 fix: Two-phase validation before dereferencing the bio pointer.
//
// Phase 1: Generation check. If the slab recycled the bio's memory
// for a new allocation, the generation counter will differ. Reading
// the generation field is safe because the slab does not return
// pages to the page allocator (bio_slab is PERMANENT), so the
// memory is always mapped and readable — we just may read a
// different bio's generation value, which will not match.
//
// Phase 2: State CAS. Even if the generation matches, the timeout
// handler may have transitioned the bio to TimedOut/Done. The CAS
// from Inflight→Inflight is a no-op that succeeds only if the bio
// is genuinely still waiting for re-submission.
//
// SAFETY: The slab memory backing entry.bio is always mapped
// (PERMANENT slab). Reading generation and state is safe even if
// the bio was freed and reallocated — we validate before any
// mutation or callback invocation.
let bio_ref = unsafe { &*entry.bio };
// Phase 1: generation mismatch → slab recycled this slot.
if bio_ref.generation != entry.generation {
continue; // stale entry — skip
}
// Phase 2: state check — bio must still be Inflight.
let state = bio_ref.state.load(Acquire);
if state != BioState::Inflight as u32 {
continue; // timeout handler already processed this bio
}
// SAFETY: generation matches AND state is Inflight → this is
// the original bio, still alive and waiting for re-submission.
let bio = unsafe { &mut *entry.bio };
match kabi_call!(bdev.block_handle, submit_bio, bio) {
Ok(()) => {}
Err(e) if e == Error::AGAIN => {
// Still busy — defer for next completion cycle.
retry_later.push_back(entry);
}
Err(e) => {
bio_complete(entry.bio, -(e as i32));
}
}
}
// Re-insert deferred bios at the front (maintain FIFO order).
for entry in retry_later.into_iter().rev() {
list.push_front(entry);
}
}
/// Submit a bio and block until I/O completion. Returns status code.
/// Caller must not hold any spinlocks (this function sleeps).
/// Sentinel value indicating the bio has not yet completed. Set at submission;
/// cleared by completion handler with actual status (0 = success, negative = error).
const BIO_STATUS_PENDING: i32 = i32::MIN;
/// Default: 30 seconds. Linux has no single bio sync timeout; its
/// `BLK_DEFAULT_SG_TIMEOUT` (60s) applies to SG_IO passthrough, not
/// internal block I/O. UmkaOS uses 30 seconds for faster fault detection
/// on synchronous I/O paths (fsync, sync read). Tunable via sysctl
/// `block.sync_timeout_ms`.
const BIO_SYNC_TIMEOUT_MS: u64 = 30_000;
/// Synchronous bio submission: submits the bio and blocks until completion.
///
/// **Preferred pattern**: Uses a stack-allocated `StackBioWaiter` to avoid
/// the heap allocation of `Arc<WaitQueueHead>`. This is safe because the
/// caller's stack frame outlives the bio in this synchronous path — the
/// function blocks until completion or timeout. For high-fsync workloads
/// (databases doing thousands of fsync/sec), eliminating the ~50-100ns
/// Arc allocation per synchronous I/O is measurable.
///
/// ```
/// // StackBioWaiter: stack-allocated waiter for synchronous bio completion.
/// // The WaitQueueHead lives on the caller's stack and is borrowed by the bio.
/// // SAFETY: The caller blocks until completion, so the stack frame outlives
/// // the bio's reference to the waiter.
/// struct StackBioWaiter {
/// wq: WaitQueueHead,
/// }
/// ```
pub fn bio_submit_and_wait(bio: &mut Bio) -> Result<(), IoError> {
// Stack-allocated waiter — no heap allocation.
let waiter = StackBioWaiter { wq: WaitQueueHead::new() };
// Set the synchronous completion callback and store the waiter pointer
// in bio.private. The callback (bio_sync_end_io) reads bio.private
// to locate the stack-allocated WaitQueueHead and wakes it.
// SAFETY: waiter lives on this stack frame; we block below until
// the bio completes, so the waiter outlives the bio's reference.
bio.end_io = bio_sync_end_io;
bio.private = &waiter.wq as *const WaitQueueHead as usize;
bio.status.store(BIO_STATUS_PENDING, Ordering::Release);
bio.state.store(BioState::Inflight as u32, Ordering::Release);
bio_submit(bio);
let completed = waiter.wq.wait_event_timeout(
|| bio.status.load(Relaxed) != BIO_STATUS_PENDING,
BIO_SYNC_TIMEOUT_MS,
);
if !completed {
// I/O did not complete within the timeout. The bio is still
// in-flight in the device queue; it will complete eventually
// (or be aborted by the error handler).
//
// CRITICAL: Atomically claim the bio via CAS(Inflight→TimedOut).
// If we win the CAS, we own the bio — the completion handler will
// see TimedOut (not Inflight) and bail out without touching our
// stack-allocated waiter. If we lose the CAS, the device already
// transitioned to Completing/Done and the waiter was already signaled.
match bio.state.compare_exchange(
BioState::Inflight as u32,
BioState::TimedOut as u32,
Ordering::AcqRel,
Ordering::Acquire,
) {
Ok(_) => {
// We won: timeout path owns the bio. The device completion
// handler will see TimedOut and skip the callback. Safe to
// return (our stack frame is about to be destroyed, but the
// completion handler will not dereference the StackWaiter).
bio.state.store(BioState::Done as u32, Ordering::Release);
return Err(IoError::TimedOut);
}
Err(_) => {
// Device completed between the timeout and our CAS.
// The waiter was signaled; fall through to check status.
}
}
}
// Verify state reached Done (completion handler sets this).
debug_assert!(bio.state.load(Ordering::Relaxed) == BioState::Done as u32);
match bio.status.load(Acquire) {
0 => Ok(()),
err => Err(IoError::from_errno(err)),
}
}
15.2.3.5 Cgroup I/O Throttling¶
Where throttling occurs: In bio_submit(), before dispatch to bdev.ops.submit_bio().
This ensures every bio — whether from filesystems, raw block reads, or device-mapper —
passes through the cgroup I/O controller.
Throttling algorithm: Token-bucket rate limiter per device per cgroup:
pub struct IoThrottleState {
/// Tokens available (bytes or IOPS depending on limit type).
pub tokens: AtomicI64,
/// Refill rate (bytes/sec or IOPS from io.max).
pub rate: u64,
/// Last refill timestamp (nanoseconds).
pub last_refill_ns: AtomicU64,
/// Wait queue for throttled bios.
pub waiters: WaitQueueHead,
}
Integration with io.max: When io.max is set on a cgroup, an IoThrottleState
is created per (cgroup, device) pair. cgroup_io_throttle() looks up the throttle
state from the cgroup's IoController using the bio's cgroup_id and bdev. If the
token bucket has insufficient tokens, the calling task sleeps on waiters until tokens
are refilled — the refill rate is derived directly from the io.max limit values.
Bypass: If no io.max is set for the bio's cgroup (or the cgroup is the root cgroup),
cgroup_io_throttle() returns None — zero overhead on the unthrottled path. No atomic
operations, no lock acquisitions, no branch mispredictions on the fast path.
Cross-reference: See Section 17.2 for cgroup io controller configuration.
15.2.3.6 Block Device Page Cache (Buffer Cache)¶
Block device special files (/dev/sda, /dev/nvme0n1p1) are accessed through the
page cache just like regular files. Every block device has an associated Inode
(the "bdev inode") whose i_mapping (AddressSpace) caches raw block data. This
is the "buffer cache" — reads from /dev/sda via read(2) or dd go through
this page cache; O_DIRECT bypasses it.
bdev inode creation: When a block device is first opened, the VFS creates
(or reuses) a bdev inode in the bdevfs pseudo-filesystem. The bdev inode's
i_size is set to the device's capacity in bytes. Its i_mapping.ops is set
to BDEV_ADDRESS_SPACE_OPS.
/// AddressSpaceOps for block device special files.
/// These translate page cache operations into raw block I/O
/// without filesystem metadata interpretation.
pub static BDEV_ADDRESS_SPACE_OPS: &dyn AddressSpaceOps = &BdevAddressSpaceOps;
struct BdevAddressSpaceOps;
impl AddressSpaceOps for BdevAddressSpaceOps {
/// Read a single page of raw block data.
/// Builds a Bio targeting the page's byte offset ÷ logical_block_size,
/// submits it, and waits for completion.
/// The caller (`filemap_get_pages`) has already allocated a page, inserted
/// it into the page cache XArray via `try_store`, and set `PageFlags::LOCKED`.
/// This function fills the page with data from the block device. It must NOT
/// allocate a new page or overwrite the XArray slot — doing so would orphan
/// the original locked page and deadlock concurrent waiters.
fn read_page(
&self,
mapping: &AddressSpace,
index: u64,
page: &Arc<Page>,
) -> Result<(), IoError> {
let bdev = bdev_from_inode(mapping.host())?;
let lba = (index * PAGE_SIZE as u64)
/ bdev.cached_params.logical_block_size as u64;
let bio = Bio::new_read(bdev, lba, page);
bio_submit_and_wait(&bio)?;
Ok(())
}
/// Write a dirty page of raw block data back to the device.
/// For async writeback (sync_mode == Background), submits the bio and returns
/// immediately — errors are delivered via the bio completion callback
/// which sets `AS_EIO`/`AS_ENOSPC` on the address space. For sync
/// writeback, uses `bio_submit_and_wait()` to block until completion.
fn writepage(
&self,
mapping: &AddressSpace,
page: &Page,
wbc: &WritebackControl,
) -> Result<(), IoError> {
let bdev = bdev_from_inode(mapping.host())?;
// page.index is shorthand for Page::index_or_freelist (the page-cache
// file offset in page-sized units). Only valid for page-cache pages
// (not slab pages, where this field is unused). See Page struct in
// [Section 4.3](04-memory.md#slab-allocator--page-frame-descriptor).
let lba = (page.index as u64 * PAGE_SIZE as u64)
/ bdev.cached_params.logical_block_size as u64;
let mut bio = Bio::new_write(bdev, lba, page);
// WritebackSyncMode defined in [Section 4.6](04-memory.md#writeback-subsystem--writebacksyncmode).
if wbc.sync_mode == WritebackSyncMode::Background {
// Async writeback: fire-and-forget. Errors are reported via
// the bio completion callback → address_space error flags.
bio.flags |= BioFlags::ASYNC;
bio_submit(&mut bio);
Ok(())
} else {
// Sync writeback (fsync path): block until I/O completes.
bio_submit_and_wait(&mut bio)
}
}
/// No releasepage needed for raw block devices.
fn releasepage(&self, _page: &Page) -> bool { true }
}
Data flow for raw block reads (e.g., dd if=/dev/sda bs=4096 count=1):
read(fd, buf, 4096)
→ vfs_read()
→ generic_file_read_iter() // same as regular files
→ filemap_get_pages(mapping, pgoff=0) // check page cache
→ [cache miss] → BdevAddressSpaceOps::read_page()
→ Bio { op: Read, start_lba: 0, segments: [page] }
→ bio_submit() → BlockDeviceOps::submit_bio()
→ [I/O completion] → page installed in cache
→ copy_to_user(buf, page_data, 4096)
Subsequent reads of the same block hit the page cache directly (no I/O). The
bdev page cache is invalidated by blkdev_invalidate_pages() when a partition
is re-read or a device is closed with exclusive access.
15.2.3.7 Writeback I/O Completion Callback¶
Async writeback bios submitted by BdevAddressSpaceOps::writepage() (and filesystem
writepage implementations) use a dedicated completion callback to update page cache
state and propagate errors to fsync() waiters.
/// Writeback I/O completion callback. Called from the `blk-io` workqueue
/// (deferred from interrupt/softirq context — see Bio Completion Callback
/// Constraints above) when an async writeback bio completes.
///
/// This function bridges the block I/O completion path and the page cache
/// error reporting path. It is the sole point where writeback errors are
/// recorded on the `AddressSpace` — all writeback paths (bdev, ext4, XFS,
/// Btrfs) use this callback or a filesystem-specific variant that calls
/// the same `wb_err` update logic.
/// Called from the `blk-io` workqueue (deferred from interrupt/softirq
/// context via `writeback_end_io_deferred`). Handles page cache
/// updates after writeback I/O completes: clears WRITEBACK/DIRTY flags,
/// decrements `nr_dirty`, records errors on the AddressSpace via ErrSeq.
///
/// **IRQ-safety**: This function is NOT called directly from
/// `bio_complete()`. Instead, `writeback_end_io_deferred` (the `end_io`
/// callback set by the writeback path) schedules this function on the
/// `blk-io` workqueue. This allows it to perform page cache
/// operations (xa_lock, wait queue wake) that are forbidden in IRQ context.
///
/// **Counter ownership**: This function is the SOLE owner of the
/// DIRTY→clean transition and `nr_dirty` decrement for Tier 0 (in-kernel)
/// filesystems. For Tier 1 filesystems, the Tier 0 `WritebackResponse`
/// handler (step 11 in [Section 4.6](04-memory.md#writeback-subsystem)) owns the transition
/// instead — `writeback_end_io` is NOT called for Tier 1 writeback bios.
fn writeback_end_io(bio: &mut Bio, status: i32) {
let errno = status;
// Iterate ALL segments in the bio. A single writeback bio may span multiple
// pages (bio_add_page() coalesces contiguous pages). Processing only
// segments[0] would silently drop error/completion handling for all
// subsequent pages, causing writeback hangs (tasks blocked on
// wait_on_page_writeback() for pages whose WRITEBACK flag is never cleared)
// and silent data loss (pages left in WRITEBACK state, never re-dirtied on error).
// Capture the mapping reference ONCE before the per-page loop. After the
// loop clears WRITEBACK, a concurrent truncate_inode_pages() can null out
// page.mapping — so post-loop access to page.mapping() is unsafe. This
// capture is safe because WRITEBACK is still set (pinning the mapping).
// SAFETY: bio.segments is non-empty (guaranteed by bio_submit validation).
let mapping_for_error = unsafe { &*bio.segments[0].page }.mapping();
for seg in &bio.segments {
// SAFETY: page validity is guaranteed by page_get() in bio_add_page().
// The page is pinned for the bio's lifetime (page_put in bio_endio).
let page = unsafe { &*seg.page };
// Track whether this page should be re-dirtied for retry.
let mut should_redirty = false;
if errno == 0 {
// Success path: page is now clean on disk.
// **Counter maintenance**: decrement nr_dirty (balancing the increment
// in __set_page_dirty). If this decrement is omitted,
// balance_dirty_pages() will eventually throttle all writes to zero.
//
// **page.mapping()** resolves the `AtomicPtr<u8>` in `Page.mapping`
// to `&AddressSpace`:
// ```rust
// impl Page {
// /// Resolve the mapping pointer to a typed AddressSpace reference.
// /// SAFETY: The caller must ensure the page is still attached to
// /// an AddressSpace (i.e., not truncated). For pages in writeback,
// /// this is guaranteed: the WRITEBACK flag pins the mapping.
// pub unsafe fn mapping(&self) -> &AddressSpace {
// &*(self.mapping.load(Acquire) as *const AddressSpace)
// }
// /// Wake tasks waiting for this page's writeback/lock to complete.
// pub fn wake_waiters(&self) {
// self.waiters.wake_up_all();
// }
// }
// ```
page.flags.fetch_and(!PageFlags::DIRTY, Release);
// SAFETY: page is in writeback — mapping is pinned.
unsafe { page.mapping() }.page_cache.as_ref().unwrap().nr_dirty.fetch_sub(1, Relaxed);
} else {
// Error path: mark page for retry and record error on AddressSpace.
page.flags.fetch_or(PageFlags::ERROR, Release);
// Check consecutive failure count. After 3 failures, mark the page
// PERMANENT_ERROR and exclude from future writeback (matching the
// writeback subsystem's retry policy in [Section 4.6](04-memory.md#writeback-subsystem)).
let fail_count = page.wb_fail_count.fetch_add(1, Relaxed) + 1;
if fail_count >= 3 {
page.flags.fetch_or(PageFlags::PERMANENT_ERROR, Release);
// Page excluded from writeback. fsync() returns -EIO.
// FMA event emitted by the writeback subsystem.
} else {
should_redirty = true;
}
// Record the error on the AddressSpace for fsync() reporting
// via ErrSeq (errseq_t pattern). set_err() atomically increments
// the sequence counter and stores the errno. Concurrent fsync()
// callers on different fds each observe the error exactly once.
// The error is recorded once per AddressSpace (not once per page) —
// set_err() is idempotent within the same generation.
// SAFETY: page is in writeback — mapping is pinned.
let mapping = unsafe { page.mapping() };
mapping.wb_err.set_err(errno as i32);
// Set legacy AS_EIO / AS_ENOSPC flags on the AddressSpace for
// backward compatibility with callers that check flags directly
// (older filesystems, memory-mapped I/O error detection). These
// flags complement the ErrSeq mechanism — both must be set.
if errno == -(ENOSPC as i32) {
mapping.flags.fetch_or(AS_ENOSPC, Release);
} else {
mapping.flags.fetch_or(AS_EIO, Release);
}
// Re-dirty the page so the writeback subsystem will retry it on
// the next writeback cycle (only for non-permanent errors).
if should_redirty {
page.flags.fetch_or(PageFlags::DIRTY, Release);
// SAFETY: page is in writeback — mapping is pinned.
unsafe { page.mapping() }.page_cache.as_ref().unwrap().nr_dirty.fetch_add(1, Relaxed);
}
}
// Unified completion path for ALL outcomes (success, retryable error,
// permanent error). Order matters: decrement nrwriteback FIRST, then
// clear WRITEBACK flag. If we cleared WRITEBACK first, a concurrent
// fsync() could observe WRITEBACK cleared (page "done") while
// nrwriteback still counts it, leading to stale count or missed waiters.
// SAFETY: page is in writeback — mapping is pinned.
unsafe { page.mapping() }.nrwriteback.fetch_sub(1, Release);
page.flags.fetch_and(!PageFlags::WRITEBACK, Release);
// Wake any tasks blocked in fsync() or sync_page() waiting for this
// page's writeback to complete. The waiters check wb_err after waking
// to detect and propagate errors.
page.wake_waiters();
}
// Check filesystem error mode ONCE per bio (not per page). Multiple pages
// in the same bio share the same AddressSpace and superblock. The error mode
// action (continue/remount-ro/panic) is a per-filesystem decision, not per-page.
//
// IMPORTANT: `mapping_for_error` was captured BEFORE the per-page loop
// cleared WRITEBACK flags. After WRITEBACK is cleared, a concurrent
// truncate_inode_pages() could remove the page from the page cache and
// null out page.mapping — making a post-loop page.mapping() dereference
// unsafe (null pointer / use-after-free). The pre-loop capture avoids
// this race.
if errno != 0 {
check_fs_error_mode(mapping_for_error.host().superblock());
}
// Dirty extent protocol completion: if this page was part of a dirty
// extent reservation ([Section 14.1](14-vfs.md#virtual-filesystem-layer--copy-on-write-and-redirect-on-write-infrastructure)),
// the filesystem's own completion callback (registered in bio.private)
// calls vfs_flush_extent_complete(token) AFTER this function returns.
// writeback_end_io() handles only the page cache and errseq_t updates;
// the filesystem completion callback handles journal commit, extent tree
// updates, and dirty extent token release. For bdev (raw block) I/O,
// no dirty extent token exists — this step is a no-op.
}
/// Check the filesystem's error-handling policy and take the appropriate
/// action for a writeback I/O error. Called from `writeback_end_io()` after
/// the error has been recorded on the `AddressSpace` (wb_err, AS_EIO/AS_ENOSPC).
///
/// The superblock's `s_error_behavior` field (set at mount time via the `errors=`
/// mount option) determines the response:
///
/// - `FsErrorMode::Continue` — log the error via FMA
/// ([Section 20.1](20-observability.md#fault-management-architecture)), continue operation. The re-dirtied
/// page will be retried on the next writeback cycle.
/// - `FsErrorMode::RemountRo` — set `SB_RDONLY` on the superblock, log via FMA.
/// Subsequent write operations return `EROFS`. Read operations continue.
/// - `FsErrorMode::Panic` — kernel panic. Used by filesystems where data
/// integrity is critical and unrecoverable corruption is worse than downtime
/// (e.g., ext4 with `errors=panic`, XFS default on metadata error).
///
/// This function does not return a value — in the `Panic` case, it does not
/// return at all. In `Continue` and `RemountRo` cases, `writeback_end_io()`
/// proceeds to wake waiters.
fn check_fs_error_mode(sb: &SuperBlock) {
match sb.s_error_behavior {
FsErrorMode::Continue => {
fma_report(sb.device_handle, HealthEventClass::Storage,
FMA_WRITEBACK_IO_ERROR, HealthSeverity::Warning, &[]);
}
FsErrorMode::RemountRo => {
sb.flags.fetch_or(SB_RDONLY, Release);
fma_report(sb.device_handle, HealthEventClass::Storage,
FMA_WRITEBACK_IO_ERROR_REMOUNT_RO, HealthSeverity::Major, &[]);
}
FsErrorMode::Panic => {
panic!("writeback I/O error on {:?} with errors=panic", sb.dev_name);
}
}
}
Callback registration: The writeback path sets bio.end_io =
writeback_end_io_deferred before calling bio_submit(). The
writeback_end_io_deferred callback enqueues a workqueue item that
calls writeback_end_io() on the blk-io workqueue in process context,
not in IRQ/softirq context — this is required because the function
performs page cache operations (xa_lock, wait queue wake, nr_dirty
decrement) that are forbidden under IRQ-disabled spinlocks.
For filesystem-specific writeback (ext4 journal, XFS log), the filesystem
provides its own end_io callback that schedules filesystem-specific
deferred work (e.g., updating journal state) and then calls
writeback_end_io() for the common page cache update logic.
Error propagation to userspace: When fsync(fd) is called, the VFS
reads the file's AddressSpace.wb_err and compares it against the fd's
file.f_wb_err (stamped at open() time). If the generation has advanced
with a non-zero errno, fsync() returns that errno to the caller and
updates file.f_wb_err to the current generation (so the error is reported
exactly once per fd, matching Linux errseq_t semantics).
Cross-references: - AddressSpace and AddressSpaceOps: Section 14.1 - Bio and BlockDeviceOps: §15.3.1 (above) - Page cache dirty tracking: Section 4.2 - Writeback subsystem and dirty page lifecycle: Section 4.6
15.2.4 Device-Mapper and Volume Management¶
Device-mapper framework — UmkaOS implements a device-mapper layer in umka-block with standard targets:
| Target | Description | Linux equivalent |
|---|---|---|
| dm-linear | Simple linear mapping | dm-linear |
| dm-striped | Stripe across N devices | dm-stripe |
| dm-mirror | Synchronous mirror (RAID-1) | dm-mirror |
| dm-crypt | Transparent encryption (AES-XTS) | dm-crypt |
| dm-verity | Read-only integrity verification | dm-verity |
| dm-snapshot | Copy-on-write snapshots | dm-snapshot |
| dm-thin | Thin provisioning with overcommit | dm-thin-pool |
LVM2 metadata compatibility — UmkaOS reads the LVM2 on-disk metadata format (PV
headers, VG descriptors, LV segment maps) and constructs logical volumes using
device-mapper targets. Existing LVM2 volume groups created under Linux are usable
without conversion. LVM2 userspace tools (lvm, pvs, vgs, lvs) work unmodified
via the standard device-mapper ioctl interface.
Software RAID — RAID levels 0/1/5/6/10 are implemented as device-mapper targets.
MD superblock formats (0.90, 1.0, 1.2) are read for compatibility with existing
Linux mdadm arrays. mdadm works unmodified. The RAID5/6 write hole is closed
by the stripe log mechanism (Section 15.2) —
auto-enabled on import for md 1.1/1.2 arrays.
Recovery-aware volume layer — This is where UmkaOS diverges meaningfully from Linux. Block device temporary disappearance during Tier 1 driver reload (~50-150ms) does NOT mark the device as failed:
Volume Layer State Machine:
DEVICE_ACTIVE → Normal I/O flow
DEVICE_RECOVERING → Driver reload in progress, I/O queued
DEVICE_FAILED → Device permanently gone, failover/degrade
Transition rules:
ACTIVE → RECOVERING: When driver supervisor signals reload start
RECOVERING → ACTIVE: When new driver instance signals ready (typical: <100ms)
RECOVERING → FAILED: When recovery timeout expires (default: 5 seconds)
- During
DEVICE_RECOVERING, the volume layer pauses I/O in its ring buffer. No requests are failed; they simply wait. - RAID resync is NOT triggered for sub-100ms driver reloads — the array stays clean. The volume layer distinguishes "device temporarily gone for driver reload" from "device removed from bus" by checking the driver supervisor state.
- If the recovery window exceeds the configurable timeout (default 5s), the device
transitions to
DEVICE_FAILEDand normal degraded-mode behavior applies (RAID rebuilds, error returns for non-redundant volumes). - dm-verity for verified boot is already designed (Section 9.3).
DmTarget trait — Every device-mapper target must implement this trait. All methods are called with preemption disabled; no sleeping is permitted.
/// Error conditions for device-mapper target operations.
pub enum DmError {
/// Underlying block device returned an I/O error.
IoError,
/// Target received a bio with an invalid sector range (outside target bounds).
InvalidMapping,
/// Target-specific error (e.g., integrity check failed for dm-verity).
TargetError,
/// Target is suspended and cannot process bios.
Suspended,
/// No space available on the underlying device (thin provisioning exhausted).
NoSpace,
/// Underlying device is busy (e.g., being removed).
DeviceBusy,
}
/// Result of mapping a bio to an underlying device.
pub enum DmMapResult {
/// Bio submitted to the underlying device; device-mapper is done.
Submitted,
/// Bio remapped in place (`bio.dev` and `bio.sector` updated); caller submits.
Remapped,
/// Bio must be requeued (target is suspending or temporarily unavailable).
Requeue,
/// Bio failed.
Error(DmError),
}
pub enum DmStatusType {
/// Return human-readable target status (I/O counts, health).
Status,
/// Return the target table string (as loaded by `dmsetup`).
Table,
}
/// Output buffer for `DmTarget::status()`.
pub struct DmStatusBuf<'a> {
pub buf: &'a mut [u8],
pub len: usize, // bytes written so far
}
impl<'a> DmStatusBuf<'a> {
pub fn write_fmt(&mut self, args: core::fmt::Arguments<'_>);
pub fn write_str(&mut self, s: &str);
}
/// Core trait every device-mapper target must implement.
pub trait DmTarget: Send + Sync {
/// Map a bio to the underlying device(s). May update `bio.dev` and `bio.sector`.
fn map(&self, bio: &mut Bio) -> DmMapResult;
/// Write human-readable status or table string into `result`.
/// Used for `/sys/block/dmN/dm/name` and the `DM_TABLE_STATUS` ioctl.
fn status(&self, type_: DmStatusType, result: &mut DmStatusBuf<'_>) -> Result<(), DmError>;
/// Iterate over all constituent block devices, calling `cb` for each with the
/// (device, start_sector, length_sectors) tuple. Used by sysfs topology, iostat,
/// and blk-integrity propagation.
fn iterate_devices(
&self,
cb: &mut dyn FnMut(&BlockDevice, u64, u64) -> i32, // (dev, start_sector, len_sectors)
) -> i32;
/// Handle a device-mapper message (from the `DM_MESSAGE` ioctl). Optional.
/// **Return convention**: 0 on success, negative errno on failure (e.g.,
/// `-libc::EINVAL` for unrecognized message, `-libc::EOPNOTSUPP` if the
/// target does not support messages). Matches Linux `dm_target_type::message`.
fn message(&self, _argc: u32, _argv: &[&str], _result: &mut DmStatusBuf<'_>) -> i32 { -libc::EOPNOTSUPP }
/// Called before target is suspended (e.g., for live resize or snapshot).
fn presuspend(&self) {}
fn postsuspend(&self) {}
/// Called when target resumes after suspension.
fn resume(&self) {}
/// Target type name (e.g., `"linear"`, `"crypt"`, `"verity"`).
fn name(&self) -> &'static str;
/// Target version tuple for `DM_LIST_VERSIONS`. Follows semver (major, minor, patch).
fn version(&self) -> (u32, u32, u32);
}
/// Registration record. Each target type registers at boot via `dm_register_target()`.
pub struct DmTargetType {
pub name: &'static str,
pub version: (u32, u32, u32),
/// Create a new target instance from a device-mapper table entry.
pub create: fn(
ti: &DmTableInfo,
argc: u32,
argv: &[&str],
) -> Result<Arc<dyn DmTarget>, DmError>,
}
15.2.5 RAID Write Hole Mitigation¶
The RAID5/6 write hole is a fundamental problem: updating data and parity chunks in a stripe requires multiple disk writes. Power failure between writes leaves the stripe inconsistent — parity doesn't match data. On rebuild, the wrong data is reconstructed from the mismatched parity. Silent data corruption.
Linux's approaches: write-intent bitmap (knows which stripes are dirty but not which chunks completed — insufficient for reconstruction), PPL (Partial Parity Log, stores parity diffs in the metadata region — 30-40% write overhead, RAID5 only, max 64 disks), journal device (full stripe journal on a separate device — effective but requires extra hardware).
UmkaOS provides two solutions depending on whether the array was created by UmkaOS or imported from Linux.
15.2.5.1 New UmkaOS Arrays: Inline Per-Chunk Metadata¶
New RAID5/6 arrays created by UmkaOS use a native format with per-chunk metadata:
/// Stored at the start of each chunk in UmkaOS-native RAID arrays.
/// Cost: 16 bytes per chunk (0.02% for 64 KB chunks — negligible).
/// On-disk format: all multi-byte fields use Le types to ensure disks are
/// portable across architectures (PPC32/s390x big-endian ↔ x86-64 little-endian).
/// Le* types defined in [Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types).
#[repr(C)]
pub struct ChunkMeta {
/// Stripe write sequence number. Incremented on every stripe update.
/// All chunks in a consistent stripe have the same seq value.
pub seq: Le64,
/// CRC32C of the chunk data (excluding this header). Hardware-accelerated
/// on all 8 architectures (x86 SSE4.2, ARM CRC32, RISC-V Zbc, PPC vpmsum, s390x KIMD, LoongArch CRC32).
pub checksum: Le32,
pub _reserved: Le32,
}
// On-disk format: seq(8) + checksum(4) + _reserved(4) = 16 bytes.
const_assert!(core::mem::size_of::<ChunkMeta>() == 16);
Write path (zero extra I/O):
1. Read old data + old parity (standard RAID5 read-modify-write).
2. Compute new parity.
3. Write all modified chunks with seq = old_seq + 1 and updated CRC32C.
4. No journal. No extra I/O. Just 16 extra bytes per chunk write.
Recovery after crash (scans dirty stripes from write-intent bitmap):
1. For each dirty stripe: read all chunk seq values and checksum fields.
2. All chunks have same seq AND all checksums valid → stripe is consistent.
3. Mixed seq values → partial write detected:
- Chunks with the lower seq (old) form a mutually consistent set.
- Recompute parity from the old-seq chunks. The partial write is rolled back.
- Chunks with the higher seq whose checksum does NOT match their data
(seq written, data partially written) are also detected and treated as old.
4. The in-flight write is lost, but the filesystem journal replays the logical
operation. The application sees "write completed" or "write didn't happen" —
never corruption.
Performance cost: effectively zero. 16 bytes per 64 KB+ chunk. CRC32C: ~1 ns per 4 KB on hardware-accelerated platforms. No journal device. No extra fsync.
Trade-off: the in-flight write is rolled back (not preserved). For workloads that need the write to survive power failure, combine with a journal device (see tiered approach below).
On-disk format: UmkaOS-native arrays are not mountable by Linux md. Chunk data starts at offset 16 instead of offset 0. This is a deliberate design choice for new arrays — same situation as Btrfs RAID or ZFS (different format, stronger guarantees).
15.2.5.2 Imported Linux Arrays: Auto-Enabled Batched Stripe Log¶
Existing Linux md arrays must be usable read-write without offline conversion. UmkaOS auto-enables a batched stripe log in the existing metadata region of each member drive.
Compatibility assessment on import:
| md superblock version | Metadata region | Auto-enable stripe log? |
|---|---|---|
| 1.1 (most common) | ~1020 KB between superblock and data_offset (typically 1 MiB) |
Yes — sufficient for in-flight stripe log |
| 1.2 | ~4 KB between superblock and data_offset (typically 4 KB offset) |
Depends on data_offset; if ≥64 KB free, yes |
| 1.0 (superblock at end) | Space between data end and superblock | Depends — check actual free space |
| 0.90 (legacy) | 64 KB reserved at end | No — too small; FMA warning issued |
Stripe log format (stored in metadata region, circular buffer):
/// Header for the stripe log region on each member drive.
/// Placed at a fixed offset in the metadata region (after md superblock + bitmap).
/// On-disk format: all multi-byte fields use Le types to ensure disks are
/// portable across architectures. Le* types defined in
/// [Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types).
#[repr(C)]
pub struct StripeLogHeader {
/// Magic: 0x554D_534C ("UMSL" = UmkaOS Stripe Log).
pub magic: Le32,
/// Log version (currently 1).
pub version: Le32,
/// Usable log region size in bytes (metadata_region_free - sizeof(StripeLogHeader)).
/// **Bounded**: metadata region is at most a few MB; u32 (4 GB) is sufficient.
pub log_size: Le32,
/// Current write position in the circular log (byte offset from log start).
/// **Bounded**: wraps within log_size (circular buffer); always < log_size.
/// **Wrap semantics**: when a batch would straddle the end of the circular
/// log (`write_pos + batch_size > log_size`), the remaining bytes are
/// zero-filled and `write_pos` is reset to 0. The batch is written at
/// offset 0. Recovery skips zero-filled tail regions by checking for a
/// valid `StripeLogEntry` header (non-zero `stripe_id`).
pub write_pos: Le32,
/// Sequence number of the last flushed batch.
/// u64: monotonic, never wraps in practice (at 1M flushes/sec, lasts 584K years).
/// **Invariant**: `flush_seq` is always equal to the highest `batch_seq` among
/// all log entries that have been durably written. During recovery, entries with
/// `batch_seq > flush_seq` are considered incomplete (in-flight at crash time)
/// and are replayed. Entries with `batch_seq <= flush_seq` are confirmed durable.
pub flush_seq: Le64,
}
// On-disk format: magic(4)+version(4)+log_size(4)+write_pos(4)+flush_seq(8) = 24 bytes.
const_assert!(core::mem::size_of::<StripeLogHeader>() == 24);
/// One entry in the stripe log. Records the parity diff for a single stripe write.
/// The log stores parity diffs (not full chunk data) to fit within the ~1 MB region.
/// On-disk format: all multi-byte fields use Le types to ensure disks are
/// portable across architectures. Le* types defined in
/// [Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types).
#[repr(C)]
pub struct StripeLogEntry {
/// Stripe number: 1-based (physical stripe number + 1).
/// **Convention**: stripe_id 0 is reserved as the empty sentinel. The first
/// physical stripe of the array uses stripe_id=1. Recovery uses non-zero
/// stripe_id to distinguish valid log entries from zero-filled (empty) slots.
pub stripe_id: Le64,
/// Sequence number for this batch. All entries in the same batch share the
/// same `batch_seq` value. Monotonically increasing across batches.
pub batch_seq: Le64,
/// CRC32C of old parity XOR new parity (the parity diff).
pub parity_diff_checksum: Le32,
/// Length of the parity diff data following this header.
pub parity_diff_len: Le32,
// Followed by `parity_diff_len` bytes of (old_parity XOR new_parity).
}
// On-disk format: stripe_id(8)+batch_seq(8)+parity_diff_checksum(4)+parity_diff_len(4) = 24 bytes.
const_assert!(core::mem::size_of::<StripeLogEntry>() == 24);
Batched write path (key improvement over Linux PPL):
Linux PPL flushes one parity diff per stripe write — serializing every write through the log. This causes 30-40% write overhead. UmkaOS batches:
- Accumulate N stripe writes in RAM (default batch size: 16, configurable via
/sys/block/mdN/md/stripe_log_batch). - Flush one batched log entry covering all N stripes to the metadata region (single sequential write, ~16 × parity_diff_size ≈ 16-64 KB).
- Issue all N stripe writes (data + parity) in parallel.
- On completion of all N stripe writes: advance
StripeLogHeader.write_pos. Old log entries are now reclaimable.
Overhead: ~5-8% write throughput reduction (amortized over batch). Compared to Linux PPL's 30-40%, this is a 4-6x improvement.
Batch flush triggers (whichever comes first):
- Batch reaches stripe_log_batch entries (default 16).
- Timer expires: stripe_log_flush_ms (default 5ms — matches ext4/XFS commit interval).
- fsync() from userspace: immediate flush of current batch.
Recovery after crash:
1. Read StripeLogHeader from each member drive.
2. Scan log entries from write_pos backward to find the last complete batch
(matching batch_seq on all members, valid CRC32C on parity diffs).
3. For each logged stripe: re-apply the parity diff to reconstruct correct parity.
4. Stripes NOT in the log were not in-flight — they are already consistent.
Tiered stripe log with optional journal device:
For workloads requiring even lower overhead or guaranteed in-flight write survival:
| Tier | Log location | Overhead | In-flight writes survive? |
|---|---|---|---|
| Auto (default) | Metadata region on member drives | ~5-8% | Yes (parity diffs logged) |
| PMEM journal | PMEM/NVDIMM device | ~0% (PMEM latency ≈ 100 ns) | Yes |
| NVMe journal | Dedicated NVMe partition | ~2-3% (fast sequential writes) | Yes |
| None (legacy compat) | Disabled (write-intent bitmap only) | 0% | No — same risk as Linux |
The "none" tier is available for users who explicitly accept the write hole risk
(e.g., arrays protected by UPS + filesystem journal, where the combined failure
probability is acceptable). Configured via:
echo none > /sys/block/mdN/md/stripe_log_policy
15.2.5.3 dm-raid and LVM Integration¶
dm-raid uses the same stripe mechanism as md — the batched stripe log applies identically. When LVM creates a RAID logical volume via dm-raid, the stripe log is auto-enabled using the same metadata region policy.
For dm-thin (thin provisioning): the write hole is metadata consistency (thin pool superblock + space maps), not data stripe consistency. dm-thin already uses a metadata journal (two-copy metadata with atomic swap). UmkaOS preserves this mechanism — no additional stripe log is needed for dm-thin metadata.
15.2.5.4 FMA Integration¶
| Event | FMA severity | Action |
|---|---|---|
| Stripe log auto-enabled on import | Info | Log: "Stripe write protection enabled for mdN" |
| Metadata region too small for stripe log | Warning | Log: "mdN has RAID write hole risk — metadata region insufficient. Run umka-md-upgrade to convert superblock to 1.2" |
| Stripe log recovery replayed entries | Warning | Log: "mdN: recovered N stripes from stripe log after unclean shutdown" |
| CRC32C mismatch during recovery | Degraded | Log: "mdN stripe S: checksum mismatch, parity recomputed from data" |
| Legacy mode (no stripe log, user-disabled) | Info | One-time log: "mdN: stripe log disabled by admin, write hole risk accepted" |
15.3 SATA/AHCI and Embedded Flash Storage¶
SATA and eMMC are general-purpose block storage buses present in servers, edge nodes, embedded systems, and consumer devices alike. They belong in the core storage architecture alongside NVMe.
15.3.1 SATA/AHCI¶
SATA (Serial ATA) remains widely deployed: HDDs in cold/warm storage tiers, SATA SSDs in cost-sensitive edge nodes, and legacy server hardware. AHCI (Advanced Host Controller Interface) is the standard host-side register interface for SATA controllers.
Full driver architecture: Section 15.4 defines the complete AHCI driver: HBA/port register maps, FIS formats, command header/table layouts, NCQ tag management, error recovery state machine, hot-plug, and ATAPI passthrough.
Driver tier: Tier 1. SATA is a block-latency-sensitive path.
AHCI register interface: The AHCI controller exposes a set of memory-mapped registers (HBA memory space, BAR5) and per-port command list / FIS receive areas. The driver:
- Discovers ports via
HBA_CAP.NP(number of ports). - For each implemented port: reads
PxSIGto identify device type (ATA, ATAPI, PM, SEMB). - Issues IDENTIFY DEVICE (ATA command 0xEC) to retrieve geometry, capabilities, LBA48 support, NCQ depth.
- Allocates per-port command list (up to 32 slots) and FIS receive buffer.
- Registers the device with umka-block as a
BlockDevicewith sector size 512 or 4096 (Advanced Format).
Command submission: AHCI uses a memory-based command list. Each command slot
contains a Command Table with a Physical Region Descriptor Table (PRDT) for
scatter-gather DMA. Native Command Queuing (NCQ, up to 32 outstanding commands)
is used when the device reports IDENTIFY.SATA_CAP.NCQ_SUPPORTED.
The canonical AhciPort struct (per-port driver state including command list,
FIS receive area, NCQ support, and in-flight tracking) is defined in
Section 15.4. This summary section
covers the integration points; see the detailed architecture for the full
field-level definition (port registers, NCQ depth, power state tracking, etc.).
Power management: AHCI supports three interface power states: Active, Partial
(~10ms wake), Slumber (~200ms wake). The driver uses Aggressive Link Power
Management (ALPM) to enter Partial/Slumber when the port is idle. On system
suspend (Section 7.9), the driver flushes the write cache (FLUSH CACHE EXT, ATA 0xEA)
and issues STANDBY IMMEDIATE (ATA 0xE0) before the controller is powered down.
Integration with Section 15.2 Block I/O: AHCI ports register as BlockDevice instances
with umka-block. The volume layer (Section 15.2) treats SATA devices identically to NVMe
namespaces — RAID, dm-crypt, dm-verity, thin provisioning all work on SATA block
devices without modification.
15.3.2 eMMC (Embedded MultiMediaCard)¶
eMMC is a managed NAND flash storage interface used in embedded systems, edge servers with soldered storage, and cost-sensitive devices. The host interface is a parallel bus (up to 8-bit data width) with an MMC command set.
Driver tier: Tier 1 for the MMC host controller; device command processing follows the same ring buffer model as NVMe.
eMMC register interface: The eMMC host controller (typically SDHCI-compatible or vendor-specific) exposes MMIO registers for command/response, data FIFO, and interrupt status. The driver:
- Initializes the host controller and negotiates bus width (1/4/8-bit) and speed (HS200/HS400 where supported).
- Issues CMD8 (SEND_EXT_CSD) to retrieve the extended CSD register (512 bytes), which contains capacity, supported features, lifetime estimation, and write-protect status.
- Registers partitions (boot partitions BP1/BP2, RPMB, user area, general
purpose partitions) as separate
BlockDeviceinstances with umka-block.
RPMB (Replay-Protected Memory Block): eMMC RPMB is a hardware-authenticated
storage area with replay protection, used for secure credential storage (e.g.,
TPM secrets, disk encryption keys). Access requires HMAC-SHA256-authenticated
commands using a device-specific key programmed once at manufacturing. The kernel
exposes RPMB as a capability-gated block device; only processes with the
CAP_RPMB_ACCESS capability (Section 9.1) can issue RPMB commands.
Lifetime and wear: The Extended CSD PRE_EOL_INFO and DEVICE_LIFE_TIME_EST
fields report device health. The kernel reads these periodically and exposes them
via sysfs (/sys/block/mmcblk0/device/life_time). No kernel policy is applied —
userspace storage daemons make retention/migration decisions.
Integration with Section 15.2: eMMC user-area partitions register as BlockDevice
instances. All Section 15.2 volume management targets (dm-crypt, dm-mirror, dm-thin) work
on eMMC partitions identically to NVMe namespaces.
15.3.3 SD Card Reader (SDHCI)¶
SDHCI (SD Host Controller Interface) is the standard register interface for
built-in SD card slot controllers. SD cards register as BlockDevice instances
with umka-block.
Driver tier: Tier 1.
Speed mode negotiation: UHS-I (SDR104, 208 MB/s max), UHS-II (312 MB/s), and UHS-III (624 MB/s) negotiated per JEDEC SD 8.0 spec. The driver reads the SD card's OCR, CID, CSD, and SCR registers at initialization to determine supported speed modes and switches the bus to the highest mutually supported mode.
Presence detection: SD cards are hot-plug devices. The SDHCI controller raises
an interrupt on card insertion/removal. The driver posts a BlockDeviceChanged event
to the system event bus (Section 7.9, umka-core) on state change.
Consumer vs. embedded: SD cards are used in consumer laptops (built-in SD slot), embedded systems (primary boot/storage medium), and IoT devices. The SDHCI driver is general-purpose; its presence in consumer devices is the most common deployment.
15.4 AHCI/SATA Driver Architecture¶
Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules.
&selfmethods use interior mutability for mutation. Atomic fields use.store()/.load(). All#[repr(C)]structs haveconst_assert!size verification. See CLAUDE.md Spec Pseudocode Quality Gates.
The AHCI driver is a Tier 1 KABI driver that manages SATA storage devices through the AHCI (Advanced Host Controller Interface) register specification. This section defines the complete driver architecture: HBA register model, per-port state machines, FIS (Frame Information Structure) formats, command submission, NCQ (Native Command Queuing), error recovery, hot-plug, and ATAPI passthrough.
Reference specification: Serial ATA AHCI 1.3.1 (Intel, June 2011). SATA 3.5 (SATA-IO, 2024) for link-layer features.
15.4.1 HBA Global Registers¶
The AHCI HBA exposes a memory-mapped register set at PCI BAR5 (ABAR). The global registers (register offsets 0x00-0x28, last register BOHC ends at byte 0x2B) control HBA-wide behavior:
/// AHCI HBA global registers (ABAR + 0x00).
/// All registers are 32-bit. Access via MMIO (volatile read/write).
///
/// AHCI defines all multi-byte register fields as little-endian.
/// Le* types ensure correct byte order on big-endian architectures
/// (PPC32, s390x). PPC64LE is little-endian and needs no byte-swap.
#[repr(C)]
pub struct AhciHbaRegisters {
/// Host Capabilities (CAP) — read-only.
/// Bits: NP (4:0) number of ports - 1, SXS (5) external SATA,
/// EMS (6) enclosure management, CCCS (7) command completion coalescing,
/// NCS (12:8) number of command slots - 1, PSC (13) partial state capable,
/// SSC (14) slumber state capable, PMD (15) PIO multiple DRQ block,
/// FBSS (16) FIS-based switching, SPM (17) port multiplier,
/// SAM (18) AHCI-only (no legacy IDE), ISS (23:20) interface speed,
/// SCLO (24) command list override, SAL (25) activity LED,
/// SALP (26) aggressive link power mgmt, SSS (27) staggered spin-up,
/// SMPS (28) mechanical presence switch, SSNTF (29) SNotification,
/// SNCQ (30) NCQ support, S64A (31) 64-bit addressing.
pub cap: Le32,
/// Global HBA Control (GHC).
/// Bits: HR (0) HBA reset, IE (1) interrupt enable, MRSM (2) MSI revert,
/// AE (31) AHCI enable.
pub ghc: Le32,
/// Interrupt Status (IS) — one bit per port. Write-1-to-clear.
pub is: Le32,
/// Ports Implemented (PI) — bitmask of implemented ports.
pub pi: Le32,
/// AHCI Version (VS) — major (31:16), minor (15:0). E.g., 0x00010301 = 1.3.1.
pub vs: Le32,
/// Command Completion Coalescing Control (CCC_CTL).
pub ccc_ctl: Le32,
/// Command Completion Coalescing Ports (CCC_PORTS).
pub ccc_ports: Le32,
/// Enclosure Management Location (EM_LOC).
pub em_loc: Le32,
/// Enclosure Management Control (EM_CTL).
pub em_ctl: Le32,
/// Host Capabilities Extended (CAP2).
/// Bits: BOH (0) BIOS/OS handoff, NVMP (1) NVMHCI present,
/// APST (2) automatic partial-to-slumber, SDS (3) DevSleep,
/// SADM (4) aggressive DevSleep, DESO (5) DevSleep entrance from slumber only.
pub cap2: Le32,
/// BIOS/OS Handoff Control and Status (BOHC).
pub bohc: Le32,
}
// 11 × Le32 = 11 × 4 = 44 bytes. AHCI spec HBA registers: 0x00-0x2B = 44 bytes.
const_assert!(core::mem::size_of::<AhciHbaRegisters>() == 44);
15.4.2 Per-Port Registers¶
Each port occupies 0x80 bytes at ABAR + 0x100 + (port × 0x80):
/// AHCI per-port register set.
///
/// AHCI defines all multi-byte register fields as little-endian.
/// Le* types ensure correct byte order on big-endian architectures
/// (PPC32, s390x). PPC64LE is little-endian and needs no byte-swap.
// kernel-internal, not KABI
#[repr(C)]
pub struct AhciPortRegisters {
/// Command List Base Address (PxCLB) — physical address of command list (1024-byte aligned).
pub clb: Le32,
/// Command List Base Address Upper 32-bits (PxCLBU) — for 64-bit addressing.
pub clbu: Le32,
/// FIS Base Address (PxFB) — physical address of received FIS area (256-byte aligned).
pub fb: Le32,
/// FIS Base Address Upper 32-bits (PxFBU).
pub fbu: Le32,
/// Interrupt Status (PxIS) — write-1-to-clear.
/// Bits: DHRS (0) D2H Register FIS, PSS (1) PIO Setup FIS,
/// DSS (2) DMA Setup FIS, SDBS (3) Set Device Bits FIS,
/// UFS (4) Unknown FIS, DPS (5) descriptor processed,
/// PCS (6) port connect change, DMPS (7) device mechanical presence,
/// PRCS (22) PhyRdy change, IPMS (23) incorrect port multiplier,
/// OFS (24) overflow, INFS (26) interface non-fatal error,
/// IFS (27) interface fatal error, HBDS (28) host bus data error,
/// HBFS (29) host bus fatal error, TFES (30) task file error.
pub is: Le32,
/// Interrupt Enable (PxIE) — same bit layout as PxIS.
pub ie: Le32,
/// Command and Status (PxCMD).
/// Bits: ST (0) start, SUD (1) spin-up device, POD (2) power on device,
/// CLO (3) command list override, FRE (4) FIS receive enable,
/// CCS (12:8) current command slot, MPSS (13) mechanical presence switch,
/// FR (14) FIS receive running (RO), CR (15) command list running (RO),
/// CPS (16) cold presence, PMA (17) port multiplier attached,
/// HPCP (18) hot-plug capable, MPSP (19) mechanical presence switch present,
/// CPD (20) cold presence detection, ESP (21) external SATA port,
/// FBSCP (22) FIS-based switching capable, APSTE (23) auto partial-to-slumber,
/// ATAPI (24) device is ATAPI, DLAE (25) drive LED on ATAPI enable,
/// ALPE (26) aggressive link power management enable,
/// ASP (27) aggressive slumber/partial.
/// ICC (31:28) interface communication control.
pub cmd: Le32,
/// Reserved.
pub _reserved0: Le32,
/// Task File Data (PxTFD) — read-only.
/// Bits (7:0): STS (status register — BSY, DRQ, ERR).
/// Bits (15:8): ERR (error register).
pub tfd: Le32,
/// Signature (PxSIG) — device signature from D2H Register FIS.
/// 0x00000101 = ATA device, 0xEB140101 = ATAPI device,
/// 0xC33C0101 = enclosure management bridge, 0x96690101 = port multiplier.
pub sig: Le32,
/// Serial ATA Status (PxSSTS) — read-only. SStatus register.
/// Bits: DET (3:0) device detection, SPD (7:4) interface speed,
/// IPM (11:8) interface power management.
pub ssts: Le32,
/// Serial ATA Control (PxSCTL) — SControl register.
/// Bits: DET (3:0) device detection init, SPD (7:4) speed allowed,
/// IPM (11:8) power management transitions allowed.
pub sctl: Le32,
/// Serial ATA Error (PxSERR) — write-1-to-clear. SError register.
pub serr: Le32,
/// Serial ATA Active (PxSACT) — one bit per NCQ tag. Set by SW before issuing
/// FPDMA commands; cleared by HW via Set Device Bits FIS on completion.
pub sact: Le32,
/// Command Issue (PxCI) — one bit per command slot. Set by SW to issue;
/// cleared by HW on command completion.
pub ci: Le32,
/// SNotification (PxSNTF) — SNotification register (port multiplier).
pub sntf: Le32,
/// FIS-based Switching Control (PxFBS).
pub fbs: Le32,
/// Device Sleep (PxDEVSLP).
pub devslp: Le32,
/// Reserved to 0x6F.
pub _reserved1: [Le32; 10],
/// Vendor-specific registers (0x70-0x7F).
pub vendor: [Le32; 4],
}
// 8+10+[10]+[4] Le32 = 32 × 4 = 128 bytes. AHCI spec: 0x80 per port = 128 bytes.
const_assert!(core::mem::size_of::<AhciPortRegisters>() == 128);
15.4.3 FIS (Frame Information Structure) Types¶
All host-to-device and device-to-host communication uses FIS frames. The AHCI driver uses these FIS types:
/// FIS types used by AHCI.
#[repr(u8)]
pub enum FisType {
/// Register FIS — Host to Device (H2D). 20 bytes.
/// Used for all ATA commands (IDENTIFY, READ DMA EXT, WRITE DMA EXT, etc.).
RegH2D = 0x27,
/// Register FIS — Device to Host (D2H). 20 bytes.
/// Delivered to FIS receive area on command completion (non-NCQ).
RegD2H = 0x34,
/// DMA Activate FIS — Device to Host. 4 bytes.
/// Requests host to proceed with DMA transfer (legacy DMA, not used with NCQ).
DmaActivate = 0x39,
/// DMA Setup FIS — Bidirectional. 28 bytes.
/// Auto-activate for first-party DMA (NCQ). Contains DMA buffer offset + transfer count.
DmaSetup = 0x41,
/// Data FIS — Bidirectional. Variable length (up to 8K payload).
/// Carries actual read/write data.
Data = 0x46,
/// BIST Activate FIS. 12 bytes.
/// Built-In Self Test pattern generation.
BistActivate = 0x58,
/// PIO Setup FIS — Device to Host. 20 bytes.
/// Precedes PIO data transfer; contains byte count and new status.
PioSetup = 0x5F,
/// Set Device Bits FIS — Device to Host. 8 bytes.
/// Updates SActive register for NCQ completion notification; carries interrupt bit.
SetDevBits = 0xA1,
}
/// Register H2D FIS — the primary command FIS. 20 bytes (5 DWORDs).
/// This is what the driver writes into the Command Table CFIS area.
///
/// AHCI defines all multi-byte register fields as little-endian.
/// Le* types ensure correct byte order on big-endian architectures
/// (PPC32, s390x). PPC64LE is little-endian and needs no byte-swap.
#[repr(C)]
pub struct FisRegH2D {
/// FIS type (0x27).
pub fis_type: u8,
/// Bits: PM_PORT (3:0) port multiplier port, C (7) 1=Command, 0=Control.
pub flags: u8,
/// ATA command register (e.g., 0x25 = READ DMA EXT, 0x35 = WRITE DMA EXT,
/// 0x60 = READ FPDMA QUEUED, 0x61 = WRITE FPDMA QUEUED, 0xEC = IDENTIFY DEVICE,
/// 0xA1 = IDENTIFY PACKET DEVICE, 0xA0 = PACKET, 0xEA = FLUSH CACHE EXT,
/// 0xE0 = STANDBY IMMEDIATE).
pub command: u8,
/// Features register (7:0).
pub features_lo: u8,
/// LBA (23:0).
pub lba_lo: [u8; 3],
/// Device register. Bit 6 = LBA mode.
pub device: u8,
/// LBA (47:24).
pub lba_hi: [u8; 3],
/// Features register (15:8).
pub features_hi: u8,
/// Sector count (15:0).
pub count: Le16,
/// ICC (Isochronous Command Completion).
pub icc: u8,
/// Control register.
pub control: u8,
/// Reserved (auxiliary).
pub _reserved: [u8; 4],
}
// 1+1+1+1+3+1+3+1+2+1+1+4 = 20 bytes. ATA Register H2D FIS: 5 DWORDs = 20 bytes.
const_assert!(core::mem::size_of::<FisRegH2D>() == 20);
impl FisRegH2D {
/// Set 48-bit LBA across the split lba_lo[3] and lba_hi[3] fields.
pub fn set_lba48(&mut self, lba: u64) {
self.lba_lo[0] = (lba & 0xFF) as u8;
self.lba_lo[1] = ((lba >> 8) & 0xFF) as u8;
self.lba_lo[2] = ((lba >> 16) & 0xFF) as u8;
self.lba_hi[0] = ((lba >> 24) & 0xFF) as u8;
self.lba_hi[1] = ((lba >> 32) & 0xFF) as u8;
self.lba_hi[2] = ((lba >> 40) & 0xFF) as u8;
}
/// Clear both LBA fields to zero (for non-data commands like FLUSH).
pub fn clear_lba(&mut self) {
self.lba_lo = [0; 3];
self.lba_hi = [0; 3];
}
/// Set sector count in the `count: Le16` field.
pub fn set_sector_count(&mut self, count: u16) {
self.count = Le16::from_ne(count);
}
/// Zero the FIS to a clean state.
pub fn zeroed() -> Self {
// SAFETY: FisRegH2D is #[repr(C)] with all-integer fields; zero is valid.
unsafe { core::mem::zeroed() }
}
}
15.4.4 Command Header and Command Table¶
Each port has a command list of up to 32 entries (determined by CAP.NCS). Each
entry is a 32-byte command header that points to a variable-length command table:
/// AHCI Command Header — 32 bytes. One per command slot (up to 32 per port).
/// The command list is a contiguous DMA buffer of 32 × AhciCmdHeader.
#[repr(C)]
pub struct AhciCmdHeader {
/// DW0: Flags.
/// CFL (4:0): Command FIS Length in DWORDs (2-16, typically 5 for Register H2D).
/// A (5): ATAPI command (1 if ACMD contains SCSI CDB).
/// W (6): Write direction (1 = host-to-device, 0 = device-to-host).
/// P (7): Prefetchable (hint — HBA may prefetch PRD entries).
/// R (8): Reset (1 = this command performs a device reset).
/// B (9): BIST FIS.
/// C (10): Clear Busy upon R_OK (for overlapped commands).
/// PMP (15:12): Port Multiplier Port.
/// PRDTL (31:16): Physical Region Descriptor Table Length (entries, 0-65535).
pub flags_prdtl: Le32,
/// DW1: Physical Region Descriptor Byte Count (PRDBC).
/// Updated by HBA on transfer completion — total bytes transferred.
pub prdbc: Le32,
/// DW2-3: Command Table Descriptor Base Address (CTBA, 128-byte aligned).
pub ctba: Le64,
/// DW4-7: Reserved.
pub _reserved: [Le32; 4],
}
// Le32(4) + Le32(4) + Le64(8) + [Le32;4](16) = 32 bytes. AHCI spec: 32-byte command header.
const_assert!(core::mem::size_of::<AhciCmdHeader>() == 32);
impl AhciCmdHeader {
/// Construct the DW0 value with CFL, write direction, and PRDTL,
/// then store it as Le32. This is the single method for populating
/// `flags_prdtl` — callers never manipulate the packed field directly.
///
/// `cfl`: Command FIS Length in DWORDs (typically 5 for Register H2D).
/// `write`: true if host-to-device (W bit 6).
/// `prdtl`: PRDT entry count (bits 31:16).
pub fn set_flags_prdtl(&mut self, cfl: u8, write: bool, prdtl: u16) {
let dw0 = (cfl as u32 & 0x1F)
| (if write { 1u32 << 6 } else { 0 })
| ((prdtl as u32) << 16);
self.flags_prdtl = Le32::from_ne(dw0);
}
}
/// AHCI Command Table — variable size. Contains CFIS, ACMD, and PRDT.
/// Minimum size: 128 bytes (CFIS) + 0 (ACMD) + N × 16 (PRDT entries).
/// The command table must be 128-byte aligned.
#[repr(C)]
pub struct AhciCmdTable {
/// Command FIS area — 64 bytes (only first CFL×4 bytes are valid).
pub cfis: [u8; 64],
/// ATAPI Command area — 16 bytes (12-byte SCSI CDB + 4 padding).
/// Only valid when AhciCmdHeader.flags.A = 1.
pub acmd: [u8; 16],
/// Reserved — 48 bytes.
pub _reserved: [u8; 48],
/// Physical Region Descriptor Table — up to 65535 entries.
/// Actual count is in AhciCmdHeader.flags_prdtl (PRDTL field).
/// For UmkaOS: capped at 248 entries per command (matching `max_segments`
/// from BlockDeviceInfo). Each PRDT entry can address up to 4MB (22-bit DBC
/// field), but the block layer's bio splitting caps practical transfers at ~1MB.
/// Rationale for 248: the command table header (CFIS + ACMD + reserved) is
/// 128 bytes; 128 + 248 × 16 = 4096 bytes = exactly one 4KB page. This
/// maximizes scatter-gather capacity within a single-page DMA allocation.
/// Linux uses `LIBATA_MAX_PRD` = 128 (half of `ATA_MAX_PRD` = 256);
/// UmkaOS uses 248 to fill the page without crossing a page boundary.
///
/// **Memory footprint**: 32 command slots × 4 KB per command table = 128 KB
/// per port. Most I/O uses 1-4 PRDT entries, leaving ~244 entries unused
/// per command. This is intentional: the AHCI spec requires the command
/// table to be a contiguous DMA allocation, and per-command dynamic sizing
/// would require separate DMA allocations per I/O (slower, more fragmentation).
/// The 128 KB/port cost is fixed and acceptable for SATA controllers.
pub prdt: [AhciPrdtEntry; 248],
}
// AhciCmdTable: cfis(64) + acmd(16) + _reserved(48) + prdt(248×16) = 4096 bytes.
const_assert!(core::mem::size_of::<AhciCmdTable>() == 4096);
/// AHCI PRDT Entry — 16 bytes. Describes one scatter-gather DMA region.
///
/// AHCI defines all multi-byte register fields as little-endian.
/// Le* types ensure correct byte order on big-endian architectures
/// (PPC32, s390x). PPC64LE is little-endian and needs no byte-swap.
#[repr(C)]
pub struct AhciPrdtEntry {
/// Data Base Address (DBA) — physical byte address of data buffer.
/// Must be word-aligned (bit 0 = 0).
pub dba: Le32,
/// Data Base Address Upper 32-bits (DBAU).
pub dbau: Le32,
/// Reserved.
pub _reserved: Le32,
/// Data Byte Count (DBC) and Interrupt flag.
/// DBC (21:0): byte count - 1 (0 = 1 byte, max 0x3FFFFF = 4MB).
/// I (31): Interrupt on completion of this PRD entry.
pub dbc_i: Le32,
}
// Le32(4) + Le32(4) + Le32(4) + Le32(4) = 16 bytes. AHCI PRDT entry: 16 bytes.
const_assert!(core::mem::size_of::<AhciPrdtEntry>() == 16);
/// ATA DATA SET MANAGEMENT (TRIM) LBA Range Entry.
/// ACS-4 §7.10: the payload is an array of these 8-byte entries packed into
/// 512-byte blocks. Each entry specifies a contiguous range of LBAs to trim.
/// Command 0x06 (DATA SET MANAGEMENT) with feature register bit 0 = TRIM.
#[repr(C, packed)]
pub struct AtaTrimRangeEntry {
/// Starting LBA of the range to trim (48-bit).
/// Stored as 6 little-endian bytes: lba[0..6].
pub lba: [u8; 6],
/// Number of logical sectors to trim (16-bit, little-endian).
/// 0 = entry is unused (skip). Max 65535 sectors per entry.
pub count: Le16,
}
const_assert!(core::mem::size_of::<AtaTrimRangeEntry>() == 8);
// Each 512-byte block holds 64 entries. The DATA SET MANAGEMENT
// command transfers 1-65535 blocks (count register), each containing
// up to 64 range entries.
15.4.5 FIS Receive Area¶
Each port has a dedicated 256-byte DMA buffer for received FIS frames:
/// AHCI received FIS area — 256 bytes per port.
/// The HBA writes incoming FIS frames to fixed offsets in this buffer.
#[repr(C, align(256))]
pub struct AhciFisRxArea {
/// DMA Setup FIS (offset 0x00, 28 bytes).
pub dma_setup: [u8; 28],
pub _pad0: [u8; 4],
/// PIO Setup FIS (offset 0x20, 20 bytes).
pub pio_setup: [u8; 20],
pub _pad1: [u8; 12],
/// D2H Register FIS (offset 0x40, 20 bytes).
pub d2h_reg: [u8; 20],
pub _pad2: [u8; 4],
/// Set Device Bits FIS (offset 0x58, 8 bytes).
pub set_dev_bits: [u8; 8],
/// Unknown FIS area (offset 0x60, 64 bytes).
pub unknown_fis: [u8; 64],
/// Reserved (offset 0xA0, 96 bytes).
pub _reserved: [u8; 96],
}
// 28+4+20+12+20+4+8+64+96 = 256 bytes. AHCI spec: 256-byte FIS receive area.
const_assert!(core::mem::size_of::<AhciFisRxArea>() == 256);
15.4.6 AhciPort Driver State¶
/// Per-port AHCI driver state — lives in the Tier 1 driver domain.
/// One instance per implemented port. Allocated at driver probe time.
pub struct AhciPort {
/// Port number (0-31).
pub port_num: u8,
/// MMIO accessor for this port's register set.
pub regs: PortedMmio,
/// DMA-coherent command list (32 entries × 32 bytes = 1024 bytes, 1K-aligned).
pub cmd_list: DmaBox<[AhciCmdHeader; 32]>,
/// DMA-coherent received FIS area (256 bytes, 256-byte aligned).
pub fis_rx: DmaBox<AhciFisRxArea>,
/// Per-slot command tables. Pre-allocated at probe time — no allocation on the I/O path.
/// Only `ncs` entries are valid (CAP.NCS + 1).
pub cmd_tables: ArrayVec<DmaBox<AhciCmdTable>, 32>,
/// Bitmask of in-flight command slots (mirrors PxCI for driver-side tracking).
pub inflight: AtomicU32,
/// Bitmask of in-flight NCQ tags (mirrors PxSACT for driver-side tracking).
pub ncq_inflight: AtomicU32,
/// Number of command slots supported (CAP.NCS + 1, max 32).
pub ncs: u8,
/// Maximum NCQ depth reported by IDENTIFY DEVICE word 75 (0-based, max 31 → 32 tags).
pub ncq_depth: u8,
/// True if the device supports NCQ (IDENTIFY word 76 bit 8).
pub ncq_capable: bool,
/// Device type detected from PxSIG.
pub device_type: AhciDeviceType,
/// Logical sector size (512 or 4096).
pub logical_sector_size: u32,
/// Physical sector size (512 or 4096 for Advanced Format).
pub physical_sector_size: u32,
/// Total capacity in logical sectors.
pub capacity_sectors: u64,
/// Device supports TRIM (IDENTIFY word 169 bit 0).
pub supports_trim: bool,
/// Device supports write cache (IDENTIFY word 82 bit 5).
pub write_cache_enabled: bool,
/// Device supports 48-bit LBA (IDENTIFY word 83 bit 10).
pub lba48: bool,
/// Device supports Force Unit Access (WRITE DMA FUA EXT).
/// Requires BOTH LBA48 AND IDENTIFY word 84 bit 6. Not all LBA48
/// devices support FUA — it is optional.
pub supports_fua: bool,
/// Device supports SANITIZE command (IDENTIFY word 59 bit 12).
pub supports_sanitize: bool,
/// Nominal media rotation rate from IDENTIFY word 217.
/// 0x0001 = non-rotating (SSD), any other non-zero = RPM.
/// Used by get_info() to set BlockDeviceFlags::ROTATIONAL.
pub nominal_rotation_rate: u16,
/// Per-slot bio pointer — maps completed command slot/NCQ tag back to
/// the originating Bio for `bio_complete()`. Analogous to NVMe's
/// `inflight: Box<[Option<NvmeInflightCmd>]>`.
///
/// Set by `submit_bio()` after claiming a slot. Taken (swap to null) by
/// the IRQ completion handler when processing a D2H FIS / SDB FIS. For
/// synchronous commands (IDENTIFY, non-bio flush), this is null — those
/// use `wait_for_completion()` instead.
///
/// Uses `AtomicPtr<Bio>` (null = no bio) for interior mutability:
/// submit paths and IRQ handler both access through `&self`/`&AhciPort`.
/// The `inflight` bitmask provides mutual exclusion (a slot is only
/// written by the submit path after claiming it, and only read/cleared
/// by the IRQ handler after the device signals completion).
///
/// **SAFETY**: Raw pointer to a Bio whose lifetime extends until
/// `bio_complete()` signals completion. The AHCI port state machine
/// ensures each slot processes exactly one bio at a time.
pub slot_bios: [AtomicPtr<Bio>; 32],
/// Per-slot completion error status — maps completed slot to the
/// errno value (0 = success, -EIO = error) for `bio_complete()`.
pub slot_status: [AtomicI32; 32],
/// Per-slot completion state — one per command slot.
/// Values: IDLE(0), PENDING(1), COMPLETED(2), ERROR(3).
/// Submit path sets `slot_completions[slot] = PENDING`.
/// IRQ handler sets `slot_completions[slot] = COMPLETED/ERROR`
/// and wakes the port's WaitQueue.
/// `wait_for_completion()` blocks on the port WaitQueue until
/// `slot_completions[slot] != PENDING`.
///
/// **Tag reuse constraint**: A completed slot MUST NOT be reused
/// (bit re-set in `inflight`/`ncq_inflight`) until the completion
/// handler has consumed the previous completion. The submit path
/// checks `slot_completions[slot] == IDLE` before claiming the slot.
pub slot_completions: [AtomicU8; 32],
/// Per-port WaitQueue for synchronous command completion (flush,
/// IDENTIFY). Woken by the IRQ handler after updating slot_completions.
/// See [Section 3.6](03-concurrency.md#lock-free-data-structures--completion-one-shot-or-multi-shot-signaling-primitive)
/// for the formal `Completion` primitive. AHCI uses `WaitQueue` directly
/// (not `Completion`) because multiple command slots share a single
/// per-port wait queue with per-slot state discrimination.
pub completion_waitq: WaitQueue,
/// Port error state — set by error recovery, checked by submit path.
pub error_state: AtomicU8,
/// Link power management state.
pub link_pm_state: AtomicU8,
}
#[repr(u8)]
pub enum AhciDeviceType {
/// Standard ATA disk (PxSIG = 0x00000101).
Ata = 0,
/// ATAPI device — optical drive, tape (PxSIG = 0xEB140101).
Atapi = 1,
/// Port multiplier (PxSIG = 0x96690101).
PortMultiplier = 2,
/// Enclosure management bridge (PxSIG = 0xC33C0101).
Semb = 3,
/// No device detected.
None = 0xFF,
}
/// Port error recovery state.
#[repr(u8)]
pub enum AhciPortErrorState {
/// Normal operation.
Normal = 0,
/// Error recovery in progress — new I/O submission blocked.
Recovering = 1,
/// Port disabled after unrecoverable error.
Disabled = 2,
}
/// Link power management state.
#[repr(u8)]
pub enum AhciLinkPmState {
/// Active (no power saving).
Active = 0,
/// AHCI Partial state (low-latency sleep, ~10us resume).
Partial = 1,
/// AHCI Slumber state (deeper sleep, ~10ms resume).
Slumber = 2,
/// SATA DevSleep (device-initiated deep sleep, ~20ms resume).
DevSleep = 3,
}
impl AhciPort {
/// Read the current error state as a typed enum.
/// Returns `AhciPortErrorState::Disabled` for any unrecognized value
/// (defensive — treats corruption as worst-case).
pub fn error_state(&self) -> AhciPortErrorState {
match self.error_state.load(Acquire) {
0 => AhciPortErrorState::Normal,
1 => AhciPortErrorState::Recovering,
_ => AhciPortErrorState::Disabled,
}
}
/// Set the error state atomically.
pub fn set_error_state(&self, s: AhciPortErrorState) {
self.error_state.store(s as u8, Release);
}
/// Read the current link power management state as a typed enum.
/// Returns `AhciLinkPmState::Active` for any unrecognized value.
pub fn link_pm_state(&self) -> AhciLinkPmState {
match self.link_pm_state.load(Acquire) {
0 => AhciLinkPmState::Active,
1 => AhciLinkPmState::Partial,
2 => AhciLinkPmState::Slumber,
3 => AhciLinkPmState::DevSleep,
_ => AhciLinkPmState::Active,
}
}
/// Set the link power management state atomically.
pub fn set_link_pm_state(&self, s: AhciLinkPmState) {
self.link_pm_state.store(s as u8, Release);
}
}
15.4.7 Initialization Sequence¶
- PCI probe: Match PCI class code 01:06:01 (Mass Storage → SATA → AHCI 1.0). Map BAR5 as uncacheable MMIO.
- BIOS/OS handoff: If
CAP2.BOHis set, perform BIOS/OS handoff viaBOHCregister (set OOS bit, wait for BOS clear, timeout 25ms per AHCI spec §11.6). - Enable AHCI mode: Set
GHC.AE(bit 31). IfCAP.SAMis clear (legacy supported), the HBA may start in IDE mode; AE forces AHCI. - Enumerate ports: Read
PIregister. Extractnum_ports = CAP.NP + 1(max 32). Bounds validation: verifynum_ports <= ports.capacity()(theAhciPortarray has a fixed capacity of 32 matching the AHCI spec maximum). IfCAP.NP + 1exceeds the array capacity (hardware bug or MMIO corruption), log an FMA error event and clampnum_portsto the array capacity. The interrupt handler iterates0..num_portsand indexeshba.ports[port_num]; this bounds check ensures the loop never exceeds the array length. For each bit set inPI: a. AllocateDmaBox<[AhciCmdHeader; 32]>(command list, 1K-aligned). b. AllocateDmaBox<AhciFisRxArea>(FIS receive, 256-byte aligned). c. Write physical addresses toPxCLB/PxCLBUandPxFB/PxFBU. d. Pre-allocatencscommand tables (each 128-byte aligned). e. ClearPxSERR(write all-ones to clear). f. SetPxCMD.FRE(FIS Receive Enable). g. IfCAP.SSS(staggered spin-up): setPxCMD.SUDto spin up the device. h. Wait forPxSSTS.DET = 3(device present and communication established), timeout 1 second. i. ReadPxSIG→ classify device type. j. SetPxCMD.ST(Start command processing). - IDENTIFY DEVICE: For ATA devices, issue IDENTIFY DEVICE (0xEC). For ATAPI, issue IDENTIFY PACKET DEVICE (0xA1). Parse:
- Words 60-61: Total addressable sectors (28-bit LBA).
- Words 100-103: Total addressable sectors (48-bit LBA).
- Word 75: NCQ queue depth (0-based).
- Word 76 bit 8: NCQ supported.
- Word 82 bit 5: Write cache supported.
- Word 83 bit 10: 48-bit LBA supported.
- Word 84 bit 6: FUA (Force Unit Access) supported.
- Word 106: Logical/physical sector size.
- Word 169 bit 0: TRIM (DATA SET MANAGEMENT) supported.
- Word 217: Nominal media rotation rate (1 = non-rotating = SSD).
- Enable NCQ: If device supports NCQ, set
ncq_capable = true. NCQ depth = min(device depth from IDENTIFY word 75, HBA slots from CAP.NCS). - Enable interrupts: Set
PxIEto enable DHRS, SDBS, PCS, IFS, HBFS, HBDS, TFES. SetGHC.IE(global interrupt enable). - Register with umka-block: Create
BlockDevicewith sector size, capacity,supports_flush = write_cache_enabled,supports_discard = supports_trim,supports_fua = lba48 && identify_word_84_bit_6.
15.4.8 Command Submission (Non-NCQ)¶
For legacy (non-NCQ) commands — IDENTIFY, FLUSH, STANDBY, TRIM (non-queued):
- Find a free slot: atomically claim a clear bit in
inflightvia CAS loop:If all slots busy, returnloop { let current = inflight.load(Acquire); let free_bit = (!current).trailing_zeros(); if free_bit >= 32 { return Err(Error::AGAIN); } let new = current | (1 << free_bit); if inflight.compare_exchange_weak(current, new, AcqRel, Acquire).is_ok() { return Ok(free_bit as u8); } }EAGAIN(caller retries via block layer backpressure). - Build
FisRegH2Dincmd_tables[slot].cfis: - Set
fis_type = 0x27,flags = 0x80(C bit = command). - Fill
command,lba_lo/hi,count,device(bit 6 = LBA mode). - Build PRDT entries from
Bio.segments— oneAhciPrdtEntryper segment. Setdbc_i = (segment.len - 1). Set I bit on last entry. - Write
AhciCmdHeader: set CFL = 5 (20 bytes / 4), W bit if write, PRDTL = segment count. - Set bit in
inflight. - Write slot bit to
PxCI— HBA fetches command header and begins execution. - Completion arrives via D2H Register FIS interrupt (PxIS.DHRS).
15.4.9 NCQ Command Submission¶
For READ/WRITE FPDMA QUEUED (commands 0x60/0x61) — the fast path:
- Find a free NCQ tag: scan
ncq_inflightfor a clear bit (maxncq_depthtags). - Build
FisRegH2D: command= 0x60 (read) or 0x61 (write).features_lo/features_hi= sector count (16-bit).lba_lo/lba_hi= 48-bit LBA.countbits (7:3) = NCQ tag number. Bit 7 ofcount= FUA if requested.device= 0x40 (LBA mode).- Build PRDT from bio segments (same as non-NCQ).
- Write
AhciCmdHeader: CFL = 5, W bit, PRDTL. - Initialize slot status:
port.slot_status[tag as usize].store(0, Relaxed). This clears any stale error code from a previous command that used this slot. Without this, the completion handler reads a stale error for successful NCQ completions. - Set tag bit in
ncq_inflight. Write tag bit toPxSACT. - Write slot bit to
PxCI. - Completion: device sends Set Device Bits FIS with
SActivebits cleared. HBA setsPxIS.SDBS. Interrupt handler readsPxSACTto determine which tags completed (bits that transitioned from 1→0).
Tag-to-slot mapping: UmkaOS uses identity mapping (tag N = slot N). This is the
simplest model and avoids the complexity of split tag/slot namespaces. Since
ncq_depth ≤ ncs, there are always enough slots.
15.4.9.1 Callable Function Bodies¶
impl AhciPort {
/// Allocate a free command slot from the port's `inflight` bitmask.
/// Returns the slot index (0..ncs-1). Returns `Err(Error::AGAIN)` if
/// all slots are busy (caller retries via block layer backpressure).
pub fn alloc_slot(&self) -> Result<u8> {
loop {
let current = self.inflight.load(Acquire);
let free_bit = (!current).trailing_zeros();
if free_bit >= self.ncs as u32 {
return Err(Error::AGAIN); // all slots busy
}
let new = current | (1 << free_bit);
if self.inflight.compare_exchange_weak(current, new, AcqRel, Acquire).is_ok() {
return Ok(free_bit as u8);
}
}
}
/// Submit an NCQ READ/WRITE FPDMA QUEUED command for a Bio.
/// Uses NCQ tag = slot index (identity mapping).
///
/// Steps: alloc NCQ tag, build FIS, build PRDT, set PxSACT, set PxCI.
/// Completion arrives via SDB FIS interrupt (PxIS.SDBS).
///
/// Takes `&self` — all mutable state uses interior mutability:
/// `slot_bios` is `[AtomicPtr<Bio>; 32]`, `cmd_tables` DMA memory
/// is mutated via unsafe pointer cast (safe because the `inflight`
/// bitmask guarantees exclusive access to the claimed slot).
pub fn submit_ncq(&self, bio: &mut Bio) -> Result<()> {
let tag = self.alloc_slot()?;
self.slot_bios[tag as usize].store(bio as *mut Bio, Release);
self.slot_status[tag as usize].store(0, Relaxed);
// Build FisRegH2D in cmd_tables[tag].cfis.
// SAFETY: DmaBox provides raw_ptr() returning *mut T (analogous to
// UnsafeCell — the DMA buffer is owned memory accessible via raw
// pointer even through &self). cfis is [u8; 64]; the first 20 bytes
// are the H2D FIS. The inflight bitmask guarantees exclusive access
// to this slot (no concurrent writer or reader for this tag).
let cmd_table_ptr = self.cmd_tables[tag as usize].raw_ptr();
let fis = unsafe {
&mut *((*cmd_table_ptr).cfis.as_mut_ptr() as *mut FisRegH2D)
};
*fis = FisRegH2D::zeroed();
fis.fis_type = 0x27;
fis.flags = 0x80; // C bit (command, not control)
fis.command = if bio.op == BioOp::Read { 0x60 } else { 0x61 };
fis.set_lba48(bio.start_lba);
// NCQ FUA: ACS-3 §7.63.6.4 — FUA is bit 7 of the DEVICE register,
// NOT the COUNT register. The COUNT register carries only the NCQ
// tag in bits [7:3]. Placing FUA in COUNT silently downgrades FUA
// writes to non-FUA, risking data loss on power failure.
let fua_bit: u8 = if bio.flags.contains(BioFlags::FUA) { 0x80 } else { 0 };
fis.device = 0x40 | fua_bit; // LBA mode | FUA (bit 7)
// NCQ: sector count goes in features_lo/features_hi (not count).
// Count register carries only the NCQ tag (bits 7:3).
let sector_count = bio.total_sectors();
fis.features_lo = (sector_count & 0xFF) as u8;
fis.features_hi = ((sector_count >> 8) & 0xFF) as u8;
fis.count = Le16::from_ne((tag as u16) << 3);
// Build PRDT from bio segments
let prdtl = self.build_prdt(tag, bio)?;
// Write AhciCmdHeader DW0: CFL=5 (20 bytes / 4), W=write, PRDTL=prdtl
self.cmd_list[tag as usize].set_flags_prdtl(
5, bio.op == BioOp::Write, prdtl,
);
// Set NCQ inflight and issue
self.ncq_inflight.fetch_or(1u32 << tag, Release);
self.regs.px_sact.write(1u32 << tag);
self.regs.px_ci.write(1u32 << tag);
Ok(())
}
/// Submit a legacy (non-NCQ) DMA READ/WRITE command for a Bio.
/// Used for devices that do not support NCQ (old SATA-I, ATAPI, etc.).
pub fn submit_legacy_dma(&self, bio: &mut Bio) -> Result<()> {
let slot = self.alloc_slot()?;
self.slot_bios[slot as usize].store(bio as *mut Bio, Release);
self.slot_status[slot as usize].store(0, Relaxed);
// SAFETY: same as submit_ncq — DmaBox::raw_ptr() + inflight bitmask exclusion.
let cmd_table_ptr = self.cmd_tables[slot as usize].raw_ptr();
let fis = unsafe {
&mut *((*cmd_table_ptr).cfis.as_mut_ptr() as *mut FisRegH2D)
};
*fis = FisRegH2D::zeroed();
fis.fis_type = 0x27;
fis.flags = 0x80;
fis.command = if bio.op == BioOp::Read { 0x25 } else { 0x35 }; // READ/WRITE DMA EXT
fis.set_lba48(bio.start_lba);
fis.set_sector_count(bio.total_sectors());
fis.device = 0x40;
let prdtl = self.build_prdt(slot, bio)?;
self.cmd_list[slot as usize].set_flags_prdtl(
5, bio.op == BioOp::Write, prdtl,
);
self.regs.px_ci.write(1u32 << slot);
Ok(())
}
/// Submit a non-data ATA command (FLUSH, STANDBY, IDENTIFY, etc.).
/// No PRDT needed — command has no data transfer phase.
pub fn submit_non_data_command(&self, command: u8) -> Result<()> {
let slot = self.alloc_slot()?;
// No bio for non-data commands — store null.
self.slot_bios[slot as usize].store(core::ptr::null_mut(), Release);
self.slot_status[slot as usize].store(0, Relaxed);
// SAFETY: same as submit_ncq — DmaBox::raw_ptr() + inflight bitmask exclusion.
let cmd_table_ptr = self.cmd_tables[slot as usize].raw_ptr();
let fis = unsafe {
&mut *((*cmd_table_ptr).cfis.as_mut_ptr() as *mut FisRegH2D)
};
*fis = FisRegH2D::zeroed();
fis.fis_type = 0x27;
fis.flags = 0x80;
fis.command = command;
fis.device = 0x00;
// LBA and count zeroed by FisRegH2D::zeroed().
self.cmd_list[slot as usize].set_flags_prdtl(5, false, 0);
self.regs.px_ci.write(1u32 << slot);
Ok(())
}
/// Submit FLUSH CACHE EXT using a pre-allocated command slot.
/// Called from BlockDeviceOps::submit_bio (BioOp::Flush) where the
/// caller has already allocated the slot and stored the bio pointer.
pub fn submit_flush_with_slot(&self, slot: u8) -> Result<()> {
let cmd = if self.lba48 { 0xEA } else { 0xE7 };
// SAFETY: same as submit_ncq — DmaBox::raw_ptr() + inflight bitmask exclusion.
let cmd_table_ptr = self.cmd_tables[slot as usize].raw_ptr();
let fis = unsafe {
&mut *((*cmd_table_ptr).cfis.as_mut_ptr() as *mut FisRegH2D)
};
*fis = FisRegH2D::zeroed();
fis.fis_type = 0x27;
fis.flags = 0x80;
fis.command = cmd;
fis.device = 0x00;
// LBA and count zeroed by FisRegH2D::zeroed().
self.cmd_list[slot as usize].set_flags_prdtl(5, false, 0);
self.regs.px_ci.write(1u32 << slot);
Ok(())
}
}
15.4.9.2 Non-NCQ Completion Handler¶
/// Process completion for non-NCQ commands. Called from the IRQ handler
/// when PxIS.DHRS (D2H Register FIS Received) is set.
///
/// For each completed slot: read status from D2H FIS, map slot to bio,
/// call bio_complete() with errno (i32: 0 = success, negative = error).
fn complete_non_ncq_slots(port: &AhciPort, completed: u32) {
for slot in 0..port.ncs {
if completed & (1 << slot) == 0 { continue; }
// Read completion status from D2H Register FIS.
// `fis_rx` is `DmaBox<AhciFisRxArea>`, `d2h_reg` is `[u8; 20]`.
// D2H FIS format: byte 0=FIS type (0x34), byte 1=flags,
// byte 2=status register, byte 3=error register.
let ata_status = port.fis_rx.d2h_reg[2];
let _ata_error = port.fis_rx.d2h_reg[3];
// Map ATA error to errno: ERR bit (status bit 0) = -EIO, else success.
// (A production driver may refine using _ata_error: ABRT, IDNF, UNC, etc.)
let errno: i32 = if ata_status & 0x01 != 0 {
-(EIO as i32)
} else {
0
};
// Clear inflight bit
port.inflight.fetch_and(!(1u32 << slot), Release);
// Complete the bio if one was associated with this slot.
// AtomicPtr::swap(null) atomically retrieves and clears the pointer.
let bio_ptr = port.slot_bios[slot as usize].swap(core::ptr::null_mut(), AcqRel);
if !bio_ptr.is_null() {
// SAFETY: bio_ptr was stored by submit_bio and is valid
// until bio_complete is called.
let bio = unsafe { &mut *bio_ptr };
bio_complete(bio, errno);
}
port.slot_completions[slot as usize].store(IDLE, Release);
}
}
15.4.10 Flush and Standby Submission¶
impl AhciPort {
/// Submit FLUSH CACHE EXT (command 0xEA) asynchronously.
/// Returns immediately; completion is signaled via the D2H FIS interrupt.
/// For non-48-bit devices, falls back to FLUSH CACHE (0xE7).
pub fn submit_flush(&self) -> Result<()> {
let cmd = if self.lba48 { 0xEA } else { 0xE7 };
self.submit_non_data_command(cmd)
}
/// Submit FLUSH CACHE EXT and wait for completion (blocking).
/// Used by the `flush()` block device method and shutdown path.
pub fn submit_flush_sync(&self) -> Result<()> {
self.submit_flush()?;
self.wait_for_completion()
}
/// Submit STANDBY IMMEDIATE (command 0xE0) — spins down the device.
/// Used during shutdown to ensure clean power-off.
pub fn submit_standby_immediate(&self) -> Result<()> {
self.submit_non_data_command(0xE0)?;
self.wait_for_completion()
}
}
15.4.11 Interrupt Handler¶
The AHCI interrupt handler runs in hardirq context (Tier 1 domain):
fn ahci_irq_handler(hba: &AhciHba) -> IrqReturn {
let global_is = hba.regs.read_is();
if global_is == 0 { return IrqReturn::None; }
for port_num in 0..hba.num_ports {
if global_is & (1 << port_num) == 0 { continue; }
let port = &hba.ports[port_num];
let port_is = port.regs.read_is();
// NCQ completions — Set Device Bits FIS received.
if port_is & AHCI_PxIS_SDBS != 0 {
let completed = port.ncq_inflight.load(Acquire)
& !port.regs.read_sact();
// For each completed tag: retrieve the associated Bio from the
// per-slot inflight table and call bio_complete() per the unified
// completion API ([Section 15.2](#block-io-and-volume-management--bio-completion)).
for tag in 0..32u8 {
if completed & (1 << tag) != 0 {
let bio_ptr = port.slot_bios[tag as usize].swap(
core::ptr::null_mut(), AcqRel,
);
if !bio_ptr.is_null() {
let bio = unsafe { &mut *bio_ptr };
let status = port.slot_status[tag as usize].load(Acquire);
bio_complete(bio, status);
}
port.ncq_inflight.fetch_and(!(1u32 << tag), Release);
port.slot_completions[tag as usize].store(IDLE, Release);
}
}
}
// Non-NCQ completion — D2H Register FIS received.
if port_is & AHCI_PxIS_DHRS != 0 {
let completed = port.inflight.load(Acquire)
& !port.regs.read_ci();
// Same pattern: retrieve Bio, call bio_complete(bio, status),
// clear inflight bit. Non-NCQ uses slot_completions tracking but
// MUST use bio_complete() for the BioState CAS state machine.
complete_non_ncq_slots(port, completed);
}
// Hot-plug event — port connect change.
if port_is & AHCI_PxIS_PCS != 0 {
handle_hotplug(port);
}
// Error conditions.
if port_is & AHCI_PxIS_ERROR_MASK != 0 {
handle_port_error(port, port_is);
}
// Clear handled interrupts.
port.regs.write_is(port_is);
}
// Clear global IS.
hba.regs.write_is(global_is);
IrqReturn::Handled
}
const AHCI_PxIS_ERROR_MASK: u32 =
(1 << 27) | // IFS: interface fatal error
(1 << 29) | // HBFS: host bus fatal error
(1 << 28) | // HBDS: host bus data error
(1 << 30); // TFES: task file error status
15.4.12 Error Recovery¶
AHCI defines three error classes, each with a different recovery procedure:
Non-fatal errors (PxIS.INFS — interface non-fatal): Log the error. Clear PxSERR. No command retry needed — the link layer recovered automatically.
Fatal errors (PxIS.IFS, HBFS, HBDS — interface fatal, host bus fatal/data):
- Set
error_state = Recovering. New submissions are blocked. - Clear
PxCMD.ST(stop command engine). Wait forPxCMD.CR = 0(timeout 500ms). - If
PxCMD.CRdoesn't clear: setPxCMD.CLO(Command List Override) ifCAP.SCLOis supported, then retry. If CLO fails: perform COMRESET (writePxSCTL.DET = 1, wait 1ms, writePxSCTL.DET = 0). - Clear
PxSERR(write all-ones). ClearPxIS(write all-ones). - Set
PxCMD.FRE, thenPxCMD.ST— restart the port. - Re-identify the device (IDENTIFY DEVICE) to confirm it's still responsive.
- Read the NCQ error log via READ LOG EXT (log page 10h) to identify the
failed command tag and error reason. For non-NCQ commands, read
PxTFD.ERRdirectly. - Classify in-flight commands by type:
- Write commands: Cancel with
-EIO. Retrying writes after a fatal error risks data corruption — the device may have partially written the data, and a retry would produce a second write with potentially different data ordering. The filesystem layer (journal or COW) is responsible for replaying writes with proper ordering guarantees. - Read commands: Retry up to 3 times. Read retries are safe because reads are idempotent — the device returns the same data regardless of how many times the read is issued.
- Non-data commands (FLUSH, IDENTIFY): Retry once.
- Set
error_state = Normal. Resume accepting new submissions.
Task file errors (PxIS.TFES — device reported error in TFD.STS.ERR):
- For NCQ: the device error log must be read via READ LOG EXT (log page 10h, "NCQ Command Error"). This identifies which tag failed and the error reason. Steps: stop port, issue READ LOG EXT (non-queued, slot 0), read the failing tag + error code, retry or fail that specific bio, restart NCQ for remaining tags.
- For non-NCQ: read PxTFD.ERR directly. The error register indicates the cause (ABRT, IDNF, UNC, etc.). Map to appropriate errno:
- UNC (uncorrectable data error) →
EIO - IDNF (ID not found) →
EIO(LBA out of range) - ABRT (command aborted) →
EIO(retry once, then fail) - ICRC (interface CRC error) → retry (link issue)
Retry policy: Each bio gets up to 3 retries for transient errors (CRC, timeout). Permanent media errors (UNC) are reported immediately — no retry.
15.4.13 Hot-Plug¶
AHCI supports hot-plug detection via PxIS.PCS (Port Connect change) and PxIS.DMPS (Device Mechanical Presence):
- Device insertion: PxSSTS.DET transitions to 3 (device present +
communication established). The driver allocates command list/FIS buffers (if
not pre-allocated), issues IDENTIFY DEVICE, and registers a new
BlockDevice. - Device removal: PxSSTS.DET transitions to 0. The driver unregisters the
BlockDevice, fails all in-flight bios withENODEV, and releases DMA buffers. Active filesystem mounts on the device receive I/O errors — unmount is the user's responsibility.
15.4.14 ATAPI Passthrough¶
ATAPI devices (optical drives, tape) use the PACKET command (ATA 0xA0) to carry 12-byte or 16-byte SCSI CDBs:
- Build
FisRegH2Dwithcommand = 0xA0. - Write the SCSI CDB (e.g., READ(10), INQUIRY, TEST UNIT READY) into
cmd_table.acmd[0..12]. - Set
AhciCmdHeader.flags.A = 1(ATAPI bit). - PRDT carries data for data-in/data-out CDBs.
- The device responds with PIO Setup FIS (for PIO data) or D2H Register FIS (for non-data commands). Check sense data on error (REQUEST SENSE CDB).
ATAPI is exposed to userspace via the standard Linux SG_IO ioctl (Section 19.7) for CD/DVD burning tools (cdrecord, growisofs) and media players.
15.4.15 Power Management¶
The AHCI driver supports four link power states and two device power states:
| State | Wake Latency | Triggered By |
|---|---|---|
| Active | 0 | I/O activity |
| Partial | ~10 μs | ALPM: idle >5ms (configurable) |
| Slumber | ~10 ms | ALPM: idle >100ms (configurable) |
| DevSleep | ~20 ms | CAP2.SDS + PxDEVSLP: idle >1s |
| Standby (device) | ~5-15 s | System suspend or explicit STANDBY IMMEDIATE |
| Sleep (device) | full reset | System hibernate (not commonly used) |
Aggressive Link Power Management (ALPM): When PxCMD.ALPE is set, the HBA
automatically transitions the link to Partial or Slumber after inactivity. The
driver sets PxCMD.ASP = 1 for Slumber preference (deeper sleep). ALPM is
enabled by default on battery-powered systems and disabled on servers (latency
sensitivity). The policy is controlled by sysfs:
/sys/class/scsi_host/hostN/link_power_management_policy — values: min_power,
med_power_with_dipm, max_performance (Linux-compatible).
System suspend path (Section 7.9): 1. Flush write cache: FLUSH CACHE EXT (0xEA). 2. Standby: STANDBY IMMEDIATE (0xE0). 3. Stop port: clear PxCMD.ST, wait for PxCMD.CR = 0. 4. Disable FIS receive: clear PxCMD.FRE, wait for PxCMD.FR = 0.
System resume path: Reverse — enable FRE, start port (ST), re-identify device.
15.4.16 BlockDeviceOps Implementation¶
/// Per-port block device wrapper. One `AhciBlockDevice` is created per
/// AHCI port that has a device attached (detected during port probe).
/// Registered with the block layer via `register_block_device()`.
pub struct AhciBlockDevice {
/// Reference to the AHCI port state (command list, FIS receive, NCQ state).
pub port: Arc<AhciPort>,
/// NUMA node closest to this controller's PCIe slot (for allocation affinity).
/// u16 matches BlockDeviceInfo.numa_node width (supports 65535 nodes).
pub numa_node: u16,
}
impl BlockDeviceOps for AhciBlockDevice {
fn submit_bio(&self, bio: &mut Bio) -> Result<()> {
if self.port.error_state.load(Acquire) != AhciPortErrorState::Normal as u8 {
return Err(Error::IO); // Port in error recovery
}
match bio.op {
BioOp::Read | BioOp::Write => {
if self.port.ncq_capable {
self.port.submit_ncq(bio)
} else {
self.port.submit_legacy_dma(bio)
}
}
BioOp::Flush => {
// Allocate a command slot and store the bio pointer so the
// D2H FIS completion handler can map the slot back to this
// bio and call bio_complete(). Without this, the flush bio's
// StackWaiter would never be woken (same fix as NVMe F15-03).
let slot = self.port.alloc_slot()?;
self.port.slot_bios[slot as usize].store(bio as *mut Bio, Release);
self.port.slot_status[slot as usize].store(0, Relaxed);
self.port.submit_flush_with_slot(slot)
}
BioOp::Discard => {
if self.port.supports_trim {
self.port.submit_trim(bio)
} else {
Err(Error::NOSYS)
}
}
BioOp::SecureErase => {
if self.port.supports_sanitize {
self.port.submit_sanitize(bio)
} else {
Err(Error::NOSYS)
}
}
BioOp::WriteZeroes => Err(Error::NOSYS), // ATA has no write-zeroes
BioOp::ZoneAppend => Err(Error::NOSYS), // Not a zoned device
}
}
fn flush(&self) -> Result<()> {
self.port.submit_flush_sync()
}
fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()> {
if !self.port.supports_trim { return Err(Error::NOSYS); }
// DATA SET MANAGEMENT (0x06) with TRIM bit. Payload: array of
// (LBA, count) pairs in 512-byte LBA Range Entry format.
self.port.submit_trim_range(start_lba, len_sectors)
}
fn get_info(&self) -> BlockDeviceInfo {
BlockDeviceInfo {
logical_block_size: self.port.logical_sector_size,
physical_block_size: self.port.physical_sector_size,
capacity_sectors: self.port.capacity_sectors,
max_segments: 248, // PRDT entries per command table (fills one 4KB page)
// 1 MiB intentional constant (not PAGE_SIZE-dependent). AHCI PRDT entries
// support up to 4 MiB each, but 1 MiB is a practical limit that works well
// across all page sizes (4K/16K/64K) without exceeding DMA mapping budgets.
max_bio_size: 1024 * 1024, // 1 MiB
flags: {
let mut f = BlockDeviceFlags::empty();
if self.port.supports_trim { f |= BlockDeviceFlags::DISCARD; }
if self.port.write_cache_enabled { f |= BlockDeviceFlags::FLUSH; }
if self.port.supports_fua { f |= BlockDeviceFlags::FUA; } // LBA48 AND IDENTIFY word 84 bit 6
// IDENTIFY word 217: nominal media rotation rate.
// 0x0001 = SSD (non-rotating), any other non-zero value = RPM.
if self.port.nominal_rotation_rate != 1 {
f |= BlockDeviceFlags::ROTATIONAL;
}
f
},
optimal_io_size: self.port.physical_sector_size,
numa_node: self.numa_node,
}
}
fn shutdown(&self) -> Result<()> {
// Flush cache, standby, stop port.
self.port.submit_flush_sync()?;
self.port.submit_standby_immediate()?;
self.port.stop()
}
}
15.4.17 KABI Driver Manifest¶
[driver]
name = "ahci"
version = "1.0.0"
tier = 1
bus-type = "pci"
[match]
pci-class = "01:06:01" # Mass Storage / SATA / AHCI 1.0
[capabilities]
dma = true
interrupts = "msi-x" # Preferred; falls back to MSI, then legacy INTx
max-memory = "4MB" # Per-port: 1K cmd_list + 256 FIS + 32×cmd_tables
[recovery]
crash-action = "reload"
state-preservation = true # Replay in-flight bios on reload
max-reload-time-ms = 500
15.4.18 Design Decisions¶
| Decision | Rationale |
|---|---|
| Tier 1 (not Tier 2) | SATA is block-latency-sensitive; Ring 3 crossing adds ~5-15 μs per bio — unacceptable for HDD seek-bound workloads |
| Pre-allocated command tables | No heap allocation on the I/O hot path. All 32 command tables allocated at probe time. |
| Identity tag-to-slot mapping | Avoids tag/slot translation overhead. NCQ depth ≤ NCS is guaranteed by spec. |
| NCQ by default | NCQ (FPDMA) is strictly better than legacy DMA for multi-outstanding I/O. Fall back to legacy DMA only for IDENTIFY, FLUSH, and error recovery. |
| 248 PRDT entries | Fills one 4KB page (128B header + 248 × 16B). Each entry addresses up to 4MB (DBC field), but block layer bio splitting caps practical I/O at ~1MB. Linux uses LIBATA_MAX_PRD = 128; UmkaOS uses 248 to maximize scatter-gather capacity within one page. |
| COMRESET as last resort | Port reset is expensive (~1-2s with device spin-up). Used only when CLO fails. |
15.5 VirtIO-blk Driver Architecture¶
Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules.
&selfmethods use interior mutability for mutation. Atomic fields use.store()/.load(). All#[repr(C)]structs haveconst_assert!size verification. See CLAUDE.md Spec Pseudocode Quality Gates.
The VirtIO-blk driver is a Tier 1 KABI driver that provides block storage access in virtualized environments. VirtIO-blk is the primary boot disk driver for QEMU/KVM, Firecracker, Cloud Hypervisor, and other VMMs using the VirtIO specification. This is a Phase 2 driver — required for the busybox boot demo on all 8 architectures.
Reference specification: VirtIO 1.2 (OASIS, July 2022), Section 5.2.
15.5.1 VirtIO Transport¶
The VirtIO-blk driver uses the common VirtIO transport layer defined in
Section 11.3 — VirtioTransport trait,
VirtqDesc/VirtqAvail/VirtqUsed ring structures, packed ring (VirtqPackedDesc),
feature negotiation protocol, and common feature bits (VIRTIO_F_*). This section
defines only the block-device-specific configuration, request format, and driver state.
VirtIO-blk is PCI device vendor 0x1AF4, device ID 0x1001 (transitional) or 0x1042 (modern, non-transitional). On MMIO transports (AArch64, ARMv7, RISC-V, PPC), the device type is 2.
15.5.2 Device Configuration Space¶
The VirtIO-blk device exposes a device-specific configuration structure:
/// VirtIO block device configuration (VirtIO 1.2 §5.2.4).
/// Read via VirtioTransport::read_config().
/// All multi-byte fields are little-endian per VirtIO 1.2 §2.4.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures including big-endian
/// PPC32 and s390x. Matches Linux's `__virtio_le16`/`__virtio_le32`/`__virtio_le64`.
///
/// `packed` matches Linux's `__attribute__((packed))` on `struct virtio_blk_config`.
/// Without it, compiler padding between fields with different alignment requirements
/// (e.g., the nested `VirtioBlkTopology` struct) could silently shift field offsets.
/// The current layout happens to be naturally aligned, but `packed` enforces this
/// invariant against future edits.
#[repr(C, packed)]
pub struct VirtioBlkConfig {
/// Device capacity in 512-byte sectors.
pub capacity: Le64,
/// Maximum size of any single segment (if VIRTIO_BLK_F_SIZE_MAX).
pub size_max: Le32,
/// Maximum number of segments in a request (if VIRTIO_BLK_F_SEG_MAX).
pub seg_max: Le32,
/// Device geometry (if VIRTIO_BLK_F_GEOMETRY).
pub geometry: VirtioBlkGeometry,
/// Logical block size in bytes (if VIRTIO_BLK_F_BLK_SIZE). Default 512.
pub blk_size: Le32,
/// Topology information (if VIRTIO_BLK_F_TOPOLOGY).
pub topology: VirtioBlkTopology,
/// Write Cache Enable (if VIRTIO_BLK_F_CONFIG_WCE). 1 = writeback, 0 = writethrough.
/// Linux/VirtIO canonical name: `wce`.
pub wce: u8,
/// Padding.
pub _unused0: u8,
/// Number of virtqueues (if VIRTIO_BLK_F_MQ). Default 1.
pub num_queues: Le16,
/// Maximum discard sectors (if VIRTIO_BLK_F_DISCARD).
pub max_discard_sectors: Le32,
/// Maximum discard segments (if VIRTIO_BLK_F_DISCARD).
pub max_discard_seg: Le32,
/// Discard sector alignment (if VIRTIO_BLK_F_DISCARD).
pub discard_sector_alignment: Le32,
/// Maximum write-zeroes sectors (if VIRTIO_BLK_F_WRITE_ZEROES).
pub max_write_zeroes_sectors: Le32,
/// Maximum write-zeroes segments (if VIRTIO_BLK_F_WRITE_ZEROES).
pub max_write_zeroes_seg: Le32,
/// Write-zeroes may unmap (if VIRTIO_BLK_F_WRITE_ZEROES).
pub write_zeroes_may_unmap: u8,
/// Padding (3 bytes to match Linux's `unused1[3]`).
pub _unused1: [u8; 3],
/// Maximum secure erase sectors (if VIRTIO_BLK_F_SECURE_ERASE, VirtIO 1.2+).
pub max_secure_erase_sectors: Le32,
/// Maximum secure erase segments (if VIRTIO_BLK_F_SECURE_ERASE).
pub max_secure_erase_seg: Le32,
/// Secure erase sector alignment (if VIRTIO_BLK_F_SECURE_ERASE).
pub secure_erase_sector_alignment: Le32,
/// Zoned device characteristics (if VIRTIO_BLK_F_ZONED).
/// VirtIO 1.2 §5.2.6.2: 5 × Le32 + 1 × u8 + 3 bytes padding = 24 bytes.
pub zoned: VirtioBlkZonedCharacteristics,
}
const_assert!(core::mem::size_of::<VirtioBlkConfig>() == 96);
/// Device geometry (VirtIO 1.2 §5.2.4).
/// Multi-byte fields are little-endian per VirtIO spec.
#[repr(C)]
pub struct VirtioBlkGeometry {
pub cylinders: Le16,
pub heads: u8,
pub sectors: u8,
}
const_assert!(core::mem::size_of::<VirtioBlkGeometry>() == 4);
/// Block topology hints for alignment (VirtIO 1.2 §5.2.4).
/// Multi-byte fields are little-endian per VirtIO spec.
#[repr(C)]
pub struct VirtioBlkTopology {
/// Number of logical blocks per physical block (log2).
pub physical_block_exp: u8,
/// Offset of first aligned logical block.
pub alignment_offset: u8,
/// Suggested minimum I/O size in logical blocks.
pub min_io_size: Le16,
/// Suggested optimal I/O size in logical blocks.
pub opt_io_size: Le32,
}
// VirtIO spec: physical_block_exp(1)+alignment_offset(1)+min_io_size(2)+opt_io_size(4) = 8 bytes.
const_assert!(core::mem::size_of::<VirtioBlkTopology>() == 8);
/// Zoned block device characteristics (VirtIO 1.2 §5.2.6.2).
/// Present when VIRTIO_BLK_F_ZONED is negotiated.
/// All multi-byte fields are little-endian per VirtIO spec.
#[repr(C)]
pub struct VirtioBlkZonedCharacteristics {
/// Maximum number of open zones.
pub zone_sectors: Le32,
/// Maximum number of active zones.
pub max_open_zones: Le32,
/// Maximum number of zones.
pub max_active_zones: Le32,
/// Maximum append sectors.
pub max_append_sectors: Le32,
/// Write granularity.
pub write_granularity: Le32,
/// Zoned model: 0=none, 1=host-aware, 2=host-managed.
pub model: u8,
/// Padding to 24 bytes.
pub _pad: [u8; 3],
}
const_assert!(core::mem::size_of::<VirtioBlkZonedCharacteristics>() == 24);
15.5.3 Feature Negotiation¶
Feature negotiation follows the VirtIO standard 3-step process:
- Device offers features: Driver reads 64-bit feature bitmap.
- Driver accepts subset: Driver writes back only the features it supports.
- Driver sets FEATURES_OK: Device validates; if cleared, negotiation failed.
/// VirtIO block feature bits (VirtIO 1.2 §5.2.3).
pub mod virtio_blk_features {
/// Maximum size of any single segment is in `size_max`.
pub const VIRTIO_BLK_F_SIZE_MAX: u64 = 1 << 1;
/// Maximum number of segments in a request is in `seg_max`.
pub const VIRTIO_BLK_F_SEG_MAX: u64 = 1 << 2;
/// Disk-style geometry specified in `geometry`.
pub const VIRTIO_BLK_F_GEOMETRY: u64 = 1 << 4;
/// Device is read-only.
pub const VIRTIO_BLK_F_RO: u64 = 1 << 5;
/// Disk logical block size in `blk_size`.
pub const VIRTIO_BLK_F_BLK_SIZE: u64 = 1 << 6;
/// Cache flush command support (VIRTIO_BLK_T_FLUSH).
pub const VIRTIO_BLK_F_FLUSH: u64 = 1 << 9;
/// Device exports topology information in `topology`.
pub const VIRTIO_BLK_F_TOPOLOGY: u64 = 1 << 10;
/// Device can toggle its cache between writeback and writethrough.
pub const VIRTIO_BLK_F_CONFIG_WCE: u64 = 1 << 11;
/// Device supports multi-queue (num_queues virtqueues).
pub const VIRTIO_BLK_F_MQ: u64 = 1 << 12;
/// Device supports discard (VIRTIO_BLK_T_DISCARD).
pub const VIRTIO_BLK_F_DISCARD: u64 = 1 << 13;
/// Device supports write-zeroes (VIRTIO_BLK_T_WRITE_ZEROES).
pub const VIRTIO_BLK_F_WRITE_ZEROES: u64 = 1 << 14;
// Bit 15 is unassigned in both Linux UAPI and VirtIO 1.2/1.3 spec.
/// Device supports secure erase (VIRTIO_BLK_T_SECURE_ERASE).
pub const VIRTIO_BLK_F_SECURE_ERASE: u64 = 1 << 16;
/// Device reports zoned block device characteristics.
pub const VIRTIO_BLK_F_ZONED: u64 = 1 << 17;
// Transport-level common feature bits (VIRTIO_F_*) are defined in
// Section 11.4.3.1 — virtio_features module.
}
The UmkaOS driver always negotiates: VIRTIO_F_VERSION_1 (required for modern),
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_FLUSH (if offered),
VIRTIO_BLK_F_TOPOLOGY (if offered), VIRTIO_BLK_F_MQ (if offered),
VIRTIO_BLK_F_DISCARD (if offered), VIRTIO_BLK_F_WRITE_ZEROES (if offered),
VIRTIO_F_INDIRECT_DESC (if offered), VIRTIO_F_EVENT_IDX (if offered),
VIRTIO_F_RING_PACKED (if offered). Transport-level feature bits are defined in
Section 11.3.
15.5.4 Virtqueue Usage¶
The VirtIO-blk driver uses the split ring (VirtqDesc/VirtqAvail/VirtqUsed)
or packed ring (VirtqPackedDesc) layouts defined in
Section 11.3. Split ring is the default;
packed ring (VIRTIO_F_RING_PACKED) is negotiated when the device offers it.
Virtqueue sizing: The driver queries max_queue_size() from the transport.
Typical values: 128, 256, or 512 entries. The UmkaOS driver uses the device's
maximum (no benefit to restricting it). All DMA regions are allocated as a single
contiguous buffer with the alignment requirements specified in §11.4.3.1.
15.5.5 Request Format¶
All VirtIO-blk requests use a three-part descriptor chain:
/// VirtIO block request header — 16 bytes (device-readable).
/// All multi-byte fields are little-endian per VirtIO 1.2 §5.2.6.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct VirtioBlkReqHeader {
/// Request type.
pub req_type: Le32,
/// I/O priority (class and priority, Linux `IOPRIO_PRIO_VALUE` encoding).
/// Only meaningful if device supports prioritized I/O (currently advisory).
pub ioprio: Le32,
/// Starting sector (512-byte units regardless of logical block size).
pub sector: Le64,
}
const_assert!(core::mem::size_of::<VirtioBlkReqHeader>() == 16);
/// VirtIO block request types.
#[repr(u32)]
pub enum VirtioBlkReqType {
/// Read from device to guest memory.
In = 0,
/// Write from guest memory to device.
Out = 1,
/// Flush volatile cache. sector field is ignored.
Flush = 4,
/// Discard sectors (if VIRTIO_BLK_F_DISCARD).
Discard = 11,
/// Write zeroes (if VIRTIO_BLK_F_WRITE_ZEROES).
WriteZeroes = 13,
/// Secure erase (if VIRTIO_BLK_F_SECURE_ERASE).
SecureErase = 14,
}
/// VirtIO block request status — 1 byte (device-writable).
/// Written by the device as the last byte of the used descriptor chain.
#[repr(u8)]
pub enum VirtioBlkStatus {
/// Success.
Ok = 0,
/// Device or driver error.
IoErr = 1,
/// Request unsupported by device.
Unsupp = 2,
}
Descriptor chain layout for a read/write request:
| Descriptor | Direction | Contents |
|---|---|---|
| 0 (head) | Device-readable | VirtioBlkReqHeader (16 bytes) |
| 1..N | Read: device-writable; Write: device-readable | Data segments (from bio) |
| N+1 (tail) | Device-writable | Status byte (VirtioBlkStatus, 1 byte) |
For flush requests: head (16 bytes, type=Flush) + tail (1 byte status). No data descriptors.
For discard/write-zeroes: head + discard/write-zeroes segment descriptors
(each VirtioBlkDiscardWriteZeroes, 16 bytes: sector, num_sectors, flags) + tail.
15.5.6 Multi-Queue Support¶
When VIRTIO_BLK_F_MQ is negotiated, the device provides num_queues virtqueues.
The driver creates one virtqueue per CPU (up to num_queues). I/O requests from
a CPU are submitted to that CPU's virtqueue without any lock contention:
/// VirtIO-blk driver state.
pub struct VirtioBlkDevice {
/// VirtIO transport handle (PCI or MMIO).
/// `&'static dyn` instead of `Box<dyn>`: transport objects are registered
/// once at boot (PCI) or device probe (MMIO) and live for the device's
/// lifetime. No heap allocation needed on the hot path.
pub transport: &'static dyn VirtioTransport,
/// Negotiated features.
pub features: u64,
/// Per-queue state. One per virtqueue (1 for single-queue, up to num_queues for MQ).
/// Warm-path allocation at device probe. Size = negotiated num_queues
/// (from VirtIO config). Bounded by VirtIO spec: max 65535, practical QEMU 256.
/// Each queue wrapped in SpinLock for interior mutability from &self.
pub queues: Box<[SpinLock<VirtioBlkQueue>]>,
/// Device capacity in 512-byte sectors.
pub capacity: u64,
/// Logical block size in bytes.
pub blk_size: u32,
/// Physical block exponent (log2 of physical/logical ratio).
pub physical_block_exp: u8,
/// Device is read-only.
pub read_only: bool,
/// Writeback mode (if CONFIG_WCE negotiated).
pub writeback: bool,
}
/// Per-virtqueue state.
pub struct VirtioBlkQueue {
/// Queue index (0-based).
pub index: u16,
/// Queue size (number of descriptors, power of 2).
pub size: u16,
/// DMA-coherent descriptor table (VirtqDesc defined in §11.4.3.1).
pub desc_table: DmaBox<[VirtqDesc]>,
/// DMA-coherent available ring (VirtqAvail defined in §11.4.3.1).
pub avail_ring: DmaBox<VirtqAvail>,
/// DMA-coherent used ring (VirtqUsed defined in §11.4.3.1).
pub used_ring: DmaBox<VirtqUsed>,
/// Free descriptor list — indices of available descriptors.
/// Pre-populated at init time (all descriptors free).
///
/// Capacity is `queue_size` (negotiated with the device at probe time,
/// power-of-2, range 1-32768 per VirtIO spec 1.2 §2.7). The backing
/// storage is allocated from slab at queue init (warm path) with
/// `Box<[u16]>` of length `queue_size`. This avoids a fixed 512-entry
/// cap that would silently drop descriptors on devices with larger
/// queues (e.g., cloud VirtIO-blk with 1024 or 4096 queue depth).
pub free_list: Box<[u16]>,
/// Number of valid entries in `free_list` (0..=queue_size).
pub free_count: u16,
/// Last seen used ring index (for polling completions).
pub last_used_idx: u16,
/// In-flight request tracking: maps descriptor head index → Bio.
/// Allocated with `queue_size` entries at init.
///
/// # Safety
///
/// Each `*mut Bio` is a borrow of a Bio owned by the block layer.
/// Ownership contract: the block layer retains ownership; the driver
/// holds a mutable borrow from `submit_bio()` until the corresponding
/// used ring entry is consumed in `virtio_blk_complete()`. After
/// completion, the driver sets the slot to `None` and calls
/// `bio_end_io(bio, status)`, returning the borrow to the block layer.
/// The raw pointer (rather than `&mut Bio`) is required because the Bio
/// may be accessed from interrupt context (used ring polling) where
/// Rust's borrow checker cannot track the lifetime across the async
/// hardware boundary.
///
/// Invariants:
/// - Each `*mut Bio` is valid from `submit_bio()` until the
/// corresponding used ring entry is consumed and `bio_end_io()` is called.
/// - The block layer guarantees the Bio remains allocated while inflight.
/// - Only the owning VirtioBlkQueue may read/write a slot (per-queue lock
/// or per-CPU exclusivity ensures no data races).
/// - The driver MUST NOT dereference a `*mut Bio` after calling
/// `bio_end_io()`. The slot MUST be set to `None` before any
/// subsequent access.
pub inflight: Box<[Option<*mut Bio>]>,
}
Queue selection: submit_bio() selects the queue for the current CPU:
queue_index = cpu_id % num_queues. No lock is needed because each CPU has
its own queue. If num_queues < num_cpus, some CPUs share a queue (protected
by a per-queue SpinLock).
15.5.7 I/O Submission¶
fn virtio_blk_submit(queue: &mut VirtioBlkQueue, bio: &mut Bio) -> Result<()> {
// 1. Allocate descriptors: 1 (header) + N (data segments) + 1 (status) = N+2.
let num_descs = 2 + bio.segments.len();
if (queue.free_count as usize) < num_descs {
return Err(Error::AGAIN); // Queue full — block layer requeues.
// The bio is placed on the per-device dispatch list. The device's
// completion IRQ kicks requeue processing when descriptor space
// becomes available. See [Section 15.2](#block-io-and-volume-management--eagain-requeue).
}
// 2. Build header descriptor.
let header_desc = queue.alloc_desc();
queue.desc_table[header_desc].addr = header_dma_addr;
queue.desc_table[header_desc].len = 16;
queue.desc_table[header_desc].flags = VIRTQ_DESC_F_NEXT;
// 3. Chain data descriptors.
let mut prev = header_desc;
for seg in &bio.segments {
let data_desc = queue.alloc_desc();
queue.desc_table[data_desc].addr = seg.page_phys + seg.offset as u64;
queue.desc_table[data_desc].len = seg.len;
queue.desc_table[data_desc].flags = if bio.op == BioOp::Read {
VIRTQ_DESC_F_WRITE | VIRTQ_DESC_F_NEXT // Device writes to buffer
} else {
VIRTQ_DESC_F_NEXT // Device reads from buffer
};
queue.desc_table[prev].next = data_desc;
prev = data_desc;
}
// 4. Chain status descriptor (1 byte, device-writable).
let status_desc = queue.alloc_desc();
queue.desc_table[status_desc].addr = status_dma_addr;
queue.desc_table[status_desc].len = 1;
queue.desc_table[status_desc].flags = VIRTQ_DESC_F_WRITE; // No NEXT — end of chain
queue.desc_table[prev].next = status_desc;
// Ensure prev descriptor has NEXT flag set (it chains to status_desc).
// For Read ops, prev already has NEXT set from the data descriptor loop.
// For Write ops (no WRITE flag), prev also has NEXT from the loop.
// The status descriptor is the final descriptor — no NEXT flag.
queue.desc_table[status_desc].flags = VIRTQ_DESC_F_WRITE; // No NEXT — end of chain
// 5. Track in-flight.
queue.inflight[header_desc] = Some(bio as *mut Bio);
// 6. Add to available ring.
let avail_idx = queue.avail_ring.idx;
queue.avail_ring.ring[avail_idx as usize % queue.size as usize] = header_desc as u16;
// Write memory barrier — ensure descriptor writes are visible before idx update.
core::sync::atomic::fence(Release);
queue.avail_ring.idx = avail_idx.wrapping_add(1);
// 7. Notify device (doorbell kick).
// With EVENT_IDX: only notify if device needs it.
if needs_notification(queue) {
queue.transport.notify(queue.index);
}
Ok(())
}
15.5.8 I/O Completion¶
Completions are processed in the interrupt handler or by polling. The
bio_complete() free function (Section 15.2) performs
CAS + extraction + dispatch, using the formal Completion primitive defined in
Section 3.6:
fn virtio_blk_complete(queue: &mut VirtioBlkQueue) {
// Read memory barrier — ensure we see device's writes to used ring.
core::sync::atomic::fence(Acquire);
// used_ring.idx is written by the device via DMA. read_volatile() prevents
// the compiler from caching the value across loop iterations or reordering
// the read relative to the Acquire fence above.
while queue.last_used_idx != read_volatile(&queue.used_ring.idx) {
let used_elem = &queue.used_ring.ring[
queue.last_used_idx as usize % queue.size as usize
];
let head_idx = used_elem.id as usize;
// Read status byte from the last descriptor in the chain.
let status = read_status_byte(queue, head_idx);
// Complete the bio.
if let Some(bio_ptr) = queue.inflight[head_idx].take() {
let bio = unsafe { &mut *bio_ptr };
let errno = match status {
VirtioBlkStatus::Ok => 0,
VirtioBlkStatus::IoErr => -EIO,
VirtioBlkStatus::Unsupp => -ENOSYS,
};
// Unified completion API: CAS + extraction + dispatch in one call.
// This eliminates the TOCTOU race from the previous mem::take
// pattern where extraction preceded the CAS (SF-373).
bio_complete(bio, errno);
}
// Free all descriptors in the chain.
free_descriptor_chain(queue, head_idx);
queue.last_used_idx = queue.last_used_idx.wrapping_add(1);
}
// Update avail_event for EVENT_IDX notification suppression.
if has_event_idx(queue) {
queue.used_ring.avail_event = queue.avail_ring.idx;
}
}
15.5.9 Initialization Sequence¶
Follows the common VirtIO initialization protocol defined in Section 11.3, with block-specific steps:
- Discovery: Match PCI vendor 0x1AF4, device 0x1001/0x1042 (block), or MMIO magic value 0x74726976 ("virt") with device type 2.
- Steps 1-6: Standard VirtIO init (reset → acknowledge → driver → feature negotiation → FEATURES_OK → queue setup) per §11.4.3.1. Block-specific features from §15.15.3 are selected in the negotiation step.
- Queue setup: For each virtqueue (1 for single-queue,
num_queuesfor MQ): a. Select queue (write queue index to queue_sel). b. Read max queue size. c. Allocate DMA buffers for desc, avail, used rings. d. Pre-populate free descriptor list. e. Write addresses to the device viasetup_queue(). f. Enable the queue. - Read config: Read
VirtioBlkConfig— capacity, blk_size, topology. - DRIVER_OK: Set
DRIVER_OK(bit 2) in status. Device is now live. - Register with umka-block: Create
BlockDevicewith parsed config.
15.5.10 Crash Recovery¶
VirtIO-blk crash recovery follows the Tier 1 recovery protocol (Section 11.9):
- Fault detection: PCI error (SERR, AER), timeout (no completion within 30s),
or device status
DEVICE_NEEDS_RESET(bit 6). - Quiesce: Block new submissions. Drain completion handler.
- Device reset: Write 0 to device status register. This resets all device state including virtqueues.
- Re-initialize: Repeat steps 2-5 of §15.15.9 (acknowledge, driver, features, queues, config, DRIVER_OK).
- Replay in-flight I/O: The block layer retains bios that were submitted but
not completed. After re-initialization, these bios are resubmitted to the
new virtqueues. Replay is safe because sector writes are idempotent: writing
the same data to the same sector twice produces the same result. Writes may
have been partially or fully executed by the device before the crash; replaying
them is correct regardless. Non-idempotent operations (discard, write-same with
side effects) are not replayed — they are failed with
-EIOand reported to the block layer for upper-layer recovery. - Resume: Set
error_state = Normal. Accept new submissions.
Recovery time: ~10-50ms (dominated by device reset + re-negotiation). No device firmware to reload — VirtIO is a software device.
15.5.11 BlockDeviceOps Implementation¶
impl BlockDeviceOps for VirtioBlkDevice {
fn submit_bio(&self, bio: &mut Bio) -> Result<()> {
if self.read_only && bio.op == BioOp::Write {
return Err(Error::ROFS);
}
let queue_idx = arch::current::cpu::id() % self.queues.len();
// Each VirtioBlkQueue is wrapped in SpinLock for interior mutability
// (submit_bio takes &self; queue mutation needs &mut through &self).
// Uncontended in the per-CPU case (~5-10 ns); required for shared
// queues when num_queues < num_cpus.
let queue = self.queues[queue_idx].lock();
match bio.op {
BioOp::Read => queue.submit_request(VirtioBlkReqType::In, bio),
BioOp::Write => queue.submit_request(VirtioBlkReqType::Out, bio),
BioOp::Flush => {
if self.features & VIRTIO_BLK_F_FLUSH != 0 {
// submit_flush builds a 3-descriptor chain
// (header + empty data + status), stores the bio in
// inflight[header_desc] for completion matching, and
// submits to the available ring. The completion handler
// calls bio_complete() on the flush bio when the device
// signals used[header_desc].
queue.submit_flush(bio)
} else {
// No volatile cache — flush is a no-op, complete immediately.
bio_complete(bio, 0);
Ok(())
}
}
BioOp::Discard => {
if self.features & VIRTIO_BLK_F_DISCARD != 0 {
queue.submit_discard(bio)
} else {
Err(Error::NOSYS)
}
}
BioOp::WriteZeroes => {
if self.features & VIRTIO_BLK_F_WRITE_ZEROES != 0 {
queue.submit_write_zeroes(bio)
} else {
Err(Error::NOSYS)
}
}
BioOp::ZoneAppend => Err(Error::NOSYS),
}
}
fn flush(&self) -> Result<()> {
if self.features & VIRTIO_BLK_F_FLUSH != 0 {
let queue_idx = arch::current::cpu::id() % self.queues.len();
self.queues[queue_idx].submit_flush_sync()
} else {
Ok(())
}
}
fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()> {
if self.features & VIRTIO_BLK_F_DISCARD == 0 {
return Err(Error::NOSYS);
}
let queue_idx = arch::current::cpu::id() % self.queues.len();
self.queues[queue_idx].submit_discard_range(start_lba, len_sectors)
}
fn get_info(&self) -> BlockDeviceInfo {
BlockDeviceInfo {
logical_block_size: self.blk_size,
physical_block_size: self.blk_size << self.physical_block_exp,
capacity_sectors: self.capacity,
max_segments: if self.features & VIRTIO_BLK_F_SEG_MAX != 0 {
// seg_max (u32) from config offset 12, clamped to u16 range.
// `as u16` truncation for values >65535 produces silently wrong
// results (65536 truncates to 0). Use explicit clamping.
let seg_max_u32 = self.transport.read_config(12, 4);
let clamped = core::cmp::min(seg_max_u32, u16::MAX as u32) as u16;
core::cmp::min(
clamped,
self.queues[0].size - 2,
)
} else {
self.queues[0].size - 2
},
max_bio_size: 0, // Sentinel: 0 means "no explicit byte limit beyond
// segment count × page size". The block layer treats
// max_bio_size == 0 as unlimited (capped only by
// max_segments × PAGE_SIZE). See BlockDeviceInfo docs.
flags: {
let mut f = BlockDeviceFlags::empty();
if self.features & VIRTIO_BLK_F_DISCARD != 0 { f |= BlockDeviceFlags::DISCARD; }
if self.features & VIRTIO_BLK_F_FLUSH != 0 { f |= BlockDeviceFlags::FLUSH; }
// VirtIO-blk has no FUA — use flush
f
},
optimal_io_size: if self.features & VIRTIO_BLK_F_TOPOLOGY != 0 {
// opt_io_size (Le32) is at config offset 28, NOT 24.
// Offset 24 contains physical_block_exp(u8) + alignment_offset(u8)
// + min_io_size(Le16). See VirtIO 1.2 §5.2.4 struct virtio_blk_config.
self.blk_size * self.transport.read_config(28, 4) as u32
} else {
self.blk_size
},
numa_node: 0, // Virtual device — no NUMA affinity
}
}
fn shutdown(&self) -> Result<()> {
// Flush pending writes.
self.flush()?;
// Reset device.
self.transport.write_status(0);
Ok(())
}
}
15.5.12 KABI Driver Manifest¶
[driver]
name = "virtio-blk"
version = "1.0.0"
tier = 1
bus-type = "pci" # Also supports "platform" for MMIO transport
[match]
pci-vendor = "0x1AF4"
pci-device = ["0x1001", "0x1042"] # Legacy and modern transitional
[match.mmio]
virtio-device-type = 2 # Block device
[capabilities]
dma = true
interrupts = "msi-x" # Preferred; falls back to MSI, then INTx (PCI) or shared IRQ (MMIO)
max-memory = "2MB" # Per-queue: desc + avail + used rings
[recovery]
crash-action = "reload"
state-preservation = true # Replay in-flight bios on reload
max-reload-time-ms = 100
15.5.13 Design Decisions¶
| Decision | Rationale |
|---|---|
| Tier 1 (not Tier 2) | VirtIO-blk is the primary boot disk for all virtualized environments. Latency budget is tight (~5 μs per I/O round-trip in QEMU). Tier 2 ring-crossing adds ~5-15 μs — unacceptable. |
| Split ring as default, packed as upgrade | Split ring is universally supported. Packed ring (VIRTIO_F_RING_PACKED) is negotiated when available for better cache behavior. |
| One queue per CPU | Eliminates lock contention. Standard for modern VirtIO-blk (F_MQ). Falls back to single-queue with per-queue SpinLock. |
| Pre-allocated descriptor pools | All descriptors and tracking arrays allocated at probe time. Zero allocation on the I/O hot path. |
| No indirect descriptors by default | Indirect descriptors (VIRTIO_F_INDIRECT_DESC) add an extra DMA read. Only enabled for large bios (>8 segments) to save descriptor slots. |
| EVENT_IDX for notification suppression | Reduces VM exits (and thus I/O latency) by batching notifications. Essential for performance — reduces hypervisor round-trips by ~40%. |
| Both PCI and MMIO transports | PCI is the production path (x86, AArch64 servers). MMIO is required for embedded platforms (ARMv7, RISC-V, PPC) and QEMU -M virt. |
15.6 ext4 Filesystem Driver¶
Scope note: This section provides UmkaOS-specific ext4 filesystem driver specifications: journal modes, error handling, Linux compatibility constraints, and data structure layouts. The on-disk format specification for ext4 is defined by the upstream project and is not duplicated here — UmkaOS implements the same on-disk format bit-for-bit.
The ext4 driver implements the FileSystemOps and InodeOps traits defined in
Section 14.1 (VFS layer). ext4 is used in server, workstation,
embedded, and consumer contexts; it is not consumer-specific.
15.6.1.1 Evolvable/Nucleus Classification¶
| Component | Classification | Rationale |
|---|---|---|
JournalSuperblock, JournalHeader, JournalBlockTag, JournalCommitBlock on-disk structs |
Nucleus | On-disk format compatibility with Linux ext4. Any change breaks cross-mount. |
Journal in-memory struct fields and state machine |
Nucleus | Transaction state machine correctness is a crash-consistency invariant. |
Transaction lifecycle (T_RUNNING through T_FINISHED) |
Nucleus | Ordering guarantees are required for durability. |
Handle API (journal_start/journal_stop/journal_get_write_access) |
Nucleus | Correctness contract with filesystem operations. |
| Recovery and replay algorithm (3-pass) | Nucleus | Must match Linux JBD2 for cross-mount compatibility. |
| Revoke record semantics | Nucleus | Freed-block replay hazard prevention is a correctness invariant. |
| Adaptive commit interval algorithm | Evolvable | Heuristic for commit timing. ML-tunable without affecting correctness. |
commit_interval_ms bounds (100-5000 ms) |
Evolvable | Policy choice for recovery time vs I/O overhead tradeoff. |
| Checkpoint thread scheduling priority (nice 5) | Evolvable | Policy: how urgently to reclaim journal space. |
errors= mode selection |
Evolvable | Policy: operator-configurable response to filesystem errors. |
15.6.1.2 const_assert! Verification¶
All #[repr(C)] on-disk structs in this section have const_assert! size verification:
| Struct | Expected size | const_assert! present |
|---|---|---|
JournalSuperblock |
1024 | Yes |
JournalBlockTag |
16 | Yes |
JournalCommitBlock |
60 | Yes |
15.6.2 ext4¶
Use cases: Default Linux filesystem. Ubiquitous across servers, containers (overlayfs on ext4), embedded roots, VM images, CI/CD storage nodes, and most existing Linux deployments. UmkaOS must read/write ext4 volumes from day one for bare-metal Linux migration compatibility.
Tier: Tier 1 (in-kernel driver; no privilege boundary makes sense for a root filesystem that must be available before any domain infrastructure is up).
Journal modes (selected at mount time via data= option):
| Mode | What is journalled | Durability on crash |
|---|---|---|
data=writeback |
Metadata only | Stale data may appear in reallocated blocks |
data=ordered (default) |
Metadata only; data flushed before metadata commit | No stale data |
data=journal |
Metadata and data | Strongest; ~2× write amplification |
UmkaOS exposes these as mount flags via the FileSystemOps::mount() options string,
consistent with Linux behaviour. The VFS durability contract (Section 15.1) requires
data=ordered or data=journal to satisfy O_SYNC/fsync guarantees; drivers
must reject data=writeback if the volume is mounted as a root or journalled
data store unless the operator explicitly overrides.
Key features the driver must implement:
- Extents (ext4_extent_tree): 48-bit logical-to-physical mapping via a
four-level B-tree embedded in the inode. Supports extents up to 128 MiB
contiguous. Replaces the older indirect-block scheme (must also be readable
for old volumes without the extents feature flag).
- HTree directory indexing: dir_index feature flag. Directories stored as
B-trees keyed by filename hash (half-MD4). Required for directories with more
than ~10,000 entries; without it readdir degrades to O(n).
- 64-bit support: 64bit feature flag extends block count from 32 to 48
bits, enabling volumes >16 TiB. Required for modern datacenter deployments;
the driver must handle both 32-bit and 64-bit superblocks.
- Inline data: Small files (≤60 bytes) stored directly in the inode body.
Important for filesystems hosting millions of tiny files (container layers,
npm caches).
- Fast commit (fast_commit feature, Linux 5.10+): Appends a small delta
journal entry instead of a full transaction commit for common operations
(rename, link, unlink). Reduces journal write amplification by 4–10× for
metadata-heavy workloads.
Error handling: ext4 supports the standard errors= mount option
(Section 14.1):
- errors=continue (default): Log the error, continue operation.
- errors=remount-ro: Remount the filesystem read-only (see FsErrorMode::RemountRo
in Section 14.1 for the procedure). This is the recommended setting
for data-critical volumes.
- errors=panic: Trigger a kernel panic. Only appropriate for root filesystems
with automatic reboot and fsck.
The errors= value is persisted in the ext4 on-disk superblock field s_errors
(__le16 at byte offset 0x3C). If not specified at mount time, the on-disk value is used.
If neither mount option nor on-disk value is set, Continue is the default.
XFS does not use errors=; it has its own error handling configuration via
sysfs (/sys/fs/xfs/<device>/error/). Btrfs always remounts read-only on
metadata corruption (equivalent to RemountRo) and has no user-configurable
error mode.
Crash recovery: Replay the ext4 journal (jbd2 compatible format) on mount.
The VFS freeze/thaw interface (Section 14.1 freeze() / thaw()) is used for
consistent snapshots (LVM thin, VM live migration).
Journal writeback error handling: When a writeback error occurs during a
journal transaction: (1) The transaction is NOT committed — it remains open.
(2) All dirty pages in the transaction are re-dirtied (preserving data for
retry). (3) The error is propagated to fsync() callers via the ErrSeq
mechanism (Section 4.6). (4) The journal retries the transaction
on the next writeback cycle. (5) After 3 consecutive failures, the filesystem
enters the configured error mode (default: RemountRo for journal errors).
Linux compatibility: UmkaOS's ext4 driver is wire-compatible with Linux's
ext4. Volumes formatted with mkfs.ext4 on Linux are mountable by UmkaOS without
conversion. The tune2fs -l feature list (FEATURE_COMPAT, FEATURE_INCOMPAT,
FEATURE_RO_COMPAT) governs which features are required vs. optional; the
driver rejects mount if any INCOMPAT bit is set that it does not understand.
15.6.2.1 JBD2 Journaling Subsystem¶
The ext4 filesystem delegates all crash-consistency guarantees to the JBD2
(Journaling Block Device 2) subsystem. UmkaOS's JBD2 implementation is
on-disk format compatible with Linux's fs/jbd2/ — volumes journaled by
Linux are recoverable by UmkaOS and vice versa. Internal improvements
(adaptive commit interval, u64 transaction IDs) do not affect on-disk layout;
they change only runtime behavior.
15.6.2.1.1 Transaction State Machine¶
Every metadata mutation is grouped into a transaction. Exactly one
transaction is in T_RUNNING state at any time; a second may be in
T_COMMIT (being written to the journal). The state machine is:
| State | Description |
|---|---|
T_RUNNING |
Accepting new metadata modifications via journal handles. journal_start() attaches a handle to this transaction. Multiple handles may be active concurrently (one per in-flight filesystem operation). |
T_LOCKED |
No new handles accepted. The commit thread sets this state to drain active handles. Callers of journal_start() block until a new T_RUNNING transaction is created after the current one advances past T_LOCKED. |
T_FLUSH |
Data pages for all inodes in inode_list are being flushed to their final on-disk locations (ordered mode only). This ensures data is stable before metadata referencing it is committed. In writeback mode this state is a no-op pass-through; in journal mode, data blocks are written to the journal instead. |
T_COMMIT |
Journal descriptor blocks, metadata blocks, and the final commit block are being written to the journal device. The commit block carries a CRC32C checksum covering all descriptor and metadata blocks in the transaction. A FUA write (or FLUSH + write + FLUSH on devices without FUA) ensures the commit block is durable before the state advances. |
T_FINISHED |
Commit is complete and durable. The transaction is moved to the checkpoint list. fsync() callers waiting on commit_wq are woken. A new T_RUNNING transaction may now be created. |
State transitions are serialized by Journal::state_lock. The T_RUNNING →
T_LOCKED transition is triggered by: (a) the periodic commit timer firing,
(b) a synchronous fsync() forcing a commit, or (c) journal free space
falling below the reservation threshold.
15.6.2.1.2 Core Data Structures¶
/// In-memory representation of a JBD2 journal instance.
///
/// One `Journal` exists per mounted ext4 filesystem. The journal may reside
/// on the same block device as the filesystem (internal journal, inode 8) or
/// on a separate block device (external journal, specified via `journal_dev=`
/// mount option).
pub struct Journal {
/// Block device backing the journal. For an internal journal this is the
/// same device as the filesystem; for an external journal it is a separate
/// `BlockDeviceHandle`.
pub dev: BlockDeviceHandle,
/// On-disk journal superblock (1024 bytes at journal block 0).
/// Cached in memory; written back on checkpoint advance and clean unmount.
pub sb: JournalSuperblock,
/// Journal block number of the first un-checkpointed transaction.
/// Advances when checkpoint frees journal space.
pub head: u32,
/// Journal block number of the next free block (write cursor).
/// Wraps circularly within the journal.
pub tail: u32,
/// Number of free blocks remaining in the journal.
/// `free = total_blocks - (tail - head)` modulo wrap.
pub free: u32,
/// Maximum metadata buffers a single transaction may accumulate before
/// the commit thread forces a commit. Derived from journal size:
/// `max_transaction_buffers = journal_blocks / 4` (same heuristic as Linux).
pub max_transaction_buffers: u32,
/// Adaptive commit interval in milliseconds. Range: 100–5000 ms.
/// Adjusted at each commit based on handle start rate (see §Adaptive
/// Commit Interval below).
pub commit_interval_ms: AtomicU32,
/// The currently active transaction accepting new handles.
/// `None` only during the brief window between one transaction entering
/// `T_LOCKED` and the next `T_RUNNING` transaction being created.
pub running_transaction: Option<Arc<Transaction>>,
/// The transaction currently being written to the journal.
/// At most one transaction is in the commit pipeline at any time.
pub committing_transaction: Option<Arc<Transaction>>,
/// Oldest-first list of committed transactions whose metadata blocks
/// have not yet been written to their final on-disk locations.
/// Checkpoint frees journal space by flushing these.
/// **Collection policy note**: IntrusiveList is acceptable here despite
/// the general preference for XArray. Transactions form a strict FIFO
/// order (checkpoint oldest-first), are never accessed by integer key,
/// and each Transaction already embeds a `checkpoint_link` node. The
/// list is O(1) insert (tail) and O(1) remove (head), matching the
/// checkpoint access pattern exactly.
pub checkpoint_transactions: IntrusiveList<Transaction>,
/// Protects transaction state transitions (`T_RUNNING` → `T_LOCKED` etc.)
/// and the `running_transaction` / `committing_transaction` fields.
pub state_lock: Mutex<()>,
/// Woken when a commit completes (`T_COMMIT` → `T_FINISHED`).
/// `fsync()` callers sleep here after requesting a commit.
pub commit_wq: WaitQueue,
/// Woken when journal free space increases (after checkpoint).
/// `journal_start()` callers sleep here when the journal is full.
pub checkpoint_wq: WaitQueue,
/// Journal block size in bytes (must equal filesystem block size).
pub block_size: u32,
/// Total number of usable journal blocks (excluding the superblock block).
pub total_blocks: u32,
/// Feature flags from the on-disk journal superblock.
pub features: JournalFeatureFlags,
}
/// A single journal transaction.
///
/// Transactions are the unit of atomicity: either all metadata changes in a
/// transaction are replayed on recovery, or none are.
pub struct Transaction {
/// Monotonically increasing transaction ID. u64 to avoid wrap within
/// any operational lifetime (at 10,000 commits/sec, wraps after 58
/// million years). The on-disk format stores only the low 32 bits
/// (`t_tid` in the commit block); the full u64 is internal only.
///
/// **Recovery disambiguation**: On journal replay, the kernel reads
/// u32 `t_tid` values from commit blocks. The full u64 is
/// reconstructed by tracking the high 32 bits across the recovery
/// scan: start with `epoch = superblock.s_last_tid >> 32`, then for
/// each commit block, if `t_tid < (prev_t_tid & 0xFFFF_FFFF)` the
/// u32 has wrapped and `epoch += 1`. The reconstructed tid is
/// `(epoch << 32) | t_tid`. This correctly handles up to one u32
/// wrap per recovery scan (at 10,000 commits/sec, u32 wraps every
/// ~4.97 days — well above the maximum replay window). The
/// superblock's `s_last_tid` stores the full u64 for cross-mount
/// continuity.
///
/// // LONGEVITY: u32 on-disk tid wraps at ~4.97 days at 10K
/// // commits/sec. Acceptable: recovery scans at most the journal
/// // size (~128 MB / 4 KB blocks = 32K transactions = ~3.2 sec
/// // of writes), far below one u32 period. The u64 internal counter
/// // never wraps in practice.
pub tid: u64,
/// Current state of this transaction.
/// Transitions: `T_RUNNING(0) → T_LOCKED(1) → T_FLUSH(2) → T_COMMIT(3) → T_FINISHED(4)`.
pub state: AtomicU8,
/// Number of active `JournalHandle` instances attached to this transaction.
/// Decremented by `journal_stop()`. When this reaches zero in `T_LOCKED`
/// state, the commit thread is woken to proceed to `T_FLUSH`.
pub handle_count: AtomicI32,
/// Total number of metadata buffers accumulated in this transaction.
/// Used to enforce `max_transaction_buffers`.
pub nr_buffers: u32,
/// Metadata blocks to be written to the journal during commit.
/// Each entry holds a reference to a kernel buffer and the on-disk
/// block number it maps to. Bounded by `max_transaction_buffers`.
pub metadata_list: ArrayVec<JournalBufferEntry, MAX_TRANSACTION_BUFFERS>,
/// After commit: metadata blocks that must still be written to their
/// final on-disk locations before this transaction's journal space can
/// be reclaimed. Drained by the checkpoint mechanism.
pub checkpoint_list: IntrusiveList<JournalBufferEntry>,
/// Inodes with dirty data pages that must be flushed before metadata
/// commit (ordered mode only). Populated by `journal_dirty_inode()`.
/// Bounded by the number of unique inodes touched in one transaction
/// (typically < 1000; uses a bounded Vec with documented maximum).
/// InodeId: u64 -- see [Section 14.1](14-vfs.md#virtual-filesystem-layer--core-vfs-data-structures).
pub inode_list: ArrayVec<InodeId, MAX_TRANSACTION_INODES>,
/// Block numbers revoked by this transaction. A revoked block must NOT
/// be replayed during recovery, even if an earlier transaction wrote it
/// to the journal. This prevents replaying freed-and-reallocated blocks.
///
/// Warm path (populated during truncate/unlink), bounded by the number
/// of metadata blocks freed in one transaction.
/// XArray per integer-key policy; presence of block number in the tree
/// means "revoked". XArray<()> acts as a set (key = block number,
/// value = unit type for presence-only semantics).
pub revoke_table: XArray<()>,
/// Intrusive list linkage for `Journal::checkpoint_transactions`.
pub checkpoint_link: IntrusiveListLink,
/// Wall-clock time of commit completion (for adaptive interval tuning).
pub commit_time_ns: u64,
/// Number of `journal_start()` calls during this transaction's
/// `T_RUNNING` lifetime. Used for adaptive commit interval calculation.
pub handle_starts: AtomicU64,
}
/// ext4-specific per-inode info, stored via `Inode.i_private`.
///
/// In Linux, this is `struct ext4_inode_info`. UmkaOS stores it as a
/// separate slab-allocated struct pointed to by `Inode.i_private: *mut ()`.
/// Access: `unsafe { &*(inode.i_private as *const Ext4InodeInfo) }`.
/// SAFETY: `i_private` is set by ext4's `alloc_inode()` and valid for
/// the inode's lifetime.
/// `#[repr(C)]` ensures deterministic field layout for debugging
/// (consistent offsets in core dumps) and const_assert compatibility.
/// Kernel-internal — not KABI.
#[repr(C)]
pub struct Ext4InodeInfo {
/// Transaction ID of the last data-modifying operation on this inode.
/// Used by `ext4_fsync()` to determine which journal transaction to
/// force-commit. Updated in `ext4_write_end()` after marking pages dirty.
pub i_datasync_tid: AtomicU64,
/// Transaction ID of the last metadata-modifying operation.
/// Used for full `fsync()` (not `fdatasync()`). Updated when inode
/// metadata (timestamps, size, block pointers) changes.
pub i_sync_tid: AtomicU64,
/// ext4 inode flags (EXT4_EXTENTS_FL, EXT4_INLINE_DATA_FL, etc.).
pub i_flags: u32,
/// Extent tree depth (0 = inline extents in inode, >0 = B-tree).
pub i_extent_depth: u16,
/// Explicit padding.
pub _pad: [u8; 2],
}
/// Per-handle reservation for a filesystem operation.
///
/// A `JournalHandle` is acquired via `journal_start()` and released via
/// `journal_stop()`. It represents one logical filesystem operation (e.g.,
/// one `unlink`, one `write` metadata update) that may dirty multiple
/// metadata blocks. The `nr_credits` field reserves journal space upfront
/// so the operation cannot deadlock mid-way for lack of journal space.
pub struct JournalHandle {
/// The transaction this handle is attached to.
pub transaction: Arc<Transaction>,
/// Number of journal blocks reserved for this handle.
/// Set by the caller at `journal_start()` based on the worst-case
/// number of metadata blocks the operation may dirty. Common values:
/// - `EXT4_DATA_TRANS_BLOCKS` (8): simple file write metadata
/// - `EXT4_DELETE_TRANS_BLOCKS` (24): unlink/truncate
/// - `EXT4_RESERVE_TRANS_BLOCKS` (4): quota update
pub nr_credits: u32,
}
/// One metadata block tracked by a transaction.
pub struct JournalBufferEntry {
/// Filesystem block number this buffer maps to.
pub blocknr: u64,
/// Copy of the metadata block contents at the time of journaling.
/// This frozen copy is what gets written to the journal — not the
/// live buffer, which may have been modified by a subsequent
/// transaction.
///
/// **Hot-path allocation note**: `frozen_data` is allocated from a
/// dedicated `jbd2_frozen_slab` cache (block-size-aligned slab objects,
/// one pool per superblock). The allocation occurs in
/// `journal_get_write_access()` — the warm path (per metadata write,
/// not per syscall). The slab is pre-populated at journal init with
/// `min(journal.j_max_transaction_buffers, 1024)` entries. If the slab
/// is exhausted under heavy metadata load, the allocation blocks on
/// the slab mempool (same as Linux's `jbd2_slab_create` pools). The
/// `Box` is freed to the slab when the transaction commits and the
/// journal buffer is released.
pub frozen_data: Option<Box<[u8]>>,
/// Reference to the live kernel buffer (for checkpoint writeback).
pub bh: BufferRef,
/// Intrusive list linkage for `Transaction::checkpoint_list`.
pub checkpoint_link: IntrusiveListLink,
}
/// Maximum metadata buffers per transaction. Derived from journal size at
/// mount time: `journal_blocks / 4`. This constant is the upper bound for
/// `ArrayVec` capacity; the actual limit is `Journal::max_transaction_buffers`.
///
/// **Collection policy (warm path)**: ArrayVec chosen to embed metadata_list
/// in the Transaction allocation (one heap alloc via `Arc<Transaction>`,
/// contiguous layout for cache locality). Trade: ~1 MB per Transaction
/// (16384 entries x ~64 bytes each) even if sparsely filled. Typical fill
/// under production workloads: <2000 entries (~128 KB used). Vec alternative
/// would add a second heap allocation but reduce waste for small transactions.
/// Transaction creation is warm-path (once per commit cycle, ~100ms-5s), so
/// either choice satisfies the collection policy. ArrayVec is preferred for
/// single-allocation simplicity and contiguous memory.
const MAX_TRANSACTION_BUFFERS: usize = 16384;
/// Maximum unique inodes with dirty data per transaction (ordered mode).
/// Empirically, even extreme workloads rarely exceed 4096 unique inodes
/// in a single 5-second commit window. If exceeded, the transaction is
/// committed early and a new one started.
const MAX_TRANSACTION_INODES: usize = 4096;
15.6.2.1.3 Handle API¶
The handle API is the interface between the ext4 filesystem driver and the
JBD2 journal. Every metadata-modifying filesystem operation brackets its
work with journal_start() / journal_stop().
journal_start(journal, nr_credits) -> Result<JournalHandle>
Reserve nr_credits journal blocks and attach a new handle to the current
T_RUNNING transaction. If the journal has insufficient free space, the
caller blocks on checkpoint_wq until checkpoint frees space. If the
running transaction is in T_LOCKED (being committed), the caller blocks
until a new T_RUNNING transaction is created. Returns EROFS if the
journal has been aborted due to I/O errors.
Concurrency: multiple handles may be active on the same transaction
simultaneously — each filesystem operation (from different tasks) gets its
own handle. The handle_count atomic tracks the number of active handles.
journal_stop(handle)
Release the handle. Decrements handle_count. If this is the last handle
on a T_LOCKED transaction (handle_count reaches zero), wakes the commit
thread to proceed with T_FLUSH. If the transaction is still T_RUNNING
and the commit timer has not fired, no commit is triggered (the transaction
remains open for more operations).
journal_get_write_access(handle, bh) -> Result<()>
Mark a buffer for journaling. Creates a frozen copy of the buffer's current contents (the "before" image for undo logging in journal mode, or simply the snapshot that will be written to the journal). Must be called before modifying the buffer. If the buffer is already tracked by this transaction, this is a no-op.
journal_dirty_metadata(handle, bh) -> Result<()>
Add the buffer to the transaction's metadata_list. Called after the
buffer has been modified. The buffer will be written to the journal during
the commit phase. If the buffer was not previously registered via
journal_get_write_access(), returns EINVAL (programming error in the
filesystem driver).
journal_revoke(handle, blocknr) -> Result<()>
Record that blocknr must not be replayed during recovery. Used when a
metadata block is freed (e.g., extent tree block freed during truncate).
Adds the block number to the transaction's revoke_table. During commit,
revoke records are written to the journal as revoke descriptor blocks.
journal_dirty_inode(handle, inode_id) -> Result<()>
Add an inode to the current transaction's inode_list for ordered-mode data
flushing. Called from ext4_write_end() (the AddressSpaceOps::write_end
implementation) after a data page is dirtied via the write path. The inode
is added only once per transaction (idempotent — checks a per-inode
I_DIRTY_DATASYNC flag against the current transaction ID).
fn journal_dirty_inode(
handle: &JournalHandle,
inode_id: InodeId,
) -> Result<()> {
let txn = &handle.transaction;
let inode = inode_lookup(inode_id);
// Check if this inode is already registered for this transaction.
// i_datasync_tid tracks the last transaction that dirtied this inode.
let current_tid = txn.tid;
if inode.i_datasync_tid.load(Acquire) == current_tid {
return Ok(()); // Already in this transaction's inode_list.
}
// Set the datasync tid to mark this inode as dirty in this txn.
inode.i_datasync_tid.store(current_tid, Release);
// Add to the transaction's inode_list (bounded, triggers early
// commit if MAX_TRANSACTION_INODES is reached).
txn.inode_list.try_push(inode_id)
.map_err(|_| IoError::new(Errno::ENOSPC))?;
Ok(())
}
The call chain from write() to inode_list population:
1. generic_file_write_iter() calls mapping.ops.write_begin().
2. ext4_write_begin() calls journal_start() to open a handle.
3. generic_file_write_iter() copies user data into the page.
4. generic_file_write_iter() calls mapping.ops.write_end().
5. ext4_write_end() calls journal_dirty_inode(handle, inode_id).
6. ext4_write_end() calls journal_stop(handle).
At commit time, Transaction.inode_list contains all inodes with dirty data
pages that must be flushed before the metadata commit proceeds (step 3 of
the commit protocol below).
journal_force_commit(journal, tid) -> Result<()>
Force the transaction identified by tid through the full commit sequence
and wait for T_FINISHED. Called by fsync() to ensure durability. If the
requested transaction is already committed, returns immediately.
impl Journal {
/// Force the specified transaction to reach T_FINISHED state.
///
/// # Arguments
/// - `tid`: Transaction ID to commit (from `inode.i_datasync_tid`).
///
/// # Algorithm
/// 1. If `tid` is older than the committing transaction: already done.
/// 2. If `tid` matches the running transaction: trigger T_RUNNING -> T_LOCKED.
/// 3. Wait on `commit_wq` until the transaction reaches T_FINISHED.
///
/// # Locking
/// Acquires `state_lock` briefly to read transaction state, then drops
/// it before sleeping on `commit_wq`.
pub fn force_commit(&self, tid: u64) -> Result<(), IoError> {
let guard = self.state_lock.lock();
// Check if the requested transaction is already committed.
// committing_transaction has a higher tid than running_transaction.
if let Some(ref committing) = self.committing_transaction {
if tid < committing.tid {
// Already fully committed (tid is in the past).
return Ok(());
}
if committing.tid == tid {
// Currently being committed — wait for it.
drop(guard);
self.commit_wq.wait_event(|| {
let g = self.state_lock.lock();
self.committing_transaction
.as_ref()
.map_or(true, |t| t.tid != tid
|| t.state.load(Acquire) == T_FINISHED)
});
return Ok(());
}
}
if let Some(ref running) = self.running_transaction {
if running.tid == tid {
// Trigger commit: T_RUNNING -> T_LOCKED.
// Wake the commit thread to begin the commit sequence.
running.state.store(T_LOCKED, Release);
self.commit_thread_wq.wake_up();
drop(guard);
// Wait for the commit to complete.
self.commit_wq.wait_event(|| {
let g = self.state_lock.lock();
self.committing_transaction
.as_ref()
.map_or(true, |t| t.tid != tid
|| t.state.load(Acquire) == T_FINISHED)
});
return Ok(());
}
if tid < running.tid {
// Already committed in a previous round.
return Ok(());
}
}
// No matching transaction — journal is idle or tid is in the future.
Ok(())
}
}
15.6.2.1.4 Commit Protocol¶
The commit thread (jbd2/<device>) runs as a kernel thread, one per mounted
ext4 filesystem. It wakes on: (a) commit timer expiry, (b) explicit
journal_force_commit(), or (c) journal free space pressure.
Ordered mode (default, data=ordered):
-
Lock transaction: Atomically transition
T_RUNNING → T_LOCKED. Create a newT_RUNNINGtransaction so new filesystem operations can proceed without blocking on the commit. Newjournal_start()callers attach to the new transaction. -
Drain handles: Wait for
handle_counton the locked transaction to reach zero. Eachjournal_stop()decrements the count; the last one wakes the commit thread. -
Flush data (
T_LOCKED → T_FLUSH): For each inode ininode_list, issue writeback for all dirty data pages. Wait for all data I/O to complete. This guarantees that data blocks referenced by the metadata being committed are already stable on disk — preventing the stale-data exposure thatdata=writebackpermits.
Note on fsync interaction: When the commit is triggered by fsync(),
filemap_write_and_wait_range() (step 1 of the fsync flow in
Section 14.4) has already flushed and waited for the target
file's dirty data pages. The T_FLUSH step is therefore a no-op for the
fsync-triggering inode — its data is already stable. However, T_FLUSH
is still necessary for the periodic commit path (not fsync-triggered),
where data pages for OTHER inodes in the same transaction may still be
dirty and need flushing before their metadata can be committed.
- Write journal blocks (
T_FLUSH → T_COMMIT): - Write one or more descriptor blocks containing an array of
JournalBlockTagentries. Each tag identifies a metadata block by its filesystem block number and flags. - Write the metadata blocks themselves, in the order described by the descriptor block tags.
- Write revoke descriptor blocks containing all block numbers in
revoke_table. -
Write the commit block with a CRC32C checksum covering all descriptor, metadata, and revoke blocks in this transaction. The commit block uses sequence number
tid as u32(low 32 bits). -
Flush commit block: Issue the commit block write with
BioFlags::FUA | BioFlags::PERSISTENT(Force Unit Access for durability, PERSISTENT for Tier 1 crash recovery preservation (Section 15.2)). On devices that do not supportFUA, issueBioFlags::PREFLUSHbefore andBioFlags::FUAafter the commit block write. The commit block landing on stable storage is the atomicity point — if recovery sees a valid commit block, all metadata in the transaction is replayed; if the commit block is missing or has an invalid checksum, the entire transaction is discarded. -
Complete (
T_COMMIT → T_FINISHED): Move all metadata buffers frommetadata_listtocheckpoint_list. Append the transaction toJournal::checkpoint_transactions. Wake all waiters oncommit_wq. UpdateJournal::sbwith the new sequence number and tail position.
Writeback mode (data=writeback): Step 3 is skipped entirely. Data may
be written to disk in any order relative to metadata, which means a crash
can expose stale data in recently-allocated blocks.
Journal mode (data=journal): In step 3, data blocks are written to the
journal alongside metadata blocks (each data block gets a descriptor tag
with JBD2_FLAG_DATA). This provides the strongest crash consistency (both
data and metadata are atomic) at the cost of approximately 2x write
amplification — every data block is written twice (once to journal, once to
final location during checkpoint).
15.6.2.1.5 Checkpoint Mechanism¶
Checkpointing reclaims journal space by writing committed metadata blocks to their final on-disk locations. Until a transaction is checkpointed, the journal blocks it occupies cannot be reused.
Background checkpoint: A kernel thread (jbd2/<device>-ckpt, SCHED_OTHER,
nice 5) runs periodically (every 5 seconds or when journal occupancy exceeds 50%).
It walks checkpoint_transactions oldest-first:
- For each transaction in the checkpoint list:
a. For each
JournalBufferEntryincheckpoint_list:- If the buffer is still dirty (not yet written back by normal writeback), issue an async write to its final on-disk location.
- If the buffer is clean (already written back), remove it from the
checkpoint list.
b. Wait for all issued writes to complete.
c. If all buffers in the transaction are clean: remove the transaction
from
checkpoint_transactionsand free its journal space.
- Advance
Journal::headto the journal block following the last checkpointed transaction. - Update the on-disk journal superblock with the new head position.
- Wake
checkpoint_wq(unblocking anyjournal_start()callers waiting for free space).
Foreground checkpoint: When journal_start() discovers that
Journal::free is less than the requested nr_credits, it triggers a
synchronous foreground checkpoint in the calling task's context. This walks
the same checkpoint list but blocks the caller until sufficient space is
freed. RT tasks that call fsync() and need journal space run the foreground
checkpoint at RT priority (inheriting the calling task's priority), preventing
priority inversion where an RT fsync() waits for the nice-5 background thread. If even after
checkpointing all eligible transactions the journal is still too full
(because in-flight commits occupy the space), the caller sleeps on
checkpoint_wq until the committing transaction completes.
Checkpoint ordering: Transactions must be checkpointed in order. A newer transaction cannot be checkpointed before an older one, because advancing the journal head past an uncheckpointed transaction would make that transaction unrecoverable after a crash.
15.6.2.1.6 Recovery and Replay Algorithm¶
On mount after an unclean shutdown (journal superblock indicates s_start !=
0), the JBD2 recovery algorithm replays committed transactions to restore
filesystem consistency. The algorithm has three passes:
Pass 1 — Scan (discover transaction boundaries):
1. Read the journal superblock. Extract s_start (first block of the
oldest un-checkpointed transaction) and s_sequence (its expected
sequence number).
2. Scan forward from s_start, wrapping circularly:
a. Read each block. Check if it is a descriptor block (magic
JBD2_MAGIC_NUMBER and blocktype JBD2_DESCRIPTOR_BLOCK).
b. If it is a descriptor block: verify the sequence number matches
the expected value. If so, record the descriptor and skip over the
metadata blocks it describes.
c. If it is a commit block: verify the sequence number and CRC32C
checksum. If valid, the transaction is complete. Increment expected
sequence number.
d. If it is a revoke block: record it for Pass 2.
e. If the block is not a valid journal block (wrong magic or sequence
number) or the commit block checksum fails: stop scanning. All
transactions up to the last valid commit block are replayable.
Pass 2 — Revoke table construction: 1. Collect all revoke records from all complete transactions discovered in Pass 1 into a single hash table keyed by (block number, sequence number). 2. A revoke record for block B in transaction T means: "do not replay any write to block B from transaction T or any earlier transaction."
Pass 3 — Replay:
1. For each complete transaction (oldest to newest):
a. For each metadata block described in the transaction's descriptor
blocks:
- Look up the block number in the revoke table. If revoked by this
or a later transaction, skip it.
- Otherwise, read the journaled copy from the journal and write it
to the block's final on-disk location.
2. After all transactions are replayed:
a. Clear the journal by writing a new journal superblock with s_start = 0.
b. The filesystem is now consistent.
Recovery correctness invariant: Because the commit block is the atomicity
point (written with FUA), and because replay only processes transactions
with valid commit blocks, recovery never applies a partial transaction.
Revoke records prevent stale metadata from overwriting blocks that were freed
and reallocated in a later transaction — without revoke, truncating a file
and then creating a new file that reuses the same blocks could cause recovery
to overwrite the new file's metadata with the old file's freed metadata.
Fast commit replay: When JBD2_FEATURE_INCOMPAT_FAST_COMMIT is set, the
fast commit area occupies the last s_fc_log_size blocks of the journal
(field at superblock offset 0x130). During recovery, after the standard
3-pass replay completes, the recovery algorithm scans the fast commit area
for delta records. Each delta encodes a single metadata change (inode update,
extent add/remove, directory entry link/unlink). Deltas are applied in
sequence order. If a delta's parent_tid does not match the last replayed
transaction's tid, the delta is skipped — it belongs to an incomplete fast
commit cycle whose parent transaction was not committed. After all valid
deltas are applied, the journal is cleared normally (s_start = 0).
15.6.2.1.7 Revoke Records¶
Revoke records solve the freed-block replay hazard:
- Transaction T1 writes metadata block B (e.g., an extent tree node).
- Transaction T2 frees block B (truncate) and allocates it for a different purpose (e.g., a data block for a new file). T2 records a revoke for B.
- Crash occurs after T2 commits but before T1 is checkpointed.
- Recovery sees T1's write to block B in the journal. Without revoke, it would replay T1's stale extent tree node over the new file's data block, silently corrupting the new file.
- With revoke: recovery checks the revoke table, finds B revoked by T2 (which is newer than T1), and skips the replay. The new file's data is preserved.
On-disk format: Revoke records are written to the journal as revoke
descriptor blocks during commit. Each revoke block contains:
- A block header (JBD2_MAGIC_NUMBER, blocktype JBD2_REVOKE_BLOCK,
sequence number).
- A r_count field indicating the number of bytes of revoke data.
- An array of 8-byte block numbers (when JBD2_FEATURE_INCOMPAT_64BIT is
set) or 4-byte block numbers (legacy 32-bit journals).
The revoke table is transient — it exists only during recovery. Normal operation does not consult revoke records; they are written to the journal during commit and read back only during replay.
15.6.2.1.8 Adaptive Commit Interval (UmkaOS Improvement)¶
Linux uses a fixed 5-second commit interval (commit=5 mount option).
UmkaOS replaces this with an adaptive algorithm that bounds both recovery
time (by committing more frequently under load) and I/O overhead (by
deferring commits when idle):
| Condition | Commit interval | Rationale |
|---|---|---|
High metadata rate (>100 journal_start() calls/sec) |
100 ms | Bound worst-case recovery replay to ~100 ms of transactions |
| Moderate rate (10–100 starts/sec) | Linear interpolation: 100–5000 ms | Smooth transition avoids oscillation |
| Low rate (<10 starts/sec) | 5000 ms | Minimize journal I/O for mostly-idle filesystems |
| Idle (0 handles for >1 second) | Immediate commit | Minimize window of dirty uncommitted metadata |
The algorithm samples Transaction::handle_starts at each commit and stores
the result in Journal::commit_interval_ms (an AtomicU32). The commit
timer re-arms itself with the new interval after each commit.
Override: The commit=N mount option forces a fixed interval (in
seconds), disabling adaptive behavior. This provides Linux-compatible
behavior for workloads that depend on a predictable commit cadence.
Recovery time bound: At the highest commit rate, worst-case recovery replays at most ~100 ms of transactions (bounded by commit interval × transaction size). At the default Linux interval of 5 seconds, recovery may need to replay up to 5 seconds of metadata mutations — on a busy database server this can mean gigabytes of journal replay.
15.6.2.1.9 On-Disk Journal Format¶
The on-disk format is byte-identical to Linux JBD2 for volume interoperability. UmkaOS must read journals written by Linux and vice versa.
Journal superblock (1024 bytes at journal block 0):
/// On-disk journal superblock. Layout matches Linux `journal_superblock_s`
/// exactly (1024 bytes).
///
/// **JBD2 on-disk format is big-endian** (defined by ext3 legacy on
/// SPARC/PA-RISC). All multi-byte integer fields use `Be32`/`Be64` types
/// ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) to enforce
/// correct byte-order conversion on all eight supported architectures.
/// On big-endian platforms (PPC32, s390x), `Be32::to_ne()` is a no-op;
/// on little-endian platforms (x86-64, AArch64, ARMv7, RISC-V, PPC64LE,
/// LoongArch64), it performs a byte-swap.
// kernel-internal, not KABI
#[repr(C)]
pub struct JournalSuperblock {
// --- Static information (set at journal creation) ---
/// Header: magic (0xC03B3998), blocktype (3 = superblock v2), sequence.
pub header: JournalHeader,
/// Journal device block size in bytes. Must equal filesystem block size.
pub s_blocksize: Be32,
/// Total number of blocks in the journal (including superblock block).
pub s_maxlen: Be32,
/// First usable block in the journal (usually 1, after superblock).
pub s_first: Be32,
// --- Dynamic information (updated on checkpoint / clean unmount) ---
/// Sequence number of the first transaction in the log.
/// 0 means the journal is clean (no recovery needed).
pub s_sequence: Be32,
/// Block number of the first transaction's first block in the log.
/// 0 when journal is clean.
pub s_start: Be32,
/// Error number from a previous abort (0 = no error).
pub s_errno: Be32,
// --- Feature flags (superblock v2 only) ---
/// Compatible feature flags (journal can be mounted even if unknown bits set).
pub s_feature_compat: Be32,
/// Incompatible feature flags (journal must not be mounted if unknown bits set).
pub s_feature_incompat: Be32,
/// Read-only compatible feature flags.
pub s_feature_ro_compat: Be32,
/// UUID of this journal (128-bit). Byte array — no endianness conversion.
pub s_uuid: [u8; 16],
/// Number of filesystems sharing this journal (0 or 1 for ext4).
pub s_nr_users: Be32,
/// Location of the dynamic superblock copy.
pub s_dynsuper: Be32,
/// Maximum number of blocks per transaction.
pub s_max_transaction: Be32,
/// Maximum number of data blocks per transaction.
pub s_max_trans_data: Be32,
/// Checksum type (1 = CRC32, 2 = MD5, 3 = SHA1, 4 = CRC32C).
/// ext4 uses CRC32C (4) exclusively since Linux 3.5+. Single byte — no endianness.
pub s_checksum_type: u8,
pub s_padding2: [u8; 3],
/// Number of fast commit blocks (offset 0x54). Required by
/// JBD2_FEATURE_INCOMPAT_FAST_COMMIT to determine the fast commit
/// area boundaries.
pub s_num_fc_blks: Be32,
/// Block number of the head of the log (offset 0x58). Used for clean
/// unmount optimization — avoids full journal scan on mount.
pub s_head: Be32,
/// Padding to 1024 bytes.
pub s_padding: [Be32; 40],
/// CRC32C of this superblock (with this field set to 0 during computation).
pub s_checksum: Be32,
/// UUIDs of filesystems sharing this journal. Byte array — no endianness.
pub s_users: [u8; 768],
}
const_assert!(core::mem::size_of::<JournalSuperblock>() == 1024);
Journal block header (common to all journal block types):
/// Common header at the start of every journal metadata block.
/// All fields are big-endian on disk (JBD2 legacy format).
#[repr(C)]
pub struct JournalHeader {
/// Magic number: `JBD2_MAGIC_NUMBER` (0xC03B3998), stored big-endian.
pub h_magic: Be32,
/// Block type (see `JBD2_DESCRIPTOR_BLOCK` etc.).
pub h_blocktype: Be32,
/// Transaction sequence number (low 32 bits of `Transaction::tid`).
pub h_sequence: Be32,
}
// On-disk JBD2 format: h_magic(4) + h_blocktype(4) + h_sequence(4) = 12 bytes.
const_assert!(core::mem::size_of::<JournalHeader>() == 12);
/// Journal block types.
pub const JBD2_DESCRIPTOR_BLOCK: u32 = 1;
pub const JBD2_COMMIT_BLOCK: u32 = 2;
pub const JBD2_SUPERBLOCK_V1: u32 = 3;
pub const JBD2_SUPERBLOCK_V2: u32 = 4;
pub const JBD2_REVOKE_BLOCK: u32 = 5;
Descriptor block (precedes a sequence of metadata blocks):
/// Tag describing one metadata block in a descriptor block.
///
/// **V3 layout only** (`JBD2_FEATURE_INCOMPAT_CSUM_V3` +
/// `JBD2_FEATURE_INCOMPAT_64BIT`). Matches Linux's `struct journal_block_tag3_s`.
///
/// UmkaOS does not support the V2 tag layout (`journal_block_tag_s`, 12 bytes
/// with Be16 checksum and Be16 flags). V2 journals are rejected at mount time
/// with `EUCLEAN` — Linux forcibly upgrades V2→V3 since kernel 3.18 (2014).
/// Any ext4 volume mounted read-write by any Linux in the last 12 years already
/// has V3. If a V2-only volume is encountered:
/// `return Err(EUCLEAN)` with diagnostic:
/// "JBD2 checksum version 2 not supported; mount with Linux kernel to upgrade."
///
/// An additional 16 bytes for UUID if `JBD2_FLAG_SAME_UUID` is NOT
/// set (first tag only, appended after the tag).
///
/// `journal_tag_bytes()` always returns 16 (no V2 conditional).
///
/// All multi-byte fields are big-endian on disk (JBD2 legacy format).
#[repr(C)]
pub struct JournalBlockTag {
/// Filesystem block number (low 32 bits).
pub t_blocknr: Be32,
/// Flags (`JBD2_FLAG_ESCAPE`, `JBD2_FLAG_SAME_UUID`, etc.).
/// V3 widens this from Be16 (V2) to Be32. Upper 16 bits are reserved
/// and must be zero on write, ignored on read.
pub t_flags: Be32,
/// Filesystem block number (high 32 bits). Always present (UmkaOS
/// requires `JBD2_FEATURE_INCOMPAT_64BIT`).
pub t_blocknr_high: Be32,
/// Full CRC32C checksum of the journaled block. V3 widens from
/// 16-bit (CSUM_V2) to full 32-bit for stronger integrity.
pub t_checksum: Be32,
}
// JournalBlockTag V3: t_blocknr(4) + t_flags(4) + t_blocknr_high(4) +
// t_checksum(4) = 16 bytes.
const_assert!(core::mem::size_of::<JournalBlockTag>() == 16);
/// Tag flag bits. Values match Linux `include/linux/jbd2.h`.
/// Note: in V3 layout, `t_flags` is Be32 but only the low 16 bits
/// carry defined flags. Upper 16 bits are reserved (zero on write).
pub const JBD2_FLAG_ESCAPE: u32 = 0x01; // block content has JBD2_MAGIC at offset 0; escaped
pub const JBD2_FLAG_SAME_UUID: u32 = 0x02; // same UUID as previous tag (omit UUID field)
pub const JBD2_FLAG_DELETED: u32 = 0x04; // block deleted by this transaction
pub const JBD2_FLAG_LAST_TAG: u32 = 0x08; // last tag in this descriptor block
Commit block (marks the end of a transaction):
/// Commit record written as the final block of each transaction.
/// The CRC32C in this block covers all descriptor blocks, metadata blocks,
/// AND revoke blocks in the transaction. A valid commit block = atomic commit
/// point. The checksum is computed incrementally: each block (descriptor,
/// metadata, or revoke) is fed into the running CRC32C as it is written to
/// the journal. The final CRC32C is stored in `h_chksum[0]`. During recovery,
/// the journal replayer recomputes the CRC32C over all blocks between the
/// descriptor block and the commit block (inclusive of revoke blocks) and
/// compares against `h_chksum[0]`; a mismatch means the transaction is
/// incomplete and is discarded.
#[repr(C)]
pub struct JournalCommitBlock {
/// Standard header: magic, blocktype = JBD2_COMMIT_BLOCK, sequence.
pub header: JournalHeader,
/// Checksum type (matches `JournalSuperblock::s_checksum_type`).
/// Single byte — no endianness conversion.
pub h_chksum_type: u8,
/// Checksum size in bytes (4 for CRC32C). Single byte — no endianness.
pub h_chksum_size: u8,
pub h_padding: [u8; 2],
/// CRC32C checksum of all descriptor, metadata, and revoke blocks.
/// Array of big-endian u32 words.
pub h_chksum: [Be32; JBD2_CHECKSUM_ELEMENTS],
/// Commit timestamp (seconds since epoch). Written for debugging;
/// not used by recovery. Big-endian on disk.
pub h_commit_sec: Be64,
/// Commit timestamp (nanoseconds component). Big-endian on disk.
pub h_commit_nsec: Be32,
}
/// Checksum array element count. 8 elements × 4 bytes per Be32 = 32 bytes.
/// Matches Linux's `JBD2_CHECKSUM_BYTES = 8` (element count).
/// Named `_ELEMENTS` (not `_SIZE`) to prevent misuse as a byte count.
const JBD2_CHECKSUM_ELEMENTS: usize = 8;
// JournalCommitBlock: header(12) + chksum_type(1) + chksum_size(1) +
// h_padding(2) + h_chksum(32) + h_commit_sec(8) + h_commit_nsec(4) = 60.
const_assert!(core::mem::size_of::<JournalCommitBlock>() == 60);
Revoke descriptor block:
/// Revoke block header. Followed by an array of block numbers.
/// All multi-byte fields are big-endian on disk (JBD2 legacy format).
#[repr(C)]
pub struct JournalRevokeHeader {
/// Standard header: magic, blocktype = JBD2_REVOKE_BLOCK, sequence.
pub header: JournalHeader,
/// Number of bytes of revoke data following this header
/// (including this r_count field).
pub r_count: Be32,
}
// On-disk JBD2 format: header(12) + r_count(4) = 16 bytes.
const_assert!(core::mem::size_of::<JournalRevokeHeader>() == 16);
// Followed by: array of Be64 (with 64BIT feature) or Be32 block numbers.
// Number of entries = (r_count.to_ne() - sizeof(JournalRevokeHeader)) / sizeof(blocknr).
Feature flags (from journal superblock s_feature_incompat):
| Flag | Value | Meaning |
|---|---|---|
JBD2_FEATURE_INCOMPAT_REVOKE |
0x01 | Journal contains revoke records (always set for ext4) |
JBD2_FEATURE_INCOMPAT_64BIT |
0x02 | Block tags use 64-bit block numbers |
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
0x04 | Commit blocks may be written without a preceding cache flush |
JBD2_FEATURE_INCOMPAT_CSUM_V2 |
0x08 | Descriptor tags carry per-block CRC16; commit block carries CRC32C of entire transaction |
JBD2_FEATURE_INCOMPAT_CSUM_V3 |
0x10 | Extended tag format with full 32-bit checksums |
JBD2_FEATURE_INCOMPAT_FAST_COMMIT |
0x20 | Fast commit area follows the main journal (Linux 5.10+) |
UmkaOS mount-time validation:
CSUM_V3set → accepted (normal path).journal_tag_bytes()= 16.CSUM_V2set,CSUM_V3NOT set → rejected withEUCLEAN. Diagnostic: "JBD2 checksum version 2 not supported; mount with Linux kernel to upgrade." Linux has auto-upgraded V2→V3 since kernel 3.18 (2014). Zero code complexity for an obsolete 12-year-old format.- Neither
CSUM_V2norCSUM_V3set → accepted in read-only mode only. Read-write mount logs a warning and rejects withEROFS. Non-checksummed journals predate Linux 3.5 (2012) and should be upgraded by a Linux fsck.
Magic number and endianness: All on-disk journal fields are
big-endian (network byte order), matching the original JBD design from
ext3. The magic number 0xC03B3998 is the first 4 bytes of every journal
descriptor, commit, and revoke block. If a metadata block being journaled
happens to start with 0xC03B3998 at offset 0, the JBD2_FLAG_ESCAPE tag
flag is set and the first 4 bytes of the journaled copy are zeroed (restored
on replay).
15.7 XFS Filesystem Driver¶
Scope note: This section provides UmkaOS-specific XFS filesystem driver specifications: allocation group design, log architecture, Linux compatibility constraints, and feature set. The on-disk format specification for XFS is defined by the upstream project and is not duplicated here — UmkaOS implements the same on-disk format bit-for-bit.
The XFS driver implements the FileSystemOps and InodeOps traits defined in
Section 14.1 (VFS layer). XFS is used in server, workstation,
HPC, and enterprise contexts; it is not consumer-specific.
15.7.1.1 Evolvable/Nucleus Classification¶
| Component | Classification | Rationale |
|---|---|---|
| Allocation group B-tree structures (bnobt, cntbt, inobt, rmapbt) | Nucleus | On-disk format compatibility with Linux XFS v5. |
| xlog write-ahead log format and replay | Nucleus | Crash-consistency invariant. Must match Linux for cross-mount. |
| CRC32C metadata checksums | Nucleus | Integrity verification is a correctness property, not a policy. |
| Reflink extent sharing semantics | Nucleus | CoW correctness invariant (see Section 14.4). |
| Delayed allocation heuristics | Evolvable | Policy: how long to defer allocation is a performance tuning choice. |
| Speculative preallocation strategy | Evolvable | Policy: how much to preallocate beyond EOF is workload-dependent. |
| AG selection for new allocations | Evolvable | Policy: which allocation group to prefer is a parallelism/fragmentation tradeoff. ML-tunable. |
15.7.2 XFS¶
Use cases: Default filesystem on RHEL, CentOS, Fedora, Rocky Linux, and Oracle Linux. Dominant in enterprise storage servers, HPC scratch filesystems, media production storage, and large-scale NFS servers. Designed for very large files and very large directories.
Tier: Tier 1 (same rationale as ext4).
Design:
XFS partitions the volume into allocation groups (AGs), each an independent
unit with its own free-space B-trees (bnobt, cntbt), inode B-tree (inobt),
and reverse-mapping B-tree (rmapbt, v5 only). Allocation groups enable
parallel allocation for multi-threaded workloads — different AGs are independent,
so concurrent file creation on different CPUs does not serialize.
Volume layout (simplified):
[ Superblock | AG 0 | AG 1 | ... | AG N ]
Each AG: [ AG header | free-space B-trees | inode B-tree | data blocks ]
Key features:
- Delayed allocation (delalloc): Blocks are not physically allocated until
writeback, allowing the allocator to choose large contiguous extents instead of
the first available fragment. Critical for streaming-write performance.
- Speculative preallocation: XFS preallocates beyond the current EOF during
sequential writes, then trims unused preallocation on close. Dramatically reduces
fragmentation for growing files (logs, databases, media files).
- Reflink (XFS v5, Linux 4.16+): Copy-on-write extent sharing for cheap
file copies (same semantic as Btrfs reflinks). Required for efficient container
image layering and cp --reflink. XFS declares WriteMode::CopyOnWrite and
implements ExtentSharingOps — see
Section 14.4
for the VFS CoW/RoW infrastructure.
- Reverse mapping B-tree (rmapbt, v5): Tracks which owner (inode or
B-tree structure) holds each physical block. Required for online scrub, online
repair, and reflink. Adds ~5% space overhead.
- Real-time device: XFS optionally uses a separate real-time device for
files tagged with XFS_XFLAG_REALTIME, guaranteeing allocation from a
contiguous extent region. Used in HPC and media production for deterministic
I/O latency. UmkaOS supports the real-time device as a second BlockDevice
passed in the mount option rtdev=.
- xattr namespaces: user., trusted., security., system.posix_acl_*.
The trusted. namespace is restricted to CAP_SYS_ADMIN; the kernel enforces
this via capability checks in setxattr(2).
Journal (xlog): XFS uses a write-ahead log (xlog) for all metadata
mutations. The log is circular; the driver replays from the last checkpoint on
mount after unclean shutdown. Log can be on the same device (default) or an
external device (logdev=) for better write isolation on HDD-based arrays.
Linux compatibility: XFS v5 (superblock sb_features_incompat bit
XFS_SB_FEAT_INCOMPAT_FTYPE) is required for all new volumes. v5 includes a
CRC checksum on every metadata block (CRC32C), catching silent corruption
that ext4 without metadata checksums would miss. UmkaOS rejects mounting v4
volumes unless a compatibility shim is provided (v4 is deprecated upstream as
of Linux 6.x and not worth supporting at launch).
15.8 Btrfs Filesystem Driver¶
Scope note: This section provides UmkaOS-specific Btrfs filesystem driver specifications: CoW design, RAID profiles, subvolumes, Linux compatibility constraints, and known limitations. The on-disk format specification for Btrfs is defined by the upstream project and is not duplicated here — UmkaOS implements the same on-disk format bit-for-bit.
The Btrfs driver implements the FileSystemOps and InodeOps traits defined in
Section 14.1 (VFS layer). Btrfs is used for workstations, snapshots,
and deployments requiring transparent compression or send/receive; it is not a
general-purpose default.
15.8.1.1 Evolvable/Nucleus Classification¶
| Component | Classification | Rationale |
|---|---|---|
| CoW B-tree structure and transaction commit semantics | Nucleus | On-disk format compatibility with Linux Btrfs. Correctness invariant for atomic snapshots. |
| Subvolume and snapshot tree relationships | Nucleus | Snapshot correctness depends on CoW tree sharing invariants. |
| Checksum verification (CRC32C, xxhash, sha256, blake2b) | Nucleus | Data integrity verification is a correctness property. |
| RAID 1/1C3/1C4/10 mirror placement | Nucleus | Mirror placement correctness ensures data survives device failure. |
incompat_flags feature gating on mount |
Nucleus | Must reject unknown INCOMPAT bits to prevent silent corruption. |
| Transparent compression algorithm selection (LZO, ZLIB, ZSTD) | Evolvable | Policy: which algorithm to use for new data is a space/CPU tradeoff. ML-tunable. |
| Free space cache management strategy (v2 B-tree) | Evolvable | Policy: how to organize free space metadata is a performance heuristic. |
nodatacow decision for database subvolumes |
Evolvable | Policy: operator-configurable CoW bypass per mount/subvolume. |
| Scrub scheduling and priority | Evolvable | Policy: when and how aggressively to run background verification. |
15.8.2 Btrfs¶
Use cases: Fedora workstations, Steam Deck, openSUSE. Used in enterprise for snapshot and send/receive capabilities (Proxmox, SUSE). Relevant at kernel level wherever atomic snapshots, compression, or multi-device volumes are needed. Not recommended as a default filesystem — ext4 (general purpose), XFS (enterprise/large files), and ZFS (data integrity/servers) are preferred defaults depending on workload. Btrfs is appropriate when its unique features (subvolume snapshots, transparent compression, send/receive) are specifically required and the operator accepts the limitations documented below.
Tier: Tier 1.
Design: Btrfs is a copy-on-write (CoW) B-tree filesystem. Every write
produces a new copy of the modified data/metadata; the old copy is retained
until freed. This is the foundation for snapshots (zero-cost at creation) and
atomic multi-file transactions. Btrfs declares WriteMode::RedirectOnWrite and
implements ExtentSharingOps — see
Section 14.4
for the VFS CoW/RoW infrastructure that Btrfs, ZFS, and future UPFS all build upon.
Key features:
| Feature | Kernel behaviour |
|---|---|
| Subvolumes | Independent CoW trees within a volume; each mountable separately. The kernel tracks the active subvolume ID per mount point. |
| Snapshots | Read-write or read-only clone of a subvolume at a point in time. Zero-cost creation (no data copied). Used by UmkaOS live update rollback (Section 13.18). |
| Reflinks | Shallow file copy (cp --reflink). Shares extent references until written. Critical for container runtimes and package managers. |
| Transparent compression | Per-file or per-subvolume, online. Algorithms: LZO (fast), ZLIB (balanced), ZSTD (best ratio, default for UmkaOS). Kernel compresses on writeback; decompresses on read. |
| RAID profiles | RAID 0 / 1 / 1C3 / 1C4 / 5 / 6 / 10 across multiple BlockDevice instances. RAID 5/6 write hole: Btrfs's CoW design significantly reduces (but does not fully eliminate) the write hole — partial stripe writes are atomic at the Btrfs extent level, but the parity update itself is not crash-atomic. UmkaOS does NOT provide a block-layer mitigation for the Btrfs RAID 5/6 write hole because Btrfs implements its own RAID layer above the block I/O interface — the block layer has no visibility into Btrfs stripe operations. The Btrfs RAID 5/6 write hole is a filesystem-internal problem. Upstream Linux Btrfs RAID 5/6 carries a BIG FAT WARNING about data loss risk; users requiring parity RAID should use ZFS (RAIDZ) or md-raid + ext4/XFS instead of Btrfs RAID 5/6. The block layer stripe log (Section 15.2) applies only to md-raid and dm-raid arrays. |
| Online scrub | Background verification of all data and metadata checksums. Driven by a kernel thread (btrfs-scrub); progress exposed via ioctl and sysfs. |
| Send/receive | Incremental snapshot delta serialisation. btrfs send produces a stream; btrfs receive applies it on another volume. Used for backup, replication, and container image distribution. |
| Free space tree | v2 free-space cache (b-tree based); replaces the v1 file-based cache. Required for large volumes (>1 TiB); UmkaOS always mounts with space_cache=v2. |
CoW and O_SYNC interaction: Because Btrfs delays the final tree root
update until transaction commit, fsync must trigger a full transaction commit
(not just a data flush) to satisfy durability. The driver calls
btrfs_commit_transaction() on fsync for non-nodatacow files. This is a
known latency source for databases; the architecture recommends nodatacow
mount option for database subvolumes (trades crash consistency for performance,
consistent with how PostgreSQL and MySQL recommend mounting their data
directories on any CoW filesystem).
Live update integration (Section 13.18): Btrfs subvolume snapshots can support snapshot-based atomic OS updates. A live update agent can create a read-only snapshot of the root subvolume before applying an update, making rollback trivial and zero-downtime. This makes Btrfs a natural fit for deployments that use snapshot-based atomic updates; on servers where ext4 or XFS is already in use, this advantage does not justify a migration.
Linux compatibility: Btrfs on-disk format is stable since Linux 3.14.
UmkaOS's Btrfs driver is wire-compatible with Linux's. Volumes created on Linux
are mountable by UmkaOS. Feature detection uses the incompat_flags superblock
field; the driver rejects mount if any unknown INCOMPAT bit is set.
Limitations documented (these are well-known, upstream-acknowledged problems):
- RAID 5/6 reliability: The Btrfs RAID 5/6 write hole remains an active
concern on LKML as of 2025 despite partial mitigations. Btrfs upstream
documentation still marks RAID 5/6 as "not recommended for production."
The block layer stripe log (Section 15.2)
applies only to md-raid and dm-raid arrays, NOT to Btrfs RAID 5/6 (Btrfs
implements its own RAID layer above the block I/O interface — the block
layer has no visibility into Btrfs stripe operations). Users requiring
parity RAID should use ZFS (RAIDZ) or md-raid + ext4/XFS instead of
Btrfs RAID 5/6. Btrfs RAID 1/1C3/1C4/10 are stable and recommended.
- fsync latency: CoW transaction commit on fsync is a known latency
source for database workloads. The nodatacow workaround trades crash
consistency for performance. Database servers should prefer ext4 or XFS.
- nodatacow files cannot have checksums. Applications that disable CoW for
performance must accept no data integrity checking on those files.
- Very large directories (>1M entries) have worse performance than XFS due to
CoW overhead on directory mutations.
- Less battle-tested than ext4/XFS: Btrfs has a shorter production track
record. ext4 has been the Linux default since 2008; XFS has been the RHEL
default since 2014. Btrfs became Fedora's desktop default in 2020 and
openSUSE's in 2014, but enterprise adoption remains limited outside
snapshot-centric workflows.
15.9 Removable Media, Interoperability Filesystems, and FUSE¶
15.9.1 Removable Media and Interoperability Filesystems¶
These filesystem drivers serve interoperability with Windows, macOS, and removable media standards. They are not consumer-specific — embedded systems, edge nodes, and industrial devices also use FAT/exFAT/NTFS for removable storage interoperability.
UmkaOS's strategy for these filesystems is native in-kernel drivers implemented as
Tier 1 drivers using the standard FileSystemOps / InodeOps / FileOps trait set
(Section 14.1). FUSE-backed userspace drivers are supported as a compatibility
mechanism for filesystems where a full native implementation is deferred; the FUSE
subsystem is specified in Section 15.9.
15.9.1.1 exFAT¶
Use case: SDXC (SD cards >32 GB) mandates exFAT per JEDEC SD specification. USB flash drives commonly use exFAT. Required for read/write interop with Windows and macOS systems.
Tier: Tier 1 (in-kernel umka-exfat driver).
Implementation: Microsoft published the exFAT specification as an open
specification in 2019 (SPDX: LicenseRef-exFAT-Specification; no royalty or
patent encumbrance for implementors). The exFAT on-disk format is simpler than
ext4 or XFS: a flat cluster chain FAT or an Allocation Bitmap (preferred for
exFAT), a root directory cluster chain, and per-file directory entries using
UTF-16 with UpCase table normalization. UmkaOS's native umka-exfat driver
implements the full read/write path using the FileSystemOps trait.
Compatibility: Read/write. Cluster sizes from 512 B to 32 MB. Files up to
16 EiB (volume limit). Directory entries use UTF-16LE with the volume's UpCase
table. Timestamps include UTC offset field (Windows 10+). No journaling; power
loss can corrupt a directory entry mid-write. The driver issues a FLUSH CACHE
command to the underlying block device after each fsync to bound exposure.
Linux compatibility: exFAT volumes created on Linux (kernel exFAT driver, merged in 5.7) are mountable by UmkaOS and vice versa. The UpCase table format and cluster allocation bitmap are identical.
15.9.1.2 NTFS¶
Use case: External drives shared with Windows installations. Common on USB hard drives purchased pre-formatted. Required for read/write interop with Windows-hosted data volumes.
Tier: Tier 1 (in-kernel ntfs3 driver; based on the Paragon ntfs3
implementation merged into Linux 5.15).
Implementation: UmkaOS's ntfs3 driver is derived from the upstream Linux ntfs3
implementation by Paragon Software Group. It provides full read/write support
including NTFS compression (LZX per-cluster), sparse files (sparse-file runs),
and hard links (multiple $FILE_NAME attributes per MFT record).
Features not supported (return EOPNOTSUPP on access):
- Alternate Data Streams exposed as separate mount namespace entries (ADS
content is preserved on read/write of the primary stream but not enumerable
via openat/readdir).
- Reparse points used as Windows junction points or symlinks (IO_REPARSE_TAG_SYMLINK,
IO_REPARSE_TAG_MOUNT_POINT) — accessed as regular files or returned as
DT_UNKNOWN in directory listings.
- Encrypted files ($EFS attribute) — opened successfully but content reads
return raw ciphertext with a warning in the kernel log.
Phase constraint: Full NTFS write support is present from Phase 2. The NTFS
journaling structure ($LogFile, $UsnJrnl) is replayed on mount to ensure
volume consistency after unclean shutdown, matching Linux ntfs3 behavior. No
NTFS write support is deferred; the complexity of NTFS journaling, compression,
and sparse files is handled by the derived ntfs3 implementation.
Linux compatibility: Wire-compatible with Linux ntfs3. Volumes created on Linux ntfs3 are mountable by UmkaOS and vice versa.
15.9.1.3 APFS (Read-Only)¶
Use case: External drives formatted by macOS. Required for data migration from macOS systems and for mounting Apple Silicon boot drives in dual-boot or forensic scenarios.
Tier: Tier 1 (in-kernel read-only driver, Phase 4+).
Phase constraint: APFS write support is permanently deferred. The APFS
on-disk format is not a public specification; Apple documents only enough for
APFS tooling on macOS. Reverse-engineered write support risks silent metadata
corruption when Apple makes undocumented changes between macOS releases. The
read-only constraint is therefore not a temporary limitation but a deliberate
design boundary: APFS volumes mounted by UmkaOS are always mounted read-only,
enforced in the FileSystemOps::mount() implementation by returning EROFS if
MountFlags::READ_WRITE is set.
Implementation: Read-only native kernel driver derived from the apfs-fuse
project's reverse-engineered format analysis (MIT licensed). Supported features:
- APFS container and volume superblock parsing.
- B-tree (object map, file system tree) traversal.
- Extent-based and inline file data.
- Compression (APFS_COMPRESS_ZLIB, APFS_COMPRESS_LZVN, APFS_COMPRESS_LZFSE).
- Symlinks, hard links (inode numbers via DREC_TYPE_HARDLINK).
- Extended attributes (xattr tree).
- Time Machine snapshot enumeration (read-only).
Phase ordering: Phase 3 delivers HFS+ read-only support (for older macOS volumes). Phase 4 delivers APFS read-only, layered on the HFS+ driver's infrastructure for Apple partition map and CoreStorage detection.
Until Phase 4, APFS volumes are accessible via the FUSE subsystem
(Section 15.9) using the apfs-fuse userspace daemon, which provides
a compatible FileDescriptor interface through FuseSession.
15.9.1.4 FUSE — Userspace Filesystem Framework¶
FUSE (Filesystem in Userspace) enables userspace daemons to implement filesystems
served through the kernel VFS. UmkaOS implements the FUSE kernel interface as a
Tier 2 bridge driver, compatible with the Linux /dev/fuse protocol (FUSE
protocol version 7.x; minimum negotiated minor version: 26, released with Linux
4.20, which introduced FUSE_RENAME2 and FUSE_LSEEK).
Scope: FUSE is a compatibility and extensibility mechanism. Native in-kernel
drivers are preferred for performance-critical or widely-used filesystems.
FUSE is the appropriate path for:
- Filesystems with complex or proprietary on-disk formats where a native
kernel driver is not feasible (e.g., APFS before Phase 4).
- Userspace tools that already implement a filesystem (e.g., sshfs, s3fs,
custom FUSE daemons in container runtimes).
- Development and prototyping of new filesystem drivers before promotion
to Tier 1.
Protocol: The FUSE kernel↔daemon protocol uses /dev/fuse. The kernel writes
request messages (opcodes: FUSE_LOOKUP, FUSE_OPEN, FUSE_READ, FUSE_WRITE,
FUSE_READDIR, etc.) into the fd; the daemon reads them, processes them, and
writes reply messages back. Each request carries a unique unique identifier
matching it to its reply. The wire format is identical to Linux libfuse
protocol version 7.x, ensuring compatibility with all existing FUSE daemons
without recompilation.
FuseSession struct — kernel-side state for one mounted FUSE filesystem:
/// Kernel-side state for one active FUSE mount.
///
/// Created when the userspace daemon opens `/dev/fuse` and calls `mount(2)`
/// with `fstype = "fuse"`. Destroyed when the daemon closes the fd or the
/// mount is forcibly unmounted (`umount -f`).
pub struct FuseSession {
/// Negotiated FUSE protocol version (major, minor).
/// Major is always 7 for current FUSE protocol; minor is negotiated
/// during `FUSE_INIT` handshake. The kernel refuses to mount if the
/// daemon proposes major != 7.
pub proto_version: (u32, u32),
/// The `/dev/fuse` file descriptor held open by the daemon process.
/// Closing this fd triggers an implicit `FUSE_DESTROY` + unmount.
pub dev_fd: FileDescriptor,
/// Mount flags captured at mount time (read-only, no-exec, etc.).
/// Propagated to `InodeOps::permission()` checks within this session.
pub mount_flags: MountFlags,
/// Maximum write payload the daemon declared it can handle, in bytes.
/// Capped at `FUSE_MAX_MAX_PAGES * PAGE_SIZE` (128 × 4096 = 512 KiB).
/// The kernel splits `FUSE_WRITE` requests larger than this value.
pub max_write: u32,
/// Maximum `readahead` size the kernel will request, in bytes.
/// Negotiated during `FUSE_INIT`; 0 disables kernel readahead for
/// this mount.
pub max_readahead: u32,
/// Whether the daemon supports `FUSE_ASYNC_READ` (concurrent reads
/// on the same file handle without serialization). Declared by the
/// daemon in `FUSE_INIT` flags. When false, the kernel serializes
/// all reads per file handle.
pub async_read: bool,
/// Whether the daemon supports `FUSE_WRITEBACK_CACHE` mode.
/// When true, the kernel VFS page cache handles write coalescing and
/// fsync; individual 4 KB write-cache flushes are not sent per page.
/// When false, every `write(2)` generates a `FUSE_WRITE` request.
pub writeback_cache: bool,
/// Pending request queue. Requests generated by VFS operations are
/// enqueued here; the daemon's `read(2)` on `/dev/fuse` dequeues them.
/// Bounded to `FUSE_MAX_PENDING` (default: 12 + 1 per CPU) requests
/// to apply backpressure to VFS callers when the daemon is slow.
pub pending: FuseRequestQueue,
/// In-flight requests awaiting a reply from the daemon. Keyed by
/// `unique` identifier. On daemon close, all in-flight requests are
/// completed with `ENOTCONN`.
pub inflight: FuseInflightMap,
}
FuseRequestQueue and FuseInflightMap are internal kernel types; their
exact layout is not part of the KABI — only the FuseSession fields visible to
the Tier 2 FuseDriver are stable.
// Internal type aliases (not KABI-stable):
// FuseRequestQueue: bounded MPMC ring for pending requests. Capacity is
// FUSE_MAX_PENDING (default: 4096). VFS operations push, daemon read() pops.
type FuseRequestQueue = BoundedMpmcRing<Arc<FuseRequest>, FUSE_MAX_PENDING>;
// FuseInflightMap: integer-keyed XArray for in-flight requests. Keyed by
// `unique` (u64 monotonic request ID). O(1) lookup on daemon reply. RCU-safe
// reads for the abort-all-on-close path.
type FuseInflightMap = XArray<Arc<FuseRequest>>;
See Section 14.11 for the canonical FuseConn struct that
uses these types directly with full documentation.
FUSE_INIT handshake: On first read(2) from the daemon, the kernel sends
a FUSE_INIT request with major = 7, minor = UMKA_FUSE_MINOR (the maximum
minor the kernel supports). The daemon replies with its supported minor; the
negotiated minor is min(kernel_minor, daemon_minor). Capabilities (flags
field) are intersected: a capability is active only if both sides declare it.
The kernel stores the negotiated values in FuseSession::proto_version and the
derived async_read, writeback_cache, max_write, max_readahead fields.
Error handling: If the daemon crashes or closes /dev/fuse with in-flight
requests, all pending VFS operations on the mount return ENOTCONN. The mount
remains in the VFS tree but is marked MS_DEAD; subsequent operations return
ENOTCONN until the mount is explicitly removed with umount. A daemon can
reconnect to a dead mount by opening /dev/fuse with O_RDWR | O_CLOEXEC
and the same mount cookie — this is the basis for daemon live-restart without
unmounting (supported when FUSE_CONN_INIT_WAIT is negotiated).
Security: The /dev/fuse fd is accessible only to the mounting user (or
root). Filesystem operations that arrive from processes outside the mounting
user's UID are checked against the allow_other mount option. Without
allow_other, FUSE_ACCESS is called only for processes with the mounting
UID/GID; others receive EACCES at the VFS permission check before the FUSE
request is even generated.
Phase: FUSE kernel infrastructure is delivered in Phase 3. FUSE daemons
such as apfs-fuse, sshfs, and custom drivers are usable from Phase 3
onward. The native APFS in-kernel driver (Phase 4) supersedes apfs-fuse for
performance-sensitive workloads but does not remove FUSE support.
15.9.1.4.1.1 FUSE KABI Ring Protocol¶
FUSE communication uses two BoundedMpmcRing buffers in a shared memory region
mapped into both the kernel and the Tier 2 daemon process:
- Request ring (kernel → daemon): kernel posts a FuseRequest; daemon pops and processes.
- Reply ring (daemon → kernel): daemon posts a FuseReply; kernel pops and unblocks
the waiting VFS caller.
Wire format (matches Linux FUSE ABI for daemon compatibility):
#[repr(C, align(8))]
pub struct FuseInHeader {
pub len: u32, // total message length including header
pub opcode: u32, // FuseOpcode
pub unique: u64, // request correlation ID; daemon must echo in reply
pub nodeid: u64, // inode number
pub uid: u32, // requesting process UID
pub gid: u32, // requesting process GID
pub pid: u32, // requesting process PID
pub _pad: u32,
}
const_assert!(core::mem::size_of::<FuseInHeader>() == 40);
#[repr(C, align(8))]
pub struct FuseOutHeader {
pub len: u32,
pub error: i32, // 0 on success; negative errno on error
pub unique: u64, // matches FuseInHeader::unique
}
const_assert!(core::mem::size_of::<FuseOutHeader>() == 16);
// Subset — see linux/fuse.h for complete list. UmkaOS implements all
// Linux FUSE opcodes (52 opcodes as of protocol version 7.45). Only the
// most commonly used opcodes are shown here for reference; the complete
// opcode table with all values is in the FUSE section (fuse-filesystem-in-userspace).
#[repr(u32)]
pub enum FuseOpcode {
Lookup = 1,
Forget = 2,
Getattr = 3,
Setattr = 4,
Readlink = 5,
Mknod = 8,
Mkdir = 9,
Unlink = 10,
Rmdir = 11,
Rename = 12,
Open = 14,
Read = 15,
Write = 16,
Release = 18,
Fsync = 20,
Flush = 25,
Init = 26,
Opendir = 27,
Readdir = 28,
Releasedir = 29,
Create = 35,
Rename2 = 45,
Lseek = 46,
// ... all remaining opcodes (Symlink=6, Link=13, Statfs=17,
// SetXattr=21..RemoveXattr=24, FsyncDir=30, GetLk=31..SetLkW=33,
// Access=34, Interrupt=36, Bmap=37, Destroy=38, Ioctl=39, Poll=40,
// NotifyReply=41, BatchForget=42, Fallocate=43, ReaddirPlus=44,
// CopyFileRange=47, SetupMapping=48, RemoveMapping=49, SyncFs=50,
// TmpFile=51, Statx=52, CopyFileRange64=53) are implemented identically.
}
KABI vtable (registered by the in-kernel FUSE driver; called by the Tier 2 daemon):
#[repr(C)]
pub struct FuseKabiVTable {
pub vtable_size: u64,
/// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
pub kabi_version: u64,
/// Called once by the daemon after opening `/dev/fuse` and mapping the shared rings.
/// `req_ring` and `rep_ring` are the two ring buffers in the shared memory region.
pub fuse_connect: unsafe extern "C" fn(
session: *mut FuseSession,
req_ring: *mut BoundedMpmcRing<FuseRequest, FUSE_RING_DEPTH>,
rep_ring: *mut BoundedMpmcRing<FuseReply, FUSE_RING_DEPTH>,
) -> i32,
/// Called by the daemon to signal it has drained all pending requests
/// and is ready for the mount to be torn down.
pub fuse_disconnect: unsafe extern "C" fn(session: *mut FuseSession) -> i32,
/// Called by the daemon to invalidate a cached inode or dentry.
pub fuse_notify: unsafe extern "C" fn(
session: *mut FuseSession,
nodeid: u64,
notify: FuseNotifyCode,
) -> i32,
}
// KABI vtable — pointer-width-dependent (contains fn pointers).
// 64-bit: vtable_size(8) + kabi_version(8) + 3 fn ptrs(24) = 40 bytes.
#[cfg(target_pointer_width = "64")]
const_assert!(core::mem::size_of::<FuseKabiVTable>() == 40);
#[cfg(target_pointer_width = "32")]
const_assert!(core::mem::size_of::<FuseKabiVTable>() == 28);
pub const FUSE_RING_DEPTH: usize = 256; // power-of-two; tunable via sysctl fuse.ring_depth
/// FUSE notify codes — daemon-to-kernel unsolicited notifications.
/// Values MUST match Linux `include/uapi/linux/fuse.h` `enum fuse_notify_code`
/// exactly — FUSE daemons send these numeric values on the wire.
#[repr(u32)]
pub enum FuseNotifyCode {
Poll = 1, // wake all pollers on the specified file handle
InvalInode = 2, // invalidate cached inode attributes
InvalEntry = 3, // invalidate a dentry in the parent directory
Store = 4, // push data into the kernel page cache
Retrieve = 5, // pull data from the kernel page cache
Delete = 6, // remove a dentry (daemon-side deletion)
Resend = 7, // resend a previously interrupted request (protocol 7.41+, verified in mainline)
IncEpoch = 8, // increment kernel-side epoch counter (verified in mainline)
Prune = 9, // prune (evict) dentries from a directory (verified in mainline)
}
Session lifecycle:
- Daemon opens
/dev/fuse(major=10, minor=229, same as Linux). - Daemon maps the shared ring memory region via
mmap(2)on the fd. - Daemon calls
fuse_connect()via the KABI vtable, passing ring pointers. - Kernel posts a
FUSE_INITrequest; daemon replies with its capability flags. - After
FUSE_INIThandshake, kernel dispatches VFS requests to the request ring. - Daemon pops requests, processes them, pushes
FuseReplyentries to the reply ring. - On unmount: kernel posts
FUSE_DESTROY; daemon responds and callsfuse_disconnect().
Blocking semantics: VFS calls block until the daemon posts the matching reply
(matched by unique ID). The blocked task waits on a per-request WaitQueue.
If the daemon exits before posting a reply, the kernel detects session teardown
and returns -EIO to all pending operations.
15.9.2 Summary of Design Decisions¶
-
Tier 1 placement: overlayfs runs in the VFS domain because it is a pure VFS stacking layer with moderate code complexity. Tier 2 would double domain-crossing overhead for every file operation in every container.
-
xattr-based whiteouts as default: Avoids
CAP_MKNODrequirement for rootless containers. Character device 0:0 whiteouts are recognized on read for backward compatibility. -
Metacopy enabled by default: Matches modern Docker/containerd behavior. The security caveat (attacker-crafted xattrs) is mitigated by the
trusted.*namespace restriction and container runtime control of layer provenance. -
Atomic copy-up via workdir rename: Uses the same-filesystem rename guarantee. The workdir must share a superblock with upperdir, which the mount validation enforces.
-
Dentry invalidation on copy-up: Uses
d_invalidate()on the parent directory's dentry for the affected name, forcing re-lookup through the overlaylookup()path which will find the new upper entry. -
d_revalidate()for overlay dentries: Checks for copy-up state changes. This is the primary mechanism by which concurrent readers discover that a file has been copied up. -
Readdir merge with HashSet dedup: O(entries x layers) with hash-based dedup. The merged listing is cached per-opendir for consistency.
-
xattr escaping for nested overlays: Supports overlayfs-on-overlayfs via the
trusted.overlay.overlay.*prefix convention, matching Linux. -
Volatile sentinel directory: Prevents mounting on unclean upper layers. The sentinel is created on mount, removed on clean unmount.
-
dm-verity + IMA dual coverage: Lower layers protected by dm-verity (block-level, Section 9.3), upper layer by IMA (file-level, Section 9.5). This is cross-referenced for clarity.
15.10 ZFS Integration¶
15.10.1 Native ZFS and Filesystem Licensing¶
Linux problem: ZFS can't be merged due to CDDL vs GPL license incompatibility. Users rely on out-of-tree OpenZFS which breaks with kernel updates.
UmkaOS design:
- The kernel is licensed under UmkaOS's proposed OKLF v1.3 license framework (see Appendix A of 23-roadmap.md, Section 24.1 for the full specification — OKLF is a novel license being developed for UmkaOS, not a pre-existing published license): GPLv2 base with the
Approved Linking License Registry (ALLR) which explicitly includes CDDL as an
approved license. CDDL-licensed code (like OpenZFS) communicates with the kernel
via KABI IPC without license conflict (no in-kernel linking occurs).
- ZFS is a first-class Tier 1 filesystem driver, same tier as ext4, XFS, and Btrfs.
The KABI interface provides the license boundary: ZFS is dynamically loaded, has one
resolved symbol (__kabi_driver_entry), communicates exclusively through ring buffer
IPC and vtable dispatch — no linking, no shared symbols. This provides more
isolation than Linux's EXPORT_SYMBOL_GPL boundary (where modules ARE linked into
the kernel and share function calls). The license separation is provided by KABI,
not by the isolation tier — running a filesystem as Tier 2 (process isolation) for
licensing reasons would impose catastrophic I/O overhead (~200-500 cycles per VFS
operation) for zero additional legal benefit.
- NFSv4 ACLs are first-class (Section 9.2), so ZFS's native ACL model works natively.
- Filesystem KABI interface is rich enough to support ZFS's advanced features: snapshots,
send/receive, datasets, native encryption, dedup, special vdevs.
- ZFS benefits from the stable driver ABI, so it won't break with kernel updates —
eliminating the primary pain point of Linux's out-of-tree OpenZFS module.
15.10.2 ZFS Advanced Features¶
Section 15.10 establishes that ZFS is a first-class UmkaOS citizen via KABI (Tier 1 driver). This section covers advanced ZFS features that benefit from UmkaOS's architecture: capability-based dataset management, RDMA-accelerated replication, and cluster integration.
Dataset hierarchy as capability objects — ZFS datasets form a hierarchy (pool → dataset → child dataset → snapshot → clone). In UmkaOS, each dataset is a capability object (Section 9.2). The capability token for a dataset encodes the specific operations permitted (Section 9.2):
| Capability | Permits |
|---|---|
CAP_ZFS_MOUNT |
Mount the dataset as a filesystem |
CAP_ZFS_SNAPSHOT |
Create/destroy snapshots of the dataset |
CAP_ZFS_SEND |
Generate a send stream (for replication) |
CAP_ZFS_RECV |
Receive a send stream into this dataset |
CAP_ZFS_CREATE |
Create child datasets |
CAP_ZFS_DESTROY |
Destroy the dataset (highest privilege) |
Phase 4+ note: These
CAP_ZFS_*capability names are conceptual placeholders. They are not yet defined in the capability model (Ch 9). Until Phase 4+ ZFS-specific KABI is implemented, all ZFS administrative operations requireCapability::SysAdmin. The capability names shown here illustrate the target delegation model.
Delegation means transferring a subset of your capabilities to another local entity
(a container, a user). A pool administrator holding all capabilities can delegate
CAP_ZFS_MOUNT + CAP_ZFS_SNAPSHOT + CAP_ZFS_CREATE for a subtree to a container —
the container can mount, snapshot, and create children within its subtree, but cannot
destroy the parent dataset or send replication streams. For shared storage across
hosts, use clustered filesystems (Section 15.14) backed by the DLM (Section 15.15) over
shared block devices (Section 15.13).
zvol (ZFS volumes) — ZFS volumes are datasets that expose a block device interface instead of a POSIX filesystem. UmkaOS integrates zvols with umka-block's device-mapper framework — a zvol can serve as the backing store for dm-crypt, dm-mirror, or as an iSCSI LUN (Section 15.13). This enables ZFS's checksumming, compression, and snapshot capabilities for raw block storage consumers.
zfs send/recv over RDMA — ZFS replication streams (zfs send) are often used for
backup, disaster recovery, and dataset migration. In Linux, zfs send | ssh remote zfs
recv pushes the stream over TCP (typically SSH-encrypted). UmkaOS provides a native RDMA
transport option:
- Uses Section 5.4's RDMA infrastructure
- Kernel-to-kernel path: when both source and destination run UmkaOS, the send stream
bypasses userspace entirely — data moves directly from the source ZFS module through
RDMA to the destination ZFS module
- Zero-copy: send stream data is RDMA READ from source memory, written directly into
destination's transaction group
- Encryption: if the dataset uses ZFS native encryption, the stream is already encrypted
end-to-end. Otherwise, RDMA transport encryption (Section 5.4) protects data in
transit
Import/export compatibility — UmkaOS's ZFS implementation reads and writes the standard ZFS on-disk format (as defined by OpenZFS). Existing zpools created on Linux, FreeBSD, or illumos can be imported by UmkaOS without modification. Conversely, zpools created by UmkaOS can be exported and imported on any OpenZFS-compatible system.
ZFS-specific KABI extensions — ZFS uses the common filesystem KABI interface
(FileSystemOps, InodeOps, FileOps) defined in Section 14.1 for
standard POSIX operations. ZFS-specific administrative operations (dataset create/destroy,
snapshot management, zfs send/recv, pool scrub/resilver, ZFS_IOC_* ioctls) require
additional KABI definitions not yet specified. These ZFS-specific KABI extensions are
deferred to Phase 4+ implementation. The common filesystem KABI is sufficient for basic
ZFS functionality (mount, read, write, fsync, xattr, ACL). Dataset management operations
will be routed through the D-Bus bridge (Section 11.11) or ioctl passthrough
until dedicated KABI vtable extensions are defined.
15.11 NFS Client, SunRPC, and RPCSEC_GSS¶
NFS is UmkaOS's primary network filesystem. This section specifies the complete kernel-side stack: - SunRPC transport layer: connection management, XDR encoding, RPC dispatch - RPCSEC_GSS + Kerberos: Kerberos-authenticated NFS (NFSv4 + Kerberos = "krb5i/krb5p") - NFSv4 client state machine: open/lock/delegation/lease - Network filesystem cache (netfs layer): shared page cache for NFS, Ceph, and other network filesystems
15.11.1 SunRPC Transport Layer¶
SunRPC (RFC 5531) is the RPC framework underlying NFS, lockd, and mount protocol.
RpcTransport trait — abstraction over TCP and UDP transports:
pub trait RpcTransport: Send + Sync {
fn send_request(&self, req: &RpcMsg, xid: u32, timeout: Duration) -> Result<(), RpcError>;
fn recv_reply(&self, xid: u32) -> impl Future<Output = Result<RpcMsg, RpcError>>;
fn close(&self);
fn reconnect(&self) -> Result<(), RpcError>;
fn max_payload_size(&self) -> usize;
}
15.11.1.1 RPC Error Taxonomy¶
The RpcError enum distinguishes transient from permanent failures, enabling
the retry logic in XClnt to make correct decisions without ambiguity.
/// RPC transport and protocol error taxonomy.
///
/// Variants are ordered by severity. The `XClnt` retry loop uses the
/// variant to decide:
/// - **Retry immediately**: `Timeout` (with exponential backoff up to
/// `XClnt.retries` attempts).
/// - **Reconnect, then retry**: `ConnReset` (calls `transport.reconnect()`
/// first; retries up to `XClnt.retries` after successful reconnect).
/// - **Refresh credentials, then retry**: `AuthFailed` (calls
/// `auth.refresh()` first; retry once).
/// - **Wait and retry**: `GracePeriod` (sleep for the server's grace
/// period duration, typically 90 seconds for NFSv4, then retry).
/// - **Permanent failure**: `ProgramMismatch`, `GarbageArgs`,
/// `SystemError` — return error to caller without retry.
pub enum RpcError {
/// RPC call timed out (server or network). Retryable with backoff.
/// Timeout duration is `XClnt.timeout`; the RPC layer retries up to
/// `XClnt.retries` times before surfacing the error.
Timeout,
/// TCP connection reset by peer or network failure. Requires
/// `transport.reconnect()` before retrying. If reconnect fails,
/// the error is surfaced to the caller as `EIO`.
ConnReset,
/// Authentication failed. For RPCSEC_GSS (Kerberos), this typically
/// means the ticket has expired. The RPC layer calls `auth.refresh()`
/// to obtain a new ticket and retries once. If refresh fails or the
/// retry also fails, the error is surfaced as `EACCES`.
AuthFailed,
/// Server does not support the requested RPC program or version.
/// Permanent failure — the client must negotiate a different version
/// or report the error to the caller.
ProgramMismatch {
/// The program version the client requested.
expected_ver: u32,
/// The highest version the server supports for this program.
server_ver: u32,
},
/// Server rejected the call arguments as malformed (RPC_GARBAGE_ARGS).
/// Permanent failure — indicates a serialization bug or protocol
/// mismatch. Surfaced as `EIO`.
GarbageArgs,
/// Server returned a system-level error (RPC_SYSTEM_ERR).
/// The `i32` is a POSIX errno from the server. Surfaced as-is
/// to the caller (mapped through the NFS error translation table
/// defined below).
SystemError(i32),
/// Server is in NFS grace period (NFSv4 `NFS4ERR_GRACE` or
/// `NFS4ERR_DELAY`). The client must wait and retry after the grace
/// period expires. The NFSv4 client state machine
/// ([Section 15.11](#nfs-client-sunrpc-and-rpcsecgss--nfsv4-client-state-machine))
/// handles grace period detection and automatic retry scheduling.
GracePeriod,
}
NFS error translation table — maps NFS protocol error codes to POSIX errno
values for the syscall return path. The NFS client applies this mapping when
translating an NFS reply status to a kernel Errno. NFSv3 uses nfsstat3
(RFC 1813 §2.6); NFSv4 uses nfsstat4 (RFC 7530 §13.1). The table covers
both:
| NFS Error | Value | Errno | Notes |
|---|---|---|---|
NFS3ERR_PERM / NFS4ERR_PERM |
1 | EPERM |
Operation not permitted |
NFS3ERR_NOENT / NFS4ERR_NOENT |
2 | ENOENT |
No such file or directory |
NFS3ERR_IO / NFS4ERR_IO |
5 | EIO |
I/O error |
NFS3ERR_NXIO / NFS4ERR_NXIO |
6 | ENXIO |
No such device or address |
NFS3ERR_ACCES / NFS4ERR_ACCESS |
13 | EACCES |
Permission denied |
NFS3ERR_EXIST / NFS4ERR_EXIST |
17 | EEXIST |
File exists |
NFS3ERR_XDEV / NFS4ERR_XDEV |
18 | EXDEV |
Cross-device link |
NFS3ERR_NODEV |
19 | ENODEV |
No such device |
NFS3ERR_NOTDIR / NFS4ERR_NOTDIR |
20 | ENOTDIR |
Not a directory |
NFS3ERR_ISDIR / NFS4ERR_ISDIR |
21 | EISDIR |
Is a directory |
NFS3ERR_INVAL / NFS4ERR_INVAL |
22 | EINVAL |
Invalid argument |
NFS3ERR_FBIG / NFS4ERR_FBIG |
27 | EFBIG |
File too large |
NFS3ERR_NOSPC / NFS4ERR_NOSPC |
28 | ENOSPC |
No space left on device |
NFS3ERR_ROFS / NFS4ERR_ROFS |
30 | EROFS |
Read-only filesystem |
NFS3ERR_MLINK |
31 | EMLINK |
Too many links |
NFS3ERR_NAMETOOLONG / NFS4ERR_NAMETOOLONG |
63 | ENAMETOOLONG |
Filename too long |
NFS3ERR_NOTEMPTY / NFS4ERR_NOTEMPTY |
66 | ENOTEMPTY |
Directory not empty |
NFS3ERR_DQUOT / NFS4ERR_DQUOT |
69 | EDQUOT |
Disk quota exceeded |
NFS3ERR_STALE / NFS4ERR_STALE |
70 | ESTALE |
Stale file handle |
NFS3ERR_BADHANDLE |
10001 | ESTALE |
Invalid NFS file handle (mapped to ESTALE per Linux nfs3_stat_to_errno) |
NFS3ERR_SERVERFAULT |
10006 | EIO |
Server internal error |
NFS4ERR_DENIED |
10010 | EAGAIN |
Lock denied (non-blocking) |
NFS4ERR_EXPIRED |
10011 | EIO |
Lease/delegation expired |
NFS4ERR_LOCKED |
10012 | EAGAIN |
File is locked |
NFS4ERR_GRACE |
10013 | EAGAIN |
Server in grace period (retry) |
NFS4ERR_DELAY |
10008 | EAGAIN |
Server busy (retry with backoff) |
NFS4ERR_WRONGSEC |
10016 | EPERM |
Wrong security flavor (EPERM triggers sec= fallback) |
NFS4ERR_MOVED |
10019 | EREMOTE |
Filesystem migrated |
| (unknown) | * | EIO |
Unmapped errors → EIO |
The mapping function is:
/// Translate an NFS protocol status code to a POSIX errno.
/// Handles both NFSv3 (nfsstat3) and NFSv4 (nfsstat4) error spaces.
///
/// This function is called only for errors NOT intercepted by the NFSv4
/// state recovery machine. `NFS4ERR_EXPIRED`, `NFS4ERR_STALE_CLIENTID`,
/// and similar lease/state errors are normally handled by the recovery
/// path before reaching this function. The `EIO` mapping applies only
/// when recovery itself has failed.
fn nfs_status_to_errno(status: i32) -> Errno {
match status {
0 => unreachable!("success should not reach error path"),
1 => Errno::EPERM,
2 => Errno::ENOENT,
5 => Errno::EIO,
6 => Errno::ENXIO,
13 => Errno::EACCES,
17 => Errno::EEXIST,
18 => Errno::EXDEV,
19 => Errno::ENODEV,
20 => Errno::ENOTDIR,
21 => Errno::EISDIR,
22 => Errno::EINVAL,
27 => Errno::EFBIG,
28 => Errno::ENOSPC,
30 => Errno::EROFS,
31 => Errno::EMLINK,
63 => Errno::ENAMETOOLONG,
66 => Errno::ENOTEMPTY,
69 => Errno::EDQUOT,
70 => Errno::ESTALE,
10001 => Errno::ESTALE,
10006 => Errno::EIO,
10008 | 10010 | 10012 | 10013 => Errno::EAGAIN,
10011 => Errno::EIO,
10016 => Errno::EPERM, // NFS4ERR_WRONGSEC: EPERM triggers sec= fallback
10019 => Errno::EREMOTE,
_ => Errno::EIO, // unmapped NFS errors → EIO
}
}
TCP transport — one persistent TCP connection per server per NFS client. Record marking (RFC 5531 §10): each RPC message prefixed with a 4-byte record mark (u32 with high bit set indicating last fragment, low 31 bits = fragment length). Multiple RPC messages may be pipelined on one TCP connection. Connection maintained as long as mounts are active; reconnect on ECONNRESET.
Network namespace binding: RpcTransport captures the network namespace at
construction time: transport.net_ns = params.net_ns.clone(). All socket operations
use this captured namespace. XClnt holds the transport; the namespace is transitively
available via self.transport.net_ns.
XClnt (RPC client) struct:
pub struct XClnt {
pub server_addr: SockAddr,
/// Transport connections. Single-element for default (nconnect=1).
/// `nconnect=N` mount option creates N TCP connections to the server
/// for bandwidth aggregation. Each PendingRpc records which transport
/// index it was dispatched on (for reply routing). New RPCs are
/// dispatched round-robin weighted by `inflight_count` (least-loaded).
pub transports: ArrayVec<Arc<dyn RpcTransport>, 16>,
pub prog: u32, // RPC program number (NFS = 100003, mountd = 100005)
pub vers: u32, // Program version (NFSv4 = 4)
pub auth: Arc<dyn RpcAuth>,
// XID is a per-connection transaction correlation tag (RFC 5531).
// Wrapping is safe: stale XIDs are garbage-collected by RPC timeout
// (pending map entry removed after `timeout` elapses). The (client_addr,
// xid) tuple provides uniqueness; no 50-year longevity concern.
pub xid_counter: AtomicU32,
pub pending: XArray<PendingRpc>, // xid (u32) → waker; XArray internal lock replaces Mutex
pub timeout: Duration,
pub retries: u32,
}
pub struct PendingRpc {
pub xid: u32,
pub waker: Waker,
pub result: Option<Result<RpcMsg, RpcError>>,
/// Index into XClnt.transports — identifies which connection this RPC
/// was dispatched on, for reply routing.
pub transport_idx: u8,
}
XDR (External Data Representation) — RFC 4506. UmkaOS implements XDR as zero-copy where possible: XdrEncoder writes directly into a NetBuf chain; XdrDecoder reads from received NetBuf without copying. Fixed-size types (u32, u64, bool) are directly encoded; variable-length strings and arrays have a 4-byte length prefix followed by zero-padded data to a 4-byte boundary.
Async RPC dispatch — call_async(proc: u32, args: impl XdrEncode) -> impl Future<Output = Result<R, RpcError>>: builds RpcMsg { xid, call: RpcCall { rpc_version: 2, program, version, procedure, auth, verifier } }, encodes args via XDR, sends via transport, registers PendingRpc in the pending map, returns a future that resolves when the matching reply arrives. The reply receiver loop runs as a Tier 1 kernel task.
15.11.2 RPC Authentication (RpcAuth)¶
RpcAuth trait:
pub trait RpcAuth: Send + Sync {
fn auth_type(&self) -> RpcAuthFlavor;
fn marshal_cred(&self, encoder: &mut XdrEncoder) -> Result<()>;
fn verify_verf(&self, decoder: &mut XdrDecoder) -> Result<()>;
fn refresh(&self) -> Result<()>; // Re-fetch credentials if expired
}
Built-in auth flavors:
- AuthNone (flavor 0): null credentials. Used only for portmap/rpcbind.
- AuthUnix / AUTH_SYS (flavor 1): uid, gid, supplementary groups. Used for NFSv3, not secure.
Translate in-namespace UID/GID through user_ns.uid_map/gid_map before encoding on
wire (prevents container root from appearing as host root). The translated host-scope
UID/GID are placed into the XDR credential body; if no mapping exists for the caller's
in-namespace UID, the RPC fails with EOVERFLOW.
- RPCSEC_GSS (flavor 6): GSS-API based authentication. Described in Section 15.11.3.
15.11.3 RPCSEC_GSS and Kerberos¶
RPCSEC_GSS (RFC 2203) wraps any GSS-API mechanism. UmkaOS implements the Kerberos V5 mechanism (RFC 4121).
Service types (negotiated at mount time via sec= mount option):
- krb5: authentication only (integrity of RPC header)
- krb5i: authentication + integrity (checksum of entire RPC payload)
- krb5p: authentication + integrity + privacy (encryption of RPC payload)
GssContext struct — Per-server-per-credential GSS context. One context per
(client principal, server principal) pair, shared by all threads using the same
credentials on the same NFS server connection:
pub struct GssContext {
// --- Authentication state ---
/// GSS mechanism OID (1.2.840.113554.1.2.2 for Kerberos V5).
pub mech_oid: GssMechOid,
/// Opaque handle to the GSS security context (from gss_init_sec_context).
pub context_hdl: u64,
/// Negotiated service level: None / Integrity / Privacy.
pub service: GssService,
/// Monotonic sequence counter for anti-replay (see below).
/// Stored as u64 internally; truncated to u32 on the wire per RFC 2203.
/// At ~100K RPCs/sec, the u32 wire space wraps in ~12 hours. Before
/// wrap, the context must be re-established (Kerberos ticket renewal
/// typically triggers this well before wrap — default ticket lifetime
/// is 10 hours). The wire-wrap check examines the **low 32 bits**:
/// `if (seq_num.load(Relaxed) as u32) >= 0xFFFF_FF00 { force renewal }`.
/// Comparing the full u64 against 0xFFFF_FF00 would check ~4 billion
/// total RPCs, not the wire representation approaching wrap.
pub seq_num: AtomicU64,
/// AES-256 session key, zeroed on context destruction.
pub session_key: Zeroizing<[u8; 32]>,
/// User ID that established this context.
pub uid: UserId,
// --- Lifecycle state ---
/// Opaque GSS context token (from gss_init_sec_context). Variable length;
/// stored as a heap allocation updated atomically on renewal.
pub token: RwLock<Box<[u8]>>,
/// Absolute expiry time (nanoseconds since boot).
pub expiry_ns: AtomicU64,
/// Current lifecycle state (see `GssContextState` enum below).
pub state: AtomicU8, // GssContextState as u8
/// Number of RPCs currently in-flight using this context.
/// Grace-period teardown waits for this to reach zero before expiring.
pub in_flight: AtomicU32,
/// Upcall ID sent to gssd for renewal (0 = none pending).
pub renewal_upcall_id: AtomicU64,
}
RPCSEC_GSS credential exchange (happens automatically on first NFS connection):
1. Client sends RPCSEC_GSS_INIT call with a Kerberos AP_REQ (service ticket + authenticator) obtained from the kernel keyring (Section 10.2). The request_key("krb5", "nfs@server.example.com", NULL) lookup triggers gssd upcall if no ticket is cached.
2. Server responds with AP_REP (session key confirmation) and assigns a gss_proc_handle.
3. Subsequent RPCs carry the gss_proc_handle + sequence number + integrity/privacy checksum in the credential field.
RpcsecGssAuth struct — implements RpcAuth:
pub struct RpcsecGssAuth {
/// GSS context. All mutable fields in GssContext use interior mutability
/// (AtomicU64 for seq_num, Zeroizing for session_key). Context replacement
/// during renewal creates a new Arc<GssContext> and atomically swaps via
/// ArcSwap — eliminating the RwLock read-lock overhead (~15-20ns) on every RPC.
pub ctx: Arc<GssContext>,
pub handle: u32, // gss_proc_handle from server
pub service: GssService,
}
marshal_cred(): writes RPCSEC_GSS credential with current seq_num.
- verify_verf(): checks server's GSS MIC (Message Integrity Code) over the reply XID.
- refresh(): if ctx.expiry < now, calls request_key() to fetch a new service ticket, re-runs RPCSEC_GSS_INIT.
Key retrieval integration with Section 10.2: Kerberos TGTs are cached as LogonKey entries in the kernel Key Retention Service. When refresh() needs a new service ticket, it calls request_key("krb5tgt", "REALM", NULL) to retrieve the cached TGT LogonKey, then calls request_key("krb5", "nfs@server.example.com", NULL) to obtain (or derive) a service ticket. If no TGT is present, the request_key upcall invokes userspace gssd, which performs the full Kerberos AS exchange, deposits the resulting TGT as a LogonKey, and provides the service ticket. This path requires Capability::SysAdmin only for initial keyring population; subsequent ticket requests use the session keyring of the process that triggered the mount.
Sequence number anti-replay: Each GssContext maintains a monotonic seq_num (AtomicU64). The server rejects any RPC with a sequence number more than 256 positions behind the current window (RFC 2203 §5.3.3). The client never reuses sequence numbers within a context lifetime.
GSS Upcall Mechanism:
Kerberos authentication requires obtaining credentials from userspace (the gssd daemon), since the kernel cannot contact a KDC directly. UmkaOS uses an upcall mechanism:
Channel: A per-mount Unix domain socket (/run/umka/gss/{mount_id}) created when the NFS mount is established. The kernel writes requests and reads responses using a simple binary framing protocol.
Namespace scoping: The GSS upcall socket is created in the network namespace of the
NFS mount (i.e., NfsSessionParams::net_ns). The gssd daemon that answers upcalls must
be running in the same network namespace — a gssd in the host namespace cannot see
upcall sockets created inside a container's network namespace. This ensures that
container-scoped NFS mounts with Kerberos authentication use the container's own gssd
instance and Kerberos credential cache.
UID namespace awareness: RPCSEC_GSS credential mapping uses the user namespace of
the NFS mount for UID translation. When an NFS mount is established inside a user
namespace, the GssContext::uid field stores the host UID (translated via the mount's
user namespace uid_map). RPCs sent to the server carry the host UID in the GSS
credential, not the in-namespace UID. This ensures the NFS server sees consistent
identities regardless of the container's UID mapping.
Request format (GssUpcallRequest):
#[repr(C)]
pub struct GssUpcallRequest {
/// Protocol version (currently 1).
pub version: u32,
/// Unique upcall ID for matching responses to requests. Allocated from
/// a per-mount atomic counter. Required for concurrent upcall support
/// (up to 32 simultaneous upcalls).
pub upcall_id: u32,
/// Request type: 1=INIT_SEC_CONTEXT, 2=ACCEPT_SEC_CONTEXT, 3=GET_MIC, 4=VERIFY_MIC.
pub req_type: u32,
/// Client principal name (NUL-terminated, max 256 bytes).
pub client_principal: [u8; 256],
/// Target service name, e.g., "nfs@server.example.com" (NUL-terminated, max 256 bytes).
pub target: [u8; 256],
/// Input token length (0 for INIT, non-zero for mutual auth response).
pub input_token_len: u32,
/// Input token data (up to 65535 bytes; variable length follows this struct).
// (actual data follows at offset sizeof(GssUpcallRequest))
}
// Userspace boundary (upcall pipe): version(4)+upcall_id(4)+req_type(4)+client_principal(256)+target(256)+input_token_len(4) = 528 bytes.
// repr(C): all u32 fields naturally aligned, [u8;256] align 1. No padding.
const_assert!(core::mem::size_of::<GssUpcallRequest>() == 528);
Response format (GssUpcallResponse):
#[repr(C)]
pub struct GssUpcallResponse {
pub version: u32, // offset 0
/// Must match the `upcall_id` from the corresponding `GssUpcallRequest`.
pub upcall_id: u32, // offset 4
pub status: i32, // offset 8; 0 = success; negative = GSS error code
/// Explicit padding for u64 alignment of `context_id`. Zero-initialized
/// on construction to prevent kernel heap information disclosure.
pub _pad0: [u8; 4], // offset 12
/// GSS context handle (opaque; returned to kernel for subsequent calls).
pub context_id: u64, // offset 16
/// Output token length (for INIT_SEC_CONTEXT response token).
pub output_token_len: u32, // offset 24
/// Explicit trailing padding to struct alignment. Zero-initialized.
pub _pad1: [u8; 4], // offset 28
// output token data follows at offset sizeof(GssUpcallResponse)
}
// repr(C): u32(4)+u32(4)+i32(4)+pad0(4)+u64(8)+u32(4)+pad1(4) = 32 bytes.
const_assert!(core::mem::size_of::<GssUpcallResponse>() == 32);
Timeout: 30 seconds per upcall. If gssd does not respond within 30s:
- The kernel returns ETIMEDOUT to the NFS operation.
- The upcall socket is closed and re-created; a new connection attempt is made.
- After 3 consecutive timeouts, the mount is marked NFS_MOUNT_SECFLAVOUR_FORCE_NONE and falls back to AUTH_SYS (if configured) or returns EACCES permanently until the mount is remounted.
Concurrent upcalls: Multiple upcalls may be in flight simultaneously (one per in-progress authentication). Each upcall is tagged with a unique upcall_id: u32; responses match by upcall_id. A ring buffer of 32 concurrent upcalls is supported.
15.11.3.1 GSS Context Lifecycle and Proactive Renewal¶
Linux behavior (reference): Linux hard-fails all NFS RPCs with EKEYEXPIRED when
the GSS/Kerberos TGT or service ticket expires. The user sees I/O errors on NFS mounts
until they re-authenticate (kinit). This is a poor user experience for long-running
workloads.
UmkaOS improvement — proactive renewal + grace period:
UmkaOS's GSS context manager proactively renews credentials and provides a short grace period for in-flight RPCs, eliminating spurious I/O errors in well-managed environments.
/// Lifecycle state of a GSS security context.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum GssContextState {
/// Context is valid and usable for RPC signing/encryption.
Valid,
/// Context expires within RENEWAL_LEAD_TIME_SEC (60 s); renewal upcall
/// has been sent to the gssd daemon. New RPCs may still use this context.
RenewPending,
/// Renewal failed or context has just expired; within the grace period
/// (GRACE_PERIOD_MS = 500 ms). In-flight RPCs are allowed to complete.
/// New RPCs are queued pending renewal or context replacement.
GracePeriod,
/// Grace period elapsed; all RPCs return EKEYEXPIRED until re-authentication.
Expired,
/// Context has been explicitly destroyed (session logout or server reset).
Destroyed,
}
// GssContext: defined above in Section 15.11.3 (the canonical definition
// includes both authentication state and lifecycle state fields).
/// Renewal timing constants.
/// Renewal is triggered this many seconds before expiry.
pub const GSS_RENEWAL_LEAD_TIME_SEC: u64 = 60;
/// After expiry, in-flight RPCs have this long to complete before the context
/// is torn down and new RPCs start returning EKEYEXPIRED.
pub const GSS_GRACE_PERIOD_MS: u64 = 500;
Renewal algorithm (runs in the kthread/gss_renewer background thread):
- Wake every 5 seconds (or when notified by an expiry timer).
- For each
GssContextwithstate == Valid: - If
now_ns >= expiry_ns - GSS_RENEWAL_LEAD_TIME_SEC * NS_PER_SECOR(ctx.seq_num.load(Relaxed) as u32) >= 0xFFFF_FF00(wire-wrap imminent):- Transition state to
RenewPending. - Send upcall to gssd:
GssUpcallRequest { op: Renew, ... }.
- Transition state to
- If renewal upcall succeeds (gssd responds within 30 s):
- Update
tokenandexpiry_nsundertoken.write(). - Transition state back to
Valid. - If renewal upcall fails or times out:
- If
now_ns < expiry_ns: retry after 10 s (transient failure). - If
now_ns >= expiry_ns: transition toGracePeriod.- Start a 500 ms timer; on expiry: wait for
in_flight == 0, then transition toExpired.
- Start a 500 ms timer; on expiry: wait for
- New RPCs arriving while
state == GracePeriodare queued (not failed); they proceed if renewal succeeds or fail with EKEYEXPIRED if grace period expires.
15.11.4 NFSv4 Client State Machine¶
NFSv4 (RFC 7530 for v4.0, RFC 5661 for v4.1) is the primary NFS version. Key concepts:
- Leases: all NFSv4 state (open files, locks, delegations) is held under a time-limited lease. Client must renew its lease before it expires (default 90s) or all state is purged by the server.
- Client ID: a 64-bit clientid identifying the client, established via SETCLIENTID (v4.0) or EXCHANGE_ID (v4.1).
- Sessions (v4.1): connection-independent; RPCs can arrive on any TCP connection in the session. CREATE_SESSION establishes a session; SEQUENCE operation prefixes every compound.
- Compounds: NFSv4 operations are batched into compounds (multiple operations per RPC call). E.g., PUTFH + GETATTR in one RPC.
NfsClient struct:
pub struct NfsClient {
pub server_addr: SockAddr,
pub rpc_clnt: Arc<XClnt>,
pub clientid: AtomicU64, // NFSv4 client ID
pub verifier: [u8; 8], // Client verifier (random, per boot)
pub lease_time_s: u32, // Negotiated from server
pub lease_renewer: JoinHandle<()>, // Background task renewing the lease
/// Read-heavy session parameters (capabilities, addresses, negotiated
/// values). Read lock-free under RCU on the RPC dispatch hot path.
/// Updated only on session creation, renewal, or server migration (cold).
pub session_params: RcuCell<NfsSessionParams>,
/// Per-slot sequence counters for NFSv4.1 session slots.
/// Separated from `NfsSessionParams` because slot sequences are
/// shared mutable state (updated on every RPC via `fetch_add`) while
/// session parameters are read-mostly config (updated at renegotiation
/// ~90s). Session parameter renewal no longer forces reallocation of
/// 256 AtomicU32 counters. Updated independently: only when the server
/// changes the slot count (rare — typically at session creation only).
pub slot_table: RcuCell<Arc<SlotTable>>,
/// Write-heavy lease/recovery state. Acquired only on lease renewal,
/// state recovery, and session reset — all rare operations.
pub lease_state: SpinLock<NfsLeaseState>,
pub nfs_version: NfsVersion, // V4_0 or V4_1
// NFSv4.1 only:
pub session_id: Option<[u8; 16]>,
pub fore_channel: Option<SessionChannel>,
pub back_channel: Option<SessionChannel>,
}
/// Read-heavy session state — accessed under RCU read guard on every RPC
/// dispatch. No lock acquisition in the common path.
///
/// **Note**: Per-slot sequence counters are NOT in this struct. They live in
/// `SlotTable` (separate `RcuCell` on `NfsClient`). This separation ensures
/// that session parameter renegotiation (which replaces `NfsSessionParams`
/// via RCU) does not force reallocation of the 256 AtomicU32 slot counters.
pub struct NfsSessionParams {
/// Server-advertised capabilities (from EXCHANGE_ID / CREATE_SESSION).
pub server_caps: u64,
/// Negotiated maximum request/response sizes.
pub max_req_sz: u32,
pub max_resp_sz: u32,
/// Number of active fore-channel slots. Read from here for config;
/// `SlotTable::num_slots` is the authoritative count for RPC dispatch.
pub num_slots: u32,
/// Network namespace for all RPC socket creation. Set during mount from the
/// mounting process's network namespace. All TCP connections to the NFS server
/// use this namespace for routing, firewall rules, and address resolution.
/// Ensures container-scoped NFS mounts use the container's network stack.
pub net_ns: Arc<NetNamespace>,
}
/// Per-slot sequence counters for NFSv4.1 session slots.
///
/// Separated from `NfsSessionParams` to avoid RCU replacement churn.
/// Session parameters change on renegotiation (~90s); slot sequences
/// change on every RPC (`fetch_add`). Independent RcuCell updates mean
/// session renewal never forces reallocation of this table.
///
/// **RPC dispatch hot path**: `rcu_read_lock()` → read `slot_table`
/// RcuCell → `seqs[slot_idx].fetch_add(1, Relaxed)` → release RCU guard.
/// Cost: one RCU read + one Arc deref + one atomic increment.
pub struct SlotTable {
/// Sequence counters indexed by slot number. `Arc<[AtomicU32]>`
/// because the server may negotiate any number of slots (typically
/// 64-256). The Arc allows the array to outlive RCU replacement
/// (readers that hold a reference during RCU swap continue to use
/// the old table until they release).
pub seqs: Arc<[AtomicU32]>,
/// Number of active fore-channel slots. This is the authoritative
/// count for RPC dispatch (may differ from `NfsSessionParams::num_slots`
/// briefly during slot count renegotiation).
pub num_slots: u32,
}
NFS mounts inside containers: the `net_ns` field ensures NFS client TCP connections
use the container's network namespace. Set at mount time from
`current_task().nsproxy.net_ns`. Immutable after mount — namespace changes via
`setns()` do not affect existing NFS mounts. Remount (`MS_REMOUNT`) from a task
in a different network namespace returns `EINVAL` — NFS cannot switch network
context on a live mount because it would invalidate all TCP connections and state
IDs. A container with a private `net_ns` gets its own NFS connections (separate
TCP sockets, separate NFSv4 sessions, separate clientid). This means an NFS mount
established inside a container always routes through that container's network stack
(routing tables, firewall rules, DNS resolution), regardless of subsequent namespace
manipulation. If the container's network namespace is destroyed while the NFS mount
is still active, all RPC operations return `ENETUNREACH` and the mount enters the
recovery path (lease expiry → reclaim sequence).
/// Write-heavy lease and recovery state — protected by SpinLock, acquired
/// only during lease renewal, state recovery, and session reset.
pub struct NfsLeaseState {
/// Open owners: keyed by the 28-byte open_owner opaque identifier.
/// Bounded by MAX_NFS_OPEN_OWNERS (65536). Returns NFS4ERR_RESOURCE on overflow.
pub open_owners: BTreeMap<[u8; 28], Arc<NfsOpenState>>,
/// Current lease expiry (absolute time).
pub lease_expiry: Instant,
/// True while state recovery is in progress. AtomicBool allows the
/// RPC dispatch path to check `recovering.load(Relaxed)` without
/// acquiring the NfsLeaseState SpinLock — reduces contention during
/// recovery (the worst time for extra lock contention).
pub recovering: AtomicBool,
/// Delegation return queue (delegations pending DELEGRETURN).
pub pending_returns: ArrayVec<[u8; 16], 64>,
}
Open state machine — NfsOpenState per open file handle:
pub struct NfsOpenState {
pub open_stateid: [u8; 16], // 4-component stateid from server
pub seqid: u32, // Local sequence for state transitions
pub access: NfsOpenAccess, // Read / Write / Both
pub deny: NfsOpenDeny, // None / Read / Write / Both
pub delegation: Option<NfsDelegation>,
/// Per-file byte-range locks. Bounded by per-file lock limit
/// (MAX_NFS_LOCKS = 256, matching typical server-side limits).
/// Vec is acceptable: lock operations are warm-path (per-lock syscall),
/// and typical files have << 10 concurrent locks.
pub locks: Vec<NfsLockState>, // max MAX_NFS_LOCKS (256)
}
pub struct NfsDelegation {
pub stateid: [u8; 16],
pub type_: DelegationType, // Read or Write
pub recall_wq: WaitQueue, // Signaled when server sends CB_RECALL
}
Write delegation — when the server grants a write delegation, the client may write and cache locally without contacting the server for each operation. On recall (server sends CB_RECALL via the NFSv4 callback channel), the client must flush all dirty pages and send DELEGRETURN before the server can grant access to other clients. The callback channel (established in CREATE_SESSION for v4.1, or via SETCLIENTID for v4.0) is a reverse TCP connection: server connects to client. The back_channel in NfsClient tracks this connection.
Lease renewal — a background kernel task (running as a Tier 1 task) calls RENEW (v4.0) or sends a SEQUENCE-only compound (v4.1) every lease_time_s * 2 / 3 seconds (default: 60s for a 90s lease). The renewal check also examines the GSS context sequence number for wire-wrap proximity: if (ctx.seq_num.load(Relaxed) as u32) >= 0xFFFF_FF00 { force GSS context renewal }. On network partition: lease renewal fails; after lease_time_s the server purges all client state. Client must perform state recovery: sends SETCLIENTID / EXCHANGE_ID (to re-establish client identity), then CLAIM_PREVIOUS opens for each open file, and LOCK reclaims for each lock, concluding with RECLAIM_COMPLETE.
State recovery error paths:
- If the server returns NFS4ERR_STALE_CLIENTID during recovery, the client lost its lease entirely: all open-file state is gone, all in-progress writes that were not yet flushed are lost. The VFS layer returns EIO to all blocked file operations.
- If CLAIM_PREVIOUS returns NFS4ERR_RECLAIM_BAD, the server no longer has a record of the open: the file descriptor is invalidated, pending writes are dropped with EIO.
- Recovery is gated by a per-client recovering flag; new operations block (interruptibly if intr mount option is set) until recovery completes or fails.
Open file state machine — Each NfsOpenState (per open file handle) transitions
through the following states:
| State | Meaning |
|---|---|
CLOSED |
No open, no stateid |
OPENING |
OPEN RPC sent, awaiting server response |
OPEN |
File open, lease active, stateid valid |
DELEGATED |
Delegation granted (read or write) |
RECALL_PENDING |
Server sent CB_RECALL; grace period active (90s) |
RETURNING |
DELEGRETURN RPC sent, awaiting server acknowledgment |
LEASE_EXPIRED |
Lease timer fired; stateid may be invalid on server |
RECLAIMING |
Server restart detected; reclaim sequence in progress |
RECLAIM_COMPLETE |
All files reclaimed; resuming normal operation |
State transitions:
| From | Event | To | Action |
|---|---|---|---|
CLOSED |
open(2) called |
OPENING |
Send OPEN RPC |
OPENING |
OPEN response OK | OPEN |
Store stateid; start lease timer |
OPENING |
OPEN response error | CLOSED |
Return errno to caller |
OPEN |
Server grants delegation | DELEGATED |
Store delegation stateid |
OPEN |
Lease timer fires | LEASE_EXPIRED |
Attempt RENEW RPC |
OPEN |
NFS4ERR_STALE_CLIENTID |
RECLAIMING |
Re-establish client ID; pause I/O |
DELEGATED |
CB_RECALL received | RECALL_PENDING |
Start 90s grace timer |
RECALL_PENDING |
DELEGRETURN sent | RETURNING |
— |
RECALL_PENDING |
Grace timer expires | RETURNING |
Force send DELEGRETURN |
RETURNING |
DELEGRETURN OK | OPEN |
Delegation relinquished; normal I/O |
LEASE_EXPIRED |
RENEW RPC OK | OPEN |
Lease refreshed |
LEASE_EXPIRED |
RENEW fails (NFS4ERR_EXPIRED) |
RECLAIMING |
Server evicted state; reclaim needed |
RECLAIMING |
All OPEN CLAIM_PREVIOUS sent | RECLAIM_COMPLETE |
Send RECLAIM_COMPLETE RPC |
RECLAIM_COMPLETE |
RECLAIM_COMPLETE RPC OK | OPEN |
Normal operation resumes |
RECLAIMING |
Grace period expires (60s) | CLOSED |
All stateids invalidated; return EIO |
| Any | close(2) + CLOSE RPC OK |
CLOSED |
Stateid invalidated |
Lease renewal timer: fires at lease_time_s * 2 / 3 (default: 60s for a 90s lease).
Three consecutive renewal failures → LEASE_EXPIRED transition. Under NFSv4.1, renewal
is implicit via the SEQUENCE operation in every compound RPC.
RECLAIM phase (triggered by NFS4ERR_STALE_CLIENTID or server restart detection):
- Pause all pending I/O on this client (operations return
EINPROGRESSinternally). Note:EINPROGRESSis an NFS-internal status during RECLAIM — it is never returned to userspacewrite(2)callers. The VFS layer translatesEINPROGRESStoEIOor retries transparently depending on the operation and mount flags. - Send
SETCLIENTID(v4.0) orEXCHANGE_ID(v4.1) to re-establish client identity. - For each cached
NfsOpenState: sendOPENwithCLAIM_PREVIOUSto reclaim the open. - For each cached
NfsLockState: sendLOCKwithreclaim = true. - Send
RECLAIM_COMPLETE; resume paused I/O. If RECLAIM fails for a specific file (server returnsNFS4ERR_RECLAIM_BAD): that file's state transitions toCLOSEDand all pending operations on it returnESTALE.
15.11.5 netfs Page Cache Layer¶
The netfs layer provides a shared page cache infrastructure for network filesystems. UmkaOS implements it as the cache tier between NFS (and future Ceph/AFS) and the page allocator. It replaces ad-hoc per-filesystem readahead and writeback logic with a unified, testable implementation.
Core abstractions:
pub trait NetfsInode: Send + Sync {
/// Populate subrequests for a read covering [rreq.start, rreq.start + rreq.len).
fn init_read_request(&self, rreq: &mut NetfsReadRequest);
/// Issue a single subrequest to the server (or local cache).
fn issue_read(&self, subreq: &mut NetfsSubrequest);
/// Issue a write request to the server.
fn issue_write(&self, wreq: &mut NetfsWriteRequest);
/// Split a dirty range into write requests.
fn create_write_requests(&self, wreq: &mut NetfsWriteRequest, start: u64, len: u64);
}
pub struct NetfsReadRequest {
pub inode: Arc<dyn NetfsInode>,
pub start: u64, // Byte offset in file
pub len: usize,
pub subrequests: Vec<NetfsSubrequest>,
pub netfs_priv: u64, // Filesystem-private field
}
pub struct NetfsSubrequest {
pub rreq: Weak<NetfsReadRequest>,
pub start: u64,
pub len: usize,
pub source: NetfsSource, // Server, Cache, LocalXfer
pub state: AtomicU32, // Pending / InFlight / Completed / Failed
}
/// Write request covering a contiguous dirty range.
/// Created by `netfs_writeback()`, split into sub-ranges by
/// `create_write_requests()`, issued via `issue_write()`.
pub struct NetfsWriteRequest {
/// The inode being written to.
pub inode: Arc<dyn NetfsInode>,
/// Byte offset of the first dirty byte in the file.
pub offset: u64,
/// Total length of the dirty range in bytes.
pub len: u32,
/// Folios backing this write. Pinned for the duration of the write;
/// unpinned on completion. Fixed-size to avoid heap allocation in the
/// writeback path (maximum 16 folios = 64 KiB at 4 KiB pages, matching
/// the typical NFS `wsize`).
pub folios: ArrayVec<PageRef, 16>,
/// Write stability mode: Unstable (deferred COMMIT), FileSync (O_SYNC),
/// or DataSync (O_DSYNC).
pub stability: NetfsWriteStability,
/// Lifecycle state of this write request.
pub state: NetfsWriteState,
/// Filesystem-private field (e.g., NFS verifier for COMMIT correlation).
pub netfs_priv: u64,
}
/// Write request lifecycle states.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum NetfsWriteState {
/// Folios collected, not yet issued.
Pending,
/// RPC in flight to server.
InFlight,
/// Server acknowledged the write.
Complete,
/// Write failed; errno stored for propagation to `fsync()` callers.
Error(i32),
}
/// Write stability modes (matching NFSv4 stable_how).
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum NetfsWriteStability {
/// Deferred commit: data may be in server's volatile cache until
/// explicit `COMMIT` RPC (issued at `fsync()` time).
Unstable,
/// Write + fsync semantics: server flushes to stable storage before ACK.
FileSync,
/// Data-only sync: server flushes data (not metadata) before ACK.
DataSync,
}
Read path: On page fault or explicit read() hitting an NFS-backed folio not in the page cache, netfs_read_folio() creates a NetfsReadRequest, calls init_read_request() which the NFS implementation uses to split the range into subrequests (one per READ RPC, sized to rsize), issues them concurrently via async tasks, and waits for all subrequests to complete. If a local CacheFiles cache is configured, subsets of reads may be served from disk cache rather than issuing an RPC.
Write path: On writeback(), netfs_writeback() groups dirty folios into write requests sorted by file offset, calls create_write_requests() to split into WRITE RPC-sized chunks (sized to wsize), and issues them via issue_write(). Ordering within a single writeback is by offset to maximize sequential I/O on the server. NFSv4 WRITE with FILE_SYNC stability mode is used when O_SYNC is active; otherwise UNSTABLE writes are used followed by a COMMIT RPC at fsync() time.
Readahead: The NetfsReadaheadControl struct drives speculative prefetch. When sequential read access is detected (via pos tracking in the file's NetfsInode), the readahead window expands up to max_readahead pages (default: 128 pages = 512 KiB at 4 KiB page size, configurable via mount option readahead=N). Readahead requests are lower priority than demand reads and are cancelled if memory pressure rises.
NFS dirty page backpressure: Network filesystems require writeback throttling beyond what local filesystems need, because NFS write RPCs can stall indefinitely when the server is unreachable. Without throttling, dirty pages accumulate in the page cache until memory pressure triggers the OOM killer — a catastrophic failure mode for NFS-heavy workloads.
/// Per-NFS-mount writeback throttle state. Embedded in the NFS superblock's
/// `NfsMountState` struct. Cooperates with the kernel's `balance_dirty_pages()`
/// infrastructure to throttle dirty page generation when NFS RPCs are stalled.
pub struct NfsWritebackThrottle {
/// Number of outstanding NFS WRITE RPCs for this mount.
/// Incremented when a WRITE RPC is dispatched; decremented on RPC
/// completion (success or failure). Read by the throttle check on
/// every balance_dirty_pages() callback.
pub outstanding_rpcs: AtomicU32,
/// Maximum outstanding WRITE RPCs before new writes begin blocking.
/// Default: 256. Configurable via mount option `nfs_max_writes=N`.
/// Upper bound: clamped to 4096 at mount time (prevents misconfiguration
/// from exhausting RPC slot table or consuming excessive memory for
/// in-flight buffers). At wsize=1048576 (1 MiB), 256 RPCs = 256 MiB
/// of in-flight dirty data, which is a reasonable default for most
/// NFS servers. The upper bound of 4096 caps at 4 GiB in-flight.
pub max_outstanding_rpcs: u32,
/// Set to true when the NFS transport detects server unreachable
/// (TCP connection reset without successful reconnect, or RPC timeout
/// exceeding 3 × timeo without response). When true, ALL new writes
/// block in balance_dirty_pages() until the flag is cleared (server
/// becomes reachable again and at least one RPC completes).
/// Relaxed ordering is acceptable: cache coherence guarantees propagation
/// within microseconds, and server congestion transitions are rare events
/// lasting seconds to minutes. The slight observation delay is irrelevant.
pub server_congested: AtomicBool,
/// Wait queue for writers blocked by `server_congested == true` on hard
/// mounts. When the SunRPC transport clears `server_congested` (after
/// successful reconnect + first RPC completion), it calls
/// `congestion_waitq.wake_up_all()` to unblock all waiting writers.
/// Without this, blocked writers would need to poll, wasting CPU.
pub congestion_waitq: WaitQueueHead,
}
Integration with balance_dirty_pages(): The kernel's writeback subsystem
calls balance_dirty_pages() on every write() path to enforce global and
per-BDI dirty page limits. NFS registers a BDI-specific dirty throttle callback
that adds NFS-aware checks:
- If
outstanding_rpcs.load(Relaxed) >= max_outstanding_rpcs: the callback returns a throttle rate of zero, causingbalance_dirty_pages()to block the writing process until RPCs drain below the threshold. This prevents unbounded dirty page accumulation. - If
server_congested.load(Relaxed) == true: the callback blocks unconditionally. No new dirty pages are generated for this mount until the server is reachable. On hard mounts, this blocks indefinitely (correct: the write will eventually complete when the server returns). On soft mounts, the RPC layer returnsEIOafterretranstimeouts, which clears the congestion and propagates the error to the writing process. - Otherwise: the callback returns a proportional throttle rate based on
outstanding_rpcs / max_outstanding_rpcs, smoothly reducing write rate as the RPC queue fills.
Congestion detection: The server_congested flag is set by the SunRPC
transport layer when ConnReset errors persist for longer than 3 × timeo
(default: 3 × 60 = 180 deciseconds = 18 seconds). It is cleared when a
reconnect() succeeds and at least one subsequent RPC completes. This avoids
false congestion signals during brief network glitches (a single TCP reset
triggers reconnect, not congestion).
Soft-mount EIO and dirty page preservation: When a soft-mount NFS client
returns EIO due to server timeout: dirty pages are re-dirtied in the page
cache (NOT released). This preserves data — the application can retry after
server recovery. The server_congested flag prevents new writes from queueing.
When the server becomes reachable again, the writeback engine flushes the
re-dirtied pages normally. If the application calls fsync() during the outage,
it receives EIO. If the application does not call fsync(), the data is
silently re-flushed when the server recovers.
15.11.6 Mount Options and Integration¶
NFS mounts use the new mount API (fsopen("nfs4") + fsconfig() + fsmount(), as specified in Section 14.6):
fsconfig(fd, FSCONFIG_SET_STRING, "source", "server.example.com:/export")
fsconfig(fd, FSCONFIG_SET_STRING, "sec", "krb5p")
fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.1")
fsconfig(fd, FSCONFIG_SET_STRING, "rsize", "1048576")
fsconfig(fd, FSCONFIG_SET_STRING, "wsize", "1048576")
fsconfig(fd, FSCONFIG_SET_STRING, "timeo", "600") // 60 seconds (units: 1/10 s)
fsconfig(fd, FSCONFIG_SET_STRING, "retrans","2")
fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL) // Hard mount: retry indefinitely
Key mount options:
| Option | Values | Meaning |
|---|---|---|
vers |
4.0, 4.1, 4.2 |
NFSv4 minor version. When vers=4.2 is specified, the client negotiates NFSv4.2 (RFC 7862) with the server. v4.2 operations deferred to Phase 4: COPY (server-side copy), SEEK (hole/data), ALLOCATE/DEALLOCATE, CLONE. Until Phase 4, the client uses v4.1-equivalent fallbacks. |
sec |
sys, krb5, krb5i, krb5p |
Security flavor |
rsize |
4096–1048576 | Read buffer size (bytes); must be multiple of 4096 |
wsize |
4096–1048576 | Write buffer size (bytes); must be multiple of 4096 |
hard / soft |
flag | Hard: retry indefinitely; soft: return error after retrans timeouts |
intr |
flag | Allow signals to interrupt hard-mount retries |
timeo |
integer (1/10 s) | Per-RPC timeout before retransmit |
retrans |
integer | Number of retransmits before soft-mount error |
nconnect |
1–16 | Number of parallel TCP connections to the server |
readahead |
pages | Readahead window size (default 128) |
ac / noac |
flag | Attribute caching; noac disables client-side attribute cache |
actimeo |
seconds | Unified attribute cache timeout |
nconnect implementation: When nconnect=N is set, the XClnt maintains N TcpTransport instances. Each async RPC call is dispatched to the transport with the lowest in-flight queue depth (round-robin with depth tie-breaking). This spreads NFS traffic across multiple TCP flows, which improves throughput on high-bandwidth links where a single TCP flow is CPU- or window-limited. Connections in the RECONNECTING state are excluded from the dispatch pool. In-flight RPCs whose underlying TCP connection fails are re-dispatched to a surviving connection (selected by lowest queue depth). When the failed connection completes reconnection, it is re-added to the dispatch pool with queue depth 0.
Capability requirements:
- Capability::SysAdmin: required to mount NFS (same as Linux). Enforced in nfs4_validate_mount_data() called from the fsconfig() implementation.
- Capability::NetAdmin: required to configure NFS server-side parameters (not client mounts).
- Rootless containers: NFS mounts inside a user namespace require that the filesystem server grants access to the mapped UID/GID range; the mount itself is permitted only if the user namespace has a mapping for UID 0 (i.e., is a privileged user namespace in context of the host).
sysfs interface — /sys/kernel/umka/nfs/:
- clients/: one directory per active NfsClient, containing:
- clientid: hex-encoded 64-bit client ID
- server: server address
- lease_time_s: negotiated lease period
- state: active / recovering / expired
- session_id (v4.1 only): hex-encoded 128-bit session ID
- servers/: per-server aggregate statistics:
- rtt_us: exponentially smoothed round-trip time (microseconds)
- retransmissions: total retransmitted RPCs since mount
- ops_per_sec: rolling 1-second average of completed RPCs
15.11.7 Locking: lockd and NFSv4 Built-in Locks¶
NFSv3 uses lockd (Network Lock Manager, NLM protocol, RFC 1813 appendix) for advisory file locking. NFSv4 has locking built into the compound protocol (LOCK / UNLOCK / LOCKT operations).
NfsLockState (NFSv4):
/// Sentinel value for byte-range locks extending to end of file.
/// Used in `NfsLockState.length` and NLM/NFSv4 LOCK operations.
/// Matches Linux NFS_LOCK_TO_EOF (0xFFFF_FFFF_FFFF_FFFF).
pub const NFS_LOCK_TO_EOF: u64 = u64::MAX;
pub struct NfsLockState {
pub stateid: [u8; 16],
pub type_: NfsLockType, // Read / Write
pub offset: u64,
pub length: u64, // NFS_LOCK_TO_EOF = to end of file
pub seqid: u32,
}
NFSv4 LOCK compound — SEQUENCE + PUTFH + LOCK { type_, reclaim, offset, length, locker: OpenToLockOwner { open_seqid, open_stateid, lock_seqid, lock_owner } }. On success returns lock_stateid used for subsequent LOCKU. On NFS4ERR_DENIED, returns the conflicting lock's owner, offset, and length so the caller can implement blocking via POSIX F_SETLKW semantics (client polls with exponential backoff up to timeo).
lockd (NFSv3) — NLM protocol between kernel lockd threads. lockd starts automatically when the first NFSv3 mount is established (Capability::SysAdmin required). The NLM daemon:
1. Registers with portmap/rpcbind as program 100021 version 4.
2. Accepts NLM_LOCK, NLM_UNLOCK, NLM_TEST RPCs from clients (server role) and issues them to remote servers (client role).
3. Implements the grace period subsystem: after server reboot, accepts only NLM_LOCK with reclaim=true until all clients have re-claimed their locks or the grace period (default 45s) expires.
Interaction between NLM and the VFS lock layer: NLM calls vfs_lock_file() (which calls the filesystem's lock() inode operation) on behalf of remote clients. UmkaOS's lock layer tracks pending NLM locks in INode::nlm_locks: Vec<NlmLock>, serialized by the inode's lock_mutex. When a lock is granted to a remote client, the NlmLock entry records the remote host and lock owner opaque identifier so it can be released on client crash (detected via NSM — Network Status Monitor callbacks, registered via SM_NOTIFY).
15.11.8 Design Decisions¶
-
NFSv4.1 as the default minor version: v4.1 sessions eliminate the need for the callback channel to traverse firewalls (server uses the established fore channel for callbacks in v4.1), simplify lease recovery (session semantics), and enable parallel slot usage. The client attempts v4.1 first and falls back to v4.0 only if the server rejects
EXCHANGE_ID. -
RPCSEC_GSS in-kernel, not userspace: Keeping GSS context management in the kernel (with upcalls to
gssdonly for ticket acquisition) eliminates a round-trip to userspace per-RPC atkrb5i/krb5psecurity levels. The integrity and privacy transforms (AES-256-CTS + HMAC-SHA-512/256 per RFC 8009) are performed in-kernel using the crypto subsystem. -
nconnectfor throughput scaling: A single TCP connection is limited by the TCP window and per-CPU processing. Multiple connections allow the NFS client to drive higher server throughput without RDMA. This matches Linux behavior since kernel 5.3. -
Hard mounts as default: Soft mounts return
EIOon transient network failures and can corrupt application data. Hard mounts block until the server is reachable again. Applications that need timeout behavior useintr+SIGINThandling orO_NONBLOCKat the VFS layer.
Layered retry semantics: TCP retransmission and NFS RPC retry operate independently
at different layers. TCP handles segment-level retransmission (typically 3 retransmits,
~60s total before connection reset). NFS RPC retry is above TCP: on a soft mount, the
RPC layer returns EIO after retrans timeouts of timeo each (default for TCP:
2 x 600 = 1200 deciseconds = 120s total). On a hard mount, the RPC layer retries indefinitely,
reconnecting the TCP transport if the connection drops. The total visible timeout is
max(TCP retransmit window, NFS RPC timeout) — typically the NFS RPC layer dominates
because it waits for the TCP connection to be re-established before retrying.
-
netfs layer as shared infrastructure: Rather than NFS implementing its own readahead and writeback, the netfs layer provides a single tested implementation. Future addition of Ceph or AFS clients reuses the same infrastructure without duplicating logic.
-
Zero-copy XDR via NetBuf chains: RPC payloads for large reads and writes avoid data copies by encoding directly into or decoding directly from the NetBuf chains used by the TCP transport (Section 12). The record-mark framing is prepended as a single 4-byte header NetBuf node; the data pages are appended as additional NetBuf nodes referencing page cache pages directly.
-
Attribute caching (
acoption): NFS attributes (size, mtime, ctime, nlinks) are cached foractimeoseconds (default: 3–60s, scaling with file size change frequency).noacdisables caching entirely, providing close-to-open coherence at the cost of oneGETATTRper VFS operation. The attribute cache is stored in theNfsInodeoverlaid on theInode(as with all UmkaOS filesystem-specific inode data).
NFS d_revalidate lock ordering: During ref-walk path resolution, NFS
d_revalidate acquires i_rwsem shared on the parent inode before issuing
a GETATTR RPC. Lock ordering: mmap_lock < i_rwsem < socket_lock.
The RPC may block on network I/O; i_rwsem shared mode allows concurrent
lookups on the same directory.
- Network namespace composition: Each NFS mount is bound to the network namespace
active at mount time (
NfsSessionParams::net_ns). The NFS client (NfsClntState) holds anArc<NetNamespace>reference captured fromcurrent_task().nsproxy.net_nsduringFileSystemOps::mount(). All SunRPC connections for this mount are created within the captured network namespace: socket creation callssock_create_kern(net_ns, AF_INET/AF_INET6, SOCK_STREAM, 0)with the storednet_nsreference, ensuring TCP connections use the namespace's routing table, firewall rules, and port space. When a container with its own network namespace mounts NFS, the mount's RPC connections are confined to the container's network stack. If the network namespace is destroyed while NFS mounts remain (container teardown without explicit unmount), all in-flight RPCs receiveENETUNREACHand the mount entersNFS4ERR_STALErecovery — thehardmount option blocks until the namespace is re-created or the admin force-unmounts withumount -f.
15.12 NFS Server (nfsd)¶
UmkaOS's NFS server (nfsd) enables exporting local filesystems to remote NFS clients over
NFSv3 (RFC 1813) and NFSv4.1 (RFC 5661). The server runs as a pool of kernel threads that
service SunRPC requests arriving on UDP and TCP port 2049. Configuration is via
/proc/fs/nfsd/ and the exportfs(8) utility, which parses /etc/exports and writes
export records into the kernel. NFSv4.1 is the default negotiated minor version; NFSv4.0
and NFSv3 clients are accepted by capability negotiation at connection time. The NFS server
integrates with:
- Section 13 (VFS) for all filesystem operations (lookup, read, write, getattr, setattr, readdir, lock, fsync).
- Section 15.11 (NFS Client) for the shared SunRPC transport and RPCSEC_GSS machinery
(the same
RpcTransportinfrastructure is used in both client and server roles). - Section 8 (Security) for Kerberos GSS context establishment and UID/GID credential validation.
15.12.1 Overview¶
The NFS server is structured into four layers:
- Transport:
svc_recv()— per-thread blocking receive over the sharedRpcSocket. - Dispatch:
svc_dispatch()— demultiplex by RPC program / version / procedure. - NFS handlers: per-procedure functions that validate export permissions, decode XDR arguments, call into the VFS, and encode XDR replies.
- Stable state: the NFSv4 state machine (clients, sessions, opens, locks, delegations) and the stable-storage journal for crash recovery.
The Duplicate Request Cache (DRC) sits between layers 2 and 3 to suppress re-execution of non-idempotent operations on retransmitted requests.
15.12.2 VFS ExportOps Interface¶
The NFS server requires filesystems to implement ExportOperations to allow stable file
handles — handles that survive server restart and that the server can use to reconstruct a
dentry from opaque bytes alone, without a mounted path hierarchy.
/// Implemented by filesystems that support being NFS-exported.
///
/// Stable file handles survive server restarts. The server must be able to
/// reconstruct a `Dentry` from the opaque handle bytes alone. Filesystems
/// that do not implement this trait cannot be NFS-exported; attempting to do
/// so returns `EINVAL`.
///
/// # Safety invariant
/// `encode_fh` and `fh_to_dentry` must be inverses: for any inode `i`,
/// `fh_to_dentry(sb, buf, ty)` where `(buf, ty) = encode_fh(i, buf, None)`
/// must return a dentry pointing to the same inode.
pub trait ExportOperations: Send + Sync {
/// Encode `inode` (and optionally its `parent`) into `fh`.
///
/// Returns the handle-type byte stored in the on-wire NFS file handle.
/// Typical implementations encode `(ino, generation)` for `parent = None`
/// and `(ino, generation, parent_ino, parent_generation)` when a parent
/// is supplied.
fn encode_fh(
&self,
inode: &Inode,
fh: &mut [u8; 128],
parent: Option<&Inode>,
) -> u8;
/// Reconstruct a dentry from a file handle.
///
/// Called on every NFS operation that arrives with a file handle. The
/// implementation must locate the inode (by inode number + generation or
/// by UUID) and return an instantiated dentry. Returns `ESTALE` if the
/// inode no longer exists.
fn fh_to_dentry(
&self,
sb: &SuperBlock,
fh: &[u8],
fh_type: u8,
) -> Result<Arc<Dentry>, KernelError>;
/// Reconstruct the parent dentry from a file handle that contains parent
/// information (i.e., was encoded with `parent = Some(...)`).
///
/// Returns `ESTALE` if the parent inode no longer exists.
fn fh_to_parent(
&self,
sb: &SuperBlock,
fh: &[u8],
fh_type: u8,
) -> Result<Arc<Dentry>, KernelError>;
/// Return the filename of `child` within `parent`.
///
/// Used during NFSv4 `READDIR` to build parent-relative paths for
/// directory entries. Returns `ENOENT` if `child` is not in `parent`.
///
/// Returns `ArrayString<256>` (stack-allocated, no heap) because
/// filenames are bounded by `NAME_MAX` (255 bytes) on all supported
/// filesystems. This avoids a `String` heap allocation on a path
/// that can be called frequently during NFS stale-handle recovery
/// and LOOKUPP operations.
fn get_name(
&self,
parent: &Dentry,
child: &Dentry,
) -> Result<ArrayString<256>, KernelError>;
/// Return the parent dentry of `child`.
///
/// Used to walk upward toward the export root when the client traverses
/// beyond the export boundary. Returns `EXDEV` if `child` is already the
/// filesystem root.
fn get_parent(&self, child: &Dentry) -> Result<Arc<Dentry>, KernelError>;
}
Standard UmkaOS-supported filesystems (ext4, XFS, Btrfs, tmpfs) implement ExportOperations
using (inode_number, generation_number) as the file handle payload. The generation number
is incremented each time an inode number is reused, ensuring handles from before a delete
are correctly rejected as ESTALE rather than silently aliasing a new file.
15.12.3 Exports Database¶
The exports table maps (host_pattern, local_path) to ExportOptions. It is loaded at
server startup and updated by exportfs -a writing binary records to
/proc/fs/nfsd/exports.
/// One row in the NFS exports table.
pub struct NfsExport {
/// Root dentry of the exported directory tree.
pub path: Arc<Dentry>,
/// Unique filesystem-ID for this export, embedded in NFSv3 `fsstat` and
/// NFSv4 `fs_locations`. Auto-assigned from `sb.dev` unless overridden
/// by `fsid=` option.
pub fsid: u64,
/// Host specifier: single IP (`192.168.1.5`), CIDR subnet
/// (`10.0.0.0/24`), DNS name (`host.example.com`), NIS netgroup
/// (`@cluster`), or wildcard (`*`).
pub client: NfsClientSpec,
/// Parsed export options.
pub options: ExportOptions,
/// Effective UID for unauthenticated or squashed access (default 65534,
/// the traditional `nfsnobody` UID).
pub anon_uid: u32,
/// Effective GID for unauthenticated or squashed access (default 65534).
pub anon_gid: u32,
}
/// Parsed export options from `/etc/exports`.
pub struct ExportOptions {
/// Allow write access. Default: `false` (read-only).
pub rw: bool,
/// Require that every `WRITE` is committed to stable storage before the
/// RPC reply is sent (`sync` option). Default: `true` (`sync`).
///
/// **Rationale**: Linux changed the default from `async` to `sync` in
/// kernel 2.6.33 (2010). The `async` default caused silent data loss
/// on server crash: the server acknowledged writes that had not yet
/// reached stable storage, and NFS clients (which trust the server's
/// response) discarded their cached copies. This violated the POSIX
/// `write()` durability contract that applications depend on. UmkaOS
/// follows the modern `sync` default. Administrators who accept the
/// data-loss risk for performance can explicitly set `async` in
/// `/etc/exports`.
pub sync: bool,
/// Map UID 0 to `anon_uid`. Default: `true`.
pub root_squash: bool,
/// Map all UIDs to `anon_uid`. Default: `false`.
pub all_squash: bool,
/// Verify that file handles refer to a file within the exported subtree
/// (not just the exported filesystem). Incurs a full path walk per
/// request. Default: `false` (disabled since Linux 2.6.x; the
/// performance cost is rarely worth the security benefit on modern
/// systems).
pub subtree_check: bool,
/// Accepted security flavors, in preference order. Default: `[Sys]`.
pub sec: ArrayVec<NfsSec, 4>,
/// Explicit `fsid=` override. Supersedes the auto-assigned value.
pub fsid: Option<u64>,
/// Automatically re-export submounts visible under this path. Default:
/// `false`.
pub crossmnt: bool,
/// Do not hide submounts from clients; clients must traverse them
/// explicitly via a separate mount. Default: `false`.
pub nohide: bool,
/// Skip AUTH_NLM authentication for NFSv3 lock requests. Default:
/// `false`.
pub no_auth_nlm: bool,
/// Require that the export is only activated when this path is an active
/// mountpoint (the `mp=` option). `None` = no requirement.
/// Bounded by PATH_MAX (4096 bytes).
pub mp: Option<String>,
}
/// Security flavor accepted on this export.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum NfsSec {
/// AUTH_SYS (UID/GID in RPC credential, no authentication).
Sys,
/// RPCSEC_GSS Kerberos 5: authentication only.
Krb5,
/// RPCSEC_GSS Kerberos 5: authentication + integrity.
Krb5i,
/// RPCSEC_GSS Kerberos 5: authentication + integrity + privacy.
Krb5p,
}
/// Global export table. RCU-protected hash table keyed on (path_hash, client_addr).
/// Hot-path lookup is O(1) with no lock acquisition. Writer path (exportfs -a)
/// holds a Mutex and publishes via RCU grace period.
pub struct NfsExportTable {
/// RCU-protected hash map: (path_hash, client_addr) → NfsExport.
/// Non-integer composite key; RcuHashMap is the correct collection
/// per collection usage policy.
pub entries: RcuHashMap<(u64, IpAddr), NfsExport>,
/// Writer lock for export add/remove operations.
pub write_lock: Mutex<()>,
}
On each NFS request the server calls export_table.lookup(dentry, peer_addr) — an O(1)
RCU read with no lock acquisition on the hot path. Updates (from exportfs -a writing
/proc/fs/nfsd/exports) take the writer lock, rebuild the affected bucket, and publish
via an RCU grace period.
15.12.4 Server Threads¶
/// The nfsd thread pool. One pool per NUMA node (optional; by default a
/// single pool is used for all CPUs).
pub struct NfsdPool {
/// Active kernel threads servicing RPC requests.
/// Bounded: max `NFSD_MAX_THREADS` (8192) per pool.
/// Collection policy: warm-path allocation (thread count changes are
/// admin operations via `/proc/fs/nfsd/threads`). Vec with documented
/// bound is acceptable per collection policy.
/// Memory budget at max: 8192 threads x ~48 KB (16 KB stack + 32 KB
/// request/reply buffers) = ~384 MB. Typical production: 32-512
/// threads (~1.5-24 MB). Probe-time validation rejects thread counts
/// exceeding `NFSD_MAX_THREADS`.
pub threads: Vec<KernelThread>,
/// Current configured thread count. Writable via
/// `/proc/fs/nfsd/threads`. Default: 8. Typical production: 32–512.
/// Hard upper bound: `NFSD_MAX_THREADS` (8192).
pub count: AtomicU32,
/// Shared RPC transport abstraction for port 2049. Despite the singular
/// name, `RpcSocket` internally multiplexes both UDP and TCP listeners
/// (and optionally RDMA). The singular name follows Linux's `svc_serv`
/// convention. See `RpcSocket` for multi-transport details.
pub socket: Arc<RpcSocket>,
/// Duplicate request cache shared across all threads in this pool.
pub drc: Arc<DuplicateRequestCache>,
/// Per-pool statistics (requests received, dispatched, dropped).
pub stats: NfsdPoolStats,
}
Thread lifecycle:
rpc.nfsd(8)opens/proc/fs/nfsd/threadsand writes the desired thread count.- The kernel spawns that many
nfsd/<n>kernel threads. - Each thread loops:
svc_recv(socket)→svc_authenticate(req)→svc_dispatch(req)→svc_send(reply). svc_recv()blocks inpoll()/epoll_wait()on the shared socket; threads compete for incoming requests (one request per wakeup).- Each thread owns a private 16 KB request buffer and a private 16 KB reply buffer. These buffers are stack-allocated within the kernel thread's stack; no per-request heap allocation is required for the common case.
- Writing
0to/proc/fs/nfsd/threadsshuts down all threads after draining in-flight requests.
Because nfsd threads are kernel threads (not user processes), each VFS call from a
thread executes directly in kernel context with the caller's effective credential set — no
context switch to user space is required between RPC dispatch and filesystem operation.
15.12.5 Duplicate Request Cache (DRC)¶
The DRC prevents non-idempotent operations from being re-executed on retransmitted
requests. It is mandatory for correctness: a client that retransmits CREATE foo after
a network timeout would otherwise create foo a second time if the first succeeded.
Non-idempotent procedures covered: SETATTR, WRITE, CREATE, MKDIR, SYMLINK,
MKNOD, REMOVE, RMDIR, RENAME, LINK (NFSv3); OPEN, CLOSE, SETATTR,
WRITE, CREATE, REMOVE, RENAME, LINK, LOCK, LOCKU (NFSv4 — note: NFSv4.1
sessions provide their own exactly-once semantics via slot + sequence IDs, so the DRC is
used only for NFSv3 and NFSv4.0 in UmkaOS).
/// Number of DRC shards. Each RPC locks only its shard — 64x reduction
/// in contention compared to a single global lock.
pub const DRC_SHARD_COUNT: usize = 64;
/// Duplicate request cache: sharded by `hash(client_addr, xid)`.
/// Each shard has its own SpinLock and LRU cache, so concurrent RPCs
/// targeting different shards proceed without contention.
pub struct DuplicateRequestCache {
shards: [SpinLock<DrcShard>; DRC_SHARD_COUNT],
}
/// One shard of the DRC. Per-shard capacity =
/// `(1024 * nfsd_thread_count) / DRC_SHARD_COUNT`.
pub struct DrcShard {
entries: LruCache<DrcKey, DrcEntry>,
max_entries: usize,
}
/// Cache key: uniquely identifies one RPC call from one client.
#[derive(Hash, Eq, PartialEq, Clone)]
pub struct DrcKey {
/// IPv4 or IPv6 address of the originating client.
pub client_addr: IpAddr,
/// RPC transaction ID (XID) from the call header.
pub xid: u32,
}
/// Maximum NFS RPC reply size cached in the DRC (8 KiB covers all NFSv3/v4
/// non-idempotent replies including GETATTR post-op attributes).
const NFS_DRC_MAX_REPLY: usize = 8192;
/// Cached reply for a completed non-idempotent operation.
pub struct DrcEntry {
/// Serialized XDR reply bytes, ready to retransmit.
/// Bounded by `NFS_DRC_MAX_REPLY` — replies exceeding this are not cached
/// (the operation is re-executed on replay, which is safe because the
/// DRC only caches non-idempotent ops that already committed).
pub reply: ArrayVec<u8, NFS_DRC_MAX_REPLY>,
/// Wall-clock time the entry was inserted (for eviction policy).
pub timestamp: Instant,
/// Adler-32 of the full request body. Used to detect the degenerate case
/// where two different requests happen to share the same XID — in that
/// case the cached reply is discarded and the new request is executed.
pub checksum: u32,
}
Request processing for non-idempotent procedures:
- Compute
DrcKey { client_addr, xid }andchecksum = adler32(request_body). - Select shard:
shard_idx = hash(client_addr, xid) % DRC_SHARD_COUNT. - Lock
shards[shard_idx](SpinLock) and look up the key. - Hit, checksum matches: return
entry.replydirectly; skip VFS execution. - Hit, checksum mismatch: evict stale entry; release shard lock; proceed to execute (new request collided with an old XID).
- Miss: release shard lock, execute VFS operation, re-lock shard, insert
DrcEntry { reply, timestamp, checksum }, release shard lock, send reply. - Entries are evicted LRU when per-shard capacity is exceeded, or after 120 seconds (hard TTL).
15.12.6 NFSv3 Protocol Dispatch¶
NFSv3 (RPC program 100003, version 3, RFC 1813) uses a stateless request/reply model.
All NFS file handles are opaque blobs of up to 64 bytes. The server reconstructs a dentry
from the file handle on every request via ExportOperations::fh_to_dentry().
| Procedure | Handler | Idempotent |
|---|---|---|
| NULL (0) | nfsd3_null() |
yes |
| GETATTR (1) | nfsd3_getattr() |
yes |
| SETATTR (2) | nfsd3_setattr() |
no |
| LOOKUP (3) | nfsd3_lookup() |
yes |
| ACCESS (4) | nfsd3_access() |
yes |
| READLINK (5) | nfsd3_readlink() |
yes |
| READ (6) | nfsd3_read() |
yes |
| WRITE (7) | nfsd3_write() |
no |
| CREATE (8) | nfsd3_create() |
no |
| MKDIR (9) | nfsd3_mkdir() |
no |
| SYMLINK (10) | nfsd3_symlink() |
no |
| MKNOD (11) | nfsd3_mknod() |
no |
| REMOVE (12) | nfsd3_remove() |
no |
| RMDIR (13) | nfsd3_rmdir() |
no |
| RENAME (14) | nfsd3_rename() |
no |
| LINK (15) | nfsd3_link() |
no |
| READDIR (16) | nfsd3_readdir() |
yes |
| READDIRPLUS (17) | nfsd3_readdirplus() |
yes |
| FSSTAT (18) | nfsd3_fsstat() |
yes |
| FSINFO (19) | nfsd3_fsinfo() |
yes |
| PATHCONF (20) | nfsd3_pathconf() |
yes |
| COMMIT (21) | nfsd3_commit() |
yes |
WRITE stability semantics: NFSv3 WRITE carries a stable_how field:
FILE_SYNC: data and metadata must be written to stable storage before reply. Implemented by callingvfs_write()followed byvfs_fsync(file, 0, len, 1).DATA_SYNC: data must reach stable storage; metadata update may be deferred. Implemented byvfs_write()+vfs_fdatasync().UNSTABLE: data written to page cache only (no fsync). The server returns the currentwrite_verifier(a 64-bit value, initialized toktime_get_boot_ns()at server start and written to/proc/fs/nfsd/write_verifier). The client must issue aCOMMITRPC before treatingUNSTABLEwrites as durable.
COMMIT: nfsd3_commit() calls vfs_fsync_range(file, offset, offset + count - 1, 0)
and returns the write_verifier. If the verifier has changed since the client last
received it (indicating a server restart), the client must re-issue all UNSTABLE writes.
READDIRPLUS: returns both directory entry names and their attributes in a single RPC,
amortizing the per-entry GETATTR round trips. Implemented by iterating vfs_iterate_dir()
and calling vfs_getattr() on each child inode, packing results into a single XDR reply
up to the maxcount limit supplied by the client.
15.12.7 NFSv4.1 Compound Dispatch¶
NFSv4.1 (RPC program 100003, version 4, RFC 5661) replaces the per-procedure dispatch
model with a compound RPC: a single RPC carries a sequence of operations processed
left-to-right. If an operation fails with any status other than NFS4_OK, the server
stops processing and returns partial results — only the first failed operation's status
is returned along with the results of all preceding successful operations.
SEQUENCE must be the first operation in every compound (except BIND_CONN_TO_SESSION
and EXCHANGE_ID). It provides session ID, slot ID, sequence ID, and cache-this flag.
The server's slot table enforces exactly-once semantics: slot i may not carry a new
request until the previous request on slot i has been replied to. This replaces the
NFSv3/v4.0 DRC with a per-session, per-slot mechanism.
Key operations and their VFS mappings:
| NFSv4.1 Operation | VFS call | Notes |
|---|---|---|
EXCHANGE_ID |
— | Client registration; returns clientid + capabilities |
CREATE_SESSION |
— | Establishes fore/back channels; negotiates slot counts and max RPC sizes |
DESTROY_SESSION |
— | Tears down session; releases slot table |
DESTROY_CLIENTID |
— | Releases all state for a clientid |
SEQUENCE |
— | Slot/sequence enforcement; lease renewal |
PUTROOTFH |
VFS root dentry | Sets current FH to the export root |
PUTFH |
fh_to_dentry() |
Sets current FH from wire handle |
GETFH |
— | Returns current FH to client |
SAVEFH / RESTOREFH |
— | Push/pop FH onto per-compound stack |
LOOKUP |
path_lookup() |
Walks one path component |
LOOKUPP |
path_lookup("..") |
Walks to parent directory |
OPEN |
vfs_open() |
Returns stateid + open flags |
CLOSE |
vfs_release() |
Releases open stateid |
READ |
vfs_read() |
Returns data + EOF flag |
WRITE |
vfs_write() |
Returns bytes written + stability |
COMMIT |
vfs_fsync_range() |
Flushes unstable writes |
GETATTR |
vfs_getattr() |
Returns requested attribute bitmask |
SETATTR |
vfs_setattr() |
Sets attributes; stateid required for size truncation |
CREATE |
vfs_mkdir() / vfs_symlink() / vfs_mknod() |
Non-regular files only (regular files via OPEN) |
REMOVE |
vfs_unlink() / vfs_rmdir() |
Inferred from inode type |
RENAME |
vfs_rename() |
Atomic cross-directory rename |
LINK |
vfs_link() |
Hard link |
READDIR |
vfs_iterate_dir() |
Returns entries with requested attributes |
READLINK |
vfs_readlink() |
Returns symlink target |
LOCK |
vfs_lock_file() |
Byte-range lock; returns lock stateid |
LOCKT |
vfs_lock_file(F_GETLK) |
Test for conflicting lock |
LOCKU |
vfs_lock_file(F_UNLCK) |
Release byte-range lock |
DELEGRETURN |
— | Client returns a read or write delegation |
LAYOUTGET |
pNFS metadata | pNFS layout (optional; Tier 1 storage backends only) |
LAYOUTRETURN |
pNFS metadata | Client returns layout |
15.12.7.1 pNFS Data Server Interface¶
pNFS (parallel NFS, RFC 5661 Section 12 and RFC 8435) distributes file data across multiple data servers (DSes) while the metadata server (MDS) handles namespace operations and layout leases. The following trait must be implemented by any Tier 1 block driver that wishes to serve as a pNFS data server.
/// pNFS data server operations. A pNFS layout divides file data across one or more
/// data servers (DSes); the metadata server (MDS) provides layout leases.
/// Each data server implements this trait to provide layout-specific I/O.
///
/// Layouts defined by RFC 5661 (NFS 4.1): FILE, BLOCK, OBJECT, FLEX_FILE (RFC 8435).
/// UmkaOS implements FILE layout (direct NFS I/O to data servers) and FLEX_FILE layout
/// (mirrors/striping with per-DS error tolerance).
pub trait PnfsDataServer: Send + Sync {
/// Unique server identifier (IP:port or RDMA endpoint address).
fn server_addr(&self) -> &PnfsServerAddr;
/// Read `len` bytes from the data server at file offset `file_offset` into `buf`.
/// Uses the layout credential from `layout_stateid`.
///
/// Returns `Ok(bytes_read)` or an error. On `PNFS_NO_LAYOUT` error, the caller
/// must fall back to the metadata server (MDS) for I/O.
fn read(
&self,
layout_stateid: &LayoutStateId,
file_offset: u64,
len: u32,
buf: &mut [u8],
) -> Result<u32, PnfsError>;
/// Write `data` to the data server at file offset `file_offset`.
/// `stable` indicates whether stable (synchronous) or unstable write is requested.
///
/// Unstable writes are buffered in the data server; a subsequent `commit()`
/// flushes them to stable storage. Stable writes are immediately persistent.
fn write(
&self,
layout_stateid: &LayoutStateId,
file_offset: u64,
data: &[u8],
stable: WriteStability,
) -> Result<WriteResponse, PnfsError>;
/// Flush unstable writes to stable storage on the data server.
/// Returns the write verifier that can be compared with previous unstable writes.
fn commit(
&self,
layout_stateid: &LayoutStateId,
file_offset: u64,
count: u64,
) -> Result<WriteVerifier, PnfsError>;
/// Return the data server's capabilities (supported layout types, max I/O size).
fn capabilities(&self) -> PnfsDataServerCaps;
/// Called when the layout is recalled by the MDS or invalidated. The data server
/// must flush all pending writes and return the layout.
fn layout_recall(&self, layout_stateid: &LayoutStateId, recall_type: RecallType);
}
/// Write stability mode for pNFS data server writes.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum WriteStability {
/// Data written to client-side buffer only (DATA_SYNC on server side).
Unstable,
/// Data written to server stable storage before reply (FILE_SYNC).
Stable,
}
/// Response from a pNFS data server write operation.
pub struct WriteResponse {
/// Number of bytes actually written.
pub count: u32,
/// Write stability achieved (may be higher than requested).
pub stability: WriteStability,
/// Write verifier: random value chosen by server at startup.
/// If verifier changes between write and commit, a server restart occurred
/// and uncommitted writes are lost (client must retry).
pub verifier: WriteVerifier,
}
/// Capabilities of a pNFS data server. Returned by `PnfsDataServer::capabilities()`
/// which is implemented by Tier 1 drivers communicating via KABI ring boundary.
/// `#[repr(C)]` is required for stable layout across compilation units.
#[repr(C)]
pub struct PnfsDataServerCaps {
/// Maximum I/O size for a single read or write RPC.
pub max_rw_size: u32,
/// Supported layout types (FILE, BLOCK, OBJECT, FLEX_FILE).
pub layout_types: u32,
/// 1 if RDMA transport is available for this data server, 0 otherwise.
/// u8 (not bool) avoids the bool validity invariant across KABI boundary.
pub rdma_available: u8,
/// Explicit padding to 4-byte alignment.
pub _pad: [u8; 3],
}
const_assert!(core::mem::size_of::<PnfsDataServerCaps>() == 12);
/// Opaque pNFS layout stateid (per RFC 5661 §14.5.2).
pub type LayoutStateId = [u8; 16];
/// pNFS write verifier (per RFC 5661 §17.3): 8-byte opaque value.
pub type WriteVerifier = [u8; 8];
/// Opaque server network address.
pub struct PnfsServerAddr { pub addr: [u8; 48], pub len: u8, pub _pad: [u8; 7] }
/// Errors specific to pNFS data server operations.
#[derive(Debug)]
pub enum PnfsError {
/// The layout stateid is no longer valid (server recalled or expired it).
/// Client must fetch a new layout from the MDS.
NoLayout,
/// Data server is temporarily unavailable. Client may retry or fall back to MDS.
Unavailable,
/// I/O error on the data server.
Io(KernelError),
/// Layout type not supported by this data server.
UnsupportedLayout,
}
15.12.8 NFSv4 State Management¶
NFSv4 introduces stateful file access. The server tracks client IDs, sessions, open owners, lock owners, and delegations. All state has an associated lease; state from clients whose leases expire is reclaimed by the server.
/// All per-client NFSv4 state. Protected by `NfsdStateTable::client_lock`.
pub struct NfsdClientState {
/// 64-bit client ID assigned at `EXCHANGE_ID`. Unique for the server's
/// lifetime.
pub clientid: u64,
/// 8-byte verifier supplied by the client at `EXCHANGE_ID`. Used to
/// detect client restarts (same IP, new verifier → client rebooted).
pub verifier: [u8; 8],
/// Confirmed IP address of the client (from the TCP connection that
/// issued `CREATE_SESSION`).
pub client_addr: IpAddr,
/// RPCSEC_GSS principal name if the client authenticated with Kerberos.
/// `None` for AUTH_SYS clients. Maximum 256 bytes (matching
/// `GssUpcallRequest.client_principal` size); EXCHANGE_ID rejects
/// overlong principals with `NFS4ERR_BADXDR`. Cold-path allocation
/// (one per client session).
pub principal: Option<String>,
/// Active sessions (fore + back channel pairs).
/// NFSv4.1 clients typically maintain 1-4 sessions; 16 is a generous upper bound.
pub sessions: ArrayVec<Arc<NfsdSession>, 16>,
/// Open owners: keyed by the 28-byte `open_owner` opaque identifier.
/// Bounded by MAX_NFS_OPEN_OWNERS (65536). Returns NFS4ERR_RESOURCE on overflow.
/// BTreeMap is correct per collection policy: [u8; 28] is a non-integer ordered
/// key. Ordered iteration is useful for crash recovery (deterministic state
/// replay). Matches Linux's rb-tree for state_owner lookup. At N=65536 with
/// 28-byte keys, BTreeMap lookup is ~16 comparisons — fast with BTreeMap's
/// cache-friendly node layout.
pub open_owners: BTreeMap<[u8; 28], Arc<OpenOwner>>,
/// Lock owners: keyed by the 28-byte `lock_owner` opaque identifier.
/// Bounded by MAX_NFS_LOCK_OWNERS (65536). Returns NFS4ERR_RESOURCE on overflow.
/// Same BTreeMap justification as `open_owners` above.
pub lock_owners: BTreeMap<[u8; 28], Arc<LockOwner>>,
/// Read and write delegations currently granted to this client.
/// Bounded by the server's per-client delegation limit
/// (`NFSD_MAX_DELEGATIONS_PER_CLIENT`, default 1024). Uses Vec instead of
/// ArrayVec<_, 1024> to avoid 8 KiB inline allocation per client — most
/// clients hold 0-10 delegations.
///
/// **Enforcement**: Checked in `nfsd_grant_delegation()` before inserting.
/// If `self.delegations.len() >= NFSD_MAX_DELEGATIONS_PER_CLIENT`, the
/// server declines the delegation (returns the OPEN response without a
/// delegation stateid). This is a UmkaOS improvement over Linux, which
/// enforces only a global delegation limit and allows a single client to
/// consume all global slots. The per-client limit prevents delegation
/// starvation across clients.
pub delegations: Vec<Arc<Delegation>>,
/// Absolute time at which this client's lease expires if not renewed.
/// Renewed on every `SEQUENCE` from this client.
pub lease_expiry: Instant,
}
/// Maximum fore-channel slots per session (RFC 8881 recommends up to 256).
const NFSD_MAX_SLOTS: usize = 256;
/// An NFSv4.1 session (one `CREATE_SESSION` creates one session).
pub struct NfsdSession {
pub session_id: [u8; 16],
/// Fore channel: client → server request slots.
/// Slot count negotiated at `CREATE_SESSION` time (max 256 per RFC 8881).
pub fore_slots: ArrayVec<NfsdSlot, NFSD_MAX_SLOTS>,
/// Back channel: server → client callback slots.
pub back_channel: Option<RpcBackChannel>,
/// Maximum request size negotiated at `CREATE_SESSION` (bytes).
pub max_req_sz: u32,
/// Maximum response size negotiated at `CREATE_SESSION` (bytes).
pub max_resp_sz: u32,
}
/// One slot in a session's fore channel.
pub struct NfsdSlot {
pub seq_id: u32,
/// Cached reply for the last compound on this slot (for replay detection).
/// Bounded by `NFS_DRC_MAX_REPLY` (same limit as DRC entries).
/// Heap-allocated to avoid ~8 KiB inline per slot — with 256 slots per
/// session and potentially thousands of sessions, inline would be ~2 MB/session.
pub cached_reply: Option<Box<[u8]>>,
pub in_use: AtomicBool,
}
/// An open-owner and the associated open stateid.
pub struct OpenOwner {
/// Current stateid (seqid increments on each OPEN/CLOSE/OPEN_DOWNGRADE).
pub stateid: StateId,
/// The opened file's dentry.
pub file: Arc<Dentry>,
/// Share access bits granted to this open (read, write, or both).
pub access: OpenAccess,
/// Share deny bits this open holds (deny read, deny write, or neither).
pub deny: OpenDeny,
/// Reference count: number of times the client has opened this
/// (owner, file) pair without a corresponding CLOSE.
pub open_count: u32,
}
/// An NFSv4 stateid: identifies one open, lock, or delegation instance.
pub struct StateId {
/// Sequence number, incremented on each state transition.
pub seqid: u32,
/// 12 opaque bytes unique within the server's lifetime.
pub other: [u8; 12],
}
/// A delegation granted to a client.
pub struct Delegation {
pub stateid: StateId,
pub dtype: DelegationType, // Read or Write
pub file: Arc<Dentry>,
pub client: u64, // clientid
/// Time at which a pending recall (CB_RECALL) was sent. `None` if no
/// recall is in progress.
pub recall_sent: Option<Instant>,
}
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum DelegationType {
/// Read delegation: client may cache reads without contacting server.
Read,
/// Write delegation: client has exclusive write access; all writes are
/// cached locally and flushed on DELEGRETURN or recall.
Write,
}
Lease renewal: Each NfsdClientState has a lease_expiry deadline. Any SEQUENCE
operation from the client resets the deadline to now + nfsd_lease_time (default: 90
seconds). The lease reaper task runs every 10 seconds and reclaims state for clients
whose lease_expiry is in the past: all OpenOwner entries are closed, byte-range
locks are released via vfs_lock_file(F_UNLCK), and delegations are revoked.
Grace period: After nfsd starts (or restarts), the server enters a grace period
of nfsd_gracetime seconds (default: 90 seconds, equal to the lease time). During the
grace period, the server:
- Accepts OPEN with claim_type = CLAIM_PREVIOUS (state reclaim) from clients that
held opens or delegations before the restart.
- Rejects new OPEN with claim_type = CLAIM_NULL with NFS4ERR_GRACE.
- Reads the stable-storage journal (Section 15.12.10) to learn which clients had state
before the restart, populating the set of expected reclaimants.
Once the grace period expires (or all expected reclaimants have completed reclaim, whichever is first), the server transitions to normal operation.
Delegations and recalls: The server grants a Read delegation when a file is opened
for read and there are no write opens or write delegations outstanding. It grants a
Write delegation when a file is opened for write and there is exactly one open (the
requesting client's) and no conflicting opens or delegations. When a conflicting open
arrives for a delegated file, the server issues CB_RECALL on the back channel to the
delegating client and waits nfsd_lease_time / 2 seconds for DELEGRETURN before
forcibly revoking the delegation with NFS4ERR_DELEG_REVOKED.
Dirty page limit during server unreachability: When the NFS server is
unreachable, dirty pages accumulate in the client's page cache. The
accumulation is bounded by the cgroup's memory.max limit and the global
dirty_ratio / dirty_bytes thresholds. If memory pressure triggers
writeback to an unreachable server, the writeback thread blocks for up to
nfs_timeout seconds (default: 60s) before reporting EIO. Delegations are
returned on lease expiry (default: 90s) regardless of server reachability.
15.12.9 Authentication and Security¶
AUTH_SYS (auth_flavor = AUTH_UNIX): the RPC credential carries a plaintext UID,
GID, and supplementary GID list. When running inside a container (nfsd in a
non-init user namespace), the server translates incoming AUTH_SYS wire UIDs
through the container's user_ns.uid_map before constructing filesystem
credentials. This prevents a containerized NFS server from accessing files as
the host UID. Outside a container (init user namespace), the server uses the
wire credentials directly as the effective credential for VFS calls. No
cryptographic authentication is performed. AUTH_SYS is a
legacy mechanism for trusted private-network deployments only — credentials are
trivially forgeable by any host on the network segment. Use sec=krb5p (authentication
+ integrity + privacy) for production deployments, or at minimum sec=krb5i
(authentication + integrity). AUTH_SYS should be restricted to legacy appliances or
isolated lab networks where deploying a Kerberos KDC is not feasible. Rejected on
exports that specify sec=krb5 or stronger.
RPCSEC_GSS / Kerberos 5 (RFC 2203 + RFC 7861): three protection levels:
krb5: authentication only. The RPC call header contains a GSSMICtoken covering the XID and procedure number; the server verifies the MIC using the session key. Payload is transmitted in clear.krb5i: authentication + integrity. The entire RPC body (arguments + results) is covered by a GSSMICtoken. Payload is transmitted in clear but any tampering is detected.krb5p: authentication + integrity + privacy. The entire RPC body is wrapped with GSSWrap(encrypt-then-MAC). Payload is opaque to network observers.
In all three cases the cryptographic transforms use AES-256-CTS-HMAC-SHA512-256 (enctypes
aes256-cts-hmac-sha512-256, RFC 8009) when negotiated with a Kerberos 5 KDC that
supports it, falling back to aes128-cts-hmac-sha256-128 (RFC 8009) or
aes256-cts-hmac-sha1-96 (RFC 3962) for older KDCs.
GSS context establishment flow:
- The client sends
RPCSEC_GSS_INITwith aGSS_Init_sec_contexttoken (Kerberos AP-REQ encapsulated in GSS-API). - The server calls
rpc_gss_svc_accept_sec_context()which makes a synchronous upcall togssdvia a kernel–user pipe.gssdcallsgss_accept_sec_context()with the host's keytab (/etc/krb5.keytab) and returns the derived session key and client principal to the kernel. - The kernel stores the session key in
GssContext::session_key(protected by aMutex<>; the key is zeroed on context expiry viaDrop). Subsequent RPCs performMIC/Wrap/Unwrapin-kernel using the UmkaOS crypto subsystem (Section 8). - The
svcgssddaemon (alternative togssd) is also supported; the upcall interface is identical.
UID mapping: applied after credential extraction, before any VFS call:
- root_squash (default on): UID 0 → anon_uid (65534), GID 0 → anon_gid (65534).
- all_squash: all UIDs/GIDs → anon_uid/anon_gid.
- Neither option: credentials passed through unchanged.
UID mapping is applied per-export, so the same file can be accessed with different effective credentials by clients matched to different export rows.
15.12.10 /proc/fs/nfsd Interface¶
The /proc/fs/nfsd/ pseudo-filesystem is the control plane for the NFS server. It is
mounted at boot when the nfsd kernel module is loaded (or when the first export is
created, if nfsd is built-in).
/proc/fs/nfsd/
├── threads (rw): read = "N\n" current thread count; write N to spawn/trim threads
├── exports (rw): current exports table in exportfs format; write to update
├── clients/ (r-x): one subdirectory per active NFSv4 client
│ └── <clientid>/ clientid in lowercase hex (16 hex digits)
│ ├── info (r--): "addr: ...\nprincipal: ...\nlease_remaining: ...s\n"
│ ├── states (r--): one line per open stateid and delegation
│ └── ctl (-w-): write "expire\n" to immediately revoke this client's lease
├── pool_stats (r--): per-pool thread count, requests served, DRC hit rate
├── write_verifier (r--): current write verifier as 16-char lowercase hex
├── nfsv4leasetime (rw): NFSv4 lease duration in seconds (default 90, range 10–3600)
├── nfsv4gracetime (rw): grace period duration in seconds (default = nfsv4leasetime)
├── nfsv4minorversion (rw): highest NFSv4 minor version offered (0 or 1; default 1; v4.2 deferred to Phase 4)
└── stable_storage (rw): path to the stable-state journal file
(default: /var/lib/nfs/v4recovery)
The stable_storage path points to a directory on a local persistent filesystem. The
server writes one file per client (named by clientid) containing serialized
NfsdClientState (open owners, lock owners, delegation stateids) using a binary format.
Each client file begins with an 8-byte header: UMKA magic (4 bytes) + version: Le32
(4 bytes, currently 1). Unknown versions cause the file to be ignored — the client must
re-establish state from scratch during the grace period. A CRC-32C checksum covers the
header and body. These files are read during the grace period to populate the
set of expected reclaimants. They are deleted when a client sends DESTROY_CLIENTID or
when its lease expires normally.
15.12.11 NLM (Network Lock Manager) Server¶
NFSv3 byte-range locking uses a separate RPC protocol: NLM (program 100021, version 4,
defined in the OpenGroup XNFS specification). The NLM server runs as part of lockd
alongside the NFS server.
NLM server procedures:
| Procedure | Handler | Notes |
|---|---|---|
NLM_TEST |
nlm4_test() |
Test for conflicting lock (non-destructive) |
NLM_LOCK |
nlm4_lock() |
Acquire byte-range lock; may block if block=true |
NLM_CANCEL |
nlm4_cancel() |
Cancel a pending blocked lock request |
NLM_UNLOCK |
nlm4_unlock() |
Release a byte-range lock |
NLM_GRANTED |
nlm4_granted() |
Callback: server notifies client of granted blocked lock |
NLM_TEST_MSG |
async variant of TEST | One-way; reply via NLM_TEST_RES callback |
NLM_LOCK_MSG |
async variant of LOCK | One-way; reply via NLM_LOCK_RES callback |
NLM_UNLOCK_MSG |
async variant of UNLOCK | One-way; reply via NLM_UNLOCK_RES callback |
NLM_SHARE |
nlm4_share() |
DOS-style share reservation (rarely used) |
NLM_UNSHARE |
nlm4_unshare() |
Release share reservation |
NLM_NM_LOCK |
nlm4_nm_lock() |
Non-monitored lock (NSM not involved) |
NLM_FREE_ALL |
nlm4_free_all() |
Release all locks for a client (NSM reboot notification) |
VFS integration: nlm4_lock() calls vfs_lock_file(file, F_SETLKW, flock) with the
translated struct file_lock. Granted locks are recorded in INode::nlm_locks:
Vec<NlmLock>, protected by INode::lock_mutex. Maximum NLM_MAX_LOCKS_PER_INODE
(default: 1024, configurable via sysctl nfs.nlm_max_locks_per_inode). Lock requests
exceeding this limit are rejected with NLM_DENIED_NOLOCKS. Each NlmLock entry
stores the remote host address and the NLM lock_owner opaque cookie so the lock
can be released if the client crashes.
NSM (Network Status Monitor) integration: rpc.statd (program 100024) runs in user
space and monitors client liveness. When lockd grants a lock to a remote client, it
calls nsm_monitor(client_addr) to register the client with rpc.statd. If the
client reboots, rpc.statd calls nsm_callback() which delivers SM_NOTIFY to the
kernel's nfsd_sm_notify() entry point. The kernel then calls nlm_host_rebooted(),
which iterates INode::nlm_locks for all inodes holding locks from that host and
calls vfs_lock_file(F_UNLCK) to release them, allowing other waiters to proceed.
Grace period: After lockd restarts (following a server crash), it enters a grace
period (default 45 seconds) during which it accepts only NLM_LOCK requests with
reclaim = true. This allows clients to re-acquire locks they held before the crash
before the server accepts new competing lock requests.
15.12.12 Linux Compatibility¶
/etc/exportsformat: identical to Linux nfsd, including all documented export options (rw,ro,sync,async,root_squash,no_root_squash,all_squash,no_all_squash,subtree_check,no_subtree_check,sec=,fsid=,anonuid=,anongid=,crossmnt,nohide,no_auth_nlm,mp=). Unrecognized options are rejected with a logged warning (not silently ignored).exportfs(8),showmount(8),nfsstat(8),rpc.nfsd(8),rpc.mountd(8)all operate without modification./proc/fs/nfsd/layout matches Linux kernel 5.15+ nfsd. Fields that do not exist in older kernels (e.g.,nfsv4minorversion) are additive and ignored by older tools.- NFSv3 wire protocol: RFC 1813 compliant, interoperable with Linux, Solaris, macOS, FreeBSD, and Windows NFS clients.
- NFSv4.1 wire protocol: RFC 5661 compliant. pNFS metadata operations (
LAYOUTGET,LAYOUTRETURN,LAYOUTCOMMIT,GETDEVICEINFO) are implemented; pNFS data-server operations require a Tier 1 block driver that exposes thePnfsDataServerinterface (optional; falls back to MDS-only mode if unavailable). - NFSv4.0 minor version: accepted (negotiated down from v4.1 if the client does not
support
EXCHANGE_ID). The DRC (Section 15.12.5) provides exactly-once semantics for v4.0. NFSv4.0 compatibility requires: SETCLIENTID(opcode 35) /SETCLIENTID_CONFIRM(opcode 36) handlers for initial client identification (v4.0 does not useEXCHANGE_ID/CREATE_SESSION).- v4.0 callback channel established via reverse TCP connection to the client
address from
SETCLIENTID'sr_netid/r_addrfields. - v4.0 compounds go through the DRC directly (no session slots or
SEQUENCE). RENEW(opcode 30) for lease renewal (replacesSEQUENCEin v4.1).- NFSv3 and NFSv4 server can run concurrently; both are enabled by default. Writing
3or4to a hypotheticalnfsv_versionsknob is not yet implemented; the standard mechanism (exportfs options + kernel compile flags) applies as on Linux.
15.12.13 Design Decisions¶
-
NFSv4.1 sessions replace the DRC for v4.1 clients: The per-session, per-slot sequence-ID mechanism in NFSv4.1 (RFC 5661 §2.10) provides exactly-once semantics without the hash-table overhead of the DRC. The DRC is retained only for NFSv3 and NFSv4.0 clients. NFSv4.1 clients receive
NFS4ERR_SEQ_MISORDEREDon sequence violations rather than a cached reply. -
Stable storage journal for NFSv4 state: Writing client open/lock/delegation state synchronously to disk on every
OPEN,CLOSE,LOCK,LOCKU, andDELEGRETURNallows the server to survive a crash and offer clients a grace period for state reclamation (RFC 5661 §9.4.2). Without stable storage, the server would be forced to returnNFS4ERR_NO_GRACEto all clients, requiring them to re-open all files from scratch — disruptive for workloads with thousands of open files. -
Thread pool model over event-driven dispatch: Kernel threads (one thread per outstanding request, blocking on
svc_recv) keep the code path from RPC arrival to VFS call entirely synchronous. An event-driven model (one thread multiplexing many connections viaepoll) would require explicit continuation passing through VFS callbacks, adding complexity with negligible throughput benefit at the connection counts typical for NFS servers (100s–1000s of clients, not millions). -
ExportOperationsas a required trait: Requiring filesystems to provide stable file handles (encode_fh/fh_to_dentry) makes the correctness contract explicit at the type level. Filesystems that cannot provide stable handles (e.g., a synthetic in-memory filesystem with no persistent inode allocation) simply do not implement the trait and cannot be exported — instead of being exported with silently brokenESTALEbehavior. -
AUTH_SYS and Kerberos both in-kernel: The Kerberos per-RPC integrity and privacy transforms (AES-256-CTS + HMAC) are performance-critical at high RPC rates and belong in the kernel crypto subsystem. Only the initial GSS context negotiation (involving the KDC and the host keytab) uses a user-space upcall to
gssd. This is identical to Linux nfsd's approach and ensures compatibility with existinggssd/svcgssddeployments. -
NLM co-located with nfsd: The NLM lock manager shares the
lockdkernel threads and the per-inodenlm_lockslist with the NFS server rather than running as a separate subsystem. This avoids a cross-subsystem RPC for every lock operation and allows lock grants and lock releases to be performed atomically with respect to VFS inode locking.
15.13 Block Storage Networking¶
Storage networking protocols that expose remote block devices as local storage. These integrate with UmkaOS's block layer (umka-block), RDMA infrastructure (Section 5.4), and driver recovery model.
15.13.1 Wire Format Validation¶
All #[repr(C)] structs in this section cross node boundaries via the peer protocol.
Every struct has:
- Explicit _pad fields for all implicit alignment padding
- Le16/Le32/Le64 for all multi-byte integers (never native endian)
- u8 (not bool) for boolean fields
- const_assert!(size_of::<T>() == N) after every struct definition
Verified structs in this section:
| Struct | Size | Le types | Pad fields | const_assert | bool-as-u8 | Notes |
|---|---|---|---|---|---|---|
NvmeDiscoveryLogEntry |
1024 | Yes (Le16) | Yes (_reserved0, _reserved1) | Yes | N/A | NVMe spec struct |
BlockServiceRequest |
64 | Yes (Le64, Le32, Le16) | Yes (_pad1) | Yes | N/A | |
SglEntry |
12 | Yes (Le64, Le32) | N/A | N/A (trivially 12B) | N/A | |
BlockServiceCompletion |
32 | Yes (Le64, Le32) | Yes (_reserved, _pad) | Yes | N/A | |
BlockServiceDeviceInfo |
112 | Yes (Le64, Le32, Le16) | Yes (_pad) | Yes | Yes (7 bool fields) | const_assert present and correct |
DataIntegrityField |
8 | YES (Le16, Le32) | YES | YES | guard(2)+app_tag(2)+ref_tag(4)=8 | Fixed: converted to Le16/Le32 |
All struct deviations in this section have been fixed:
DataIntegrityFieldnow usesLe16/Le32, andBlockServiceDeviceInfohas an activeconst_assert!.
iSCSI Initiator
Tier 1 umka-block module implementing the iSCSI initiator role (RFC 7143): - Session management: login, logout, connection multiplexing, session recovery - SCSI command encapsulation over TCP - CHAP authentication (unidirectional and mutual) - Header and data digests (CRC32C) for integrity - Multi-connection sessions (MC/S) for bandwidth aggregation - Error recovery levels 0, 1, and 2
iSCSI Target
Tier 1 module exposing local block devices as iSCSI LUNs:
- LIO-compatible configuration interface (existing targetcli works via SysAPI layer)
- ACL-based access control (initiator IQN whitelist + CHAP)
- Multiple LUNs per target portal group
- SCSI Persistent Reservations (PR) support (required for clustered filesystems)
iSCSI CHAP Authentication
CHAP (Challenge-Handshake Authentication Protocol, RFC 3720 §12.1) is the standard authentication mechanism for iSCSI. Required for enterprise deployments; both initiator and target sides must support unidirectional and mutual CHAP.
/// CHAP authentication configuration for an iSCSI session.
/// RFC 3720 §12.1, updated by RFC 7143 §12.1.
pub struct IscsiChapAuth {
/// CHAP algorithm: MD5 (5, legacy) or SHA-256 (7, preferred).
pub algorithm: ChapAlgorithm,
/// Target authenticates initiator (one-way CHAP).
pub target_auth: ChapCredential,
/// Initiator authenticates target (mutual CHAP, recommended).
/// Mutual CHAP prevents rogue targets from harvesting initiator credentials.
pub mutual_auth: Option<ChapCredential>,
}
/// CHAP credential pair (name + shared secret).
pub struct ChapCredential {
/// CHAP name (typically the iSCSI qualified name of the peer).
pub name: KString,
/// CHAP secret (shared secret, 12-256 bytes per RFC 7143 §12.1.3).
/// Wrapped in `Zeroizing` to ensure memory is cleared on drop.
pub secret: Zeroizing<ArrayVec<u8, 256>>,
}
/// CHAP hash algorithm identifier (IANA "PPP Authentication Algorithms" registry).
pub enum ChapAlgorithm {
/// MD5 (legacy, for backward compatibility with older initiators only).
Md5 = 5,
/// SHA-1 (legacy, for interop with older initiators).
Sha1 = 6,
/// SHA-256 (recommended, RFC 7143 §13.11). Preferred for all new deployments.
Sha256 = 7,
/// SHA3-256. IANA-assigned (David_Black). FIPS 202 compliant.
/// Interoperability limited to implementations that support CHAP algorithm 8.
Sha3_256 = 8,
}
CHAP negotiation flow during iSCSI login:
- Initiator sends
LoginRequestwithAuthMethod=CHAPin the key-value text parameters. - Target selects CHAP and responds with
CHAP_A(algorithm list),CHAP_I(identifier, one byte),CHAP_C(challenge, random bytes). - Initiator computes
response = Hash(CHAP_I || secret || CHAP_C)using the selected algorithm, sendsCHAP_N(name) andCHAP_R(response). - Target verifies the response against its stored credential for the initiator's name.
- If mutual CHAP is negotiated: the target sends its own
CHAP_IandCHAP_Cin the same response. The initiator verifies the target's identity using the mutual secret. This prevents man-in-the-middle attacks where a rogue target impersonates the real one.
CHAP credentials are stored in the kernel key retention service
(Section 10.2) under the iscsi_chap keyring. The targetcli
configuration interface writes credentials via the configfs auth/ directory
(see targetcli configfs management below).
iSER (iSCSI Extensions for RDMA)
When RDMA fabric is available (InfiniBand, RoCE, iWARP — Section 5.4), iSCSI sessions transparently upgrade to RDMA transport: - Zero-copy data transfer: RDMA READ/WRITE directly between initiator/target memory - Kernel-bypass data path: data moves without CPU involvement - Same iSCSI session management and authentication, different transport - Transparent upgrade: if both ends advertise RDMA capability during login, iSER is negotiated automatically. Applications and management tools see a standard iSCSI session.
NVMe-oF Initiator (Host)
Tier 1 umka-block module implementing the NVMe over Fabrics host side (NVM Express 2.0, NVMe TCP Transport Specification TP 8000, NVMe/RDMA part of original NVMe-oF specification June 2016). For NVMe/TCP transport, the initiator creates kernel-internal TCP sockets via the standard networking stack (Section 16.1). All network-level operations (routing, congestion control, netfilter) apply to NVMe-oF TCP traffic as normal flows. For NVMe/RDMA transport, it uses the RDMA pool manager (Section 5.4) for zero-copy buffer registration.
- Discovery: NVMe-oF discovery protocol (well-known discovery NQN) — initiator queries a discovery controller to enumerate available subsystems and transport addresses. Supports Discovery Log Page, referrals, and persistent discovery connections, unique discovery controller identification (TP 8013a).
- NVMe/TCP transport: NVMe commands encapsulated in TCP (NVMe TCP Transport Specification, TP 8000, widely deployed).
Lighter than iSCSI — no SCSI translation layer, native NVMe command set. Supports
header and data digests (CRC32C), and TLS 1.3 for in-transit encryption (TP 8011).
Uses the kernel-internal socket API (
SocketOpstrait, Section 16.3) for TCP connections —connect(),sendmsg()withMSG_MOREfor PDU framing,recvmsg()for response parsing. Each NVMe I/O queue maps to one TCP connection. Zero-copy TX usesNetBufscatter-gather (Section 16.5) to avoid copying NVMe command capsules and data payloads. - NVMe/RDMA transport: NVMe commands over RDMA (InfiniBand, RoCE, iWARP). Capsule
commands sent via RDMA SEND, data transferred via RDMA READ/WRITE — zero-copy,
kernel-bypass. Lowest latency option (~3-5 μs network transport; ~10-20 μs end-to-end
including NVMe target processing). NVMe-oF I/O SGLs (scatter-gather lists for data
buffers) are allocated from
RdmaPoolManager::alloc("nvmeof", size)(Section 5.4). On quota exhaustion (the NVMe-oF RDMA pool is depleted), the initiator returnsBLK_STS_RESOURCEto the block layer, which applies backpressure by re-queuing the I/O request and throttling submission until pool capacity is recovered. - Multipath: native NVMe multipath (ANA — Asymmetric Namespace Access). Multiple paths to the same namespace are managed by the NVMe driver itself (not dm-multipath). ANA groups indicate path optimality (optimized, non-optimized, inaccessible). UmkaOS's NVMe multipath integrates with the recovery-aware volume layer (Section 15.2) — if a path fails due to driver crash, the volume layer waits for recovery rather than immediately failing over.
- Namespace management: attach/detach namespaces, resize, format — full NVMe-oF namespace management command set.
- Zoned namespaces (ZNS): NVMe-oF supports zoned namespaces. UmkaOS exposes these through the block layer's zone interface, compatible with zonefs and f2fs.
NVMe-oF Target (Subsystem)
Tier 1 module exposing local NVMe devices (or any block device) as NVMe-oF subsystems:
- Subsystem management: create/destroy NVMe subsystems, each with one or more namespaces backed by local block devices (NVMe, zvol, dm device, or any umka-block device).
- Transport bindings: simultaneous TCP and RDMA listeners on the same subsystem. Clients connect via whichever transport is available.
- Access control: per-host NQN ACLs. Each allowed host can be restricted to specific namespaces within the subsystem.
- ANA groups: configure asymmetric namespace access for multipath. Allows active/passive and active/active configurations.
- Passthrough mode: for local NVMe devices, optionally pass NVMe commands directly to the hardware (no block layer translation). Provides the lowest-latency target implementation — remote host gets near-local NVMe performance.
- Configuration interface:
nvmetcli-compatible JSON configuration (existing Linux NVMe target management tools work via SysAPI layer).
NVMe-oF TLS 1.3 Security
NVMe/TCP supports in-band TLS 1.3 encryption (NVMe TP 8011). This protects data in transit without requiring IPsec or network-level encryption, and is mandatory for deployments where storage traffic traverses untrusted network segments.
/// TLS 1.3 configuration for an NVMe-oF target port or initiator connection.
/// Implements NVMe TP 8011 (Secure Channel — TLS for NVMe/TCP).
pub struct NvmeofTlsConfig {
/// TLS mode for this port/connection.
pub mode: NvmeofTlsMode,
/// PSK identity hint (for PSK mode). Matches the identity configured
/// on the initiator via `nvme gen-tls-key` / `nvme set-key`.
pub psk_identity: Option<KString>,
/// Pre-shared key value (for PSK mode). TLS 1.3 PSK, up to 48 bytes
/// (SHA-384 output). Wrapped in `Zeroizing` for secure memory handling.
pub psk: Option<Zeroizing<[u8; 48]>>,
/// X.509 certificate for certificate-based TLS. The certificate and
/// private key are stored in the kernel key retention service
/// ([Section 10.2](10-security-extensions.md#kernel-key-retention-service)).
pub cert: Option<Arc<X509Cert>>,
/// Whether to require client (initiator) authentication.
/// When true, the target requests and verifies a client certificate
/// during the TLS handshake (mutual TLS).
pub require_client_auth: bool,
}
/// NVMe-oF TLS mode selection.
pub enum NvmeofTlsMode {
/// No TLS (plaintext NVMe/TCP). Default for backward compatibility.
None,
/// TLS 1.3 with Pre-Shared Key (NVMe TP 8011). The PSK is provisioned
/// out-of-band and identified by `psk_identity`. Simpler deployment
/// than certificates; suitable for static clusters.
Psk,
/// TLS 1.3 with X.509 certificates. Provides identity verification
/// via certificate chain validation. Required for multi-tenant or
/// cross-organizational deployments.
Certificate,
}
TLS offload: when the NIC supports kTLS offload (Section 16.15), TLS
record-layer encryption and decryption are performed in hardware, making encrypted
NVMe/TCP effectively zero-copy. The NVMe-oF initiator and target negotiate TLS during
the NVMe/TCP connection setup phase (before the NVMe Connect command). The TLS session
keys are installed into the kTLS socket via setsockopt(SOL_TLS, TLS_TX) /
setsockopt(SOL_TLS, TLS_RX).
DH-HMAC-CHAP (NVMe TP 8001) provides an alternative in-band authentication mechanism that does not require TLS infrastructure. It can be used standalone or as a pre- authentication step before TLS handshake. The discovery controller also supports DH-HMAC-CHAP (see NVMe-oF Discovery Controller below).
NVMe-oF Discovery Controller
The NVMe-oF discovery controller provides interoperability with non-UmkaOS nodes (Linux, Windows, ESXi) that discover NVMe-oF targets using the standard NVMe-oF discovery protocol. Without a discovery controller, non-UmkaOS initiators cannot find UmkaOS NVMe-oF subsystems — they require out-of-band configuration of target addresses, which defeats the self-discovery model that NVMe-oF was designed for.
Implementation (Tier 1, part of the NVMe-oF target module):
-
Well-known discovery NQN: listens as
nqn.2014-08.org.nvmexpress.discovery(the standard NVMe-oF discovery NQN defined in NVMe Base Specification 2.0 §4.1). Initiators connect to this NQN to retrieve the discovery log page. -
Well-known port: listens on TCP port 8009 (the IANA-assigned NVMe-oF discovery port, also used by Linux nvmet and commercial NVMe-oF targets). Also listens for RDMA connections on the same port when RDMA transport is available.
-
Dual transport: the discovery controller accepts connections over both TCP and RDMA transports simultaneously. Initiators connect via whichever transport they support. The discovery log page includes entries for both TCP and RDMA target addresses, allowing the initiator to select its preferred data transport.
-
Discovery Log Page (NVMe Base Specification 2.0 §5.3): responds to the
Get Log Pagecommand (Log Identifier 0x70) with a standard discovery log page containing one entry per locally-exported NVMe subsystem + transport binding.
/// NVMe-oF Discovery Log Page Entry (NVMe Base Spec 2.0, Figure 292).
/// One entry per (subsystem, transport, address) tuple.
/// Size: 1024 bytes per entry (fixed, per NVMe spec).
/// Multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct NvmeDiscoveryLogEntry {
/// Transport type: 0x01 = RDMA, 0x03 = TCP.
pub trtype: u8,
/// Address family: 0x01 = IPv4, 0x02 = IPv6.
pub adrfam: u8,
/// Subsystem type: 0x01 = NVMe I/O subsystem, 0x02 = discovery.
pub subtype: u8,
/// Transport requirements (RDMA: RDMA_QPTYPE, RDMA_PRTYPE, RDMA_CMS).
pub treq: u8,
/// Port ID (unique per transport address on this target).
pub portid: Le16,
/// Controller ID (0xFFFF = dynamic, assigned at connect).
pub cntlid: Le16,
/// Admin max SQ size.
pub asqsz: Le16,
/// Extended discovery flags (NVMe 1.4+). Bit 0: EPCSD (explicit persistent
/// connection to discovery controller). Bit 1: DUPRETINFO (duplicate return info).
pub eflags: Le16,
/// Reserved padding.
pub _reserved0: [u8; 20],
/// Transport service identifier (port number as ASCII string,
/// e.g., "4420" for NVMe-oF I/O, "8009" for discovery).
pub trsvcid: [u8; 32],
/// Explicit padding.
pub _reserved1: [u8; 192],
/// NVMe subsystem qualified name (NQN) — null-terminated ASCII,
/// max 223 characters + NUL (NVMe spec §4.1).
pub subnqn: [u8; 256],
/// Transport address (IP address as ASCII string for TCP/RDMA,
/// e.g., "192.168.1.100" or "fe80::1").
pub traddr: [u8; 256],
/// Transport-specific address subtype (RDMA: partition key, TCP: unused).
pub tsas: [u8; 256],
}
// NVMe Base Spec 2.1: discovery log entry is exactly 1024 bytes.
// trtype(1) + adrfam(1) + subtype(1) + treq(1) + portid(2) + cntlid(2) +
// asqsz(2) + eflags(2) + _reserved0(20) + trsvcid(32) + _reserved1(192) +
// subnqn(256) + traddr(256) + tsas(256) = 1024.
const_assert!(core::mem::size_of::<NvmeDiscoveryLogEntry>() == 1024);
-
Automatic enumeration: the discovery controller scans all locally-configured NVMe-oF subsystems (from the nvmet configuration) and generates discovery log page entries for each subsystem + transport combination. When subsystems are added or removed, the discovery log page generation counter is incremented, and initiators with persistent discovery connections receive an AEN (Asynchronous Event Notification) prompting them to re-read the log page.
-
Persistent discovery connections (TP 8013a): initiators can maintain a long-lived connection to the discovery controller. The controller sends AENs when the discovery log changes (subsystem added/removed, transport address changed, ANA state changed). This eliminates periodic polling — the initiator is notified immediately of topology changes.
-
Referrals: the discovery controller can include referral entries pointing to discovery controllers on other UmkaOS nodes. This enables distributed discovery: an initiator connects to any one UmkaOS node's discovery controller and learns about NVMe-oF subsystems across the entire cluster. Referral entries use
subtype = 0x02(discovery subsystem) with the remote node's transport address. -
Security: discovery controller connections support DH-HMAC-CHAP authentication (NVMe TP 8001) and TLS 1.3 (NVMe TP 8011) when configured. Unauthenticated discovery is permitted by default for compatibility with existing initiators; operators can require authentication via the nvmet access control configuration.
-
Mixed cluster interoperability: in a cluster containing both UmkaOS and non-UmkaOS nodes, UmkaOS nodes discover storage via the native PeerRegistry (Section 5.2), while non-UmkaOS nodes use the NVMe-oF discovery controller on TCP port 8009. Both paths expose the same NVMe subsystems — the discovery controller simply translates PeerRegistry storage advertisements into standard NVMe-oF discovery log entries.
targetcli configfs Management
Both iSCSI and NVMe-oF targets are configured via configfs
(Section 14.12). The configfs tree layout is
Linux-compatible so that existing user-space tools (targetcli, targetcli-fb,
rtslib-fb, nvmetcli) work without modification via the SysAPI layer.
iSCSI target configfs hierarchy (/sys/kernel/config/target/iscsi/):
| Path | Purpose |
|---|---|
<iqn>/ |
mkdir: create an iSCSI target with the given IQN |
<iqn>/tpgt_<n>/ |
mkdir: create target portal group N |
<iqn>/tpgt_<n>/enable |
echo 1 >: activate the portal group |
<iqn>/tpgt_<n>/lun/lun_<m>/ |
mkdir + symlink to backstore: map a LUN |
<iqn>/tpgt_<n>/acls/<initiator_iqn>/ |
mkdir: create ACL entry for an initiator |
<iqn>/tpgt_<n>/acls/<initiator_iqn>/auth/ |
CHAP credentials: userid, password, userid_mutual, password_mutual |
<iqn>/tpgt_<n>/np/<ip>:<port>/ |
mkdir: create a network portal (listen address) |
<iqn>/tpgt_<n>/param/ |
iSCSI session parameters: MaxRecvDataSegmentLength, MaxBurstLength, FirstBurstLength, DefaultTime2Wait, DefaultTime2Retain, HeaderDigest, DataDigest |
NVMe-oF target configfs hierarchy (/sys/kernel/config/nvmet/):
| Path | Purpose |
|---|---|
subsystems/<nqn>/ |
mkdir: create an NVMe subsystem |
subsystems/<nqn>/attr_allow_any_host |
echo 1 >: disable host NQN ACL checking |
subsystems/<nqn>/namespaces/<nsid>/ |
mkdir: create a namespace |
subsystems/<nqn>/namespaces/<nsid>/device_path |
echo /dev/nvme0n1 >: set backing device |
subsystems/<nqn>/namespaces/<nsid>/enable |
echo 1 >: activate the namespace |
ports/<port_id>/ |
mkdir: create a transport port |
ports/<port_id>/addr_trtype |
Transport type: tcp, rdma |
ports/<port_id>/addr_traddr |
Listen address (e.g., 192.0.2.1) |
ports/<port_id>/addr_trsvcid |
Listen port (e.g., 4420) |
ports/<port_id>/param_tls |
TLS mode: none, psk, certificate (see NVMe-oF TLS 1.3 above) |
ports/<port_id>/subsystems/<nqn> |
Symlink: bind a subsystem to this port |
hosts/<nqn> |
mkdir: register an allowed host NQN for ACL |
Configuration operations are serialized by the configfs group_mutex (one writer at a
time). Reads (e.g., cat param/MaxBurstLength) are lock-free via RCU-protected
snapshots of the parameter structures. Runtime parameter changes (e.g., adjusting
MaxRecvDataSegmentLength) take effect on new sessions only; existing sessions retain
the parameters negotiated at login time.
NVMe-oF over Fabrics — Why It Matters
NVMe-oF is replacing iSCSI in new deployments because it eliminates the SCSI translation layer. iSCSI encapsulates SCSI commands (a protocol designed for parallel buses in 1986) over TCP. NVMe-oF speaks NVMe natively — the same command set used by local NVMe SSDs. This means: - No SCSI CDB translation overhead - Native support for NVMe features (multipath/ANA, zoned namespaces, NVMe reservations) - Simpler protocol state machine (NVMe queue pairs vs iSCSI session/connection/task) - Lower latency at every layer
UmkaOS supports both because iSCSI remains dominant in existing infrastructure (and iSER makes it competitive on RDMA fabrics), while NVMe-oF is the clear direction for new deployments.
Protocol comparison:
| Protocol | Transport | CPU overhead | Latency | Bandwidth |
|---|---|---|---|---|
| iSCSI | TCP | High (TCP stack + SCSI) | ~100μs | 10-25 Gbps |
| iSER | RDMA | Minimal (zero-copy) | ~15-25μs end-to-end (transport only: ~5-10μs; end-to-end with SCSI target: ~15-25μs) | Line rate (100+ Gbps) |
| NVMe-oF/TCP | TCP | Medium (no SCSI layer) | ~15-30μs | 25-100 Gbps |
| NVMe-oF/RDMA | RDMA | Minimal | ~10-20μs end-to-end¹ | Line rate |
¹ NVMe-oF/RDMA latency breakdown: ~3-5 μs network transport (RDMA) + NVMe target processing. The 3-5 μs figure commonly cited represents RDMA transport latency only; end-to-end I/O latency including NVMe device processing is typically ~10-20 μs.
Recovery advantage — Both iSCSI and NVMe-oF initiators run as Tier 1 drivers with state preservation (Section 11.9). If an initiator driver crashes: 1. Connection state is checkpointed to the state preservation buffer 2. Driver reloads in ~50-150ms 3. RDMA transports (iSER, NVMe-oF/RDMA): When a driver crashes, the local RNIC's Queue Pair enters Error state, and the remote side's QP also transitions to Error state from retransmission timeouts. QP state cannot be transparently restored from a checkpoint — the QP must be destroyed and re-created (Reset -> Init -> RTR -> RTS). UmkaOS performs a fast QP re-creation: checkpointed session parameters (remote QPN, GID, LID, PSN, MTU, RDMA capabilities) allow the new QP to be configured without full connection manager negotiation. The remote side detects the QP failure (via async error event or failed RDMA operation) and cooperates in re-establishing the QP pair. Total recovery: ~50-150ms (fast re-creation, not transparent restore), vs. 10-30 seconds for full re-discovery in Linux. 4. TCP transports (iSCSI/TCP, NVMe-oF/TCP): Full TCP connection state cannot be reliably restored after a crash (the remote peer's TCP state has advanced: retransmissions, window adjustments, etc.). Instead, UmkaOS performs a fast reconnect: the checkpointed session parameters (target portal, ISID, TSIH for iSCSI; NQN, controller ID for NVMe-oF) allow session re-establishment without full discovery. The target accepts the reconnect as a session continuation (iSCSI RFC 7143 Section 7.3.5 session reinstatement; NVMe-oF controller reconnect). I/O commands in flight are retried by the block layer. Total recovery: ~200-500ms (vs. 10-30 seconds for full re-discovery in Linux).
In Linux, an initiator crash requires full session re-establishment: TCP/RDMA reconnection, login/connect, LUN/namespace re-discovery, and filesystem remount. This can take 10-30 seconds and may cause I/O errors visible to applications.
Multipath — Two multipath models coexist:
- iSCSI: dm-multipath integration with the recovery-aware volume layer
(Section 15.2). Multiple iSCSI paths (via different network interfaces or through
different target portals) provide redundancy.
- NVMe-oF: native NVMe ANA multipath (managed by the NVMe driver, not dm-multipath).
ANA state changes are handled in-driver with recovery awareness.
Both models coordinate with the volume state machine — if a path fails due to driver crash (not network failure), the volume layer waits for driver recovery rather than immediately failing over.
15.13.2 NVMe-oF Reconnect Policy¶
The external NVMe-oF protocol is Linux-compatible (same wire format, same controller reconnect semantics). The reconnect strategy — when and how to retry — is UmkaOS's internal design space. Without backoff and jitter, all hosts in a cluster that lose fabric connectivity simultaneously will attempt to reconnect simultaneously, overloading the target's accept queue and prolonging the outage. UmkaOS uses exponential backoff with full jitter to spread reconnect attempts across the cluster.
Algorithm: exponential backoff with full jitter
When a fabric connection drops (TCP disconnect, QP error event) or an initial connect attempt fails:
attempt = 0
base_delay_ms = 100
max_delay_ms = 30_000 // 30 seconds
jitter_frac = 0.25 // ±25%
loop:
delay = min(base_delay_ms * 2^attempt, max_delay_ms)
jitter = random_uniform(-delay * jitter_frac, +delay * jitter_frac)
sleep(delay + jitter)
attempt = attempt + 1
try connect()
if connected: reset attempt = 0, break
Delays without jitter (for reference): 100ms, 200ms, 400ms, 800ms, 1.6s, 3.2s, 6.4s, 12.8s, 25.6s, 30s, 30s, ...
With jitter, the actual delay is uniformly random in [0.75×delay, 1.25×delay]. Full jitter (as opposed to equal jitter or decorrelated jitter) provides the best protection against synchronized reconnects in large clusters — reconnect attempts spread across the jitter window rather than clustering at the same instant. Reference: AWS Architecture Blog "Exponential Backoff And Jitter" (2015).
ANA path failover — If a path transitions to ANAState::Inaccessible, UmkaOS
immediately tries the next available ANA-optimized path before entering the reconnect
loop for the failed path. The reconnect loop is only entered after all optimized paths
for a namespace are exhausted. This preserves I/O availability during single-path
failures without incurring any reconnect delay.
Fast-path reconnect (NVMe-oF/TCP only) — If the TCP connection drops but the NVMe-oF controller was previously established (implying a fabric-layer issue rather than a target reset or controller crash), the first reconnect attempt uses a fixed 10ms delay instead of the normal 100ms base delay. The rationale: the target controller is likely still healthy and ready to accept the reconnect immediately; the full backoff sequence is reserved for cases where the target itself is unavailable.
Maximum reconnect attempts — After 20 consecutive failed attempts (approximately
10 minutes at the 30s ceiling), the controller is marked NvmeControllerState::Offline
and I/O to namespaces served only by this controller fails with EIO. The controller
remains registered; operators can re-trigger connection attempts via sysfs or the umkafs
control interface at /ukfs/kernel/nvmeof/<nqn>/reconnect.
15.13.3 Block Service Provider¶
When a host has a storage device managed by a traditional KABI driver (NVMe, SCSI, virtio-blk), the block layer can provide that device as a cluster service via the peer protocol. Remote peers access the device through the standard block device interface — they do not know or care which driver manages it on the serving host.
This is the block-layer instantiation of the capability service provider model described in Section 5.7. In a uniform UmkaOS cluster, the block service provider provides remote storage access without NVMe-oF targets, iSCSI daemons, or any external protocol stack.
15.13.3.1 Service Provider and Wire Protocol¶
Device-native providers (Tier M): When an NVMe drive's firmware implements
the umka peer protocol (Section 11.1),
the drive IS the block service provider — no host-side KABI NVMe driver is involved.
The drive advertises BLOCK_STORAGE via CapAdvertise, the host creates a
BlockServiceClient via the PeerServiceProxy bridge
(Section 5.11), and
I/O flows through the ring pair directly to drive hardware. The wire protocol
(BlockServiceRequest/Completion) is identical whether the provider is device
firmware, a host-proxy kernel module, or a remote host. Sharing model: multiple
hosts can ServiceBind to the same block device simultaneously with reservation
coordination (Reserve/Release/Preempt opcodes).
// umka-block/src/service_provider.rs
/// Registers a local block device for remote access by cluster peers.
/// The service provider listens for incoming block I/O requests on the
/// peer protocol and dispatches them to the local block layer.
pub struct BlockServiceProvider {
/// The local block device being served.
device: BlockDeviceHandle,
/// Unique service instance identifier. Used for reservation namespace
/// and multi-path target identification.
service_id: ServiceInstanceId,
/// Per-CPU I/O queues (see "Multi-Queue I/O" below). One queue pair per
/// connected client CPU, up to `max_queues` per client.
/// Bounded: max `max_queues_per_client × MAX_CONNECTED_CLIENTS` (32 × 1024 = 32768).
/// `MAX_CONNECTED_CLIENTS` is enforced in the connection accept path: new
/// connections beyond the limit are rejected with a protocol-level error.
/// Allocated at client connect time (warm path).
queues: Vec<BlockServiceQueue>,
/// Maximum I/O queues per client connection. Default: min(server_cpus, 32).
/// Each queue is a separate peer queue pair for full parallelism.
max_queues_per_client: u16,
/// Maximum concurrent I/O operations per queue (backpressure).
/// Default: 128. Total max inflight = max_queues × queue_depth.
queue_depth: u16,
/// Write-back cache for coalescing remote writes (optional).
/// Disabled by default for safety. Enabled via export configuration
/// when the remote consumer tolerates write-back semantics.
writeback_cache: Option<WritebackCache>,
/// Connected clients, tracked for reservation state and recovery.
/// Keyed by PeerId (u64). XArray provides O(1) lookup with native
/// RCU-protected reads and internal xa_lock for write serialization.
clients: XArray<BlockServiceClientState>,
}
/// Server-side write coalescing cache. Buffers remote writes in memory
/// before flushing to the backing block device, reducing small-write
/// amplification for workloads with temporal locality (e.g., metadata
/// updates). Disabled by default for safety — only enabled via explicit
/// export configuration when the remote consumer tolerates write-back
/// semantics (i.e., acknowledges that unflushed writes are lost on
/// server crash, same as a local volatile write cache).
pub struct WritebackCache {
/// Per-client dirty page tracking. XArray keyed by (offset / block_size).
/// Provides O(1) lookup for coalescing successive writes to the same block
/// and ordered iteration for sequential flush.
dirty_map: XArray<DirtyEntry>,
/// Maximum dirty bytes before flush (backpressure). Default: 64 MiB.
/// When `dirty_bytes` reaches this threshold, new writes block until
/// the periodic flush or an explicit Flush request drains enough data.
max_dirty_bytes: u64,
/// Current dirty bytes. Updated atomically on write (add) and flush (sub).
dirty_bytes: AtomicU64,
/// Flush interval in milliseconds. A periodic writeback timer fires at
/// this interval to flush aged dirty entries. Default: 5000.
flush_interval_ms: u32,
/// Write-through threshold: writes larger than this bypass the cache
/// and go directly to the block device. Default: 256 KiB. Large
/// sequential writes do not benefit from coalescing and would evict
/// useful cached small writes.
write_through_threshold: u32,
}
/// A single dirty block in the writeback cache.
struct DirtyEntry {
/// Client that wrote this block (for invalidation on disconnect).
client_id: PeerId,
/// Data buffer (slab-allocated, block_size bytes).
data: SlabRef<[u8]>,
/// Timestamp of last write (monotonic ns, for age-based flush).
last_write_ns: u64,
}
/// Per-client state tracked on the server side. One entry per connected
/// remote peer, stored in `BlockServiceProvider::clients` (XArray keyed
/// by PeerId).
pub struct BlockServiceClientState {
/// Remote peer identity.
peer_id: PeerId,
/// Number of queues established by this client.
nr_queues: u16,
/// Reservation state (if this client holds a reservation on the device).
reservation: Option<ReservationState>,
/// In-flight I/O count for this client (for fair scheduling across
/// clients sharing the same export).
inflight_count: AtomicU32,
/// Bandwidth consumed (bytes/sec, EWMA with α = 1/16). For QoS
/// enforcement — the server throttles clients exceeding their
/// per-client bandwidth limit.
bandwidth_ewma: AtomicU64,
/// Connection timestamp (monotonic ns). Used for diagnostics and
/// connection age reporting in sysfs.
connected_since_ns: u64,
/// Request ID deduplication window for reconnect. Tracks the last
/// N completed request_ids to reject duplicates after reconnection.
/// Ring buffer, size = queue_depth per queue. On reconnect, the
/// client may re-submit requests that already completed on the server
/// before the connection dropped. The server checks incoming request_ids
/// against this window and returns the cached completion without
/// re-executing the I/O.
dedup_window: ArrayVec<u64, 256>,
}
/// Per-client reservation state (server-side).
pub struct ReservationState {
/// Reservation type (exclusive write, shared read, etc.).
reservation_type: BlockReservationType,
/// Reservation key (client-chosen, used for preemption identification).
key: u64,
/// Generation counter for SCSI-3 PR compatibility. Incremented on
/// every reservation change for this client. Used by clustered
/// filesystems (GFS2, OCFS2) to detect stale reservations.
generation: u32,
}
// NOTE: ReservationType enum was removed. Use `BlockReservationType` (below,
// with correct SCSI-3 PR values starting at 1) for all reservation state.
/// A single I/O queue within an export. Each queue is serviced by a
/// dedicated kernel thread pinned to one CPU — no lock contention
/// between queues (same model as NVMe hardware queues).
pub struct BlockServiceQueue {
/// Peer protocol queue pair for this I/O queue. Established at
/// ServiceBind time; the concrete implementation depends on the
/// transport binding (RDMA RC QP, TCP socket, CXL doorbell, PCIe
/// BAR ring). Service providers use the ring pair abstraction
/// ([Section 5.1](05-distributed.md#distributed-kernel-architecture--peer-ring-entry-format)),
/// not raw transport operations.
qp: PeerQueuePair,
/// Submission ring: client writes requests here.
submit_ring: RingBuffer<BlockServiceRequest>,
/// Completion ring: server writes completions here.
completion_ring: RingBuffer<BlockServiceCompletion>,
/// CPU this queue is bound to on the server.
cpu: u32,
}
/// Block I/O request from a remote peer.
/// Size: 64 bytes (one cache line, fits in one transport send).
/// This struct crosses node boundaries via the peer protocol. Per the DSM
/// wire format policy ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)):
/// every `#[repr(C)]` struct that crosses a node boundary MUST use
/// `Le16`/`Le32`/`Le64` for all multi-byte integer fields. Single-byte
/// fields (`u8`) and byte arrays (`[u8; N]`) are endianness-neutral.
#[repr(C, align(64))]
pub struct BlockServiceRequest {
/// Client-assigned request ID. Echoed in completion.
pub request_id: Le64,
/// Operation code.
pub opcode: BlockServiceOpcode,
/// I/O priority class (see "I/O Priority and QoS" below). Higher = more urgent.
/// 0 = default (best-effort). Used by server-side I/O scheduler
/// for QoS enforcement when multiple clients share an export.
pub priority: BlockServicePriority,
/// Flags: FUA, barrier, scatter-gather, data integrity.
pub flags: Le16,
/// Explicit padding: Le types have alignment 1, so no implicit padding
/// exists. This 4-byte pad ensures `offset` starts at byte 16 (8-byte
/// aligned), which is conventional for wire formats.
pub _pad1: [u8; 4],
/// Byte offset on the block device.
pub offset: Le64,
/// Length in bytes (for Read/Write/Discard/CompareAndWrite).
pub len: Le32,
/// Number of scatter-gather entries (see "Scatter-Gather I/O" below).
/// 0 = single contiguous buffer (data_region_offset).
/// 1-15 = scatter-gather list follows the request as inline SGL.
/// The inline SGL (sgl_count × 12 bytes) plus the 64-byte header must
/// fit within one transport send inline threshold. Standard ConnectX NICs
/// support 256-byte inline send → max 16 SGL entries inline
/// ((256 - 64) / 12 = 16, capped at 15 by sgl_count).
/// If the SGL exceeds the inline threshold, it is written into a
/// pre-registered server buffer via push_page() and
/// data_region_offset points to the SGL, not the data.
pub sgl_count: u8,
/// Reserved for alignment.
pub _reserved: [u8; 3],
/// Offset within the per-queue ServiceDataRegion established at
/// ServiceBind time. For Read: server writes data here; for Write:
/// server reads data from here. Zero for non-data ops.
/// When sgl_count > 0, this points to the first SglEntry.
pub data_region_offset: Le64,
/// For CompareAndWrite: offset within the ServiceDataRegion for the
/// compare buffer. The compare buffer contains the expected data;
/// data_region_offset contains the new data to write if comparison
/// succeeds.
pub compare_region_offset: Le64,
/// Explicit padding to fill the 64-byte cache line.
pub _pad: [u8; 16],
// Layout: request_id(8) + opcode(1) + priority(1) + flags(2) +
// _pad1(4) + offset(8) + len(4) + sgl_count(1) + _reserved(3)
// + data_region_offset(8) + compare_region_offset(8) + _pad(16) = 64.
}
const_assert!(core::mem::size_of::<BlockServiceRequest>() == 64);
/// Scatter-gather list entry for multi-segment I/O (see "Scatter-Gather I/O" below).
/// Size: 12 bytes (offset: Le64 + len: Le32). Cross-node wire format — all
/// multi-byte fields use Le types per DSM wire format policy
/// ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)).
#[repr(C)]
pub struct SglEntry {
/// Offset within the per-queue ServiceDataRegion.
pub region_offset: Le64,
/// Length of this segment in bytes.
pub len: Le32,
}
// Wire format: Le64(8) + Le32(4) = 12 bytes. Le types have alignment 1.
const_assert!(core::mem::size_of::<SglEntry>() == 12);
/// Block service wire protocol opcode. These values are INDEPENDENT of
/// `BioOp` values — the block service protocol has its own opcode space
/// because it includes operations (GetInfo, Abort, Reserve, CompareAndWrite,
/// etc.) that have no `BioOp` equivalent.
///
/// **Conversion**: The block service provider MUST convert between `BioOp`
/// and `BlockServiceOpcode` using explicit match arms, NOT numeric casting.
/// After the SF-192 fix, `BioOp` values match Linux's `req_op` (with gaps),
/// while `BlockServiceOpcode` uses sequential numbering. Numeric casting
/// (`bio.op as u8`) produces WRONG opcodes.
///
/// ```rust
/// fn bio_op_to_service_opcode(op: BioOp) -> BlockServiceOpcode {
/// match op {
/// BioOp::Read => BlockServiceOpcode::Read,
/// BioOp::Write => BlockServiceOpcode::Write,
/// BioOp::Flush => BlockServiceOpcode::Flush,
/// BioOp::Discard => BlockServiceOpcode::Discard,
/// BioOp::WriteZeroes => BlockServiceOpcode::WriteZeroes,
/// BioOp::SecureErase => BlockServiceOpcode::Discard, // mapped to discard on wire
/// BioOp::ZoneAppend => BlockServiceOpcode::Write, // treated as write on wire
/// }
/// }
/// ```
#[repr(u8)]
pub enum BlockServiceOpcode {
Read = 0,
Write = 1,
Flush = 2,
Discard = 3,
WriteZeroes = 4,
GetInfo = 5,
/// Abort a previously submitted request by request_id.
Abort = 6,
/// Reservation operations (see "Reservations for Shared Access" below).
Reserve = 7,
ReleaseReservation = 8,
Preempt = 9,
/// Atomic compare-and-write (see "Atomic Compare-and-Write" below).
/// Reads `len` bytes at `offset`, compares with `compare_region_offset`
/// buffer. If equal, writes `data_region_offset` buffer. If not equal,
/// fails with ECANCELED and returns the current data in
/// `compare_region_offset`.
CompareAndWrite = 10,
/// Reset the exported device (see "Error Recovery and Reconnection" below). Last-resort recovery
/// when Abort fails. Aborts all in-flight I/O, resets device state.
ResetDevice = 11,
}
/// I/O priority class. Maps to Linux I/O priority (ioprio) levels.
#[repr(u8)]
pub enum BlockServicePriority {
/// Background — lowest priority. Batch jobs, scrubbing.
Idle = 0,
/// Best-effort, low urgency (default for most workloads).
BestEffortLow = 1,
/// Best-effort, normal urgency.
BestEffort = 2,
/// Best-effort, high urgency.
BestEffortHigh = 3,
/// Real-time, low urgency. Latency-sensitive but not critical.
RealTimeLow = 4,
/// Real-time, normal urgency. Database journal commits.
RealTime = 5,
/// Real-time, high urgency. UPFS metadata operations.
RealTimeHigh = 6,
/// Real-time, critical. Fencing and reservation operations.
RealTimeCritical = 7,
}
bitflags! {
/// In-memory representation of block service flags.
///
/// The wire format in `BlockServiceRequest.flags` is `Le16`. Conversion:
/// - Deserialize: `BlockServiceFlags::from_bits_truncate(request.flags.to_ne())`
/// - Serialize: `Le16::from_ne(flags.bits())`
pub struct BlockServiceFlags: u16 {
/// Force Unit Access — bypass volatile write cache, ensure data
/// reaches persistent storage before completion. Maps to
/// `BioFlags::FUA` in the block layer.
const FUA = 1 << 0;
/// This request is part of a write barrier sequence
/// (see "Write Ordering and Barriers" below). Server must preserve ordering.
const BARRIER = 1 << 1;
/// Data integrity fields are present (see "Data Integrity" below).
/// Completion will include integrity verification result.
const DATA_INTEGRITY = 1 << 2;
}
}
/// Completion sent back to the requesting peer.
/// Size: 32 bytes (power-of-two for ring buffer slot alignment).
/// Layout: request_id(8) + status(4) + bytes_done(4) + info_len(4) +
/// integrity_status(1) + _reserved(3) + _pad(8) = 32.
/// Cross-node wire format — all multi-byte fields use Le types
/// per DSM wire format policy ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)).
#[repr(C, align(32))]
pub struct BlockServiceCompletion {
/// Matches the request_id from the original request.
pub request_id: Le64,
/// 0 on success, negative errno on failure.
/// CompareAndWrite: -ECANCELED if comparison failed.
/// Transmitted as `Le32` (unsigned wire representation of a signed i32).
/// Receiver converts: `status.to_ne() as i32`. This avoids introducing a
/// separate `Lei32` type — the Le* family covers unsigned integers only.
pub status: Le32,
/// Bytes transferred (for Read/Write). 0 for non-data ops.
pub bytes_done: Le32,
/// For GetInfo: serialized BlockServiceDeviceInfo follows as inline data.
/// For other ops: reserved, zero.
pub info_len: Le32,
/// Data integrity result (see "Data Integrity" below).
/// 0 = integrity check passed or not requested.
/// Non-zero = integrity error (DIF_GUARD_ERROR, DIF_REF_ERROR, DIF_APP_ERROR).
pub integrity_status: u8,
pub _reserved: [u8; 3],
/// Explicit padding to fill the 32-byte alignment boundary. Must be zeroed.
pub _pad: [u8; 8],
}
const_assert!(core::mem::size_of::<BlockServiceCompletion>() == 32);
Wire protocol: per-queue ring pairs on the peer transport. Each queue has a submission ring and a completion ring. Data transfers use remote write (server pushes read data into client's data region) and remote read (server fetches write data from client's data region) via the peer transport. The request/completion messages themselves are sent via ring pair entries.
Transport abstraction: All service provider wire structs use
transport-neutral addressing. Data references are region_offset: u64
values — offsets within the ServiceDataRegion established at
ServiceBind time (Section 5.1).
The peer transport layer maps these offsets to the concrete mechanism:
on RDMA, offset + bind-time rkey + base_addr form an RDMA Write/Read
target; on CXL, offset indexes into hardware-coherent shared memory;
on TCP, the sender transmits the data inline (remote memory access is
not available). Service providers never reference transport-specific
types (rkeys, RDMA work requests, etc.) — they use peer protocol ring
pairs and region offsets.
This is structurally identical to NVMe-oF over RDMA fabrics (submission queue + completion queue per CPU), but uses the native peer protocol instead of NVMe capsules.
15.13.3.2 Multi-Queue I/O¶
A single I/O queue is a bottleneck for high-IOPS devices. Modern NVMe SSDs deliver 1M+ IOPS; a single queue pair saturates at ~200-400K IOPS (limited by completion polling and doorbell overhead).
Client connection setup:
1. Client connects to server's BlockServiceProvider.
2. Server advertises max_queues_per_client and queue_depth.
3. Client creates N queue pairs (typically one per local CPU that will
issue I/O, up to max_queues_per_client).
4. Each queue pair is an independent reliable connected transport queue.
5. Client pins each queue to a local CPU. Server pins the corresponding
server-side queue to a server CPU.
I/O dispatch (client side):
cpu = smp_processor_id()
queue = export_queues[cpu % nr_queues]
queue.submit(request)
// No cross-CPU contention — each CPU uses its own queue.
I/O processing (server side):
// Each server queue thread is pinned to one CPU.
// Polls its submission ring, dispatches to local block layer,
// posts completions. No locks between queues.
Queue count negotiation: the client requests its preferred queue count
(typically min(nr_cpus, 32)). The server grants up to
max_queues_per_client. For a 32-core client talking to a 16-core server,
the server grants 16 queues. The client maps 2 CPUs per queue.
RDMA resource partitioning: NVMe-oF and DLM allocate QPs from separate
pools to prevent resource starvation. NVMe-oF allocates from the I/O QP
pool (budget: num_cpus × num_targets QPs). DLM allocates from the control
QP pool (budget: 2 × num_peers QPs). Both pools draw from the RDMA device's
total QP capacity. If either pool is exhausted, the requesting subsystem
queues allocation and retries on QP release — no cross-pool borrowing.
15.13.3.3 Write Ordering and Barriers¶
Filesystem journaling requires strict write ordering: journal data must
reach persistent storage before the commit record. The block layer expresses
this through write barriers (BioFlags::PREFLUSH, BioFlags::FUA).
Block service provider preserves write ordering within each queue:
Ordering guarantees:
1. WITHIN a single queue: requests complete in submission order.
Write A submitted before Write B → A completes before B.
This matches NVMe command ordering within a single SQ.
2. ACROSS queues: no ordering guarantee. Same as NVMe across
different SQs, same as local block layer across different CPUs.
3. FLUSH: drains all prior writes in ALL queues to persistent storage.
Server translates to blkdev_issue_flush() on the local device.
Flush completion means all prior writes are persistent.
4. FUA (Force Unit Access): this specific write bypasses volatile cache.
Server translates to `BioFlags::FUA` on the local block layer. The write
is persistent when the completion is returned.
5. BARRIER flag: server processes this request only after all prior
requests in the same queue have completed. Used by filesystem
journaling to sequence: writes → flush → commit_record(FUA).
Correctness argument: a filesystem on the client issues journal writes on one CPU (one queue), then a flush, then the commit record with FUA. All three go to the same queue (same CPU → same queue). Within-queue ordering guarantees the sequence: writes complete → flush drains to disk → commit record is FUA-written. This is the same guarantee that local NVMe provides.
15.13.3.4 Error Recovery and Reconnection¶
I/O timeout: each submitted request has a timeout (default: 30 seconds, configurable). If no completion arrives within the timeout:
- Client sends
Abort { request_id }to the server. - If the server responds with abort confirmation, the original request
is failed with
ETIMEDOUT. The client's block layer retries or fails upward depending on the filesystem's error handling. - If the abort itself times out (server unreachable), the client transitions to reconnection.
Reconnection follows the same model as NVMe-oF reconnect (Section 15.13):
Reconnect protocol:
1. Client detects server unreachable (heartbeat Dead, or I/O + abort timeout).
2. Client enters RECONNECTING state. All new I/O is queued (not failed).
3. Client attempts to reconnect with exponential backoff:
initial=1s, max=30s, multiplier=2, jitter_frac=0.25.
Note: jitter prevents reconnect storms when many clients lose connectivity
simultaneously (same rationale as NVMe-oF reconnect in Section 15.7).
4. On successful reconnect:
a. Client re-creates queue pairs.
b. Client re-sends all in-flight (unacknowledged) requests.
c. Server detects duplicate request_ids and deduplicates.
d. Queued I/O is drained.
5. After 20 failed attempts (~10 minutes), client marks the export
OFFLINE. I/O fails with EIO. Manual reconnect via sysfs.
During RECONNECTING:
- Read I/O: queued (stalls the calling process).
- Write I/O: queued (filesystem journal stalls until reconnect).
- New opens of the block device: succeed (device is still registered).
- fsync: stalls until reconnect or OFFLINE.
Server reboot recovery: client detects server reboot via PeerRegistry generation change. Client reconnects as above. Server-side volatile write cache (if enabled) is lost — client must assume unflushed writes are lost and rely on the filesystem's journal replay for consistency. This is the same guarantee as a local power loss: FUA writes survived, cached writes may not have.
In-flight I/O deduplication: the server maintains a sliding window of recently completed request IDs per client (size: 2 × queue_depth). On reconnect, if a retransmitted request_id matches a recently completed request, the server returns the cached completion without re-executing. This prevents duplicate writes after reconnect.
15.13.3.5 Reservations for Shared Access¶
When multiple peers need coordinated access to the same exported block device (e.g., for clustered filesystems — Section 15.14), they use block reservations managed through the DLM (Section 15.15).
/// Reservation type. Mirrors SCSI Persistent Reservation types
/// for compatibility with clustered filesystem expectations (GFS2, OCFS2).
#[repr(u8)]
pub enum BlockReservationType {
/// Write Exclusive — one peer can write, all can read.
WriteExclusive = 1,
/// Exclusive Access — one peer can read and write.
ExclusiveAccess = 2,
/// Write Exclusive, Registrants Only — registered peers can
/// write, all can read.
WriteExclusiveRegistrantsOnly = 3,
/// Exclusive Access, Registrants Only — only registered peers
/// can read or write.
ExclusiveAccessRegistrantsOnly = 4,
}
/// Reservation state for one export.
pub struct BlockReservationState {
/// Current reservation holder (None if unreserved).
holder: Option<PeerId>,
/// Reservation type.
res_type: BlockReservationType,
/// Registered peers (may access device under RegistrantsOnly types).
registrants: ArrayVec<PeerId, 16>,
/// DLM lock resource for this reservation.
dlm_resource: DlmLockResource,
/// Generation counter — incremented on every reservation change.
/// Used for fencing (stale reservations are rejected).
generation: u64,
}
Reservation flow (peer B reserves export on peer A):
- Peer B sends
Reserve { type: WriteExclusive }to peer A. - Peer A acquires a DLM lock on the reservation resource in exclusive
mode. If another peer holds a conflicting reservation, the request
blocks or fails with
EBUSY. - On DLM grant, peer A records peer B as the holder and responds success.
- Subsequent I/O from non-holders is rejected per the reservation type
(e.g., writes from non-holders fail with
EACCESunder WriteExclusive).
Preemption: a peer with higher priority (or admin action) can preempt
an existing reservation via Preempt { request_id }. The DLM handles the
lock transfer; the preempted peer receives an asynchronous notification
and must cease I/O.
Fencing on peer failure: when a reservation-holding peer is declared Dead (heartbeat timeout), the DLM releases its locks. The export server clears the reservation and notifies remaining registrants. Clustered filesystems detect the reservation change and trigger journal replay for the failed peer.
SCSI PR compatibility: the reservation types map directly to SCSI Persistent Reservation types. Clustered filesystems (GFS2, OCFS2) that expect SCSI PR semantics work without modification — the block export translates reservation operations to DLM locks internally.
15.13.3.6 Multi-Path I/O¶
When a client has multiple network paths to the same export server (e.g., two transport devices, or a direct CXL link plus an RDMA link), block resource export supports multi-path I/O for both performance and high availability.
Multi-path model:
Client has two transport devices: NIC-A (port 1) and NIC-B (port 2).
Server exports block device with service_id=42.
Client creates two connections to the same export:
Connection 1: NIC-A → Server NIC-X (queues 0-7)
Connection 2: NIC-B → Server NIC-Y (queues 8-15)
I/O policy (configurable per-export):
round-robin: distribute I/O across all healthy paths.
active-standby: use path 1; failover to path 2 on failure.
min-latency: use the path with lowest measured RTT
(from topology graph edge weights, Section 5.2.9.5).
Failover:
1. Path failure detected (transport error or heartbeat miss on that link).
2. All queues on the failed path are drained (in-flight I/O retried
on surviving paths).
3. When the path recovers, queues are re-created and I/O is
rebalanced across all healthy paths.
Path identification: the client identifies paths by the pair
(local_nic, remote_nic). Multiple paths to the same service_id are
recognized as the same device. The client block device presents a single
/dev/umkaN device regardless of path count.
Relationship to topology graph: the topology graph (Section 5.2) models all links between peers. Multi-path I/O uses the same link information but operates at the block layer: the topology graph provides cost/latency for path selection; the block multi-path layer handles I/O distribution and failover. This is analogous to the separation between routing (L3) and link aggregation (L2) in networking.
15.13.3.7 Scope and Relationship to NVMe-oF/iSCSI¶
Block service provider is designed for uniform UmkaOS clusters. It provides the same functionality as NVMe-oF and iSCSI but integrated into the cluster infrastructure:
| Feature | Block Export | NVMe-oF/RDMA | iSCSI |
|---|---|---|---|
| Wire protocol | Native peer protocol | NVMe capsules | SCSI CDB over TCP/RDMA |
| Multi-queue | Per-CPU queue pairs | Per-CPU SQ/CQ | Per-session queues |
| Write ordering | In-queue ordering + FUA + Flush | NVMe ordering + FUA | SCSI ordering + FUA |
| Reservations | DLM-backed (SCSI PR compatible) | NVMe reservations | SCSI Persistent Reservations |
| Multi-path | Built-in (topology-aware) | ANA + dm-multipath | dm-multipath |
| Reconnection | Exponential backoff + dedup | NVMe-oF reconnect | iSCSI session recovery |
| Compare-and-write | Atomic CAS (Section 15.13) | NVMe Compare | SCSI COMPARE AND WRITE |
| Data integrity | T10-DIF compatible (Section 15.13) | NVMe PI | T10-DIF |
| I/O priority | 8-level priority (Section 15.13) | NVMe urgency | iSCSI task priority |
| Scatter-gather | Per-request SGL (Section 15.13) | NVMe SGL | iSCSI data segments |
| Max I/O negotiation | Connection setup (Section 15.13) | NVMe MDTS | iSCSI login params |
| Discovery | PeerRegistry (automatic) | Discovery controller | iSNS / SendTargets |
| Authentication | Peer capabilities | DH-HMAC-CHAP | CHAP |
| Configuration | Zero (auto-discovered) | nvmet-cli / configfs | tgtd / LIO configfs |
| Daemons required | None | nvmet kernel target | tgtd or LIO |
For non-UmkaOS initiators (Linux, Windows, ESXi), NVMe-oF (Section 15.13) and iSCSI targets remain available as compatibility protocols.
For device-native storage providers (firmware shim implementing the umka-protocol): the block service provider is unnecessary. The device is directly addressable as a peer — remote hosts submit I/O via the peer protocol without any host-proxy layer.
15.13.3.8 Atomic Compare-and-Write¶
Atomic compare-and-write (CAS at block level) is essential for building high-performance clustered filesystems. It enables lock-free metadata updates: instead of acquiring a DLM lock, reading a metadata block, modifying it, and releasing the lock, the filesystem can read the block, prepare the update locally, and submit a single CompareAndWrite that atomically succeeds or fails.
CompareAndWrite flow:
1. Client reads metadata block at offset X (normal Read).
2. Client prepares updated metadata locally.
3. Client submits CompareAndWrite:
offset = X
len = block_size (typically 4096)
compare_region_offset → buffer containing the ORIGINAL data read in step 1
data_region_offset → buffer containing the UPDATED data from step 2
4. Server atomically:
a. Reads current data at offset X.
b. Compares with compare buffer (byte-for-byte).
c. If equal: writes data buffer to offset X. Returns success.
d. If not equal: does NOT write. Returns ECANCELED.
The current (conflicting) data is written into compare_region_offset
so the client can retry with the updated baseline.
5. Client on ECANCELED: re-read current data from compare_region_offset,
re-compute the update, retry from step 3.
Atomicity guarantee:
The server executes CompareAndWrite under a per-range spinlock keyed by
(device, offset / max_compare_write_bytes). The lock covers read +
compare + conditional write as a single critical section. This spinlock
is separate from the device's I/O submission path — it only serializes
overlapping CAS operations. Non-CAS reads and writes proceed without the
lock (they are naturally ordered by the submission queue). The per-range
granularity ensures that CAS operations on non-overlapping ranges execute
in parallel with no contention.
Maximum CAS size: limited by max_compare_write_bytes negotiated at
connection setup (Section 15.13). Minimum: 512 bytes (one sector).
Typical: 4096 bytes (one filesystem block). Maximum: 1 MB (for large
metadata structures). Larger CAS increases the chance of conflicts; UPFS
metadata blocks are typically 4-64 KB.
FUA support: CompareAndWrite respects the FUA flag. With FUA set, the written data reaches persistent storage before the completion is returned. Essential for UPFS metadata integrity.
Interaction with reservations: CompareAndWrite is subject to reservation checks — a peer without the correct reservation type cannot perform CAS on a reserved device.
15.13.3.9 I/O Priority and QoS¶
When multiple clients share an exported block device (common in UPFS
deployments), the server must arbitrate I/O fairly. The priority field in
BlockServiceRequest enables server-side QoS enforcement.
Priority model:
8 priority levels (BlockServicePriority), mapped to the server's
local I/O scheduler:
Level 0 (Idle): Background scrub, RAID rebuild.
Level 1-3 (BestEffort): Normal application I/O.
Level 4-6 (RealTime): Latency-sensitive workloads, UPFS metadata.
Level 7 (RealTimeCritical): Fencing, reservation operations.
Server-side enforcement:
- Each priority level gets a token bucket (configurable rate + burst).
- Higher-priority I/O is dispatched first.
- Within the same priority: FIFO per queue.
- Starvation prevention: even Idle I/O gets a minimum share
(default: 5% of device bandwidth).
Client-side mapping:
- Process I/O priority (ioprio_set) maps to BlockServicePriority.
- UPFS metadata operations use RealTimeHigh (level 6).
- UPFS journal commits use RealTime (level 5) + FUA.
- Regular data I/O uses BestEffort (level 2).
Linux ioprio mapping: Linux encodes ioprio as (class << 13) | data. UmkaOS converts
BlockServicePriority to ioprio encoding at the syscall compatibility boundary
(Section 19.1).
| BlockServicePriority | Linux ioprio class | ioprio data |
|---|---|---|
| Idle = 0 | IOPRIO_CLASS_IDLE | 0 |
| BestEffortLow = 1 | IOPRIO_CLASS_BE | 7 |
| BestEffort = 2 | IOPRIO_CLASS_BE | 5 |
| BestEffortHigh = 3 (default) | IOPRIO_CLASS_BE | 4 |
| RealTimeLow = 4 | IOPRIO_CLASS_BE | 2 |
| RealTime = 5 | IOPRIO_CLASS_BE | 0 |
| RealTimeHigh = 6 | IOPRIO_CLASS_RT | 4 |
| RealTimeCritical = 7 | IOPRIO_CLASS_RT | 0 |
The mapping preserves relative ordering within each Linux class. UmkaOS's eight levels
provide finer granularity than Linux's three classes (IDLE, BE, RT) while remaining
fully compatible at the syscall boundary: ioprio_get() returns the mapped Linux value,
and ioprio_set() maps the Linux value back to the closest BlockServicePriority level.
Per-client bandwidth limits: the server can enforce per-client bandwidth and IOPS limits via export configuration (sysfs). This prevents one client from monopolizing the device. Limits are enforced by the token bucket independently of I/O priority.
15.13.3.10 Scatter-Gather I/O¶
Large I/O requests (1 MB+ stripe writes in UPFS) often span multiple non-contiguous memory regions on the client. Without scatter-gather support, the client must copy data into a contiguous buffer — defeating zero-copy.
Scatter-gather model:
Request with sgl_count = 0:
Single contiguous buffer. data_region_offset points to the data.
(This is the common case for small I/O.)
Request with sgl_count = N (1-15):
data_region_offset points to an array of N SglEntry structs.
Each SglEntry describes one data region segment: {region_offset, len}.
Total I/O length = sum of all segment lengths = request.len.
Server processes the SGL by issuing N remote read/write operations
(one per segment), coalesced into a single local block I/O.
Example: 1 MB striped write from UPFS
SGL: [
{ region_offset=0x1000, len=256KB }, // journal header
{ region_offset=0x5000, len=512KB }, // data block 1
{ region_offset=0xA000, len=256KB }, // data block 2
]
sgl_count = 3, len = 1MB
Server issues 3 remote reads, assembles into 1MB contiguous write
to the local block device.
Maximum SGL entries: 15 per request (sgl_count is u8, capped at 15
to keep the SGL within one transport send inline threshold). For I/O
requiring more segments, the client splits into multiple requests.
Maximum total I/O size per request: max_io_bytes negotiated at
connection setup (Section 15.13). The sum of all SGL segment lengths
must not exceed this.
15.13.3.11 Data Integrity (T10-DIF Compatible)¶
For production clustered filesystems, silent data corruption must be detected. Block service provider supports end-to-end data integrity compatible with T10-DIF (Data Integrity Field), the industry standard used by both SCSI and NVMe.
/// Data integrity metadata. Appended after each protected block
/// (typically 512 or 4096 bytes) when DATA_INTEGRITY flag is set.
/// Size: 8 bytes per protected block (T10-DIF Type 1 layout).
#[repr(C)]
pub struct DataIntegrityField {
/// CRC-16 of the data block (T10-DIF guard tag).
/// Computed by the client before remote write; verified by the server
/// before writing to disk; re-verified on read before returning.
pub guard: Le16,
/// Application tag. UPFS uses this for inode number or metadata type.
/// Enables detection of misdirected writes (data written to wrong block).
pub app_tag: Le16,
/// Reference tag. Contains the expected LBA (lower 32 bits).
/// Detects misdirected writes where data is written to the correct
/// device but wrong offset.
pub ref_tag: Le32,
}
// Wire/on-disk format (T10-DIF): guard(Le16=2) + app_tag(Le16=2) + ref_tag(Le32=4) = 8 bytes.
// All fields little-endian per CLAUDE.md rule 12 (wire struct crossing node boundary).
const_assert!(core::mem::size_of::<DataIntegrityField>() == 8);
Protection path (end-to-end):
Client:
1. Compute CRC-16 guard for each data block.
2. Set app_tag (filesystem-assigned), ref_tag (LBA).
3. Append DIF metadata after each data block in the data region buffer.
4. Set DATA_INTEGRITY flag in request.
Network:
5. The transport provides its own integrity at the transport level (RDMA
iCRC, TCP checksums, CXL link CRC). Double protection: CRC-16 for data
+ transport integrity.
Server:
6. Verify guard, app_tag, ref_tag before writing to local device.
7. If local device supports T10-DIF (PI): pass DIF through to device.
Device verifies again on write (triple protection).
8. If local device does NOT support PI: server verifies DIF, strips it,
writes raw data. DIF is re-computed on read.
Server (on read):
9. Read data from device (with DIF if supported, without if not).
10. Compute/verify DIF.
11. Remote write data + DIF to client buffer (via peer transport).
Client:
12. Verify guard, app_tag, ref_tag on received data.
13. Strip DIF, return data to filesystem.
Integrity error handling: if any DIF check fails (guard mismatch, ref_tag
mismatch, app_tag mismatch), the server returns an error with
integrity_status set in the completion. The client retries from a different
path (if multi-path) or returns EIO to the filesystem. UPFS logs the
integrity violation via FMA (Section 20.1).
Negotiation: data integrity support is negotiated at connection setup (Section 15.13). Both client and server must support it. If the server's underlying device supports T10-DIF (PI), end-to-end protection covers the entire path including the physical media. If not, the server provides software DIF (covers network + server memory, not physical media).
15.13.3.12 Connection Setup and Capability Negotiation¶
When a client connects to a block export, the server and client negotiate capabilities and limits. This replaces the complex login phase of iSCSI and the NVMe-oF Connect command with a single exchange.
/// Server advertises these capabilities in the connection response.
/// The client uses the minimum of its own and the server's capabilities.
///
/// This struct crosses node boundaries in `ConnectResponse`. Per the DSM
/// wire format policy ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)):
/// all multi-byte integer fields use `Le*` types. Bool fields use `u8`
/// (0/1) to avoid Rust UB from non-0/1 bytes received from remote peers.
#[repr(C)]
pub struct BlockServiceDeviceInfo {
/// Export identifier.
pub service_id: ServiceInstanceId,
/// Block device name (human-readable, for diagnostics).
pub name: [u8; 64],
/// Total device capacity in bytes.
pub capacity_bytes: Le64,
/// Logical block size (typically 512 or 4096).
pub block_size: Le32,
/// Physical block size (alignment hint for optimal I/O).
pub physical_block_size: Le32,
/// Maximum I/O size in bytes per request. Client must not submit
/// requests with len > max_io_bytes. Typical: 1 MB - 4 MB.
/// Determined by: min(server_rdma_max_msg, device_max_transfer,
/// server_configured_limit).
pub max_io_bytes: Le32,
/// Maximum compare-and-write size in bytes. 0 = CAS not supported.
/// Typical: 4096 (one FS block) to 1 MB (large metadata).
pub max_compare_write_bytes: Le32,
/// Maximum I/O queues the server will grant to this client.
pub max_queues: Le16,
/// Maximum queue depth (outstanding requests per queue).
pub max_queue_depth: Le16,
/// Server supports data integrity (T10-DIF).
pub supports_integrity: u8, // 0 = false, 1 = true
/// Server supports scatter-gather.
pub supports_sgl: u8, // 0 = false, 1 = true
/// Maximum SGL entries per request (0 if !supports_sgl).
pub max_sgl_entries: u8,
/// Device supports discard (TRIM/UNMAP).
pub supports_discard: u8, // 0 = false, 1 = true
/// Device supports write zeroes.
pub supports_write_zeroes: u8, // 0 = false, 1 = true
/// Device is read-only.
pub read_only: u8, // 0 = false, 1 = true
/// Volatile write cache present (Flush is meaningful).
pub has_volatile_cache: u8, // 0 = false, 1 = true
/// Explicit padding (1 byte) for 4-byte alignment of next field.
pub _pad: u8,
/// Optimal I/O alignment in bytes (for best performance).
/// Client should align offsets and lengths to this boundary.
pub optimal_io_alignment: Le32,
}
// Le types are byte-array-backed (alignment 1). ServiceInstanceId is 8 bytes (Le64).
// Layout: 8 (service_id) + 64 (name) + 8 + 4 + 4 + 4 + 4 + 2 + 2 + 7×1 + 1 (pad) + 4 = 112.
const_assert!(size_of::<BlockServiceDeviceInfo>() == 112);
Connection handshake:
1. Client sends ConnectRequest:
{ service_id, requested_queues, requested_queue_depth,
want_integrity, want_sgl, client_max_io_bytes }
2. Server responds with ConnectResponse:
{ status, device_info: BlockServiceDeviceInfo }
The device_info contains negotiated values (min of client request
and server capability).
3. Client creates queue pairs based on negotiated queue count.
4. Client registers data regions with the peer transport, sized
based on max_io_bytes and max_queue_depth.
5. I/O can begin.
Capability gating: remote block access requires CAP_BLOCK_REMOTE
(Section 22.5). Checked
once at connection setup, not per-I/O.
Data region authentication: The ConnectRequest carries a CapabilityToken
(signed by the capability subsystem,
Section 22.5) proving
the requester holds CAP_BLOCK_ACCESS for the target device. The responder
validates the token signature and scope before establishing data regions in the
ConnectResponse. This prevents unauthorized nodes from obtaining remote
memory access to device data buffers. The token is validated once at connection
setup; subsequent I/O operations on the established connection are authorized
by the transport binding established at bind time (revocable via region
re-registration if the capability is later revoked).
Discovery: hosts exporting block devices advertise BLOCK_STORAGE in
their PeerRegistry capabilities
(Section 5.2). Remote peers
discover available exports by querying
PeerRegistry::peers_with_cap(BLOCK_STORAGE), then sending GetInfo to
the exporting peer to retrieve the list of available exports with their
BlockServiceDeviceInfo.
Why two-phase discovery (PeerCapFlags + GetInfo RPC): block service uses a two-phase discovery model because block device properties (capacity, block size, cache status) change at runtime (online resize, cache mode switch). Inline properties (32 bytes, set at advertisement time in PeerCapFlags) cannot reflect runtime changes. The GetInfo RPC fetches current device state, ensuring the client sees accurate geometry before connecting. Other capability service providers (e.g., serial, USB) use inline properties because their advertised characteristics are static for the lifetime of the advertisement.
15.13.3.13 Client-Side Block Device (BlockServiceClient)¶
The server-side BlockServiceProvider and wire protocol are defined above. This
section specifies the client-side kernel module that turns a ServiceBindAck
into a usable local block device. The client registers as a standard
BlockDeviceOps implementation (Section 15.2),
so filesystems, dm-*, LVM, and every other block consumer work without modification.
Tier assignment: Tier 1 (Evolvable). The client is an umka-block module running in a hardware-isolated domain (MPK/POE/DACR on supported architectures, Tier 0 fallback on RISC-V/s390x/LoongArch64). A client crash triggers driver reload (~50-150ms); in-flight I/O is re-submitted from the block layer's retry queue. No kernel panic.
Phase: Phase 3 (requires RDMA stack, peer protocol, block layer, and block service provider to be functional).
15.13.3.13.1 BlockServiceClient Struct¶
// umka-block/src/service_client.rs
/// Memory region registered with the peer transport at ServiceBind time.
/// Backing depends on transport: RDMA MR, CXL shared-memory window,
/// TCP bounce buffer. Service providers access it via region_offset
/// values in wire structs.
pub struct ServiceDataRegion {
/// Local virtual address of the region base.
base: *mut u8,
/// Size of the region in bytes.
size: usize,
/// Opaque transport handle (RDMA lkey, CXL window ID, etc.).
transport_handle: u64,
}
/// Client-side module that creates a local block device backed by a remote
/// BlockServiceProvider. Holds connection state, per-CPU transport queues,
/// and adaptive polling state. One instance per remote block device.
///
/// Implements `BlockDeviceOps` — the block layer routes bios here exactly
/// as it would for a local NVMe device.
pub struct BlockServiceClient {
/// Remote peer hosting the BlockServiceProvider.
peer_id: PeerId,
/// ServiceInstanceId of the remote export (from ServiceBindAck).
service_id: ServiceInstanceId,
/// Negotiated device info (capacity, block size, limits).
/// Immutable after connection setup; replaced atomically on reconnect
/// if the remote device geometry changed (e.g., online resize).
device_info: RcuCell<BlockServiceDeviceInfo>,
/// Per-queue state. Array length = negotiated queue count.
/// Index = queue_id (0..nr_queues-1). Each CPU maps to one queue
/// via `cpu_to_queue: [u16; NR_CPUS_MAX]`.
queues: ArrayVec<ClientQueue, 64>,
/// CPU-to-queue mapping. Populated at connection setup based on
/// negotiated queue count: `cpu_to_queue[cpu] = cpu % nr_queues`.
/// Length = nr_possible_cpus (runtime-discovered). Allocated once
/// from slab at connection time (warm path).
cpu_to_queue: Box<[u16]>,
/// Connection state machine.
state: AtomicU8, // ClientState as u8
/// Pre-registered data regions for bulk transfer. One region per queue,
/// sized to hold `queue_depth × max_io_bytes` of concurrent I/O.
/// Registered once at connection setup with the peer transport; avoids
/// per-I/O registration overhead (saves ~1-3μs per I/O).
data_regions: ArrayVec<ServiceDataRegion, 64>,
/// Multipath state (None if single-path).
multipath: Option<MultipathState>,
/// Block device handle for deregistration on disconnect.
bdev_handle: Option<BlockDeviceHandle>,
/// Reconnection backoff state.
reconnect: ReconnectState,
/// Per-device I/O timeout in milliseconds. Default: 30_000.
/// Range: 1_000..=600_000 (1 second to 10 minutes).
/// Values outside this range are clamped on write.
/// Configurable via sysfs at `/sys/block/umkaXpYbZ/queue/io_timeout`.
io_timeout_ms: AtomicU32,
}
/// Per-queue client state. Each queue has its own peer queue pair,
/// request/completion rings, and polling thread. Queues are fully independent
/// — no locks between them on the I/O submission or completion paths.
pub struct ClientQueue {
/// Peer transport queue pair connected to the server's corresponding
/// BlockServiceQueue. Reliable connected mode.
qp: PeerQueuePair,
/// Request IDs for in-flight tracking. Pre-allocated bitmap +
/// request metadata array. Size = queue_depth.
inflight: InflightTracker,
/// Adaptive poll/interrupt mode for completions (see below).
poll_mode: AtomicU8, // PollMode as u8
/// Completion thread handle (one per queue).
completion_thread: Option<TaskHandle>,
/// Queue index (matches server-side queue index).
queue_id: u16,
/// Data region for this queue's bulk transfers.
data_region: ServiceDataRegion,
}
/// Tracks in-flight requests per queue. Fixed-size, no heap allocation
/// on the I/O path.
pub struct InflightTracker {
/// Bitmap of in-use request IDs (1 = in-flight).
/// Size: queue_depth bits, rounded up to u64 words.
bitmap: ArrayVec<AtomicU64, 16>, // supports up to 1024 queue depth (16*64=1024)
/// Per-slot metadata for in-flight requests.
slots: Box<[InflightSlot]>,
/// Queue depth (number of slots).
depth: u16,
}
/// Bitmap allocation algorithm (lock-free, O(1) amortized):
///
/// Allocate:
/// 1. `hint = per-CPU last_allocated_hint` (avoids contention across CPUs
/// scanning the same word).
/// 2. `word = bitmap[hint / 64]`
/// 3. If word has any zero bits:
/// `bit = ctz(!word.load(Acquire))` // count trailing zeros of inverted
/// attempt CAS: `word.compare_exchange(old, old | (1 << bit), AcqRel, Acquire)`
/// if CAS succeeds: update hint, return `hint_base + bit`
/// else: retry same word (contention — another CPU claimed this bit)
/// 4. If word is all 1s (full): advance hint to next word, wrap at `depth / 64`.
/// After scanning all words without finding a free bit:
/// return `Err(QueueFull)` → `BLK_STS_RESOURCE` (block layer re-queues the bio).
///
/// Free:
/// `bitmap[slot / 64].fetch_and(!(1 << (slot % 64)), Release)`
///
/// This is the standard lock-free bitmap allocator used in high-performance
/// I/O stacks (NVMe blk-mq tag allocator, SPDK). The per-CPU hint eliminates
/// false sharing: each CPU scans from where it last succeeded, so concurrent
/// CPUs naturally spread across different bitmap words.
/// Metadata for one in-flight I/O request.
/// kernel-internal, not KABI — pointer-width-dependent (contains *mut Bio).
#[repr(C, align(64))]
pub struct InflightSlot {
/// Original bio pointer (for completion callback).
///
/// SAFETY: The bio pointer is valid for the entire duration the slot is
/// marked as in-use (bitmap bit set). The block layer guarantees that a
/// bio is not freed until its completion callback has been invoked, and
/// BlockServiceClient only invokes the callback when clearing the bitmap
/// bit (in `process_completion`). Therefore, the pointer is always valid
/// when accessed through a set bitmap bit. The Release ordering on
/// bitmap clear ensures the bio completion is visible before the slot
/// is reused.
bio: *mut Bio,
/// Submission timestamp (nanoseconds, monotonic). For timeout detection.
submit_ns: u64,
/// Request ID assigned to this slot (= slot index, unique per queue).
request_id: u64,
/// Which multipath path this request was submitted on (0 if single-path).
path_index: u8,
/// Retry count for this request (0 = first attempt).
retries: u8,
_pad: [u8; 6],
}
/// Connection state machine.
#[repr(u8)]
pub enum ClientState {
/// Initial state. No connection to server.
Disconnected = 0,
/// ServiceBind sent, awaiting ServiceBindAck.
Connecting = 1,
/// Queue pairs being created and connected.
QueueSetup = 2,
/// Fully connected. I/O flows normally.
Active = 3,
/// Connection lost. I/O is queued, reconnection in progress.
Reconnecting = 4,
/// Max reconnect attempts exceeded. I/O fails with EIO.
Offline = 5,
/// Graceful disconnect in progress. Draining in-flight I/O.
Draining = 6,
}
/// Adaptive polling mode for completion processing.
#[repr(u8)]
pub enum PollMode {
/// Busy-poll completions. Used when I/O rate exceeds the poll
/// threshold (default: 10K IOPS per queue). Lowest latency,
/// highest CPU usage. SPDK-inspired.
Poll = 0,
/// Interrupt-driven. Completion queue event triggers wakeup.
/// Used when queue is idle or below the poll threshold.
/// Saves CPU at the cost of ~2-5μs interrupt latency.
Interrupt = 1,
/// Hybrid: poll for `poll_spin_us` microseconds after each
/// completion batch, then fall back to interrupt if no new
/// completions arrive. Default mode.
Hybrid = 2,
}
15.13.3.13.2 Device Registration¶
When BlockServiceClient connects successfully, it registers a block device
with the umka-block layer. The device appears as a standard block device
accessible to filesystems, dm-*, LVM, and all block consumers.
Device naming: /dev/umka/peer{N}_blk{M} where N is the PeerId (u64,
rendered as hex) and M is the service instance index on that peer. These
are NOT /dev/sd* (reserved for local SCSI) or /dev/nvme* (reserved for
local NVMe). The umka/ subdirectory groups all cluster block devices.
Symlinks in /dev/disk/by-id/ use the format umka-{service_id} for
stable identification across reconnects.
sysfs integration: the device appears in /sys/block/umkaXpYbZ/ with
standard block device attributes (size, queue/, stat). Additional
cluster-specific attributes:
| sysfs path | Content |
|---|---|
device/peer_id |
Remote PeerId (hex) |
device/service_id |
ServiceInstanceId (hex) |
device/state |
Current ClientState name |
device/transport |
rdma or tcp |
queue/io_timeout |
Per-request timeout in ms (r/w) |
queue/nr_queues |
Number of transport queues |
queue/queue_depth |
Depth per queue |
queue/poll_mode |
poll, interrupt, or hybrid (r/w) |
device/multipath/policy |
Multipath policy name (r/w, if multipath) |
device/multipath/paths |
Per-path state table |
BlockDeviceOps implementation:
impl BlockDeviceOps for BlockServiceClient {
/// Convert bio → BlockServiceRequest and submit to the transport queue
/// for the current CPU. No intermediate request queue, no I/O
/// scheduler between client and network — the server runs its own
/// scheduler ([Section 15.13](#block-storage-networking--io-priority-and-qos)).
///
/// This is the hot path. Zero heap allocation. The bio's memory pages
/// are already registered in `data_region` (pre-registered) so no
/// per-I/O transport registration is needed.
fn submit_bio(&self, bio: &mut Bio) -> Result<()> {
// 1. Check state. If not Active, queue or fail.
// Reconnecting → queue bio in backlog (bounded, queue_depth × 4).
// Offline → return -EIO immediately.
// Other → return -ENODEV.
// 2. Select queue: cpu_to_queue[smp_processor_id()].
// If multipath: path_select() first, then queue on chosen path.
// 3. Allocate inflight slot (bitmap scan, O(1) amortized).
// If no slots → return BLK_STS_RESOURCE (block layer retries).
// 4. Build BlockServiceRequest from bio:
// - request_id = slot index (unique per queue).
// - opcode = bio.op → BlockServiceOpcode mapping:
// BioOp::Read → Read, BioOp::Write → Write,
// BioOp::Flush → Flush, BioOp::Discard → Discard,
// BioOp::WriteZeroes → WriteZeroes.
// - offset = bio.start_lba × device_info.block_size.
// - len = sum of bio segment lengths.
// - flags: FUA if bio.flags.contains(BioFlags::FUA),
// BARRIER if bio.flags.contains(BioFlags::PREFLUSH).
// - priority: bio.cgroup_id → ioprio → BlockServicePriority
// (table in "I/O Priority and QoS" above).
// - data_region_offset: points into pre-registered data_region.
// For writes: bio pages are COPIED into a slot in data_region
// (the bio's own page frames are not pre-registered; only
// data_region is registered with the transport). For reads:
// the server will remote-write into the same data_region slot.
// - sgl_count: if bio has >1 segment and server supports SGL,
// build inline SGL entries. Otherwise, bounce into contiguous
// buffer within data_region.
// 5. Record bio pointer and timestamp in inflight slot.
// 6. Send BlockServiceRequest via the queue's ring pair.
// 7. Return Ok(()). Completion is asynchronous.
}
fn flush(&self) -> Result<()> {
// Submit Flush opcode synchronously (wait on completion).
// Uses bio_submit_and_wait() which sets bio_sync_end_io callback.
}
fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()> {
// Submit Discard opcode. Returns ENOSYS if
// !device_info.supports_discard.
}
fn get_info(&self) -> BlockDeviceInfo {
// Read from RcuCell<BlockServiceDeviceInfo>, convert to
// BlockDeviceInfo. Fields map directly:
// logical_block_size = device_info.block_size
// physical_block_size = device_info.physical_block_size
// capacity_sectors = device_info.capacity_bytes / block_size
// max_segments = device_info.max_sgl_entries.max(1)
// max_bio_size = device_info.max_io_bytes
// supports_discard = device_info.supports_discard
// supports_flush = device_info.has_volatile_cache
// supports_fua = true (always supported by wire protocol)
// optimal_io_size = device_info.optimal_io_alignment
// numa_node = local transport device's NUMA node
}
fn shutdown(&self) -> Result<()> {
// Graceful disconnect: drain → ServiceUnbind → deregister.
}
}
No intermediate I/O scheduler: unlike iSCSI and NVMe-oF kernel initiators
which funnel through the full Linux block layer multi-queue infrastructure
(blk-mq → I/O scheduler → hw dispatch queue), BlockServiceClient submits
directly from submit_bio() to the transport queue. The server has its own I/O
scheduler (Section 15.13) — a client-side
scheduler would add latency without improving ordering. This saves ~1-3μs per
I/O compared to the Linux blk-mq path.
15.13.3.13.3 Connection Lifecycle¶
Phase 1 — Discovery:
1. Client queries PeerRegistry::peers_with_cap(BLOCK_STORAGE)
([Section 5.2](05-distributed.md#cluster-topology-model--peer-registry)).
2. For each peer with BLOCK_STORAGE: send GetInfo (BlockServiceOpcode::GetInfo)
to retrieve available exports and their BlockServiceDeviceInfo.
3. Client filters exports by policy (admin config, cgroup affinity, NUMA
proximity to local transport device).
Phase 2 — Connect:
4. Client validates it holds CAP_BLOCK_REMOTE
([Section 9.2](09-security.md#permission-and-acl-model)). Checked once here, not per-I/O.
5. Client sends ServiceBind ([Section 5.1](05-distributed.md#distributed-kernel-architecture--message-payload-structs))
for the selected export:
service_id = target export's ServiceInstanceId
ring_pair_index = 0 (first; more will follow)
requested_queue_depth = min(local_preference, 128)
requested_entry_size = 128 (minimum for BlockServiceRequest + alignment)
6. Server validates CapabilityToken, responds with ServiceBindAck:
granted_queue_depth, granted_entry_size, transport_params (transport-specific:
RDMA QP number + rkey + remote_addr; CXL doorbell offset; TCP port).
7. Client stores negotiated parameters in device_info.
Phase 3 — Queue Setup:
8. Client creates N peer queue pairs (N = negotiated queue count).
Each queue pair: reliable connected mode, linked to server's
corresponding queue.
9. Client registers data regions with the peer transport (one per queue):
region_size = queue_depth × max_io_bytes
These are pre-registered once; subsequent I/O uses region offsets.
10. Client populates cpu_to_queue mapping:
for cpu in 0..nr_cpus: cpu_to_queue[cpu] = cpu % nr_queues
11. Client spawns one completion thread per queue (see "I/O Completion" below).
12. State transitions: Connecting → QueueSetup → Active.
Phase 4 — Steady State:
13. Bios arrive via submit_bio(). Converted to BlockServiceRequest, posted
to the per-CPU queue's ring pair. Completions arrive on the same queue.
14. Adaptive polling adjusts per-queue poll mode based on I/O rate.
Phase 5 — Disconnect:
15. Trigger: admin request, or device removal on server, or peer departure.
16. State → Draining. Block layer is notified: no new bios accepted
(QUEUE_FLAG_QUIESCING set on the block device).
17. Wait for all in-flight requests to complete or timeout (io_timeout_ms).
18. Destroy transport queue pairs and deregister data regions.
19. Send ServiceUnbind ([Section 5.1](05-distributed.md#distributed-kernel-architecture)).
20. Deregister block device from umka-block. /dev/umka/peer{N}_blk{M}
disappears. Any open file descriptors see -ENODEV on subsequent I/O.
21. State → Disconnected.
Phase 6 — Reconnection (on transient failure):
22. Trigger: transport error, completion timeout, peer heartbeat Suspect/Dead.
23. State → Reconnecting. New bios are queued in a bounded backlog
(capacity: nr_queues × queue_depth × 4). Overflow → BLK_STS_RESOURCE.
24. Reconnect loop with exponential backoff + full jitter:
base_delay_ms = 100, max_delay_ms = 30_000, jitter = ±25%.
Same algorithm as NVMe-oF reconnect
([Section 15.13](#block-storage-networking--nvme-of-reconnect-policy)).
25. On successful reconnect:
a. Re-create QPs, re-register MRs.
b. Re-submit in-flight requests (server deduplicates by request_id —
see "In-flight I/O deduplication" in server spec above).
c. Drain backlog.
d. State → Active.
26. After 20 consecutive failures (~10 minutes): State → Offline.
I/O fails with EIO. Manual recovery via:
echo reconnect > /sys/block/umkaXpYbZ/device/state
Reconnect state:
/// Reconnection backoff state. Tracks consecutive failures and
/// computes the next retry delay.
pub struct ReconnectState {
/// Consecutive failed reconnect attempts.
attempt: u32,
/// Timestamp of last reconnect attempt (monotonic ns).
last_attempt_ns: u64,
/// Maximum attempts before transitioning to Offline.
max_attempts: u32, // default: 20
/// Base delay in milliseconds. Default: 100.
base_delay_ms: u32,
/// Maximum delay in milliseconds. Default: 30_000.
max_delay_ms: u32,
}
impl ReconnectState {
/// Compute next delay with exponential backoff + full jitter.
/// Returns delay in milliseconds.
pub fn next_delay(&mut self) -> u32 {
let delay = core::cmp::min(
self.base_delay_ms.saturating_mul(1u32 << self.attempt.min(20)),
self.max_delay_ms,
);
let jitter_range = delay / 4; // ±25%
let jitter = prng_uniform_u32(jitter_range * 2) as i32 - jitter_range as i32;
let result = (delay as i32 + jitter).max(1) as u32;
self.attempt = self.attempt.saturating_add(1);
result
}
/// Reset on successful reconnect.
pub fn reset(&mut self) {
self.attempt = 0;
}
}
15.13.3.13.4 Multipath¶
Multiple paths to the same remote export (via different transport devices or different
network fabrics) are managed directly inside BlockServiceClient. No
dm-multipath dependency — multipath is built-in.
/// Multipath state for a BlockServiceClient with multiple paths to
/// the same remote export.
pub struct MultipathState {
/// All known paths to this export. XArray keyed by path_id (u64).
/// Path IDs are assigned sequentially and never reused (u64 counter).
paths: XArray<PathInfo>,
/// Active path selection policy.
policy: AtomicU8, // MultipathPolicy as u8
/// Round-robin counter (used by RoundRobin policy).
rr_counter: AtomicU64,
/// Number of currently active paths.
active_count: AtomicU16,
}
/// Per-path connection state.
pub struct PathInfo {
/// Unique path identifier within this BlockServiceClient instance.
/// Monotonically increasing per-instance (starts at 0 on device creation,
/// never reset). Scoped to one BlockServiceClient — not shared across
/// devices. At 1 billion path events/sec, wraps after 584 years.
path_id: u64,
/// The underlying client connection for this path. Each path has
/// its own set of transport queue pairs and data regions.
queues: ArrayVec<ClientQueue, 64>,
/// CPU-to-queue mapping for this path.
cpu_to_queue: Box<[u16]>,
/// Local transport device used for this path (RDMA NIC, CXL port, etc.).
local_transport: TransportDeviceRef,
/// Remote transport endpoint for this path.
remote_endpoint: PeerEndpoint,
/// Current path state.
state: AtomicU8, // PathState as u8
/// NUMA node of the local transport device. Used by NUMA-aware path selection
/// to prefer paths whose device is on the same NUMA node as the submitting CPU.
numa_node: u16,
/// Exponentially weighted moving average of completion latency (ns).
/// Updated on each completion. Used by LeastLatency policy.
avg_latency_ns: AtomicU64,
/// Number of in-flight requests on this path. Used by LeastQueueDepth policy.
inflight_count: AtomicU32,
}
/// Path health state.
#[repr(u8)]
pub enum PathState {
/// Path is healthy and accepting I/O.
Active = 0,
/// Path is configured but not preferred (admin-designated standby).
Standby = 1,
/// Path has failed (transport error or timeout). Reconnection in progress.
/// I/O is redirected to other Active paths.
Failed = 2,
/// Path is being removed (admin disconnect or NIC removal).
Removing = 3,
}
/// Multipath I/O distribution policy.
#[repr(u8)]
pub enum MultipathPolicy {
/// Distribute I/O across active paths in round-robin order.
/// Simple, fair, good default for symmetric paths.
RoundRobin = 0,
/// Select the path with fewest in-flight requests.
/// Best for asymmetric paths (different bandwidths).
LeastQueueDepth = 1,
/// Select the path whose local transport device is on the same NUMA node
/// as the submitting CPU. Falls back to RoundRobin for CPUs
/// without a same-node path.
NumaAware = 2,
/// Select the path with lowest measured completion latency (EWMA).
/// Best for paths with different link speeds or hop counts.
LeastLatency = 3,
}
Path selection (hot path — no locks, no allocation):
path_select(bio) → PathInfo:
match policy:
RoundRobin:
idx = rr_counter.fetch_add(1, Relaxed) % active_count
return nth_active_path(idx)
LeastQueueDepth:
scan active paths, return one with lowest inflight_count.load(Relaxed)
(tie-break: lowest path_id)
NumaAware:
cpu_numa = cpu_to_numa_node(smp_processor_id())
scan active paths for one with matching numa_node
if found: return it
else: fall back to RoundRobin
LeastLatency:
scan active paths, return one with lowest avg_latency_ns.load(Relaxed)
Failover: automatic, completion-timeout-based. No dedicated heartbeat — the peer heartbeat (Section 5.8) already detects peer-level failures. Path-level failures are detected by I/O timeout:
- Request completes with transport error, or no completion within
io_timeout_ms→ mark the request'spath_indexpath asFailed. - Re-submit the failed request on the next Active path (up to 3 retries total across different paths).
- If all paths are
Failed→ enterReconnectingstate on all paths. - Path recovery: when transport reconnect succeeds, transition
Failed→Active. I/O rebalances automatically.
Path discovery: new paths are discovered when:
- A new transport device comes online (hotplug) that has connectivity to the server.
- The topology graph (Section 5.2) reports a
new link to the server peer.
- Admin explicitly adds a path via sysfs:
echo <transport_dev>:<remote_endpoint> > /sys/block/umkaXpYbZ/device/multipath/add_path
15.13.3.13.5 I/O Completion and Error Handling¶
Each queue has a dedicated completion thread. The thread uses adaptive polling to balance latency against CPU usage.
Adaptive poll/interrupt mode:
Completion thread main loop (per queue):
poll_spin_us = 10 // configurable, default 10μs
poll_threshold_iops = 10_000 // per-queue IOPS to enter poll mode
loop:
match poll_mode:
Poll:
// Busy-poll the completion queue. Lowest latency (~0.5-1μs).
// Burns one CPU core per queue — only used at high IOPS.
batch = poll_cq(qp.cq, max_batch=32)
if batch.is_empty():
spin_loop_hint() // pause instruction
continue
Interrupt:
// Arm the CQ for completion notification, then sleep.
arm_cq(qp.cq)
wait_for_cq_completion(qp.cq) // blocks until completion interrupt
batch = rdma_poll_cq(qp.cq, max_batch=32)
Hybrid:
// Poll for poll_spin_us, then fall back to interrupt.
deadline = now_ns() + poll_spin_us * 1000
loop:
batch = poll_cq(qp.cq, max_batch=32)
if !batch.is_empty(): break
if now_ns() >= deadline:
// No completions in poll window → switch to interrupt.
arm_cq(qp.cq)
wait_for_cq_completion(qp.cq)
batch = poll_cq(qp.cq, max_batch=32)
break
spin_loop_hint()
// Process batch of completions.
for wc in batch:
process_completion(wc)
// Adaptive mode switching (evaluated every 1000 completions):
// recent_iops > poll_threshold_iops → switch to Poll
// recent_iops < poll_threshold_iops / 2 → switch to Hybrid
// no completions for 100ms → switch to Interrupt
update_poll_mode()
Completion batching: completion queue poll returns up to 32 completions
at once (max_batch=32). Each completion is a BlockServiceCompletion
received via the ring pair. Processing multiple completions per poll avoids
per-completion overhead (doorbell, cache line bouncing).
Per-completion processing (process_completion):
process_completion(wc: TransportCompletion):
1. Extract request_id from the BlockServiceCompletion in the recv buffer.
2. Look up inflight slot: inflight.slots[request_id].
3. Validate: slot must be in-use (bitmap bit set). If not → log + discard
(stale completion from pre-reconnect).
4. Clear bitmap bit (atomic, release ordering).
5. Map completion status → bio status:
0 → BIO_OK (0)
-ECANCELED → -ECANCELED (CompareAndWrite mismatch)
-EIO → -EIO
-ENOSPC → -ENOSPC
-ENOMEM → -ENOMEM (server out of memory)
other negative → -EIO (unexpected server error)
6. If integrity_status != 0 and DATA_INTEGRITY was requested:
Log integrity violation via FMA ([Section 20.1](20-observability.md#fault-management-architecture)).
Set bio status to -EIO.
7. Set bio.status to mapped value.
8. Invoke bio completion:
bio_complete(bio, status)
This calls the bio's `end_io` callback, which wakes the waiting
filesystem/application or triggers the next stage in an async I/O pipeline.
9. Update path statistics:
path.avg_latency_ns EWMA update.
path.inflight_count.fetch_sub(1, Release).
10. Repost receive entry for the next completion (pre-posted recv pool).
Timeout detection: a per-queue timer fires every io_timeout_ms / 4
(default: 7.5 seconds). It scans the inflight bitmap for requests whose
submit_ns exceeds the timeout:
timeout_scan(queue):
now = monotonic_ns()
deadline = now - io_timeout_ms * 1_000_000
for slot in inflight.slots where bitmap bit is set:
if slot.submit_ns < deadline:
// Request timed out.
if multipath and other paths Active:
// Retry on different path (up to 3 total retries).
if slot.retries < 3:
slot.retries += 1
resubmit_on_different_path(slot)
continue
// No retry possible → fail the bio.
mark_path_failed(slot.path_index)
complete_bio_with_error(slot, -EIO)
Error classification and retry policy:
| Completion status | Retry? | Action |
|---|---|---|
| 0 (success) | No | Complete bio successfully |
-ECANCELED (CAS mismatch) |
No | Pass to caller (not a transport error) |
-EIO |
Yes (different path) | Transient storage error — retry on another path |
-ENOMEM |
Yes (same path, after backoff) | Server memory pressure — back off 10ms, retry |
-ENOSPC |
No | Propagate to filesystem |
| Transport error | Yes (different path) | Path failure — failover |
| Timeout (no completion) | Yes (different path) | Path failure — failover |
| Transport fatal error | No (path dead) | Mark path Failed, trigger reconnect |
Maximum retries: 3 per I/O request across all paths. After 3 retries with
no success, the bio completes with -EIO. The filesystem or application handles
the error (journal replay, read retry, user notification).
15.13.3.13.6 Pre-Registered Memory Regions¶
Per-I/O transport memory registration costs ~1-3μs (kernel page pin, device
doorbell, PTE update). At 1M IOPS, that is 1-3 seconds of CPU time per second
— unacceptable. BlockServiceClient pre-registers data regions at connection
setup.
Pre-registration model:
Per queue:
region_size = queue_depth × max_io_bytes
Typical example: 32 slots × 128KB = 4MB per queue.
For 8 queues: 32MB total pre-registered.
High-performance example: 128 slots × 1MB = 128MB per queue.
For 16 queues: 2GB total pre-registered (requires dedicated RDMA NIC memory).
The region is a contiguous virtual allocation (vmalloc) backed by
physical pages. It is registered as a single ServiceDataRegion with
the peer transport, yielding a local handle and a remote-accessible
token (communicated to the server at connection setup via
ServiceBindTransportParams).
I/O submission:
For writes: bio pages are copied into a slot in the pre-registered
region (memcpy cost ~0.5μs for 4KB, amortized by zero registration
overhead). For large I/O with SGL: each SGL entry points into the
pre-registered region (no copy if bio pages happen to be within
the region — but this is not guaranteed, so copy is the common path).
For reads: the server remote-writes directly into the slot. On
completion, the client copies from the slot to the bio's target pages.
Trade-off: memory copy (~0.5μs/4KB) vs per-I/O registration (~1-3μs/op).
For 4KB I/O: copy wins. For 1MB I/O: copy cost (~30μs) approaches
registration cost — but registration has higher variance (device
contention). Pre-registration is uniformly better for mixed workloads.
Alternative (future optimization): on-demand region cache. Keep a pool
of pre-registered page-aligned regions. On submit, check if bio pages
are already registered. If yes, use directly (zero-copy). If no, fall
back to copy into pre-registered region. This is Phase 4 work.
Memory budget: the pre-registered region size is configurable via
sysfs (/sys/block/umkaXpYbZ/queue/mr_size_mb). Default is computed
from negotiated parameters. On memory-constrained systems, reducing
queue_depth or max_io_bytes proportionally reduces MR size.
15.13.3.13.7 Performance Comparison¶
Why this is better than existing remote block protocols:
| Overhead source | iSCSI (Linux) | NVMe-oF/RDMA (Linux) | BlockServiceClient |
|---|---|---|---|
| Protocol translation | SCSI CDB encode/decode | NVMe capsule build | None — native wire format |
| I/O scheduler (client) | blk-mq + mq-deadline | blk-mq + none | None — direct submit |
| Request conversion | bio → scsi_cmnd → iSCSI PDU | bio → nvme_request → capsule | bio → BlockServiceRequest (1 step) |
| Region registration | Per-I/O (no pre-reg in Linux iSCSI) | Per-I/O or FMR pool | Pre-registered (zero per-I/O cost) |
| Completion model | Interrupt-only | Interrupt + poll (since 5.x) | Adaptive poll/hybrid/interrupt |
| Multipath | dm-multipath (separate layer) | NVMe ANA (in-driver) | Built-in (no layer crossing) |
| Connection setup | iSCSI login (multi-round) | NVMe Connect (2 rounds) | ServiceBind (1 round) |
Expected latency (4KB random read, RDMA/RoCE, single hop):
- Network RTT: ~3-5μs (RDMA RC one-sided)
- Client overhead (submit + completion): ~1-2μs
- submit_bio → build request + post to ring pair: ~0.5μs
- Poll CQ + process completion + bio callback: ~0.5-1μs
- Server overhead: ~2-4μs (receive + local NVMe submit + local completion + remote write back)
- Total: ~6-11μs end-to-end (vs ~15-25μs for NVMe-oF/RDMA in Linux, ~50-100μs for iSCSI/TCP)
15.14 Clustered Filesystems¶
Shared-disk filesystems where multiple nodes access the same block device simultaneously, coordinated by a distributed lock manager (DLM).
Linux problem — GFS2 and OCFS2 require a complex multi-daemon stack: - Corosync: cluster membership and messaging - Pacemaker: resource manager and fencing coordinator - DLM: distributed lock manager (kernel module + userspace daemon) - Fencing agent: STONITH (Shoot The Other Node In The Head) — kills unresponsive nodes to prevent split-brain corruption
These components are developed by different teams, have different configuration languages, and interact in subtle ways. Diagnosing failures requires understanding all four components and their interactions. A single daemon crash can fence the entire node.
UmkaOS design — The cluster infrastructure from Section 5.1 provides the foundation. UmkaOS integrates these components into a coherent architecture:
DLM over RDMA — The DLM (Section 15.15) uses Section 5.4's RDMA transport for lock operations. Lock grant/release round-trip is ~3-5μs over RDMA (vs ~30-50μs over TCP in Linux's DLM). This directly impacts filesystem performance — every metadata operation (create, rename, delete, stat) requires at least one DLM lock. At 3-5μs per lock, clustered filesystem metadata operations approach local filesystem performance. See Section 15.15 for the full DLM design, including RDMA-native lock protocols, lease-based extension, batch operations, and recovery.
Fencing — When a node becomes unresponsive, the cluster must fence it (prevent it from accessing shared storage) before allowing other nodes to recover its locks: - IPMI/BMC fencing: power-cycle the node via out-of-band management - SCSI-3 Persistent Reservations: revoke the node's reservation on the shared storage device — the storage controller itself blocks I/O from the fenced node - Same mechanisms as Linux, but integrated into Section 5.8's cluster membership protocol rather than requiring a separate Pacemaker/STONITH stack
Quorum — Inherits from Section 5.8's split-brain handling. A partition with fewer than quorum nodes self-fences (stops accessing shared storage) to prevent data corruption.
GFS2 compatibility — Read the GFS2 on-disk format, implemented as an umka-vfs module: - Resource groups, dinodes, journaled metadata - GFS2 DLM lock types mapped to DLM lock modes (Section 15.15) - Journal recovery for failed nodes - Existing GFS2 volumes can be mounted by UmkaOS without reformatting
OCFS2 compatibility — Similar approach: read OCFS2 on-disk format, implement as an umka-vfs module. Lower priority than GFS2.
Recovery advantage — This is where UmkaOS's architecture fundamentally changes clustered filesystem behavior: - Linux: if a node's storage driver crashes, the DLM loses heartbeat from that node. Fencing kicks in — the node is killed (power-cycled or SCSI-3 PR revoked). After reboot (~60s), the node must rejoin the cluster, replay its journal, and re-acquire locks. Other nodes are blocked on any locks held by the crashed node until fencing and recovery complete. - UmkaOS: if a node's storage driver crashes, the driver recovers in ~50-150ms (Tier 1 reload). The cluster heartbeat continues throughout (heartbeat runs in umka-core, not the storage driver), so the node is never declared dead. The node stays in the cluster. Its locks remain valid. No fencing, no journal replay, no lock recovery. Other nodes never notice.
This transforms clustered filesystem reliability from "minutes of disruption per failure" to "50ms blip per failure." See Section 15.15 for detailed recovery comparison.
15.15 Distributed Lock Manager¶
The Distributed Lock Manager (DLM) is a first-class kernel subsystem in umka-core that provides cluster-wide lock coordination for shared-disk filesystems (Section 15.14), distributed applications, and any kernel subsystem requiring cross-node synchronization. It implements the VMS/DLM lock model — the same model used by Linux's DLM, GFS2, OCFS2, and VMS clustering.
The DLM lives in umka-core (not a separate daemon or Tier 1 driver). This is a deliberate architectural choice: lock state survives Tier 1 driver restarts, cluster heartbeat continues during storage driver reloads (keeping the node alive in the cluster), and there are zero kernel/userspace boundary crossings for lock operations.
15.15.1 Design Overview and Linux Problem Statement¶
Linux's DLM implementation suffers from seven systemic problems that limit clustered filesystem performance. Each problem stems from architectural decisions made when the Linux DLM was designed for 1 Gbps Ethernet and 4-node clusters in the early 2000s. UmkaOS's DLM addresses each problem by design:
| # | Linux Problem | Impact | UmkaOS Fix |
|---|---|---|---|
| 1 | Global recovery quiesce — DLM stops ALL lock activity cluster-wide during any node failure recovery | Seconds of cluster-wide stall; all nodes blocked, not just those sharing resources with the dead node | Per-resource recovery: only resources mastered on the dead node are affected; all other lock operations continue uninterrupted (Section 15.15) |
| 2 | TCP lock transport (~30-50 μs per lock operation) | Orders of magnitude slower than hardware allows; metadata-heavy workloads bottleneck on lock latency | RDMA-native: Atomic CAS for uncontested locks (~3-5 μs including confirmation, zero remote CPU on CAS path), RDMA Send for contested locks (~5-8 μs) (Section 15.15) |
| 3 | No lock batching — each lock request is a separate network round-trip | rename() requires 3 locks = 3 round-trips = ~90-150 μs on Linux DLM |
Batch API: up to 64 locks grouped by master in a single RDMA Write (~5-10 μs total) (Section 15.15) |
| 4 | BAST (Blocking AST) callback storms — O(N) invalidation messages for N holders of a contended resource, including uncontended downgrades | Metadata-heavy workloads on large clusters see network saturation from invalidation traffic | Lease-based extension: holders extend cheaply via RDMA Write; minimal traffic for uncontended resources — only periodic one-sided RDMA lease renewals that bypass the remote CPU (zero CPU-consuming traffic, vs. Linux BASTs on every downgrade that require CPU processing); contended worst case is still O(K) for K active holders but K ≤ N because expired leases are reclaimed without messaging (Section 15.15) |
| 5 | Separate daemon architecture — corosync + pacemaker + dlm_controld with kernel/userspace boundary crossings | Every membership change requires multiple kernel↔userspace transitions; diagnosis requires understanding 4 separate components | Integrated in-kernel: membership events from Section 5.8 delivered directly to DLM; single heartbeat source; no userspace daemons (Section 15.15) |
| 6 | Lock holder must flush ALL dirty pages on lock downgrade | Dropping an EX lock on a 100 GB file flushes all dirty pages, even if only 4 KB was written | Targeted writeback: DLM tracks dirty page ranges per lock; only modified pages within the lock's range are flushed (Section 15.15) |
| 7 | No speculative multi-resource lock acquire | GFS2 rgrp allocation: each attempt to lock a resource group is a full round-trip; 8 attempts = 8 × 30-50 μs | lock_any_of(N) primitive: single message tries N resources, first available is granted (Section 15.15) |
15.15.2 Lock Modes and Compatibility Matrix¶
The DLM implements the six standard VMS/DLM lock modes. GFS2 uses all six modes — this is not a simplification, it is the minimum required for correct clustered filesystem operation.
/// DLM lock modes, ordered by exclusivity (lowest to highest).
/// Compatible with Linux DLM, GFS2, and OCFS2 expectations.
#[repr(u8)]
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
pub enum LockMode {
/// Null Lock — placeholder, compatible with everything.
/// Used to hold a position in the lock queue without blocking others.
NL = 0,
/// Concurrent Read — read access, compatible with all except EX.
/// Used by GFS2 for inode lookup (reading inode from disk).
CR = 1,
/// Concurrent Write — write access, compatible with NL, CR, CW.
/// Used by GFS2 for writing to a file while others read metadata.
CW = 2,
/// Protected Read — read-only, blocks writers.
/// Used by GFS2 for operations requiring consistent metadata snapshot.
PR = 3,
/// Protected Write — write, compatible with NL and CR only.
/// Used by GFS2 for metadata modification (create, rename, unlink).
PW = 4,
/// Exclusive — sole access, incompatible with everything except NL.
/// Used by GFS2 for operations requiring exclusive inode access.
EX = 5,
}
Compatibility matrix — true means the two modes can be held concurrently by
different nodes:
| NL | CR | CW | PR | PW | EX | |
|---|---|---|---|---|---|---|
| NL | yes | yes | yes | yes | yes | yes |
| CR | yes | yes | yes | yes | yes | no |
| CW | yes | yes | yes | no | no | no |
| PR | yes | yes | no | yes | no | no |
| PW | yes | yes | no | no | no | no |
| EX | yes | no | no | no | no | no |
This matrix follows the standard VMS/DLM compatibility semantics (OpenVMS Programming
Concepts Manual, Red Hat DLM Programming Guide Table 2-2; Linux kernel
fs/dlm/lock.c __dlm_compat table). Key points: PW is compatible with NL and CR
only (PW is the "update lock" — allows one writer with concurrent readers); CW is
compatible with NL, CR, and CW (CW allows concurrent writers); PW and CW are mutually
incompatible (PW forbids other writers, including CW holders). The matrix is stored as
a compile-time constant lookup table for zero-cost compatibility checks on the lock
grant path.
15.15.3 Lock Value Blocks (LVBs)¶
Each lock resource carries a 64-byte Lock Value Block — a small metadata payload piggybacked on lock state. LVBs are the critical optimization that makes clustered filesystem metadata operations efficient.
/// Lock Value Block — 64 bytes of metadata attached to a lock resource.
/// Updated by the last EX/PW holder on downgrade or unlock.
/// Read by PR/CR holders on lock grant.
///
/// MUST be cache-line aligned (`align(64)`). On all target RDMA hardware
/// (ConnectX-5+, EFA, RoCEv2 NICs), a cache-line-aligned 64-byte RDMA Read
/// is performed as a single PCIe transaction, providing de facto atomicity.
/// The alignment is a correctness requirement for the double-read protocol;
/// see the "LVB read consistency" section below.
#[repr(C, align(64))]
pub struct LockValueBlock {
/// Application-defined data (e.g., inode size, mtime, block count).
pub data: [u8; 56],
/// Sequence counter — incremented on every LVB update.
/// Readers use this to detect stale LVBs after recovery.
///
/// Stored as u64 for alignment and RDMA atomic operation compatibility
/// (RDMA atomics require 8-byte aligned 8-byte values).
///
/// **Odd/even protocol**: Writers use FAA to increment the counter before
/// and after writing data. An odd value indicates mid-update (reader should
/// retry); an even value indicates stable data. The counter is initialized
/// to 0 (even) on LVB creation.
///
/// **Masking requirement**: Readers MUST mask with `LVB_SEQUENCE_MASK`
/// (0x0000_FFFF_FFFF_FFFF) before checking parity or comparing values.
/// The high 16 bits are used for special sentinel values (e.g., INVALID)
/// and should not be interpreted as part of the sequence counter.
///
/// The 48-bit counter wraps after ~9.2 years at 1M increments/sec
/// (2^48 / 10^6 ≈ 290 million seconds). The LVB rotation protocol
/// ensures the effective counter lifetime exceeds the 50-year uptime
/// target at any sustained write rate. At 1M increments/sec, rotation
/// triggers every ~8 years; the rotation is transparent to lock holders
/// (50-100 us pause). See "LVB sequence counter wrap limitation" below
/// for wrap-safety analysis and handling guidance.
pub sequence: u64,
}
// Wire/RDMA format: data(56) + sequence(8) = 64 bytes (cache-line aligned).
const_assert!(core::mem::size_of::<LockValueBlock>() == 64);
/// Mask to extract the 48-bit sequence counter from the u64 field.
/// MUST be applied before checking odd/even parity or comparing sequence values.
pub const LVB_SEQUENCE_MASK: u64 = 0x0000_FFFF_FFFF_FFFF;
/// Sentinel value indicating an invalid LVB (after recovery from dead holder).
/// Uses high bits outside the 48-bit sequence space to avoid collision.
/// Readers observing this value must treat the LVB as invalid and refresh
/// from disk before use.
pub const LVB_SEQUENCE_INVALID: u64 = 0xFFFF_0000_0000_0000;
Why LVBs matter: Consider the common case of reading a file's size on a clustered filesystem:
Without LVB:
Node A holds inode EX lock → writes file → updates size on disk → releases EX
Node B acquires inode PR lock → reads inode FROM DISK → gets current size
Cost: 1 lock operation (~3-5 μs) + 1 disk read (~10-15 μs NVMe) = ~13-20 μs
With LVB:
Node A holds inode EX lock → writes file → writes size to LVB → releases EX
Node B acquires inode PR lock → reads size FROM LVB (in lock grant message)
Cost: 1 lock operation (~4-6 μs, LVB included) + 0 disk reads = ~4-6 μs
LVBs eliminate one disk read per metadata operation in the common case. GFS2 uses LVBs
to cache inode attributes (i_size, i_mtime, i_blocks, i_nlink) and resource
group statistics (free blocks, free dinodes). The VFS layer reads these attributes from
the LVB via Section 14.7's per-field inode validity mechanism.
Note: UmkaOS uses 64-byte LVBs (56 data + 8 sequence counter), vs Linux's 32 bytes, to accommodate extended metadata including the sequence counter and capability token. GFS2 on-disk format compatibility requires translating between 32-byte and 64-byte LVB formats at the filesystem layer: UmkaOS's GFS2 implementation packs the standard 32-byte GFS2 LVB fields into the first 32 bytes of the 56-byte data portion, using the remaining 24 bytes for UmkaOS-specific metadata. The layout:
/// UmkaOS LVB extension — the 24 bytes after the standard 32-byte LVB data. /// Layout will be defined when DLM capability integration is implemented (Phase 3+). #[repr(C)] pub struct UmkaLvbExtension { pub _reserved: [u8; 24], } const_assert!(core::mem::size_of::<UmkaLvbExtension>() == 24);When importing a GFS2 volume from Linux, the filesystem driver zero-extends Linux's 32-byte LVBs into the 64-byte format on first lock acquire.
LVB read consistency: RDMA does not provide atomic reads for 64-byte payloads (RDMA
atomics are limited to 8 bytes). When a node reads an LVB via RDMA Read, a concurrent
writer could update the LVB mid-read, producing a torn value. The protocol:
1. Reader performs RDMA Read of the full 64-byte LVB.
2. Reader checks sequence counter. If sequence is odd, the writer is mid-update
(writers set sequence to an odd value before writing data, then increment to even
after). Retry the read.
3. Reader performs a second RDMA Read of the full 64-byte LVB. If every byte (data +
sequence) matches the first read, the data is consistent. If any byte differs, retry
from step 1. The full-payload comparison (not just the sequence field) catches the
case where a writer completes two full updates between the reader's two reads: the
48-bit sequence counter (bits 47:0 of the sequence field) is monotonically increasing
(wraps after ~9.2 years at 500K writes/sec — two FAAs per write equals 1M increments/sec —
far exceeding practical deployment lifetimes; the correctness argument holds for any
deployment shorter than this), so it will differ after any update. The full-payload
comparison is a defense-in-depth measure that also detects torn reads where the
sequence counter itself was partially updated.
LVB sequence counter wrap limitation: The 48-bit sequence counter (bits 47:0 of the
sequence field, masked by LVB_SEQUENCE_MASK) wraps after 2^48 increments. At maximum
sustained write rate (500,000 writes/sec = 1,000,000 FAA operations/sec), wrap occurs in
approximately 290 million seconds (~9.2 years). During the wrap transition, a reader could
observe sequence=2^48-1 on the first read and sequence=0 on the second read, incorrectly concluding
that no write occurred between reads (ABA problem on the sequence field). This is an
acceptable limitation because: (1) the wrap interval far exceeds typical cluster deployment
lifetimes; (2) the full-payload comparison (data + sequence) still detects torn reads even
during wrap, since the writer's data changes between FAA operations; (3) production deployments
monitor LVB write rate and proactively replace LVB structures approaching the wrap threshold.
Clusters with write-intensive workloads exceeding ~50,000 writes/sec on critical LVBs may
configure periodic LVB rotation to avoid theoretical wrap scenarios in long-running
deployments.
LVB rotation protocol (for wrap avoidance in long-running clusters):
- The DLM master monitors each LVB's sequence counter. When
(current_seq & LVB_SEQUENCE_MASK) > LVB_ROTATION_THRESHOLD(default:0x0000_E000_0000_0000, ~87.5% of the 48-bit space), the master initiates rotation. - The master acquires an exclusive (EX) lock on the resource owning the LVB, blocking all other lock operations on that resource.
- Under the EX lock, the master zeros the embedded sequence counter
(
lvb.sequence = 0) and incrementsrotation_epoch += 1, preserving the existing data payload in place. No allocation or pointer swap is needed — the LVB is a by-value field inDlmResourceInner, and the rotation is simply a counter reset + epoch bump under exclusive access. - The master releases the EX lock. Subsequent LVB writes start from sequence = 0 with a fresh 48-bit counter space.
Rotation failure handling: If the master cannot acquire the EX lock
within the rotation timeout (default 30 seconds — e.g., because the EX
holder has died and the lock is stuck in recovery), the rotation is
deferred. If FAA increments sequence past the 48-bit boundary
((value >> 48) != 0 and value != LVB_SEQUENCE_INVALID), the LVB
enters a degraded state: all readers fall back to two-sided read
(same as the LVB_SEQUENCE_INVALID path), which remains correct but
slower. The degraded state is cleared by the next successful rotation.
An FMA event (LVB_ROTATION_DEFERRED) is emitted at warning severity
to alert the cluster administrator.
This protocol is transport-agnostic: it operates on the master's local
DlmResource.inner.lvb field under the resource's SpinLock + an EX DLM lock
and does not involve any transport-specific operations. The EX lock alone
provides the necessary serialization — no RDMA fences or transport-specific
ordering mechanisms are involved in the rotation itself.
Post-rotation even/odd invariant: After rotation, lvb.sequence is 0
(even), indicating stable data. This is correct because the data IS stable
(the EX lock holder preserved the payload). The first subsequent writer will
FAA to 1 (odd = writing), update data, then FAA to 2 (even = stable) — the
even/odd protocol continues correctly from 0 without any special handling.
Visibility guarantees after rotation (per-transport):
- RDMA transports: The LVB resides in RDMA-registered memory. One-sided readers see the new counter value after the EX lock release because the release triggers an RDMA Send to the holder transitioning out of EX, which provides an ordering fence at the responder NIC. The one-sided double-read protocol continues to work correctly with the reset counter.
- TCP transports: LVB reads are always two-sided (see "Two-sided LVB read
fallback" below). The master returns the fresh counter (and the updated
rotation_epoch) in subsequentDlmLvbReadResponsePayloadmessages.
Rotation frequency: At maximum sustained rate (500K writes/sec), rotation
occurs approximately every 8 years (87.5% of the 9.2-year wrap interval). The
rotation itself takes ~50-100 us (one EX lock cycle + counter reset) and
blocks the resource for the duration. The LVB_ROTATION_THRESHOLD is
configurable via sysctl cluster.dlm.lvb_rotation_threshold (valid range:
50%-99% of 2^48).
Wrap-safety for cache invalidation ordering: The LVB sequence counter is also used to
detect stale LVB data during cache invalidation. When a node receives an LVB update, it
compares the incoming sequence with its last-known sequence. This comparison is wrap-safe
because the comparison window (difference between two consecutive reads of the same LVB
by a single node) is always much smaller than 2^47 (half the 48-bit counter space).
Specifically, a node that reads an LVB will re-read it only upon the next lock acquire,
which happens at most milliseconds to seconds later — accumulating at most a few thousand
sequence increments. Since the wrap interval is ~9.2 years, the comparison window is
negligible relative to 2^47, and signed 48-bit comparison ((new_seq - old_seq) > 0 in
48-bit modular arithmetic) correctly determines ordering even near the wrap boundary.
Rotation safety for lockless read_lvb() callers: The rotation protocol (above) resets
the sequence counter from ~87.5% of 2^48 to 0 under an EX lock. Lock-holding readers are
notified of the reset via BAST/revocation (they re-acquire after rotation and see the new
counter). However, read_lvb() callers (TCP two-sided path) do NOT hold locks and have no
revocation channel. A lockless reader that cached sequence S near LVB_ROTATION_THRESHOLD
before rotation, then calls read_lvb() after rotation and gets sequence ~0, would compute
a massive negative difference and incorrectly conclude the LVB is stale.
To handle this, the DlmLvbReadResponsePayload includes a rotation_epoch field
(incremented on each rotation). Lockless readers MUST compare rotation_epoch values
before using sequence comparison for ordering:
- If new_rotation_epoch != cached_rotation_epoch: a rotation occurred. The reader MUST
discard its cached sequence and treat the new LVB as authoritative (no ordering comparison
is meaningful across rotation boundaries).
- If rotation_epoch values match: the standard signed 48-bit sequence comparison applies.
Callers that use read_lvb() solely for torn-read detection (checking even/odd parity) are
unaffected by rotation — the zeroed counter is even (stable), and the even/odd protocol
continues correctly from 0 (see "Post-rotation even/odd invariant" below).
The sequence counter detects torn reads: the reader retries if the sequence changed
during the read. This is a consistency mechanism, not an ABA prevention mechanism —
ABA is not applicable because the reader does not perform compare-and-swap on the
LVB data. The writer protocol uses RDMA Fetch-and-Add (FAA) for both
transitions: FAA(sequence, 1) (now odd = writing) → update data → FAA(sequence, 1)
(now even = stable). FAA is a standard RDMA atomic operation, ensuring visibility to
concurrent one-sided readers.
LVB single-writer guarantee: The double-read protocol's correctness depends on there being at most one concurrent LVB writer for a given resource. This invariant is provided by the DLM lock itself: only a node holding an EX (Exclusive) or PW (Protected Write) lock on a resource may write to that resource's LVB (per the DLM compatibility matrix in Section 15.15). Because the DLM guarantees that at most one node holds EX or PW on a resource at any time, the single-writer invariant is guaranteed by the lock mode rules — no additional coordination is needed. During master failover, LVB writes are suspended until the new master is established and the lock state has been recovered, preventing interleaved writes from two nodes each believing they hold the lock.
RDMA ordering correctness argument: The writer updates the LVB via three RDMA operations posted to a single Reliable Connection (RC) Queue Pair: (1) FAA on sequence, (2) RDMA Write to data bytes, (3) FAA on sequence. Per the InfiniBand Architecture Specification (Vol 1, Section 11.5), operations within a single RC QP are processed at the responder (target NIC) in posting order. Therefore, when FAA #3 completes, the data Write #2 has already completed at the responder's memory. A reader on a DIFFERENT QP (QP_B) may see operations from QP_A interleaved with its own reads — this is the "no inter-QP ordering" property of RDMA. However, the double-read protocol handles this correctly: if QP_A's operations interleave with QP_B's first Read, the torn value will differ from QP_B's second Read (because the writer changed data and/or sequence between reads), causing a retry. The only remaining concern is whether QP_A's three operations can interleave with BOTH of QP_B's reads to produce identical torn values — this is impossible because the FAA operations on the sequence counter are 8-byte RDMA atomics (always observed atomically, no partial reads), and the sequence counter is monotonically increasing. If the reader's two RDMA Reads see the same sequence value (even), the writer either completed all three operations before both reads (data is consistent) or has not started (data is unchanged). If the sequence values differ between the two reads, the reader retries. The double-read protocol is therefore correct under RDMA's relaxed inter-QP ordering model without requiring explicit fencing between QPs.
RDMA Read atomicity and the SIGMOD 2023 analysis: The InfiniBand Architecture Specification does not formally guarantee that an RDMA Read larger than 8 bytes is delivered atomically. Ziegler et al. (SIGMOD 2023) investigated this question and found that in practice, cache-line-aligned 64-byte RDMA Reads are delivered atomically on all tested hardware — their experiments observed no torn reads for objects that fit within a single cache line. This empirical finding supports our cache-line-aligned LVB design. Nevertheless, the IB spec provides no formal guarantee, and future NICs or memory subsystems could behave differently. The double-read protocol provides defence-in-depth across three complementary layers:
- Cache-line alignment (de facto atomicity): The
#[repr(C, align(64))]requirement ensures the 64-byte LVB is always cache-line aligned. On all shipping RDMA NICs (ConnectX-5+, AWS EFA, RoCEv2 adapters), the responder NIC reads from the last-level cache or memory controller, which operates at cache-line granularity. A cache-line-aligned 64-byte read therefore arrives from the responder as a single coherent unit — a single PCIe TLP — providing de facto atomicity even without formal IB spec guarantees. This is the primary defence.
Hardware qualification note: 64-byte RDMA read atomicity is a de-facto property of specific NICs, not guaranteed by the InfiniBand specification. It is confirmed on: Mellanox/NVIDIA ConnectX-5, ConnectX-6, ConnectX-7 (single cache-line reads are atomic in the NIC's memory subsystem), and AWS EFA (Elastic Fabric Adapter) NICs. It is NOT guaranteed on iWARP NICs (Chelsio T6, Intel X722) or InfiniBand HCAs without this property. UmkaOS's LVB implementation checks for the
RDMA_ATOMIC_64Bcapability flag at device initialization and falls back to the double-read protocol (read → check sequence → read again if sequence changed) when the flag is absent. The double-read protocol is correct regardless of hardware atomicity; the single-read optimization is enabled only when the flag is present.
-
Probabilistic defence via double-read: Even if a torn read occurs on a specific platform (e.g., under unusual NUMA topology or memory subsystem conditions), the double-read comparison provides a strong probabilistic defence. For both reads to produce identical torn values, the writer's in-progress modifications must create the EXACT same byte pattern in both torn snapshots — including the monotonically increasing sequence counter. Because the sequence counter changes by exactly 2 per complete write (odd during update, even after), reconstructing the same even sequence value twice from independent torn reads of two different write phases would require an astronomically unlikely alignment of byte delivery from two distinct PCIe transactions. In practice this is negligible.
-
Two-sided fallback (absolute correctness): After 8 retries the reader falls back to a two-sided RDMA Send to the resource master, which reads the LVB under its local lock and returns a consistent snapshot. This path is unconditionally correct regardless of RDMA read atomicity guarantees or NIC implementation details.
Together these three layers ensure correctness: the first eliminates torn reads on all known hardware, the second provides defence-in-depth on any hypothetical future hardware, and the third guarantees forward progress regardless of RDMA semantics.
Livelock prevention: A continuously-updated LVB could cause a reader to retry indefinitely (the writer keeps changing the sequence counter between the reader's two RDMA Reads). To prevent this, the reader enforces a maximum of 8 retries with exponential backoff (1 μs, 2 μs, 4 μs, ..., 128 μs). If all retries are exhausted, the reader falls back to a two-sided RDMA Send to the resource master, requesting a consistent LVB snapshot. The master reads the LVB under its local lock (preventing concurrent writer updates during the read) and returns the consistent value. This fallback adds ~5-8 μs but guarantees forward progress. In practice, a single retry suffices in over 99% of cases — the 8-retry limit is a safety bound for pathological writer contention.
Typical case: 1 RDMA Read + 1 RDMA Read (64 bytes) = ~3-4 μs total.
After lock master recovery (Section 15.15), LVBs from dead holders are marked
INVALID (sequence counter set to u64::MAX). The next EX or PW holder must refresh
the LVB from disk before other nodes can trust it (both EX and PW are write modes that
can update the LVB, per the compatibility matrix above).
15.15.3.1 Two-Sided LVB Read Fallback¶
Applicability of the one-sided LVB read protocol above: The double-read/seqlock protocol described above (RDMA Read → check sequence → retry) applies ONLY when
transport.supports_one_sided() == true(RDMA, CXL). For TCP peers, LVBs are read via the two-sidedLvbReadRequest/LvbReadResponsepath described in this section, or piggybacked on lock grant messages (DlmLockGrantPayload.lvb_len). The double-read protocol is never used on TCP.
For peers connected via TCP (where transport.supports_one_sided() == false), a node
that needs to read an LVB without acquiring a lock uses the two-sided LVB read path:
- The requester sends a
DlmMessageType::LvbReadRequestmessage to the resource master, identifying the resource by name hash. - The master receives the request, acquires the resource's
innerSpinLock (preventing concurrent LVB writes), reads the current LVB data, sequence counter, androtation_epoch, releases the lock, and sends aDlmMessageType::LvbReadResponsecontaining the 64-byte LVB content, sequence counter, and rotation epoch. - The requester uses the received LVB data directly — no double-read or sequence checking is needed because the master serialized the read under its local lock.
Cost: ~50-200 μs on TCP (one round-trip) vs ~3-5 μs for RDMA one-sided double-read. Still far cheaper than a full lock acquire+release round-trip (~100-400 μs on TCP), because the LVB read does not modify lock state, does not enter the granted/waiting queues, and does not generate BAST callbacks.
The DlmResource::read_lvb() API dispatches transparently:
/// Read the LVB for a resource without acquiring a lock.
/// Returns the 64-byte LVB data and the sequence counter.
///
/// Transport dispatch:
/// - RDMA/CXL: one-sided double-read with seqlock protocol (fast path).
/// - TCP: two-sided LvbReadRequest/LvbReadResponse (message path).
pub fn read_lvb(&self, resource: &ResourceName) -> Result<LockValueBlock, DlmError> {
let master = self.hash_ring.master(resource);
let transport = self.peer_transport(master);
if transport.supports_one_sided() {
self.read_lvb_one_sided(master, resource)
} else {
self.read_lvb_two_sided(master, resource)
}
}
Wire message structs for the two-sided LVB read path are defined in the
"DLM Wire Protocol" section below (DlmLvbReadRequestPayload,
DlmLvbReadResponsePayload).
LVB write protocol — TCP alternative: The FAA+Write+FAA sequence described in the "RDMA ordering correctness argument" section above applies only to RDMA transports where one-sided writes are used. TCP peers do not use FAA or RDMA Write for LVB updates. Instead, LVB updates are carried in
LockConvertorLockReleasemessages: the LVB data is appended to the wire message per thelvb_lenfield inDlmLockConvertPayload/DlmLockReleasePayload. The master updatesDlmResourceInner.lvbunder the resource's SpinLock. No FAA or RDMA Write operations are involved — the seqlock protocol is unnecessary for message-based transports because the master serializes all updates.LVB read direction —
LockGrant: LVBs are also piggybacked onLockGrantmessages (master-to-requester direction) viaDlmLockGrantPayload.lvb_len. This is the read path: when a node acquires a PR/CR lock, the grant message carries the current LVB snapshot. The write direction (holder-to-master) usesLockConvertandLockReleaseas described above.
15.15.4 Lock Resource Naming and Master Assignment¶
Lock resources are identified by hierarchical names that encode the filesystem, resource type, and specific object:
Format: <filesystem>:<uuid>:<type>:<id>[:<subresource>]
Examples:
gfs2:550e8400-e29b:inode:12345:data — data lock for inode 12345
gfs2:550e8400-e29b:inode:12345:meta — metadata lock for inode 12345
gfs2:550e8400-e29b:rgrp:42 — resource group 42 allocation lock
gfs2:550e8400-e29b:journal:3 — journal 3 ownership lock
gfs2:550e8400-e29b:dir:789:bucket:5 — directory 789 hash bucket 5
app:mydb:table:users:row:1001 — application-level row lock
Master assignment: Each lock resource is assigned a master node responsible for
maintaining the granted/converting/waiting queues. The master is determined by
consistent hashing using a virtual-node ring (note: this is deliberately different
from DSM home-node assignment in Section 6.4, which uses modular hashing —
hash % cluster_size — for simpler O(1) lookups; DLM uses consistent hashing because
lock resources are more numerous and benefit from minimal redistribution on node changes):
// Each physical node has V virtual nodes on the ring (default V=64).
// The ring is a sorted array of (hash, physical_node_id) pairs.
ring = [(hash(node_0, vnode_0), 0), (hash(node_0, vnode_1), 0), ...,
(hash(node_N, vnode_V), N)]
master(resource_name) = ring.successor(hash(resource_name)).physical_node_id
When a node joins or leaves the cluster, only ~1/N of total resources are remapped
(the resources whose ring position falls between the departed node's virtual nodes
and their successors). This is the key property of consistent hashing — unlike
modular hashing (hash % cluster_size), which remaps nearly all resources on
membership change.
Design choice — consistent hashing vs. directory-based master assignment: Linux's
DLM uses modular hashing for lock resource mastering. UmkaOS uses consistent hashing with
virtual nodes because: (1) it is fully distributed with no single point of failure — any
node can compute any resource's master locally from the ring (O(log V×N) binary search);
(2) membership changes remap only ~1/N of resources instead of ~all. Note that the DLM's
consistent hashing is deliberately different from DSM's modular hashing (Section 6.4,
hash % cluster_size): DSM uses modular hashing for simpler O(1) lookups with full
rehash on membership change, while the DLM uses consistent hashing for minimal
redistribution on node changes. These are separate protocols with different tradeoffs,
not a shared scheme. The tradeoff is that consistent hashing cannot
optimize for locality (a node that uses a resource heavily is not preferentially assigned
as its master). For workloads where locality matters (e.g., a single node accessing a
file exclusively), the DLM's lease mechanism (Section 15.15) compensates: the holder
simply extends its lease without contacting the master, so master location is irrelevant
on the fast path.
/// Consistent hash ring for DLM master assignment. Each physical node
/// contributes V virtual nodes (default 64) to the ring. The ring is a
/// sorted array of (hash_point, node_id) pairs; master lookup is O(log N)
/// binary search for the successor of hash(resource_name).
///
/// The ring is immutable between membership changes. On node join/departure,
/// a new ring is computed and swapped atomically via RCU. Lock operations in
/// flight see a consistent snapshot — either the old ring or the new one,
/// never a partially-updated ring.
pub struct DlmHashRing {
/// Sorted array of (hash_point, physical_node_id) pairs.
/// Length = N_nodes * VNODES_PER_NODE. Sorted by hash_point ascending.
/// Binary search finds the successor: the first entry with
/// hash_point >= hash(resource_name). If no such entry exists (wrap),
/// the first entry in the array is the successor (ring wraps around).
///
/// Hash function: SipHash-2-4(resource_name) → u64 for resource lookups;
/// SipHash-2-4(node_id || vnode_index) → u64 for ring point generation.
/// SipHash is chosen for DoS resistance (keyed hash prevents adversarial
/// resource name selection that skews master assignment).
pub points: ArrayVec<HashRingPoint, MAX_RING_POINTS>,
}
/// Maximum ring points = max cluster nodes * vnodes per node.
/// 256 nodes * 64 vnodes = 16384 points. Sufficient for the largest
/// supported cluster size.
pub const MAX_RING_POINTS: usize = 16384;
/// Virtual nodes per physical node on the consistent hash ring.
/// **Imbalance analysis**: With 64 vnodes per node and SipHash-2-4, the
/// expected load imbalance between the most- and least-loaded master nodes
/// is ~10-15% for clusters of 4-16 nodes, decreasing to ~5% at 64+ nodes
/// (standard deviation ~1/sqrt(vnodes)). 64 vnodes provides a practical
/// balance between ring size (memory: 16 bytes × 64 = 1 KiB per node)
/// and distribution uniformity. Increasing to 256 would reduce imbalance
/// to ~2-3% but quadruples per-node ring memory.
pub const VNODES_PER_NODE: u32 = 64;
/// A single point on the consistent hash ring.
/// Points are sorted by `(hash, node_id)` to break ties deterministically.
/// With SipHash-2-4 and 16384 points in a 64-bit hash space, collisions
/// have probability ~1.5e-11 per ring build, but the deterministic
/// tie-break ensures all nodes agree on the master even in the
/// degenerate case.
// kernel-internal, not KABI — local in-memory consistent hashing structure.
#[repr(C)]
pub struct HashRingPoint {
/// Hash value (SipHash-2-4 of node_id || vnode_index).
pub hash: u64,
/// Physical node ID that owns this ring point.
pub node_id: NodeId,
}
// kernel-internal: hash(8) + node_id(8) = 16 bytes.
const_assert!(core::mem::size_of::<HashRingPoint>() == 16);
impl DlmHashRing {
/// Look up the master node for a given resource name.
/// Returns the node_id of the first ring point whose hash >= hash(resource_name).
/// O(log(N * VNODES_PER_NODE)) binary search.
pub fn master(&self, resource_name: &ResourceName) -> NodeId {
let h = siphash_2_4(resource_name.as_bytes());
// Binary search for first point with hash >= h.
// If none found (h > all points), wrap to points[0].
match self.points.binary_search_by_key(&h, |p| p.hash) {
Ok(idx) => self.points[idx].node_id,
Err(idx) => {
if idx < self.points.len() {
self.points[idx].node_id
} else {
self.points[0].node_id // wrap around
}
}
}
}
/// Rebuild the ring after a membership change. Called when a node joins
/// or departs the cluster. The new ring is built from scratch from the
/// current member set and swapped in via RCU.
///
/// `members`: current cluster member set (post-join or post-departure).
/// `sip_key`: SipHash key for ring point generation (cluster-wide constant,
/// derived from the lockspace creation seed).
pub fn rebuild(members: &[NodeId], sip_key: &SipKey) -> Self {
let mut ring = DlmHashRing {
points: ArrayVec::new(),
};
for &node_id in members {
for vnode in 0..VNODES_PER_NODE {
let h = siphash_2_4_keyed(sip_key, &(node_id, vnode));
ring.points.push(HashRingPoint { hash: h, node_id });
}
}
ring.points.sort_unstable_by_key(|p| (p.hash, p.node_id));
ring
}
}
Master migration on membership change: When a node departs (crash or graceful
leave), the surviving nodes rebuild the hash ring. Resources whose master was the
departed node are now hashed to their successor in the new ring. The new master
broadcasts MasterMigration to all nodes that hold locks on affected resources.
Each node re-targets pending lock requests to the new master. Resources mastered on
surviving nodes are unaffected — their hash position and successor are unchanged.
When a node joins, the new ring is computed and ~1/N of resources shift to the new
node. The old master for each shifted resource sends the resource's granted/converting/
waiting queues to the new master via a MasterTransfer message. Lock operations on
shifted resources are briefly queued (not rejected) during the transfer window
(~1-5 ms typical).
/// Intrusive doubly-linked list node. Embedded in structs that need to
/// be linked without heap allocation.
///
/// # Safety invariant
/// A node must be removed from all lists before its containing struct
/// is freed. Leaving a dangling node pointer causes use-after-free.
///
/// Fields are private to encapsulate unsafe pointer manipulation.
/// All modifications go through `IntrusiveList` methods that document
/// their safety contracts.
pub struct IntrusiveListNode {
prev: *mut IntrusiveListNode,
next: *mut IntrusiveListNode,
}
impl IntrusiveListNode {
/// Create a new unlinked node with null prev/next pointers.
///
/// After slab allocation places the node at its permanent address,
/// the caller must invoke [`init_at_final_address`] to establish the
/// self-referential "unlinked" state. Until then, `is_unlinked()`
/// returns `true` (null pointers are treated as unlinked).
pub const fn new() -> Self {
IntrusiveListNode {
prev: core::ptr::null_mut(),
next: core::ptr::null_mut(),
}
}
/// Initialise a node at its permanent (pinned) address so that
/// prev/next point to itself, establishing the "unlinked" sentinel
/// state.
///
/// # Safety
///
/// `this` must point to a valid, pinned `IntrusiveListNode` that
/// will not be moved for the lifetime of the containing allocation
/// (e.g., a slab object).
pub unsafe fn init_at_final_address(this: *mut IntrusiveListNode) {
// SAFETY: Caller guarantees `this` is valid and pinned.
unsafe {
(*this).prev = this;
(*this).next = this;
}
}
/// Returns true if this node is not currently linked in any list.
///
/// A node is unlinked if both pointers are null (freshly constructed)
/// or both point to self (initialised but not inserted).
pub fn is_unlinked(&self) -> bool {
let self_ptr = self as *const _ as *mut _;
(self.prev.is_null() && self.next.is_null())
|| (self.prev == self_ptr && self.next == self_ptr)
}
}
/// Head sentinel for an intrusive list. The `prev`/`next` pointers
/// form a circular doubly-linked list with the head acting as a
/// sentinel. An empty list has `head.prev == head.next == &head`.
pub struct IntrusiveListHead {
sentinel: IntrusiveListNode,
len: usize,
}
impl IntrusiveListHead {
/// Return the number of entries in this list.
pub fn len(&self) -> usize { self.len }
/// Return true if the list is empty.
pub fn is_empty(&self) -> bool { self.len == 0 }
}
/// Typed intrusive list. `T` must embed an `IntrusiveListNode` accessible
/// via the `node_offset` (computed by `field_offset!` at the call site).
///
/// All pointer manipulation is encapsulated in `insert_after()`,
/// `insert_before()`, and `remove()` methods. These are the only
/// entry points that modify `IntrusiveListNode` fields, ensuring
/// safety invariants are auditable in one location.
pub struct IntrusiveList<T> {
head: IntrusiveListHead,
_marker: PhantomData<T>,
}
impl<T> IntrusiveList<T> {
/// Insert `node` after the sentinel (at the front of the list).
///
/// # Safety
/// `node` must be a valid pointer to an `IntrusiveListNode` embedded
/// in a live `T`. The node must not be currently linked in any list.
/// The caller must hold the protecting lock (e.g., `DlmResource.inner`).
pub unsafe fn insert_front(&mut self, node: *mut IntrusiveListNode) {
// SAFETY: caller guarantees node validity and mutual exclusion.
(*node).next = self.head.sentinel.next;
(*node).prev = &mut self.head.sentinel as *mut _;
(*self.head.sentinel.next).prev = node;
self.head.sentinel.next = node;
self.head.len += 1;
}
/// Insert `node` before the sentinel (at the back of the list).
///
/// # Safety
/// Same preconditions as `insert_front`.
pub unsafe fn insert_back(&mut self, node: *mut IntrusiveListNode) {
// SAFETY: caller guarantees node validity and mutual exclusion.
(*node).prev = self.head.sentinel.prev;
(*node).next = &mut self.head.sentinel as *mut _;
(*self.head.sentinel.prev).next = node;
self.head.sentinel.prev = node;
self.head.len += 1;
}
/// Remove `node` from this list.
///
/// # Safety
/// `node` must be currently linked in THIS list (not another list).
/// The caller must hold the protecting lock.
pub unsafe fn remove(&mut self, node: *mut IntrusiveListNode) {
// SAFETY: caller guarantees node is in this list and mutual exclusion.
(*(*node).prev).next = (*node).next;
(*(*node).next).prev = (*node).prev;
// Reset to self-referential (unlinked sentinel).
(*node).prev = node;
(*node).next = node;
self.head.len -= 1;
}
}
/// DLM resource name. Variable-length, hierarchical (e.g., "gfs2:fsid:inode:12345").
/// Maximum 256 bytes (matching the DLM protocol maximum resource name length).
/// Compared by byte equality for lock matching.
///
/// **Memory budget**: At 258 bytes per `ResourceName`, 100K lock resources
/// consume ~25 MB; 1M resources consume ~246 MB. For workloads exceeding
/// 500K concurrent lock resources, a compact representation (inline 64 bytes
/// + slab-allocated overflow) is recommended as a Phase 4+ optimization.
pub struct ResourceName {
/// Name bytes (NUL-terminated, max 256 bytes including NUL).
pub bytes: [u8; 256],
/// Length of the name (excluding NUL terminator).
pub len: u16,
}
/// Wait-for graph for distributed deadlock detection.
/// Nodes are lock holders (identified by (node_id, lock_id) pairs).
/// Edges represent "waits for" relationships. Cycle detection runs
/// periodically (default: every 100ms) using a DFS traversal.
pub struct WaitForGraph {
/// Adjacency list: waiter → set of holders it's waiting for.
/// Bounded to MAX_CONCURRENT_LOCKS (65536) entries.
///
/// **BTreeMap rationale**: Deadlock detection runs only after a lock
/// has been waiting >5 seconds (see Section 15.12.9) — this is off the
/// hot lock-grant path entirely. BTreeMap provides ordered iteration by
/// WaiterId (NodeId, lock_id), ensuring all cluster nodes evaluate
/// deadlock victim candidates in the same deterministic order. This
/// eliminates the need for an explicit sort before victim selection.
/// The DFS cycle detection itself traverses per-vertex adjacency lists
/// (ArrayVec), not the BTreeMap iteration order; the BTreeMap ordering
/// matters only for victim selection when multiple candidate victims
/// exist. The 65536-entry bound caps memory at ~2MB
/// (65536 x (16+8+padding) bytes), acceptable for a background structure.
/// An alternative HashMap would give O(1) average but non-deterministic
/// iteration order would require an explicit sort before victim
/// selection.
pub edges: BTreeMap<WaiterId, ArrayVec<WaiterId, 8>>,
}
/// Deterministic ordering is required for consistent cycle detection
/// across all cluster nodes.
#[derive(Ord, PartialOrd, Eq, PartialEq, Clone, Copy)]
pub struct WaiterId {
pub node_id: NodeId,
pub lock_id: u64,
}
/// A lock resource managed by the DLM.
///
/// **Locking protocol**: The `inner` field wraps all mutable resource state
/// (LVB, lock queues, pending CAS) in a `SpinLock`. This lock is held for
/// O(1) operations only: queue manipulation, LVB read/write, pending CAS
/// update. It MUST NOT be held across any network message send or RDMA
/// operation — those are performed after releasing the lock with the
/// relevant data copied out.
///
/// Lock ordering: `DlmResource.inner` is below lockspace-level locks
/// (e.g., `DlmLockspace.shard_locks`) and above nothing — it is a leaf
/// lock. Acquiring two `DlmResource.inner` locks simultaneously is
/// forbidden (deadlock risk with lock conversion across resources).
pub struct DlmResource {
/// Resource name (hierarchical, variable-length).
/// Immutable after creation — no lock needed for reads.
pub name: ResourceName,
/// Node ID of the resource master.
/// Updated only during re-mastering (under lockspace shard lock).
pub master: NodeId,
/// Mutable resource state protected by a SpinLock.
pub inner: SpinLock<DlmResourceInner>,
/// Per-resource DSM dependency for recovery ordering. Tracks whether
/// this resource's CAS word page resides in a DSM region. Used during
/// node failure recovery to determine whether re-mastering must wait
/// for DSM home reconstruction
/// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--cross-subsystem-recovery-ordering-dsm-and-dlm)).
/// Populated at resource creation time when the master allocates CAS
/// word arrays from RDMA-registered memory. Immutable after creation.
pub dsm_dep: DlmResourceDsmDep,
}
/// Mutable state within a `DlmResource`, protected by `DlmResource.inner`
/// SpinLock. All fields in this struct require holding the lock for access.
pub struct DlmResourceInner {
/// Lock Value Block for this resource.
pub lvb: LockValueBlock,
/// Rotation epoch — incremented each time the LVB sequence counter is
/// rotated (reset to 0). Returned in `DlmLvbReadResponsePayload` so
/// lockless `read_lvb()` callers can detect rotation discontinuities.
/// See "Rotation safety for lockless `read_lvb()` callers".
pub rotation_epoch: u64,
/// Granted queue — locks currently held.
/// Intrusive linked list: DlmLock nodes are allocated from a per-lockspace
/// slab allocator (fixed-size, no heap resizing on the lock grant path).
pub granted: IntrusiveList<DlmLock>,
/// Converting queue — locks being converted (upgrade/downgrade).
/// Processed in FIFO order before the waiting queue.
pub converting: IntrusiveList<DlmLock>,
/// Waiting queue — new lock requests waiting for compatibility.
pub waiting: IntrusiveList<DlmLock>,
/// Pending CAS confirmations ([Section 15.15](#distributed-lock-manager--rdma-atomic-cas-lock-fast-path)).
/// When remote nodes acquire a lock via RDMA CAS but have not yet sent
/// the confirmation RDMA Send, this field tracks the expected confirmations.
/// The master defers processing new incompatible-mode requests against this
/// resource until all confirmations arrive or time out. A bounded collection
/// is required — not Option<PendingCas> — because shared-mode CAS operations
/// (e.g., PR acquires) allow multiple peers to win concurrently (each
/// successive shared-mode CAS increments holder_count). For exclusive-mode
/// CAS (EX, PW), at most one entry exists. Cap of 64 is bounded by CAS
/// serialization within the 500 us confirmation timeout, not by cluster size
/// (a single CAS takes ~1 us round-trip; at most ~64 can complete within
/// 500 us under contention).
pub pending_cas: ArrayVec<PendingCas, MAX_PENDING_CAS>,
}
/// Region identifier for DSM (Distributed Shared Memory) regions.
/// Matches the DsmRegionId defined in the DSM subsystem ([Section 6.2](06-dsm.md#dsm-design-overview)).
/// `DsmRegionId` (u64 alias in DLM) is the unwrapped value from
/// `DsmRegionHandle(u64)` in the DSM subsystem ([Section 6.1](06-dsm.md#dsm-foundational-types)).
type DsmRegionId = u64;
/// Per-resource DSM dependency metadata for recovery ordering.
/// 16 bytes overhead per DlmResource.
pub struct DlmResourceDsmDep {
/// DSM region containing this resource's CAS word page.
/// 0 = no DSM region dependency (CAS word is in local, non-DSM RDMA
/// pool memory — the common case for lockspaces that do not use
/// DSM-backed state). DSM region IDs are always positive (assigned
/// by the region coordinator starting at 1).
pub region_id: u64,
/// Virtual address of the CAS word within the DSM region.
/// Used during recovery to check whether the specific page was homed
/// on the failed node: `home_node(region_id, cas_word_va) == failed_node`.
/// 0 when `region_id == 0` (kernel virtual addresses are never 0).
pub cas_word_va: u64,
}
const_assert!(core::mem::size_of::<DlmResourceDsmDep>() == 16);
pub const MAX_PENDING_CAS: usize = 64;
/// Tracks a pending CAS confirmation for a DlmResource.
pub struct PendingCas {
/// Peer that performed the CAS.
pub peer: PeerId,
/// Lock mode the node acquired.
pub mode: LockMode,
/// Sequence value in the CAS word after the acquire (for timeout reset).
pub post_cas_sequence: u64,
/// Timestamp when the CAS was detected (for 500 μs timeout).
pub detected_at_ns: u64,
}
// Note on allocation strategy: DlmLock nodes are allocated from a per-lockspace
// slab allocator (umka-core Section 4.3). The slab pre-allocates DlmLock-sized objects
// and grows in page-sized chunks, so individual lock grant/release operations
// never trigger the general-purpose heap allocator. This ensures bounded latency
// on the contested lock path. The intrusive list avoids the pointer indirection
// and dynamic resizing of VecDeque/Vec.
//
// Note on byte-range lock tracking: each DlmLock's associated LockDirtyTracker
// (Section 15.12.8) uses LargeRangeBitmap (not a flat SparseBitmap) to track dirty
// pages within the lock's byte range. This supports files of any practical size:
// ≤ 1 GiB files use the flat SparseBitmap path (zero overhead), while larger files
// use the two-level LargeRangeBitmap with lazily-allocated 1 GiB slots.
/// A single lock held or requested by a node.
pub struct DlmLock {
/// Node that owns this lock.
pub node: NodeId,
/// Requested/granted lock mode.
pub mode: LockMode,
/// Process ID on the owning node (for deadlock detection).
pub pid: u32,
/// Flags (NOQUEUE, CONVERT, CANCEL, etc.).
pub flags: LockFlags,
/// Timestamp for ordering and deadlock victim selection.
pub timestamp_ns: u64,
/// Revocation handler — called on the holder's node when the master
/// sends a revocation/downgrade request due to a conflicting lock.
///
/// The handler encapsulates all application-specific revocation logic:
/// - UPFS: flush dirty pages (targeted writeback), invalidate cache,
/// update LVB with latest metadata, then downgrade or release.
/// - VFS export: break client leases, flush data, release.
/// - Generic application: release immediately.
///
/// The DLM drives the entire flow: detect contention → send revocation
/// → handler runs on holder → handler calls dlm_convert() or
/// dlm_unlock() → DLM grants to the new requester. No separate
/// "token layer" needed — the handler IS the token behavior.
///
/// Set at lock acquire time. If None, the DLM uses a default handler
/// that releases the lock immediately on revocation.
pub revocation_handler: Option<&'static dyn DlmRevocationHandler>,
/// Intrusive list linkage for membership in `DlmResourceInner.granted`,
/// `.converting`, or `.waiting` queues. A DlmLock is in exactly one
/// queue at any time. Access requires holding `DlmResource.inner`.
pub queue_link: IntrusiveListNode,
}
/// **DlmLock intrusive list lifecycle**:
/// 1. **Allocation**: DlmLock is allocated from the per-lockspace slab on `dlm_lock()`.
/// 2. **Waiting**: Inserted into `DlmResourceInner.waiting` queue. Remains there until
/// compatibility check passes or the request is cancelled.
/// 3. **Granted**: Moved from `waiting` → `granted` when the lock mode is compatible
/// with all existing grants. The move is O(1) (unlink + relink).
/// 4. **Converting**: Moved from `granted` → `converting` on `dlm_convert()`.
/// Moved back to `granted` when the conversion is compatible.
/// 5. **Release**: Removed from whichever queue it occupies on `dlm_unlock()`.
/// The slab object is returned to the per-lockspace slab allocator.
/// A DlmLock is NEVER on two queues simultaneously.
/// Trait for lock revocation handlers. Implemented by subsystems that
/// need custom behavior on lock downgrade/revocation (UPFS, VFS export,
/// block export reservations).
///
/// The handler runs in a DLM worker thread on the lock holder's node.
/// It must complete within a bounded time (configurable per lockspace,
/// default: 5 seconds). If the handler exceeds the timeout, the DLM
/// forcibly releases the lock and logs an FMA event.
pub trait DlmRevocationHandler: Send + Sync {
/// Called when the DLM master requests downgrade from `current_mode`
/// to `requested_mode` (e.g., EX → PR when a reader arrives).
///
/// The handler MUST:
/// 1. Perform any application-specific cleanup (flush dirty data,
/// invalidate caches, update LVB).
/// 2. Call `lock.convert(requested_mode)` to complete the downgrade.
/// OR call `lock.unlock()` to release entirely.
///
/// `context` carries the conflicting requester's information (node,
/// requested mode) for handlers that need it (e.g., UPFS may choose
/// different flush strategies based on the requester's lock type).
fn on_revoke(&self, lock: &DlmLock, current_mode: LockMode,
requested_mode: LockMode, context: &RevocationContext);
}
/// Context passed to revocation handlers.
pub struct RevocationContext {
/// Node requesting the conflicting lock.
pub requester_node: NodeId,
/// Mode requested by the conflicting lock.
pub requester_mode: LockMode,
/// Urgency: Normal (best-effort timing) or Urgent (minimize delay,
/// used for fencing and reservation preemption).
pub urgency: RevocationUrgency,
}
#[repr(u8)]
pub enum RevocationUrgency {
/// Normal revocation. Handler has the full timeout to complete.
Normal = 0,
/// Urgent revocation (fencing, reservation preempt). Handler should
/// complete as quickly as possible. DLM reduces the timeout to 1 second.
Urgent = 1,
}
/// Opaque handle to an acquired DLM lock. Returned by lock_acquire()
/// and lock_any_of(), used by lock_release() and lock_convert().
/// Contains the lock identity (resource + mode) and a version counter
/// to detect stale handles after lock migration or failover.
pub struct DlmLockHandle {
/// Unique ID assigned by the DLM master at grant time.
pub lock_id: u64,
/// Name of the locked resource.
pub resource_name: ResourceName,
/// Granted lock mode (may differ from requested mode after convert).
pub mode: LockMode,
/// Version counter — incremented on each convert or migration.
/// Used by the DLM to reject operations on stale handles.
pub version: u64,
/// Optional causal consistency attachment for DSM-coordinated locks.
/// When a DLM lock protects DSM-managed pages, the lock release or
/// downgrade message carries the releasing node's CausalStampWire so
/// the next lock holder can verify causal ordering of DSM page updates.
/// Set by `dsm_bind_lock()` when the lock is bound to a DSM region;
/// `None` for non-DSM locks. On lock release/downgrade, if this field
/// is `Some`, the CausalStampWire is serialized into the LOCK_DOWNGRADE
/// or LOCK_RELEASE message payload sent to the lock master, which
/// forwards it to the next granted holder.
/// See [Section 6.6](06-dsm.md#dsm-coherence-protocol-moesi) §6.6 for the causal consistency
/// protocol and CausalStampWire wire format.
pub dsm_causal_stamp: Option<CausalStampWire>,
}
/// Handle binding a DLM lock to DSM dirty tracking.
/// Created by `dsm_bind_dirty_tracker()`, released on `lock_release()`.
/// While this binding is active, dirty page tracking on the locked
/// resource is forwarded to the DSM bitmap identified by `region_id`.
pub struct DsmLockBindingHandle {
/// The underlying DLM lock that owns this binding.
pub lock: DlmLockHandle,
/// DSM region whose dirty bitmap is bound to this lock.
pub region_id: DsmRegionId,
/// Offset within the DSM dirty bitmap where this lock's
/// dirty tracking begins. Used to partition a single DSM region
/// across multiple lock-protected sub-ranges.
pub bitmap_offset: u32,
}
15.15.5 Transport-Agnostic Lock Operations¶
The DLM uses the ClusterTransport trait
(Section 5.10) for all network
operations. Each peer's transport is obtained from PeerNode.transport
(Arc<dyn ClusterTransport>). On RDMA peers, lock operations use one-sided RDMA
atomics for lowest latency (~2-3 μs CAS round-trip, ~3-5 μs full acquire). On TCP
peers, lock operations use serialized request-response messages (~50-200 μs). On CXL
peers, hardware CAS provides the fastest path (~0.1-0.3 μs). The DLM protocol is
identical across all transports; only the per-peer latency differs.
Transport selection at lockspace initialization: When a DLM lockspace is created
or a node joins an existing lockspace, the DLM calls select_transport()
(Section 5.5) for each peer in the lockspace.
The selected transport is stored per-peer in DlmLockspace and used for all subsequent
lock operations to that peer. Transport selection follows the standard priority:
CXL shared memory > RDMA > TCP. If transport.supports_one_sided() returns true
(RDMA, CXL), the DLM enables the CAS fast path for uncontested acquires. If
supports_one_sided() returns false (TCP), all lock operations use the two-sided
transport.send_reliable() path (protocol 2 below), which is still 5-10x faster
than Linux's TCP-based DLM due to integrated kernel-to-kernel messaging without
userspace daemon involvement.
Four protocol flows cover the full lock lifecycle:
1. Uncontested acquire (transport.atomic_cas(), ~3-5 μs on RDMA, ~50-200 μs on TCP)
When a resource has no current holders or only compatible holders, and the transport
supports one-sided operations (transport.supports_one_sided() == true), the requesting
node can acquire the lock via transport.atomic_cas() on the master's lock state
word — a 64-bit value encoding the current lock state. On RDMA transports, this maps
to a NIC-side RDMA Atomic CAS (zero remote CPU involvement). On TCP transports, the
remote kernel thread performs the CAS locally and returns the old value:
/// 64-bit lock state word, laid out for RDMA Atomic CAS.
/// Stored in master's RDMA-accessible memory for each DlmResource.
///
/// bits [63:61] = current_mode (3 bits: 0=NL, 1=CR, 2=CW, 3=PR, 4=PW, 5=EX)
/// bits [60:48] = holder_count (13 bits: up to 8191 concurrent holders;
/// sufficient for clusters with hundreds of peers, with
/// margin for future expansion)
/// bits [47:0] = sequence (48 bits: monotonic counter for ABA prevention)
///
/// IMPORTANT: current_mode encodes a SINGLE lock mode. This means the CAS fast
/// path only works for HOMOGENEOUS holder sets — all holders must be in the same
/// mode. When holders have different compatible modes (e.g., CR + PR, or CR + PW),
/// the CAS word cannot represent the mixed state. These transitions MUST use the
/// two-sided RDMA Send path (protocol 2 below), where the master's control thread
/// maintains per-holder mode information in the full DlmResource granted queue.
///
/// This is a deliberate design tradeoff: the CAS fast path covers the most common
/// lock patterns in practice:
/// - EX for exclusive write access (single writer)
/// - PR for shared read access (multiple readers)
/// - CR for concurrent read (e.g., GFS2 inode attribute reads via LVB)
/// Mixed-mode combinations (CR+PR, CR+PW, CR+CW) are valid but uncommon in
/// GFS2 workloads — they arise primarily during mode transitions (one node
/// downgrades while another acquires). The two-sided path at ~5-8μs is still
/// 5-10x faster than Linux's TCP-based DLM.
///
/// ABA safety: 48-bit sequence counter. At 500,000 lock ops/sec on a single
/// resource (sustained maximum), wrap time = 2^48 / 500,000 = ~563 million
/// seconds (~17.8 years). This eliminates ABA as a practical concern.
/// (Note: this is the CAS lock-word sequence counter, which increments once
/// per lock acquisition. The LVB sequence counter in Section 15.12.3 wraps in ~9.2
/// years because it increments twice per write — once at begin_write, once
/// at end_write — giving 1M increments/sec at 500K writes/sec.)
///
/// The full granted/converting/waiting queues are maintained separately in the
/// master's local memory. The CAS word is a fast-path optimization — it
/// encodes enough state for common homogeneous transitions without remote CPU
/// involvement. The master's granted queue is the authoritative lock state;
/// the CAS word is a cache of that state for the fast path.
CAS fast path cases (homogeneous mode only):
| Transition | CAS expected | CAS desired | Ops | Notes |
|---|---|---|---|---|
| Unlocked → EX | NL\|0\|seq |
EX\|1\|seq+1 |
1 CAS | First exclusive holder |
| Unlocked → PR | NL\|0\|seq |
PR\|1\|seq+1 |
1 CAS | First protected reader |
| Unlocked → CR | NL\|0\|seq |
CR\|1\|seq+1 |
1 CAS | First concurrent reader |
| PR → PR (add reader) | PR\|K\|seq |
PR\|K+1\|seq+1 |
Read + CAS | Add same-mode holder |
| CR → CR (add reader) | CR\|K\|seq |
CR\|K+1\|seq+1 |
Read + CAS | Add same-mode holder |
| EX → NL (unlock) | EX\|1\|seq |
NL\|0\|seq+1 |
1 CAS | Last holder releases |
| PR → NL (last reader) | PR\|1\|seq |
NL\|0\|seq+1 |
1 CAS | Last holder releases |
| CR → NL (last reader) | CR\|1\|seq |
NL\|0\|seq+1 |
1 CAS | Last holder releases |
| PR (remove reader) | PR\|K\|seq |
PR\|K-1\|seq+1 |
Read + CAS | K>1, decrement count |
| CR (remove reader) | CR\|K\|seq |
CR\|K-1\|seq+1 |
Read + CAS | K>1, decrement count |
| Unlocked → PW | NL\|0\|seq |
PW\|1\|seq+1 |
1 CAS | Single PW holder (PW+PW incompatible) |
| Unlocked → CW | NL\|0\|seq |
CW\|1\|seq+1 |
1 CAS | First concurrent writer |
| CW → CW (add writer) | CW\|K\|seq |
CW\|K+1\|seq+1 |
Read + CAS | CW is self-compatible (per Section 15.15 matrix) |
| CW → NL (last writer) | CW\|1\|seq |
NL\|0\|seq+1 |
1 CAS | Last CW holder releases |
| CW (remove writer) | CW\|K\|seq |
CW\|K-1\|seq+1 |
Read + CAS | K>1, decrement count |
Transitions that CANNOT use CAS (require two-sided path): - Any mode conversion (e.g., PR→EX, EX→PR, CR→PW) - Acquiring a mode different from current holders (e.g., CW when current_mode=CR, or PR when current_mode=CW) - Adding a second PW holder (PW is not self-compatible) - These transitions require the master's control thread to evaluate the full compatibility matrix and update per-holder mode tracking in the granted queue.
Requester Master (remote memory)
| |
|--- transport.atomic_cas(master, ---->|
| lock_word_addr, |
| expected=UNLOCKED, |
| desired=EX|1|seq+1) |
|<-- old_value (CAS result) ------------------|
| |
If old_value matched expected: lock acquired.|
RDMA: zero remote CPU involvement, ~2-3 μs. |
TCP: server-side CAS, ~50-200 μs. |
Full acquire (CAS + confirmation): ~3-5 μs |
on RDMA, ~100-400 μs on TCP. |
For the Read+CAS path (adding a shared reader when holders exist), the requester first
reads the current state (transport.fetch_page() or transport.send_reliable() for
small reads), then calls transport.atomic_cas() to atomically increment the holder
count. Total: 2 transport operations (~3-5 μs on RDMA, ~100-400 μs on TCP). CAS failure
(due to concurrent modification) triggers retry with the returned value as the new
expected value.
Important: The CAS word is an optimization for the uncontested fast path. It does
NOT replace the full lock queues maintained in the master's local memory. When a CAS
succeeds, the acquiring node MUST send transport.send_reliable() to the master confirming
its identity (node ID) and the acquired lock mode. The master updates the full granted
queue upon receiving this confirmation. If the master does not receive confirmation
within ~500 μs (the confirmation timeout), it assumes the CAS winner crashed before
completing the acquire and resets the lock state word via its own CAS (restoring the
pre-acquire state). The CAS target word includes a generation counter (the 48-bit
sequence field) to prevent ABA issues during this reclamation — the master's
restoration CAS uses the post-acquire sequence value as the expected value, so a
concurrent legitimate acquire by another node will not be clobbered. This
confirmation step is a required correctness measure, not an optimization: without it,
if the CAS winner crashes before the master processes its queue entry, recovery would
iterate the granted queue and find no record of the holder, leaving the lock state word
permanently wedged. When a CAS fails (contested lock, incompatible mode), the requester
falls back to the two-sided protocol below. The master's control thread is the sole
authority for complex operations (conversions, waiters, deadlock detection).
CAS outcome determination and transport failure recovery. RDMA Atomic CAS is a single round-trip operation: the RNIC performs the compare-and-swap on the remote memory and returns the previous value of the target word in the CAS completion. The requester determines the CAS outcome entirely from this return value — if the returned old value matches the expected value, the CAS succeeded and the lock is held. No separate "confirmation response" from the master's CPU is involved in determining CAS success or failure; the RDMA NIC hardware handles the entire operation atomically. This means the requester always knows whether it acquired the lock, as long as the RDMA completion is delivered.
If the RDMA transport itself fails during a CAS operation (e.g., the Queue Pair enters Error state due to a link failure, cable pull, or remote RNIC reset), the requester receives a Work Completion with an error status (not a successful CAS completion). In this case, the CAS may or may not have been applied to the master's memory — the requester cannot distinguish between "CAS was never sent", "CAS was sent but not executed", and "CAS succeeded but the response was lost in transit." The requester must handle this ambiguity:
- Assume the CAS may have succeeded. The requester must not retry the CAS blindly (doing so could double-acquire or corrupt the sequence counter).
- Query the master via a recovery path. The requester establishes a fresh RDMA connection (or uses a separate TCP fallback if the RDMA fabric is partitioned) and sends a two-sided lock state query to the master's control thread. The master reads its authoritative lock state — the CAS word in registered memory — and responds with the current lock state plus the sequence counter value.
- Master's lock word is ground truth. If the CAS word shows the requester's expected post-CAS value (matching mode, holder count, and sequence), the CAS succeeded and the requester proceeds with the confirmation RDMA Send (on the new connection). If the CAS word shows a different state, the CAS either was not applied or was already reclaimed by the master's confirmation timeout (the ~500 μs timeout described above). In either case, the requester starts a fresh lock acquisition attempt.
- Interaction with confirmation timeout. If the CAS succeeded but the requester takes longer than ~500 μs to query the master (due to connection re-establishment), the master may have already reclaimed the lock via its confirmation timeout logic. This is safe: the master's reclamation CAS uses the post-acquire sequence value, so if reclamation occurred, the lock word has been reset and the requester's recovery query will see the reset state. The requester then re-acquires normally.
This recovery path is exercised rarely (only on RDMA transport failures, not on normal CAS contention), so its higher latency (~1-5 ms for connection re-establishment + query) does not affect steady-state performance.
Pending CAS confirmation window: Between a successful CAS and the arrival of the confirmation Send, the CAS word and the master's granted queue are temporarily inconsistent — the CAS word shows a lock held, but the granted queue has no entry. During this window, if another node's CAS fails and it falls back to the two-sided path, the master must handle the discrepancy correctly:
- When the master receives a two-sided lock request, it checks BOTH the granted queue AND the CAS word state. If the granted queue is empty but the CAS word shows a held lock, the master knows a CAS confirmation is pending.
- The master enqueues the incoming request in the waiting queue and defers processing until either: (a) the CAS confirmation arrives (at which point the granted queue is updated and the waiting queue is processed normally), or (b) the confirmation timeout expires (at which point the master resets the CAS word and processes the waiting queue against the now-empty granted queue).
- If the pending CAS mode is compatible with the incoming request's mode (per the Section 15.15 compatibility matrix), the master grants the incoming request immediately without waiting for the CAS confirmation. The master also updates the CAS word via its own local CAS to reflect the new holder (incrementing holder_count in the CAS word to account for both the pending CAS winner and the newly granted node). The CAS winner's confirmation, when it arrives, simply adds the CAS winner to the already-updated granted queue. This eliminates the blocking window entirely for same-mode shared requests (e.g., multiple concurrent PR acquires), which are the most common contested case.
- For incompatible-mode requests, this deferred processing adds at most 500 μs of latency to the second node's request in the worst case (CAS winner crashed). In the normal case, the confirmation arrives within ~1-2 μs (one RDMA Send), so the deferred processing completes almost immediately. A crashed node's 500 μs delay is negligible compared to the 50-200 ms DLM recovery time.
- The master tracks pending CAS confirmations with a per-resource
pending_cas: ArrayVec<PendingCas, MAX_PENDING_CAS>field (see DlmResource struct in Section 15.15). A bounded collection is required — notOption<PendingCas>— because shared-mode CAS operations (e.g., PR acquires) allow multiple peers to win concurrently: each successive shared-mode CAS increments theholder_countfield embedded in the CAS word and updates the sequence number, so two or more nodes can complete their CAS atomics before any confirmation arrives. The master must reconcile ALL concurrent CAS winners: it reads the final CAS word once all confirmations have arrived (or the polling timeout expires) and uses theholder_countto verify that the number of confirmations received matches the number of nodes that successfully CAS'd. Any node whose confirmation does not arrive within the timeout is treated as crashed and is excluded from the granted queue. For exclusive-mode CAS (EX, PW), at most one node can win — the CAS word format enforces mutual exclusion — so the collection will contain at most one entry in that case. This field is set when the master observes a CAS word change via periodic polling of the CAS word in its registered memory region, and cleared when all confirmations arrive or times out. Note: The master does NOT receive RDMA completion queue notifications for remote CAS operations (one-sided RDMA is CPU-transparent at the responder). Detection relies on the master's targeted polling of CAS words with pending requests only — the master maintains a per-lockspace pending set of resources with outstanding CAS operations, and polls only those CAS words (poll interval: ~100μs per pending resource). Resources with no pending CAS operations are not polled, so the CPU overhead scales with O(pending) not O(total_resources). On a lockspace with 10,000 resources but only 50 with pending CAS operations, polling generates ~500K polls/second — manageable on a single core. Optimization note: For workloads with consistently high pending-CAS counts (>100), an interrupt-driven notification path is available: the requesting node sends a two-sided RDMA Send to the master after completing its CAS, triggering a completion queue event instead of requiring polling. The master switches to interrupt-driven mode per-resource when the pending count exceeds a configurable threshold (default: 100). This trades higher per-lock latency (~1μs CQ processing vs ~0.1μs poll) for reduced CPU overhead.
Security: RDMA CAS access to the lock state word is controlled via RDMA memory
registration (Memory Regions / MRs). The master registers each lockspace's CAS word
array as a separate RDMA MR and distributes the Remote Key (rkey) only to nodes that
hold CAP_DLM_LOCK for that lockspace. Capability verification happens at lockspace
join time (a two-sided RDMA Send to the master, which checks CAP_DLM_LOCK via
umka-core's capability system before returning the rkey). Nodes that lose
CAP_DLM_LOCK have their rkey revoked via RDMA MR re-registration (which invalidates
the old rkey). This enforces the capability boundary at the RDMA transport layer — a
node without the rkey physically cannot issue CAS operations to the lock state words.
The rkey is per-lockspace, so CAP_DLM_LOCK scoping (Section 15.15) maps directly
to RDMA access control.
Rkey lifetime and TOCTOU safety: RDMA rkeys are registered for the lifetime of
the node's DLM membership in the lockspace, not per-operation. When a node joins a
lockspace, the master registers the RDMA Memory Region and returns the rkey; when the
node leaves (graceful or fenced), the MR is deregistered and the rkey is invalidated.
This eliminates TOCTOU (time-of-check-to-time-of-use) races: a node that passes the
capability check at join time retains a valid rkey for all subsequent lock operations
until membership ends. Rkey revocation (for CAP_DLM_LOCK loss) uses RDMA MR
re-registration, which atomically invalidates the old rkey -- any in-flight CAS using
the old rkey will fail with a remote access error (IBA v1.4 Section 14.6.7.2: deregistered
MR causes Remote Access Error completion), and the node must re-join the lockspace
(re-passing the capability check) to obtain a new rkey.
Revocation ordering: The MR re-registration is the authoritative enforcement
mechanism — it must complete before the capability is marked as revoked in the
local capability table. Sequence: (1) master calls dereg_mr() on the RNIC, which
invalidates the rkey in hardware; (2) master updates the lockspace membership record
(removes node); (3) capability revocation propagates to the evicted node. This
ordering ensures no window exists where the capability is revoked but the rkey is
still valid. If the evicted node races a CAS between steps (1) and (3), the RNIC
rejects it (rkey already invalid). Rkey revocation is hardware-enforced with < 1ms
latency from the dereg_mr() call — there is no exposure window. This is the same
eager dereg_mr() mechanism used for cluster membership revocation
(Section 5.8); the 180s rkey
rotation grace period described in Section 5.3
(Mitigation 2) is a separate defense-in-depth against rkey leakage to non-cluster
entities, not the revocation path for DLM membership loss.
2. Contested acquire (transport.send_reliable(), ~5-8 μs on RDMA, ~100-400 μs on TCP)
When the CAS fails (resource is already locked in an incompatible mode), or when the
transport does not support one-sided operations (transport.supports_one_sided() == false),
the requester uses a two-sided exchange via transport.send_reliable():
Requester Master
| |
|--- transport.send_reliable(master, ---->|
| lock_request_msg) |
| [enqueue in waiting list]
| [check compatibility]
| [if compatible: grant]
|<-- transport.send_reliable(requester, -----|
| lock_grant_msg + LVB) |
| |
RDMA: 2 RDMA Send round-trips (~5-8 μs). |
TCP: 2 TCP request-response (~100-400 μs). |
The master's kernel thread processes the request, checks compatibility against the granted queue, and either grants immediately or enqueues for later grant.
3. Lock conversion (upgrade/downgrade)
A node holding a lock can convert it to a different mode without releasing and reacquiring. Conversions use the same protocol as contested acquire (RDMA Send to master). The converting queue is processed before the waiting queue — a conversion request from an existing holder takes priority over new requests.
Common conversions: - PR → EX: upgrade from read to write (e.g., before modifying an inode) - EX → PR: downgrade from write to read (triggers targeted writeback, Section 15.15) - EX → NL: release write lock but keep queue position (for future reacquire)
4. Batch request (up to 64 locks, ~5-10 μs on RDMA, ~150-500 μs on TCP)
Multiple lock requests destined for the same master are grouped into a single transport message:
Requester Master
| |
|--- transport.send_reliable(master, ---->|
| batch_msg: 8 lock requests) |
| [process all 8]
|<-- transport.send_reliable(requester, -----|
| batch_response: 8 grants/queued) |
| |
RDMA: ~5-10 μs for 8 locks. |
TCP: ~150-500 μs for 8 locks. |
Linux DLM: 8 × 30-50 μs = 240-400 μs. |
Batch requests are critical for operations that require multiple locks atomically.
A rename() requires locks on the source directory, destination directory, and the
file being renamed — three locks that can be batched into a single network operation
when they share the same master.
When batch locks span multiple masters, the requester sends one batch per master in parallel and waits for all grants. Worst case: N masters = N parallel RDMA operations completing in max(individual latencies) rather than sum(individual latencies).
15.15.6 Lease-Based Lock Extension¶
Problem solved: Linux DLM's BAST (Blocking AST) callback storms.
In Linux, when a node requests a lock in a mode incompatible with current holders, the DLM sends a BAST callback to every holder. For a popular file with 100 readers (PR mode), a writer requesting EX mode triggers 100 BAST messages — O(N) network traffic per contention event. On large clusters (64+ nodes), this becomes a significant source of network overhead.
UmkaOS's lease-based approach:
- Every granted lock includes a lease duration (configurable per resource type):
- Metadata locks: 30 seconds default
- Data locks: 5 seconds default
-
Application locks: configurable (1-300 seconds)
-
Lease extension: Holders extend their lease cheaply via
transport.push_page()to update a timestamp in the master's lease table. On RDMA transports, this is a single one-sided RDMA Write (zero master CPU involvement, ~1-2 μs). On TCP transports, this is a request-response pair (~50-200 μs). Cost is amortized because renewals happen at 50% of lease duration (e.g., every 15s for 30s metadata leases). -
Revocation strategy:
- Uncontended resource: No revocation needed. Holders extend leases indefinitely. Minimal network traffic for uncontended locks — only periodic one-sided RDMA lease renewals, which do not interrupt the remote CPU (vs. Linux's periodic BAST heartbeats that require CPU processing on every node).
- Contended resource (incompatible request arrives): Master checks lease expiry for all incompatible holders. If all leases have expired, master grants to new requester immediately. If any leases are active, master sends revocation messages to those holders. For the worst case (EX request on a resource with K active CR/PR holders), this is O(K) revocations — the same as Linux's BAST count. The improvement over Linux is for the common case: uncontended resources have zero CPU-consuming traffic — only one-sided RDMA lease renewals that bypass the remote CPU (Linux BASTs are sent even for uncontended downgrade requests and require CPU processing on the receiving node), and resources where most holders' leases have naturally expired need only revoke the few remaining active holders.
-
Emergency revocation: For locks with
NOQUEUEflag (non-blocking), the master immediately checks compatibility and returnsEAGAINif blocked. No revocation attempted. -
Correctness guarantee: Lease expiry is a sufficient condition for revocation — if a holder fails to extend its lease, the master knows the lock can be safely reclaimed. For contended resources, the fallback to immediate revocation (single targeted message) preserves correctness identically to Linux's BAST mechanism.
-
Clock skew safety: Lease timing is master-clock-relative only. The master is the sole arbiter of lease validity. To handle clock skew between holder and master:
- Grant messages include the master's absolute expiry timestamp.
- Holders renew at 50% of lease duration (e.g., 15s for a 30s metadata lease), providing a safety margin larger than any reasonable clock skew (seconds).
- Holders track the master's clock offset from grant/renewal responses and adjust their renewal timing accordingly.
- If a holder discovers its lease was revoked (via a failed extension response), it must immediately stop using cached data and flush any dirty pages before reacquiring the lock. This is the hard correctness boundary: the holder's opinion of lease validity does not matter — only the master's.
-
NTP or PTP synchronization is recommended but not required for correctness. The protocol is safe with unbounded clock skew — only the renewal safety margin shrinks, increasing the probability of unnecessary revocations (performance, not correctness).
-
Network traffic reduction: From O(N) BASTs per contention event to O(1) for uncontended resources (no active holders — just clear the lease) and O(K) for contended resources with K active holders. Cluster-wide lock traffic is reduced by orders of magnitude on large clusters.
15.15.7 Speculative Multi-Resource Lock Acquire¶
Problem solved: GFS2 resource group contention.
GFS2 must find a resource group (rgrp) with free blocks before allocating file data. In Linux, this is sequential: try rgrp 0, if locked → full round-trip (~30-50 μs); try rgrp 1, if locked → another round-trip. On a busy cluster with 8 rgrps, worst case is 8 × 30-50 μs = 240-400 μs just to find a free rgrp.
UmkaOS's lock_any_of() primitive:
/// Request an exclusive lock on ANY ONE of the provided resources.
/// The DLM tries all resources and grants the first available one.
/// Returns the index of the granted resource and the lock handle.
pub fn lock_any_of(
resources: &[ResourceName],
mode: LockMode,
flags: LockFlags,
) -> Result<(usize, DlmLockHandle), DlmError>;
The requester sends a single message listing N candidate resources. The master (or masters, if resources span multiple masters) evaluates each candidate and grants the first one that is available in the requested mode.
Requester Master(s)
| |
|--- "Lock any of [rgrp0..rgrp7]" ---->|
| [try rgrp0: locked]
| [try rgrp1: locked]
| [try rgrp2: FREE → grant]
|<-- "Granted: rgrp2" ------------------|
| |
Total: ~5-10 μs (single round-trip). |
Linux: up to 8 × 30-50 μs = 240-400 μs.|
For resources spanning multiple masters, the requester sends parallel requests to each
master. The first grant received is accepted; the requester cancels remaining requests.
Cancel uses a two-phase protocol: (1) send CANCEL to all nodes where the lock was
requested, (2) wait for either CANCEL_ACK or GRANT from each. A GRANT that arrives
after CANCEL intent is released immediately via an unconditional UNLOCK message. This
prevents double-grant: at most one resource is held after lock_any_of() returns,
regardless of message reordering.
15.15.8 Targeted Writeback on Lock Downgrade¶
Problem solved: Linux's "flush ALL pages" on lock drop.
In Linux, when a node holding an EX lock on a GFS2 inode downgrades to PR or releases to NL, the kernel must flush ALL dirty pages for that inode to disk. This is because Linux's page cache has no concept of which pages were dirtied under which lock — the dirty tracking is per-inode, not per-lock-range.
For a 100 GB file where only 4 KB was modified, Linux flushes ALL dirty pages (which could be the entire file if it was recently written). This turns a lock downgrade into a multi-second I/O operation.
UmkaOS's per-lock-range dirty tracking:
The DLM integrates with the VFS layer (Section 14.7) to track dirty pages per lock range:
/// A 512-byte chunk holding 64 consecutive u64 words of the bitmap.
/// Allocated from the slab allocator as a unit; freed when all 64 words
/// become zero. One chunk covers 64 × 64 = 4,096 bit positions.
pub struct SparseBitmapChunk {
/// 64 consecutive bitmap words. Index within the chunk is `(bit / 64) % 64`.
pub words: [u64; 64],
}
/// Sparse bitmap for tracking dirty page ranges.
///
/// Two-level structure:
/// - **Top level**: a 64-bit presence word per chunk. Bit `c` of `top` is set
/// whenever `chunks[c]` is allocated (i.e., has at least one set bit). This
/// allows O(1) skip of empty chunks during iteration.
/// - **Bottom level**: up to 64 chunk slots, each covering 64 u64 words.
///
/// A chunk is allocated on the first `set()` that falls within it and freed
/// when the last `clear()` empties all 64 words. Maximum coverage:
/// 64 chunks × 64 words × 64 bits = 262,144 tracked positions.
///
/// **Addressing**: bit `b` maps to chunk `b / 4096`, word-in-chunk
/// `(b / 64) % 64`, bit-in-word `b % 64`.
///
/// **Allocation cost**: O(set_chunks), not O(set_bits). A fully-dense
/// 262,144-bit bitmap requires 64 slab allocations of 512 bytes each,
/// versus 4,096 individual allocations under the old per-word scheme.
/// Cache locality: all 64 words of a chunk occupy 8 consecutive cache lines,
/// so sequential scans stay within L1 for the active chunk.
///
/// Used by DLM targeted writeback ([Section 15.15](#distributed-lock-manager--targeted-writeback-on-lock-downgrade)) to track dirty pages
/// within a lock range.
pub struct SparseBitmap {
/// Top-level presence map. Bit `c` is set iff `chunks[c]` is `Some(_)`.
/// Allows fast iteration: `leading_zeros()` / `trailing_zeros()` locate
/// the next non-empty chunk in one instruction.
pub top: u64,
/// Chunk slots. `None` means the chunk is all-zeros and not allocated.
/// 64 slots × 512 bytes/chunk = 32 KiB maximum live data.
pub chunks: [Option<Box<SparseBitmapChunk>>; 64],
/// Total number of set bits across all chunks. Maintained by `set()`
/// and `clear()`. Allows O(1) `is_empty()` and density checks.
pub popcount: u32,
}
/// Sparse bitmap for tracking locked byte ranges.
///
/// A flat `SparseBitmap` covers 262,144 bit positions. When each bit represents
/// a 4 KiB page, that covers 1 GiB — sufficient for most files. However, the
/// DLM must track byte-range locks on files that can be much larger (e.g.,
/// 100 GB NFS exports). `LargeRangeBitmap` provides a two-level fallback:
///
/// - **Files ≤ 1 GiB** (common case): uses a flat `SparseBitmap` directly.
/// Zero overhead versus the existing flat bitmap — `small` is `Some(bitmap)`,
/// `large` is `None`.
/// - **Files > 1 GiB**: uses a two-level structure where each top-level slot
/// covers a 1 GiB region and is lazily allocated as a `SparseBitmap` when
/// first needed.
pub struct LargeRangeBitmap {
/// For files ≤ 1 GiB (common case): flat bitmap.
small: Option<SparseBitmap>,
/// For files > 1 GiB: array of 1 GiB-covering SparseBitmaps, lazily allocated.
/// Index N covers byte range [N * 2^30, (N+1) * 2^30).
/// Maximum file size supported: 1 TiB (1024 slots × 1 GiB each).
/// Uses `Box<SparseBitmap>` per-slot to keep the top-level array small:
/// 1024 * 8 = 8 KiB (pointers only). Individual SparseBitmaps (~528 bytes
/// each) are heap-allocated only on first `set()` to that slot.
large: Option<Box<[Option<Box<SparseBitmap>>; 1024]>>,
/// Total file size in bytes (determines which level to use).
file_size: u64,
}
LargeRangeBitmap design notes:
-
Lazy transition: The bitmap starts in
smallmode. On the firstset()call targeting a bit position beyond the 1 GiB boundary (bit index ≥ 262,144), thesmallbitmap is moved into slot 0 of the newly-allocatedlargearray, andsmallis set toNone. Subsequent accesses compute the slot index asbit / 262_144and the intra-slot bit index asbit % 262_144. -
Two levels of lazy allocation: (1) The
largearray itself (8 KiB ofOption<Box<SparseBitmap>>pointers) is heap-allocated only when needed (files > 1 GiB that actually have locks past the 1 GiB boundary). (2) Within thelargearray, each slot'sBox<SparseBitmap>is allocated on firstset()to that slot — empty slots remainNone(8-byte null pointer). -
Maximum coverage: 1 TiB (1024 slots × 1 GiB each). Files larger than 1 TiB use coarse-grained lock tracking: byte-range locks map to 1 GiB granules, with potential false conflicts for adjacent byte ranges within the same 1 GiB granule. This is acceptable because files > 1 TiB with fine-grained byte-range locking are extremely rare in practice; whole-file or large-region locks dominate.
-
Performance: For files ≤ 1 GiB (the common case), zero overhead versus the existing flat
SparseBitmap— one branch onsmall.is_some(). For large files, each access adds one pointer dereference (slot lookup) plus the existingSparseBitmapO(1) per-bit cost. -
range_coverage_bytes() -> u64: Returns the current maximum byte range the bitmap can track at full granularity. Insmallmode: 1 GiB (262,144 × 4 KiB). Inlargemode: 1 TiB (1024 × 1 GiB). For files beyond 1 TiB: returnsfile_size(coarse tracking covers the entire file, but at 1 GiB granularity beyond the 1 TiB fine-grained limit).
SparseBitmap method contracts:
-
set(b: u64): Computes chunk indexc = b / 4096, word indexw = (b / 64) % 64, bit indexk = b % 64. Ifchunks[c]isNone, allocates aSparseBitmapChunkfrom the slab allocator and sets bitcintop. Sets bitkinchunks[c].words[w]. If the bit was previously clear, incrementspopcount. -
clear(b: u64): Computes(c, w, k)as above. Clears bitkinchunks[c].words[w]. If the bit was set, decrementspopcount. If all 64 words inchunks[c]are now zero, frees the chunk and clears bitcintop. -
test(b: u64) -> bool: Computes(c, w, k). Ifchunks[c]isNone, returnsfalse. Otherwise returns(chunks[c].words[w] >> k) & 1 != 0. -
iter_set() -> impl Iterator<Item = u64>: Iterates over set chunk indices usingtop.trailing_zeros()/ bit-clear loop. Within each chunk, iterates over non-zero words usingwords[w].trailing_zeros(). Yields absolute bit positions. Total cost: O(set_chunks + set_bits).
/// Dirty page tracker associated with a DLM lock.
/// Tracks which pages were modified while this lock was held.
pub struct LockDirtyTracker {
/// Byte range covered by this lock (for range locks).
/// For whole-file locks: 0..u64::MAX.
pub range: core::ops::Range<u64>,
/// Bitmap of dirty pages within the lock's range.
/// Indexed by (page_offset - range.start) / PAGE_SIZE.
///
/// Uses `LargeRangeBitmap` to support files of any practical size:
/// - Files ≤ 1 GiB: flat `SparseBitmap` (O(1) per page, zero overhead).
/// - Files > 1 GiB: two-level structure with lazily-allocated 1 GiB slots.
/// - Files > 1 TiB: coarse 1 GiB granule tracking (rare in practice).
///
/// O(1) set/clear per page, O(dirty_chunks + dirty_bits) iteration.
/// Slab allocation is per-chunk (512 bytes), not per set bit, keeping
/// allocator pressure and fragmentation proportional to the number of
/// 256 KB dirty regions rather than the number of dirty pages.
pub dirty_pages: LargeRangeBitmap,
/// Optional delegation to a DSM dirty bitmap. When a `DsmLockBinding`
/// ([Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching--dlm-token-binding)) is active
/// for this lock, the DLM delegates all dirty tracking to the binding's
/// `DsmDirtyBitmap` instead of maintaining its own `dirty_pages` bitmap.
/// This avoids double bookkeeping and ensures a single source of truth.
///
/// Set by `dsm_bind_lock()` at binding registration time; cleared by
/// `dsm_unbind_lock()` at binding teardown.
pub dsm_delegate: Option<DsmDirtyDelegate>,
}
/// Delegation handle connecting a DLM lock's dirty tracking to a
/// `DsmLockBinding`'s `DsmDirtyBitmap`. When present, all dirty page
/// tracking operations on `LockDirtyTracker` are forwarded to the
/// DSM bitmap.
pub struct DsmDirtyDelegate {
/// Handle to the active DsmLockBinding that owns the canonical
/// dirty bitmap. The DLM never reads or writes `dirty_pages`
/// while this handle is live.
pub binding_handle: DsmLockBindingHandle,
}
Dirty tracking delegation contract (DLM ↔ DSM):
When a DsmLockBinding is registered for a DLM lock
(Section 6.12), the DLM's
LockDirtyTracker and the DSM's DsmDirtyBitmap would both track the same
set of dirty pages — one driven by VFS page-fault write paths (setting PTE
dirty bits), the other by MOESI state transitions (Exclusive-dirty / SharedOwner).
Maintaining both bitmaps independently wastes memory, risks divergence if one
path marks a page dirty but the other does not, and complicates writeback (which
bitmap is authoritative?).
Setup sequencing: The binding must follow a strict ordering:
(1) DLM lock acquire completes (the lock is held in EX or PW mode),
(2) DSM region join (the region is locally mapped and the coherence protocol is active),
(3) dirty tracker bind to DLM lock via dsm_bind_lock().
This ordering ensures the lock is held before tracking begins — if the dirty tracker
were bound before the lock was granted, incoming MOESI invalidations could mark pages
dirty in a tracker that has no corresponding lock protection, violating the invariant
that every tracked dirty page is covered by a held DLM lock.
Resolution — single-owner delegation:
- Bind: When
dsm_bind_lock()is called, the DSM subsystem setslock.dirty_tracker.dsm_delegate = Some(DsmDirtyDelegate { binding_handle })on the DLM lock's tracker. From this point: LockDirtyTracker::mark_dirty(page)forwards toDsmDirtyBitmap::mark_dirty(page_idx)via the delegate handle.LockDirtyTracker::iter_dirty()returnsDsmDirtyBitmap::iter_dirty()via the delegate.LockDirtyTracker::dirty_count()returnsDsmDirtyBitmap::dirty_count().-
The local
dirty_pages: LargeRangeBitmapis not accessed; it remains in its last state (or empty if the binding was created before any writes). -
Writeback: On lock downgrade or release, the DLM's targeted writeback path (Section 15.15) calls
lock.dirty_tracker.iter_dirty(). Because the delegate is active, this iterates theDsmDirtyBitmap, which reflects every MOESI M/O transition — including pages dirtied through DSM coherence protocol messages that the VFS write path would not have seen. -
Unbind: When
dsm_unbind_lock()is called (or the DLM lock is released), the DSM subsystem clearslock.dirty_tracker.dsm_delegate. Any remaining dirty pages in theDsmDirtyBitmapare written back synchronously before the delegate is cleared (per the existing unbind contract). After unbind,LockDirtyTrackerreverts to its localdirty_pagesbitmap for any subsequent non-DSM use. -
Invariant: At no point are both
dirty_pagesanddsm_delegateactively tracking writes. The DLM checksdsm_delegate.is_some()on everymark_dirty/iter_dirty/dirty_countcall (single branch, predicted taken when DSM is active). This is a warm-path check (called per dirty page, not per instruction), so the branch cost is negligible.
VFS call site for mark_dirty(): The VFS set_page_dirty() path checks if
the page's inode has an active DLM lock with dirty tracking enabled. If so, it
calls lock.dirty_tracker.mark_dirty(page.index) to record the dirty page index
in the per-lock bitmap. This is the sole entry point for populating the
LockDirtyTracker during normal file writes — page-fault write paths and
buffered-write paths both converge on set_page_dirty(), ensuring no dirty page
is missed regardless of the I/O path taken.
Downgrade behavior:
- EX/PW → PR (downgrade to read): Flush only pages in
dirty_pagesbitmap. If 4 KB of a 100 GB file was modified, flush exactly 1 page (~10-15 μs for NVMe), not the entire file. PW (Protected Write) follows the same writeback rules as EX, since both are write modes that can dirty pages (per the compatibility matrix in Section 15.15). - EX/PW → NL (release): Flush dirty pages, then invalidate only pages covered by this lock's range. Other cached pages (from other lock ranges or read-only access) remain valid.
- Range lock downgrade: When a byte-range lock is downgraded, only dirty pages within that specific byte range are flushed. Pages outside the range are untouched.
Page cache invalidation on lock release/downgrade: When a DLM lock is released
or downgraded, cached pages protected by that lock must be invalidated to prevent
stale reads by other nodes. The DLM calls dlm_invalidate_pages(resource, node) in
the lock release path after dirty page writeback completes. This callback invokes
invalidate_inode_pages2_range() on the inode's address space for the byte range
covered by the lock. Pages that are currently under writeback are waited on before
invalidation. The invalidation is synchronous — the lock release message is not
sent to the master until all local cached pages for the lock's range are evicted.
Cost reduction: From O(file_size) to O(dirty_pages_in_range). For the common case of small writes to large files, this reduces lock downgrade cost by orders of magnitude.
15.15.9 Deadlock Detection¶
The DLM uses a distributed wait-for graph (WFG) with two detection tiers: immediate local cycle detection for same-node deadlocks, and a probe-based protocol for cross-node deadlocks that activates after a configurable wait threshold.
15.15.9.1 Local Wait-For Graph Construction¶
Each node maintains a local wait-for graph of lock dependencies. Vertices are
globally unique process identifiers (node_id, pid) — bare PIDs are insufficient
because PID 1234 on Node A and PID 1234 on Node B are different processes. Edges
represent lock dependencies: process (N1, P) holds lock A, process (N2, Q) waits
for lock A → edge (N2, Q) → (N1, P). The pid field always refers to the
initial (host) PID namespace, not a container-local PID namespace. Containers
that each have PID 1 are unambiguously distinguished this way. Container-local PIDs
are translated to initial-namespace PIDs at the DLM boundary before insertion into
the wait-for graph.
Edge management:
- Insertion: When a lock request blocks (enqueued on the waiting queue), an edge
is added from the requesting task to each current holder of the conflicting lock
mode. For mode conversions, the edge points from the converting task to each holder
of an incompatible mode.
- Removal: When the lock is granted or the request is cancelled (dlm_unlock()
or EDEADLK victim cancellation), all edges originating from that task for the
given resource are removed.
15.15.9.2 Local Cycle Detection (Immediate)¶
On each new edge insertion, the master node runs a depth-first search (DFS) starting from the newly blocked task. If the DFS visits a node already on the current traversal stack, a cycle is detected locally.
/// Perform local cycle detection starting from `waiter`.
///
/// Returns `Some(victim)` if a cycle is found, `None` otherwise.
/// Runs under the WFG lock (held for the duration of the DFS).
/// Worst case O(E) where E = number of edges in the local graph.
fn detect_local_cycle(
graph: &WaitForGraph,
waiter: WaiterId,
policy: VictimPolicy,
) -> Option<WaiterId>;
Algorithm:
1. Mark waiter as visiting (push onto DFS stack).
2. For each holder h that waiter is waiting for:
a. If h is already on the DFS stack → cycle found. Collect all nodes on the
cycle path from h back to h on the stack.
b. If h is waiting for other locks on this node, recurse into h.
c. If h is waiting for a lock mastered on a remote node, stop local DFS
for this branch — the dependency crosses a node boundary and requires the
distributed probe protocol (below).
3. If no cycle found locally, return None.
Victim selection: from the set of tasks in the detected cycle, the victim is chosen
by the configured VictimPolicy:
/// Policy for selecting the deadlock victim from a cycle.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum VictimPolicy {
/// Cancel the youngest transaction (most recent lock_id / highest timestamp).
/// Default. Minimizes wasted work by aborting the task that has done the least.
Youngest,
/// Cancel the lowest-priority task (smallest `nice` value).
LowestPriority,
/// Cancel the task holding the fewest locks (smallest transaction footprint).
SmallestTransaction,
}
The victim's lock request is cancelled with EDEADLK. For same-node deadlocks, this
completes in O(edges) time without any network round-trips, typically within
microseconds of the blocking request.
15.15.9.3 Distributed Probe Protocol¶
If local DFS reaches a task waiting on a lock mastered on a remote node, and the
lock request has been waiting longer than the activation threshold (default: 5
seconds, configurable via DlmLockspaceConfig.deadlock_timeout_ns), the node
initiates a distributed probe.
Probe message format:
/// Distributed deadlock detection probe message.
/// Sent between DLM nodes to trace cross-node wait-for chains.
/// kernel-internal, not KABI — contains ArrayVec (not wire-safe).
/// Serialized to wire format before transmission.
#[repr(C)]
pub struct DlmProbe {
/// Monotonically increasing probe ID (per initiator node).
/// Used for deduplication — nodes cache recent probe_ids to avoid
/// re-processing probes that have already been forwarded.
pub probe_id: u64,
/// Node that initiated this probe (the node where the blocked task resides).
pub initiator_node: NodeId,
/// The blocked task that triggered the probe (cycle target).
pub initiator_waiter: WaiterId,
/// Probe path: list of (node_id, lock_id) pairs traversed so far.
/// Bounded to MAX_PROBE_PATH_LEN (32) entries. If a probe exceeds this
/// depth, it is dropped — real deadlock cycles in practice involve <10 nodes.
pub path: ArrayVec<WaiterId, 32>,
}
/// Maximum probe path length. Probes exceeding this depth are dropped
/// as they are unlikely to represent real deadlock cycles.
pub const MAX_PROBE_PATH_LEN: usize = 32;
Protocol steps:
-
Initiation: When local DFS reaches a remote dependency and the wait time exceeds the threshold, the initiator node constructs a
DlmProbewith a freshprobe_id, setsinitiator_waiterto the originally blocked task, and appends the local path traversed so far. The probe is sent to the remote node that masters the lock being waited on. -
Forwarding: The receiving node looks up the lock resource, identifies the current holders, and continues the DFS in its local WFG:
- For each holder that is itself waiting on a local lock: extend the DFS locally.
- For each holder waiting on a lock mastered on yet another remote node: append
the local path segment to
DlmProbe.pathand forward the probe to that node. -
If a holder is not waiting on anything (it holds the lock and is running): the probe terminates on this branch (no deadlock on this path).
-
Cycle detection: If the probe arrives back at the initiator node (the receiving node's
node_idmatchesDlmProbe.initiator_node) and the DFS reachesinitiator_waiter, a distributed cycle is confirmed. The full cycle path is the concatenation ofDlmProbe.pathplus the local segment. -
Victim selection: The node that detects the cycle selects the victim using the same
VictimPolicyapplied to all tasks in the cycle path. The detecting node sends aDLM_MSG_CANCELto the victim's home node, which cancels the victim's lock request withEDEADLK. -
Probe deduplication: Each node maintains a bounded LRU cache of recently seen
(initiator_node, probe_id)pairs (capacity: 1024 entries). When a probe arrives whose ID is already in the cache, it is dropped without processing. This prevents probe storms in dense wait-for graphs where multiple paths lead to the same node.
15.15.9.4 Gossip-Based Edge Propagation¶
In addition to the probe protocol (which is demand-driven), nodes exchange wait-for graph edges with their neighbors via periodic gossip. This provides a secondary detection mechanism and accelerates probe convergence:
- Every 100 ms, each node selects
ceil(log2(N))random peers from the cluster membership list (Section 5.8) and sends its current local WFG edges (anti-entropy gossip). Random selection ensures convergence in O(log N) rounds with high probability. - Each gossip message includes the
(node_id, pid)tuples for both endpoints of each edge, ensuring no PID aliasing across nodes or containers. - Edge removal: When a lock request is granted or cancelled, the node removes the corresponding edge from its local graph and propagates a tombstone (edge + deletion timestamp) in the next gossip round. Tombstones are garbage-collected after 2x the gossip interval (200 ms).
- Each node runs local cycle detection on its accumulated graph (local edges +
edges received via gossip). If a cycle is found, the youngest transaction
(highest timestamp) is selected as the victim and receives
EDEADLK. - After
3 * ceil(log2(N))gossip rounds without detecting a complete cycle, the detector falls back to a centralized query to the DLM coordinator (lowest live node-id), adding one extra RTT but guaranteeing termination regardless of gossip convergence.
Victim selection is configurable per lockspace: youngest (default), lowest priority, or smallest transaction (fewest locks held).
15.15.9.5 Performance Characteristics¶
Zero overhead on fast path: Deadlock detection only activates when a lock request
has been waiting for longer than the configurable threshold (default: 5 seconds). Short
waits (the common case for contended locks) complete before deadlock detection engages.
The gossip protocol runs on a low-priority background thread and uses minimal bandwidth.
Per-message bound: Each gossip message carries at most MAX_GOSSIP_EDGES (128)
WFG edges. If a node has more local edges than the limit, edges are sent across
multiple gossip rounds (round-robin). At 16 bytes per edge (two (node_id, pid) pairs
+ mode + timestamp), 128 edges = 2 KiB per message, well within a single RDMA inline
send or UDP datagram.
Latency tradeoff justification: The 5-second activation threshold means a true deadlock waits ~5 seconds before detection begins, which is 1,000,000x the typical lock latency (~5 us). This is acceptable because: (1) deadlocks are rare in practice -- most lock waits resolve within milliseconds; (2) the alternative (immediate distributed cycle detection on every wait) would add gossip overhead to every contended lock operation, degrading the common-case latency that the DLM is optimized for; (3) the 5-second threshold matches Linux DLM's deadlock detection timeout and is well within application tolerance for the rare deadlock case.
Local fast-path detection: For locks mastered on the same node, the master performs immediate local cycle detection when enqueueing a new waiter -- if the waiter and all holders in the cycle are on the same node, the deadlock is detected in O(edges) time without any network round-trips, typically within microseconds. The 5-second probe-based detection is only needed for cross-node deadlock cycles, where the wait-for graph edges span multiple nodes and must be traced via the probe protocol.
15.15.10 Integration with Cluster Membership (Section 5.8)¶
The DLM receives cluster membership events directly from Section 5.8's cluster membership protocol:
- NodeJoined: New node added to consistent hash ring. Some lock resources are remapped to the new master (~1/N of resources). The new node receives resource state from the old masters.
- NodeSuspect: Heartbeat missed. DLM begins preparing for potential recovery but does NOT stop lock operations. Current lock holders continue normally.
- NodeDead: Confirmed node failure. DLM initiates recovery for resources mastered
on or held by the dead node (Section 15.15).
Ordering constraint: The DLM lock reclaim timer (
DLM_LOCK_RECLAIM_DELAY_NS, 200 ms) starts ONLY after the membership layer has delivered theNodeDeadevent. The DLM MUST NOT initiate lock reclaim based solely on its own liveness probe (DLM_MONITOR_INTERVAL_NS, 500 ms) — the monitor is advisory and may pre-stage recovery preparation, but actual lock removal from granted queues requires authoritativeNodeDeadconfirmation from the membership layer. The callback:
/// Called by the membership layer ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery))
/// when a node is authoritatively confirmed dead (10 missed heartbeats, 1000ms).
/// This is the ONLY entry point for DLM lock reclaim initiation.
fn on_node_dead(node_id: PeerId) {
// 1. Cancel any in-flight RDMA operations to the dead node.
dlm_cancel_rdma_to(node_id);
// 2. Start the reclaim delay timer. Lock reclaim begins when
// this timer fires, allowing the dead node's RDMA NIC to
// drain any in-flight operations (200ms grace).
dlm_recovery.start_reclaim_timer(node_id, DLM_LOCK_RECLAIM_DELAY_NS);
}
Total worst-case recovery latency (lock holder failure):
- Heartbeat detection: 1000 ms (10 missed heartbeats at 100 ms interval)
- NodeDead delivery to DLM: < 1 ms (in-kernel function call)
- Reclaim delay: 200 ms (DLM_LOCK_RECLAIM_DELAY_NS)
- Lock queue processing: < 10 ms (per-resource, not global)
- Total: ~1210 ms from crash to lock availability for new requesters.
- NodeLeaving: Graceful departure. Node transfers mastered resources to their new owners before leaving. Zero disruption.
Single membership source: The DLM does NOT make its own authoritative membership
decisions. It relies on the cluster membership layer
(Section 5.8) as the single source of
truth for node liveness. The DLM does run its own lightweight monitor thread
(DLM_MONITOR_INTERVAL_NS, 500 ms) to pre-stage lock reclaim before the membership
layer confirms failure, but this monitor cannot unilaterally declare a node dead or trigger
fencing. This eliminates the Linux problem where DLM and corosync can disagree on
node liveness — in UmkaOS, there is exactly one authority for cluster membership.
15.15.11 Recovery Protocol¶
Cross-subsystem ordering: When both DSM and DLM require recovery after a node
failure, DSM home reconstruction runs first for pages that the DLM depends on.
DLM re-mastering proceeds per-resource: resources whose CAS word pages have no
DSM dependency (or whose pages are homed on surviving nodes) re-master immediately;
only resources whose CAS word pages were homed on the failed node wait for
dsm_recovery_complete on the affected DSM region (~1% of resources in typical
deployments). See
Section 5.8
for the full per-resource ordering protocol, DlmResourceDsmDep tracking structure,
and rationale.
Four failure scenarios, each with a targeted recovery flow:
1. Lock holder failure (a node holding locks crashes)
Timeline:
t=0: Node B crashes while holding locks on resources R1, R2, R3
t=300ms: Cluster membership heartbeat ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery),
100ms interval) detects NodeSuspect(B) (3 missed heartbeats).
Note: the DLM's own monitor thread (DLM_MONITOR_INTERVAL_NS=500ms)
may have pre-staged lock reclaim by this point but does NOT trigger
failure autonomously — it only sends an advisory hint to the membership layer.
t=1000ms: NodeDead(B) confirmed by membership layer (10 missed heartbeats)
Recovery (per-resource, NOT global):
For each resource where B held a lock:
1. Master removes B's lock from granted queue
2. If B held EX with dirty LVB: mark LVB as INVALID (sequence = LVB_SEQUENCE_INVALID)
3. Process converting queue, then waiting queue (grant compatible waiters)
4. If B held journal lock: trigger journal recovery for B's journal
Resources NOT involving B: completely unaffected. Zero disruption.
Lease expiry race handling: NodeSuspect is detected at 300ms (3 missed heartbeats),
but leases may not expire until their full timeout (metadata: 30s, data: 5s). If the
master attempts to send revocation messages to B during recovery and B is already
dead (RDMA Send fails), the master does not block indefinitely waiting for B to
acknowledge revocation. Instead, the master records B as "revocation pending" and
proceeds with resource recovery immediately — the lease timeout will naturally
invalidate B's access rights when it expires. For data locks (5s timeout), the
recovery completes within the lease window; for metadata locks (30s timeout), the
master may grant new locks on the resource before B's lease expires. This is
correct because B is confirmed dead at t=1000ms and cannot access the resource.
The lease timeout provides a safety net in the corner case where NodeDead
confirmation is delayed beyond the lease duration — if the master cannot confirm
B's death, B retains access until lease expiry, preserving correctness at the cost
of temporary unavailability for incompatible lock requests.
2. Lock master failure (the node responsible for a resource's lock queues crashes)
Timeline:
t=0: Node M crashes (was master for resources hashing to M)
t=1000ms: NodeDead(M) confirmed (10 missed heartbeats per Section 5.8.2.2)
Recovery:
1. Consistent hashing reassigns M's resources to surviving nodes.
(~1/N resources move, distributed across all survivors.)
2. Each survivor that held locks on M's resources reports its lock
state to the new master via RDMA Send.
3. New master rebuilds granted/converting/waiting queues from
survivor reports.
4. Lock operations resume for affected resources.
Timeline: ~50-200ms for affected resources.
All other resources: unaffected (their masters are alive).
3. Split-brain (network partition divides cluster)
Inherits Section 5.8's quorum protocol: - Majority partition: Continues normal DLM operation. Resources mastered on nodes in the minority partition are remapped. - Minority partition: Blocks new EX/PW lock acquisitions to prevent conflicting writes. Existing EX/PW locks are downgraded to PR — the holder retains the lock (avoiding re-acquisition on partition heal) but cannot write. Dirty pages held under the downgraded lock are flushed before the downgrade completes (targeted writeback, Section 15.15). Existing PR and CR locks remain valid for local cached reads.
How nodes learn they are in the minority partition:
The cluster membership subsystem (Section 5.8)
calls dlm_partition_event(PartitionRole::Minority) on the DLM when quorum is
lost. This is the single notification entry point — the DLM does not independently
monitor heartbeats or quorum; it relies entirely on the membership layer's event.
The event is delivered on a dedicated kernel thread and holds the DLM partition_lock
during processing to serialize with ongoing lock grant decisions.
In-flight write handling:
An in-flight write is any operation where a write() syscall has returned to userspace
but the dirtied pages have not yet been included in the LockDirtyTracker for the
covering EX lock. Two sub-cases:
Case A — write() completed before partition detected:
Pages are already in the dirty page cache and tracked by LockDirtyTracker.
The downgrade flushes them via targeted writeback (normal path).
Case B — write() in progress (PTE dirty bit set, LockDirtyTracker not yet updated):
The VFS page-fault path sets the dirty bit before returning to userspace.
DLM's partition handler waits for the write_seq counter to stabilize
(spin at most 1ms — write() syscall cannot hold a page lock indefinitely)
then calls sync_file_range(ALL) on all files covered by EX locks.
This forces any PTE-dirty pages into tracked writeback before downgrade.
Atomic writeback-then-downgrade sequence:
For each EX or PW lock held by this node:
1. Set lock.state = LOCK_CONVERTING (blocks new writers via KABI fence).
2. Flush in-flight writes: sync_file_range(file, lock.range.start, lock.range.end).
This is synchronous: returns only when all dirty pages in the range
are submitted to the block layer (not necessarily persisted to disk).
3. Call targeted_writeback_flush(lock) ([Section 15.12.8]):
Walk LockDirtyTracker, submit writeback for each dirty page.
Wait for writeback completion (submit + await journal commit).
4. Only after step 3 completes: change lock mode from EX/PW → PR.
This is the atomic downgrade: no window where lock is PR but pages are dirty.
5. Send LOCK_DOWNGRADE message to lock master (majority partition).
If the lock has a DSM binding (`dsm_causal_stamp.is_some()`), the
LOCK_DOWNGRADE message includes the CausalStampWire payload
([Section 6.6](06-dsm.md#dsm-coherence-protocol-moesi)). The master forwards this
stamp to the next granted holder so it can verify causal ordering
of DSM page updates made under the previous lock tenure.
Master updates granted queue: replaces EX entry with PR entry.
The "atomic" guarantee is within a single CPU: steps 3→4 are serialized by
partition_lock. Concurrent readers (PR/CR holders) may read stale data from
the page cache during the flush window (step 2-3), but they cannot read
partially-flushed state because each page is either fully clean or fully dirty
at page cache granularity. No intermediate state is visible.
Lease enforcement is suspended in the minority partition: since masters in the
majority partition cannot be reached for lease renewal, lease expiry cannot be used
to revoke locks. No new writes are permitted. No data corruption is possible
because the minority cannot acquire or hold write locks, and read-only access
to stale data is explicitly safe for PR/CR modes at the filesystem level (no
on-disk corruption or metadata structure damage, though application-visible staleness
is possible (e.g., readdir may return deleted entries or miss new files created on
the majority partition)). Applications requiring linearizable reads
(e.g., databases with ACID guarantees) may see stale values during the partition;
this is inherent to any system that allows minority-partition reads (CAP theorem).
DSM integration: The DLM's write-lock downgrade is consistent with the DSM's
SUSPECT page mechanism (Section 5.8): DSM write-protects SUSPECT pages while
allowing reads. Both subsystems independently block writes in the minority partition,
providing defense-in-depth.
- Partition heals: Minority nodes rejoin. Lock state is reconciled:
1. Minority nodes report their held lock state to the (majority-elected) masters.
2. Masters compare against current granted queues (majority wins for conflicts).
3. Any minority-held locks that conflict with locks granted during the partition
are forcibly revoked on the minority nodes (cached data invalidated).
4. Non-conflicting locks are re-validated and lease timers restarted.
4. Simultaneous holder + master failure (the node holding locks is also the master for those resources, or both the holder and master crash at the same time)
Timeline:
t=0: Node B crashes. B held EX on resources R1, R2 (with dirty LVBs).
B was also the master for R1 (self-mastered). Node M was the master
for R2 and also crashes at t=0 (e.g., rack power failure).
t=1000ms: NodeDead(B) and NodeDead(M) confirmed.
Recovery (composes scenarios 1 + 2, master rebuild first):
Phase 1 — Master rebuild (scenario 2):
1. Consistent hashing reassigns R1 (was mastered on B) and R2 (was
mastered on M) to surviving nodes. New master N1 gets R1, new master
N2 gets R2.
2. Surviving nodes report their lock state to N1 and N2:
- For R1: Node C reports "I have PR on R1", Node D reports "I am
waiting for EX on R1." No node reports holding EX on R1.
- For R2: Node C reports "I have PR on R2." No node reports holding
EX on R2.
3. N1 and N2 rebuild granted/converting/waiting queues from survivor
reports. Dead node B's locks are absent (B cannot report).
Phase 2 — Dead holder cleanup (scenario 1, applied by new masters):
4. N1 examines R1's rebuilt state: PR holders exist (C), but no EX
holder. A waiting EX request exists (D). N1 infers that the dead
node B held the missing EX lock:
- INFERENCE RULE: If a resource has waiters for an incompatible mode
but no granted lock blocking them, the dead node(s) held the
blocking lock. The new master does not need to know WHICH dead
node — the lock is simply gone.
5. N1 marks R1's LVB as INVALID (LVB_SEQUENCE_INVALID) because the
dead EX holder may have written a dirty LVB that no survivor has.
6. N1 processes the waiting queue: grants D's EX request on R1.
7. N2 performs the same for R2: marks LVB INVALID, grants waiters.
Phase 3 — Journal recovery:
8. If B held journal locks, journal recovery runs against B's journal
slot (same as scenario 1 step 4). The new master coordinates this.
Timeline: same as scenario 2 (~50-200ms for affected resources).
The holder cleanup (phase 2) adds negligible time — it is local queue
manipulation on the new master, no network round-trips.
The key insight is ordering: master rebuild (phase 1) must complete before dead holder cleanup (phase 2), because the new master needs the rebuilt queue state to infer which locks the dead node held. An implementer must NOT attempt scenario 1 cleanup before scenario 2 rebuild — the old master is dead and cannot execute holder cleanup steps.
Key difference from Linux: NO global recovery quiesce. Linux's DLM stops ALL lock activity cluster-wide while recovering from ANY node failure. This is because Linux's DLM recovery protocol requires a globally consistent view of all lock state before it can proceed — every node must acknowledge the recovery, and no new lock operations can be processed until all nodes agree.
UmkaOS's DLM recovers per-resource: only resources mastered on or held by the dead node require recovery. The remaining (typically 90%+) of lock resources continue operating without any pause.
15.15.12 UmkaOS Recovery Advantage¶
The combination of umka-core's architecture and the per-resource DLM recovery protocol creates a fundamentally different failure experience:
Linux path (storage driver crash on Node B):
t=0: Driver crash
t=0-30s: Fencing: cluster must confirm B is dead (IPMI/BMC power-cycle
or SCSI-3 PR revocation). Conservative timeout.
t=30-90s: Reboot: Node B reboots, OS loads, cluster stack starts.
t=90-120s: Rejoin: B rejoins cluster. DLM recovery begins.
GLOBAL QUIESCE: ALL nodes stop ALL lock operations.
t=120-130s: DLM recovery: all nodes exchange lock state, rebuild queues.
t=130s: Normal operation resumes.
Total: 80-130 seconds of disruption. ALL nodes affected.
UmkaOS path (storage driver crash on Node B):
t=0: Driver crash in Tier 1 storage driver.
t=0: Cluster heartbeat CONTINUES (heartbeat runs in umka-core, not
the storage driver). Cluster does NOT detect a node failure.
t=50-150ms: Driver reloads (Tier 1 recovery, Section 11.7). State restored
from checkpoint.
t=150ms: Driver operational. Lock state was never lost (DLM is in
umka-core). No fencing needed. No recovery needed.
Total: 50-150ms I/O pause on Node B only. Zero lock disruption.
Zero impact on other nodes.
The difference is architectural: in Linux, the DLM runs in the same failure domain as storage drivers (all are kernel modules that crash together). In UmkaOS, the DLM is in umka-core — it survives driver crashes. The DLM only needs recovery when umka-core itself fails (which means the entire node is down).
DLM-driver supervisor integration: The DLM and cluster heartbeat run in umka-core,
independent of any driver. When the driver supervisor detects a Tier 1 crash, it notifies the DLM
via dlm_driver_recovering(driver_id). The DLM suspends lock grant callbacks for that
driver's lockspaces until the driver signals ready via dlm_driver_ready(driver_id).
15.15.13 Application-Level Distributed Locking¶
The DLM provides application-visible locking interfaces:
flock()on clustered filesystem → transparently maps to DLM lock operations. Applications usingflock()for coordination get cluster-wide locking without code changes.fcntl(F_SETLK)byte-range locks → DLM range lock resources. POSIX byte-range locks on clustered filesystems provide true cluster-wide exclusion.- Explicit DLM API via
/dev/dlm→ compatible with Linux'sdlm_controldinterface. Applications that uselibdlmfor explicit distributed locking work without modification. flock2()system call (new, UmkaOS extension) — enhanced distributed lock with:- Lease semantics: caller specifies desired lease duration
- Failure callback: notification when lock is lost due to node failure
- Partition behavior: configurable (block, release, or fence)
- Batch support: lock multiple files in a single system call
15.15.14 Capability Model¶
DLM operations are gated by capabilities (Section 9.1):
| Capability | Permits |
|---|---|
CAP_DLM_LOCK |
Acquire, convert, and release locks on resources in permitted lockspaces |
CAP_DLM_ADMIN |
Create and destroy lockspaces, configure parameters, view lock state |
CAP_DLM_CREATE |
Create new lock resources (for application-level locking via /dev/dlm) |
Lockspaces provide namespace isolation — a container with CAP_DLM_LOCK scoped to its
own lockspace cannot interfere with locks in other lockspaces. Each lockspace is fully
isolated: lock names in one lockspace have no relationship to identically named locks in
another. Lockspaces provide both namespace isolation (containers) and domain separation
(filesystem vs. application locks). GFS2 creates a lockspace per filesystem; applications
create lockspaces via /dev/dlm.
New node provisioning: When a node joins the cluster, it does not initially hold any
CAP_DLM_LOCK capabilities. The cluster coordinator (Raft leader) provisions DLM
capabilities via capability delegation (Section 5.7): the
coordinator creates a scoped CAP_DLM_LOCK for each lockspace the node is authorized
to join, signs it, and sends it as part of the membership acknowledgment. The node
presents this capability when joining a lockspace (the two-sided RDMA Send that returns
the rkey). Nodes not delegated CAP_DLM_LOCK for a lockspace cannot join it.
15.15.15 Lockspace Lifecycle API¶
/// Create a new DLM lockspace.
///
/// The caller must hold `CAP_DLM_ADMIN`. The lockspace name must be unique
/// cluster-wide; if a lockspace with the same name already exists, returns
/// `DlmError::AlreadyExists`. The creating node becomes the initial member
/// and master assignment seed.
///
/// `config`: lockspace-level parameters (lease durations, slab pre-allocation
/// capacity, deadlock detection policy). Defaults are used for any field set
/// to zero.
///
/// On success, the lockspace is broadcast to all cluster peers via the Raft
/// state machine ([Section 5.1](05-distributed.md#distributed-kernel-architecture--raft-consensus)). Peers
/// that hold `CAP_DLM_LOCK` for the new lockspace may join immediately.
pub fn dlm_lockspace_create(
name: &LockspaceName,
config: &LockspaceConfig,
) -> Result<DlmLockspaceHandle, DlmError>;
/// Destroy a DLM lockspace.
///
/// The caller must hold `CAP_DLM_ADMIN`. All locks in the lockspace must be
/// released before destruction; if any locks remain, returns
/// `DlmError::LockspaceBusy`. Destruction is propagated to all peers via
/// Raft — peers that are still members receive a `LockspaceDestroyed` event
/// and drop their local state.
///
/// After destruction, the lockspace name may be reused by a subsequent
/// `dlm_lockspace_create()`. The old lockspace's generation is retained
/// to prevent stale `DlmLockspaceHandle` reuse.
pub fn dlm_lockspace_destroy(
handle: DlmLockspaceHandle,
) -> Result<(), DlmError>;
/// Join an existing DLM lockspace on this node.
///
/// The caller must hold `CAP_DLM_LOCK` scoped to the target lockspace.
/// This node registers with the lockspace master, receives the current
/// RDMA rkey for CAS word arrays ([Section 15.15](#distributed-lock-manager--transport-agnostic-lock-operations)),
/// and begins participating in lock operations. Transport selection
/// ([Section 5.5](05-distributed.md#distributed-ipc--transparent-transport-selection)) is performed
/// for each existing peer in the lockspace during join.
///
/// If this node is already a member, returns `DlmError::AlreadyJoined`.
pub fn dlm_lockspace_join(
name: &LockspaceName,
) -> Result<DlmLockspaceHandle, DlmError>;
/// Leave a DLM lockspace on this node.
///
/// Graceful departure: all locks held by this node in the lockspace are
/// released (with LVB writeback for EX/PW holders). The node's RDMA rkey
/// is deregistered. Remaining members re-master any resources that were
/// mastered on this node using consistent hashing
/// ([Section 15.15](#distributed-lock-manager--recovery-protocol)).
///
/// Unlike `dlm_lockspace_destroy()`, the lockspace continues to exist for
/// other members. The leaving node's local `DlmLockspaceHandle` is
/// invalidated.
pub fn dlm_lockspace_leave(
handle: DlmLockspaceHandle,
) -> Result<(), DlmError>;
/// Lockspace configuration parameters.
pub struct LockspaceConfig {
/// Lease configuration (metadata, data, application lease durations
/// and grace period). Zero values select defaults.
pub lease: LeaseConfig,
/// Pre-allocated slab capacity for DlmResource entries.
/// The ShardedMap grows in page-sized chunks; this sets the initial
/// allocation. Default: 1024 resources per shard (256K total).
pub initial_resource_capacity: u32,
/// Deadlock detection mode.
/// `true`: enable wait-for graph cycle detection (adds ~2-5 μs to
/// contested lock path). `false`: disable (caller responsible for
/// deadlock avoidance, e.g., lock ordering discipline).
/// Default: `true`.
pub deadlock_detection: bool,
/// Transport preference override. If `None`, standard priority
/// (CXL > RDMA > TCP) is used. If `Some`, the specified transport
/// is forced for all peers in this lockspace.
pub transport_override: Option<TransportType>,
}
/// Opaque handle to a joined lockspace on this node.
/// Carries an internal generation counter to prevent use-after-destroy.
pub struct DlmLockspaceHandle {
/// Index into the node-local lockspace table.
index: u32,
/// Generation at creation time. Compared on every operation to detect
/// stale handles after lockspace destroy + name reuse.
generation: u64,
}
Lockspace lifecycle state machine:
Created ──(dlm_lockspace_join() by any peer)──► Active (≥1 member)
Active ──(dlm_lockspace_leave() by last member)──► Empty (no members, state preserved)
Active ──(dlm_lockspace_destroy() with no locks)──► Destroyed
Empty ──(dlm_lockspace_join() by any peer)──► Active
Empty ──(dlm_lockspace_destroy())──► Destroyed
Destroyed: lockspace name freed, generation counter retained.
Kernel-internal vs. application lockspaces: GFS2 and other kernel subsystems call
dlm_lockspace_create() / dlm_lockspace_join() directly from kernel context.
Application-level locking (Section 15.15)
uses the /dev/dlm ioctl interface, which maps to the same lifecycle functions with
capability checks against the calling process's credential.
15.15.16 Performance Summary¶
| Operation | UmkaOS (RDMA) | UmkaOS (TCP fallback) | Linux DLM | vs Linux (RDMA) |
|---|---|---|---|---|
| Uncontested acquire | ~3-5 μs (CAS + confirmation) | ~100-400 μs (two-sided) | ~30-50 μs (TCP) | ~10-15x |
| Uncontested acquire + LVB read | ~4-6 μs | ~150-500 μs | ~100 μs | ~20x |
| Contested acquire (same master) | ~5-8 μs | ~100-400 μs | ~100-200 μs (TCP) | ~20-30x |
| Batch N locks (same master) | ~5-10 μs | ~150-500 μs | N x 30-50 μs | ~Nx8x |
| Lock any of N resources | ~5-10 μs | ~150-500 μs | N x 30-50 μs (sequential) | ~Nx8x |
| Lease extension | ~1-2 μs (push_page) | ~50-200 μs | N/A (no leases) | -- |
| Lock holder recovery | ~50-200 ms (affected only) | ~50-200 ms | 5-10 s (global quiesce) | ~50x |
| Lock master recovery | ~200-500 ms (affected only) | ~200-500 ms | 5-10 s (global quiesce) | ~20x |
TCP fallback note: On TCP transports, the DLM CAS fast path is unavailable
(transport.supports_one_sided() == false). All lock operations use the two-sided
transport.send_reliable() path. TCP latency (~100-400 μs per lock) is comparable to
Linux DLM latency but benefits from integrated kernel-to-kernel messaging (no
kernel/userspace transitions, no separate daemon processes). Recovery times are
identical across transports because recovery is dominated by heartbeat timeouts and
queue rebuilds, not transport latency.
Arithmetic basis: RDMA CAS latency is measured at 1.5-2.5 μs on InfiniBand HDR
(200 Gb/s) and RoCEv2 (100 Gb/s) in published benchmarks. The full uncontested
acquire includes the raw CAS (~2-3 μs) plus the mandatory confirmation
transport.send_reliable() (~1-2 μs on RDMA), totaling ~3-5 μs. Contested locks
add ~1-2 μs for receive-side processing on RDMA. On TCP, each
transport.send_reliable() call incurs kernel TCP stack processing (~15-20 μs per
direction) plus cluster message framing, totaling ~50-200 μs per round-trip. Linux
DLM TCP latency includes TCP stack processing (~15-20 μs round-trip), DLM lock manager
processing (~10-15 μs), and completion notification (~5-10 μs), totaling ~30-50 μs in
published GFS2 benchmarks.
Note: The Linux DLM runs entirely in-kernel since kernel 2.6; dlm_controld handles
only membership events, not lock operations.
15.15.17 Data Structures¶
/// Fixed-capacity open-addressing hash table shard with slab-backed storage.
/// Capacity is chosen at construction time and never changes — no rehashing,
/// no heap allocation on the insert hot path, no spinlock hold during allocation.
///
/// Slots store `SlabHandle` indices (8 bytes each) rather than inline `(K, V)`
/// pairs. The actual `(K, V)` entries are allocated from a per-lockspace slab
/// allocator, keeping the per-shard inline array small: 4096 × 8 = 32 KiB
/// per shard. Without this indirection, inline `(K, V)` pairs for
/// `(ResourceName, DlmResource)` would be ~2,760 bytes each, producing
/// ~11 MiB per shard and ~2.83 GiB per lockspace — catastrophically unusable.
pub struct ShardedMapShard<K, V, const CAP: usize> {
/// Open-addressing table. Each slot stores a slab handle pointing to
/// the actual `(K, V)` entry. CAP must be a power of 2. Load factor
/// kept <= 0.75 by construction.
slots: [Option<SlabHandle<(K, V)>>; CAP],
count: usize,
}
/// Opaque handle into the per-lockspace slab allocator.
/// 8 bytes (u64 index), allowing O(1) entry access via slab_get(handle).
/// SlabHandle is Copy + Clone for hash table operations.
#[derive(Copy, Clone)]
pub struct SlabHandle<T> {
index: u64,
_marker: core::marker::PhantomData<T>,
}
/// Sharded lock table for DLM. Each shard has its own spinlock to minimize contention.
///
/// ShardedMap uses fixed-capacity open-addressing with slab-backed entries to ensure
/// spinlock hold times are bounded and O(1). The DLM must pre-allocate sufficient
/// slab capacity based on expected concurrent lock count; capacity exhaustion returns
/// `DlmError::TableFull` rather than blocking. Insertion returns `Err` if the load
/// factor would exceed 75%.
/// `insert_or_update` and `remove` complete in bounded time under the spinlock —
/// there is no rehashing, no inline entry allocation, and no unbounded iteration.
/// Default SHARD_CAP of 4096 with 256 shards gives 256 * 4096 * 0.75 = ~786K
/// resources at 75% load factor. Per-shard memory: 4096 * 8 = 32 KiB;
/// 256 shards total: ~8 MiB (slot arrays only; slab memory is separate).
/// GFS2 workloads may have millions of locked inodes;
/// `initial_resource_capacity` in `DlmCreateParams` allows tuning the
/// per-lockspace capacity at creation time.
pub struct ShardedMap<K: Hash + Eq, V, const SHARDS: usize = 256, const SHARD_CAP: usize = 4096> {
shards: [SpinLock<ShardedMapShard<K, V, SHARD_CAP>>; SHARDS],
/// Per-lockspace slab allocator for `(K, V)` entries. Grows in
/// page-sized chunks; individual entries are O(1) alloc/free.
slab: SlabAllocator<(K, V)>,
}
/// DLM lockspace — namespace for a set of related lock resources.
pub struct DlmLockspace {
/// Lockspace name (e.g., "gfs2:550e8400-e29b" for a GFS2 filesystem).
pub name: LockspaceName,
/// Lock resources in this lockspace.
/// Sharded concurrent hash map: 256 shards, each with its own SpinLock.
/// Shard = hash(resource_name) & 0xFF. This reduces lock contention from
/// a single global bottleneck to per-shard contention. Individual lock
/// operations only hold their shard's SpinLock, allowing concurrent access
/// to resources in different shards. DlmResource entries are allocated
/// from a per-lockspace slab allocator.
pub resources: ShardedMap<ResourceName, DlmResource, 256>,
/// Lease configuration for this lockspace.
pub lease_config: LeaseConfig,
/// Deadlock detection state.
pub wait_for_graph: Mutex<WaitForGraph>,
/// Statistics counters.
pub stats: DlmStats,
}
/// Per-lockspace lease configuration.
pub struct LeaseConfig {
/// Default lease duration for metadata locks.
pub metadata_lease_ns: u64,
/// Default lease duration for data locks.
pub data_lease_ns: u64,
/// Default lease duration for application locks.
pub app_lease_ns: u64,
/// Grace period after lease expiry before forced revocation.
pub grace_period_ns: u64,
}
/// DLM statistics (per-lockspace, exposed via umkafs unified-object-namespace).
pub struct DlmStats {
/// Total lock operations (acquire + convert + release).
pub lock_ops: AtomicU64,
/// Operations served by RDMA CAS fast path (uncontested).
pub fast_path_ops: AtomicU64,
/// Operations requiring RDMA Send (contested).
pub slow_path_ops: AtomicU64,
/// Batch operations.
pub batch_ops: AtomicU64,
/// Lock-any-of operations.
pub lock_any_ops: AtomicU64,
/// Deadlocks detected.
pub deadlocks_detected: AtomicU64,
/// Recovery events (holder + master).
pub recovery_events: AtomicU64,
}
/// DLM error type returned by lock/unlock operations.
/// Maps to standard errno values for Linux compatibility.
pub enum DlmError {
/// Deadlock detected by the wait-for graph (Section 15.12.9).
Deadlock,
/// Trylock failed — lock is held in an incompatible mode.
Again,
/// Lockspace does not exist.
NoEntry,
/// Resource exhaustion (slab allocator, ShardedMap capacity).
NoMemory,
/// Wait interrupted by signal delivery.
Interrupted,
/// Timeout expired before lock was granted.
TimedOut,
/// Lock not held by caller (unlock of unheld lock).
NotHeld,
/// Invalid lock handle (already released or corrupted).
InvalidHandle,
/// ShardedMap shard is at capacity (load factor > 75%).
TableFull,
}
/// Per-node DLM recovery state machine. Each entry in the DLM's node table
/// carries one of these states. The state transitions enforce the ordering
/// constraint that actual lock reclaim (removal from granted queues) MUST NOT
/// proceed until the membership layer has delivered authoritative `NodeDead`.
///
/// State transitions:
/// Normal → AwaitingNodeDead: DLM's own liveness probe detects suspect node.
/// AwaitingNodeDead → Recovering: membership layer delivers `NodeDead`.
/// Recovering → Normal: all affected resources have been re-mastered and
/// lock queues unfrozen.
/// AwaitingNodeDead → Normal: membership layer declares the node alive
/// (false positive from the DLM probe — network partition healed).
pub enum DlmRecoveryState {
/// No recovery in progress for this node.
Normal,
/// DLM probe detected a suspect node but the membership layer has NOT yet
/// confirmed `NodeDead`. During this phase, the DLM pre-stages recovery:
/// freezes local lock queues mentioning the suspect node, pre-computes
/// new master assignments, and cancels in-flight RDMA operations. No locks
/// are removed from granted queues — the suspect node may still be alive
/// (e.g., transient network partition).
AwaitingNodeDead {
/// The node under suspicion.
failed_node: NodeId,
/// Monotonic instant when the DLM probe first detected the failure.
detected_at: Instant,
},
/// Membership layer has confirmed `NodeDead`. Actual lock reclaim is in
/// progress: dead node's lock entries are removed from granted queues,
/// waiting/converting queues are re-evaluated, and blocked lock requests
/// are granted where the dead node's hold was the sole blocker. Resources
/// whose CAS-word pages have DSM dependency on the dead node wait for
/// `dsm_recovery_complete` before re-mastering
/// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--cross-subsystem-recovery-ordering-dsm-and-dlm)).
Recovering {
/// The confirmed-dead node.
dead_node: NodeId,
/// Resources whose lock queues are currently frozen pending re-mastering.
/// Bounded by the number of resources mastered on or held by the dead
/// node — typically O(R/N) where R is total resources and N is cluster
/// size. Stored as an XArray keyed by `ResourceId` for O(1) lookup.
frozen_resources: XArray<ResourceId>,
},
}
DLM error codes: All lock and unlock operations return Result<_, DlmError>.
The following table defines the complete error space:
| Operation | Error | errno | Meaning |
|---|---|---|---|
| lock() | — | 0 | Success |
| lock() | Deadlock |
EDEADLK |
Deadlock detected by wait-for graph (Section 15.15) |
| lock() | Again |
EAGAIN |
Trylock failed (lock held in incompatible mode) |
| lock() | NoEntry |
ENOENT |
Lockspace does not exist |
| lock() | NoMemory |
ENOMEM |
Resource exhaustion (slab or lock table full) |
| lock() | Interrupted |
EINTR |
Wait interrupted by signal |
| lock() | TimedOut |
ETIMEDOUT |
Timeout expired before grant |
| lock() | TableFull |
ENOSPC |
ShardedMap shard at capacity |
| unlock() | — | 0 | Success |
| unlock() | NotHeld |
ENOENT |
Lock not held by caller |
| unlock() | InvalidHandle |
EINVAL |
Invalid lock handle (already released or corrupted) |
The errno mapping is applied at the syscall compatibility boundary
(Section 19.1) for application-level DLM operations via
/dev/dlm (Section 15.15)
and flock()/fcntl() on clustered filesystems. Kernel-internal callers
(e.g., GFS2, UPFS) receive the typed DlmError enum directly.
DlmError → LockError conversion: The ClusterLockAdapter trait
(Section 15.15) returns LockError, not
DlmError. The conversion is defined as:
impl From<DlmError> for LockError {
fn from(e: DlmError) -> Self {
match e {
DlmError::Deadlock => LockError::Deadlock,
DlmError::TimedOut => LockError::Timeout,
DlmError::Interrupted => LockError::Cancelled,
DlmError::Again => LockError::WouldBlock,
DlmError::NoEntry => LockError::ResourceDestroyed,
DlmError::NoMemory => LockError::WouldBlock, // transient slab exhaustion; VFS retries
DlmError::NotHeld => LockError::InvalidMode,
DlmError::InvalidHandle => LockError::ResourceDestroyed,
DlmError::TableFull => LockError::WouldBlock, // capacity limit; VFS retries with backoff
}
}
}
This mapping is intentionally lossy — LockError is the VFS-facing error type
with coarser granularity than the DLM's internal DlmError. Filesystem drivers
that need finer-grained error handling (e.g., distinguishing NoMemory from
TableFull for retry backoff) should call the DLM API directly and receive
DlmError.
15.15.18 Licensing¶
The VMS/DLM lock model is published academic work (VAX/VMS Internals and Data Structures, Digital Press, 1984). The six-mode compatibility matrix, Lock Value Block concept, and granted/converting/waiting queue model are well-documented in public literature and implemented by multiple independent projects (Linux DLM, Oracle DLM, HP OpenVMS DLM). No patent or proprietary IP concerns.
RDMA Atomic CAS and Send/Receive operations are standard InfiniBand/RoCE verbs defined by the IBTA (InfiniBand Trade Association) specification, which is publicly available.
15.15.19 DLM Master Election and Liveness Integration¶
The DLM uses a deterministic master election based on node ranking rather than a Paxos/Raft round to minimize election latency in the common case (no failures).
Master selection rule: The node with the lowest node_id among currently healthy cluster members is the DLM master. On membership change (join/leave), all nodes independently compute the new master from the updated membership view — no election protocol needed. This requires consistent failure detection.
/// DLM master state. One instance per DLM domain (per filesystem/cluster).
pub struct DlmMaster {
/// Node ID of the current master (determined by lowest-node-id rule).
/// Atomically updated on membership changes. Zero = no master (election in progress).
/// u64 matches NodeId width — no truncation on large clusters.
pub master_node_id: AtomicU64,
/// True if this node is the current DLM master.
pub is_master: AtomicBool,
/// Monotonic epoch counter. Incremented on each master transition.
/// Used to detect stale messages from a previous master.
pub epoch: AtomicU64,
/// Per-peer liveness tracking. Keyed by PeerId (u64). XArray provides
/// O(1) lookup with native RCU-protected reads and ordered iteration.
/// Updated from cluster heartbeat callbacks and DLM message receipt.
pub peers: XArray<DlmPeerState>,
/// Last time this node received any DLM message from each peer (nanoseconds).
/// Keyed by PeerId (u64). XArray native RCU reads replace RcuHashMap.
/// Updated on receipt of ANY DLM message (lock request, grant, convert, etc.)
/// — every DLM message is implicit proof of liveness. Also updated when the
/// cluster heartbeat layer notifies the DLM of a received heartbeat from a
/// peer that participates in this lockspace.
pub last_heard_ns: XArray<AtomicU64>,
}
/// Liveness tracking state for one peer node within this DLM domain.
/// Updated from two sources: (1) cluster heartbeat callbacks
/// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--heartbeat-protocol)), and
/// (2) receipt of any DLM wire message from the peer.
pub struct DlmPeerState {
/// Node is considered live by the DLM's local view. Set to `false` when
/// the DLM monitor detects prolonged silence (no DLM messages AND no
/// cluster heartbeats forwarded for this peer). This is an advisory
/// signal only — authoritative failure is determined by the cluster
/// membership layer's `NodeDead` event.
pub alive: AtomicBool,
/// Number of consecutive DLM monitor wakeups with no activity from this peer.
pub missed_intervals: AtomicU32,
/// Cluster heartbeat sequence from the last forwarded heartbeat.
/// Used to correlate DLM liveness with the cluster-level heartbeat
/// sequence (avoids re-processing stale forwarded heartbeats).
pub last_hb_seq: AtomicU64,
}
/// DLM monitor wakeup interval. The DLM monitor thread wakes at this interval
/// to check `last_heard_ns` for each peer and pre-stage lock reclaim if a
/// peer appears unresponsive. The DLM does NOT send its own heartbeat messages
/// — liveness information comes from the cluster-level heartbeat protocol
/// ([Section 5.8](05-distributed.md#failure-handling-and-distributed-recovery--heartbeat-protocol)), which
/// uses neighbor-only topology and scales O(neighbors) not O(peers).
pub const DLM_MONITOR_INTERVAL_NS: u64 = 500_000_000; // 500 ms
/// A peer is considered suspect by the DLM if no activity (DLM messages or
/// forwarded cluster heartbeats) has been observed for this duration.
/// 3x monitor interval provides tolerance for transient delays.
pub const DLM_SUSPECT_TIMEOUT_NS: u64 = 1_500_000_000; // 1.5 s (3 x 500 ms)
/// After the cluster membership layer delivers a `NodeDead` event, wait this
/// long before reclaiming the dead node's locks. Allows the failed node's
/// RDMA NIC to drain in-flight operations.
pub const DLM_LOCK_RECLAIM_DELAY_NS: u64 = 200_000_000; // 200 ms
Liveness model — no DLM-specific heartbeat: The DLM does NOT run its own heartbeat protocol. Sending DLM-specific heartbeats to all peers in a lockspace would create O(N^2) traffic on TCP clusters (N nodes each sending to N-1 peers), making clusters of >100 nodes impractical. Instead, the DLM derives liveness from two existing sources:
-
Cluster-level heartbeat (Section 5.8): The cluster membership layer heartbeats only direct neighbors in the topology graph (O(neighbors) per node, typically 2-6). When a cluster heartbeat is received from a neighbor that participates in this DLM lockspace, the heartbeat layer forwards a callback to the DLM, which updates
last_heard_ns[sender]. This piggyback is free — no additional network traffic. -
DLM message receipt: Every DLM wire message (lock request, grant, convert, release, revocation, etc.) is implicit proof of liveness. The DLM updates
last_heard_ns[sender]on receipt of any message. For active lockspaces with ongoing lock traffic, this provides sub-millisecond failure detection without any heartbeat at all.
Failure detection algorithm (runs in kthread/dlm_monitor):
- Every
DLM_MONITOR_INTERVAL_NS(500 ms): the monitor thread wakes and checkslast_heard_nsfor each peer in the lockspace. - For each peer, if
now_ns - last_heard_ns[peer] > DLM_SUSPECT_TIMEOUT_NSandpeer.alive == true: - Set
peer.alive = false, incrementmissed_intervals. - Send advisory liveness-suspect hint to the membership layer (not an authoritative failure declaration — the membership layer independently confirms or rejects the failure via its own quorum-protected protocol).
- Pre-stage lock reclaim preparation (freeze local lock queues mentioning
the suspect node, pre-compute new master assignments). Actual lock
removal from granted queues does NOT occur until the membership layer
delivers authoritative
NodeDeadconfirmation — see the ordering constraint in the Integration with Cluster Membership section above. - On receipt of any DLM message or forwarded cluster heartbeat from a peer:
update
last_heard_ns[sender], resetmissed_intervals[sender], setpeer.alive = trueif previously suspect. - Master recomputation: after any membership change (authoritative
NodeDeadorNodeJoinedfrom the membership layer), all nodes computenew_master = min(alive_node_ids). Ifnew_master != master_node_id, atomically swap and increment epoch.
Relationship to cluster heartbeat: The DLM is a consumer of the cluster heartbeat,
not a producer. The cluster membership layer (Section 5.8)
provides neighbor-only heartbeat with O(neighbors) traffic, failure detection via
Suspect/Dead state transitions, and authoritative NodeDead events protected by
quorum consensus. The DLM's monitor thread provides DLM-scoped early warning
(pre-staging lock reclaim before the membership layer confirms failure) using locally
observed liveness signals. This two-tier approach eliminates the Linux DLM problem
where DLM and corosync can disagree on node liveness during partial-failure scenarios,
while avoiding the O(N^2) traffic that a DLM-specific heartbeat would produce.
15.15.20 VFS Lock Integration (ClusterLockAdapter)¶
Clustered filesystems (GFS2, OCFS2, future UPFS) must translate POSIX file locking
semantics (flock(), fcntl(F_SETLK)) into DLM lock operations. Rather than
each filesystem implementing ad-hoc DLM integration, UmkaOS defines a standard
adapter trait that the VFS file locking layer calls directly.
/// A byte range for file-level locking. Used by the ClusterLockAdapter to
/// translate POSIX/flock byte-range locks into DLM lock scope. The range
/// is inclusive on both ends: `[start, end]`. A range of `[0, u64::MAX]`
/// represents a whole-file lock (equivalent to flock semantics).
pub struct LockRange {
/// Start byte offset (inclusive).
pub start: u64,
/// End byte offset (inclusive). u64::MAX = end-of-file.
pub end: u64,
}
/// Errors returned by DLM lock operations (lock, unlock, convert).
#[derive(Debug)]
pub enum LockError {
/// The lock request would cause a deadlock (detected by the distributed
/// wait-for graph). The caller should abort the operation and retry.
Deadlock,
/// The lock request timed out waiting for grant (exceeded the caller's
/// specified timeout or the default 30-second DLM timeout).
Timeout,
/// The non-blocking trylock failed because the lock is held in an
/// incompatible mode. The caller should retry after a brief backoff.
/// Maps from `DlmError::Again` (errno `EAGAIN`). Distinct from `Timeout`
/// which indicates an expired wait duration.
WouldBlock,
/// The lock request was cancelled by the caller or by cluster membership
/// change (the requesting node was evicted during the wait).
Cancelled,
/// The requested lock mode is invalid for this operation (e.g., converting
/// from EX to a mode incompatible with the resource's current state).
InvalidMode,
/// The target lock resource has been destroyed (e.g., the filesystem was
/// unmounted or the lockspace was released while the lock was pending).
ResourceDestroyed,
}
/// A lock resource entry in the DLM master's resource directory. Tracks the
/// resource name, current master node, and the set of waiters for deadlock
/// detection. This is the master-side view; each node also maintains a local
/// resource cache (`DlmResource`) for resources it has active locks on.
pub struct DlmLockResource {
/// Resource name (hierarchical, variable-length, max 256 bytes).
/// Format: "fsname:type:object" (e.g., "gfs2:inode:0x1234").
/// Uses `ResourceName` (byte array, not UTF-8-enforced `ArrayString`)
/// because DLM resource names may contain non-UTF-8 bytes (e.g.,
/// binary inode numbers in GFS2/OCFS2 lock naming conventions).
pub name: ResourceName,
/// Node ID of the current master for this resource. Updated during
/// re-mastering after node failure recovery.
pub master_node: NodeId,
/// Queue of waiters for this resource, used by the distributed deadlock
/// detector to construct the wait-for graph. Each entry represents a
/// pending lock request that is blocked by an incompatible holder.
/// Bounded at 64 concurrent waiters per resource — if exceeded, new lock
/// requests receive `DlmError::TryAgain` (backpressure). In practice,
/// clustered filesystems rarely exceed 8-16 waiters per inode lock.
pub lock_queue: SpinLock<ArrayVec<DlmWaitEdge, 64>>,
}
/// Adapter trait for translating VFS file locking operations to DLM operations.
/// Implemented by clustered filesystems (GFS2, OCFS2) in their filesystem driver.
pub trait ClusterLockAdapter {
/// Translate a POSIX/flock lock request to a DLM lock acquisition.
/// mode mapping: LOCK_SH → DLM_LOCK_PR, LOCK_EX → DLM_LOCK_EX
/// LOCK_UN → dlm_unlock(). Blocking locks: DLM_LKF_WAIT.
fn lock_file(
&self,
inode_id: InodeId,
range: LockRange,
mode: LockMode,
wait: bool,
) -> Result<DlmLockHandle, LockError>;
/// Release a DLM lock when the corresponding VFS lock is dropped.
fn unlock_file(&self, handle: DlmLockHandle) -> Result<(), LockError>;
/// Integrate with deadlock detection. Returns the DLM's global wait-for
/// graph entries for this filesystem. VFS merges these with its local
/// WaitForGraph ([Section 14.14](14-vfs.md#local-file-locking)) for cross-node deadlock detection.
/// Returns at most `MAX_DLM_WAIT_EDGES` (64) remote wait-for edges for deadlock
/// detection. 64 matches the per-resource `lock_queue` bound (see `DlmResource`
/// above) — there can never be more than 64 waiters per resource, so 64 edges
/// suffices. If multiple resources are queried, the deadlock detector calls
/// `get_remote_waiters()` per inode and merges results.
fn get_remote_waiters(&self, inode_id: InodeId) -> ArrayVec<DlmWaitEdge, 64>;
}
/// An edge in the distributed wait-for graph. Maps a DLM lock waiter to
/// the VFS thread space so that the VFS deadlock detector can merge DLM
/// (cross-node) and local (single-node) wait-for relationships.
pub struct DlmWaitEdge {
/// Thread ID of the waiter (translated from DLM owner ID to local
/// ThreadId space via the cluster membership table).
pub waiter: ThreadId,
/// Thread ID of the holder that the waiter is blocked on.
/// For remote holders, this is a synthetic ThreadId allocated from
/// a per-node range to avoid collision with local threads.
pub holder: ThreadId,
/// The DLM lock mode requested by the waiter.
pub requested_mode: LockMode,
/// The DLM lock mode currently held by the holder.
pub held_mode: LockMode,
/// Node ID of the waiter (for diagnostic logging and cycle reporting).
pub waiter_node: u64,
/// Node ID of the holder.
pub holder_node: u64,
}
VFS integration point:
- FileSystemOps::cluster_lock_ops() -> Option<&dyn ClusterLockAdapter> — returns the adapter for clustered filesystems, None for local filesystems.
- vfs_lock_file() (the kernel's central file locking entry point, Section 14.14): checks sb.fs_ops.cluster_lock_ops(). If Some → DLM path via the adapter. If None → local VFS locking (Section 14.14).
Lock mode mapping:
| VFS Lock | DLM Mode | Semantics |
|---|---|---|
LOCK_SH / F_RDLCK |
DLM_LOCK_PR (Protected Read) |
Multiple readers allowed, writers excluded |
LOCK_EX / F_WRLCK |
DLM_LOCK_EX (Exclusive) |
Single writer, all others excluded |
LOCK_UN / F_UNLCK |
dlm_unlock() |
Release the DLM lock |
Blocking semantics: When wait == true (corresponding to F_SETLKW or
flock() without LOCK_NB), the DLM lock request is submitted with
DLM_LKF_WAIT. The calling task sleeps interruptibly until the lock is granted
or a signal is delivered (returning EINTR). The DLM's deadlock detector
(Section 15.15) may return EDEADLK if a
cross-node deadlock cycle is detected.
Deadlock detection: The VFS maintains a local WaitForGraph for single-node
deadlock detection (Section 15.15). For
clustered filesystems, this graph must be extended with cross-node wait edges.
The adapter's get_remote_waiters() returns DLM-tracked wait edges for a given
inode. The VFS deadlock detector merges these remote edges with its local graph
during cycle detection, providing unified cross-node deadlock detection without
requiring the VFS to understand DLM internals.
DLM resource naming: The adapter translates inode IDs and byte ranges into
DLM resource names. The recommended convention is
"FL:<fsid>:<inode_id>:<start>:<end>" for byte-range locks and
"FL:<fsid>:<inode_id>" for whole-file flock() locks. Each filesystem's
adapter implementation defines its own naming scheme.
15.15.21 DLM as Foundation for UPFS Token Management¶
The DLM is designed to serve as the foundation for UPFS — UmkaOS's own GPFS-class clustered filesystem. GPFS's "token manager" — its core coordination mechanism — maps naturally onto the DLM's existing primitives:
| GPFS Token Concept | DLM Equivalent |
|---|---|
| Token types (data, metadata, layout, quota) | Resource name prefix: "D:", "M:", "L:", "Q:" + inode |
| Byte-range tokens | Byte-range lock resources: "D:inode:start:len" |
| Token revocation callback | Lease-based revocation (Section 15.15) — master sends targeted revocation to active holders |
| Downgrade callback (EX→PR) | Lock conversion (Section 15.15) with targeted writeback (Section 15.15) |
| Token batching (multi-resource) | Speculative multi-resource acquire (Section 15.15) |
| Lock Value Block for metadata piggybacking | LVB (Section 15.15) — 64-byte inline data attached to lock |
What the DLM provides natively that GPFS needs:
-
Downgrade with targeted writeback. When a UPFS metadata server revokes a write token, the DLM's
EX → PRconversion triggersLockDirtyTracker-based writeback of only modified pages within the lock's range (Section 15.15). This is directly equivalent to GPFS's "flush dirty data, then downgrade token" flow — and better than Linux DLM's BAST storms (Section 15.15). -
Lock Value Blocks for metadata piggybacking. GPFS piggybacks small metadata updates (inode timestamps, file sizes) on token grant/revoke messages to avoid separate metadata RPCs. The DLM's 64-byte LVB (Section 15.15) provides exactly this: the last EX holder writes updated metadata into the LVB on downgrade; the next PR holder reads it from the LVB without contacting the metadata server.
-
Speculative multi-resource acquire. UPFS allocators need to lock a resource group (block allocation bitmap). With hundreds of resource groups, contention is common. The DLM's
lock_any_of()primitive (Section 15.15) tries multiple resource groups in a single round-trip — same optimization GPFS uses for allocation. -
RDMA-native fast path. Uncontested token acquire via RDMA CAS (Section 15.15) at ~2 μs is competitive with GPFS's token manager latency. Contested path at ~5-8 μs (RDMA Send/Recv) matches GPFS's two-sided token path.
What UPFS builds on top (minimal — naming conventions only):
-
Token type semantics. The DLM provides generic lock modes (NL, CR, CW, PR, PW, EX). UPFS defines which modes map to its token types: e.g., "data write token = EX on
D:inode:range", "metadata read token = PR onM:inode". This is a resource name prefix convention, not a DLM change. -
Revocation handlers. UPFS registers
DlmRevocationHandlerimplementations for each token type at lock acquire time. The DLM calls these handlers directly on revocation — no intermediate token layer needed: - Data token handler: targeted writeback → DSM invalidation → convert/release
- Metadata token handler: update LVB with latest inode attrs → convert/release
- Layout token handler: flush stripe map changes → release
- Quota token handler: flush quota deltas to LVB → convert/release
The DLM drives the entire revocation flow. The UPFS handlers are
stateless functions that operate on the lock's associated state
(LockDirtyTracker, LockValueBlock). No "token manager" object, no
token state machine, no separate revocation protocol.
-
Stripe-group coordination. When UPFS stripes a file across N storage servers, each stripe has independent data tokens. The UPFS client holds N byte-range data tokens (one per stripe) and submits I/O to N block exports in parallel. Coordination between stripes (e.g., extending a file across a stripe boundary) uses metadata tokens on the stripe map inode.
-
Quota tokens. GPFS uses tokens for quota enforcement (user/group quota fragments cached on each node). Maps to DLM locks on quota resources with LVB carrying the cached quota values.
The "token layer" is essentially zero code. UPFS's token management
consists of:
1. A set of resource name prefix conventions ("D:", "M:", "L:", "Q:").
2. A set of DlmRevocationHandler implementations (one per token type,
~50-100 lines each).
3. Helper functions that call dlm_lock() with the right resource name
and handler.
There is no token manager object, no token state machine, no separate protocol, and no separate recovery mechanism. The DLM IS the token manager. This is the design goal: the DLM is not a "bolt-on lock service" that an UPFS wraps — it is the native token infrastructure.
15.15.22 DLM Wire Protocol¶
All DLM messages are carried over ClusterTransport
(Section 5.10) using the
standard ClusterMessageHeader (40 bytes, defined in
Section 5.2). DLM messages use
PeerMessageType::DlmOp as the message_type. The payload after
ClusterMessageHeader is a DlmMessageHeader followed by a message-type-specific
payload struct.
All multi-byte integers in wire structs use Le types (Section 6.1) for correct operation on mixed-endian clusters (PPC32, s390x are big-endian).
/// DLM message types. Carried in DlmMessageHeader.msg_type.
#[repr(u16)]
pub enum DlmMessageType {
/// Request a lock on a resource. Sent by requester to resource master.
LockRequest = 0x0001,
/// Grant a lock to a requester. Sent by master to requester.
LockGrant = 0x0002,
/// Convert an existing lock to a different mode. Sent to master.
LockConvert = 0x0003,
/// Confirm a conversion. Sent by master to holder.
LockConvertGrant = 0x0004,
/// Release a lock. Sent by holder to master.
LockRelease = 0x0005,
/// Revoke a lock (lease expiry or contention). Sent by master to holder.
LockRevocation = 0x0006,
/// Deadlock detection probe. Forwarded along wait-for edges.
DeadlockProbe = 0x0007,
/// Deadlock victim notification. Sent by detector to victim.
DeadlockVictim = 0x0008,
/// Look up the master node for a resource. Sent when the local hash
/// ring indicates a different master than expected (post-migration).
MasterLookup = 0x0009,
/// Reply to MasterLookup with the authoritative master node.
MasterLookupReply = 0x000A,
/// Transfer resource master state (granted/converting/waiting queues)
/// to a new master during membership change.
MasterTransfer = 0x000B,
/// Notify nodes that a resource's master has migrated.
MasterMigration = 0x000C,
/// Lease renewal (one-sided RDMA write of lease timestamp; this type
/// is used only on the TCP fallback path where one-sided is unavailable).
LeaseRenew = 0x000D,
/// Batch lock request (up to 64 locks in a single message).
LockBatch = 0x000E,
/// Read an LVB without acquiring a lock (TCP two-sided fallback).
/// Sent by requester to resource master when `supports_one_sided() == false`.
/// See [Section 15.15](#distributed-lock-manager--two-sided-lvb-read-fallback).
LvbReadRequest = 0x000F,
/// Response to LvbReadRequest. Contains the 64-byte LVB data + sequence.
LvbReadResponse = 0x0010,
}
/// DLM message header. Follows ClusterMessageHeader in every DLM message.
/// Total: 24 bytes.
///
/// Note: Le64 fields have alignment 1 (they are `#[repr(transparent)]` over
/// `[u8; 8]`), so `#[repr(C)]` layout would pack `lockspace_id` at offset 4
/// with no implicit padding, producing a 20-byte struct. We add explicit
/// `_pad` to ensure 8-byte alignment of the lockspace_id field on the wire,
/// which prevents deserialization bugs on architectures that trap on unaligned
/// access and makes the wire layout conventional (all 8-byte fields at 8-byte
/// offsets).
#[repr(C)]
pub struct DlmMessageHeader {
/// DLM message type (DlmMessageType as Le16).
pub msg_type: Le16,
/// Flags (reserved for future use). Must be zero on send.
pub flags: Le16,
/// Explicit padding to align lockspace_id at offset 8. Must be zeroed
/// on send to prevent information disclosure.
pub _pad: [u8; 4],
/// Lockspace ID. Identifies the lockspace context for this message.
/// Each lockspace has a unique 64-bit ID assigned at creation time.
pub lockspace_id: Le64,
/// Resource name hash (SipHash-2-4 of the full resource name).
/// Used for routing and shard selection on the receiver. The full
/// resource name is carried in the payload when needed (LockRequest,
/// MasterLookup) but omitted from compact messages (LockGrant,
/// LockRelease) where the hash suffices for lookup.
pub resource_hash: Le64,
}
const_assert!(size_of::<DlmMessageHeader>() == 24);
// Layout: msg_type(2) + flags(2) + _pad(4) + lockspace_id(8) + resource_hash(8) = 24 bytes.
// Total wire message: ClusterMessageHeader (40) + DlmMessageHeader (24) + payload.
/// LockRequest payload. Sent by requester to resource master.
/// Total: 48 bytes.
#[repr(C)]
pub struct DlmLockRequestPayload {
/// Requester's node-local lock ID. Unique within the requester node.
/// The master returns this ID in LockGrant for correlation.
pub lock_id: Le64,
/// Requested lock mode (LockMode as u8).
pub mode: u8,
/// Lock request flags.
pub flags: u8,
/// Padding to 8-byte boundary for requester_node alignment.
pub _pad: [u8; 6],
/// Requester's node ID (redundant with ClusterMessageHeader.node_id
/// but included for self-contained payload parsing).
pub requester_node: Le64,
/// Resource name length (bytes). The full resource name follows this
/// struct as a variable-length tail (max 256 bytes).
pub name_len: Le16,
/// Padding.
pub _pad2: [u8; 6],
/// Lease duration requested (nanoseconds). 0 = use lockspace default.
pub lease_ns: Le64,
/// DSM causal epoch for lock requests that coordinate with DSM
/// coherence. 0 = no DSM dependency. Non-zero values carry the
/// epoch component of the CausalStampWire, which is sufficient for
/// causal ordering between lock acquire and DSM page access. The
/// full CausalStampWire (epoch + dirty page bitmap) is variable-length
/// and cannot fit in a fixed-size field; the dirty bitmap is only
/// needed for page reconstruction and is sent in a separate
/// DsmReconstructRequest message when the lock holder needs to
/// reconstruct pages.
pub dsm_causal_epoch: Le64,
// Followed by: resource_name: [u8; name_len]
}
const_assert!(size_of::<DlmLockRequestPayload>() == 48);
/// LockGrant payload. Sent by master to requester.
/// Total: 24 bytes.
#[repr(C)]
pub struct DlmLockGrantPayload {
/// Lock ID from the original LockRequest.
pub lock_id: Le64,
/// Granted lock mode (may differ from requested if a conversion was
/// applied by the master's compatibility check).
pub granted_mode: u8,
/// Grant status: 0 = success, non-zero = error code.
pub status: u8,
/// Padding.
pub _pad: [u8; 2],
/// Master-assigned sequence number for this lock instance. Used for
/// CAS-word ABA prevention and lease tracking.
pub master_seq: Le32,
/// LVB data length attached to this grant (0-64 bytes). If non-zero,
/// the LVB data follows this struct.
pub lvb_len: Le32,
/// Padding.
pub _pad2: [u8; 4],
// Followed by: lvb_data: [u8; lvb_len] (0-64 bytes)
}
const_assert!(size_of::<DlmLockGrantPayload>() == 24);
/// LockConvert payload. Sent by holder to master.
/// Total: 28 bytes.
#[repr(C)]
pub struct DlmLockConvertPayload {
/// Lock ID of the lock to convert.
pub lock_id: Le64,
/// Current lock mode (for validation).
pub current_mode: u8,
/// Requested new mode.
pub new_mode: u8,
/// Padding.
pub _pad: [u8; 2],
/// LVB data length to write on downgrade (0-64). If converting from
/// EX/PW to a lower mode, the holder writes updated LVB data here.
pub lvb_len: Le32,
/// Padding.
pub _pad2: [u8; 4],
/// Causal consistency epoch from the converting node's CausalStampWire.
/// Le64 stores the epoch component only; full CausalStampWire
/// reconstruction is handled by DSM.
pub dsm_causal_epoch: Le64,
// Followed by: lvb_data: [u8; lvb_len] (0-64 bytes)
}
const_assert!(size_of::<DlmLockConvertPayload>() == 28);
/// LockRelease payload. Sent by holder to master.
/// Total: 24 bytes.
#[repr(C)]
pub struct DlmLockReleasePayload {
/// Lock ID to release.
pub lock_id: Le64,
/// LVB data length to write on release (0-64). EX/PW holders write
/// the final LVB on release so the next acquirer sees updated data.
pub lvb_len: Le32,
/// Padding.
pub _pad: [u8; 4],
/// Causal consistency epoch from the releasing node's CausalStampWire.
/// Le64 stores the epoch component only; full CausalStampWire
/// reconstruction is handled by DSM.
pub dsm_causal_epoch: Le64,
// Followed by: lvb_data: [u8; lvb_len] (0-64 bytes)
}
const_assert!(size_of::<DlmLockReleasePayload>() == 24);
/// LockRevocation payload. Sent by master to holder.
/// Total: 16 bytes.
#[repr(C)]
pub struct DlmLockRevocationPayload {
/// Lock ID to revoke.
pub lock_id: Le64,
/// Requested downgrade mode. The holder should convert to this mode
/// (or release entirely if mode == NL). The holder has grace_period_ns
/// to comply before the master forcibly revokes.
pub requested_mode: u8,
/// Padding.
pub _pad: [u8; 3],
/// Deadline (nanoseconds from message timestamp) by which the holder
/// must comply. After this deadline, the master force-revokes.
pub deadline_ns: Le32,
}
const_assert!(size_of::<DlmLockRevocationPayload>() == 16);
/// DeadlockProbe payload. Forwarded along wait-for graph edges.
/// Total: 40 bytes.
#[repr(C)]
pub struct DlmDeadlockProbePayload {
/// Originator of the probe (the node that started cycle detection).
pub origin_node: Le64,
/// Lock ID that initiated the probe (the waiting lock).
pub origin_lock_id: Le64,
/// Current probe depth (incremented at each hop). If depth exceeds
/// MAX_DEADLOCK_DEPTH (16), the probe is dropped (prevents infinite
/// forwarding in pathological wait-for graphs).
pub depth: Le32,
/// Padding.
pub _pad: [u8; 4],
/// Probe generation (from the origin node's monotonic counter).
/// Duplicate probes with the same (origin_node, probe_gen) are dropped.
pub probe_gen: Le64,
/// Number of WaiterId entries in the variable-length path that follows
/// this fixed payload. Each hop appends its local waiter to the path
/// before forwarding, enabling cycle reconstruction on detection.
pub path_len: Le32,
/// Padding.
pub _pad2: [u8; 4],
// Variable-length path follows the fixed payload: path_len × Le64
// (WaiterId values). Per-hop reconstruction: each node appends its
// local waiter to the path and forwards. When the probe returns to
// the origin_node, the complete cycle is the path array.
}
const_assert!(size_of::<DlmDeadlockProbePayload>() == 40);
/// Maximum deadlock probe depth before dropping. 16 hops covers any
/// realistic deadlock cycle in clustered filesystem workloads.
/// **Relationship to MAX_PROBE_PATH_LEN**: MAX_DEADLOCK_DEPTH (16) is the
/// wire protocol hop limit for `DlmDeadlockProbePayload.depth` — probes
/// exceeding this are dropped by the receiver. MAX_PROBE_PATH_LEN (32) is
/// the in-memory path array capacity for `DlmProbe.path`, which is used
/// in the gossip-based protocol. The wire limit is stricter because each
/// wire hop has RDMA latency cost; the in-memory limit is more generous
/// to accommodate fan-out in the wait-for graph reconstruction.
pub const MAX_DEADLOCK_DEPTH: u32 = 16;
/// MasterLookup payload. Sent when a node needs to confirm or discover
/// the current master for a resource (e.g., after membership change).
/// Total: 8 bytes + variable name.
#[repr(C)]
pub struct DlmMasterLookupPayload {
/// Resource name length.
pub name_len: Le16,
/// Padding.
pub _pad: [u8; 6],
// Followed by: resource_name: [u8; name_len]
}
const_assert!(size_of::<DlmMasterLookupPayload>() == 8);
/// MasterLookupReply payload.
/// Total: 24 bytes.
#[repr(C)]
pub struct DlmMasterLookupReplyPayload {
/// Resource name hash (from the original lookup).
pub resource_hash: Le64,
/// Authoritative master node for this resource.
pub master_node: Le64,
/// Status: 0 = known master, 1 = resource not found (no active locks).
pub status: u8,
/// Padding.
pub _pad: [u8; 7],
}
const_assert!(size_of::<DlmMasterLookupReplyPayload>() == 24);
/// MasterTransfer payload. Sent from old master to new master during
/// membership change. Carries the full lock state for a resource.
/// Total: 16 bytes header + variable queues.
#[repr(C)]
pub struct DlmMasterTransferPayload {
/// Resource name hash.
pub resource_hash: Le64,
/// Resource name length.
pub name_len: Le16,
/// Number of entries in the granted queue (following the name).
pub granted_count: Le16,
/// Number of entries in the converting queue.
pub converting_count: Le16,
/// Number of entries in the waiting queue.
pub waiting_count: Le16,
// Followed by:
// resource_name: [u8; name_len]
// granted: [DlmQueueEntry; granted_count]
// converting: [DlmQueueEntry; converting_count]
// waiting: [DlmQueueEntry; waiting_count]
}
const_assert!(size_of::<DlmMasterTransferPayload>() == 16);
/// A lock queue entry for MasterTransfer wire format.
/// Total: 24 bytes.
#[repr(C)]
pub struct DlmQueueEntry {
/// Node ID of the lock holder/waiter.
pub node_id: Le64,
/// Node-local lock ID.
pub lock_id: Le64,
/// Lock mode (held or requested).
pub mode: u8,
/// Padding.
pub _pad: [u8; 7],
}
const_assert!(size_of::<DlmQueueEntry>() == 24);
/// LvbReadRequest payload. Sent by a node that wants to read an LVB without
/// acquiring a lock when `transport.supports_one_sided() == false` (TCP).
/// The master reads the LVB under its internal resource lock and responds
/// with `DlmLvbReadResponsePayload`.
/// Total: 8 bytes + variable name.
/// See [Section 15.15](#distributed-lock-manager--two-sided-lvb-read-fallback).
#[repr(C)]
pub struct DlmLvbReadRequestPayload {
/// Resource name length.
pub name_len: Le16,
/// Padding.
pub _pad: [u8; 6],
// Followed by: resource_name: [u8; name_len]
}
const_assert!(size_of::<DlmLvbReadRequestPayload>() == 8);
/// LvbReadResponse payload. Sent by the resource master in reply to
/// LvbReadRequest. Contains the full 64-byte LVB snapshot read under
/// the resource's `inner` SpinLock (guaranteed consistent — no double-read
/// needed by the receiver).
/// Total: 88 bytes.
/// Layout: resource_hash(8) + status(1) + _pad(3) + lvb_len(4) +
/// rotation_epoch(8) + lvb_data(64) = 88.
#[repr(C)]
pub struct DlmLvbReadResponsePayload {
/// Resource name hash (for correlation with the request).
pub resource_hash: Le64,
/// Status: 0 = success (LVB data valid), 1 = resource not found,
/// 2 = LVB invalid (INVALID sentinel, needs disk refresh).
pub status: u8,
/// Padding.
pub _pad: [u8; 3],
/// LVB data length (0-64). 0 if the resource has no LVB or the
/// LVB is in INVALID state (status == 2).
pub lvb_len: Le32,
/// Rotation epoch — monotonically increasing counter incremented each
/// time the LVB sequence counter is rotated (reset to 0). Lockless
/// `read_lvb()` callers MUST compare this with their cached
/// `rotation_epoch` before using sequence comparison for ordering.
/// If epochs differ, the cached sequence is stale across a rotation
/// boundary and must be discarded. See "Rotation safety for lockless
/// `read_lvb()` callers" above.
pub rotation_epoch: Le64,
/// LVB data (56 bytes of application data + 8 bytes sequence counter).
/// Only the first `lvb_len` bytes are meaningful.
pub lvb_data: [u8; 64],
}
const_assert!(size_of::<DlmLvbReadResponsePayload>() == 88);
Wire message size summary:
| Message Type | Header | Payload (fixed) | Variable | Typical Total |
|---|---|---|---|---|
| LockRequest | 64 | 48 | 0-256 (name) | ~112-368 bytes |
| LockGrant | 64 | 24 | 0-64 (LVB) | ~88-152 bytes |
| LockConvert | 64 | 28 | 0-64 (LVB) | ~92-156 bytes |
| LockRelease | 64 | 24 | 0-64 (LVB) | ~88-152 bytes |
| LockRevocation | 64 | 16 | 0 | 80 bytes |
| DeadlockProbe | 64 | 40 | N*8 (path) | ~104+ bytes |
| MasterLookup | 64 | 8 | 0-256 (name) | ~72-328 bytes |
| MasterTransfer | 64 | 16 | name + queues | variable |
| LockBatch | 64 | 8 (count) | N * 48 | ~120-3144 bytes |
| LvbReadRequest | 64 | 8 | 0-256 (name) | ~72-328 bytes |
| LvbReadResponse | 64 | 88 | 0 | 152 bytes |
Header = ClusterMessageHeader (40) + DlmMessageHeader (24) = 64 bytes.
Batch messages: LockBatch carries up to 64 DlmLockRequestPayload entries in a
single wire message, grouped by destination master. The master processes all entries
atomically and returns a single batch reply with per-lock grant/reject status. This
reduces RDMA round-trips for operations like rename() (3 locks) and GFS2 resource
group allocation (8+ locks). The batch reply uses DlmMessageType::LockGrant with
the batch flag set in DlmMessageHeader.flags.
15.16 Persistent Memory¶
15.16.1 The Hardware¶
CXL-attached persistent memory is coming (Samsung CMM-H with NAND-backed persistence via CXL GPF, SK Hynix). Also: battery-backed DRAM (NVDIMM-N) for enterprise storage. The model: byte-addressable memory that survives power loss.
15.16.2 Design: DAX (Direct Access) Integration¶
// umka-core/src/mem/persistent.rs
/// Persistent memory region descriptor.
pub struct PersistentMemoryRegion {
/// Physical address range.
pub base: PhysAddr,
pub size: u64,
/// NUMA node this persistent memory is attached to.
pub numa_node: u16,
/// Technology type (affects performance characteristics).
pub tech: PmemTechnology,
/// Is this region backed by a filesystem (DAX mode)?
pub dax_device: Option<DeviceNodeId>,
}
#[repr(u32)]
pub enum PmemTechnology {
/// Intel Optane / 3D XPoint (legacy, for existing deployments).
Optane = 0,
/// CXL-attached persistent memory.
CxlPersistent = 1,
/// Battery-backed DRAM (NVDIMM-N).
BatteryBacked = 2,
}
15.16.3 Memory-Mapped Persistent Storage¶
When a filesystem on persistent memory is mounted with DAX:
Standard file I/O (non-DAX):
read() → VFS → page cache → memcpy to userspace
write() → VFS → page cache → writeback → storage device
DAX file I/O:
read() → VFS → mmap directly to persistent memory → load instruction
write() → VFS → store instruction → persistent memory
No page cache. No copies. No writeback.
CPU load/store talks directly to persistent media.
AS_DAX inode initialization: The AS_DAX flag is set per-inode during inode
initialization when: (1) the filesystem is mounted with -o dax=always (all regular
file inodes), or (2) the filesystem is mounted with -o dax=inode and the inode has
the FS_DAX_FL persistent attribute (set via ioctl(FS_IOC_SETFLAGS) or chattr +x).
The flag is set in the filesystem's iget()/alloc_inode() path by calling
mapping_set_dax(inode.i_mapping), which sets bit AS_DAX in AddressSpace.flags.
Once set, the flag is immutable for the lifetime of the inode in memory.
The memory manager must handle persistent pages differently:
- Persistent pages are NOT evictable (they ARE the storage)
- fsync() → CPU cache flush (CLWB/CLFLUSH) not block I/O
- MAP_SYNC flag ensures metadata (file size, timestamps) is also persistent
- Crash consistency: partial writes are visible after reboot (see Section 15.16)
15.16.4 Crash Consistency Protocol¶
Persistent memory stores survive power loss, but CPU caches do not. Without explicit cache flushing, writes to persistent memory may be reordered or lost in the CPU write-back cache. The kernel must enforce a strict persistence protocol:
Persistence primitives (x86):
CLWB addr — Write-back cache line, leave line CLEAN but VALID in cache.
(Preferred: no performance penalty on subsequent reads.)
CLFLUSHOPT addr — Flush cache line, INVALIDATE from cache.
(Legacy: forces re-fetch on next read.)
SFENCE — Store fence. Guarantees all preceding CLWB/CLFLUSHOPT
have reached the persistence domain (ADR/eADR boundary).
Correct write sequence for persistent data:
1. Store data to persistent memory region (mov/memcpy)
2. CLWB for each modified cache line (64 bytes each)
3. SFENCE ← data is now durable
4. Store metadata update (e.g., committed flag, log tail pointer)
5. CLWB for metadata cache line(s)
6. SFENCE ← metadata is now durable (atomically marks data as committed)
ARM equivalent:
DC CVAP addr — Clean data cache to Point of Persistence (ARMv8.2+)
DSB — Data Synchronization Barrier
fsync() on a DAX-mounted filesystem translates to cache writeback + store fence
(not block I/O). msync(MS_SYNC) on DAX mappings follows the same path. The
kernel provides pmem_flush() and pmem_drain() helpers that abstract the
architecture-specific instructions.
15.16.4.1 DAX fsync Path¶
When fsync(fd) is called on a DAX file (AS_DAX set in
AddressSpace.flags, Section 14.1), the
standard page-cache writeback path is bypassed entirely. Instead, the VFS
dispatches to dax_fsync():
fsync(fd)
→ sys_fsync()
→ vfs_fsync_range(file, 0, LLONG_MAX, datasync=false)
→ file.f_op.fsync(file, start, end, datasync)
→ dax_fsync(file, start, end) // DAX-specific path
dax_fsync() implementation:
/// Flush dirty DAX mappings to persistent media.
/// Called instead of the standard filemap_write_and_wait_range() path
/// for DAX files. There are no page cache pages — data lives directly
/// in persistent memory. The only action needed is to flush CPU caches
/// for any dirty ranges.
fn dax_fsync(
file: &File,
start: i64,
end: i64,
) -> Result<(), IoError> {
let mapping = &file.inode.i_mapping;
// (1) Check for DAX hardware errors (MCE/SEA).
// Compare file.f_dax_err (AtomicU32, snapshot at open time) with
// mapping.dax_err. If different, report -EIO.
// f_dax_err is AtomicU32 for interior mutability (dax_fsync takes &File).
if file.f_dax_err.load(Acquire) != mapping.dax_err.load(Acquire) {
file.f_dax_err.store(mapping.dax_err.load(Acquire), Release);
return Err(IoError::EIO);
}
// (2) Walk dirty DAX mappings in [start, end] and flush
// CPU caches to the persistence domain.
dax_writeback_range(mapping, start, end)?;
Ok(())
}
/// Walk all DAX mappings in the given range that have been dirtied
/// (PTE dirty bit set) and issue architecture-specific cache writeback
/// instructions to push data from CPU caches to persistent media.
///
/// This is the DAX equivalent of filemap_write_and_wait_range() for
/// page-cache-backed files.
fn dax_writeback_range(
mapping: &AddressSpace,
start: i64,
end: i64,
) -> Result<(), IoError> {
// Defensive guard: the VFS should never pass negative values, but
// a programming error upstream would cause catastrophic behavior
// (wrapping u64 creating an infinite-length range).
if start < 0 || end < 0 || start > end {
return Err(IoError::EINVAL);
}
// Walk the filesystem's iomap to find the physical addresses
// backing the dirty range.
let mut pos = start as u64;
while pos <= end as u64 {
let iomap = mapping.inode.i_op.iomap_begin(
pos,
(end as u64 + 1).saturating_sub(pos),
AccessType::Read,
)?;
match iomap.kind {
IomapKind::Mapped { phys_addr } => {
let len = core::cmp::min(iomap.length, (end as u64 + 1) - pos);
// Flush CPU caches for this physical range.
arch_wb_cache_pmem(phys_addr, len);
}
IomapKind::Hole | IomapKind::Unwritten => {
// No data to flush.
}
}
pos += iomap.length;
}
// Issue a store fence to ensure all cache writebacks have reached
// the persistence domain before returning.
arch_pmem_drain();
Ok(())
}
arch_wb_cache_pmem(addr, len) — per-architecture cache writeback:
| Architecture | Instructions | Notes |
|---|---|---|
| x86-64 | CLWB for each 64-byte cache line in [addr, addr+len) |
CLWB preferred (leaves line clean but valid in cache). Falls back to CLFLUSHOPT on CPUs without CLWB (pre-Skylake). Trailing fence is exclusively arch_pmem_drain(). |
| AArch64 | DC CVAP for each 64-byte cache line in [addr, addr+len) |
DC CVAP = Clean to Point of Persistence (ARMv8.2-A DPB feature). Falls back to DC CVAC (clean to Point of Coherency) on CPUs without DPB. |
| RISC-V | cbo.flush for each cache line (Zicbom extension) |
Without Zicbom, RISC-V has no cache management instructions — persistent memory requires explicit fence only (assumes platform guarantees writeback on fence). |
| PPC64LE | dcbst for each cache line |
dcbst forces writeback of the specified cache block. Trailing fence deferred to arch_pmem_drain() to avoid redundant fences when flushing multiple extents. |
| ARMv7 | DCCMVAC (MCR p15, 0, Rd, c7, c10, 1) for each cache line |
Clean D-cache by MVA to Point of Coherency. No Point of Persistence concept in ARMv7; relies on platform-level battery backup. |
| PPC32 | dcbst for each cache line |
Same as PPC64LE. |
| s390x | PFP (Perform Frame Management Function) or store + BCR serialization |
z/Architecture uses channel I/O for persistent storage; CPU cache is write-through to SCM via firmware-managed paths. |
| LoongArch64 | cacop 0x19 (D-cache writeback + invalidate) for each cache line |
LoongArch cache operations use cacop instructions with type/level encoding. |
arch_pmem_drain() — per-architecture store fence:
| Architecture | Instruction | Semantics |
|---|---|---|
| x86-64 | SFENCE |
Guarantees all preceding CLWB/CLFLUSHOPT have reached the persistence domain (ADR/eADR boundary). |
| AArch64 | DSB SY |
Data Synchronization Barrier — all prior DC CVAP operations complete before subsequent memory accesses. |
| RISC-V | fence w, w |
Write-write ordering fence. Ensures all prior stores (including cbo.flush) are visible. |
| PPC64LE | sync |
Heavyweight barrier. All prior dcbst operations reach the persistence domain. |
| ARMv7 | DSB |
Data Synchronization Barrier — all prior cache maintenance operations complete. |
| PPC32 | sync |
Same as PPC64LE. |
| s390x | BCR 14,0 |
Serialization instruction. All prior store operations reach the persistence domain. |
| LoongArch64 | dbar 0 |
Full barrier (dbar 0x00). All prior cache operations complete before subsequent accesses. |
Why the page cache path does not apply: DAX files have page_cache: None
in their AddressSpace (Section 14.1). There are no dirty page
cache pages to write back. Data was written directly to persistent memory via
CPU store instructions (through the DAX mapping). The only thing that might
not be persistent is data sitting in CPU write-back caches. dax_fsync() +
arch_wb_cache_pmem() flushes exactly those cache lines.
15.16.5 PMEM Error Handling¶
Persistent memory is physical media and can develop errors (bit rot, wear-out,
manufacturing defects). The error model mirrors Linux badblocks:
Error sources:
1. UCE (Uncorrectable Error) — MCE (Machine Check Exception) on x86,
SEA (Synchronous External Abort) on ARM.
CPU receives #MC / abort when reading a poisoned cache line.
2. ARS (Address Range Scrub) — ACPI background scan discovers latent
errors before they're read. Results reported via ACPI NFIT.
3. CXL Media Error — CXL 3.0 devices report media errors via CXL
event log (Get Event Records command).
Kernel response:
MCE/SEA on PMEM page:
1. Mark physical page as HWPoison (same as DRAM MCE path).
2. Add to per-region badblocks list.
3. If a process has the page mapped:
a. DAX mapping → deliver SIGBUS (BUS_MCEERR_AR) with fault address.
b. Process can handle SIGBUS and skip/retry the corrupted region.
4. Filesystem (ext4/xfs DAX) is notified via dax_notify_failure().
Filesystem marks affected file range as damaged.
ARS/CXL background error:
1. ACPI notification or CXL event interrupt.
2. Add to badblocks list.
3. If mapped: deliver SIGBUS (BUS_MCEERR_AO — action optional).
4. Userspace can query badblocks via /sys/block/pmemN/badblocks.
15.16.6 Integration with Memory Tiers¶
Persistent memory becomes another level in the memory hierarchy. Note: the "Memory Level" numbering below refers to the memory distance hierarchy, NOT the driver isolation tiers (Tier 0/1/2) used elsewhere in this architecture.
Existing memory levels (see numa-topology-and-policy, dsm-global-memory-pool):
Level 0: Per-CPU caches
Level 1: Local DRAM
Level 2: Remote DRAM (cross-socket)
Level 3: CXL pooled memory
...
Extended:
Level N: Persistent memory (CXL-attached or NVDIMM)
Properties:
- Byte-addressable (like DRAM)
- Survives power loss (like storage)
- Higher latency than DRAM (~200-500ns vs ~80ns)
- Lower bandwidth than DRAM
- Cannot be evicted (it IS the backing store)
15.16.7 Linux Compatibility¶
Linux persistent memory interfaces are preserved:
/dev/pmem0, /dev/pmem1: Block device interface (libnvdimm)
/dev/dax0.0, /dev/dax1.0: Character DAX device (devdax)
mount -o dax /dev/pmem0 /mnt: DAX-mounted filesystem
mmap() with MAP_SYNC: Guaranteed persistence of metadata
Optane Discontinuation Note:
Intel discontinued Optane persistent memory products in 2022. The persistent memory
design in this section is hardware-agnostic — it applies to any byte-addressable
persistent medium. CXL 3.0 Type 3 devices with persistence (battery-backed or
inherently persistent media) are the expected successor. The PmemTechnology enum
includes CxlPersistent for this reason. The DAX path, cache flush protocol, and
error handling are technology-independent.
PMEM Namespace Discovery:
Persistent memory regions are discovered via:
- ACPI NFIT (NVDIMM Firmware Interface Table): For NVDIMM-N and legacy Optane. The NFIT describes each PMEM region's physical address range, interleave set, and health status.
- CXL DVSEC (Designated Vendor-Specific Extended Capability): For CXL-attached
persistent memory. CXL devices advertise memory regions via PCIe DVSEC structures.
The kernel's CXL driver enumerates regions and creates
/dev/daxN.Mdevice nodes. - Namespace management: Regions are partitioned into namespaces via
ndctl(userspace tool) using the Linux-compatible namespace management ioctl interface. UmkaOS implements the same ioctls via umka-sysapi.
15.16.8 Performance Impact¶
Zero overhead for systems without persistent memory. When persistent memory is present: DAX I/O is faster than standard I/O (eliminates page cache copies and writeback). Performance improves.
15.16.9 Filesystem Repair and Consistency Checking¶
Filesystem repair (fsck, xfs_repair, btrfs check) is handled by existing Linux
userspace utilities running against UmkaOS's block device interface. UmkaOS does not
implement in-kernel repair paths — the standard Linux repair tools are unmodified
userspace binaries that interact with block devices via standard syscalls (open,
read, write, ioctl). Since UmkaOS implements the complete block device interface
(Section 15.2) and the relevant filesystem syscalls (Section 19.1), these tools work
unchanged:
e2fsck/fsck.ext4for ext4 repairxfs_repairfor XFS repairbtrfs check/btrfs scrubfor btrfs repair (btrfs scrub runs online)- ZFS self-heals via block-level checksums (Section 15.10);
zpool scrubis the equivalent of fsck for ZFS
No kernel-side changes are needed to support these tools. The only UmkaOS-specific
consideration is that filesystem drivers should expose consistent BLKFLSBUF and
BLKRRPART ioctl behavior matching Linux, as some repair tools use these to
synchronize cache state.
15.16.10 SCSI-3 Persistent Reservations¶
SCSI-3 Persistent Reservations (PR) are required for shared-storage cluster fencing (Section 15.14). UmkaOS's block I/O layer implements the following PR commands as ioctls on block devices:
PR_REGISTER/PR_REGISTER_AND_IGNORE: register a reservation key with the storage target. Each node registers a unique key (derived from node ID).PR_RESERVE: acquire a reservation (Write Exclusive, Exclusive Access, or their "Registrants Only" variants).PR_RELEASE: release a held reservation.PR_CLEAR: clear all registrations and reservations.PR_PREEMPT/PR_PREEMPT_AND_ABORT: preempt another node's reservation (used for fencing — a surviving node preempts the fenced node's key).
These map to SCSI PR IN / PR OUT commands (SPC-4) for SCSI/SAS devices and to NVMe
Reservation Register/Acquire/Release/Report commands for NVMe devices. The block
layer translates between the common ioctl interface and the device-specific command
set. The fencing integration with Section 5.8's membership protocol uses
PR_PREEMPT_AND_ABORT to revoke a dead node's storage access before recovering its
DLM locks.
15.17 Computational Storage¶
15.17.1 Problem¶
NVMe Computational Storage Devices (CSDs) can run compute on the storage device: filter, aggregate, search, compress — without moving data to the host CPU.
15.17.2 Design: CSD as AccelBase Device¶
A CSD naturally fits the accelerator framework (Section 22.1). It's a device with local memory (flash) and compute capability (embedded processor):
// Extends AccelDeviceClass (Section 22.1)
#[repr(u32)]
pub enum AccelDeviceClass {
Gpu = 0,
GpuCompute = 1,
Npu = 2,
Tpu = 3,
Fpga = 4,
Dsp = 5,
MediaProcessor = 6,
/// Computational Storage Device.
/// "Local memory" = flash storage on the device.
/// "Compute" = embedded processor running submitted programs.
ComputeStorage = 7,
Other = 255,
}
Note: The
AccelDeviceClassenum is canonically defined in Section 22.1 (11-accelerators.md). TheComputeStoragevariant (value 7) must be added to the canonical definition to support computational storage devices.
15.17.3 CSD Command Submission¶
Standard NVMe read (move data to compute):
Host CPU ← 1 TB data ← NVMe SSD
Host CPU processes 1 TB → produces 1 MB result
Total data moved: 1 TB
CSD compute (move compute to data):
Host CPU → submit "grep pattern" → CSD
CSD processes 1 TB internally → produces 1 MB result
Host CPU ← 1 MB ← CSD
Total data moved: 1 MB (1000x reduction)
The CSD accepts commands via the AccelBase vtable:
- create_context: allocate CSD execution context
- submit_commands: submit a compute program (filter, aggregate, map, etc.)
- poll_completion: check if computation is done
- Results returned via DMA to host memory
15.17.4 CSD Security Model¶
CSDs run arbitrary compute programs on the device's embedded processor. The kernel must enforce access boundaries:
Capability-gated namespace access:
1. Each NVMe namespace has an owner (cgroup or capability).
2. CSD compute programs can ONLY access namespaces granted to
the submitting process's capability set.
3. Cross-namespace access (e.g., join across two datasets on
different namespaces) requires capabilities for BOTH namespaces.
4. The CSD driver enforces this BEFORE submitting to hardware
via the NVMe Computational Storage command set.
Program validation:
- CSD programs are opaque to the kernel (device-specific bytecode).
- The kernel does NOT inspect or validate program contents.
- Trust boundary: the NVMe device enforces isolation between
namespaces at the hardware level (NVMe namespace isolation).
- If the CSD hardware lacks namespace isolation, the kernel
treats the device as single-tenant (only one cgroup at a time).
DMA buffer isolation:
- Result DMA buffers are allocated from the submitting process's
address space (via IOMMU-mapped regions, same as GPU DMA).
- CSD cannot DMA to arbitrary host memory — IOMMU enforces this.
CSD Program Validation and IOMMU Enforcement:
Before submitting a CSD program to a device, the kernel performs:
1. IOMMU domain restriction: The CSD device is placed in an isolated IOMMU domain (one per process/namespace submitting CSD work). The IOMMU mapping for the CSD domain is restricted to:
- The input data region(s) specified in the submission descriptor.
- The output data region(s) specified in the submission descriptor.
- The program binary itself (if stored in a device-accessible region).
Any attempt by the CSD device to DMA outside these regions raises an IOMMU fault, which terminates the CSD operation and returns EPERM to the submitting process.
2. Capability check: CSD program submission requires CAP_ACCEL_SUBMIT (Section 9.2) on the CSD device's capability object. Programs submitted via a cgroup with storage quota enforcement additionally require that the submission's estimated compute units do not exceed the cgroup's CSD budget.
3. Program opaqueness vs. DMA opaqueness: The program logic is opaque to the kernel (vendor-specific bytecode). However, the DMA access pattern is NOT opaque: the IOMMU enforces that the device can only DMA to the addresses explicitly listed in the submission. The program cannot expand its DMA scope at runtime.
4. Namespace isolation: Each process namespace maps to a distinct IOMMU domain. Programs from process A cannot access data mapped into process B's CSD domain. Shared CSD regions (for cooperative workloads) require an explicit capability grant from process B to process A (Section 9.1 capability delegation) and a corresponding IOMMU mapping shared between the two domains.
5. Program signing (optional policy): Operators can configure CSD device policies to reject programs without a valid signature (csd_policy: require_signed = true). The signature is checked against the system's IMA policy (Section 9.5). Unsigned programs return EKEYREJECTED.
15.17.5 CSD Error Handling¶
Error scenarios and kernel response:
Timeout (program runs too long):
1. CSD command timeout (default: 300s, configurable via AccelBase).
300s default accommodates long-running CSD programs (e.g., full-scan
compression, dedup over multi-TB datasets). Short timeouts can be set
per-submission via AccelSubmitParams.timeout_ns for latency-sensitive ops.
2. Kernel sends NVMe Abort command for the specific command ID.
3. Returns -ETIMEDOUT to the submitting process.
4. If abort fails: NVMe controller reset (same path as NVMe I/O timeout).
Hardware error (device reports failure):
1. CSD returns NVMe status code (e.g., Internal Error, Data Transfer Error).
2. Kernel maps to errno: -EIO for hardware faults, -ENOMEM for device
memory exhaustion, -EINVAL for malformed programs.
3. Error counter incremented in /sys/class/accel/csdN/errors.
4. If error rate exceeds threshold: driver marks device degraded,
stops accepting new submissions, notifies userspace via udev event.
Device reset:
1. NVMe controller reset via PCIe FLR (Function Level Reset).
2. All in-flight CSD commands are failed with -EIO.
3. Contexts are invalidated; processes must re-create them.
4. Same recovery path as standard NVMe timeout handling in Linux.
15.17.6 Linux Compatibility¶
NVMe Computational Storage is defined in separate NVMe technical proposals — primarily
TP 4091 (Computational Programs) and TP 4131 (Subsystem Local Memory) — not in
the NVMe 2.0 base specification. These TPs define the Computational Programs I/O
command set and the Subsystem Local Memory I/O command set as independent command sets
within the NVMe 2.0 specification library architecture (which separates base spec,
command set specs, and transport specs into distinct documents). Linux support is
emerging (/dev/ngXnY namespace devices). UmkaOS supports the same device files and
NVMe ioctls through umka-sysapi.
CSD Programming Model:
CSD programs are opaque command buffers — the kernel does not interpret or compile them. The programming model:
- Vendor SDK in userspace: Each CSD vendor provides a userspace SDK that compiles programs for their embedded processor (e.g., Samsung SmartSSD SDK, ScaleFlux CSD SDK).
- NVMe TP 4091 (Computational Programs): The NVMe technical proposal defines a standard command set for managing computational programs on CSDs. Programs are uploaded via NVMe admin commands and executed via NVMe I/O commands.
- Kernel role: The kernel manages namespace access (capability-gated), DMA buffer allocation (IOMMU-protected), command timeout enforcement, and error reporting. The kernel does NOT validate program correctness — that is the vendor SDK's responsibility.
CSD Data Affinity:
For workloads that benefit from computational storage, data should be placed on the CSD's local namespaces:
- Filesystem-level routing: Mount a CSD-backed filesystem and place data files on it. CSD compute programs access data locally (no PCIe transfer).
- Cgroup hint:
csd.preferred_devicecgroup knob suggests which CSD device should be preferred for new file allocations within that cgroup's processes. Advisory only — the filesystem makes the final placement decision. - Explicit placement: Applications using
O_DIRECT+ the NVMe passthrough interface can target specific CSD namespaces directly.
15.17.7 Performance Impact¶
CSD offload reduces host CPU usage and PCIe bandwidth consumption. Performance improves for data-heavy workloads. Zero overhead when CSDs are not present.
15.18 I/O Priority and Scheduling¶
UmkaOS implements per-task I/O priority with full Linux ioprio_set/ioprio_get syscall
compatibility. The UmkaOS I/O scheduler (MQPA — Multi-Queue Priority-Aware) is a unified
implementation that replaces the Linux family of pluggable schedulers (CFQ, mq-deadline,
BFQ, kyber) with a single, purpose-built scheduler that is correct, composable, and
integrates natively with NVMe multi-queue hardware.
15.18.1 Syscall Interface¶
ioprio_set(which: i32, who: i32, ioprio: i32) -> 0 | -EINVAL | -EPERM | -ESRCH
ioprio_get(which: i32, who: i32) -> ioprio: i32 | -EINVAL | -EPERM | -ESRCH
Syscall numbers (x86-64): ioprio_set = 251, ioprio_get = 252.
Syscall numbers (i386 compat): ioprio_set = 289, ioprio_get = 290.
Syscall numbers (AArch64): ioprio_set = 30, ioprio_get = 31.
which argument — target scope:
| Constant | Value | Meaning |
|---|---|---|
IOPRIO_WHO_PROCESS |
1 | Single process or thread identified by who (PID/TID). If who = 0, the calling thread. |
IOPRIO_WHO_PGRP |
2 | All processes in the process group identified by who. If who = 0, the caller's process group. |
IOPRIO_WHO_USER |
3 | All processes whose real UID matches who. |
ioprio_get with PGRP/USER: When multiple processes match, returns the highest priority
found: RT > BE > Idle, and within the same class, the numerically lowest level (0 = highest).
Error conditions:
| Error | Condition |
|---|---|
EINVAL |
which is not one of the three valid values; ioprio encodes an invalid class (> 3) or level (> 7); the level is non-zero for IoSchedClass::Idle. |
EPERM |
Caller lacks CAP_SYS_ADMIN when setting RT class; caller lacks CAP_SYS_NICE when setting another user's tasks. |
ESRCH |
No process matching the given which/who combination was found. |
15.18.2 IoPriority Encoding¶
The ioprio value is a 16-bit quantity passed as a 32-bit int (upper 16 bits must be zero).
The bit layout is identical to Linux's <linux/ioprio.h>:
bits 15-13: I/O scheduling class (3 bits)
bits 12-3: Priority hint (10 bits; used for SCSI command duration limits)
bits 2-0: Priority level within the class (3 bits; values 0-7)
The 13-bit "data" field (bits 12-0, accessed via IOPRIO_PRIO_DATA()) is further
split into a 10-bit hint and a 3-bit level. Linux added the hint sub-field in 6.0
for SCSI command duration limit hints (IOPRIO_HINT_DEV_DURATION_LIMIT_*).
/// Per-task I/O priority. Wire-compatible with Linux `ioprio` values.
///
/// Bit layout (little-endian u16):
/// [15:13] = IoSchedClass (3 bits)
/// [12:3] = hint (10 bits; IOPRIO_HINT_* values)
/// [2:0] = level (3 bits; values 0–7; 0 = highest priority)
///
/// The `IOPRIO_PRIO_DATA(ioprio)` macro returns bits [12:0] (hint + level combined).
/// UmkaOS exposes separate `hint()` and `level()` accessors for clarity.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct IoPriority(u16);
impl IoPriority {
/// Construct an `IoPriority` from class, hint, and level.
///
/// `level` must be in 0..=7. `hint` must be in 0..=0x3ff (10 bits).
/// Callers should validate before constructing.
pub const fn new(class: IoSchedClass, hint: u16, level: u8) -> Self {
IoPriority(
((class as u16) << 13)
| ((hint & 0x3ff) << 3)
| (level as u16 & 0x7)
)
}
/// Decode the scheduling class from the encoded value.
pub fn class(self) -> IoSchedClass {
match (self.0 >> 13) & 0x7 {
0 => IoSchedClass::None,
1 => IoSchedClass::RealTime,
2 => IoSchedClass::BestEffort,
3 => IoSchedClass::Idle,
_ => IoSchedClass::None, // bits 4-7 are invalid; treat as None
}
}
/// Decode the priority hint (10 bits). Used for SCSI command duration
/// limit hints (`IOPRIO_HINT_DEV_DURATION_LIMIT_*`).
pub fn hint(self) -> u16 {
(self.0 >> 3) & 0x3ff
}
/// Decode the priority level (3 bits, 0 = highest within the class).
pub fn level(self) -> u8 {
(self.0 & 0x7) as u8
}
/// Return the combined 13-bit data field (hint + level), matching
/// Linux's `IOPRIO_PRIO_DATA(ioprio)` = `ioprio & 0x1fff`.
pub fn data(self) -> u16 {
self.0 & 0x1fff
}
/// Round-trip to/from the raw `i32` syscall argument.
pub fn from_raw(raw: i32) -> Option<Self> {
if raw < 0 || raw > 0xffff { return None; }
let class = (raw >> 13) & 0x7;
if class > 3 { return None; } // Invalid class (4-7 reserved)
Some(IoPriority(raw as u16))
}
pub fn to_raw(self) -> i32 {
self.0 as i32
}
/// The zero value: class = None, hint = 0, level = 0.
/// Semantics: inherit priority from CPU nice value.
pub const NONE: IoPriority = IoPriority(0);
}
/// SCSI command duration limit hint values (bits [12:3] of ioprio).
/// Linux `include/uapi/linux/ioprio.h` defines these since kernel 6.0.
pub const IOPRIO_HINT_NONE: u16 = 0;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_1: u16 = 1;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_2: u16 = 2;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_3: u16 = 3;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_4: u16 = 4;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_5: u16 = 5;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_6: u16 = 6;
pub const IOPRIO_HINT_DEV_DURATION_LIMIT_7: u16 = 7;
/// I/O scheduling class. Numeric values are identical to Linux
/// `IOPRIO_CLASS_*` constants — do not renumber.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum IoSchedClass {
/// Class not set. I/O priority is derived from CPU nice (see Section 15.15.3).
None = 0,
/// Real-time. Levels 0–7, 0 = highest. Preempts all BestEffort and Idle I/O.
RealTime = 1,
/// Best-effort. Levels 0–7, 0 = highest. Default class for all tasks.
BestEffort = 2,
/// Idle. Served only when no RT or BE I/O is pending.
/// The level field is ignored; all Idle I/O is equal.
Idle = 3,
}
Validation rules (enforced by ioprio_set before storing):
- Class must be 0–3 (values 4–7 are reserved; return EINVAL).
- For RT and BE: level must be 0–7 (return EINVAL if level > 7).
Hint bits are NOT validated — arbitrary hint values are passed through
to the block layer (Linux behavior). Unknown hints are ignored by drivers.
- For Idle: level must be 0 (any non-zero level is EINVAL). Hint is ignored.
- For None: level must be 0, hint must be 0.
15.18.3 Priority Inheritance from CPU Nice¶
When a task has IoPriority::NONE (class = IoSchedClass::None), its effective I/O
priority is derived from its CPU nice value at dispatch time. This matches Linux behavior:
This maps the nice range −20..+19 to BE levels 0..7:
| nice | effective BE level |
|---|---|
| −20 | 0 (highest) |
| −15 | 1 |
| −10 | 2 |
| −5 | 3 |
| 0 | 4 (default) |
| 5 | 5 |
| 10 | 6 |
| 19 | 7 (lowest) |
The derivation happens in the dispatch path, not at ioprio_set time, so that a subsequent
setpriority(2) call continues to influence I/O priority as expected.
15.18.4 Task Storage and Inheritance¶
/// Fields added to the Task structure (see Chapter 8).
pub struct Task {
// ... existing fields ...
/// Explicitly set I/O priority. `IoPriority::NONE` means "derive from nice".
pub io_priority: IoPriority,
}
Fork semantics: On fork(2) and clone(2) without CLONE_IO, the child inherits the
parent's io_priority value verbatim. If the parent had IoPriority::NONE, the child also
starts with IoPriority::NONE and its effective priority is derived from its own nice value
(which it also inherits from the parent, but may be changed independently).
CLONE_IO: When CLONE_IO is set, the child shares the parent's I/O context (same
io_context pointer). In this case the io_priority is also shared — a write by either
task is visible to the other. This is the same as Linux.
Thread groups: POSIX threads within the same process do NOT share io_priority by default
(consistent with Linux). Each thread has an independent io_priority. Tools that wish to set
the priority for all threads of a process must call ioprio_set(IOPRIO_WHO_PROCESS, tid, ioprio)
once per thread, using TIDs from /proc/<pid>/task/.
15.18.5 Permission Model¶
UmkaOS enforces the same permission rules as Linux:
| Operation | Required capability |
|---|---|
Set IoSchedClass::RealTime for any task |
CAP_SYS_ADMIN |
Set IoSchedClass::BestEffort or IoSchedClass::Idle for own tasks |
None |
Set IoSchedClass::BestEffort or IoSchedClass::Idle for another user's tasks |
CAP_SYS_NICE |
| Set priority for a process group or all processes of a UID | Same as for individual processes |
| Read priority of any task | None (always permitted) |
"Own tasks" means: tasks whose real or effective UID matches the caller's real UID, or tasks
in the caller's process group when which = IOPRIO_WHO_PGRP. Setting a higher-than-current
BE level (lower priority number) for one's own tasks is always permitted.
15.18.6 UmkaOS I/O Scheduler: Multi-Queue Priority-Aware (MQPA)¶
UmkaOS does not implement CFQ, BFQ, mq-deadline, or kyber as separate pluggable schedulers. Instead, UmkaOS implements a single unified scheduler — MQPA — that provides the correct behavior for all workloads without the configuration complexity of Linux's scheduler selection knob.
Design rationale vs Linux schedulers: - CFQ: deprecated in Linux 5.0, removed in 5.3. Had global elevator lock, per-process queues with O(n) dispatch, poor NVMe multi-queue support. - BFQ: per-process B-WF2Q+ scheduling with budget tracking. Good fairness, but complex and has a single per-device lock that limits scaling on high-queue-depth SSDs. - mq-deadline: simple, fast, low overhead, but only provides read/write starvation prevention — no per-class prioritization beyond that. - kyber: good SSD latency targeting, but no class-based priority support.
MQPA provides class-based strict priority (RT > BE > Idle), weighted round-robin within BE levels, per-CPU queues for lock-free submission, elevator merge optimization, and NVMe hardware queue integration — without any of the above limitations.
15.18.6.1 Scheduler Data Structures¶
Note on state ownership: The canonical per-device queue container is
DeviceIoQueues, owned by the BlockDevice (not by the scheduler algorithm).
The dispatch algorithm (dispatch_one) becomes the IoSchedOps::pick_next()
implementation. See Section 15.15.10 for the full ownership model.
/// A single block I/O request, created by the submission path and dispatched
/// through the DeviceIoQueues to the hardware queue. Each IoRequest corresponds
/// to one contiguous LBA range (possibly merged from adjacent submissions).
pub struct IoRequest {
/// Logical block address of the first sector.
pub lba: Lba,
/// Length in bytes (aligned to sector size). Named `len_bytes` (not `len`)
/// to prevent ambiguity between bytes and sectors. u64: UmkaOS rule — no u32
/// for sizes. u32 caps at 4 GiB, which is too restrictive for NVMe 128 KiB+
/// scatter-gather and CXL memory-mapped storage.
pub len_bytes: u64,
/// Operation type. Uses BioOp directly (no redundant IoOp enum).
/// See [Section 15.2](#block-io-and-volume-management) for BioOp values.
pub op: BioOp,
/// Resolved priority from the submitting task (Section 15.15.2).
pub priority: IoPriority,
/// Monotonic timestamp of submission (ns). Used for latency accounting
/// in `/proc/PID/io` (Section 15.15.9) and deadline starvation detection.
pub submit_ns: u64,
/// Absolute deadline (ns). For RT class: `submit_ns + rt_deadline_ns`.
/// For BE/Idle: `submit_ns + be_deadline_ns`. If the scheduler has not
/// dispatched this request by `deadline_ns`, it is promoted to the head
/// of its queue (starvation prevention).
pub deadline_ns: u64,
/// PID of the submitting task (for per-process I/O accounting).
pub pid: Pid,
/// cgroup ID of the submitting task (for cgroup I/O throttling, Section 15.15.8).
pub cgroup_id: u64,
/// Scatter-gather list of physical pages backing this request.
/// Pinned for the duration of the I/O. See [Section 4.14](04-memory.md#dma-subsystem).
pub sgl: DmaSgl,
/// Back-pointer to the originating Bio. The scheduler uses this to:
/// 1. Extract the Bio at dispatch time — `BlockDeviceOps::submit_bio()`
/// takes `&mut Bio`, so the scheduler unwraps the IoRequest back to
/// its originating Bio for driver dispatch.
/// 2. Call `bio_complete(req.bio, status)` when hardware signals completion,
/// routing the result through the Bio's `end_io` callback back to the
/// submitter (filesystem, page cache, io_uring, sync waiter).
///
/// # Safety
/// The Bio is kept alive for the duration of the IoRequest's lifetime.
/// The submitter transfers ownership to the completion path at
/// `bio_submit()` time. The Bio is not freed until `bio_complete()`
/// invokes `end_io`. See [Section 15.2](#block-io-and-volume-management--bio-lifecycle-and-ownership).
pub bio: *mut Bio,
}
// IoOp removed — use BioOp ([Section 15.2](#block-io-and-volume-management)) directly.
// IoOp was a redundant subset of BioOp that lacked SecureErase and ZoneAppend.
// The I/O scheduler classifies operations for WRR dispatch and merge eligibility
// using BioOp directly via IoRequest.bio.op.
Design note (Decision 4 — IoCompletion removal): The previous IoCompletion
enum had three variants (TaskWake, IoUringCqe, None) and required a bridging
conversion (IoCompletion::from_bio_completion()) to route completion from IoRequest
back to Bio. That bridge was never defined, creating a broken completion chain
(BIO-01, BIO-05). The fix: IoRequest carries a *mut Bio back-pointer. On
completion, the scheduler calls bio_complete(req.bio, status), which invokes the
Bio's end_io callback — the callback set by the original submitter. No bridging
conversion needed. The submitter (filesystem, io_uring, sync waiter) controls
completion routing by setting Bio.end_io before bio_submit().
15.18.6.2 State Ownership for Live Evolution¶
The I/O scheduler follows the state spill avoidance pattern (see
Section 13.18): per-device I/O
queues are owned by the BlockDevice, not by the scheduler component. The
scheduler is a stateless dispatch function that reads queue state and selects
the next request to issue. This enables:
- Scheduler swap without queue drain: replacing the MQPA algorithm (or
swapping to a future alternative) replaces only the
pick_nextfunction. All queued requests, in-flight counters, and deadline tracking survive the swap untouched. - Driver crash recovery: when a Tier 1 storage driver crashes and reloads
(Section 11.9),
the I/O queues survive in the
BlockDevice(Tier 0 kernel memory). The new driver instance sees all pending requests — no I/O is lost. - Zero-overhead steady state: the
IoSchedOpstrait is a&'static dynpointer resolved once at device init. No vtable indirection per-request beyond the single dispatch call.
/// Scheduler-private metadata embedded in each IoRequest.
/// The scheduler may use this to store per-request scheduling state
/// (e.g., virtual time, WRR credits, BFQ budget slice) without heap
/// allocation. The format is opaque to the block layer — only the
/// active scheduler implementation reads/writes these bytes.
///
/// 64 bytes = one cache line. Sufficient for all known scheduling
/// algorithms (BFQ budget: 24 bytes, mq-deadline: 8 bytes, MQPA WRR: 4 bytes).
pub const SCHED_DATA_SIZE: usize = 64;
Additional IoRequest field for scheduler state:
/// Scheduler-private per-request metadata. Opaque to the block layer.
/// Written by `IoSchedOps::on_submit()`, read by `IoSchedOps::pick_next()`.
/// Zeroed on request allocation; the scheduler initializes it during submission.
pub sched_data: [u8; SCHED_DATA_SIZE],
/// I/O scheduler algorithm interface (stateless pattern).
///
/// The scheduler does NOT own the I/O queues — they are owned by the
/// BlockDevice (via `DeviceIoQueues`). The scheduler provides stateless
/// decision functions that operate on queue references. This enables
/// live scheduler replacement without draining queues.
///
/// **Steady-state cost**: one `&'static dyn IoSchedOps` pointer dereference
/// per dispatch call. No additional indirection.
pub trait IoSchedOps: Send + Sync {
/// Algorithm name (for `/sys/block/<dev>/queue/scheduler`).
fn name(&self) -> &'static str;
/// Called when a new request enters a queue. The scheduler may
/// initialize `req.sched_data` (e.g., compute virtual time, assign
/// WRR credits). The request is already inserted into the appropriate
/// `IoQueue` by the block layer based on its `IoPriority` class/level.
fn on_submit(&self, queues: &DeviceIoQueues, req: &mut IoRequest);
/// Select the next request to dispatch to hardware. Returns `None` if
/// all queues are empty or rate-limited. The scheduler reads queue state
/// and `sched_data` but does NOT modify the queues — the block layer
/// calls `IoQueue::pop_front()` on the selected queue after `pick_next`
/// returns.
fn pick_next(&self, queues: &DeviceIoQueues, cpu: CpuId) -> Option<PickResult>;
/// Notification that a request completed. Used for accounting
/// (e.g., decrement inflight counters, update WRR round state).
fn on_complete(&self, queues: &DeviceIoQueues, req: &IoRequest);
}
/// Result of `pick_next`: identifies which queue and (for sorted queues)
/// which request to dispatch.
pub struct PickResult {
/// Class of the selected queue (RT, BE, or Idle).
pub class: IoSchedClass,
/// Level within the class (0-7 for RT/BE, 0 for Idle).
pub level: u8,
/// CPU whose per-CPU queue to dequeue from.
pub cpu: CpuId,
}
/// Per-device I/O queue set. Owned by BlockDevice (Tier 0 kernel memory),
/// NOT by the scheduler. Survives both scheduler swap and driver crash.
///
/// Created once during `BlockDevice` registration. Destroyed only when the
/// device is permanently removed.
///
/// **Memory budget**: 8 RT + 8 BE + 1 idle = 17 PerCpu<IoQueue> arrays per
/// device. On a 256-CPU system: 17 * 256 = 4352 IoQueue instances per device.
/// Each IoQueue is ~64 bytes (backing + oldest_enqueue_time + dispatched_this_round),
/// total ~272 KiB per device. For a server with 24 NVMe devices: ~6.4 MiB.
/// This is warm-path allocation (device registration), bounded per device.
pub struct DeviceIoQueues {
/// Per-CPU dispatch queues for RT class, indexed by level (0 = highest).
pub rt_queues: [PerCpu<IoQueue>; 8],
/// Per-CPU dispatch queues for BE class, indexed by level (0 = highest).
pub be_queues: [PerCpu<IoQueue>; 8],
/// Single per-CPU idle queue.
pub idle_queue: PerCpu<IoQueue>,
/// Monotonic count of in-flight requests across all classes.
pub inflight: AtomicU32,
/// In-flight RT requests.
pub inflight_rt: AtomicU32,
/// Maximum queue depth supported by the device.
pub queue_depth: u32,
/// Block device identifier for per-cgroup-per-device io.latency budget lookup.
/// Set at BlockDevice registration time.
pub device_id: DeviceId,
/// KABI service handle for the block device driver. Used by
/// `dispatch_pending()` to call `submit_bio()` on the driver via
/// `kabi_call!(block_handle, submit_bio, bio)`. The handle encodes
/// the transport decision (direct call for Tier 0, ring for Tier 1)
/// cached at device registration time.
pub block_handle: KabiServiceHandle,
/// Per-device slab cache for `IoRequest` allocation. Sized at boot:
/// `nr_cpus * 128` entries. Used by `bio_to_io_request()` via
/// `SlabArc::new(&request_slab, req)` to avoid heap allocation on
/// the I/O submission hot path.
pub request_slab: SlabCache<IoRequest>,
/// Active scheduler algorithm. Swapped atomically during live evolution.
///
/// Uses `RcuCell<&'static dyn IoSchedOps>` instead of `AtomicPtr` because
/// `dyn IoSchedOps` is an unsized trait object with a fat pointer (data
/// pointer + vtable pointer = 2 words). `AtomicPtr` only handles thin
/// (single-word) pointers; `AtomicPtr<&dyn Trait>` is not valid Rust.
/// `RcuCell` stores both words atomically and provides RCU-protected
/// reads (lock-free on the dispatch hot path) with safe writer-side swap.
pub sched_ops: RcuCell<&'static dyn IoSchedOps>,
}
Scheduler evolution protocol:
- Prep: new
IoSchedOpsimplementation loaded and verified. - Atomic swap:
DeviceIoQueues::sched_opsis replaced viasched_ops.rcu_replace(new_ops). The old reference is freed after an RCU grace period. No queue quiescence needed — the queues are untouched. - Post-swap: new scheduler's
on_submit()is called for newly arriving requests. Existing requests in queues retain their oldsched_data; the new scheduler'spick_next()must handlesched_datawritten by the old scheduler (or treat unknown data as default priority). This is safe becausepick_next()can always fall back to FIFO order within each priority queue.
Swap latency: ~1 us (RCU pointer swap + release fence). No stop-the-world IPI. No queue drain. Compare with the general component evolution path (Section 13.18) which requires 1-10 us stop-the-world.
/// Backing storage for an `IoQueue`, parameterised by media type.
///
/// - **Sorted** (rotational media — HDD): requests ordered by LBA for elevator
/// merge and seek-distance minimisation. `BTreeMap` provides O(log N) insert,
/// O(log N) predecessor/successor lookup for merge checks, and O(log N)
/// `pop_first()` for dispatch. Allocation per insert is negligible vs HDD
/// access latency (~4 ms for a 7200 RPM drive).
/// - **Fifo** (non-rotational media — NVMe, SSD, PMEM): no seek penalty; FIFO
/// preserves submission order and is optimal for deep hardware queues. Uses
/// `BoundedRing` (O(1) push/pop, pre-allocated at device init, no per-element allocation).
///
/// The variant is set once at `DeviceIoQueues` creation from `blk_queue_flag_set(QUEUE_FLAG_NONROT)`
/// and never changes at runtime. All `IoQueue` instances within one `DeviceIoQueues`
/// use the same variant.
pub enum IoQueueBacking {
/// **Arc overhead tradeoff**: `Arc<IoRequest>` adds 16 bytes (refcount +
/// allocation header) and one atomic increment/decrement per submit/complete.
/// This is acceptable because: (1) `IoRequest` must be shared between the
/// submitter, the scheduler, and the completion path (three owners); (2) the
/// atomic refcount cost (~5-15 ns) is negligible compared to device I/O
/// latency (~2-10 us for NVMe, ~4 ms for HDD); (3) the alternative (raw
/// pointers) would require unsafe lifetime tracking across the async I/O
/// boundary with no performance benefit.
Sorted(BTreeMap<Lba, Arc<IoRequest>>),
/// Bounded ring buffer for non-rotational media. Capacity is set to the
/// device's hardware queue depth (NVMe MQES) at `DeviceIoQueues` creation
/// time — no heap allocation on the I/O submission hot path.
Fifo(BoundedRing<Arc<IoRequest>>),
}
/// A single priority-level dispatch queue.
pub struct IoQueue {
/// Backing storage, parameterised by media type. See `IoQueueBacking`.
pub backing: IoQueueBacking,
/// Timestamp when the oldest request in this queue was enqueued.
/// Used for starvation detection: if `Instant::now() - oldest_enqueue_time`
/// exceeds the starvation threshold, the request is promoted.
/// `None` if the queue is empty.
oldest_enqueue_time: Option<Instant>,
/// Number of requests dispatched from this queue in the current WRR round
/// (BE queues only; unused for RT and Idle).
dispatched_this_round: u32,
}
impl IoQueue {
/// Dequeue the highest-priority request from this queue.
/// For rotational media (BTreeMap), this pops the lowest-LBA entry.
/// For non-rotational media (VecDeque/Fifo), this pops from the front.
/// Returns `None` if the queue is empty.
pub fn pop_front(&mut self) -> Option<Arc<IoRequest>> { /* ... */ }
/// Insert a request, attempting back-merge or front-merge with adjacent
/// requests sharing the same block device and contiguous LBA range.
/// If no merge is possible, the request is appended. Merged requests
/// are capped at `MERGE_SIZE_LIMIT` (64 KiB) to bound latency.
pub fn insert_merged(&mut self, req: Arc<IoRequest>) { /* ... */ }
}
15.18.6.3 Dispatch Algorithm¶
The dispatch loop runs when the device signals readiness for more commands (doorbell
ring, completion interrupt, or explicit dispatch_pending() call from the submit path).
fn dispatch_one(sched: &DeviceIoQueues, cpu: CpuId) -> Option<Arc<IoRequest>> {
// IRQ-disable scope: held for the full scan+pop to prevent a completion
// IRQ on the same CPU from modifying queue state between the starvation
// check and the pop_front(). If IRQs were re-enabled between the check
// and the pop, a completion IRQ could drain the queue, invalidating the
// starvation check result.
//
// Worst-case IRQ-disabled window: ~5-10 us on cold cache (8 RT levels +
// 8 BE levels with starvation checks + cgroup budget lookups). This is
// below hard-RT deadlines but is a known overhead. Future optimization:
// split into preempt-disabled scan (select target queue) + IRQ-disabled
// pop (narrow critical section), retrying if the pop returns None.
let irq_guard = IrqDisabledGuard::acquire();
// IrqDisabledGuard implies preemption disabled. Obtain a PreemptGuard
// from the disabled-IRQ context for the PerCpu API.
let mut preempt_guard = PreemptGuard::from_irq_disabled(&irq_guard);
// All PerCpu accesses below use .get_mut_nosave(&mut preempt_guard, &irq_guard).
// Step 1: RT always wins. Scan RT levels 0..7, take first non-empty queue.
for level in 0..8 {
if let Some(req) = sched.rt_queues[level]
.get_mut_nosave(&mut preempt_guard, &irq_guard).pop_front()
{
sched.inflight_rt.fetch_add(1, Release);
sched.inflight.fetch_add(1, Release);
return Some(req);
}
}
// Step 2: Starvation promotion (BE). If any BE request has waited beyond
// the starvation threshold (500ms since enqueue), treat it as RT-priority
// for one dispatch.
for level in 0..8 {
let q = sched.be_queues[level]
.get_mut_nosave(&mut preempt_guard, &irq_guard);
if q.oldest_enqueue_time.map_or(false, |t| t.elapsed() > Duration::from_millis(500)) {
if let Some(req) = q.pop_front() {
sched.inflight.fetch_add(1, Release);
return Some(req);
}
}
}
// Step 3: BE weighted round-robin with io.latency enforcement.
// Weights: level 0 = 8, level 1 = 4, level 2 = 2, levels 3-7 = 1.
//
// io.latency enforcement (cgroup `io.latency` target):
// For each cgroup with an active lat_target_us, the block layer tracks
// per-cgroup I/O completion latency as an EMA (7/8 decay). When a
// cgroup's EMA exceeds its target, sibling cgroups' dispatch budgets
// are reduced proportionally:
// sibling_budget = max(1, normal_budget * target_us / sibling_ema)
// This throttles siblings to give the latency-sensitive cgroup more
// I/O bandwidth. The budget reduction is per-device, recalculated on
// each completion. See [Section 17.2](17-containers.md#control-groups) for the full specification.
let be_weights: [u32; 8] = [8, 4, 2, 1, 1, 1, 1, 1];
for level in 0..8 {
let q = sched.be_queues[level]
.get_mut_nosave(&mut preempt_guard, &irq_guard);
if q.dispatched_this_round < be_weights[level] {
if let Some(req) = q.pop_front() {
// io.latency check: if the dequeued request's cgroup has
// exhausted its dispatch budget for this device, re-queue
// and try the next level.
if let Some(cg) = cgroup_for_bio(&req) {
if cg.io_dispatch_budget(sched.device_id).load(Relaxed) == 0 {
q.push_front(req);
continue;
}
}
q.dispatched_this_round += 1;
sched.inflight.fetch_add(1, Release);
return Some(req);
}
}
}
// End of WRR round: reset counters and retry from level 0.
for level in 0..8 {
sched.be_queues[level]
.get_mut_nosave(&mut preempt_guard, &irq_guard)
.dispatched_this_round = 0;
}
for level in 0..8 {
let q = sched.be_queues[level]
.get_mut_nosave(&mut preempt_guard, &irq_guard);
if let Some(req) = q.pop_front() {
q.dispatched_this_round = 1;
sched.inflight.fetch_add(1, Release);
return Some(req);
}
}
// Step 4: Starvation promotion (Idle). 5s threshold since enqueue.
{
let iq = sched.idle_queue
.get_nosave(&preempt_guard, &irq_guard);
if iq.oldest_enqueue_time.map_or(false, |t| t.elapsed() > Duration::from_secs(5)) {
let iq_mut = sched.idle_queue
.get_mut_nosave(&mut preempt_guard, &irq_guard);
if let Some(req) = iq_mut.pop_front() {
sched.inflight.fetch_add(1, Release);
return Some(req);
}
}
}
// Step 5: Idle — only when RT and BE are empty.
sched.idle_queue
.get_mut_nosave(&mut preempt_guard, &irq_guard)
.pop_front()
.map(|req| {
sched.inflight.fetch_add(1, Release);
req
})
}
Starvation prevention: - BE requests that wait longer than 500ms are promoted once (dispatched as if RT, then return to normal BE accounting afterward). - Idle requests that wait longer than 5s are promoted once (dispatched regardless of pending BE I/O). - Promotion is per-request, not per-queue: only the single oldest request in a queue is promoted at a time, preserving ordering within the queue.
15.18.6.4 Elevator Merge Optimization¶
For rotational media (IoQueueBacking::Sorted), requests are sorted by starting LBA.
When a new request arrives:
- Back-merge check: Look up the predecessor entry via
BTreeMap::range(..lba).next_back(). If the predecessor's end LBA + 1 == new request's start LBA, and the combined bio size is ≤ 64 KB (the merge size limit), extend the predecessor'sIoRequestto cover the new range and discard the new request object. - Front-merge check: Look up the successor via
BTreeMap::range(lba..).next(). If the successor's start LBA == new request's end LBA + 1, and combined size ≤ 64 KB, extend the new request and replace the successor. - No merge: Insert the new request into the
BTreeMapkeyed by its start LBA.
Each merge check is O(log N). There is no global elevator lock: the per-CPU IoQueue is
accessed only while holding the per-CPU scheduler lock (preempt-disable critical section on
the submitting CPU).
For non-rotational media (IoQueueBacking::Fifo), back/front merge checks are still
attempted (same logic, but searching by LBA in the BoundedRing is O(N)); dispatch pops from
the front of the ring rather than the lowest-LBA entry. Default-off for NVMe: On devices
with rotational=0 and native NVMe multi-queue, merge is disabled by default
(/sys/block/<dev>/queue/nomerges=2) because NVMe controllers handle coalescing internally
and the O(N) scan cost exceeds any merge benefit. Merge can be re-enabled via sysfs for
devices where software merge is beneficial (e.g., SATA SSDs behind AHCI with single HW queue).
The 64 KB merge limit is chosen to match a typical NVMe preferred transfer size and to
bound the latency spike of a merged request. This can be adjusted per-device at
initialization time by querying the device's MDTS (Maximum Data Transfer Size) field
in the NVMe identify controller data structure.
15.18.6.5 Submission Path¶
pub fn submit(sched: &DeviceIoQueues, req: Arc<IoRequest>, task: &Task) {
let priority = task.effective_io_priority(); // derives from nice if NONE
let cpu = current_cpu();
let irq_guard = IrqDisabledGuard::acquire();
match priority.class() {
IoSchedClass::RealTime => {
sched.rt_queues[priority.level() as usize]
.get_mut_nosave(&irq_guard)
.insert_merged(req);
}
IoSchedClass::BestEffort | IoSchedClass::None => {
let level = match priority.class() {
IoSchedClass::None => task.nice_to_be_level(),
_ => priority.level() as usize,
};
sched.be_queues[level]
.get_mut_nosave(&irq_guard)
.insert_merged(req);
}
IoSchedClass::Idle => {
sched.idle_queue
.get_mut_nosave(&irq_guard)
.insert_merged(req);
}
}
dispatch_pending(sched, cpu);
}
15.18.6.6 dispatch_pending() — Submit-to-hardware bridge¶
/// Drains the I/O scheduler queue and submits the originating Bios to the
/// block device via `kabi_call!`. This is the critical bridge between
/// "request inserted into scheduler queue" and "request dispatched to hardware."
///
/// Called from:
/// - `submit()` after inserting a new request (submit path kickoff)
/// - NVMe/AHCI completion handler after freeing a hardware slot (refill)
///
/// **BIO-06 fix**: The driver's `BlockDeviceOps::submit_bio()` accepts
/// `&mut Bio`, not `&IoRequest`. The scheduler extracts the originating Bio
/// from `req.bio` (the `*mut Bio` back-pointer stored by `bio_to_io_request()`)
/// and dispatches the Bio directly to the driver. The IoRequest is a
/// scheduler-internal wrapper for priority/merging/deadline tracking — the
/// driver never sees it.
///
/// The tier awareness lives in `kabi_call!` — the handle knows whether the
/// driver is Tier 0 (direct vtable call) or Tier 1 (ring submission). No
/// `DispatchMode` enum needed. The dispatch loop is transport-agnostic.
///
/// # Algorithm
///
/// ```text
/// fn dispatch_pending(queues: &DeviceIoQueues, cpu: usize) {
/// let irq_guard = disable_irqs();
/// let sched_ops = queues.sched_ops.rcu_read();
/// loop {
/// let pick = match sched_ops.pick_next(queues, cpu) {
/// Some(p) => p,
/// None => break, // all queues empty
/// };
/// let req = queues.dequeue(pick);
/// // Extract the originating Bio from the IoRequest.
/// // SAFETY: req.bio was set by bio_to_io_request() and the Bio is
/// // alive (owned by the completion path via ManuallyDrop).
/// let bio = unsafe { &mut *req.bio };
/// // Submit the Bio to the driver via KABI transport.
/// // The handle cached at device registration time determines
/// // direct call (Tier 0) vs ring submission (Tier 1).
/// match kabi_call!(queues.block_handle, submit_bio, bio) {
/// Ok(()) => {
/// // Request accepted by hardware. Track in-flight count.
/// queues.inflight.fetch_add(1, Relaxed);
/// if req.priority.class() == IoSchedClass::RealTime {
/// queues.inflight_rt.fetch_add(1, Relaxed);
/// }
/// }
/// Err(e) if e == Error::BUSY => {
/// // Hardware queue full. Requeue the IoRequest at the front
/// // of the scheduler's dispatch queue for the next attempt.
/// // The completion handler will call dispatch_pending()
/// // again when a slot frees up.
/// queues.requeue_front(req, pick);
/// break;
/// }
/// Err(e) => {
/// // Permanent error — complete the originating Bio.
/// bio_complete(req.bio, -(e as i32));
/// }
/// }
/// }
/// }
/// ```
///
/// **Completion path (Decision 4)**: When hardware signals completion (via
/// IRQ ring for Tier 1, or direct callback for Tier 0), the completion
/// handler calls `bio_complete(req.bio, status)`. This invokes the Bio's
/// `end_io` callback — the function pointer set by the original submitter
/// (filesystem, io_uring, sync waiter). No `IoCompletion` bridging needed.
///
/// The scheduler's `on_complete()` hook is called for accounting (decrement
/// inflight counters, update WRR round state) before or after `bio_complete()`,
/// depending on whether the callback may free the Bio:
///
/// ```text
/// fn complete_request(queues: &DeviceIoQueues, req: &IoRequest, status: i32) {
/// // 1. Notify scheduler for accounting.
/// let sched_ops = queues.sched_ops.rcu_read();
/// sched_ops.on_complete(queues, req);
/// // 2. Decrement in-flight counters.
/// queues.inflight.fetch_sub(1, Relaxed);
/// if req.priority.class() == IoSchedClass::RealTime {
/// queues.inflight_rt.fetch_sub(1, Relaxed);
/// }
/// // 3. Complete the originating Bio (invokes end_io callback).
/// bio_complete(req.bio, status);
/// // 4. Free the IoRequest back to the slab.
/// // (Arc<IoRequest> refcount drops to zero here.)
/// // 5. Kick dispatch to refill the freed hardware slot.
/// dispatch_pending(queues, current_cpu());
/// }
/// ```
15.18.7 NVMe Multi-Queue Integration¶
NVMe hardware supports multiple independent submission/completion queue pairs. UmkaOS maps the MQPA scheduler to NVMe hardware queues as follows:
Queue layout per NVMe controller:
- One hardware queue pair per online CPU (as Linux does with blk-mq).
- Each hardware queue has its own DeviceIoQueues instance — no cross-queue locking.
- Tasks submit requests to the DeviceIoQueues associated with their current CPU. The
dispatcher drains that scheduler's queues into the hardware submission queue doorbell.
NVMe queue priority (QPRIO): When the NVMe controller supports the Weighted Round Robin
with Urgent Priority Class arbitration mechanism (reported in CAP.AMS), UmkaOS creates
dedicated submission queue tiers:
| NVMe QPRIO | Value (CDW11[2:1]) | Used for |
|---|---|---|
| Urgent | 00b | RT I/O class (all levels 0-7) |
| High | 01b | BE levels 0-1 |
| Medium | 10b | BE levels 2-4 |
| Low | 11b | BE levels 5-7 and Idle |
Queue priority is set at queue creation time via the QPRIO field in CDW11 of the
Create I/O Submission Queue admin command. This maps UmkaOS's software priority classes
to NVMe hardware arbitration, so that the drive's internal scheduler also respects
UmkaOS priorities — not just the host-side MQPA scheduler.
If the controller does not support CAP.AMS priority, all queues are created at the
default (equal) priority and MQPA's software dispatch order is the sole priority mechanism.
RT fast path: RT requests are eligible for direct hardware queue submission without
going through the sorted BTreeMap, provided the hardware queue has available slots.
This reduces the RT dispatch latency to approximately one PCIe round trip (2–4 μs on
Gen4/Gen5 NVMe) without waiting for a dispatch tick.
Completion handling: NVMe completions arrive per-queue. Each completion decrements
inflight and inflight_rt (if RT), then calls dispatch_one to fill the freed slot.
This keeps queue depth at the device's preferred level for maximum throughput.
CPU hotplug handling: When a CPU goes offline, the I/O scheduler must drain or
migrate requests from the dead CPU's per-CPU IoQueue. The hotplug sequence:
1. CPU_DEAD notifier fires for the going-offline CPU.
2. For each block device: acquire the dead CPU's IoQueue lock.
3. Drain all pending requests from the dead CPU's queues (RT, BE[0..7], Idle).
4. Re-submit drained requests via submit() on the current (live) CPU, which
inserts them into the live CPU's queues at their original priority.
5. In-flight requests (already submitted to hardware) complete normally on any
CPU via interrupt steering — no migration needed.
6. When a CPU comes online (CPU_ONLINE), a fresh per-CPU IoQueue set is
allocated and registered. No request migration is needed for online events.
15.18.8 cgroup Integration¶
UmkaOS's io cgroup v2 controller and blkio cgroup v1 controller interact with MQPA:
cgroup v2 io controller:
Integer weight 1–10000 (default 100). Maps to an effective BE level multiplier:
effective_weight = io.weight // 1-10000, default 100
be_dispatch_quota = be_weights[level] * effective_weight / 100
Tasks in a cgroup with io.weight=500 (5× default) get 5× the per-round dispatch quota
at their BE level. Tasks in a cgroup with io.weight=10 get 0.1× quota (rounded up to
1 dispatch per round to avoid starvation).
/// Read the I/O weight for a cgroup. Called by the I/O scheduler at dispatch
/// time to determine the cgroup's proportional share.
pub fn cgroup_io_weight(cgroup: &Cgroup) -> u32 {
cgroup.io.as_ref().map_or(100, |io| io.weight.load(Ordering::Relaxed))
}
The per-cgroup weight applies within the same BE priority level. A task at BE level 0
with io.weight=10 still preempts a task at BE level 1 with io.weight=10000 — class
and level take strict priority; cgroup weight only affects relative bandwidth within the
same level.
cgroup v2 io.max — hard rate limits:
Format (Linux compatible): MAJ:MIN rbps=N wbps=N riops=N wiops=N
Implemented as a token bucket per cgroup per device. Tokens refill at the configured rate; requests that arrive when the bucket is empty are held in a per-cgroup delay queue and released when tokens become available. Rate-limited requests retain their original MQPA priority and are inserted into the normal dispatch queue when released from the delay queue.
Token bucket parameters: - Bucket capacity: 4× the per-second rate limit (allows burst up to 4 seconds of quota). - Refill granularity: every 1ms tick (avoids thundering herd on 1-second boundaries).
cgroup v1 blkio controller:
Supported knobs and their v2 equivalents:
| v1 knob | v2 equivalent | Notes |
|---|---|---|
blkio.weight |
io.weight |
Per-cgroup default weight |
blkio.weight_device |
io.weight (per-device) |
Per-device weight override |
blkio.throttle.read_bps_device |
io.max rbps= |
Hard rate limit |
blkio.throttle.write_bps_device |
io.max wbps= |
Hard rate limit |
blkio.throttle.read_iops_device |
io.max riops= |
Hard rate limit |
blkio.throttle.write_iops_device |
io.max wiops= |
Hard rate limit |
v1 blkio.bfq.* knobs are accepted but ignored with a logged warning (BFQ is not
implemented; MQPA provides equivalent or better behavior).
cgroup v2 io.stat — I/O accounting:
Format (Linux 4.16+ compatible):
Fields:
- rbytes / wbytes: bytes read/written from storage (not page cache hits)
- rios / wios: number of completed read/write I/O operations
- dbytes / dios: bytes/ops issued as discard (TRIM/UNMAP) commands
Counters are updated on I/O completion, not on submission. Accounted per-task first, then aggregated to the cgroup hierarchy on read.
15.18.9 /proc/PID/io Accounting¶
Each task accumulates I/O counters in its RusageAccum structure (defined in Chapter 8).
These are exposed in /proc/<pid>/io with the following format (Linux compatible):
rchar: <N>
wchar: <N>
syscr: <N>
syscw: <N>
read_bytes: <N>
write_bytes: <N>
cancelled_write_bytes: <N>
Field definitions:
| Field | Type | Description |
|---|---|---|
rchar |
u64 | Bytes passed to read(2) and similar calls. Includes page cache hits. Does not represent physical I/O. |
wchar |
u64 | Bytes passed to write(2) and similar calls. Includes writes to page cache. Does not represent physical I/O. |
syscr |
u64 | Number of read-class syscalls (read, pread64, readv, preadv, preadv2, sendfile, copy_file_range). |
syscw |
u64 | Number of write-class syscalls (write, pwrite64, writev, pwritev, pwritev2, sendfile, copy_file_range). |
read_bytes |
u64 | Bytes actually fetched from storage (cache misses that triggered block I/O). Updated at I/O completion. |
write_bytes |
u64 | Bytes actually written to storage (writeback completions). Updated at writeback completion. |
cancelled_write_bytes |
u64 | Bytes charged to write_bytes that were subsequently cancelled because the page was truncated before writeback. |
Implementation:
- rchar and wchar are incremented in the VFS read/write path before checking the page cache.
- syscr and syscw are incremented at syscall entry.
- read_bytes is incremented in the block I/O completion handler when the originating
task can be attributed (via IoRequest::pid).
- write_bytes is incremented in the writeback completion handler. Writeback is attributed
to the task that dirtied the page (recorded in the page's DirtyAccountable field).
- cancelled_write_bytes is incremented in truncate_inode_pages when a dirty page
is discarded before writeback.
Thread aggregation: /proc/<pid>/io reports the sum across all threads in the process.
Per-thread values are available at /proc/<pid>/task/<tid>/io.
15.18.10 sysfs Interface¶
/sys/block/<dev>/queue/scheduler:
UmkaOS presents the MQPA scheduler under the name umka-mqpa. For compatibility with tools
that check this file (e.g., fio, tuned, irqbalance), the file also accepts none,
mq-deadline, bfq, and kyber as writes — all are silently mapped to umka-mqpa.
The read value always shows [umka-mqpa] in the list of available schedulers.
/sys/block/<dev>/queue/iosched/:
UmkaOS presents as mq-deadline for maximum tool compatibility (iostat, blktrace, fio all
detect the scheduler name and adjust output accordingly). The following tunables are
honored:
| Tunable | Default | Meaning in UmkaOS |
|---|---|---|
read_expire |
500ms | Starvation deadline for BE read requests |
write_expire |
5000ms | Starvation deadline for BE write requests |
writes_starved |
2 | (ignored; MQPA WRR handles fairness) |
front_merges |
1 | 0 = disable front-merge check; 1 = enable (default) |
fifo_batch |
16 | (ignored; MQPA dispatches one request per call) |
All other mq-deadline tunables (async_depth, prio_aging_expire, etc.) are accepted via
sysfs write but have no effect. A single-line message is logged at info level when an
ignored tunable is written: umka-mqpa: tunable '<name>' accepted but has no effect.
MQPA-native tunables (exposed under /sys/block/<dev>/queue/iosched/):
| Tunable | Default | Description |
|---|---|---|
wrr_quantum_us |
100 | WRR time quantum per BE level (microseconds) |
rt_starve_limit |
64 | Max RT requests dispatched before one BE is served |
idle_batch |
4 | Max idle-class requests dispatched per round |
merge_max_kb |
64 | Maximum merged request size (KiB) |
/sys/block/<dev>/queue/ common knobs honored by MQPA:
| Knob | Description |
|---|---|
nr_requests |
Maximum queue depth. UmkaOS clamps to the device's reported NVMe MQES. |
rq_affinity |
0 = complete on any CPU, 1 = complete on submitting CPU's socket, 2 = complete on exact submitting CPU. |
add_random |
0 = do not contribute to /dev/random entropy pool on I/O completion. |
rotational |
0 = SSD/NVMe (disable elevator C-scan; use FIFO-within-level order instead of LBA order). |
When rotational=0, each IoQueue is created with backing: IoQueueBacking::Fifo
(a BoundedRing<Arc<IoRequest>> pre-allocated to the device's hardware queue depth).
Back/front merge checks are still performed but dispatch pops from the front of the
ring rather than the lowest-LBA entry. This avoids
unnecessary seek-optimisation work on random-access media.
15.18.11 Linux Compatibility Notes¶
| Item | Detail |
|---|---|
| Syscall numbers (x86-64) | ioprio_set = 251, ioprio_get = 252 |
| Syscall numbers (i386 compat) | ioprio_set = 289, ioprio_get = 290 |
| Syscall numbers (AArch64) | ioprio_set = 30, ioprio_get = 31 |
IOPRIO_CLASS_NONE |
0 |
IOPRIO_CLASS_RT |
1 |
IOPRIO_CLASS_BE |
2 |
IOPRIO_CLASS_IDLE |
3 |
IOPRIO_PRIO_CLASS(ioprio) |
(ioprio >> 13) & 0x7 |
IOPRIO_PRIO_DATA(ioprio) |
ioprio & 0x1fff (13-bit combined hint+level) |
IOPRIO_PRIO_HINT(ioprio) |
(ioprio >> 3) & 0x3ff (10-bit hint, Linux 6.0+) |
IOPRIO_PRIO_LEVEL(ioprio) |
ioprio & 0x7 (3-bit level, Linux 6.0+) |
IOPRIO_PRIO_VALUE(class, level) |
((class) << 13) \| (level) (hint=0 compat) |
ionice(1) (util-linux) |
Works without modification |
iopriority field in /proc/<pid>/status |
Not exposed; use ioprio_get(2) |
taskset / chrt |
Unaffected; these set CPU/RT scheduler priority, not I/O priority |
cgroup v2 io.stat format |
Compatible with Linux 4.16+ |
cgroup v2 io.weight range |
1–10000, default 100 (Linux compatible) |
blkio.weight v1 range |
10–1000, mapped to v2 weight via weight * 10 |
/proc/<pid>/io format |
Identical to Linux (all 7 fields, same names) |
ionice(1) tool compatibility: The ionice utility from util-linux calls
ioprio_set(2) and ioprio_get(2) directly via syscall(2) (no glibc wrapper exists).
No modification is required.
Tools that query /sys/block/<dev>/queue/scheduler: Tools like fio, tuned, and
storage benchmarks that read or write the scheduler knob will see [umka-mqpa] and accept
writes of mq-deadline without error. The fio engine io_uring and libaio are
unaffected by scheduler selection — they bypass the scheduler for direct I/O
(O_DIRECT).
O_DIRECT and io_uring with fixed buffers: Requests submitted via io_uring with
IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED on O_DIRECT file descriptors are
still subject to MQPA priority. The submitting task's io_priority is sampled at
io_uring_enter(2) time and embedded in each IoRequest generated from the
submission ring.
15.19 NVMe Host Controller Driver Architecture¶
Pseudocode convention: Code in this section uses Rust syntax and follows Rust ownership, borrowing, and type rules.
&selfmethods use interior mutability for mutation. Atomic fields use.store()/.load(). All#[repr(C)]structs haveconst_assert!size verification. See CLAUDE.md Spec Pseudocode Quality Gates.
The NVMe driver is a Tier 1 KABI driver that manages local PCIe-attached NVMe solid-state drives through the NVM Express register and command interface. This is the primary high-performance block storage driver for UmkaOS — NVMe SSDs are the default boot and data disk on modern servers, workstations, and laptops.
Reference specification: NVM Express Base Specification 2.1 (NVM Express, Inc., August 2024). NVM Express Zoned Namespace Command Set Specification 1.1b (August 2022).
NVMe-oF (over Fabrics) is a separate subsystem. The NVMe-oF initiator and target are defined in Section 15.13. This section covers the local PCIe NVMe host controller driver only. The two share the
NvmeCommandformat and namespace abstraction — see Section 15.19 below.
15.19.1 Controller Memory Space (CMS) Registers¶
The NVMe controller exposes a memory-mapped register set at PCI BAR0. All registers
are little-endian. The first 0x40 bytes are controller-wide registers; doorbell
registers start at offset 0x1000 (configurable via CAP.DSTRD).
/// NVMe controller registers (BAR0 MMIO, offsets 0x00-0x3F).
/// All registers are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures including big-endian
/// PPC32 and s390x. Matches Linux `struct nvme_bar` which uses `__le64`/`__le32`.
#[repr(C)]
pub struct NvmeRegisters {
/// Controller Capabilities (CAP) — read-only, 64-bit.
/// Bits: MQES (15:0) maximum queue entries supported (0-based),
/// CQR (16) contiguous queues required,
/// AMS (18:17) arbitration mechanism supported (0=round-robin, 1=WRR+urgent),
/// TO (31:24) timeout in 500ms units (worst-case time for CSTS.RDY transitions),
/// DSTRD (35:32) doorbell stride (2^(2+DSTRD) bytes between doorbells),
/// NSSRS (36) NVM subsystem reset supported,
/// CSS (44:37) command set supported (bit 0=NVM, bit 6=I/O command sets, bit 7=admin only),
/// BPS (45) boot partition support,
/// CPS (47:46) controller power scope,
/// MPSMIN (51:48) memory page size minimum (2^(12+MPSMIN) bytes),
/// MPSMAX (55:52) memory page size maximum (2^(12+MPSMAX) bytes),
/// PMRS (56) persistent memory region supported,
/// CMBS (57) controller memory buffer supported,
/// NSSS (58) NVM subsystem shutdown supported,
/// CRMS (60:59) controller ready modes supported.
pub cap: Le64,
/// Version (VS) — read-only, 32-bit.
/// Major (31:16), minor (15:8), tertiary (7:0). E.g., 0x00020100 = 2.1.0.
pub vs: Le32,
/// Interrupt Mask Set (INTMS) — write-only, 32-bit.
/// Set bits to mask corresponding interrupt vectors.
pub intms: Le32,
/// Interrupt Mask Clear (INTMC) — write-only, 32-bit.
/// Set bits to unmask corresponding interrupt vectors.
pub intmc: Le32,
/// Controller Configuration (CC) — read-write, 32-bit.
/// Bits: EN (0) enable,
/// CSS (6:4) I/O command set selected,
/// MPS (10:7) memory page size (2^(12+MPS) bytes, must be within MPSMIN..MPSMAX),
/// AMS (13:11) arbitration mechanism selected,
/// SHN (15:14) shutdown notification (00=none, 01=normal, 10=abrupt),
/// IOSQES (19:16) I/O submission queue entry size (2^N bytes, must be 6 for 64-byte),
/// IOCQES (23:20) I/O completion queue entry size (2^N bytes, must be 4 for 16-byte),
/// CRIME (24) controller ready independent of media enable.
pub cc: Le32,
/// Reserved (0x18).
pub _reserved0: Le32,
/// Controller Status (CSTS) — read-only, 32-bit.
/// Bits: RDY (0) ready,
/// CFS (1) controller fatal status,
/// SHST (3:2) shutdown status (00=normal, 01=in-progress, 10=complete),
/// NSSRO (4) NVM subsystem reset occurred,
/// PP (5) processing paused.
pub csts: Le32,
/// NVM Subsystem Reset (NSSR) — read-write, 32-bit.
/// Write 0x4E564D65 ("NVMe") to initiate subsystem reset (if CAP.NSSRS=1).
pub nssr: Le32,
/// Admin Queue Attributes (AQA) — read-write, 32-bit.
/// ASQS (11:0) admin submission queue size (0-based, max 4095 entries),
/// ACQS (27:16) admin completion queue size (0-based, max 4095 entries).
pub aqa: Le32,
/// Admin Submission Queue Base Address (ASQ) — read-write, 64-bit.
/// Physical address, page-aligned (bits 11:0 must be zero).
pub asq: Le64,
/// Admin Completion Queue Base Address (ACQ) — read-write, 64-bit.
/// Physical address, page-aligned (bits 11:0 must be zero).
pub acq: Le64,
/// Controller Memory Buffer Location (CMBLOC) — offset 0x38, 32-bit.
/// Indicates the location and access parameters of the CMB if CAP.CMBS=1.
/// If CMB is not supported, this register is reserved.
pub cmbloc: Le32,
/// Controller Memory Buffer Size (CMBSZ) — offset 0x3C, 32-bit.
/// Indicates the size and capabilities of the CMB if CAP.CMBS=1.
/// SZU (3:0) size units, SZ (31:4) size.
pub cmbsz: Le32,
}
// NVMe Base Spec 2.1: registers 0x00-0x3F = 64 bytes.
const_assert!(core::mem::size_of::<NvmeRegisters>() == 64);
15.19.2 Submission/Completion Queue Pair Model¶
NVMe uses paired ring buffers for command submission and completion. Each pair consists of a Submission Queue (SQ) and a Completion Queue (CQ). The admin queue pair (QID 0) handles controller management; I/O queue pairs (QID 1+) handle data transfer.
Submission Queue (SQ): Circular buffer of 64-byte command entries. The host writes commands and advances the SQ tail doorbell. The controller fetches commands from the SQ head (tracked internally by the controller).
Completion Queue (CQ): Circular buffer of 16-byte completion entries. The controller writes completions and the host detects new entries via the phase bit (bit 16 of DW3). The phase bit toggles on each CQ wraparound, allowing the host to distinguish new completions from stale entries without reading a head register. The host advances the CQ head doorbell after processing completions.
Doorbell registers start at BAR0 + 0x1000, spaced by 4 << CAP.DSTRD bytes:
- SQ Y Tail Doorbell: offset 0x1000 + (2Y) * (4 << DSTRD)
- CQ Y Head Doorbell: offset 0x1000 + (2Y + 1) * (4 << DSTRD)
/// NVMe Submission Queue Entry — 64 bytes. All NVMe commands use this format.
/// The first 16 bytes are common; CDW10-CDW15 are command-specific.
/// All multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct NvmeCommand {
/// Command Dword 0 (CDW0).
/// OPC (7:0) opcode,
/// FUSE (9:8) fused operation (00=normal, 01=first, 10=second),
/// PSDT (15:14) PRP or SGL for data transfer (00=PRP, 01/10=SGL),
/// CID (31:16) command identifier (unique per SQ, used to correlate completions).
pub cdw0: Le32,
/// Namespace Identifier (NSID). 0xFFFFFFFF for controller-wide commands.
pub nsid: Le32,
/// Command Dword 2-3 — reserved for most commands.
pub cdw2: Le32,
pub cdw3: Le32,
/// Metadata Pointer (MPTR) — physical address of metadata buffer (if applicable).
pub mptr: Le64,
/// Data Pointer — two PRP entries (PRP1, PRP2) or one SGL descriptor.
/// For PRP mode: PRP1 = first page, PRP2 = second page or PRP list address.
pub dptr: [Le64; 2],
/// Command Dwords 10-15 — command-specific parameters.
pub cdw10: Le32,
pub cdw11: Le32,
pub cdw12: Le32,
pub cdw13: Le32,
pub cdw14: Le32,
pub cdw15: Le32,
}
// NVMe Base Spec: cdw0(4)+nsid(4)+cdw2(4)+cdw3(4)+mptr(8)+dptr(16)+cdw10-15(24) = 64 bytes.
const_assert!(core::mem::size_of::<NvmeCommand>() == 64);
/// NVMe Completion Queue Entry — 16 bytes.
/// All multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct NvmeCompletion {
/// Command-specific result (DW0).
pub result: Le32,
/// Command-specific result (DW1). Zero for NVM I/O commands; carries
/// additional data for some admin commands (e.g., Identify, Create I/O
/// Queue). See NVMe Base Specification 2.0+ Figure 89.
pub result_hi: Le32,
/// SQ Head Pointer — controller's current SQ head position.
/// The host uses this to reclaim SQ entries.
pub sq_head: Le16,
/// SQ Identifier — identifies which SQ this completion is for.
pub sq_id: Le16,
/// Command Identifier — matches the CID from the submitted NvmeCommand.
pub cid: Le16,
/// Status Field (NVMe Base Spec 2.0, Figure 89).
/// P (bit 0) phase bit — toggled on each CQ wraparound.
/// SC (bits 8:1) status code — 8-bit field.
/// SCT (bits 11:9) status code type (0=generic, 1=command-specific, 2=media, 3=path, 6-7=vendor).
/// CRD (bits 13:12) command retry delay.
/// M (bit 14) more — more status available (via Error Info log page).
/// DNR (bit 15) do not retry — 1 means permanent error, 0 means transient (may retry).
pub status: Le16,
}
// NVMe Base Spec: result(4)+result_hi(4)+sq_head(2)+sq_id(2)+cid(2)+status(2) = 16 bytes.
const_assert!(core::mem::size_of::<NvmeCompletion>() == 16);
15.19.3 NVMe Command Opcodes¶
/// Admin command opcodes (Opcode field in CDW0, used on QID 0).
#[repr(u8)]
pub enum NvmeAdminOpcode {
/// Delete I/O Submission Queue.
DeleteIoSq = 0x00,
/// Create I/O Submission Queue.
CreateIoSq = 0x01,
/// Get Log Page (error log, SMART, firmware slot, AEN config, etc.).
GetLogPage = 0x02,
/// Delete I/O Completion Queue.
DeleteIoCq = 0x04,
/// Create I/O Completion Queue.
CreateIoCq = 0x05,
/// Identify — returns controller or namespace data structures.
Identify = 0x06,
/// Abort — request cancellation of a previously submitted command.
Abort = 0x08,
/// Set Features — configure controller parameters.
SetFeatures = 0x09,
/// Get Features — read controller parameters.
GetFeatures = 0x0A,
/// Asynchronous Event Request — register for async notifications.
AsyncEventReq = 0x0C,
/// Namespace Management — create/delete namespaces.
NsMgmt = 0x0D,
/// Firmware Commit — activate firmware image.
FwCommit = 0x10,
/// Firmware Image Download — transfer firmware to controller.
FwDownload = 0x11,
/// Namespace Attachment — attach/detach namespace to controller.
NsAttach = 0x15,
/// Format NVM — low-level format a namespace.
FormatNvm = 0x80,
}
/// NVM I/O command opcodes (used on I/O queues, QID 1+).
#[repr(u8)]
pub enum NvmeIoOpcode {
/// Flush — commit volatile write cache to non-volatile media.
Flush = 0x00,
/// Write — transfer data from host to namespace.
Write = 0x01,
/// Read — transfer data from namespace to host.
Read = 0x02,
/// Write Uncorrectable — mark LBA range as invalid (read returns error).
WriteUncor = 0x04,
/// Compare — compare data in namespace with host buffer.
Compare = 0x05,
/// Write Zeroes — set LBA range to zero without transferring data.
WriteZeroes = 0x08,
/// Dataset Management — TRIM/deallocate, volatile write cache hints.
Dsm = 0x09,
/// Verify — verify data integrity without transferring data.
Verify = 0x0C,
/// Reservation Register — register/unregister reservation keys.
ResrvRegister = 0x0D,
/// Reservation Report — report current reservations.
ResrvReport = 0x0E,
/// Reservation Acquire — acquire/preempt reservations.
ResrvAcquire = 0x11,
/// Reservation Release — release reservations.
ResrvRelease = 0x15,
/// Zone Append (ZNS) — write data to zone write pointer.
ZoneAppend = 0x7D,
/// Zone Management Send (ZNS) — open/close/finish/reset zone.
ZoneMgmtSend = 0x79,
/// Zone Management Receive (ZNS) — report zone descriptors.
ZoneMgmtRecv = 0x7A,
}
15.19.4 Driver State¶
/// NVMe controller driver state — lives in the Tier 1 driver domain.
/// One instance per NVMe controller (PCI function).
pub struct NvmeController {
/// PCI BAR0 MMIO accessor for controller registers.
pub regs: MmioRegion,
/// Controller capabilities (cached from CAP register at init).
pub cap: NvmeCapabilities,
/// Maximum queue entries supported (CAP.MQES + 1).
/// u32 because MQES is 16 bits: when MQES = 65535, the actual entry count
/// is 65536, which overflows u16.
pub max_queue_entries: u32,
/// Doorbell stride in bytes (4 << CAP.DSTRD).
pub doorbell_stride: u32,
/// Host memory page size configured in CC.MPS (bytes, power of 2).
pub page_size: u32,
/// Admin queue pair (QID 0). Always present after initialization.
pub admin_queue: NvmeQueuePair,
/// I/O queue pairs (QID 1+). One per CPU, up to controller maximum.
/// ArrayVec capacity 256: compile-time upper bound for the number of
/// I/O queues. Actual count is min(nr_cpu_ids, CAP.MQES, 256) at init.
/// 256 is sufficient for current NVMe controllers (most support <=128
/// queues). Systems with >256 CPUs share queues (queue_idx = cpu % N).
/// If future controllers support >256 queues, this constant must be
/// increased or replaced with a slab-allocated slice.
pub io_queues: ArrayVec<NvmeQueuePair, 256>,
/// Active namespaces discovered via Identify. One NvmeNamespace per NSID.
/// XArray keyed by NSID (u32) — integer-keyed mapping per collection policy.
/// NVMe allows up to 2^32-1 namespaces (NN field from Identify Controller);
/// XArray provides runtime-sized, O(log₆₄ N) lookup without hardcoded limits.
/// Populated at probe time (warm-path), accessed on I/O submission (hot-path
/// via cached queue→namespace binding, not repeated XArray lookup).
pub namespaces: XArray<NvmeNamespace>,
/// Number of MSI-X vectors allocated.
pub msix_vectors: u16,
/// Controller serial number (20 ASCII bytes from Identify Controller).
pub serial: [u8; 20],
/// Controller model number (40 ASCII bytes from Identify Controller).
pub model: [u8; 40],
/// Firmware revision (8 ASCII bytes from Identify Controller).
pub firmware_rev: [u8; 8],
/// Maximum Data Transfer Size in bytes. Derived from controller MDTS field.
/// 0 means no limit (use host page size × max PRP list length).
pub max_transfer_size: u32,
/// Number of outstanding Async Event Requests (AER) the controller supports.
pub aerl: u8,
/// Controller supports volatile write cache (Identify Controller, VWC bit 0).
pub volatile_write_cache: bool,
/// Controller power state management.
pub power_state: NvmePowerState,
/// Error recovery state.
pub error_state: AtomicU8,
/// NUMA node of the PCI device (for queue/interrupt affinity).
pub numa_node: u16,
}
/// Cached controller capabilities from the CAP register.
pub struct NvmeCapabilities {
/// Maximum Queue Entries Supported (0-based). Actual max = mqes + 1.
pub mqes: u16,
/// Contiguous Queues Required.
pub cqr: bool,
/// Timeout in 500ms units (for CSTS.RDY transitions).
pub timeout: u8,
/// Doorbell Stride (2^(2+dstrd) bytes).
pub dstrd: u8,
/// Minimum host memory page size (2^(12+mpsmin) bytes).
pub mpsmin: u8,
/// Maximum host memory page size (2^(12+mpsmax) bytes).
pub mpsmax: u8,
}
/// NVMe submission/completion queue pair.
pub struct NvmeQueuePair {
/// Queue identifier (0 = admin, 1+ = I/O).
pub qid: u16,
/// Queue depth (number of entries, power of 2, max CAP.MQES+1).
pub depth: u16,
/// DMA-coherent submission queue buffer.
pub sq: DmaBox<[NvmeCommand]>,
/// DMA-coherent completion queue buffer.
pub cq: DmaBox<[NvmeCompletion]>,
/// SQ tail index — next slot to write a command. Advanced by the host.
pub sq_tail: u16,
/// CQ head index — next slot to read a completion. Advanced by the host.
pub cq_head: u16,
/// Current CQ phase bit. Starts at 1; toggles on each CQ wraparound.
pub cq_phase: bool,
/// Doorbell offset for SQ tail (BAR0 + 0x1000 + qid*2*stride).
pub sq_doorbell_offset: u32,
/// Doorbell offset for CQ head (BAR0 + 0x1000 + (qid*2+1)*stride).
pub cq_doorbell_offset: u32,
/// In-flight command tracking: maps CID → pending Bio.
/// Allocated at queue creation with length = actual queue depth
/// (discovered from controller CAP.MQES, typically 64-1024).
/// Warm-path allocation (driver init only).
pub inflight: Box<[Option<NvmeInflightCmd>]>,
/// Next command identifier. Wraps at queue depth.
pub next_cid: u16,
/// MSI-X vector assigned to this queue's CQ.
pub irq_vector: u16,
/// Number of commands posted since the last doorbell write.
/// Used by `nvme_ring_doorbell()` to skip redundant MMIO writes.
pub pending_doorbells: u16,
/// Batch mode flag. When true, `nvme_submit_io()` defers doorbell
/// writes. Set by `nvme_submit_batch()`, cleared after the batch
/// doorbell write.
pub batch_mode: bool,
/// DMA device handle for IOMMU/SWIOTLB address translation.
pub dma_device: DmaDevice,
/// Reference to the controller's BAR0 MMIO region (shared across all
/// queues). Used by `nvme_ring_doorbell()` to write the SQ tail and
/// CQ head doorbells at the queue-specific offsets.
pub regs: MmioRegion,
/// Per-CID flush waiter. `submit_flush_sync` stores a slab-allocated
/// `Completion` handle here; the IRQ handler wakes it on flush
/// completion. Allocated at queue creation with length = actual queue
/// depth. `None` for CIDs not used by synchronous flush.
///
/// Uses slab-allocated `Completion` (not stack references) to avoid
/// dangling pointers if the submit path returns early via `?`.
pub flush_waiters: Box<[Option<SlabBox<Completion>>]>,
/// Pre-allocated PRP list page pool. Each entry is a DMA-coherent
/// page (4096 bytes, 512 Le64 entries) for multi-page I/O commands.
/// Pool size = queue depth (one PRP list per in-flight command).
/// Allocated at queue creation; no allocation on the I/O hot path.
pub prp_pool: PrpPool,
/// Domain ID of the NVMe driver's isolation domain. Set at driver init
/// from the `DomainService.domain_id` passed during module registration.
/// `CORE_DOMAIN_ID` (0) during early boot (Tier 0); updated to the Tier 1
/// domain ID after promotion. Used by `nvme_signal_completion()` to select
/// direct (`bio_complete()`) vs ring-based completion path.
pub domain_id: DomainId,
/// Outbound KABI completion ring for Tier 1 mode. Initialized during
/// module registration when the driver binds to the block layer service.
/// The Tier 0 block layer consumer drains this ring and calls
/// `bio_complete()` for each entry. `None` in Tier 0 (boot) mode.
pub outbound_ring: Option<CrossDomainRing>,
}
/// In-flight command context — tracks a submitted command until completion.
/// Stores all information needed to rebuild the SQ entry on retry
/// (opcode, nsid, DMA mapping, PRP list).
pub struct NvmeInflightCmd {
/// Pointer to the originating Bio (for completion callback).
pub bio: *mut Bio,
/// PRP list page (if the command required a PRP list for >2 segments).
/// None if the command fit in the two inline PRP pointers. All PRP
/// entries are little-endian per NVMe spec — Le64, not native u64.
pub prp_list: Option<DmaBox<[Le64; 512]>>,
/// DMA mapping for this command's data transfer. Unmapped in the
/// completion handler after bio signaling to prevent IOMMU leak.
pub dma_map: Option<DmaMapping>,
/// Command opcode (for retry classification on error).
pub opcode: u8,
/// Namespace identifier (needed for command rebuild on retry).
pub nsid: u32,
/// Retry count (incremented on transient error, max 3).
pub retries: u8,
}
15.19.5 Namespace State¶
/// NVMe namespace — one per active NSID on the controller.
pub struct NvmeNamespace {
/// Namespace Identifier (1-based).
pub nsid: u32,
/// Namespace capacity in logical blocks.
pub capacity_blocks: u64,
/// Logical Block Address (LBA) format: sector size in bytes.
/// Derived from Identify Namespace LBAF[FLBAS].LBADS: size = 2^lbads.
pub block_size: u32,
/// Metadata size per block (from LBAF[FLBAS].MS). 0 if no metadata.
pub metadata_size: u16,
/// Namespace supports thin provisioning (NSFEAT bit 0).
pub thin_provisioned: bool,
/// Namespace supports deallocate (DSM, Dataset Management command).
pub supports_deallocate: bool,
/// Maximum number of LBA ranges per DSM command (from Identify Namespace DMRL).
/// 0 means no limit reported; driver uses 256 (spec maximum).
pub dsm_range_limit: u16,
/// Namespace is a Zoned Namespace (ZNS). See [Section 15.19](#nvme-driver-architecture--zoned-namespaces-zns).
pub zns: Option<NvmeZnsInfo>,
/// Optimal I/O boundary in logical blocks (NOIOB from Identify Namespace).
/// Straddling this boundary may degrade performance. 0 = no boundary.
pub optimal_io_boundary: u16,
/// Preferred write granularity in logical blocks (NPWG from Identify Namespace).
pub preferred_write_granularity: u16,
/// Preferred write alignment in logical blocks (NPWA from Identify Namespace).
pub preferred_write_alignment: u16,
/// End-to-end data protection type (0=none, 1=Type1, 2=Type2, 3=Type3).
pub pi_type: u8,
}
15.19.6 Initialization Sequence¶
Eight steps, from PCI probe to ready:
-
PCI probe and BAR mapping: Match PCI class 01:08:02 (Mass Storage, NVM Express, NVM Express I/O Controller). Map BAR0 as uncacheable MMIO. Read
CAPandVSregisters. Validate spec version (minimum 1.0 for basic operation). -
Controller reset: Clear
CC.EN(bit 0). PollCSTS.RDYuntil it clears. Timeout =CAP.TO × 500ms. IfCSTS.RDYdoes not clear, the controller is non-functional — abort initialization withEIO. -
Configure controller: Set
CC.MPSto match the host page size (must be withinCAP.MPSMIN..CAP.MPSMAX). SetCC.IOSQES = 6(64-byte SQ entries). SetCC.IOCQES = 4(16-byte CQ entries). SetCC.AMS = 1(WRR with urgent) ifCAP.AMSbit 0 is set; otherwiseCC.AMS = 0(round-robin). SetCC.CSS = 0(NVM command set). -
Admin queue setup: Determine admin queue depth (min of 4096 and
CAP.MQES + 1). DMA memory tradeoff: 4096-entry admin queue consumes 256 KiB (4096 × 64B SQ entries) + 64 KiB (4096 × 16B CQ entries) = 320 KiB of DMA-coherent memory per controller. This is generous for admin commands (typically <100 concurrent), but simplifies firmware update, namespace management, and device self-test flows that can submit many commands concurrently. For memory-constrained systems, the default can be reduced via boot parameternvme.admin_queue_depth=256. Allocate DMA-coherent buffers for admin SQ and CQ. Write physical addresses toASQandACQ. Write queue sizes toAQA(both ASQS and ACQS fields). SetCC.EN = 1. PollCSTS.RDYuntil it sets (timeout =CAP.TO × 500ms). IfCSTS.CFS(Controller Fatal Status) is set, reset and retry once before aborting. -
Identify Controller: Submit Identify command (opcode 0x06, CNS=0x01) on the admin queue. Parse the 4096-byte Identify Controller data structure:
- Bytes 24-63: Serial Number (SN), Model Number (MN).
- Byte 77: MDTS — Maximum Data Transfer Size as a power of 2 in units of
CAP.MPSMIN. If 0, no limit. Otherwise, max transfer =(1 << MDTS) × page_size. - Bytes 257-258: OACS — Optional Admin Command Support (namespace management, firmware commands, format NVM).
- Byte 259: ACL — Abort Command Limit (max outstanding Abort commands).
- Byte 260: AERL — Async Event Request Limit.
- Byte 525: VWC — Volatile Write Cache (bit 0: present).
-
Bytes 514-515: NVSCC — NVM Vendor Specific Command Configuration.
-
Identify Namespaces: Submit Identify command (CNS=0x02) to get the active namespace list. For each NSID in the list, submit Identify Namespace (CNS=0x00) to discover:
- NSZE (bytes 0-7): namespace size in logical blocks.
- NCAP (bytes 8-15): namespace capacity.
- FLBAS (byte 26): formatted LBA size index (selects entry from LBAF array).
- LBAF[0..63] (bytes 128-191): LBA format descriptors (LBADS = data size log2, MS = metadata size, RP = relative performance).
- DPS (byte 29): data protection settings.
- NSFEAT (byte 24): namespace features (thin provisioning, deallocate support).
- NOIOB (bytes 72-73): namespace optimal I/O boundary.
-
NPWG/NPWA (bytes 74-77): preferred write granularity and alignment.
-
I/O queue creation: Allocate MSI-X vectors — one per I/O queue plus one for the admin queue. Determine I/O queue count:
min(online_cpus, controller_max_queues). Set Features (feature ID 0x07 — Number of Queues) to request the desired count. The controller may grant fewer. For each I/O queue pair: a. Allocate DMA-coherent CQ buffer. Submit Create I/O CQ (opcode 0x05) on admin queue: CDW10 =(size-1) << 16 | QID, CDW11 =irq_vector << 16 | IEN | PC(physically contiguous, interrupts enabled). b. Allocate DMA-coherent SQ buffer. Submit Create I/O SQ (opcode 0x01) on admin queue: CDW10 =(size-1) << 16 | QID, CDW11 =CQID << 16 | QPRIO | PC. c. Assign the queue pair to a specific CPU for interrupt affinity. -
Ready: Register Async Event Requests (up to AERL+1 outstanding). Enable interrupt coalescing via Set Features (feature ID 0x08) if the workload benefits from batching (tunable: threshold count + aggregation time). Register each namespace with umka-block as a
BlockDevice.
15.19.7 I/O Path¶
Bio-to-NVMe command translation: The block layer submits a Bio containing an
LBA range and a scatter-gather list of memory segments. The NVMe driver translates
this into an NvmeCommand with PRP (Physical Region Page) data pointers.
PRP construction: NVMe uses two PRP pointers in each command (PRP1 and PRP2):
- 1 segment (data fits in one page): PRP1 = physical address of the data buffer. PRP2 = 0 (unused).
- 2 segments (data spans two pages): PRP1 = first page physical address. PRP2 = second page physical address.
- >2 segments: PRP1 = first page physical address. PRP2 = physical address of a PRP list — a page-aligned buffer of u64 physical addresses for the remaining pages. Each PRP list page holds up to 512 entries (4096 / 8). If more than 512 additional pages are needed, the last entry in the PRP list points to the next PRP list page (chained).
fn nvme_submit_io(queue: &mut NvmeQueuePair, bio: &mut Bio,
ns: &NvmeNamespace) -> Result<()> {
// Find a free CID by scanning forward from next_cid.
// Each slot in `inflight` is Some(_) while in-flight, None when free.
let start = queue.next_cid as usize;
let depth = queue.depth as usize;
let mut cid = None;
for i in 0..depth {
let idx = (start + i) % depth;
if queue.inflight[idx].is_none() {
cid = Some(idx as u16);
queue.next_cid = ((idx + 1) % depth) as u16;
break;
}
}
let cid = cid.ok_or(EBUSY)?; // All slots in-flight — queue full.
// Build NvmeCommand at SQ tail.
let cmd = &mut queue.sq[queue.sq_tail as usize];
*cmd = NvmeCommand::zeroed();
// All NvmeCommand fields are Le32/Le64 — explicit conversion via
// Le32::from_ne() required on big-endian architectures (PPC32, s390x).
cmd.cdw0 = Le32::from_ne(match bio.op {
BioOp::Read => NvmeIoOpcode::Read as u32,
BioOp::Write => NvmeIoOpcode::Write as u32,
_ => unreachable!(), // Flush/Discard handled separately
} | ((cid as u32) << 16));
cmd.nsid = Le32::from_ne(ns.nsid);
// CDW10-11: Starting LBA (64-bit).
let slba = bio.start_lba;
cmd.cdw10 = Le32::from_ne(slba as u32);
cmd.cdw11 = Le32::from_ne((slba >> 32) as u32);
// CDW12: Number of logical blocks (0-based) | FUA bit 30.
let total_bytes: u64 = bio.segments.iter().map(|s| s.len as u64).sum::<u64>()
+ bio.segments_ext.as_deref().map_or(0u64, |ext| ext.iter().map(|s| s.len as u64).sum());
let nlb = (total_bytes / ns.block_size as u64) - 1;
let fua = if bio.flags.contains(BioFlags::FUA) { 1u32 << 30 } else { 0u32 };
cmd.cdw12 = Le32::from_ne(nlb as u32 | fua);
// Build PRP entries from bio segments.
let opcode = match bio.op {
BioOp::Read => NvmeIoOpcode::Read as u8,
BioOp::Write => NvmeIoOpcode::Write as u8,
_ => unreachable!(),
};
let mut inflight = NvmeInflightCmd {
bio: bio as *mut Bio,
prp_list: None,
dma_map: None,
opcode,
nsid: ns.nsid,
retries: 0,
};
// Map bio segments to DMA addresses via IOMMU/SWIOTLB translation.
// BioSegment contains (page: Arc<Page>, offset: u32, len: u32) —
// physical/bus addresses are obtained by calling dma_map_sgl(),
// not by direct field access. See §4.11 DMA Subsystem.
let sgl = DmaSgl::from_bio_segments(&bio.segments, bio.segments_ext.as_deref());
let dma_map = queue.dma_device.dma_map_sgl(
&sgl, DmaDirection::from_bio_op(bio.op),
)?;
// Extract DMA addresses BEFORE moving dma_map into inflight.
// Rust move semantics: accessing dma_map after move is a compilation error.
let dma_addrs = dma_map.addresses();
inflight.dma_map = Some(dma_map);
// DmaAddr is u64 (native-endian); NvmeCommand.dptr is [Le64; 2]
// and PRP list entries are Le64 — NVMe is a little-endian wire
// protocol. Le64::from_ne() is a no-op on LE architectures (x86-64,
// AArch64 LE, RISC-V LE) and a byte-swap on BE (PPC32, PPC64 BE,
// s390x). See also the Le32::from_ne() conversions for cdw0/nsid above.
match dma_addrs.len() {
0 => {} // No data (should not happen for read/write)
1 => {
cmd.dptr[0] = Le64::from_ne(dma_addrs[0]);
}
2 => {
cmd.dptr[0] = Le64::from_ne(dma_addrs[0]);
cmd.dptr[1] = Le64::from_ne(dma_addrs[1]);
}
n => {
cmd.dptr[0] = Le64::from_ne(dma_addrs[0]);
// Allocate PRP list from per-queue pre-allocated pool.
// PRP list entries are Le64 per NVMe spec.
let prp_list = queue.alloc_prp_list();
for i in 1..n {
prp_list[i - 1] = Le64::from_ne(dma_addrs[i]);
}
cmd.dptr[1] = Le64::from_ne(prp_list.phys_addr());
inflight.prp_list = Some(prp_list);
}
}
queue.inflight[cid as usize] = Some(inflight);
// Advance SQ tail. Doorbell write is deferred for batch submission.
queue.sq_tail = (queue.sq_tail + 1) % queue.depth;
queue.pending_doorbells += 1;
// Ring doorbell immediately only for single-command submission.
// Batch callers use `nvme_submit_batch()` which defers the doorbell
// write until all commands in the batch are posted. This reduces
// MMIO writes from N to 1 per batch (MMIO writes are ~100-500ns each
// due to uncacheable PCIe BAR access). For single-command submission
// (the common case for fsync/flush), ring immediately.
if !queue.batch_mode {
nvme_ring_doorbell(queue);
}
Ok(())
}
/// Ring the NVMe submission queue doorbell. Writes the current SQ tail
/// to the controller's doorbell register, notifying hardware of new commands.
/// Must be called after a write barrier to ensure all command data is visible.
fn nvme_ring_doorbell(queue: &mut NvmeQueuePair) {
if queue.pending_doorbells == 0 {
return;
}
// Write memory barrier — ensure commands are visible before doorbell write.
core::sync::atomic::fence(Release);
queue.regs.write32(queue.sq_doorbell_offset, queue.sq_tail as u32);
queue.pending_doorbells = 0;
}
/// Submit a batch of bios with a single deferred doorbell write.
/// Each bio is individually placed into the SQ; the doorbell is rung
/// once after all commands are posted. Reduces MMIO overhead from N
/// writes to 1 per batch. Used by the block layer's request merging
/// and plugging infrastructure.
fn nvme_submit_batch(queue: &mut NvmeQueuePair, bios: &mut [&mut Bio],
ns: &NvmeNamespace) -> Result<()> {
queue.batch_mode = true;
for bio in bios.iter_mut() {
nvme_submit_io(queue, bio, ns)?;
}
queue.batch_mode = false;
nvme_ring_doorbell(queue);
Ok(())
}
15.19.7.1 NVMe Helper Functions¶
impl NvmeQueuePair {
/// Allocate a free Command ID by scanning forward from `next_cid`.
/// Returns the CID index. Returns `Err(Error::BUSY)` if all slots in-flight.
pub fn alloc_cid(&mut self) -> Result<u16> {
let start = self.next_cid as usize;
let depth = self.depth as usize;
for i in 0..depth {
let idx = (start + i) % depth;
if self.inflight[idx].is_none() {
self.next_cid = ((idx + 1) % depth) as u16;
return Ok(idx as u16);
}
}
Err(Error::BUSY)
}
/// Allocate a PRP list page from the per-queue pre-allocated PRP pool.
/// Each NvmeQueuePair has a slab of pre-allocated PRP list pages (one
/// per queue depth entry). The PRP list is page-aligned (4096 bytes)
/// and holds up to 512 Le64 entries.
/// Returns (virtual_ptr, dma_addr) or Err(Error::NOMEM) if pool exhausted.
pub fn alloc_prp_list(&mut self) -> Result<(*mut Le64, u64)> {
self.prp_pool.alloc().ok_or(Error::NOMEM)
}
/// Return a PRP list page to the per-queue pool.
pub fn free_prp_list(&mut self, ptr: *mut Le64) {
self.prp_pool.free(ptr);
}
/// Build and submit a flush command (NvmeIoOpcode::Flush, opcode 0x00).
/// The inflight entry must already be stored by the caller.
/// Uses the packed CDW0 API: OPC(7:0) | CID(31:16).
pub fn submit_flush_cmd(&mut self, nsid: u32, cid: u16) -> Result<()> {
let cmd = &mut self.sq[self.sq_tail as usize];
*cmd = NvmeCommand::zeroed();
cmd.cdw0 = Le32::from_ne(
NvmeIoOpcode::Flush as u32 | ((cid as u32) << 16),
);
cmd.nsid = Le32::from_ne(nsid);
// No data pointers, no CDW10-15 for flush.
self.sq_tail = (self.sq_tail + 1) % self.depth;
nvme_ring_doorbell(self);
Ok(())
}
/// Submit a flush command and block until completion.
/// Used by BlockDeviceOps::flush() for synchronous flush.
///
/// Uses a slab-allocated `Completion` (not a stack-local reference)
/// so that early return via `?` cannot leave a dangling pointer in
/// `flush_waiters`. Cleanup of `flush_waiters[cid]` and `inflight[cid]`
/// is performed explicitly on all exit paths.
pub fn submit_flush_sync(&mut self, nsid: u32) -> Result<()> {
let cid = self.alloc_cid()?;
let completion = SlabBox::new(Completion::new());
self.inflight[cid as usize] = Some(NvmeInflightCmd {
bio: core::ptr::null_mut(), // no bio — completion wakes waiter
prp_list: None,
dma_map: None,
opcode: NvmeIoOpcode::Flush as u8,
nsid,
retries: 0,
});
self.flush_waiters[cid as usize] = Some(completion);
// Submit the flush command. On error, clean up both slots.
if let Err(e) = self.submit_flush_cmd(nsid, cid) {
self.flush_waiters[cid as usize] = None;
self.inflight[cid as usize] = None;
return Err(e);
}
// Block until completion handler signals (TASK_KILLABLE).
self.flush_waiters[cid as usize].as_ref().unwrap().wait_killable();
// Cleanup: the completion handler has already processed the inflight
// entry (via .take()). Clear the waiter slot.
self.flush_waiters[cid as usize] = None;
Ok(())
}
}
/// Requeue a command after transient NVMe error (SC != 0, DNR == 0).
/// Re-submits the same command with the same CID. Increments retries,
/// rebuilds the SQ entry from the inflight state, and re-rings the
/// doorbell. The inflight entry is re-inserted into the tracking array
/// (the caller must NOT have consumed the DMA mapping or PRP list).
///
/// Takes owned `NvmeInflightCmd` because the caller `.take()`d it from
/// the inflight array. After rebuilding the SQ entry, the inflight is
/// stored back into `queue.inflight[cid]` so the completion handler
/// can find it when the retried command completes.
fn requeue_command(queue: &mut NvmeQueuePair, mut inflight: NvmeInflightCmd, cid: u16) {
inflight.retries += 1;
// Rebuild SQ entry from inflight fields (same packed CDW0 API as nvme_submit_io).
let cmd = &mut queue.sq[queue.sq_tail as usize];
*cmd = NvmeCommand::zeroed();
cmd.cdw0 = Le32::from_ne(
inflight.opcode as u32 | ((cid as u32) << 16),
);
cmd.nsid = Le32::from_ne(inflight.nsid);
// Re-use existing DMA mapping and PRP list (still valid — NOT consumed).
if let Some(ref dma_map) = inflight.dma_map {
let addrs = dma_map.addresses();
if !addrs.is_empty() {
cmd.dptr[0] = Le64::from_ne(addrs[0]);
}
if addrs.len() == 2 {
cmd.dptr[1] = Le64::from_ne(addrs[1]);
} else if let Some(ref prp_list) = inflight.prp_list {
cmd.dptr[1] = Le64::from_ne(prp_list.phys_addr());
}
}
// Re-insert inflight entry so the completion handler finds it.
queue.inflight[cid as usize] = Some(inflight);
queue.sq_tail = (queue.sq_tail + 1) % queue.depth;
nvme_ring_doorbell(queue);
}
Flush command: NvmeIoOpcode::Flush (opcode 0x00), no data pointers, NSID set.
CDW10-15 all zero. The controller commits volatile write cache to non-volatile media.
Flush inflight construction (must include opcode and nsid fields):
queue.inflight[cid as usize] = Some(NvmeInflightCmd {
bio: bio as *mut Bio,
prp_list: None,
dma_map: None,
opcode: NvmeIoOpcode::Flush as u8,
nsid: ns.nsid,
retries: 0,
});
Discard (DSM/Deallocate): NvmeIoOpcode::Dsm (opcode 0x09). CDW10 = number of
ranges - 1. CDW11 = attribute bits (bit 2 = Deallocate). The data buffer contains
an array of NvmeDsmRange entries:
/// Dataset Management range descriptor — 16 bytes per range.
/// All multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
#[repr(C)]
pub struct NvmeDsmRange {
/// Context attributes (optional hints).
pub attributes: Le32,
/// Number of logical blocks in this range.
pub length: Le32,
/// Starting LBA of this range.
pub slba: Le64,
}
// NVMe DSM range: attributes(4) + length(4) + slba(8) = 16 bytes.
const_assert!(core::mem::size_of::<NvmeDsmRange>() == 16);
15.19.8 Interrupt Handling¶
MSI-X preferred: The driver requests one MSI-X vector per I/O queue plus one for the admin queue. This provides per-queue interrupt isolation — each I/O queue's completions are delivered to the CPU that owns that queue, avoiding cross-CPU interrupt migration. If MSI-X is unavailable, fall back to MSI (single vector, shared across all queues) then to INTx legacy (pin-based).
Interrupt coalescing: Configured via Set Features (Feature ID 0x08 — Interrupt Coalescing). Parameters: aggregation threshold (number of completions before interrupt) and aggregation time (100μs units). Default: threshold=8, time=100μs (10 ticks). Tunable per workload — latency-sensitive workloads disable coalescing; throughput workloads increase the threshold.
Completion processing (see Section 3.6
for the formal Completion primitive).
Tier boundary: The NVMe driver runs in a Tier 1 domain. bio_complete()
is a Tier 0 function (Section 15.2). Per the Unified
Domain Model (Section 12.8), the driver cannot call bio_complete()
directly -- that would be a cross-domain direct call violating isolation.
Instead, the driver enqueues completion events on its outbound KABI completion
ring targeting the Tier 0 block layer. The Tier 0 block layer consumer dequeues
these events and calls bio_complete() in Tier 0 context. The bio pointer is
passed as an opaque cookie: u64 (cast from *mut Bio); the Tier 0 consumer
recovers the Bio reference and invokes bio_complete(bio, status).
Tier 0 boot path exception: During early boot (before Tier 1 promotion),
the NVMe driver runs in Tier 0 (Domain 0). In this mode, bio_complete() is
in the same domain and can be called directly. The completion path checks
self.domain_id == CORE_DOMAIN_ID to select the direct or ring path. After
promotion to Tier 1, all completions go through the outbound ring.
fn nvme_irq_handler(queue: &mut NvmeQueuePair) -> IrqReturn {
let mut completed = 0u32;
loop {
let cqe = &queue.cq[queue.cq_head as usize];
// Check phase bit -- if it doesn't match our expected phase,
// there are no more new completions.
let phase = (cqe.status.to_ne() & 1) != 0;
if phase != queue.cq_phase {
break;
}
// Read memory barrier -- ensure CQE fields are visible after phase check.
core::sync::atomic::fence(Acquire);
// Le16 fields must be converted to native before bit extraction.
let cid = cqe.cid.to_ne();
let status = cqe.status.to_ne();
let status_code = (status >> 1) & 0xFF; // SC: bits 8:1, 8-bit field
let status_type = (status >> 9) & 0x07; // SCT: bits 11:9, 3-bit field
let dnr = (status >> 15) & 1;
if let Some(inflight) = queue.inflight[cid as usize].take() {
if status_code == 0 {
// Success path -- unmap DMA BEFORE signaling completion.
// After completion, the waiter may free the bio and reuse
// the data pages. The IOMMU mapping must be torn down first
// to prevent stale mappings.
if let Some(dma_map) = inflight.dma_map {
queue.dma_device.dma_unmap(dma_map);
}
if let Some(prp_list) = inflight.prp_list {
queue.free_prp_list(prp_list);
}
nvme_signal_completion(queue, inflight.bio, 0);
} else if dnr == 0 && inflight.retries < 3 {
// Transient error -- retry. Do NOT consume DMA mapping
// or PRP list: they are reused by the retried command.
// requeue_command takes ownership and re-inserts into
// queue.inflight[cid].
requeue_command(queue, inflight, cid);
} else {
// Permanent error or retries exhausted -- unmap DMA, complete.
if let Some(dma_map) = inflight.dma_map {
queue.dma_device.dma_unmap(dma_map);
}
if let Some(prp_list) = inflight.prp_list {
queue.free_prp_list(prp_list);
}
let errno = nvme_status_to_errno(status_type, status_code);
nvme_signal_completion(queue, inflight.bio, errno);
}
// Check if there is a flush waiter for this CID.
if let Some(ref completion) = queue.flush_waiters[cid as usize] {
completion.signal();
}
}
// Advance CQ head. Toggle phase on wraparound.
queue.cq_head = (queue.cq_head + 1) % queue.depth;
if queue.cq_head == 0 {
queue.cq_phase = !queue.cq_phase;
}
completed += 1;
}
if completed > 0 {
// Update CQ head doorbell -- tells controller it can reuse CQ entries.
queue.regs.write32(queue.cq_doorbell_offset, queue.cq_head as u32);
IrqReturn::Handled
} else {
IrqReturn::None
}
}
/// Signal bio completion respecting the Tier 0/Tier 1 boundary.
///
/// In Tier 0 (boot, before promotion): calls `bio_complete()` directly.
/// In Tier 1 (post-promotion): enqueues a completion event on the
/// outbound KABI ring targeting the Tier 0 block layer consumer.
///
/// The Tier 0 block layer consumer ([Section 15.2](#block-io-and-volume-management))
/// dequeues the completion and calls `bio_complete(bio, status)`.
///
/// # Arguments
///
/// - `queue`: The NVMe queue pair (provides access to the outbound ring
/// handle and domain_id).
/// - `bio`: Raw pointer to the Bio being completed.
/// - `status`: 0 = success, negative = -errno.
fn nvme_signal_completion(
queue: &NvmeQueuePair,
bio: *mut Bio,
status: i32,
) {
if queue.domain_id == CORE_DOMAIN_ID {
// Tier 0 (boot path) -- same domain, direct call is safe.
// SAFETY: bio pointer was validated at submit_bio() time and
// stored in the inflight table. The inflight entry was consumed
// by take() above, so we have exclusive access.
let bio_ref = unsafe { &mut *bio };
bio_complete(bio_ref, status);
} else {
// Tier 1 (post-promotion) -- cross-domain, use outbound ring.
// Enqueue a T1CompletionEntry on the outbound KABI ring.
// The Tier 0 block layer consumer processes this and calls
// bio_complete() in Tier 0 context.
let entry = T1CompletionEntry {
cookie: bio as u64, // Bio pointer as opaque cookie.
status,
result_len: 0,
result_offset: 0,
_reserved: [0u8; 44],
};
// outbound_ring is always Some in Tier 1 mode (initialized at promotion).
let ring = queue.outbound_ring.as_ref()
.expect("outbound_ring must be Some in Tier 1 mode");
match ring.try_enqueue(&entry) {
Ok(()) => {}
Err(()) => {
// Outbound ring full -- this should not happen in practice
// because the ring is sized to match the queue depth. Log
// and mark the bio as failed (EIO). The Tier 0 block layer
// consumer will pick this up on the next drain cycle.
//
// This is a serious error -- it means completions are being
// generated faster than the Tier 0 consumer can drain them.
// The FMA subsystem is notified for diagnosis.
klog_err!("NVMe: outbound completion ring full, bio {:p} lost", bio);
// Cannot call bio_complete() from Tier 1 -- the bio may
// already be referenced by the Tier 0 submitter. The Tier 0
// block layer will time out the bio via its completion
// timeout mechanism.
}
}
}
}
/// Map NVMe status to errno.
fn nvme_status_to_errno(sct: u16, sc: u16) -> i32 {
match (sct, sc) {
(0, 0x02) => -EINVAL, // Invalid Field in Command
(0, 0x80) => -EREMOTEIO, // LBA Out of Range (addressing error, maps to BLK_STS_TARGET per Linux)
(0, 0x81) => -ENOSPC, // Capacity Exceeded
(0, 0x82) => -EIO, // Namespace Not Ready
(2, 0x81) => -EIO, // Unrecovered Read Error
(2, 0x82) => -EIO, // Write Fault
(2, 0x83) => -EIO, // Deallocated/Unwritten Logical Block
(2, 0x84) => -ENODATA, // End-to-End Guard Check Error
(2, 0x85) => -ENODATA, // End-to-End Application Tag Check Error
(2, 0x86) => -ENODATA, // End-to-End Reference Tag Check Error
(3, 0x00) => -ENXIO, // Internal Path Error
(3, 0x01) => -ENXIO, // Asymmetric Access Persistent Loss
(3, 0x02) => -EAGAIN, // Asymmetric Access Inaccessible
(3, 0x03) => -EAGAIN, // Asymmetric Access Transition
_ => -EIO, // All other errors
}
}
15.19.9 Error Recovery¶
NVMe error recovery operates at three levels:
Command-level retry: On transient errors (DNR=0 in completion status), the driver
re-submits the command up to 3 times. Transient errors include path errors
(ANA transitions), abort due to SQ deletion, and internal controller errors.
Permanent errors (DNR=1) are reported to the block layer immediately.
Controller-level reset: Triggered by controller fatal status (CSTS.CFS=1),
command timeout (no completion within 30 seconds), or unrecoverable command errors:
- Set
error_state = Recovering. New I/O submissions returnEAGAIN. - Disable controller: clear
CC.EN. PollCSTS.RDY = 0(timeout =CAP.TO × 500ms). If timeout expires, perform PCI function-level reset (FLR) viaPCIE_CAP + 0x08. - Delete all I/O queues (the controller forgets them on reset).
- Rebuild admin queue: rewrite
ASQ,ACQ,AQA. SetCC.EN = 1. Wait forCSTS.RDY = 1. - Re-identify controller and namespaces (configuration may have changed).
- Re-create I/O queues via Create I/O CQ / Create I/O SQ admin commands.
- Replay in-flight commands: the block layer retains all bios that were submitted
but not completed. After queue re-creation, these bios are re-submitted through
the normal
submit_bio()path. - Set
error_state = Normal. Resume accepting submissions.
Async Event Notification (AEN): The driver maintains AERL+1 outstanding
Async Event Requests with the controller. When the controller detects a noteworthy
condition, it completes an AER with the event type:
| Event Type | Action |
|---|---|
| Error (0x00) — persistent internal error | Read Error Log (Log Page 0x01), report via FMA |
| SMART/Health (0x01) — threshold exceeded | Read SMART Log (Log Page 0x02), report temperature/wear via FMA |
| Notice (0x02) — namespace attribute changed | Re-identify affected namespace |
| Notice (0x02) — firmware activation starting | Quiesce I/O, wait for activation complete |
| I/O command set specific (0x06) — zone changed | Refresh zone descriptors for affected namespace |
After processing each AEN, the driver resubmits a replacement Async Event Request to maintain the outstanding AER count.
15.19.10 Namespace Management¶
Multi-namespace controllers expose multiple independent block devices. Each namespace has its own LBA space, block size, and capabilities.
Namespace attachment/detachment: Admin commands NsAttach (opcode 0x15) with CDW10
select action: 0x00 = attach, 0x01 = detach. The data buffer contains a controller list
specifying which controllers the namespace is attached to. On detachment, the driver
unregisters the corresponding BlockDevice from umka-block and fails pending bios
with ENXIO.
Format NVM: Admin command FormatNvm (opcode 0x80) performs a low-level format on
a namespace. CDW10 specifies the target LBAF index and secure erase setting. This is
a destructive operation — all data in the namespace is lost. The driver blocks I/O to
the namespace during format (which may take minutes for large devices), then
re-identifies the namespace to pick up the new LBA format.
15.19.11 Power State Management¶
NVMe controllers define multiple power states (PS0 = highest performance, PS4+ = deepest idle). Each power state specifies maximum power consumption and entry/exit latencies.
/// NVMe power state descriptor — from Identify Controller (bytes 2048+).
/// 32 bytes per entry, up to 32 power states (NVMe 2.0 Figure 275).
/// Multi-byte fields are little-endian (DMA-returned from controller).
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
///
/// Field layout matches Linux `struct nvme_id_power_state` (include/linux/nvme.h)
/// and the NVMe Base Specification.
#[repr(C)]
pub struct NvmePowerStateDesc {
/// Maximum power (bytes 0-1). Units depend on `flags` MPS bit:
/// MPS=0 → centiwatts (0.01 W), MPS=1 → milliwatts (0.001 W).
pub max_power: Le16,
/// Byte 2: reserved.
pub _rsvd2: u8,
/// Byte 3: Flags.
/// Bit 0 = MPS (Max Power Scale: 0=centiwatts, 1=milliwatts).
/// Bit 1 = NOPS (Non-Operational State: 1=non-operational).
pub flags: u8,
/// Entry Latency in microseconds (bytes 4-7).
pub entry_lat_us: Le32,
/// Exit Latency in microseconds (bytes 8-11).
pub exit_lat_us: Le32,
/// Relative Read Throughput (byte 12, 0 = best within this power state).
pub rrt: u8,
/// Relative Read Latency (byte 13, 0 = best).
pub rrl: u8,
/// Relative Write Throughput (byte 14, 0 = best).
pub rwt: u8,
/// Relative Write Latency (byte 15, 0 = best).
pub rwl: u8,
/// Idle Power consumption (bytes 16-17). Units: see `idle_scale`.
pub idle_power: Le16,
/// Byte 18: Idle Power Scale (bits 1:0). 0=not reported, 1=0.0001W, 2=0.01W.
pub idle_scale: u8,
/// Byte 19: reserved.
pub _rsvd19: u8,
/// Active Power consumption (bytes 20-21). Units: see `active_work_scale`.
pub active_power: Le16,
/// Byte 22: Active Power Workload (bits 2:0) + Active Power Scale (bits 7:6).
/// Workload: 0=not reported, 1=workload #1, 2=workload #2.
/// Scale: 0=not reported, 1=0.0001W, 2=0.01W.
pub active_work_scale: u8,
/// Bytes 23-31: reserved.
pub _rsvd23: [u8; 9],
// Layout: 2+1+1+4+4+1+1+1+1+2+1+1+2+1+9 = 32 bytes.
}
// NVMe power state descriptor: 32 bytes per entry.
const_assert!(core::mem::size_of::<NvmePowerStateDesc>() == 32);
/// Runtime power state tracking.
pub struct NvmePowerState {
/// Current operational power state (0-based index).
pub current_ps: u8,
/// Number of supported power states (from Identify Controller NPSS+1).
pub num_states: u8,
/// Power state descriptors (cached from Identify Controller).
pub states: ArrayVec<NvmePowerStateDesc, 32>,
/// APST (Autonomous Power State Transitions) enabled.
pub apst_enabled: bool,
/// APST transition table: for each idle threshold, target power state.
pub apst_table: ArrayVec<NvmeApstEntry, 32>,
}
/// APST table entry — configures automatic idle power state transition.
pub struct NvmeApstEntry {
/// Idle time threshold in milliseconds before transitioning to target_ps.
pub idle_threshold_ms: u32,
/// Target power state for this idle threshold.
pub target_ps: u8,
}
APST (Autonomous Power State Transitions): When supported (Identify Controller APSTA bit), the controller autonomously transitions between power states based on idle time. The driver programs the APST table via Set Features (Feature ID 0x0C — Autonomous Power State Transition). UmkaOS programs a conservative table: 100ms idle → PS1, 500ms → PS2, 2s → PS3. Non-operational states (NOPS=1) are excluded from the APST table — these states halt I/O processing and require explicit host-initiated transition.
Integration with runtime PM (Section 7.5):
The NVMe driver registers with the runtime PM framework. On runtime_suspend(), the
driver sets the deepest non-operational power state via Set Features (Feature ID 0x02 —
Power Management, CDW11 = target power state). On runtime_resume(), the driver
transitions back to PS0. The autosuspend delay defaults to 5 seconds for NVMe.
System suspend path: Flush volatile write cache (Flush command), then set shutdown
notification (CC.SHN = 01 for normal shutdown). Poll CSTS.SHST until it reads
10b (shutdown complete). On resume: re-enable controller (CC.EN), wait for CSTS.RDY,
re-create queues.
15.19.12 Tier 1 Isolation Integration¶
The NVMe driver runs as a Tier 1 driver — Ring 0 execution with hardware memory domain isolation (MPK on x86-64, POE on AArch64 where available, page table isolation as fallback). See Section 11.9 for the complete Tier 1 recovery protocol.
DMA isolation: All DMA buffers (SQ, CQ, PRP lists, data buffers) are mapped through the IOMMU. The NVMe controller's PCI function is assigned a dedicated IOMMU domain. The IOMMU page table restricts the controller to accessing only memory regions explicitly mapped for NVMe I/O — it cannot read or write arbitrary physical memory. On architectures without IOMMU (rare for NVMe-capable systems), DMA buffers are allocated from physically contiguous regions and the bounce buffer (SWIOTLB) path is used.
Crash recovery sequence:
- Fault detection: Hardware memory domain fault (MPK/POE violation), null pointer dereference, kernel OOPS within the NVMe driver domain, or watchdog timeout.
- Domain isolation: The faulting Tier 1 domain is immediately isolated — its memory domain key is revoked. No other kernel subsystem is affected.
- Controller quiesce: Assert PCI FLR (Function-Level Reset) to halt all DMA. The IOMMU domain prevents any stale DMA from reaching memory after reset.
- Driver reload: The KABI framework loads a fresh copy of the NVMe driver into a new Tier 1 domain. The driver re-initializes following the full 8-step sequence.
- I/O replay: The block layer replays all in-flight bios that were submitted but not completed before the crash. The new driver instance processes them normally.
- Recovery time: ~50-150ms (dominated by controller reset + queue re-creation). The block layer's retry queue absorbs the gap — filesystems and applications see a brief latency spike, not an error.
15.19.13 Zoned Namespaces (ZNS)¶
Zoned Namespaces (NVMe ZNS, TP 4053) divide the namespace into sequential-write zones. Within each zone, writes must proceed sequentially from the zone write pointer. This aligns with the erase-block behavior of NAND flash, enabling the SSD controller to eliminate the Flash Translation Layer (FTL) and reduce write amplification.
/// ZNS namespace information (from Identify Namespace, Zoned fields).
pub struct NvmeZnsInfo {
/// Zone size in logical blocks (fixed for all zones in the namespace).
pub zone_size_blocks: u64,
/// Maximum open zones allowed simultaneously. 0 = no limit.
pub max_open_zones: u32,
/// Maximum active zones. 0 = no limit.
pub max_active_zones: u32,
/// Zone append size limit in logical blocks (ZASL from Identify Controller ZNS).
/// Maximum data size for a single Zone Append command.
pub zone_append_size_limit: u32,
}
/// Zone descriptor — returned by Zone Management Receive (Report Zones).
/// Layout per ZNS Command Set Specification 1.1b, Figure 40.
/// Total: 64 bytes.
#[repr(C)]
pub struct NvmeZoneDescriptor {
/// Byte 0: Zone type. 0x02 = Sequential Write Required (SWR).
pub zone_type: u8,
/// Byte 1: Zone condition (bits 7:4) and zone attributes (bits 3:0).
/// Zone condition values (upper nibble):
/// 0x00=Empty, 0x10=ImplicitlyOpened, 0x20=ExplicitlyOpened,
/// 0x30=Closed, 0x40=ReadOnly, 0xE0=Full, 0xF0=Offline.
/// Zone attribute bits (lower nibble):
/// bit 2 = Zone Finished by Controller (ZFC).
/// bit 1 = Reset Recommended (RZR).
/// bit 0 = Zone Descriptor Extension Valid (ZDEV).
pub zone_condition_and_attrs: u8,
/// Bytes 2-7: Reserved.
pub _rsvd: [u8; 6],
/// Bytes 8-15: Zone capacity in logical blocks (may be < zone_size).
pub zone_capacity: Le64,
/// Bytes 16-23: Zone Start LBA.
pub zone_start_lba: Le64,
/// Bytes 24-31: Write Pointer — next LBA for sequential writes.
/// 0xFFFF_FFFF_FFFF_FFFF if invalid (zone in ReadOnly or Offline state).
pub write_pointer: Le64,
/// Bytes 32-63: Reserved.
pub _rsvd2: [u8; 32],
}
const_assert!(core::mem::size_of::<NvmeZoneDescriptor>() == 64);
Zone operations via Zone Management Send (opcode 0x79):
| CDW13 Action | Operation | Description |
|---|---|---|
| 0x01 | Close Zone | Transition zone from Open to Closed. Frees active zone resources. |
| 0x02 | Finish Zone | Fill remaining zone capacity with zeros. Zone becomes Full. |
| 0x03 | Open Zone | Explicitly open a zone for writing. |
| 0x04 | Reset Zone | Reset zone write pointer to start. Zone becomes Empty. All data lost. |
| 0x08 | Offline Zone | Take zone offline (administrative action). |
Zone Append (opcode 0x7D): Write data to a zone without specifying an exact LBA.
The controller appends data at the current write pointer and returns the actual
written LBA in the completion entry (result field). This eliminates host-side write
pointer tracking contention — multiple threads can zone-append concurrently, and the
controller serializes them.
Filesystem integration: ZNS namespaces register with umka-block as zoned block
devices. Zone-aware filesystems (F2FS, btrfs zoned mode) issue zone commands through
the BlockDeviceOps interface:
- BioOp::ZoneAppend maps to NVMe Zone Append (opcode 0x7D).
- Zone management (open/close/finish/reset) is exposed via a separate
zone_mgmt(&self, zone_slba: u64, action: ZoneAction) -> Result<()> method on
BlockDeviceOps.
- Zone report queries are exposed via
report_zones(&self, start_lba: u64, buf: &mut [ZoneDescriptor]) -> Result<usize>.
15.19.14 NVMe-oF Fabrics Bridge¶
The local NVMe driver and the NVMe-oF subsystem (Section 15.13) share core abstractions:
Shared types: NvmeCommand (64-byte submission entry) and NvmeCompletion
(16-byte completion entry) are the identical wire format for both local PCIe and
fabric transports. The NvmeIoOpcode enum is transport-agnostic. Namespace
identification (nsid: u32) is common to both paths.
NVMe-oF Target passthrough: When the NVMe-oF target operates in passthrough mode
(exporting a local NVMe namespace to remote hosts), it submits NVMe commands received
from the fabric directly to the local NVMe controller's I/O queues — bypassing the
block layer entirely. The NvmeCommand from the fabric capsule is validated (opcode
whitelist, NSID check, LBA bounds) and then placed in the local SQ. Completions are
forwarded back through the fabric transport.
Unified namespace model: Both local and remote NVMe namespaces appear as
BlockDevice instances in umka-block. The block layer, volume manager, and
filesystems are agnostic to whether a namespace is local (PCIe) or remote
(NVMe-oF/TCP, NVMe-oF/RDMA). The BlockDeviceInfo returned by each path
reflects the true capabilities — local NVMe reports hardware FUA support while
NVMe-oF/TCP does not (flush is required).
15.19.15 BlockDeviceOps Implementation¶
/// Per-namespace block device wrapper. One `NvmeBlockDevice` is created per
/// NVMe namespace discovered during controller initialization. Registered
/// with the block layer via `register_block_device()`. The block layer
/// holds `Arc<dyn BlockDeviceOps>` which points to this struct.
pub struct NvmeBlockDevice {
/// Reference to the parent NVMe controller (shared across all namespaces).
pub ctrl: Arc<NvmeController>,
/// Namespace metadata (NSID, capacity, format, features).
pub ns: NvmeNamespace,
/// NUMA node closest to this controller's PCIe slot (for allocation affinity).
pub numa_node: u32,
}
impl BlockDeviceOps for NvmeBlockDevice {
fn submit_bio(&self, bio: &mut Bio) -> Result<()> {
if self.ctrl.error_state.load(Acquire) != 0 {
return Err(Error::IO); // Controller in error recovery
}
let queue_idx = arch::current::cpu::id() % self.ctrl.io_queues.len();
// Each NvmeQueuePair is wrapped in SpinLock for interior mutability
// (submit_bio takes &self; queue mutation needs &mut through &self).
// Uncontended in the per-CPU case (~5-10 ns).
let queue = self.ctrl.io_queues[queue_idx].lock();
match bio.op {
BioOp::Read | BioOp::Write => {
nvme_submit_io(queue, bio, &self.ns)
}
BioOp::Flush => {
if self.ctrl.volatile_write_cache {
// Allocate an inflight slot for the flush command so
// that the CQE completion handler can map the CID back
// to this bio and signal completion. Without this, the
// flush bio's StackWaiter would never be woken.
let cid = queue.alloc_cid()?;
queue.inflight[cid as usize] = Some(NvmeInflightCmd {
bio: bio as *mut Bio,
dma_map: None,
prp_list: None,
opcode: NvmeIoOpcode::Flush as u8,
nsid: self.ns.nsid,
retries: 0,
});
queue.submit_flush_cmd(self.ns.nsid, cid)
} else {
// No volatile cache -- flush is a no-op. Signal
// completion immediately via the tier-aware path.
nvme_signal_completion(&queue, bio as *mut Bio, 0);
Ok(())
}
}
BioOp::Discard => {
if self.ns.supports_deallocate {
queue.submit_dsm_deallocate(bio, self.ns.nsid)
} else {
Err(Error::NOSYS)
}
}
BioOp::WriteZeroes => {
queue.submit_write_zeroes(bio, self.ns.nsid)
}
BioOp::ZoneAppend => {
if self.ns.zns.is_some() {
queue.submit_zone_append(bio, self.ns.nsid)
} else {
Err(Error::NOSYS)
}
}
}
}
fn flush(&self) -> Result<()> {
if !self.ctrl.volatile_write_cache { return Ok(()); }
let queue_idx = arch::current::cpu::id() % self.ctrl.io_queues.len();
self.ctrl.io_queues[queue_idx].submit_flush_sync(self.ns.nsid)
}
fn discard(&self, start_lba: u64, len_sectors: u64) -> Result<()> {
if !self.ns.supports_deallocate { return Err(Error::NOSYS); }
let queue_idx = arch::current::cpu::id() % self.ctrl.io_queues.len();
self.ctrl.io_queues[queue_idx]
.submit_dsm_range(self.ns.nsid, start_lba, len_sectors)
}
fn get_info(&self) -> BlockDeviceInfo {
BlockDeviceInfo {
logical_block_size: self.ns.block_size,
physical_block_size: self.ns.block_size, // NVMe: logical == physical
capacity_sectors: self.ns.capacity_blocks
* (self.ns.block_size as u64 / 512),
max_segments: if self.ctrl.max_transfer_size > 0 {
(self.ctrl.max_transfer_size / self.ctrl.page_size) as u16
} else {
256 // Default: 256 pages = 1MB at 4K pages
},
max_bio_size: self.ctrl.max_transfer_size,
flags: {
let mut f = BlockDeviceFlags::empty();
if self.ns.supports_deallocate { f |= BlockDeviceFlags::DISCARD; }
if self.ctrl.volatile_write_cache { f |= BlockDeviceFlags::FLUSH; }
f |= BlockDeviceFlags::FUA; // NVMe always supports FUA (CDW12 bit 30)
f
},
optimal_io_size: if self.ns.optimal_io_boundary > 0 {
self.ns.optimal_io_boundary as u32 * self.ns.block_size
} else {
self.ns.block_size
},
numa_node: self.ctrl.numa_node,
}
}
fn shutdown(&self) -> Result<()> {
// Flush volatile write cache.
self.flush()?;
// Normal shutdown notification.
let cc = self.ctrl.regs.read32(0x14); // CC register
self.ctrl.regs.write32(0x14, (cc & !0xC000) | 0x4000); // SHN=01
// Poll CSTS.SHST for shutdown complete (10b).
let timeout = self.ctrl.cap.timeout as u64 * 500;
poll_until(timeout, || {
let csts = self.ctrl.regs.read32(0x1C);
(csts >> 2) & 0x3 == 0x2
})
}
}
15.19.16 KABI Driver Manifest¶
[driver]
name = "nvme"
version = "1.0.0"
tier = 1
bus-type = "pci"
[match]
pci-class = "01:08:02" # Mass Storage / NVM Express / NVM Express I/O Controller
[capabilities]
dma = true
interrupts = "msi-x" # One vector per I/O queue + 1 admin; falls back to MSI, then INTx
max-memory = "16MB" # SQ/CQ buffers + PRP list pools (scales with queue count)
[recovery]
crash-action = "reload"
state-preservation = true # Replay in-flight bios on reload
max-reload-time-ms = 200
15.19.17 Design Decisions¶
| Decision | Rationale |
|---|---|
| Tier 1 (not Tier 2) | NVMe is the primary storage path. ~2-5 μs per I/O on fast SSDs. Ring 3 crossing adds ~5-15 μs — doubling or tripling latency is unacceptable. |
| One I/O queue per CPU | NVMe SQs have no locking — each CPU writes to its own SQ tail doorbell without contention. This is the design NVMe was built for. |
| PRP over SGL | PRP is mandatory in the NVMe base spec; SGL is optional. PRP is simpler (array of page-aligned addresses) and sufficient for block I/O. SGL is used only for NVMe-oF passthrough where the fabric provides SGLs. |
| Pre-allocated PRP list pool | Each queue pre-allocates a pool of PRP list pages at init time. No heap allocation on the I/O hot path. Pool size = queue depth (one PRP list per possible in-flight command). |
| Conservative APST table | Aggressive power transitions cause latency spikes on some controllers. UmkaOS defaults to conservative thresholds (100ms/500ms/2s) and lets userspace tune via sysfs if needed. |
| MSI-X per queue | Per-queue interrupt vectors eliminate the CQ polling fan-out that single-vector modes require. On a 32-queue controller, a single MSI vector would force scanning all 32 CQs on every interrupt. |
| 30-second command timeout | NVMe spec does not define a command timeout. 30 seconds is the Linux default and covers worst-case garbage collection stalls on consumer SSDs. Configurable via the nvme_io_timeout_ms module parameter (default 30000). For consumer TLC/QLC workloads with heavy GC, operators may increase to 120000 (120s). |
| IOMMU mandatory for DMA | NVMe DMA without IOMMU means the controller can write to any physical address — a firmware bug or malicious device could corrupt kernel memory. IOMMU containment is a Tier 1 requirement. |
15.19.18 Error Recovery¶
NVMe error recovery handles both transient command failures and catastrophic controller events. The recovery sequence is modeled as a state machine:
/// NVMe controller error recovery states.
///
/// State machine: Normal → ErrorDetected → ControllerReset → QueueReinit → ReplayIo → Normal
/// ↘ Fatal (if reset fails)
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum NvmeRecoveryState {
/// Normal operation — no error recovery in progress.
Normal,
/// Error detected; new I/O submissions blocked. In-flight commands
/// await timeout or controller-reported failure.
ErrorDetected,
/// Controller reset in progress (CC.EN = 0 → wait CSTS.RDY = 0 → CC.EN = 1).
ControllerReset,
/// Re-creating admin and I/O queue pairs after reset.
QueueReinit,
/// Replaying in-flight I/O that was interrupted by the reset.
ReplayIo,
/// Unrecoverable: controller failed to reset within `CAP.TO * 500ms`.
/// Block device returns `-EIO` for all subsequent I/O.
Fatal,
}
/// Error sources that trigger recovery.
pub enum NvmeErrorSource {
/// Completion entry with non-zero status code.
CompletionError {
/// CQE status field: bits [15:1] = status code, bit 0 = phase tag.
status: u16,
/// Command ID that failed.
cid: u16,
/// Queue pair that reported the error.
qid: u16,
},
/// CSTS.CFS (Controller Fatal Status) bit set — controller firmware crash.
ControllerFatalStatus,
/// Command timeout: no completion received within `nvme_io_timeout_ms`.
CommandTimeout { cid: u16, qid: u16 },
/// PCIe AER (Advanced Error Reporting) event forwarded by the PCIe subsystem.
/// Typical events: Uncorrectable Internal Error, Completion Timeout, Data Link
/// Protocol Error, Poisoned TLP.
AerEvent { severity: AerSeverity, error_type: u32 },
}
/// AER severity levels.
#[derive(Clone, Copy)]
pub enum AerSeverity {
/// Correctable: logged, no recovery needed.
Correctable,
/// Non-fatal uncorrectable: device may still respond; attempt reset.
NonFatal,
/// Fatal: device is non-responsive; attempt reset with escalation to
/// PCIe Function Level Reset (FLR) if the standard NVMe reset fails.
Fatal,
}
Controller reset sequence (runs in the NVMe driver's recovery workqueue):
nvme_controller_reset(ctrl):
1. Set ctrl.state = ControllerReset
2. Block new submissions: set NVME_CTRL_RESETTING flag, drain KABI ring
3. CC.EN = 0 (disable controller)
4. Poll CSTS.RDY == 0 with timeout = CAP.TO * 500ms
- If timeout: escalate to PCI FLR (pci_reset_function())
- If FLR fails: ctrl.state = Fatal, return
5. Re-configure CC (MPS, AMS, IOSQES, IOCQES)
6. CC.EN = 1 (re-enable controller)
7. Poll CSTS.RDY == 1 with timeout = CAP.TO * 500ms
8. Re-create admin queue pair, issue Identify Controller
9. Re-create I/O queue pairs (one per CPU), re-register interrupt vectors
10. ctrl.state = ReplayIo
11. For each in-flight I/O (tracked in per-queue command ID bitmap):
- Re-submit the command (bio is preserved in the driver's shadow ring)
- Original completion callback fires when the replayed command completes
12. ctrl.state = Normal
Asynchronous Event Reporting (AER/AEN): The driver posts one AEN (Asynchronous
Event Request) admin command at init time. The controller sends a completion when an
asynchronous event occurs (error, SMART threshold, namespace change, firmware
activation, etc.). On receipt, the driver logs the event, handles it (e.g., triggers
controller reset for critical errors, re-scans namespaces for NS_ATTR_CHANGED), and
immediately re-posts a new AEN command to receive the next event.
15.19.19 Autonomous Power State Transitions (APST)¶
NVMe controllers support multiple power states (PS0=highest performance through PS5=deepest sleep). APST allows the controller to autonomously transition to lower power states after configurable idle periods.
/// APST table entry (one per power state transition).
/// Written to the controller via Set Features (Feature 0x0C) as DMA data.
/// Multi-byte fields are little-endian per NVMe Base Specification 2.1.
/// Le* types ([Section 6.1](06-dsm.md#dsm-foundational-types--wire-format-integer-types)) enforce
/// correct byte order on all eight supported architectures.
///
/// UmkaOS writes up to 32 entries (one per supported non-operational
/// power state). The controller transitions to the target state after
/// the specified idle time and transitions back to operational on any
/// new command submission.
#[repr(C)]
pub struct ApstEntry {
/// Idle time threshold in microseconds before transitioning.
/// 0 = disable this transition.
pub idle_transition_us: Le32,
/// Idle Transition Power State (ITPS) — target power state index.
pub idle_transition_ps: u8,
pub _reserved: [u8; 3],
}
// NVMe APST entry: idle_transition_us(4) + idle_transition_ps(1) + _reserved(3) = 8 bytes.
const_assert!(core::mem::size_of::<ApstEntry>() == 8);
/// Default APST table. Conservative thresholds suitable for server workloads.
/// Desktop/laptop use cases may use more aggressive values via sysfs.
pub const DEFAULT_APST: &[(u32, u8)] = &[
// (idle_ms, target_power_state)
(100, 1), // After 100ms idle → PS1 (slightly reduced performance)
(500, 2), // After 500ms idle → PS2 (moderate power saving)
(2000, 3), // After 2s idle → PS3 (deep idle, 50-500ms exit latency)
];
APST configuration sequence (during controller init, after Identify Power State descriptors are parsed):
- Read Identify Controller power state descriptors (PS0-PS31).
- Filter: only non-operational states with
exit_latency_us < apst_max_latency_us(sysctlnvme.apst_max_latency_us, default 25000 = 25ms). - Build APST table from
DEFAULT_APST, mapping each target state to the highest-numbered non-operational state whose exit latency is within the threshold. - Issue Set Features (Feature 0x0C, APSTE=1) with the computed table.
- If the controller rejects APST (Invalid Field in Command), log a warning and continue without power management (some consumer NVMe controllers have buggy APST firmware).
Sysfs interface (/sys/class/nvme/nvmeN/power/):
| File | Description |
|---|---|
pm_policy |
default (APST) or none (disabled) |
apst_max_latency_us |
Maximum acceptable exit latency for APST transitions |
power_state |
Current power state (read-only, queried via Get Features) |
15.20 fscrypt — File-Level Encryption¶
fscrypt is the Linux filesystem-level encryption subsystem. It encrypts file contents and filenames on a per-directory policy basis, transparently to applications: userspace reads and writes cleartext; the kernel encrypts on writeback and decrypts on read-in. Supported backing filesystems include ext4, f2fs, and ubifs (Btrfs and XFS do not implement fscrypt hooks). UmkaOS implements fscrypt for Linux ABI compatibility and because it is required by Android file-based encryption (FBE), Chromebook disk encryption, and enterprise per-directory encryption workflows.
Reference specification: Linux kernel Documentation/filesystems/fscrypt.rst
(canonical), include/uapi/linux/fscrypt.h (UAPI header).
Tier: Tier 0 (in-kernel, part of the VFS/filesystem path). fscrypt hooks execute inside the page cache read/write path and cannot be isolated behind a domain boundary without unacceptable latency on every I/O.
15.20.1 Encryption Policies¶
A directory is encrypted by setting an encryption policy on it (via ioctl) before any files are created inside it. All files and subdirectories created within an encrypted directory inherit the parent's policy. Two policy versions exist; V2 is required for new deployments.
/// fscrypt policy version 2 (FSCRYPT_POLICY_V2).
///
/// V1 policies (`FscryptPolicyV1`) are supported for backward compatibility
/// with existing Android and Chrome OS volumes but are deprecated: V1 uses
/// an ad-hoc AES-128-ECB KDF that is non-standard and reversible. All new
/// encrypted directories must use V2.
///
/// Matches the Linux `struct fscrypt_policy_v2` layout exactly (UAPI ABI).
#[repr(C)]
pub struct FscryptPolicyV2 {
/// Policy version: always `2`.
pub version: u8,
/// Contents encryption mode ([`FscryptMode`]).
pub contents_encryption_mode: u8,
/// Filenames encryption mode ([`FscryptMode`]).
pub filenames_encryption_mode: u8,
/// Policy flags (bitwise OR of `FSCRYPT_POLICY_FLAG_*`).
pub flags: u8,
/// Log2 of the data unit size for contents encryption.
/// 0 means the filesystem block size (default). Non-zero values
/// allow sub-block encryption granularity (Linux 6.7+).
pub log2_data_unit_size: u8,
/// Reserved, must be zero.
pub reserved: [u8; 3],
/// Master key identifier: first 16 bytes of
/// `HKDF-SHA512(master_key, info="fscrypt\0" || 0x01)`.
/// Computed by the kernel on `FS_IOC_ADD_ENCRYPTION_KEY` and matched
/// against the policy when unlocking.
pub master_key_identifier: [u8; FSCRYPT_KEY_IDENTIFIER_SIZE],
}
// UAPI ABI: version(1)+contents(1)+filenames(1)+flags(1)+log2_du(1)+reserved(3)+key_id(16) = 24 bytes.
const_assert!(core::mem::size_of::<FscryptPolicyV2>() == 24);
/// Size of the master key identifier (bytes).
pub const FSCRYPT_KEY_IDENTIFIER_SIZE: usize = 16;
/// Size of the per-file nonce (bytes).
/// Not in Linux UAPI header; kernel-internal constant from fs/crypto/fscrypt_private.h.
pub const FSCRYPT_FILE_NONCE_SIZE: usize = 16;
/// Encryption mode constants.
///
/// Values match Linux `FSCRYPT_MODE_*` exactly (UAPI ABI).
#[repr(u8)]
pub enum FscryptMode {
/// AES-256-XTS — contents encryption (default, recommended).
/// 64-byte key (two 256-bit AES keys: one for data, one for tweak).
Aes256Xts = 1,
/// AES-256-CTS-CBC — filenames encryption (default, recommended).
/// 32-byte key. CTS (ciphertext stealing) handles non-block-aligned names.
Aes256Cts = 4,
/// AES-128-CBC-ESSIV — legacy contents mode. Not recommended for new use.
Aes128Cbc = 5,
/// AES-128-CTS-CBC — legacy filenames mode. Not recommended for new use.
Aes128Cts = 6,
/// SM4-XTS — contents encryption (Chinese national standard, GM/T 0002).
Sm4Xts = 7,
/// SM4-CTS-CBC — filenames encryption (GM/T 0002).
Sm4Cts = 8,
/// Adiantum — both contents and filenames. Wide-block cipher built on
/// XChaCha12 + AES-256 + NH + Poly1305. Designed for devices without
/// AES hardware acceleration (low-end ARM, older RISC-V).
Adiantum = 9,
/// AES-256-HCTR2 — contents encryption (wide-block, Linux 6.7+).
/// Hash-Counter-Hash construction over AES-256 + XCTR + POLYVAL.
/// Better semantic security than XTS for small data units.
Aes256Hctr2 = 10,
}
/// Policy flag constants (UAPI ABI).
pub const FSCRYPT_POLICY_FLAG_DIRECT_KEY: u8 = 0x04;
pub const FSCRYPT_POLICY_FLAG_IV_INO_LBLK_64: u8 = 0x08;
pub const FSCRYPT_POLICY_FLAG_IV_INO_LBLK_32: u8 = 0x10;
Valid mode combinations (enforced at FS_IOC_SET_ENCRYPTION_POLICY time):
| Contents mode | Filenames mode | Notes |
|---|---|---|
Aes256Xts (1) |
Aes256Cts (4) |
Default. Recommended for all platforms with AES-NI / ARMv8 CE / AES ISA. |
Aes128Cbc (5) |
Aes128Cts (6) |
Legacy. V1 policies only. |
Adiantum (9) |
Adiantum (9) |
No-AES-hardware path. Required for ARMv7 without CE, some RISC-V. |
Sm4Xts (7) |
Sm4Cts (8) |
Chinese regulatory compliance (GM/T). |
Aes256Hctr2 (10) |
Aes256Cts (4) |
Wide-block contents with standard filenames. |
Aes256Hctr2 (10) |
Aes256Hctr2 (10) |
Wide-block for both contents and filenames. |
The kernel rejects any mode combination not in this table with EINVAL.
15.20.2 Key Derivation¶
fscrypt V2 uses HKDF-SHA512 (Section 10.1) for all key derivation. The
master key is the HKDF input keying material (IKM); no salt is used. Different
application-specific info strings (the HKDF info parameter) produce distinct derived
keys:
Key identifier:
info = "fscrypt\0" || 0x01
→ 16-byte identifier stored in FscryptPolicyV2.master_key_identifier
Per-file encryption key (default, no DIRECT_KEY flag):
info = "fscrypt\0" || 0x02 || file_nonce[16]
→ One unique key per file. file_nonce is random, stored in xattr.
Per-mode encryption key (DIRECT_KEY flag set):
info = "fscrypt\0" || 0x03 || mode_number[1]
→ One key per (master_key, mode) pair. File nonce mixed into IV instead.
Per-mode IV_INO_LBLK_64 key:
info = "fscrypt\0" || 0x04 || mode_number[1]
→ Inode number and block index combined into a 64-bit IV.
Dirhash key (for case-insensitive/casefolded directories):
info = "fscrypt\0" || 0x05 || file_nonce[16]
→ SipHash-2-4 key for directory entry hashing.
Per-mode IV_INO_LBLK_32 key:
info = "fscrypt\0" || 0x06 || mode_number[1]
→ 32-bit inode+block IV for hardware with limited IV width.
All context bytes (0x01..0x06) are reserved by the fscrypt specification. UmkaOS
must not redefine them.
15.20.2.1 Master Key Lifecycle¶
- Userspace provides the raw master key via
FS_IOC_ADD_ENCRYPTION_KEY. (For V1 policies, the legacy path usessys_add_key()with thefscrypt-provisioningkey type — see Section 10.2 for the formal syscall definition. V2 policies use the dedicated ioctl instead of the generic keyring syscalls.) - The kernel derives the 16-byte key identifier via HKDF and stores the master key in the filesystem-level keyring (not the user session keyring — V2 improvement).
- When an encrypted inode is opened, the kernel matches the inode's
master_key_identifieragainst the keyring, derives the per-file key, and caches the derived key in the in-coreFscryptInfoattached to the inode. - On
FS_IOC_REMOVE_ENCRYPTION_KEY, derived keys are zeroized and evicted. Inodes with cached keys are marked stale; subsequent I/O returnsENOKEY.
15.20.2.2 Per-File Context (On-Disk)¶
/// Per-file fscrypt context stored in the inode's encryption xattr.
///
/// For ext4: xattr name `c` in the `system.` namespace (index 9).
/// For f2fs: stored in the inode's `i_extra` area.
/// For ubifs: stored as an extended attribute.
///
/// Matches Linux `struct fscrypt_context_v2` layout exactly (kernel-internal).
#[repr(C)]
pub struct FscryptContextV2 {
/// Context version: `2`.
pub version: u8,
/// Contents encryption mode.
pub contents_encryption_mode: u8,
/// Filenames encryption mode.
pub filenames_encryption_mode: u8,
/// Policy flags.
pub flags: u8,
/// Log2 of data unit size (0 = filesystem block size).
pub log2_data_unit_size: u8,
/// Reserved, must be zero.
pub reserved: [u8; 3],
/// Master key identifier (16 bytes).
pub master_key_identifier: [u8; FSCRYPT_KEY_IDENTIFIER_SIZE],
/// Random per-file nonce generated at inode creation time.
/// Used as HKDF input (per-file key mode) or as IV tweak (DIRECT_KEY mode).
pub nonce: [u8; FSCRYPT_FILE_NONCE_SIZE],
}
// On-disk format: version(1)+contents(1)+filenames(1)+flags(1)+log2_du(1)+reserved(3)+key_id(16)+nonce(16) = 40 bytes.
const_assert!(core::mem::size_of::<FscryptContextV2>() == 40);
15.20.3 Ioctls¶
All ioctls use magic number 'f' (0x66). Values match Linux UAPI exactly.
| Ioctl | Direction | Nr | Arg type | Description |
|---|---|---|---|---|
FS_IOC_SET_ENCRYPTION_POLICY |
_IOR |
19 | fscrypt_policy_v1 |
Set encryption policy on an empty directory. V2 policies are passed via the same ioctl with version=2 in the struct. Returns ENOTEMPTY if directory is non-empty, EEXIST if a policy is already set. |
FS_IOC_GET_ENCRYPTION_POLICY_EX |
_IOWR |
22 | fscrypt_get_policy_ex_arg |
Get encryption policy (V1 or V2) with version discrimination. |
FS_IOC_ADD_ENCRYPTION_KEY |
_IOWR |
23 | fscrypt_add_key_arg |
Add master key to the filesystem keyring. Derives and stores the key identifier. Any user may add keys; key is ref-counted per-user. |
FS_IOC_REMOVE_ENCRYPTION_KEY |
_IOWR |
24 | fscrypt_remove_key_arg |
Remove the calling user's claim on a master key. When the last user removes, derived keys are wiped and inodes evicted. |
FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS |
_IOWR |
25 | fscrypt_remove_key_arg |
Force-remove for all users. Requires CAP_SYS_ADMIN. |
FS_IOC_GET_ENCRYPTION_KEY_STATUS |
_IOWR |
26 | fscrypt_get_key_status_arg |
Query whether a master key is present, absent, or incompletely removed. |
FS_IOC_GET_ENCRYPTION_NONCE |
_IOR |
27 | [u8; 16] |
Retrieve the file's 16-byte encryption nonce (for backup/restore tooling). |
All ioctls can be issued on any file or directory on the target filesystem; the
filesystem root directory is the conventional target. FS_IOC_SET_ENCRYPTION_POLICY
must target the directory to be encrypted.
15.20.4 I/O Path Integration¶
15.20.4.1 Read Path¶
- VFS
read()dispatches to the filesystem'sreadpage()/readahead(). - The filesystem reads ciphertext blocks from disk into the page cache.
fscrypt_decrypt_pagecache_blocks()decrypts the page in-place using the per-file key cached inFscryptInfo. The decryption transform is obtained from Section 10.1 (skcipherfor AES-XTS/AES-CTS,lskcipherfor Adiantum/HCTR2).- The cleartext page is returned to userspace.
If the master key is absent (not added or removed), the filesystem returns ENOKEY
on open() for files requiring content decryption. Directory listing is still possible
(filenames are shown as no-key names — see below).
15.20.4.2 Write Path¶
- VFS
write()copies cleartext data into page cache pages. - On writeback,
fscrypt_encrypt_pagecache_blocks()allocates a bounce page, encrypts the cleartext page into the bounce page, and submits the bounce page to the block layer. - The original cleartext page remains in the page cache for subsequent reads (no redundant decryption).
- The bounce page is freed after I/O completion.
Memory allocation: Bounce pages are drawn from a dedicated mempool
(fscrypt_bounce_page_pool) to guarantee forward progress under memory pressure.
The pool is sized at FSCRYPT_BOUNCE_POOL_SIZE (default: 32 pages, configurable via
boot parameter fscrypt.bounce_pool_size). This fixed size avoids CPU hotplug
sensitivity. Allocation uses GFP_NOFS to avoid filesystem re-entry deadlock.
Backpressure on exhaustion: When all pool pages are in use, mempool_alloc()
blocks the calling writeback thread until a bounce page is freed by I/O completion.
This naturally throttles concurrent writebacks to the pool size. The mempool
guarantee means allocation never fails — it may block indefinitely, but forward progress
is assured because in-flight bounce pages are freed on I/O completion (I/O completion
runs in softirq/workqueue context, independent of the blocked writeback thread).
15.20.4.3 Filename Encryption¶
Directory entries on disk store encrypted filenames. The kernel translates between encrypted and cleartext forms:
fscrypt_fname_disk_to_usr(): Decrypts an on-disk filename forreaddir()andlookup(). When the key is present, the cleartext name is returned. When the key is absent, a no-key name is returned: the ciphertext encoded as base64url (RFC 4648 section 5, no padding), prefixed with_if the name would otherwise start with.(to avoid hiding entries in directory listings).fscrypt_fname_usr_to_disk(): Encrypts a cleartext filename forcreate(),rename(),link(), andunlink(). Requires the key to be present; returnsENOKEYotherwise.
Filename encryption uses AES-256-CTS-CBC (or Adiantum, or AES-256-HCTR2 depending on policy). CTS handles names whose length is not a multiple of the AES block size without padding, preserving the original name length in the directory entry.
15.20.4.4 IV Construction¶
The IV (initialisation vector) varies by policy flag:
| Policy mode | IV layout (little-endian) |
|---|---|
| Default (per-file key) | data_unit_index[8] || zeros[8] |
DIRECT_KEY |
data_unit_index[8] || file_nonce[16] (24 bytes total; AES-XTS/CTS use only the first 16 bytes; Adiantum/HCTR2 use all 24) |
IV_INO_LBLK_64 |
data_unit_index[4] || inode_number[4] || zeros[8] |
IV_INO_LBLK_32 |
hash(inode_number) + data_unit_index (mod 2^32)[4] || zeros[12] |
15.20.5 Inline Crypto Engine (ICE) Support¶
Modern SoCs include inline encryption hardware that encrypts/decrypts data in the
storage controller's DMA path, eliminating CPU-side crypto overhead entirely. UmkaOS
integrates with this hardware through the blk-crypto framework
(Section 15.2).
When inline crypto is available and supports the requested mode, fscrypt attaches a
BlkCryptoKey to the Bio instead of performing software encryption. The block layer
programs the key into a hardware keyslot and the storage controller encrypts/decrypts
transparently. If the hardware does not support the requested mode (or no inline crypto
hardware is present), blk-crypto falls back to software encryption automatically — no
filesystem or fscrypt code change is needed.
Hardware-wrapped keys (Linux 6.15+): On SoCs that support it (Qualcomm ICE, Samsung FMP), the master key can be provided in hardware-wrapped form. The hardware unwraps the key internally and programs the derived inline encryption key into a keyslot without ever exposing it to software. This provides defense-in-depth: even a kernel compromise cannot extract the raw encryption key.
15.20.5.1 Per-Architecture Inline Crypto Availability¶
| Arch | ICE hardware | Notes |
|---|---|---|
| x86-64 | Rare (some Intel platforms with IBECC) | Primarily software path. AES-NI provides fast software AES-XTS (~2 cycles/byte). |
| AArch64 | Common: Qualcomm ICE, Samsung FMP, MediaTek UFS inline crypto | Standard on mobile/embedded SoCs. Hardware-wrapped key support on Qualcomm SM8x50+. |
| ARMv7 | Limited: older Qualcomm ICE on 32-bit SoCs | Adiantum mode recommended when AES CE is absent. |
| RISC-V | No known ICE hardware (as of 2026) | Software path only. Adiantum recommended for devices without scalar AES extensions. |
| PPC32 | No ICE hardware | Software path only. |
| PPC64LE | No ICE hardware | Software path only. POWER9/10 AES instructions provide adequate software throughput. |
15.20.6 Filesystem Integration Points¶
Each filesystem that supports fscrypt must implement a set of hooks. These are not
a separate trait; they are woven into the existing InodeOps and FileOps
implementations (Section 14.1).
| Hook | Filesystem responsibility |
|---|---|
| Inode creation | Generate random 16-byte nonce, store FscryptContextV2 as xattr. |
| Inode load | Read FscryptContextV2 from xattr, call fscrypt_get_encryption_info() to derive/cache keys. |
readpage / readahead |
Read ciphertext from disk, call fscrypt_decrypt_pagecache_blocks(). |
| Writeback | Call fscrypt_encrypt_pagecache_blocks() to produce bounce pages. |
lookup / readdir |
Decrypt filenames via fscrypt_fname_disk_to_usr(). |
create / rename / link |
Encrypt filenames via fscrypt_fname_usr_to_disk(). |
statfs |
No change (encrypted and unencrypted data occupy the same space). |
Per-filesystem xattr storage:
- ext4:
FscryptContextV2stored in thesystem.xattr namespace (index 9, namec). Retrieved during inode read-in from the inode's xattr area. - f2fs: Stored in the inode's
i_extrainline area (not a separate xattr block), avoiding an extra disk read for key setup. - ubifs: Stored as a standard extended attribute on the inode.
15.20.7 Crypto Backend Integration¶
fscrypt uses the Section 10.1 algorithm registry exclusively. It never calls hardware crypto instructions directly.
Algorithm allocation (performed once per master key, cached):
| Purpose | Crypto API algorithm name | Transform type |
|---|---|---|
| Key derivation | hmac(sha512) |
Shash |
| AES-256-XTS content | xts(aes) |
Skcipher |
| AES-256-CTS-CBC filenames | cts(cbc(aes)) |
Skcipher |
| AES-128-CBC-ESSIV content | essiv(cbc(aes),sha256) |
Skcipher |
| Adiantum both | adiantum(xchacha12,aes) |
Skcipher |
| AES-256-HCTR2 both | hctr2(aes) |
Skcipher |
| SM4-XTS content | xts(sm4) |
Skcipher |
| SM4-CTS-CBC filenames | cts(cbc(sm4)) |
Skcipher |
| Dirhash | siphash24 |
Shash |
Hardware-accelerated implementations (AES-NI on x86-64, ARMv8 CE on AArch64, etc.) are selected automatically by the crypto API's priority-based dispatch. No fscrypt-specific code is needed to prefer hardware acceleration.
Key zeroization: When a master key is removed (FS_IOC_REMOVE_ENCRYPTION_KEY),
all derived keys cached in FscryptInfo structs are overwritten with zeros
(memzero_explicit) before the memory is freed. The master key in the filesystem
keyring is similarly zeroized. This limits the window during which key material
is resident in kernel memory.
GFP flags: All crypto allocations within the fscrypt I/O path use GFP_NOFS to
prevent deadlock from re-entrant filesystem calls during memory reclaim.
15.20.8 In-Core State¶
/// Per-inode fscrypt state, allocated when an encrypted inode is first accessed
/// with a valid master key. Attached to the in-core inode and freed when the
/// inode is evicted from the inode cache.
pub struct FscryptInfo {
/// Encryption mode for file contents.
pub contents_mode: FscryptMode,
/// Encryption mode for filenames (meaningful only for directory inodes).
pub filenames_mode: FscryptMode,
/// Policy flags from the inode's `FscryptContextV2`.
pub flags: u8,
/// Derived per-file encryption key (zeroized on drop).
/// For DIRECT_KEY mode: this is the per-mode key, shared across files.
pub contents_key: ZeroizingKey,
/// Derived filenames encryption key (directories only; zeroized on drop).
pub filenames_key: Option<ZeroizingKey>,
/// Allocated crypto transform for contents encryption.
pub contents_tfm: SkcipherHandle,
/// Allocated crypto transform for filenames encryption (directories only).
pub filenames_tfm: Option<SkcipherHandle>,
/// The file's 16-byte nonce (copied from `FscryptContextV2`).
pub nonce: [u8; FSCRYPT_FILE_NONCE_SIZE],
/// Reference to the master key entry in the filesystem keyring.
/// Prevents the master key from being fully removed while this inode
/// is still in use.
pub master_key_ref: Arc<FscryptMasterKey>,
/// For inline crypto: the `BlkCryptoKey` prepared for hardware keyslot
/// programming. `None` if software encryption is used.
pub blk_crypto_key: Option<BlkCryptoKey>,
}
ZeroizingKey is a wrapper around ArrayVec<u8, 64> that implements Drop by
calling memzero_explicit on the key material. It must never implement Clone.
15.20.9 Security Considerations¶
- Threat model: fscrypt protects data at rest. It does not protect against a running kernel compromise (the kernel holds derived keys in memory). For protection against a compromised kernel, use confidential computing (Section 9.7).
- Key scrubbing: Derived keys are zeroized on removal, but the master key may persist in userspace memory (e.g., a PAM module or key agent). UmkaOS cannot control userspace key hygiene.
- No authenticated encryption: fscrypt uses unauthenticated encryption modes (XTS, CTS-CBC). An attacker with physical disk access can modify ciphertext without detection (bit-flipping attacks). Filesystem metadata checksums (ext4 metadata_csum, Btrfs checksums) detect some corruption but do not provide cryptographic authentication. For authenticated at-rest protection, use dm-crypt with AEAD (dm-integrity + dm-crypt) or full-disk authenticated encryption at the block layer.
- Filename length leakage: Encrypted filenames preserve the original length (CTS does not pad). An attacker can observe filename lengths on the encrypted volume. This is a known and accepted trade-off (padding would break directory entry size constraints).
15.20.10 Cross-References¶
- Section 10.1 -- underlying algorithm registry and hardware dispatch
- Section 15.6 -- implements fscrypt hooks (ext4_encrypt_page, ext4_decrypt_page)
- Section 15.7, Section 15.8 -- do NOT implement fscrypt hooks (noted here for completeness; future integration is Phase 4+)
- Section 14.16 -- fscrypt context stored as inode xattr
- Section 9.7 -- complementary protection (fscrypt = at-rest, CC = in-use)
- Section 15.2 -- blk-crypto inline encryption framework
- Section 14.1 -- VFS read/write hooks,
InodeOps,FileOps - Section 10.2 -- filesystem keyring integration
15.21 SMB Server (ksmbd)¶
ksmbd is UmkaOS's in-kernel SMB server, providing high-performance SMB file sharing
without the overhead of running Samba as a userspace daemon. Originally merged into
Linux 5.15, the ksmbd architecture splits work between the kernel (data path: read,
write, directory enumeration, oplock/lease management) and a lightweight userspace
helper (ksmbd.mountd, for authentication and configuration parsing). UmkaOS supports
SMB 2.1, 3.0, and 3.1.1 protocols — sufficient for all modern Windows, macOS,
and Linux CIFS clients.
Use cases: Windows interoperability (file sharing with unmodified Windows 10/11 clients), NAS appliances, Samba-compatible file servers, container-based file gateways.
Tier: Tier 1 (in-kernel, hardware domain isolated). The SMB data path executes
entirely in Ring 0 within a Tier 1 isolation domain; the authentication and
configuration plane is delegated to the ksmbd.mountd userspace daemon via a netlink
IPC channel.
15.21.1 Server State¶
/// ksmbd server state -- one instance per SMB listener.
pub struct KsmbdServer {
/// Listening TCP socket (port 445).
pub listener: Arc<TcpListener>,
/// Active sessions, keyed by session ID (u64).
pub sessions: RwLock<XArray<Arc<SmbSession>>>,
/// Share configuration (loaded from ksmbd.conf via ksmbd.mountd).
/// Updated via RCU: readers (SMB request handlers) never block.
pub shares: RcuVec<SmbShare>,
/// Global server GUID (randomly generated at first start, persisted
/// across restarts in `/etc/ksmbd/server_guid`).
pub server_guid: [u8; 16],
/// Supported dialects, ordered by preference (highest first).
pub dialects: ArrayVec<SmbDialect, 4>,
/// Server capabilities advertised in SMB2 NEGOTIATE response.
pub capabilities: SmbServerCapabilities,
/// Worker thread pool for request processing. One thread per
/// concurrent SMB connection; threads are spawned on accept and
/// exit when the connection closes.
pub worker_pool: WorkerPool,
/// IPC transport to ksmbd.mountd (userspace helper for auth + config).
pub mountd_ipc: NetlinkSocket,
}
15.21.2 SMB Session¶
Each authenticated client connection produces one SmbSession. Sessions are independent:
a client may establish multiple sessions (e.g., one per user credential) over the same
or different TCP connections.
pub struct SmbSession {
/// Session ID (unique per server, assigned at session setup).
pub session_id: u64,
/// Authenticated user credentials (resolved by ksmbd.mountd).
pub user: SmbUser,
/// Session key (derived from authentication exchange). 16 bytes per MS-SMB2.
pub session_key: Zeroizing<[u8; 16]>,
/// Signing key (KDF from session key per MS-SMB2 §3.1.4.2). 16 bytes.
pub signing_key: Zeroizing<[u8; 16]>,
/// Encryption keys (SMB 3.0+). `None` if encryption not negotiated.
/// 32 bytes to support AES-256-CCM and AES-256-GCM (negotiated via
/// `SMB2_ENCRYPTION_CAPABILITIES`). The KDF produces 32 bytes for
/// AES-256; AES-128 ciphers use only the first 16 bytes of the buffer.
pub encrypt_key: Option<Zeroizing<[u8; 32]>>,
pub decrypt_key: Option<Zeroizing<[u8; 32]>>,
/// Tree connections (mounted shares), keyed by tree ID (u32).
pub tree_connects: SpinLock<XArray<Arc<SmbTreeConnect>>>,
/// Open file handles, keyed by volatile file ID (u64).
pub open_files: SpinLock<XArray<Arc<SmbOpenFile>>>,
/// Connection transport (TCP or RDMA).
pub transport: SmbTransport,
/// Negotiated dialect for this session.
pub dialect: SmbDialect,
/// Session lifecycle state (stored as `SmbSessionState as u8`).
/// Use `session_state()` / `set_session_state()` typed accessors
/// instead of raw AtomicU8 operations to avoid u8-to-enum mismatch bugs.
pub state: AtomicU8,
}
/// Session lifecycle states.
#[repr(u8)]
pub enum SmbSessionState {
/// Negotiate complete, session setup in progress.
InProgress = 0,
/// Fully authenticated and active.
Valid = 1,
/// Session expired (idle timeout or explicit logoff).
Expired = 2,
}
impl SmbSession {
/// Read the current session state as a typed enum.
/// Returns `SmbSessionState::Expired` for any unrecognized value
/// (defensive — treats corruption as expired to prevent use of
/// an invalid session).
pub fn session_state(&self) -> SmbSessionState {
match self.state.load(Acquire) {
0 => SmbSessionState::InProgress,
1 => SmbSessionState::Valid,
_ => SmbSessionState::Expired,
}
}
/// Set the session state atomically.
pub fn set_session_state(&self, s: SmbSessionState) {
self.state.store(s as u8, Release);
}
}
15.21.3 Dialect Negotiation¶
/// SMB protocol dialects supported by UmkaOS ksmbd.
#[repr(u16)]
pub enum SmbDialect {
/// SMB 2.1 (Windows 7 / Server 2008 R2).
Smb21 = 0x0210,
/// SMB 3.0 (Windows 8 / Server 2012). Adds multichannel, encryption.
Smb30 = 0x0300,
/// SMB 3.0.2 (Windows 8.1 / Server 2012 R2).
Smb302 = 0x0302,
/// SMB 3.1.1 (Windows 10+ / Server 2016+). Adds pre-auth integrity,
/// AES-256, compression. Preferred dialect.
Smb311 = 0x0311,
}
The client sends SMB2 NEGOTIATE with a list of supported dialects. The server selects
the highest common dialect. If no common dialect exists, the server returns
STATUS_NOT_SUPPORTED and closes the connection. SMB 1.0/CIFS is deliberately not
supported — it has known security vulnerabilities (EternalBlue, MS17-010) and no modern
client requires it.
15.21.3.1 SMB 3.1.1 Negotiate Context¶
SMB 3.1.1 extends NEGOTIATE with typed negotiate contexts:
- Pre-authentication integrity (
SMB2_PREAUTH_INTEGRITY_CAPABILITIES): SHA-512 hash chain of all negotiate messages. The pre-auth integrity hash is carried into the session setup exchange, binding the authenticated session to the specific negotiate sequence and preventing downgrade attacks. - Encryption (
SMB2_ENCRYPTION_CAPABILITIES): ordered cipher preference list. UmkaOS supports AES-128-CCM (mandatory per MS-SMB2), AES-128-GCM, AES-256-CCM, and AES-256-GCM. The server selects the first client-offered cipher it supports. - Compression (
SMB2_COMPRESSION_CAPABILITIES): LZ77, LZ77+Huffman, LZNT1, Pattern_V1. Compression is optional and negotiated per-connection. - Signing (
SMB2_SIGNING_CAPABILITIES): AES-128-CMAC (SMB 3.0/3.0.2) or AES-128-GMAC (SMB 3.1.1, preferred for hardware acceleration via AES-NI).
15.21.4 Share Configuration¶
/// One exported SMB share.
pub struct SmbShare {
/// Share name as visible to clients (e.g., "public", "homes").
pub name: KString,
/// Local filesystem path (must be an absolute path to a mounted directory).
pub path: KString,
/// Share type (disk, printer, or IPC).
pub share_type: SmbShareType,
/// Maximum access mask the server will grant on this share.
pub max_access: SmbAccessMask,
/// Read-only flag. When true, all write operations return STATUS_ACCESS_DENIED.
pub read_only: bool,
/// Allow guest (unauthenticated) access. Default: false.
pub guest_ok: bool,
/// Visible in SMB network neighborhood enumeration. Default: true.
pub browseable: bool,
/// Oplocks enabled on this share. Default: true.
pub oplocks: bool,
/// Per-share encryption required. When true, unencrypted sessions cannot
/// access this share (returns STATUS_ACCESS_DENIED). SMB 3.0+ only.
pub encrypt: bool,
}
#[repr(u32)]
pub enum SmbShareType {
/// Disk share (file/directory access).
Disk = 0x00,
/// Printer share.
Printer = 0x01,
/// IPC share (named pipes for inter-process communication).
Ipc = 0x02,
}
Share configuration is loaded from /etc/ksmbd/ksmbd.conf by ksmbd.mountd and pushed
to the kernel via the netlink IPC channel. The kernel stores shares in an RcuVec so
that SMB request handlers can look up shares without acquiring any lock. Configuration
reloads (triggered by ksmbd.mountd --reload) replace the entire share list via an RCU
update; in-progress tree connects on the old configuration complete normally.
15.21.5 Oplock and Lease Model¶
SMB oplocks (opportunistic locks) and leases control client-side caching. They are the SMB equivalent of NFSv4 delegations (Section 15.11).
/// SMB oplock level (per MS-SMB2 section 2.2.14).
#[repr(u8)]
pub enum SmbOplockLevel {
/// No oplock granted.
None = 0x00,
/// Level II: read caching only. Multiple clients may hold simultaneously.
LevelII = 0x01,
/// Exclusive: read + write caching. Only one client may hold.
Exclusive = 0x08,
/// Batch: read + write + handle caching. Client may delay close.
Batch = 0x09,
/// Lease (SMB 2.1+): fine-grained, per-file-name caching state.
Lease = 0xFF,
}
/// SMB2 Lease -- per-file-name (not per-handle) caching state.
/// Leases survive handle close and reopen, unlike oplocks.
pub struct SmbLease {
/// Client-generated lease key (unique per client per file name).
pub lease_key: [u8; 16],
/// Current lease state (combination of R, W, H flags).
pub lease_state: LeaseState,
/// Lease epoch (incremented on each lease break/upgrade).
/// Protocol-mandated u16 (MS-SMB2 wire format). Wrap is safe: SMB2
/// uses modular comparison for epoch changes; absolute value is not meaningful.
pub lease_epoch: u16,
/// Parent lease key for directory leases (SMB 3.0+). Enables
/// directory change caching: the client can cache readdir results
/// until the parent lease is broken.
pub parent_lease_key: Option<[u8; 16]>,
}
bitflags! {
/// Lease state flags (MS-SMB2 section 2.2.13.2.8).
pub struct LeaseState: u32 {
/// Read caching: client may cache read data locally.
const READ = 0x01;
/// Write caching: client may cache writes locally (flush on break).
const WRITE = 0x02;
/// Handle caching: client may defer close operations.
const HANDLE = 0x04;
}
}
Lease break protocol: When a conflicting access arrives (e.g., another client opens
a file for write while a read-write lease is held), the server sends a lease break
notification. The client must acknowledge the break within 35 seconds (the oplock break
timeout per MS-SMB2 section 3.3.5.22.1) or the lease is forcibly revoked. During the
break period, the client flushes cached writes and downgrades its lease state. The VFS
integration for conflict detection uses vfs_test_lock() and the file notification
subsystem (Section 14.13).
15.21.6 SMB Multichannel¶
SMB 3.0 and later support multichannel: multiple TCP connections per session for bandwidth aggregation and fault tolerance.
- Interface discovery: The client issues
FSCTL_QUERY_NETWORK_INTERFACE_INFOto discover the server's network interfaces (IP addresses, link speeds, RSS capability). The server populates the response by enumerating all network interfaces via the netdevice subsystem (Section 16.13), filtering to interfaces that are UP and have a routable address. Each entry includes: interface index, link speed (fromethtool_link_ksettings), RSS capability flag, and IPv4/IPv6 socket addresses. The response is bounded by the number of interfaces (typically <32). - Connection binding: Additional connections are bound to the existing session via
SMB2 SESSION_SETUPwithSMB2_SESSION_FLAG_BINDING. All connections in a session share the same session key and signing/encryption keys. - Request distribution: Requests are distributed across channels per-request (not per-session). The server processes requests from any channel interchangeably.
- Failover: If one channel fails (TCP RST or timeout), in-flight requests on that channel are retried on a surviving channel. The session remains valid as long as at least one channel is active.
- Channel limit: UmkaOS supports up to 32 channels per session. The limit is
configurable via
ksmbd.conf(max_channels = N).
15.21.7 SMB Direct (RDMA)¶
SMB Direct (MS-SMBD, SMB 3.0+) enables RDMA transport for zero-copy file transfer, eliminating TCP/IP overhead on RDMA-capable networks.
- Transport: Uses iWARP, RoCE v2, or InfiniBand RDMA via the UmkaOS RDMA subsystem (Section 5.4).
- Data transfer: Bulk data (read/write payloads) uses RDMA Read/Write operations; control messages (SMB headers, negotiate, session setup) use RDMA Send/Receive.
- Buffer descriptors: Each RDMA data transfer is described by a
SmbdBufferDescriptor { offset: u64, token: u32, length: u32 }that the peer uses for remote DMA addressing. - Credit-based flow control: The receiver advertises receive credits; the sender must not exceed the credit count. Credits are replenished in each response.
- Negotiation: SMB Direct is negotiated at connection time. If both endpoints support RDMA, the connection transparently upgrades from TCP to RDMA. Applications and management tools see a standard SMB session.
- Supported hardware: Mellanox/NVIDIA ConnectX series, Chelsio T6+, Intel E810 (iWARP). Any NIC exposing the UmkaOS RDMA verbs interface is usable.
15.21.8 ksmbd.mountd IPC Protocol¶
The kernel/userspace split follows the same model as Linux ksmbd: the kernel handles the
fast data path while ksmbd.mountd handles authentication and configuration.
ksmbd.mountd responsibilities:
- Parse /etc/ksmbd/ksmbd.conf (share definitions, global parameters).
- Manage the user database (/etc/ksmbd/ksmbdpwd.db): NTLM password hashes.
- Authenticate SMB session setup requests (NTLMv2, Kerberos via GSSAPI/SPNEGO).
- Return session keys and user credentials to the kernel.
IPC channel: Generic Netlink family (KSMBD_GENL_NAME = "KSMBD_GENL"). ksmbd uses
Generic Netlink with a dynamically registered family, not a fixed netlink protocol
number. The protocol is a simple request/response framing:
- Kernel sends
KSMBD_EVENT_LOGIN_REQUEST { account_name, domain_name }when an SMB2 SESSION_SETUP arrives. ksmbd.mountdvalidates credentials (NTLM challenge-response or Kerberos AP-REQ), sendsKSMBD_EVENT_LOGIN_RESPONSE { session_key, uid, gid, status }.- On share configuration reload:
ksmbd.mountdsendsKSMBD_EVENT_SHARE_CONFIG_REQUESTwith the full share table; kernel replaces theRcuVec<SmbShare>atomically.
Failure mode: If ksmbd.mountd is not running, new session setup requests are
rejected with STATUS_LOGON_FAILURE. Existing authenticated sessions continue to
operate (the kernel has cached the session key). This allows ksmbd.mountd to be
restarted without disrupting active file transfers.
15.21.9 VFS Integration¶
SMB operations map directly to UmkaOS VFS operations (Section 14.1):
| SMB2 Command | VFS Call | Notes |
|---|---|---|
SMB2_CREATE |
vfs_open() / vfs_create() |
Creates or opens a file/directory |
SMB2_READ |
vfs_read() / vfs_splice_read() |
Splice path for zero-copy when possible |
SMB2_WRITE |
vfs_write() |
Respects share read_only flag |
SMB2_CLOSE |
vfs_close() |
Releases oplock/lease if last handle |
SMB2_FLUSH |
vfs_fsync() |
Flushes to stable storage |
SMB2_QUERY_INFO |
vfs_getattr() / vfs_getxattr() |
File/FS/security info classes |
SMB2_SET_INFO |
vfs_setattr() / vfs_setxattr() |
Includes timestamp, size, ACL updates |
SMB2_QUERY_DIRECTORY |
vfs_readdir() |
Pattern matching (wildcards) in kernel |
SMB2_CHANGE_NOTIFY |
Section 14.13 | Maps to inotify/fanotify watchers |
SMB2_LOCK |
vfs_lock_file() |
Byte-range locks (Section 14.14) |
SMB2_IOCTL |
FSCTL dispatch | FSCTL_GET_REPARSE_POINT, FSCTL_PIPE_WAIT, etc. |
Extended attributes and NT ACLs: Windows NT security descriptors (DACLs/SACLs) are
stored as extended attributes in the security.NTACL xattr namespace
(Section 14.16). When a Windows client sets file permissions via the
Security tab, the server serializes the NT security descriptor into the xattr. On
SMB2_QUERY_INFO with SMB2_0_INFO_SECURITY, the xattr is read and returned as a
wire-format security descriptor. If no security.NTACL xattr exists, the server
synthesizes a default ACL from POSIX mode bits.
Durable handles: SMB 2.x durable handles allow a client to reconnect to open files
after a brief network disconnection (up to the durable handle timeout, default 60
seconds). The server retains the SmbOpenFile entry and its associated VFS state
(OpenFile, byte-range locks, oplock/lease) across the disconnect. On reconnect, the
client presents the durable file ID and the server resumes the open without re-executing
vfs_open(). Persistent handles (SMB 3.0+, requires stable storage) survive server
restart — the open file state is journaled to /var/lib/ksmbd/persistent_handles/.
Constraint: The persistent handle journal directory MUST NOT be on the same
exported share that the handles reference — this avoids a circular dependency where
reconstructing a persistent handle requires mounting a share that itself has a
persistent handle pending reconstruction. The journal directory must be on a local
filesystem (ext4/XFS) that is mounted before ksmbd starts.
15.21.10 Security¶
Authentication: Delegated entirely to ksmbd.mountd:
- NTLMv2: Challenge-response using NT password hash from ksmbdpwd.db. The kernel
generates the 8-byte challenge; ksmbd.mountd validates the response.
- Kerberos (SPNEGO): ksmbd.mountd accepts the Kerberos AP-REQ via GSSAPI/SPNEGO,
validates it against the host keytab, and returns the session key.
Message signing: When negotiated (mandatory for SMB 3.1.1 when the client requests signing), all SMB2 messages carry an AES-CMAC or AES-GMAC signature computed over the message header and payload. The signing key is derived from the session key via KDF(SessionKey, label, context) per MS-SMB2 section 3.1.4.2. Signing prevents man-in-the-middle modification of SMB traffic.
Encryption: Per-session or per-share encryption (SMB 3.0+). When enabled, the entire
SMB2 transform header and payload are encrypted using AES-128-CCM, AES-128-GCM,
AES-256-CCM, or AES-256-GCM (negotiated during SMB2 NEGOTIATE). Encryption keys are
derived from the session key. All cryptographic operations use the UmkaOS kernel crypto
subsystem (Section 10.1).
Guest access: Configurable per-share via guest_ok. Disabled by default. When
enabled, unauthenticated connections are granted the nobody credential (UID 65534).
Guest sessions cannot use signing or encryption.
Capability requirement: Capability::NetAdmin is required to configure ksmbd
(start/stop the server, modify shares). Standard users may connect as SMB clients
without any special capability.
15.21.11 Cross-references¶
- Section 14.1 -- VFS operations for all file serving
- Section 15.11 -- complementary network filesystem (NFS client)
- Section 15.12 -- complementary network filesystem (NFS server)
- Section 5.4 -- SMB Direct RDMA transport
- Section 14.13 -- change notification (SMB2_CHANGE_NOTIFY)
- Section 14.16 -- NT ACL xattr storage
- Section 10.1 -- AES encryption, signing, and KDF primitives
- Section 14.14 -- byte-range locks mapped from SMB2_LOCK
- Section 16.2 -- TCP transport layer
15.21.12 Design Decisions¶
-
Kernel/userspace split: The data path (read, write, directory enumeration) runs entirely in-kernel for minimal latency. Authentication and configuration parsing run in userspace (
ksmbd.mountd) where they can use standard libraries (MIT Kerberos, OpenSSL) without kernel-space constraints. This matches the Linux ksmbd architecture and is the right trade-off: authentication is per-session (infrequent), while data operations are per-request (high frequency). -
SMB 1.0 not supported: SMB 1.0/CIFS has critical security vulnerabilities (EternalBlue/WannaCry) and no legitimate modern use case. All supported clients (Windows 10+, macOS 10.12+, Linux CIFS) speak SMB 2.1 or later. Excluding SMB 1.0 eliminates a large attack surface.
-
Tier 1 placement: ksmbd is a kernel-resident server that accesses the VFS via
kabi_call!(resolves to ring dispatch since ksmbd and VFS are in different hardware isolation domains). Tier 2 (userspace) placement would add a privilege transition on top of the ring dispatch, further increasing latency. Tier 1 provides ring-based VFS access with hardware domain isolation for fault containment and implicit batching for throughput. -
RCU for share configuration: Share lookups happen on every SMB request (to verify access rights). RCU eliminates lock contention on the read path. Configuration reloads are rare (operator-initiated) and use the RCU update path.
-
Oplock/lease integration via VFS file notification: Rather than implementing a separate conflict detection mechanism, ksmbd uses the VFS file notification subsystem (Section 14.13) to detect conflicting opens and trigger lease breaks. This ensures consistent behavior between local processes and remote SMB clients accessing the same files.
-
Durable handles with VFS state retention: Durable handles keep the VFS
OpenFilealive across client disconnects, avoiding the cost of re-opening and re-acquiring locks. The 60-second default timeout is short enough that server resources are not held indefinitely by crashed clients.