Skip to content

Chapter 14: Virtual Filesystem Layer

VFS architecture, dentry cache, mount tree, path resolution, overlayfs, mount namespace operations


The Virtual Filesystem Layer provides a unified interface over all filesystem implementations. Dentry caching, inode management, mount tree operations, and path resolution are kernel-internal — filesystems plug in via well-defined traits. FUSE, overlayfs, configfs, and autofs are first-class citizens, not afterthoughts.

14.1 Virtual Filesystem Layer

The VFS (umka-vfs) provides a unified interface over all filesystem types. It is a Tier 1 component that shares a hardware isolation domain with filesystem drivers (see Section 11.2 for platform-specific isolation mechanisms). This shared domain provides crash containment from umka-core but does not provide mutual isolation between VFS and the filesystem drivers within it.

Why VFS is Tier 1 (not Tier 0):

The VFS handles complex, security-sensitive operations: path resolution (symlink loops, mount point crossing), permission checks, and filesystem driver coordination. Isolating VFS from Core provides:

  1. Attack surface reduction: Path resolution bugs (symlink attacks, directory traversal) are confined to the VFS domain and cannot corrupt Core memory.

  2. Domain boundary: Core → VFS+FS domain (Tier 1) → individual FS drivers (Tier 1/2). A compromised VFS+FS domain cannot corrupt Core memory. However, VFS and filesystem drivers share a domain by default, so a filesystem driver bug can corrupt VFS metadata within the shared domain. On platforms with sufficient isolation domains and few active Tier 1 drivers (e.g., x86-64 with PKU and <12 Tier 1 drivers), VFS and filesystem drivers may be placed in separate domains, providing inter-driver hardware isolation. Rust memory safety — not hardware isolation — is the primary defense against filesystem driver bugs within the shared domain. The hard isolation boundary is between Core and the VFS+FS domain, not between VFS and individual filesystem drivers.

  3. Crash containment: A VFS panic (e.g., corrupted dentry cache) is recoverable without rebooting the entire kernel. The recovery protocol:

a. Detection: umka-core detects VFS domain death (MPK exception, panic handler, or watchdog timeout on the VFS heartbeat ring). b. Freeze: All syscalls that enter VFS (open, stat, read, write, close, etc.) are blocked at the umka-core domain boundary. Callers receive -ERESTARTSYS and the VFS ring is drained. c. Dirty page cache flush: Dirty pages in umka-core's page cache are flushed to their backing block devices. The page cache is in umka-core memory (not VFS memory), so it survives the VFS crash. Flush uses the block layer ring directly. d. Dentry/inode cache rebuild: The new VFS instance starts with an empty dentry cache. Dentries are lazily re-populated on the next path lookup (cache miss triggers disk read). Inode cache is similarly rebuilt on demand. e. Mount tree reconstruction: umka-core maintains a shadow mount registry — a Tier 0 table recording (mount_id, device, fstype, mountpoint_path, flags) for every active mount. The registry is updated atomically by the syscall layer on every mount()/umount() call (BEFORE dispatching to VFS). After VFS restart, the new VFS instance iterates the shadow registry and re-mounts each filesystem in depth-sorted order (root first, then children):

  ```rust
  /// Shadow mount registry in Tier 0 Core. Survives VFS Tier 1 crash.
  /// Updated on mount/umount syscalls before VFS dispatch.
  pub struct ShadowMountRegistry {
      /// Active mounts, keyed by mount_id (u64).
      mounts: XArray<ShadowMountEntry>,
  }

  /// Stable layout for crash recovery: Tier 0 Core persists this struct
  /// and reads it back after a VFS Tier 1 crash. `#[repr(C)]` ensures
  /// deterministic field ordering across compiler versions.
  #[repr(C)]
  pub struct ShadowMountEntry {
      pub mount_id: u64,            // 8 bytes  (offset 0)
      /// Block device backing this mount (`DevId`, or `DevId::ZERO` for
      /// pseudo-fs). Uses `DevId` (u32) — the same type as
      /// `SuperBlock.sb_dev` — for consistency with all other VFS device
      /// ID fields. KABI `DeviceId` (u64) is not used here because
      /// ShadowMountEntry is a Tier 0 crash recovery struct, not a KABI
      /// wire type.
      pub device: DevId,             // 4 bytes  (offset 8)
      /// Explicit padding for u8-array alignment of `fstype`.
      pub _pad0: [u8; 4],            // 4 bytes  (offset 12)
      /// Filesystem type ("ext4", "proc", "sysfs", "tmpfs", "overlay",
      /// "fuse.sshfs"). 32 bytes: covers all standard types and FUSE names.
      pub fstype: [u8; 32],          // 32 bytes (offset 16)
      /// Mountpoint path ("/", "/proc", "/sys", "/dev", etc.).
      /// 512 bytes: covers deeply nested container mount paths
      /// (Kubernetes pod volumes, Docker overlay lower directories).
      pub path: [u8; 512],           // 512 bytes (offset 48)
      /// Mount flags (MS_RDONLY, MS_NOSUID, etc.). u64 for consistency
      /// with MountFlags (highest defined bit is 29, but u64 prevents
      /// silent truncation if future flags use bits 32+).
      pub flags: u64,                // 8 bytes  (offset 560)
      /// Depth in the mount tree (root=0, /proc=1, /proc/sys=2, ...).
      /// Used for depth-sorted re-mount ordering.
      pub depth: u16,                // 2 bytes  (offset 568)
      /// Filesystem-specific mount data (e.g., overlayfs lower/upper/work).
      /// Opaque bytes, copied from the original mount() call.
      /// **Truncation**: If the original mount data exceeds 256 bytes, only
      /// the first 256 bytes are stored and `fs_data_len` is set to 256.
      /// The `truncated` flag below is set to indicate data loss. During
      /// crash recovery, truncated mounts fall back to re-reading mount
      /// options from /etc/fstab or fail with -EINVAL if unavailable.
      pub fs_data: [u8; 256],        // 256 bytes (offset 570)
      pub fs_data_len: u16,          // 2 bytes  (offset 826)
      /// Set to 1 if the original fs_data was longer than 256 bytes.
      pub truncated: u8,             // 1 byte   (offset 828)
      /// Explicit trailing padding to u64 alignment boundary.
      pub _pad: [u8; 3],             // 3 bytes  (offset 829)
      // Total: 8+4+4+32+512+8+2+256+2+1+3 = 832 bytes.
  }
  const_assert!(size_of::<ShadowMountEntry>() == 832);
  ```

  Pseudo-filesystems (/proc, /sys, /dev/devtmpfs) are re-mounted from kernel
  state — they have no on-disk backing. Overlayfs is re-constituted from the
  recorded `fs_data` (lower/upper/work paths). FUSE mounts that require a
  userspace daemon connection receive `-ENOTCONN` until the daemon reconnects.

  Memory cost: ~832 bytes per mount (with alignment) × ~20 typical mounts = ~16.6 KB in Core.

f. Open file descriptor recovery: umka-core's FdTable (in Tier 0 Core, per-task via Task.files: Arc<FdTable>, Section 8.1) survives the VFS crash. After mount tree reconstruction, umka-core re-opens each fd by inode number using the re-mounted filesystem. File descriptors that pointed to deleted files (unlinked but still open) receive -EIO on next access. g. Resume: The VFS ring is reopened and blocked syscalls are retried.

Recovery time: ~100-500ms depending on the number of open file descriptors. Limitation: In-flight writes that had not yet reached the page cache are lost (the application receives -EIO and must retry).

Domain grouping limitation: The crash recovery protocol above is most effective when the crash originates from VFS logic itself (e.g., a bug in path resolution or dentry management). Because VFS and filesystem drivers share an isolation domain by default, a filesystem driver bug can corrupt VFS metadata (dentry cache, mount tree, inode state) before detection. In this case, the corrupted VFS state may have already produced incorrect I/O (wrong block mappings, stale metadata replies) before the domain crash is detected by Core. Recovery restores VFS to a clean state, but data written to disk under corrupted VFS guidance may be silently wrong. This is a known limitation of domain grouping — the hardware fault boundary catches the crash, but cannot retroactively undo I/O performed with corrupted in-domain state. Rust memory safety mitigates this risk by preventing most classes of memory corruption bugs, but unsafe code within the shared domain remains a vector.

In-flight write definition: In-flight writes are writes that have entered the VFS write path (passed the syscall boundary) but whose data has not yet been inserted into the page cache. This includes: (1) writes buffered in the VFS ring command queue awaiting processing, (2) writes being copied from user buffer to a page that has not yet been marked dirty. Writes that have reached the page cache (page marked PG_DIRTY with a committed dirty extent) are NOT in-flight — they survive VFS crash via the dirty extent protocol (Section 14.4). Error reporting for lost in-flight writes: the write() syscall returns -EIO if the VFS crashes during the write. If write() had already returned success, the data is in the page cache and is safe. Applications should retry write() calls that returned -EIO after VFS recovery completes (the ring is reopened in step f of the recovery protocol).

14.1.1.1.1 Dirty Page Handling on VFS Crash

Dirty Page Handling on VFS Crash:

When a Tier 1 VFS driver crashes, UmkaOS Core cannot safely flush dirty pages using the crashed driver's block mapping (the file-offset → block-address translation lives in the now-destroyed VFS domain).

UmkaOS's design: two-phase dirty extent protocol.

The dirty extent protocol must accommodate two fundamentally different filesystem write models:

  • In-place / pre-allocated filesystems (ext4, XFS non-reflink, FAT): The block address is known at page-dirty time (the extent tree maps file offsets to fixed block addresses). Both the logical reservation and physical commit can happen in one step.

  • Copy-on-Write filesystems (Btrfs, XFS reflink, bcachefs): The physical block address is not known at page-dirty time. CoW filesystems allocate new blocks at writeback time, not when the page is first dirtied. A protocol that requires block_addr at dirty time is incompatible with CoW.

To support both models, UmkaOS uses a two-phase dirty extent protocol: Phase 1 (reserve) records the logical intent at dirty time; Phase 2 (commit) binds the physical block address after writeback allocation.

/// Phase 1: Reserve a dirty extent at page-dirty time.
///
/// Called by VFS drivers when a page is first marked dirty (from the
/// `AddressSpaceOps::dirty_extent()` callback). Records a **logical**
/// writeback intent — the file offset and length that will need to be
/// written back. No physical block address is required at this point.
///
/// The intent is stored in a per-inode **writeback intent list** maintained
/// by UmkaOS Core. The intent list is protected by `i_rwsem` (held at
/// least shared by the caller, since `dirty_extent()` is called from
/// `__set_page_dirty()` which holds the page lock, which nests inside
/// `i_rwsem`).
///
/// # Parameters
/// - `inode_id`: Stable inode identifier (survives VFS crash).
/// - `file_offset`: Byte offset of the dirty range start.
/// - `len`: Length of the dirty range in bytes.
///
/// # Errors
/// Returns `VfsDirtyError::IntentListFull` when the per-inode intent list
/// reaches its capacity (8192 entries). The caller must trigger writeback
/// for this inode to drain completed intents before retrying.
pub fn vfs_dirty_extent_reserve(
    inode_id: InodeId,
    file_offset: u64,
    len: u64,
) -> Result<DirtyExtentToken, VfsDirtyError>;

/// Phase 2: Commit a dirty extent with its physical block address.
///
/// Called by the filesystem's writeback path **after** it has allocated
/// the physical blocks for a dirty range (CoW allocation for Btrfs/XFS
/// reflink, or extent-tree lookup for ext4/XFS non-reflink). Binds the
/// physical block address to the previously reserved logical intent.
///
/// After `vfs_dirty_extent_commit()` returns, UmkaOS Core has a complete
/// record of the dirty extent's physical location. If the VFS crashes
/// between commit and actual I/O completion, Core can flush the extent
/// directly via the block layer.
///
/// # Parameters
/// - `token`: The token returned by `vfs_dirty_extent_reserve()` for
///   this extent. Tokens are single-use; committing the same token
///   twice is a kernel bug (caught by debug-mode assertion).
/// - `block_addr`: Physical block address assigned by the filesystem's
///   writeback allocator.
/// - `block_len`: Length of the physical block range in bytes. May
///   differ from the logical `len` if the filesystem compresses or
///   coalesces extents.
///
/// # Errors
/// Returns `VfsDirtyError::InvalidToken` if the token has already been
/// committed or was never issued.
pub fn vfs_dirty_extent_commit(
    token: DirtyExtentToken,
    block_addr: PhysBlockAddr,
    block_len: u64,
) -> Result<(), VfsDirtyError>;

/// Atomic reserve+commit for non-CoW filesystems.
///
/// Convenience function for filesystems that know the block address at
/// dirty time (ext4 non-delayed-alloc, FAT, exFAT). Equivalent to
/// calling `vfs_dirty_extent_reserve()` followed immediately by
/// `vfs_dirty_extent_commit()`, but avoids the overhead of a separate
/// token round-trip.
///
/// CoW filesystems MUST NOT use this function — they must use the
/// two-phase protocol because the block address is not available at
/// dirty time.
pub fn vfs_dirty_extent_reserve_and_commit(
    inode_id: InodeId,
    file_offset: u64,
    len: u64,
    block_addr: PhysBlockAddr,
    block_len: u64,
) -> Result<(), VfsDirtyError>;

/// Abort a previously reserved dirty extent. Called when writeback fails
/// after `vfs_dirty_extent_reserve()` but before
/// `vfs_dirty_extent_commit()` — e.g., block allocation failure, I/O
/// error during journal write, or filesystem shutdown. Releases the
/// reserved intent list entry, decrementing `nr_reserved` and freeing
/// the token. The token is consumed (single-use, same as commit).
///
/// **Retry policy**: After 3 consecutive writeback failures for the same
/// extent (tracked per-inode by a `(file_offset, len)` → `fail_count`
/// map in the intent list), the filesystem marks the extent permanently
/// dirty and logs an FMA error:
/// `"writeback abort: extent [offset, offset+len) on inode {id} failed 3 times"`.
/// The permanently-dirty extent remains in the intent list until the
/// inode is evicted or the filesystem is unmounted, ensuring crash
/// recovery can still identify it.
pub fn vfs_dirty_extent_abort(
    token: DirtyExtentToken,
) -> Result<(), VfsDirtyError>;

/// Acknowledge that a dirty extent has been successfully flushed to
/// stable storage. Removes the extent from Core's dirty extent log.
/// Called by the VFS driver after receiving I/O completion for the
/// writeback.
pub fn vfs_flush_extent_complete(
    inode_id: InodeId,
    file_offset: u64,
    len: u64,
) -> Result<(), VfsDirtyError>;

/// Opaque token binding a reserved dirty extent to its commit.
/// Issued by `vfs_dirty_extent_reserve()`, consumed by
/// `vfs_dirty_extent_commit()`. Internally encodes the inode ID,
/// file offset, length, and a monotonic sequence number for
/// double-commit detection.
///
/// Size: 24 bytes (inode_id: u64 + file_offset: u64 + seq: u64).
/// Passed by value through the VFS ring buffer.
#[derive(Clone, Copy)]
pub struct DirtyExtentToken {
    pub inode_id: InodeId,
    pub file_offset: u64,
    pub seq: u64,
}

/// Error type for dirty extent operations.
pub enum VfsDirtyError {
    /// The per-inode intent list is full (8192 entries). Trigger writeback
    /// to drain completed intents before retrying. Equivalent to EBUSY.
    IntentListFull,
    /// The token has already been committed or was never issued.
    InvalidToken,
    /// Other VFS error (invalid inode ID, etc.).
    Other(VfsError),
}

Dirty extent intent list — UmkaOS Core maintains a per-inode writeback intent list in core memory (not in VFS domain memory):

/// Per-inode writeback intent list. Maintained by UmkaOS Core in its own
/// memory domain, surviving VFS crashes.
///
/// Each entry tracks a dirty file range through its lifecycle:
/// Reserved (logical intent only) → Committed (physical address bound) →
/// Complete (flushed to stable storage, entry removed).
///
/// Protected by `i_rwsem` for structural modifications (insert, remove).
/// The writeback thread reads the list under `i_rwsem` shared to collect
/// committed extents for I/O submission.
pub struct DirtyIntentList {
    /// Ring buffer of intent entries. Capacity: 8192 per inode.
    /// At 80 bytes per entry (inode_id(8) + file_offset(8) + len(8) +
    /// block_addr(16) + block_len(8) + seq(8) + sb_dev(4) + pad(4) +
    /// block_dev(16) = 80), worst case is ~640 KB per heavily-dirtied
    /// inode — acceptable for a production server. Most inodes have
    /// <100 entries at any given time.
    ///
    /// **Allocation**: `DirtyIntentList` is allocated from the
    /// `dirty_intent_slab` (a dedicated slab cache, object size = sizeof
    /// `DirtyIntentList`, created at boot Phase 2.4). Allocation occurs
    /// lazily on the first `vfs_dirty_extent_reserve()` for each inode.
    /// The slab is GC'd when idle inodes are evicted from the inode cache.
    pub entries: BoundedRing<DirtyIntentEntry, 8192>,
    /// Monotonic sequence counter for token generation.
    pub next_seq: u64,
}

/// Physical block address on a block device. Newtype around `u64` (plain u64,
/// no `NonZero` — `Option<PhysBlockAddr>` is 16 bytes: 8-byte discriminant +
/// 8-byte payload, with no niche optimization).
pub struct PhysBlockAddr(pub u64);

/// A single dirty extent intent entry.
// Kernel-internal, not KABI: contains Option<Arc<dyn>>, never crosses a compilation
// boundary. #[repr(C)] is required to make the const_assert deterministic across
// compiler versions — without it the compiler may reorder fields.
#[repr(C)]
pub struct DirtyIntentEntry {
    /// Stable inode identifier. Redundant during normal operation (the list
    /// is per-inode) but needed during crash recovery log replay, where entries
    /// may be iterated without per-inode context.
    pub inode_id: InodeId,
    /// File offset of the dirty range (bytes).
    pub file_offset: u64,
    /// Length of the dirty range (bytes).
    pub len: u64,
    /// Physical block address. `None` if Phase 1 only (reserved but not
    /// yet committed). `Some(addr)` after Phase 2 commit.
    pub block_addr: Option<PhysBlockAddr>,
    /// Physical block length. Only valid when `block_addr` is `Some`.
    pub block_len: u64,
    /// Sequence number (matches `DirtyExtentToken.seq`).
    pub seq: u64,
    /// Device ID of the superblock that owns this inode.
    ///
    /// This is the key that Core (Tier 0) uses to look up the block device
    /// during crash recovery, via the global device registry
    /// (`DEVICE_REGISTRY: XArray<Arc<dyn BlockDeviceOps>>`, keyed by `DevId`).
    /// When a Tier 1 VFS driver crashes, Core iterates dirty intent entries
    /// and uses `sb_dev` to resolve the target block device — without needing
    /// any VFS-domain state. This is strictly more reliable than using
    /// `block_dev` alone because `block_dev` is `None` for Phase 1 entries
    /// (deferred allocation) and for NFS, whereas `sb_dev` is always set for
    /// local filesystems and enables Core to match intent entries to the
    /// correct `SUPER_BLOCK_MAP` entry for filesystem journal replay.
    ///
    /// Set to `DevId::ZERO` for network filesystems (NFS, CIFS) where there
    /// is no local block device — crash recovery for network filesystems uses
    /// the network reconnection path instead.
    pub sb_dev: DevId,
    /// Reference to the block device that owns the physical blocks.
    /// Required by crash recovery: when a Tier 1 VFS driver crashes, Core
    /// must issue direct block writes via the block layer (bypassing VFS)
    /// for all committed extents. Without this reference, Core would need
    /// to resolve the block device from the superblock — but the VFS driver
    /// holding the superblock may be the one that crashed.
    ///
    /// Set to `None` for network filesystems (NFS) where `block_addr` is an
    /// opaque server-side commit token, not a local block address.
    ///
    /// **Redundancy with `sb_dev`**: `block_dev` provides the fast path —
    /// Core can issue I/O immediately without a registry lookup. `sb_dev`
    /// provides the fallback — if `block_dev` is `None` (Phase 1 entry),
    /// Core uses `sb_dev` to locate the block device via `DEVICE_REGISTRY`
    /// and then uses the filesystem's block allocator (from the superblock)
    /// to determine whether the extent can be committed or must be discarded.
    /// The `block_dev` reference is valid during crash recovery because block
    /// device drivers are in a separate Tier 1 domain from the VFS driver. If
    /// the block device driver crashes simultaneously (a different, independently
    /// handled crash event), dirty intent entries referencing that block device
    /// are skipped with an FMA warning.
    pub block_dev: Option<Arc<dyn BlockDeviceOps>>,
}
// Verify DirtyIntentEntry size matches the capacity analysis (80 bytes per entry,
// 8192 entries = ~640 KB worst case per heavily-dirtied inode). If this assertion
// fails, update the capacity analysis in DirtyIntentList.entries comment above.
//
// Note: DirtyIntentEntry is kernel-internal (never crosses a compilation boundary —
// only Core crash-recovery code touches it). The `#[repr(C)]` attribute ensures
// deterministic layout for the const_assert below, NOT for KABI compatibility.
// `Option<Arc<dyn BlockDeviceOps>>` relies on Rust's niche optimization for
// `Arc<T>` (null pointer → None). This is guaranteed by the language for `Arc`/`Box`
// and verified at compile time by this const_assert. If a future rustc version
// changes the representation, the const_assert will fail and the struct must be
// reworked (e.g., raw pointer + validity flag).
//
// Field breakdown (64-bit target):
//   inode_id: InodeId(u64) = 8
//   file_offset: u64 = 8
//   len: u64 = 8
//   block_addr: Option<PhysBlockAddr> = 16 (u64 discriminant + u64 payload; no niche — PhysBlockAddr wraps plain u64)
//   block_len: u64 = 8
//   seq: u64 = 8
//   sb_dev: DevId(u32) = 4
//   block_dev: Option<Arc<dyn BlockDeviceOps>> = 16 (fat pointer: data_ptr + vtable_ptr)
//   padding for alignment = 4 (after sb_dev, before block_dev's 8-byte alignment)
//   TOTAL = 80 bytes
const_assert!(core::mem::size_of::<DirtyIntentEntry>() == 80);

Intent list overflow policy: vfs_dirty_extent_reserve() returns IntentListFull when the per-inode intent list reaches 8192 entries. The VFS driver must not proceed with the write operation when IntentListFull is returned; it must first trigger writeback for this inode (via writeback_single_inode()) to flush committed extents and free intent list slots, then retry vfs_dirty_extent_reserve().

This is a deliberate design choice that differs from Linux's approach: UmkaOS never silently discards safety information. The backpressure ensures that on any VFS crash, umka-core has a complete record of all outstanding dirty extents and can accurately flag inconsistent data — no dirty extent is ever "forgotten."

If the VFS driver is unresponsive (not calling vfs_flush_extent_complete() for

5 seconds), umka-core treats all entries in the intent list as dirty and initiates VFS driver restart — the backpressure prevents intent list overflow from masking a stuck VFS driver.

Per-filesystem type usage patterns:

Filesystem Write model Phase 1 (reserve) Phase 2 (commit) Notes
ext4 (non-delayed-alloc) In-place reserve_and_commit() N/A (atomic) Block address known from extent tree at dirty time
ext4 (delayed-alloc) Deferred reserve() at dirty time commit() during writeback after ext4_map_blocks() allocates Delayed allocation defers block assignment
XFS (non-reflink) In-place reserve_and_commit() N/A (atomic) BMBT lookup gives block address at dirty time
XFS (reflink) CoW reserve() at dirty time commit() during writeback after CoW fork allocates new blocks Shared extents require new block allocation
Btrfs Redirect-on-Write reserve() at dirty time commit() during writeback after extent allocator assigns new tree location All writes redirect; old blocks freed at transaction commit. Crash semantics: reserved-only intents (Phase 1 without Phase 2) are volatile — on VFS crash, they are discarded because no physical blocks were allocated. The CoW filesystem's on-disk tree remains consistent because uncommitted writes never modified the on-disk tree.
FAT/exFAT In-place reserve_and_commit() N/A (atomic) Cluster chain gives block address at dirty time
NFS Network reserve() at dirty time commit() after NFS WRITE RPC completes with server-assigned stable storage Block address is the server's opaque commit token

Core-owned superblock registry for crash recovery bypass:

/// Global superblock map, owned by Core (Tier 0). Maps device IDs to
/// superblock references so that crash recovery can locate the block
/// device and filesystem metadata without going through the crashed VFS
/// driver.
///
/// Keyed by `DevId` (integer key → XArray per collection policy).
/// Populated during `mount()`: Core registers the superblock before handing
/// control to the VFS driver. Removed during `umount()` after the VFS driver
/// has cleanly shut down.
///
/// This map is **not** used on the normal I/O path — it exists solely for
/// crash recovery. Normal path resolution goes through the VFS mount tree.
pub static SUPER_BLOCK_MAP: OnceCell<XArray<Arc<SuperBlock>>> = OnceCell::new();

On crash recovery, Core uses SUPER_BLOCK_MAP to resolve the superblock for the crashed filesystem, then iterates its inodes' dirty intent lists. Each DirtyIntentEntry carries its own block_dev reference, enabling Core to issue direct block writes without the VFS driver's cooperation.

Crash recovery sequence:

Dirty page flush during VFS crash uses the committed dirty extent records stored in Core memory (via vfs_dirty_extent_commit()). These records contain physical block addresses that survive the VFS crash. The file-to-inode-to-superblock-to-device mapping is NOT needed — the dirty extent protocol captures the physical location at commit time specifically to enable crash recovery without VFS state.

When a Tier 1 VFS driver crashes while UmkaOS Core detects pending dirty intents:

  1. Iterate the dirty intent lists for all inodes on the crashed filesystem. Core looks up the superblock via SUPER_BLOCK_MAP.get(dev_id) and walks its inode cache.
  2. Committed extents (Phase 2 complete — block_addr is Some):
  3. For each committed extent (newest first, to preserve journal ordering): Issue a direct block write via the block layer (bypassing VFS), using entry.block_dev for the device handle and block_addr/block_len from the intent entry. Wait for write completion.
  4. Reserved-only extents (Phase 1 only — block_addr is None):
  5. These extents have dirty pages in the page cache but no physical block assignment. Core cannot flush them without knowing the block address.
  6. Flag these pages as "potentially inconsistent." The filesystem's own journal/log handles recovery on next mount (same as a hard power-off scenario where dirty pages had not reached disk).
  7. After all committed extents are flushed: mark as "crash-flushed" and continue with driver reload.
  8. Any dirty pages NOT covered by any intent entry (neither reserved nor committed) are also flagged as "potentially inconsistent."

Design rationale: This is better than Linux's approach (which silently loses dirty pages when a kernel module crashes) while being simpler than running a full WAL in UmkaOS Core. The pre-registration overhead is one lightweight ring-buffer push per dirtied file region — negligible for writeback-dominated workloads.

Performance implications and mitigation:

The Core → VFS domain switch costs ~23 cycles for the bare WRPKRU instruction (x86-64 MPK). The full domain crossing — including argument marshaling via the inter-domain ring buffer and cache effects — is ~30-35 cycles per crossing. This overhead is amortized by:

  1. Page Cache in Core: The Page Cache (Section 4.4) lives in Core, not VFS. Cached file reads/writes hit the Page Cache directly with zero domain switches. Only cache misses (actual I/O) cross into VFS.

  2. Batching: Multiple file operations within a single syscall (e.g., readv, io_uring batches) amortize the domain switch over many operations.

  3. Dentry cache hit rate: The dentry cache (in VFS) has >99% hit rate for typical workloads. Path resolution is fast, and the domain switch cost is dominated by the actual I/O latency (microseconds vs nanoseconds).

Measured overhead: For a 4KB NVMe read (~10μs device latency), the additional domain switches (Core → VFS → FS driver) add ~70 cycles (~30ns total), which is 0.3% overhead. This is well within the "<5% overhead" target.

Metadata-heavy workloads: Individual metadata syscalls (stat, readdir, open/close) pay higher per-call overhead because the operation base cost is lower (~200-500ns vs ~10μs for I/O). A single stat() on x86-64 incurs ~46 cycles (~18ns) for the Core → VFS → Core round-trip, which is ~3.6-9% per call. This is the design tradeoff for VFS crash containment: the dentry/inode cache lives in the VFS domain, enabling cache rebuild on VFS crash recovery. Two amortization mechanisms reduce this cost to ~0.3-0.5% effective overhead for the dominant readdir+stat access pattern and ~0.05% per stat for io_uring batch workloads; see Section 14.1 below. For per-architecture raw overhead figures, see Section 3.4.

14.1.1.2 Metadata Access Amortization

Metadata-heavy workloads (find, package managers, ls -la, container image unpacking) are dominated by the readdir+stat pattern: the application reads a directory, then immediately stats every entry. Without amortization, each stat() incurs a full Core-to-VFS domain crossing (~18ns on x86-64 MPK, ~32-64ns on AArch64 POE). This section specifies two complementary mechanisms that eliminate most of those crossings, plus a per-filesystem policy framework that controls when prefetch is safe.

14.1.1.2.1 Mechanism 1: Readdir-Plus Prefetch with Per-Task Buffer

When a filesystem's readdir implementation returns directory entries, the VFS also collects statx metadata for each returned entry. The inode is already resolved from the dentry lookup during readdir — fetching its attributes is essentially free for local filesystems (the inode struct is already cache-hot). The VFS writes the metadata into a per-task prefetch buffer allocated in Core memory.

Subsequent stat() / statx() calls on the same directory's entries check the per-task buffer before crossing into the VFS domain. On hit, stat() returns immediately with zero domain crossings.

Key design constraints:

  • Per-task buffer, not a global cache. The buffer is private to each task. No locking, no cross-task contention, no cache coherence traffic. Allocated on first readdir() for a given directory fd, freed when the directory fd is closed or the task exits.
  • Bounded size: 128 entries x ~320 bytes = ~40KB per task. This exceeds L1 data cache capacity on some architectures (ARMv7 Cortex-A15: 32KB L1D; Cortex-A72/A76: 32KB-48KB L1D). On these targets the scan spills to L2 (~10-15 cycles per access vs ~4 cycles for L1), adding ~640-1280 cycles worst-case on a full miss path. On x86-64 (32-48KB L1D typical) the buffer may fit depending on core implementation. The sequential access pattern means the hardware prefetcher keeps up regardless — L2 prefetch on ARM delivers ~6-8 cycles per line, acceptable for the miss path (which is cold: a readdir-stat pattern that misses has already paid a domain crossing). Miss path cost: 128 iterations x (DevId compare + InodeId compare + Relaxed atomic load) = ~384-640 cycles on L1 hit, ~640-1280 cycles with L2 spills. This is paid only when the target inode is NOT in the prefetch buffer (the common case after readdir IS a hit). Covers the vast majority of directories (median directory size in real workloads is 20-60 entries). For directories larger than 128 entries, only the most recently returned batch is buffered; earlier entries that were already stat'd remain valid, later entries fall through to the normal VFS path.
  • Keyed by (sb_dev, ino) — stable identifiers that survive VFS crash/evolution. No path strings, no dentry pointers, no VFS-internal state.
  • Generation counter per entry. VFS increments the inode's generation counter on any metadata mutation (setattr, truncate, write that changes mtime, etc.). Core checks the generation before returning a prefetch hit. Stale entries produce a miss and fall through to the normal VFS path.
  • VFS epoch counter for crash/evolution invalidation. The buffer records the VFS epoch at fill time. If the VFS is replaced (crash recovery or live evolution), the global VFS_EPOCH counter is incremented and all buffers become stale in O(1).
  • Core memory (Tier 0) — the buffer itself is not in the VFS domain. It survives VFS crash without corruption.

Data structures:

/// Single prefetch entry. Aligned to cache line to avoid false sharing
/// when the VFS writer and the Core reader access adjacent entries.
///
/// Total size: 320 bytes (64-byte aligned).
/// - DevId (4 bytes) + InodeId (8 bytes) + generation (8 bytes)
///   + StatxBuf (256 bytes) + valid (1 byte) + padding (43 bytes) = 320.
#[repr(C, align(64))]
pub struct PrefetchEntry {
    /// Superblock device ID. Together with `ino`, forms the unique key.
    pub sb_dev: DevId,
    /// Explicit padding: DevId is 4 bytes, InodeId requires 8-byte alignment.
    pub _pad0: [u8; 4],
    /// Inode number within the filesystem identified by `sb_dev`.
    pub ino: InodeId,
    /// Inode generation counter at the time this entry was filled.
    /// VFS increments the inode's generation on any metadata mutation.
    /// Core compares this against the current inode generation before
    /// returning the entry. Mismatch → stale → fall through to VFS.
    ///
    /// Memory ordering: stored with Release by VFS (during readdir fill),
    /// loaded with Acquire by Core (during stat fast path).
    pub generation: AtomicU64,
    /// Cached statx result. Layout matches `struct statx` from Linux
    /// (256 bytes, binary compatible with the userspace ABI).
    pub stx: StatxBuf,
    /// Entry validity flag. 0 = invalid, 1 = valid. AtomicU8 instead of
    /// AtomicBool: this is cross-domain shared memory (VFS Tier 1 writes,
    /// Core Tier 0 reads). A non-0/1 value from a corrupted Tier 1 domain
    /// would cause UB with AtomicBool's validity invariant.
    pub valid: AtomicU8,
    /// Explicit trailing padding: AtomicU8 ends at offset 281; align(64)
    /// rounds struct size to 320 (next multiple of 64). 320 - 281 = 39.
    pub _pad_tail: [u8; 39],
}
// Layout with align(64): DevId(4) + _pad0(4) + InodeId(8) + AtomicU64(8) +
// StatxBuf(256) + AtomicU8(1) + _pad_tail(39) = 320 bytes (5 × 64-byte cache lines).
const_assert!(size_of::<PrefetchEntry>() == 320);

/// Per-task readdir prefetch buffer. Allocated in Core memory (Tier 0).
///
/// One buffer exists per open directory fd per task. The buffer is
/// populated during `readdir()` and consumed during subsequent
/// `stat()` / `statx()` calls. It is freed when the directory fd is
/// closed or the task exits.
///
/// **Collection policy**: Hot path (per-syscall lookup). `entries` is
/// a fixed-capacity `ArrayVec` — no heap allocation on the hot path.
/// The 128-entry limit bounds memory to ~40KB per task per open
/// directory, which is acceptable for the readdir+stat pattern.
pub struct TaskPrefetchBuf {
    /// Prefetch entries, indexed by insertion order. Lookup is linear
    /// scan over at most 128 entries (~40KB — exceeds L1 on ARMv7/some
    /// AArch64; spills to L2 with ~3-5 extra cycles/access on those
    /// targets). For the readdir+stat pattern, entries are accessed in
    /// insertion order (sequential scan), so linear search has optimal
    /// prefetch behavior regardless of L1/L2 residency.
    entries: ArrayVec<PrefetchEntry, 128>,
    /// Directory fd this buffer was filled for. Used to associate the
    /// buffer with the correct directory on subsequent stat() calls.
    /// When the fd is closed, the buffer is freed.
    dir_fd: i32,
    /// VFS epoch at fill time. If the global `VFS_EPOCH` has advanced
    /// (crash or live evolution), the entire buffer is stale and must
    /// be discarded. This is an O(1) invalidation mechanism.
    vfs_epoch: u64,
}

/// Global VFS epoch counter. Incremented on VFS crash recovery or
/// live evolution. All `TaskPrefetchBuf` instances whose `vfs_epoch`
/// differs from this value are stale.
///
/// Stored as AtomicU64 in Core memory. Incremented with Release
/// ordering; read with Acquire ordering in the stat fast path.
///
/// **Longevity**: u64 counter incremented only on VFS crash or
/// evolution events. At one event per second (vastly exceeding any
/// realistic crash rate), this counter lasts ~584 billion years.
pub static VFS_EPOCH: AtomicU64 = AtomicU64::new(0);

stat() fast path (Core-side, before any domain crossing):

/// Attempt to serve a statx() call from the per-task prefetch buffer.
/// Returns `Some(stx)` on hit (zero domain crossings), `None` on miss
/// (caller falls through to the normal VFS domain crossing path).
///
/// This function runs entirely in Core (Tier 0). No locks, no domain
/// crossings, no ring buffer interaction. The only synchronization is
/// atomic loads on the VFS epoch and per-entry generation counters.
///
/// # Hot path classification
///
/// This is called on every `stat()` / `statx()` / `fstat()` /
/// `newfstatat()` syscall when the task has an active prefetch buffer.
/// Must be O(1) amortized with no heap allocation.
fn sys_statx_fast_path(task: &Task, dentry_dev: DevId, dentry_ino: InodeId) -> Option<StatxBuf> {
    let buf = task.prefetch_buf.as_ref()?;
    // Check VFS epoch — if VFS was replaced, entire buffer is stale.
    if buf.vfs_epoch != VFS_EPOCH.load(Acquire) {
        return None;
    }
    // Linear scan over at most 128 entries. Sequential access pattern
    // means the prefetcher keeps up; worst case is 128 × 320B = 40KB
    // which exceeds L1 on some architectures (ARMv7 32KB L1D, some
    // AArch64 32-48KB L1D) — L2 spill adds ~3-5 cycles/access on
    // those targets. Acceptable: miss path is cold (already paying a
    // domain crossing on fallthrough).
    let entry = buf.entries.iter().find(|e| {
        e.valid.load(Relaxed) && e.sb_dev == dentry_dev && e.ino == dentry_ino
    })?;
    // Check generation — if inode was mutated since readdir, entry is stale.
    let gen = entry.generation.load(Acquire);
    if gen != inode_current_generation(dentry_dev, dentry_ino) {
        return None;
    }
    Some(entry.stx)
}

Generation counter update path (VFS-side):

When the VFS processes any inode-mutating operation (SetAttr, Truncate, Write that updates mtime/ctime, Link, Unlink, Rename), it increments the inode's generation counter. This is a single AtomicU64::fetch_add(1, Release) on the inode struct — the inode is already locked for the mutation, so this adds zero contention. The generation counter is stored in the inode struct itself (in VFS memory), and the Core stat fast path reads it via a shared-memory mapping (the inode's generation field is in a page mapped read-only into Core's domain). No ring buffer crossing is needed for the generation check.

Cross-domain memory ordering invariant: This is a shared-memory cross-domain access pattern — the VFS (Tier 1) writes the generation counter with Release, and Core (Tier 0) reads it with Acquire. On x86-64 (TSO), these translate to plain loads/stores with no additional fences. On ARM/AArch64 and RISC-V (weak memory models), the Acquire load emits the appropriate barrier instruction (LDAR on AArch64, fence r,rw on RISC-V) to ensure the stat fields are observed consistently with the generation counter. This ordering MUST NOT be downgraded to Relaxed in future maintenance — doing so would allow Core to observe a new generation counter but stale stat fields, returning incorrect metadata to userspace.

Readdir fill path (VFS-side):

During readdir() processing, after the filesystem driver returns each batch of directory entries (via the VfsResponse for ReadDir), the VFS iterates over the returned entries and for each one:

  1. Looks up the inode in the inode cache (already resolved during readdir).
  2. Copies the inode's current statx attributes into a PrefetchEntry.
  3. Stores the current inode generation counter.
  4. Writes the entry into the task's TaskPrefetchBuf via the shared-memory mapping.

This piggybacks on work already being done — the inode is cache-hot from the readdir lookup. The additional cost is ~50-80 cycles per entry (one memcpy of 256 bytes for the StatxBuf + two atomic stores). For a typical directory of 50 entries, this is ~2,500-4,000 cycles total — less than the cost of a single domain crossing.

14.1.1.2.2 Mechanism 2: io_uring Statx Coalescing

When the io_uring submission queue contains multiple consecutive IORING_OP_STATX entries, the VFS dispatcher coalesces them into a single domain crossing. Instead of N crossings for N stat requests, one crossing processes all N.

Detection and dispatch:

The io_uring dispatch loop (Section 19.3) already processes SQEs in batches. When the dispatcher encounters an IORING_OP_STATX SQE, it peeks ahead in the submission queue for consecutive IORING_OP_STATX entries, collecting up to 64 into a single batch. The batch is sent to the VFS as a single VfsRequest::StatxBatch over the ring buffer.

/// Batched statx request. Sent as a single VfsRequest when the io_uring
/// dispatcher detects consecutive IORING_OP_STATX SQEs.
///
/// The VFS resolves all paths in a single domain stay and writes all
/// results back in a single VfsResponse::StatxBatchResult.
pub struct StatxBatchArgs {
    /// Number of statx requests in this batch (1..=64).
    pub count: u8,
    /// DMA buffer handle containing an array of `StatxBatchEntry` structs.
    /// The buffer is allocated from the io_uring's pre-registered buffer
    /// pool when available, or from the shared DMA pool otherwise.
    pub entries_buf: DmaBufferHandle,
}

/// Single entry within a StatxBatch request.
#[repr(C)]
pub struct StatxBatchEntry {
    /// Directory fd for path resolution (AT_FDCWD or an open directory).
    pub dirfd: i32,
    /// AT_* flags (AT_SYMLINK_NOFOLLOW, AT_EMPTY_PATH, etc.).
    pub flags: u32,
    /// STATX_* request mask.
    pub mask: u32,
    /// Path string offset within the DMA buffer's string region.
    pub path_offset: u32,
    /// Path string length in bytes.
    pub path_len: u16,
    /// Padding to 4-byte alignment for array element stride. Without this
    /// pad, `path_len: u16` at offset 18 leaves the struct at 20 bytes but
    /// with 2 bytes of implicit tail padding for `u32`-aligned array access.
    /// Making it explicit ensures no uninitialized bytes leak across userspace.
    pub _pad: [u8; 2],
}
const_assert!(size_of::<StatxBatchEntry>() == 20);

/// Batched statx response. One result per entry in the request.
/// C-compatible layout: fixed array with explicit count. Neither
/// `ArrayVec` nor `Result<T, E>` have stable repr(C) layout.
#[repr(C)]
pub struct StatxBatchResult {
    /// Number of valid entries in `results`.
    pub count: u8,
    /// Explicit padding: count(u8, offset 1) to results[0] (align 8 from StatxBuf).
    /// 7 bytes: 1 + 7 = 8. CLAUDE.md rule 11.
    pub _pad: [u8; 7],
    /// Per-entry results. Index corresponds to the request entry index.
    pub results: [StatxBatchResultEntry; 64],
}

/// Single entry in a batched statx response.
#[repr(C)]
pub struct StatxBatchResultEntry {
    /// 0 = success (stx is valid), negative = errno (stx is zeroed).
    pub error: i32,
    /// Explicit padding: error(i32, offset 0+4=4) to stx (align 8 from StatxBuf).
    /// 4 bytes. CLAUDE.md rule 11.
    pub _pad: [u8; 4],
    /// Valid only when `error == 0`.
    pub stx: StatxBuf,
}
// Layout: error(4) + _pad(4) + StatxBuf(256) = 264 bytes. All padding explicit.
const_assert!(size_of::<StatxBatchResultEntry>() == 264);
// StatxBatchResult: count(1) + _pad(7) + 64 × 264 = 16904 bytes. All padding explicit.
const_assert!(size_of::<StatxBatchResult>() == 16904);

Key design properties:

  • Transparent to userspace. Applications submit individual IORING_OP_STATX SQEs as usual. The coalescing is entirely internal to the kernel's io_uring dispatch path. Each SQE still gets its own CQE with the correct user_data and result code.
  • No additional memory overhead. The batch uses the existing ring buffer and DMA buffer infrastructure. The StatxBatchEntry array is written into a DMA buffer that is already allocated for the io_uring ring.
  • Works with all filesystem types. Each stat within the batch goes through normal VFS path resolution and filesystem locking. The coalescing only eliminates the domain crossing overhead, not any per-file locking. Filesystems with Never prefetch policy still benefit from crossing amortization.
  • Interaction with readdir-plus prefetch. Before sending a StatxBatch to the VFS, the dispatcher checks each entry against the task's TaskPrefetchBuf. Entries that hit the prefetch buffer are resolved immediately and removed from the batch. Only cache-miss entries cross into the VFS domain. In the best case (all 64 entries hit the prefetch buffer), zero domain crossings occur.
  • Ring protocol extension. StatxBatch is added as VfsOpcode::StatxBatch = 70 in the VFS ring protocol (Section 14.2). The response uses VfsOpcode::StatxBatchResult = 71. These opcodes are only generated by the io_uring coalescing path; they are never exposed to filesystem drivers directly (the VFS dispatches individual Getattr calls internally for each entry in the batch).
14.1.1.2.3 Prefetch Policy Framework

Each filesystem declares its readdir prefetch policy via a method on the FileSystemOps trait (Section 14.1):

/// Policy controlling whether the VFS prefetches statx metadata during
/// readdir for this filesystem.
///
/// The default implementation returns `Always`, which is correct for all
/// single-node local filesystems. Network and cluster filesystems must
/// override this to return the appropriate policy.
pub enum ReaddirPrefetchPolicy {
    /// Always prefetch. Data is authoritative — no external consistency
    /// concerns. The VFS fills the per-task prefetch buffer on every
    /// readdir, and stat() uses the buffer unconditionally (subject to
    /// generation counter freshness).
    ///
    /// Appropriate for: ext4, XFS, btrfs, tmpfs, procfs, sysfs, debugfs.
    Always,

    /// Prefetch, but layer on top of the filesystem's existing attribute
    /// cache. The prefetch buffer entries are valid only as long as the
    /// filesystem's own cache considers them valid. When the filesystem
    /// invalidates its cache (e.g., NFS delegation recall, CIFS oplock
    /// break, FUSE attr_timeout expiry), it calls
    /// `vfs_invalidate_prefetch(sb_dev, ino)` which bumps the generation
    /// counter on any matching prefetch entry. No additional locking is
    /// needed — the prefetch mechanism layers on top of whatever
    /// consistency protocol the filesystem already implements.
    ///
    /// Appropriate for: NFS, CIFS/SMB, FUSE, UmkaOS peerfs.
    CacheAware,

    /// Never prefetch. Each stat() acquires its own consistency token
    /// (e.g., DLM glock) for linearizability. The domain crossing cost
    /// (~18ns on x86-64) is negligible compared to the distributed lock
    /// round-trip (~50-500us), so prefetch elimination provides no
    /// measurable benefit and would violate the consistency model.
    ///
    /// Appropriate for: GFS2, OCFS2, UmkaOS DLM-based cluster FS.
    Never,
}

Per-filesystem policy table:

Filesystem Type Policy Rationale
ext4, XFS, btrfs, tmpfs Always Single-node, no external consistency concerns. Inode data is authoritative.
procfs, sysfs, debugfs Always Synthetic FS. Metadata is kernel-generated and stable within a readdir window.
NFS (v3/v4) CacheAware Layers on NFS actimeo/delegation cache. CB_RECALL bumps generation counter.
CIFS/SMB CacheAware Layers on oplock/lease cache. Lease break bumps generation counter.
FUSE CacheAware Layers on FUSE entry_timeout/attr_timeout. Unified regardless of whether the daemon supports FUSE_READDIRPLUS.
UmkaOS peerfs (distributed) CacheAware Peer protocol metadata push notifications bump generation counter (Section 5.1).
GFS2, OCFS2 Never DLM linearizability required. ~18ns crossing is 0.004% of ~50-500us glock round-trip.
UmkaOS DLM-based cluster FS Never Same as GFS2/OCFS2 — DLM consistency model requires per-stat lock acquisition.
Overlayfs Inherits Upper layer: Always (local, mutable). Lower layers: read-only, so Always (immutable data is trivially consistent).
14.1.1.2.4 Network/Cluster Filesystem Invalidation Integration

The CacheAware policy integrates with each filesystem's existing cache invalidation mechanism through a single Core-side callback:

/// Invalidate a prefetch entry for the given (sb_dev, ino) pair.
/// Called by filesystem cache invalidation handlers (NFS CB_RECALL,
/// CIFS lease break, FUSE NOTIFY_INVAL_INODE, peerfs METADATA_INVALIDATE).
///
/// This function iterates over all tasks that have an active prefetch
/// buffer for the given superblock and bumps the generation counter on
/// any matching entry. The iteration is bounded: at most one buffer per
/// task, at most 128 entries per buffer.
///
/// # Performance
///
/// Cold path — called only on cache invalidation events, which are
/// infrequent relative to stat() calls. Uses a per-superblock task list
/// to avoid scanning all tasks in the system.
pub fn vfs_invalidate_prefetch(sb_dev: DevId, ino: InodeId) {
    // Bump the generation counter on the inode. Any prefetch entry
    // holding the old generation will fail the freshness check on the
    // next stat() fast path and fall through to the VFS.
    inode_bump_generation(sb_dev, ino);
}

Per-filesystem invalidation triggers:

  • NFS: When the NFS client receives CB_RECALL (NFSv4 delegation return) or detects actimeo expiry (NFSv3/v4 attribute timeout), it calls vfs_invalidate_prefetch(sb_dev, ino). Next stat() sees generation mismatch, crosses to VFS, and the NFS client re-fetches attributes from the server.

  • CIFS/SMB: When the CIFS client receives an oplock break or lease break notification from the SMB server, it calls vfs_invalidate_prefetch(sb_dev, ino).

  • FUSE: When attr_timeout expires or the FUSE daemon sends FUSE_NOTIFY_INVAL_INODE, the FUSE client calls vfs_invalidate_prefetch(sb_dev, ino).

  • UmkaOS peerfs: The peer protocol METADATA_INVALIDATE message (Section 5.1) triggers vfs_invalidate_prefetch(). This is tighter than NFS because the peer node pushes invalidations proactively (not just on delegation recall), reducing the stale-data window.

Why this is better than existing approaches:

  1. Eliminates the domain crossing. NFS READDIRPLUS only eliminates the network round-trip for attribute fetches; the kernel-side VFS domain crossing still occurs for each stat(). Our readdir-plus prefetch eliminates both.
  2. Unified across all remote filesystems. NFS, FUSE, CIFS, and peerfs all use the same Core-side prefetch buffer with the same generation-counter invalidation. No per-filesystem prefetch implementation is needed.
  3. Generation-counter invalidation is more precise than time-based expiry. NFS actimeo is a blunt timeout; our generation counter reflects actual inode mutations. The result is fewer false invalidations and a higher effective hit rate.
14.1.1.2.5 Live Evolution Interaction
  • The prefetch buffer is in Core memory (Tier 0) and survives VFS live evolution (Section 13.18) unchanged.
  • On VFS evolution: Core increments VFS_EPOCH (single fetch_add(1, Release)). All prefetch buffers become stale in O(1) — no per-task or per-entry iteration.
  • The new VFS instance exports fresh inode generation counters. The first readdir() after evolution refills the buffer with current data.
  • No data from the old VFS instance leaks through — the epoch check catches everything before any stale entry is returned to userspace.
14.1.1.2.6 Crash Recovery Interaction
  • On VFS crash: Core increments VFS_EPOCH (same mechanism as evolution). All prefetch buffers are invalidated atomically.
  • Buffer memory is in Core (Tier 0) and cannot be corrupted by a VFS crash.
  • Subsequent stat() calls miss the buffer, fall through to the VFS domain crossing, and trigger VFS restart via the normal crash recovery path (Section 11.9).
  • After VFS recovery completes, the next readdir() refills the buffer. The transient period between crash and buffer refill uses the unoptimized path (full domain crossing per stat), which is correct but slower.
14.1.1.2.7 Amortized Performance Budget

With both mechanisms active, the effective metadata overhead for common access patterns:

Access Pattern Mechanism Effective Overhead (x86-64 MPK) Domain Crossings
Single stat() (no prefetch) None ~3.6-9% per call (~18ns / 200-500ns base) 1 round-trip
readdir + stat (Always/CacheAware FS) Readdir-plus prefetch ~0.3-0.5% effective 1 crossing for readdir, 0 for stat hits (~95% hit rate)
io_uring batch of 64 IORING_OP_STATX Statx coalescing ~0.05% per stat 1 crossing for 64 stats
io_uring batch + prefetch buffer Both ~0.01% per stat (best case) 0 crossings on full prefetch hit
readdir + stat (Never-policy FS) None (DLM overhead dominates) ~3.6-9% (negligible vs ~50-500us DLM) 1 round-trip per stat

Assumptions: 95% prefetch buffer hit rate for readdir+stat pattern (based on: median directory has <128 entries, stat() calls follow readdir in program order, inode mutation between readdir and stat is rare). Hit rate degrades for directories >128 entries (only the last batch is buffered) and for workloads that interleave mutations with stat.

Phase assignment: Readdir-plus prefetch is Phase 2 (required for metadata-heavy workload performance targets). io_uring statx coalescing is Phase 3 (optimization; the system is correct without it).

14.1.2 VFS Architecture

Responsibilities: path resolution, dentry caching, inode management, mount tree traversal, and permission checks (delegated to umka-core's capability system via the inter-domain ring buffer).

14.1.2.1 Nucleus / Evolvable Classification

Every VFS component is explicitly classified per the replaceability model (Section 13.18). Nucleus components are non-replaceable data structures whose invariants are verified and whose layout survives live evolution. Evolvable components are replaceable policy modules that can be hot-swapped via EvolvableComponent without rebooting.

Component Classification Rationale
Dentry cache (dcache hash table, Dentry struct, LRU list) Nucleus Correctness-critical: path resolution depends on dentry integrity. RCU-walk protocol embeds ordering invariants into the data structure. Corrupted dcache = silent wrong-file access. Cannot be swapped while RCU readers hold references.
Inode cache (per-superblock XArray, Inode struct, AddressSpace) Nucleus Correctness-critical: inode metadata (permissions, size, link count) governs security and data integrity. The inode generation counter protocol for prefetch invalidation depends on immutable layout.
Mount table (MountNamespace, Mount struct, mount hash table) Nucleus Correctness-critical: mount tree integrity governs which filesystem serves each path. Corrupted mount table = namespace escape. The RCU-protected mount hash and propagation graph encode safety invariants.
SuperBlock (per-mount filesystem state, SbWriters, freeze FSM) Nucleus Correctness-critical: the freeze state machine, writer tracking counters, and error behavior mode are safety-critical invariants that must not change during operation.
VFS ring protocol (VfsRingSet, VfsRingPair, request/response format) Nucleus Correctness-critical: the ring is the isolation boundary. Ring layout, opcode encoding, and response matching must be immutable across live evolution. A new VFS Evolvable inherits the existing ring set (see Section 14.3).
Dirty extent protocol (DirtyIntentList, reserve/commit/abort API) Nucleus Correctness-critical: crash recovery depends on the intent list being complete and uncorrupted. The two-phase protocol's invariants (token single-use, overflow backpressure) are safety properties.
ErrSeq (writeback error tracking) Nucleus Correctness-critical: one-shot error reporting to userspace is a POSIX contract. The atomic packing of errno + counter is a verified invariant.
Path resolution algorithm (RCU-walk, ref-walk fallback, symlink loop detection) Nucleus Correctness-critical: symlink loop detection (depth limit, visited set) and mount-crossing logic are security invariants. TOCTOU resistance depends on the algorithm, not tunable parameters.
Readahead window sizing (sequential detection, window growth/shrink) Evolvable Policy decision: the heuristic for when to grow or shrink the readahead window is a tuning knob, not a correctness property. ML can improve it. The readahead engine (page pre-allocation, I/O submission) is Nucleus; the window sizing policy is Evolvable.
Writeback scheduling (BDI dirty page selection, inode writeback ordering) Evolvable Policy decision: which dirty inodes to write back first, how to interleave sequential and random I/O, and when to trigger background writeback are heuristic choices. The writeback infrastructure (bio submission, completion tracking, writeback_lock) is Nucleus.
Dirty page throttling (balance_dirty_pages pause duration, dirty ratio) Evolvable Policy decision: the bandwidth-proportional throttling algorithm and dirty ratio thresholds are tunable via sysctl and ML. The throttling mechanism (task sleep, PerCpuCounter for dirty page counts) is Nucleus.
Dentry LRU eviction policy (which unused dentries to reclaim first) Evolvable Policy decision: LRU ordering and shrinker batch size are heuristics. The LRU list data structure is Nucleus.
Readdir-plus prefetch policy (ReaddirPrefetchPolicy per-filesystem) Evolvable Policy decision: whether to prefetch statx metadata during readdir is a per-filesystem heuristic. The prefetch buffer infrastructure (TaskPrefetchBuf, PrefetchEntry) is Nucleus.
Doorbell coalescing policy (batch size, timeout thresholds) Evolvable Policy decision: the coalescing batch size and timeout are ML-tunable parameters. The coalescing mechanism (CoalescedDoorbell, atomic bitmask) is Nucleus.

Swap mechanics: When the VFS Evolvable is live-replaced (Section 13.18), the Nucleus data structures (dcache, inode cache, mount table, ring set, dirty intent lists) are preserved in-place. The new Evolvable inherits them and resumes operation. Only the policy vtables (readahead sizing, writeback scheduling, dirty throttling, LRU eviction) are swapped. This is why the Nucleus/Evolvable boundary is drawn at the data-structure / policy-algorithm line: data survives the swap, policy is replaced.

Filesystem drivers register as VFS backends. The VFS never interprets on-disk format directly — it delegates all storage operations through three trait interfaces:

Foundational VFS types (used throughout this chapter):

/// Opaque filesystem inode identifier. Unique within a single SuperBlock.
///
/// Inode 0 is never valid (used as the null sentinel in `AtomicOption`).
/// Inode 1 is conventionally the root directory inode.
/// The u64 width accommodates all known filesystem inode spaces (ext4 uses
/// u32 internally but promotes to u64 for future-proofing; Btrfs and ZFS
/// use u64 natively).
///
/// `InodeId` is filesystem-private: the same u64 value in two different
/// `SuperBlock` instances refers to different inodes.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct InodeId(pub u64);

impl From<u64> for InodeId { fn from(v: u64) -> Self { InodeId(v) } }
impl From<InodeId> for u64  { fn from(id: InodeId) -> u64 { id.0 } }

/// Opaque VFS pipe identifier. Each `pipe(2)` / `pipe2(2)` call produces a
/// unique `PipeId` for internal tracking (waitqueue association, splice
/// routing, and PipeBuffer lifetime management). Not visible to userspace.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct PipeId(pub u64);

/// Memory protection flags for `FileOps::mmap()`.
///
/// Bitfield matching Linux `PROT_*` constants from `<sys/mman.h>`.
/// Passed by the VMM to the filesystem's mmap callback so it can validate
/// or adjust protections (e.g., deny PROT_WRITE for read-only mounts,
/// deny PROT_EXEC for noexec mounts).
///
/// These are the userspace-facing PROT_* values, NOT the kernel-internal
/// VM_* flags. The VMM converts between MmapProt and VmFlags via
/// `prot_flags_to_vm_flags()` ([Section 4.8](04-memory.md#virtual-memory-manager)).
pub struct MmapProt(u32);

impl MmapProt {
    pub const NONE:  MmapProt = MmapProt(0x0);
    pub const READ:  MmapProt = MmapProt(0x1); // PROT_READ
    pub const WRITE: MmapProt = MmapProt(0x2); // PROT_WRITE
    pub const EXEC:  MmapProt = MmapProt(0x4); // PROT_EXEC

    pub fn contains(&self, flag: MmapProt) -> bool {
        self.0 & flag.0 == flag.0
    }
}

/// Result type returned by `FileOps::mmap()`.
///
/// On success, the filesystem returns `MmapResult` describing any
/// adjustments it made to the mapping. The VMM applies these adjustments
/// to the VMA after the callback returns.
///
/// For the Tier 0 in-kernel direct-call path (where `f_op.mmap(f, &mut vma)`
/// modifies the VMA directly), `MmapResult::Ok` is returned after the VMA
/// has been modified in place. For the KABI ring transport (Tier 1/2), the
/// decomposed return struct carries the adjusted fields back to the VMM.
pub struct MmapResult {
    /// Adjusted vm_flags (the filesystem may set VM_IO, clear VM_MAYWRITE, etc.).
    /// If the filesystem did not modify flags, this equals the input vm_flags.
    pub vm_flags: u64,
    /// Filesystem-specific VmOperations handle (opaque u64 for KABI transport).
    /// The VMM sets `vma.vm_ops` from this value. Zero means no custom vm_ops.
    pub vm_ops_handle: u64,
}

/// Response envelope for cross-domain VFS ring buffer calls.
///
/// This is the kernel-internal typed representation. The wire-level
/// representation on the ring buffer is `VfsResponseWire`
/// ([Section 14.2](#vfs-ring-buffer-protocol)), which uses a single `i64 status`
/// field for compact encoding. The VFS dispatch layer converts between
/// the two: `status >= 0` → `Ok(status)`, `status < 0 && status !=
/// i64::MIN` → `Err(status as i32)`, `status == i64::MIN` → `Pending`.
#[derive(Debug)]
pub enum VfsResponse {
    /// Success, possibly with a return value (e.g., byte count for read/write).
    Ok(i64),
    /// Error code (negated Linux errno, e.g., `-ENOENT`).
    Err(i32),
    /// Asynchronous completion pending; caller must wait on the completion ring.
    Pending,
}
/// Filesystem-level operations (mount, unmount, statfs).
/// Implemented once per filesystem type (ext4, XFS, btrfs, ZFS, tmpfs, etc.).
pub trait FileSystemOps: Send + Sync {
    /// Mount a filesystem from the given source device with flags and options.
    fn mount(&self, source: &str, flags: MountFlags, data: &[u8]) -> Result<SuperBlock>;

    /// Unmount a previously mounted filesystem.
    fn unmount(&self, sb: &SuperBlock) -> Result<()>;

    /// Force-unmount: abort in-flight I/O with EIO. Called when umount2()
    /// is invoked with MNT_FORCE. Not all filesystems support this — return
    /// ENOSYS if unsupported. NFS uses this for stale server recovery.
    fn force_umount(&self, sb: &SuperBlock) -> Result<()>;

    /// Return filesystem statistics (total/free/available blocks and inodes).
    fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;

    /// Flush all dirty data and metadata for this filesystem to stable storage.
    /// Backend for syncfs(2) and the filesystem-level portion of sync(2).
    fn sync_fs(&self, sb: &SuperBlock, wait: bool) -> Result<()>;

    /// Remount with changed flags/options (e.g., `mount -o remount,ro`).
    fn remount(&self, sb: &SuperBlock, flags: MountFlags, data: &[u8]) -> Result<()>;

    /// Freeze the filesystem for a consistent snapshot. All pending writes are
    /// flushed and new writes block until thaw. Used by LVM snapshots, device-mapper,
    /// and backup tools via FIFREEZE ioctl.
    fn freeze(&self, sb: &SuperBlock) -> Result<()>;

    /// Thaw a previously frozen filesystem, allowing writes to resume.
    fn thaw(&self, sb: &SuperBlock) -> Result<()>;

    /// Format filesystem-specific mount options for /proc/mounts output.
    fn show_options(&self, sb: &SuperBlock, buf: &mut [u8]) -> Result<usize>;

    /// Declare the filesystem's write mode. Called once at mount time and cached
    /// by the VFS in `SuperBlock.write_mode`. Informs writeback scheduling, page
    /// cache sharing, and free space accounting.
    /// See [Section 14.4](#vfs-fsync-and-cow--copy-on-write-and-redirect-on-write-infrastructure)
    /// for the `WriteMode` enum and design rationale.
    /// Default: `WriteMode::InPlace` (traditional overwrite semantics).
    fn write_mode(&self) -> WriteMode {
        WriteMode::InPlace
    }
}

/// Inode (directory structure) operations.
/// Handles namespace operations: lookup, create, link, unlink, rename.
///
/// Note: `OsStr` is a kernel-defined type (NOT `std::ffi::OsStr`, which is
/// unavailable in `no_std`). It is a dynamically-sized type (DST) wrapping
/// `[u8]`, representing filenames that may contain arbitrary non-UTF-8 bytes
/// (Linux filenames are byte strings, not Unicode). Defined in
/// `umka-vfs/src/types.rs`:
///   `pub struct OsStr([u8]);`
/// As a DST, `OsStr` cannot be used by value — it is always behind a
/// reference (`&OsStr`) or `Box<OsStr>`. `&OsStr` is a fat pointer
/// (pointer + length), analogous to `&[u8]` but carrying the semantic
/// intent of "filesystem name component." Conversion from `&str` is
/// infallible (UTF-8 is a valid byte sequence); conversion TO `&str`
/// returns `Result` (may fail on non-UTF-8 filenames).
pub trait InodeOps: Send + Sync {
    /// Look up a child entry by name within a parent directory.
    fn lookup(&self, parent: InodeId, name: &OsStr) -> Result<InodeId>;

    /// Create a regular file in the given directory.
    fn create(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;

    /// Create a subdirectory.
    fn mkdir(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;

    /// Create a hard link: new entry `new_name` in `new_parent` pointing to `inode`.
    fn link(&self, inode: InodeId, new_parent: InodeId, new_name: &OsStr) -> Result<()>;

    /// Create a symbolic link containing `target` at `parent/name`.
    fn symlink(&self, parent: InodeId, name: &OsStr, target: &OsStr) -> Result<InodeId>;

    /// Read the target of a symbolic link.
    fn readlink(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;

    /// Create a device special file (block/char device, FIFO, or socket).
    fn mknod(&self, parent: InodeId, name: &OsStr, mode: FileMode, dev: DevId) -> Result<InodeId>;

    /// Remove a directory entry (unlink for files, rmdir for empty directories).
    fn unlink(&self, parent: InodeId, name: &OsStr) -> Result<()>;

    /// Remove an empty directory. Separate from unlink for POSIX semantics:
    /// `unlink()` on a directory returns EISDIR; `rmdir()` on a file returns ENOTDIR.
    fn rmdir(&self, parent: InodeId, name: &OsStr) -> Result<()>;

    /// Rename/move a directory entry, possibly across directories.
    /// `flags` supports RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT
    /// (Linux renameat2 semantics, required for overlayfs).
    fn rename(
        &self,
        old_parent: InodeId, old_name: &OsStr,
        new_parent: InodeId, new_name: &OsStr,
        flags: RenameFlags,
    ) -> Result<()>;

    /// Get inode attributes (size, mode, timestamps, link count).
    fn getattr(&self, inode: InodeId) -> Result<InodeAttr>;

    /// Set inode attributes (chmod, chown, utimes).
    fn setattr(&self, inode: InodeId, attr: &SetAttr) -> Result<()>;

    /// Truncate a byte range within a file, deallocating the corresponding
    /// on-disk blocks (extent tree updates, journal entries, COW handling).
    /// Used by hole-punch (`FALLOC_FL_PUNCH_HOLE`) and range-discard
    /// operations. The VFS calls this after evicting the affected pages
    /// from the page cache; the filesystem is responsible only for the
    /// on-disk state. `start` and `end` are byte offsets (inclusive start,
    /// exclusive end; `end == u64::MAX` means "to end of file").
    fn truncate_range(&self, inode: InodeId, start: u64, end: u64) -> Result<(), IoError>;

    /// List extended attributes on an inode.
    fn listxattr(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;

    /// Get an extended attribute value.
    fn getxattr(&self, inode: InodeId, name: &OsStr, buf: &mut [u8]) -> Result<usize>;

    /// Set an extended attribute value.
    fn setxattr(&self, inode: InodeId, name: &OsStr, value: &[u8], flags: XattrFlags)
        -> Result<()>;

    /// Remove an extended attribute.
    fn removexattr(&self, inode: InodeId, name: &OsStr) -> Result<()>;

    /// Flush inode metadata to stable storage. Called by
    /// `vfs_fsync_metadata()` for O_SYNC/O_DSYNC writes when the inode's
    /// on-disk metadata must be updated (timestamps, size, block map).
    ///
    /// `sync_mode`: `WB_SYNC_ALL` (wait for I/O completion) or
    /// `WB_SYNC_NONE` (schedule I/O but do not wait). O_SYNC always uses
    /// `WB_SYNC_ALL`.
    fn write_inode(&self, ino: InodeId, sync_mode: WriteSyncMode) -> Result<()>;
}

/// Validated userspace pointer wrapper for writing data to userspace.
///
/// `UserSliceMut` represents a region of userspace memory that the kernel has
/// validated for write access. It ensures that:
/// 1. The pointer range `[ptr, ptr + len)` lies entirely within the task's
///    user address space (below `TASK_SIZE`, not in kernel address space).
/// 2. The pages are mapped writable (or will be demand-faulted on copy).
///
/// **Construction**: Created by `UserSliceMut::new(ptr, len)` which performs
/// the address range validation. This is called early in the syscall path
/// (before any I/O) so that an invalid buffer is rejected with `EFAULT`
/// before work is done.
///
/// **Copy path**: `copy_to_user(dst: &mut UserSliceMut, src: &[u8])` copies
/// kernel data into the validated userspace region. The copy handles:
/// - Page faults: if a destination page is not resident, the fault handler
///   allocates and maps it (demand paging), then retries the copy.
/// - Partial copies: if a fault cannot be resolved (e.g., SIGBUS on a
///   mapped-but-uncommittable page), the copy returns the number of bytes
///   successfully copied. The caller (VFS read dispatch) returns a short
///   read to userspace.
/// - SMAP/PAN enforcement: on architectures with Supervisor Mode Access
///   Prevention (x86 SMAP, ARM PAN), the copy temporarily enables user
///   access via `stac`/`clac` (x86) or `uaccess_enable`/`uaccess_disable`
///   (ARM). The access window is scoped to the copy operation.
///
/// **Advance semantics**: After each `copy_to_user()` call, the internal
/// pointer advances by the number of bytes written and `remaining()` decreases
/// accordingly. This allows iterative filling (e.g., page-by-page copy from
/// the page cache in `generic_file_read_iter()`).
///
/// **Thread safety**: `UserSliceMut` is `!Send` and `!Sync` — it is valid
/// only for the current task's address space on the current CPU. It must not
/// be stored beyond the syscall lifetime.
pub struct UserSliceMut {
    /// Validated userspace destination pointer. Guaranteed to be below
    /// `TASK_SIZE` at construction time.
    ptr: *mut u8,
    /// Remaining bytes available for writing.
    len: usize,
}

impl UserSliceMut {
    /// Create a validated userspace write buffer.
    ///
    /// Returns `EFAULT` if `ptr + len` overflows or exceeds `TASK_SIZE`.
    pub fn new(ptr: *mut u8, len: usize) -> Result<Self, Errno>;

    /// Number of bytes remaining in the buffer.
    pub fn remaining(&self) -> usize;

    /// Copy `src` into the userspace buffer, advancing the internal pointer.
    /// Returns the number of bytes actually copied (may be less than
    /// `src.len()` if a page fault cannot be resolved).
    pub fn write(&mut self, src: &[u8]) -> Result<usize, Errno>;
}

/// Validated userspace pointer wrapper for reading data from userspace.
///
/// Analogous to `UserSliceMut` but for kernel reads from user memory.
/// `copy_from_user(dst: &mut [u8], src: &UserSlice)` copies userspace data
/// into a kernel buffer with the same fault-handling and SMAP/PAN semantics
/// as `UserSliceMut`.
pub struct UserSlice {
    /// Validated userspace source pointer. Guaranteed to be below
    /// `TASK_SIZE` at construction time.
    ptr: *const u8,
    /// Remaining bytes available for reading.
    len: usize,
}

impl UserSlice {
    /// Create a validated userspace read buffer.
    ///
    /// Returns `EFAULT` if `ptr + len` overflows or exceeds `TASK_SIZE`.
    pub fn new(ptr: *const u8, len: usize) -> Result<Self, Errno>;

    /// Number of bytes remaining in the buffer.
    pub fn remaining(&self) -> usize;

    /// Copy data from the userspace buffer into `dst`, advancing the
    /// internal pointer. Returns the number of bytes actually copied.
    pub fn read(&mut self, dst: &mut [u8]) -> Result<usize, Errno>;
}

/// File data operations (open, read, write, sync, allocate, close).
pub trait FileOps: Send + Sync {
    /// Called when a file is opened. Allows the filesystem to initialize per-open
    /// state (NFS delegation, device state, lock state). Returns a filesystem-private
    /// context value stored in the file descriptor.
    fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64>;

    /// Called when the last file descriptor referencing this open file is closed.
    /// Filesystem releases per-open state (flock release-on-close, NFS delegation
    /// return, device cleanup). `private` is the value returned by `open()`.
    fn release(&self, inode: InodeId, private: u64) -> Result<()>;

    /// Read data from a file. `file` provides the OpenFile context (f_pos,
    /// f_flags, filesystem-private state). `offset` is read-write: the
    /// implementation advances it by the number of bytes read (supporting
    /// both pread with caller-supplied offset and read with f_pos).
    /// `buf` is a user-space slice descriptor for safe copy-to-user.
    fn read(
        &self,
        file: &OpenFile,
        buf: &mut UserSliceMut,
        offset: &mut i64,
    ) -> Result<usize, IoError>;

    /// Write data to a file. Same conventions as `read()`: `offset` is
    /// advanced by the number of bytes written.
    fn write(
        &self,
        file: &OpenFile,
        buf: &UserSlice,
        offset: &mut i64,
    ) -> Result<usize, IoError>;

    /// Truncate a file to the specified size. This is separate from setattr
    /// because truncation is a complex operation on many filesystems: it must
    /// free blocks/extents, update extent trees, handle COW (ZFS/btrfs),
    /// interact with snapshots, and flush in-progress writes beyond the new
    /// size. The VFS calls truncate after updating the in-memory inode size.
    /// `private` is the filesystem-private context value returned by `open()`.
    fn truncate(&self, inode: InodeId, private: u64, new_size: u64) -> Result<()>;

    /// Flush file data (and optionally metadata) to stable storage.
    /// `private` is the filesystem-private context value returned by `open()`.
    /// For DSM-managed pages: fsync waits for both local writeback completion
    /// AND DSM PutAck receipt ([Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching--fsync-semantics)).
    /// The VFS fsync path calls `dsm_sync_pages(inode)` after
    /// `filemap_write_and_wait_range()` to ensure DSM coherence.
    fn fsync(&self, inode: InodeId, private: u64, start: u64, end: u64, datasync: u8) -> Result<()>;

    /// Pre-allocate or punch holes in file storage. `private` is the
    /// filesystem-private context value returned by `open()`.
    fn fallocate(&self, inode: InodeId, private: u64, offset: u64, len: u64, mode: FallocateMode) -> Result<()>;

    /// Read directory entries. Returns entries starting from `offset` (an opaque
    /// cookie, not a byte position). The callback is invoked for each entry; it
    /// returns `false` to stop iteration (buffer full). This is the backend for
    /// `getdents64(2)`. `private` is the filesystem-private context value
    /// returned by `open()`.
    fn readdir(
        &self,
        inode: InodeId,
        private: u64,
        offset: u64,
        emit: &mut dyn FnMut(InodeId, u64, FileType, &OsStr) -> bool,
    ) -> Result<()>;

    /// Seek to a data or hole region (SEEK_DATA / SEEK_HOLE, lseek(2)).
    /// Filesystems that do not support sparse files return the file size for
    /// SEEK_DATA at any offset, and ENXIO for SEEK_HOLE at any offset.
    /// `private` is the filesystem-private context value returned by `open()`.
    fn llseek(&self, inode: InodeId, private: u64, offset: i64, whence: SeekWhence) -> Result<u64>;

    /// Map a file region into a process address space. The VFS calls this to
    /// obtain the page frame list; the actual page table manipulation is done
    /// by umka-core (Section 4.1). Filesystems that do not support mmap (e.g.,
    /// procfs, sysfs) return ENODEV. `private` is the filesystem-private
    /// context value returned by `open()`.
    fn mmap(&self, inode: InodeId, private: u64, offset: u64, len: usize, prot: MmapProt) -> Result<MmapResult>;

    /// Handle a filesystem-specific ioctl. The VFS dispatches generic ioctls
    /// (FIOCLEX, FIONREAD, etc.) itself; only unrecognized ioctls reach the
    /// filesystem driver. Returns ENOTTY for unsupported ioctls. `private` is
    /// the filesystem-private context value returned by `open()`.
    fn ioctl(&self, inode: InodeId, private: u64, cmd: u32, arg: u64) -> Result<i64>;

    /// Splice data between a file and a pipe without copying through userspace.
    /// Backend for splice(2), sendfile(2), and copy_file_range(2). Filesystems
    /// that do not implement this get a generic page-cache-based fallback
    /// provided by the VFS. `private` is the filesystem-private context value
    /// returned by `open()`.
    fn splice_read(
        &self,
        inode: InodeId,
        private: u64,
        offset: u64,
        pipe: PipeId,
        len: usize,
    ) -> Result<usize>;

    /// Splice data from a pipe into a file without copying through userspace.
    /// Reverse direction of splice_read: pipe is the data source, file is the
    /// destination. Backend for splice(2) write direction and vmsplice(2).
    /// Filesystems that do not implement this get a generic page-cache-based
    /// fallback provided by the VFS. `private` is the filesystem-private
    /// context value returned by `open()`.
    fn splice_write(
        &self,
        pipe: PipeId,
        inode: InodeId,
        private: u64,
        offset: u64,
        len: usize,
    ) -> Result<usize>;

    /// Remap a file range: create shared extent references between files.
    /// Backend for FICLONE, FICLONERANGE, and FIDEDUPERANGE ioctls, and the
    /// server-side copy path of copy_file_range(2). Source and destination
    /// must be on the same filesystem.
    ///
    /// `flags` controls behavior (see `RemapFlags` in
    /// [Section 14.4](#vfs-fsync-and-cow--copy-on-write-and-redirect-on-write-infrastructure)):
    /// - `REMAP_FILE_DEDUP`: only remap if source and destination byte ranges
    ///   are identical (deduplication mode; byte-by-byte comparison first).
    /// - `REMAP_FILE_CAN_SHORTEN`: caller accepts a shorter remap than
    ///   requested (e.g., if source extent ends before `len` bytes).
    ///
    /// Returns the number of bytes actually remapped. Filesystems that do not
    /// support reflinks return `EOPNOTSUPP`. The VFS generic layer handles
    /// permission checks, file size validation, and lock ordering before
    /// dispatching to this method.
    fn remap_file_range(
        &self,
        src_inode: InodeId,
        src_private: u64,
        src_offset: u64,
        dst_inode: InodeId,
        dst_private: u64,
        dst_offset: u64,
        len: u64,
        flags: RemapFlags,
    ) -> Result<u64> {
        Err(Errno::EOPNOTSUPP)
    }

    /// Poll for readiness events (POLLIN, POLLOUT, POLLERR, etc.).
    ///
    /// Called by `poll(2)`, `select(2)`, and `epoll_ctl(EPOLL_CTL_ADD)` to:
    /// 1. Register the caller's wait entry on the file's internal WaitQueue(s)
    ///    via `poll_wait()`, so the caller is woken when readiness changes.
    /// 2. Return the current readiness mask (which events are ready *right now*).
    ///
    /// `pt` is `Some(&mut PollTable)` on the first call (registration pass) and
    /// `None` on subsequent re-polls after wakeup (just check readiness, don't
    /// re-register). Regular files always return `EPOLLIN | EPOLLOUT | EPOLLRDNORM
    /// | EPOLLWRNORM` — they are always ready. Special files (pipes, sockets,
    /// eventfd, signalfd, timerfd, pidfd) check their internal state and call
    /// `poll_wait()` on the appropriate WaitQueue(s).
    ///
    /// `private` is the filesystem-private context value returned by `open()`.
    fn poll(
        &self,
        inode: InodeId,
        private: u64,
        events: PollEvents,
        pt: Option<&mut PollTable>,
    ) -> Result<PollEvents>;
}

/// Poll callback registration table.
///
/// Passed to `FileOps::poll()` on the first call. The file implementation calls
/// `poll_wait(wq, pt)` for each WaitQueue that can change the file's readiness.
/// The `PollTable` records which wait queues were registered so that the polling
/// infrastructure (epoll, poll, select) can install wakeup callbacks.
///
/// **Lifecycle**: allocated on the caller's stack (for poll/select) or embedded
/// in the `EpollItem` (for epoll). The `queue_proc` function pointer is the
/// mechanism that installs the actual `WaitQueueEntry`:
/// - For `poll(2)` / `select(2)`: installs a one-shot entry that wakes the
///   calling task.
/// - For `epoll_ctl(EPOLL_CTL_ADD)`: installs a persistent entry whose wakeup
///   function is `ep_poll_callback` ([Section 19.1](19-sysapi.md#syscall-interface--epoll-primary)).
pub struct PollTable {
    /// Callback invoked by `poll_wait()`. Installs a `WaitQueueEntry` on the
    /// given `WaitQueueHead`. The `key` parameter carries the events mask so
    /// the wakeup callback can filter spurious wakes.
    pub queue_proc: fn(wq: &WaitQueueHead, pt: &mut PollTable, key: PollEvents),

    /// Opaque pointer to the polling infrastructure's private state.
    /// For epoll: points to the `EpollItem` that owns this poll table entry.
    /// For poll/select: points to the per-fd poll state on the caller's stack.
    /// SAFETY: For poll/select: points to caller-stack-allocated poll state,
    /// valid for the duration of the poll syscall. For epoll: points to the
    /// owning EpollItem, valid for the EpollItem's lifetime. The queue_proc
    /// callback must cast to the correct type.
    pub private: *mut (),

    /// Events the caller is interested in. Set by the polling infrastructure
    /// before calling `FileOps::poll()`. The file implementation may use this
    /// to avoid registering on wait queues that cannot produce requested events.
    pub events: PollEvents,
}

/// Register a wait queue with the poll table.
///
/// Called by `FileOps::poll()` implementations to tell the polling infrastructure
/// "wake me when this wait queue fires." The `PollTable` installs a
/// `WaitQueueEntry` on `wq` with the appropriate wakeup function.
///
/// If `pt` is `None` (re-poll after wakeup), this is a no-op — the entry is
/// already installed from the first call.
///
/// **Cost**: one `WaitQueueEntry` insertion per wait queue per monitored fd.
/// Most files have one wait queue; sockets may have two (read + write).
///
/// ```rust
/// fn poll_wait(wq: &WaitQueueHead, pt: Option<&mut PollTable>) {
///     if let Some(pt) = pt {
///         (pt.queue_proc)(wq, pt, pt.events);
///     }
/// }
/// ```
pub fn poll_wait(wq: &WaitQueueHead, pt: Option<&mut PollTable>);

/// Dentry (directory entry) lifecycle operations.
/// Most filesystems use the default VFS implementations. Only network and
/// clustered filesystems need custom implementations (primarily d_revalidate).
pub trait DentryOps: Send + Sync {
    /// Revalidate a cached dentry. Called before using a cached dentry to verify
    /// it is still valid. Returns true if the dentry is still valid, false if
    /// the VFS should discard it and perform a fresh lookup.
    /// Default: always returns true (local filesystems).
    /// Network FS: checks with the server. Clustered FS: checks DLM lease (Section 15.12.6).
    fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool> {
        Ok(true)
    }

    /// Custom name comparison. Called during lookup to compare a dentry name
    /// with a search name. Used by case-insensitive filesystems (e.g., VFAT,
    /// CIFS with case folding, ext4 with casefold feature).
    /// Default: byte-exact comparison.
    fn d_compare(&self, name: &OsStr, search: &OsStr) -> bool {
        name == search
    }

    /// Returns a custom hash for this dentry name, or `None` to use the
    /// VFS default (SipHash-1-3 with per-superblock key from `SuperBlock.hash_key`).
    /// Must be consistent with d_compare: if two names are equal per d_compare,
    /// they must produce the same hash.
    ///
    /// The VFS lookup layer calls `d_hash()` and checks the return value.
    /// If `None`, the VFS uses its own SipHash-1-3 with the per-superblock
    /// random key directly, without requiring filesystem involvement. This
    /// matches Linux's pattern where `d_hash` is only invoked when
    /// `dentry->d_op->d_hash` is non-NULL.
    ///
    /// Filesystems with custom hash requirements (e.g., case-insensitive)
    /// override this to return `Some(hash_value)` using their own algorithm —
    /// they never see the SipHash key. The per-superblock key is managed by
    /// the VFS, not exposed to filesystem implementations.
    fn d_hash(&self, name: &OsStr) -> Option<u64> {
        None
    }

    /// Called when a dentry's reference count drops to zero (dentry enters
    /// the unused LRU list). Filesystem can veto caching by returning false.
    fn d_delete(&self, inode: InodeId, name: &OsStr) -> bool {
        true // default: allow LRU caching
    }

    /// Called when a dentry is finally freed from the cache.
    fn d_release(&self, inode: InodeId, name: &OsStr) {}
}

/// Kernel-internal inode attribute structure. Contains all fields exposed by
/// Linux statx(2). The SysAPI layer translates to the userspace struct statx
/// layout (different field ordering, padding, and encoding).
pub struct InodeAttr {
    /// Bitmask of valid fields (STATX_* flags). Filesystems set only
    /// the bits for fields they actually populate.
    pub mask: u32,

    pub mode: u32,        // File type and permissions. u32 for internal storage
                          // convenience and future extensibility. Only bits [15:0]
                          // are defined (identical to Linux umode_t). Bits [31:16]
                          // are reserved and must be zero. The SysAPI translation
                          // to userspace statx truncates to u16.
    pub nlink: u32,       // Hard link count
    pub uid: u32,         // Owner UID
    pub gid: u32,         // Group GID
    pub ino: u64,         // Inode number
    pub size: u64,        // File size in bytes
    pub blocks: u64,      // 512-byte blocks allocated
    pub blksize: u32,     // Preferred I/O block size

    // Timestamps with nanosecond precision
    pub atime_sec: i64,   // Last access
    pub atime_nsec: u32,
    pub mtime_sec: i64,   // Last modification
    pub mtime_nsec: u32,
    pub ctime_sec: i64,   // Last status change
    pub ctime_nsec: u32,
    pub btime_sec: i64,   // Creation time (birth time)
    pub btime_nsec: u32,

    /// Device ID (for device special files: char/block). Uses the `DevId` type
    /// ([Section 14.5](#device-node-framework)) with Linux-compatible MKDEV encoding:
    /// `(major << 20) | (minor & 0xFFFFF)`. Major occupies bits 31:20 (12 bits,
    /// 0-4095), minor occupies bits 19:0 (20 bits, 0-1048575). The SysAPI layer
    /// ({ref:linux-compatible-syscall-dispatch-layer}  <!-- UNRESOLVED -->) splits `DevId` into separate
    /// `stx_rdev_major`/`stx_rdev_minor` u32 fields for `statx()` responses using
    /// `dev_id.major()` and `dev_id.minor()`.
    pub rdev: DevId,
    /// Device ID of the filesystem containing this inode. Same `DevId` encoding
    /// as `rdev`. The SysAPI layer splits into `stx_dev_major`/`stx_dev_minor`
    /// for `statx()` responses.
    pub dev: DevId,
    pub mount_id: u64,    // Mount identifier (STATX_MNT_ID, since Linux 5.8)
    pub attributes: u64,  // File attributes (STATX_ATTR_* flags)
    pub attributes_mask: u64, // Supported attributes mask

    // Direct I/O alignment (STATX_DIOALIGN, since Linux 6.1)
    pub dio_mem_align: u32,    // Required alignment for DIO memory buffers
    pub dio_offset_align: u32, // Required alignment for DIO file offsets

    // Subvolume identifier (STATX_SUBVOL, since Linux 6.10; btrfs, bcachefs)
    pub subvol: u64,

    // Atomic write limits (STATX_WRITE_ATOMIC, since Linux 6.11)
    pub atomic_write_unit_min: u32,  // Min atomic write size (power-of-2)
    pub atomic_write_unit_max: u32,  // Max atomic write size (power-of-2)
    pub atomic_write_segments_max: u32, // Max segments in atomic write
    pub atomic_write_unit_max_opt: u32, // Optimal max atomic write size (STATX_WRITE_ATOMIC, since Linux 6.13)

    // Direct I/O read alignment (STATX_DIO_READ_ALIGN, since Linux 6.14)
    pub dio_read_offset_align: u32,  // DIO read offset alignment (0 = use dio_offset_align)
}

Linux comparison: Linux's VFS uses struct super_operations, struct inode_operations, struct file_operations, and struct dentry_operations — C structs of function pointers (Linux's file_operations alone has 30+ methods). UmkaOS's trait-based design serves the same purpose but with Rust's safety guarantees: a filesystem that forgets to implement fsync is a compile-time error, not a null pointer dereference at runtime. The trait methods above cover the operations needed for POSIX compatibility, including remap_file_range() for reflink/clone/dedup (see Section 14.4). Rarely-used operations (e.g., fiemap) are handled by generic VFS fallback code that calls the core read/write/fallocate methods.

14.1.2.2 File Handle Export (ExportOps)

The ExportOps trait is implemented by filesystems that support persistent file handles — opaque tokens that identify an inode across server reboots and path renames. Required for:

  • NFS server (clients hold file handles that survive server restart)
  • CRIU checkpoint/restore (open_by_handle_at reopens files by handle)
  • Backup software (rsync --no-implied-dirs, backup agents)
/// File system export operations. Optional — implement only if the filesystem
/// supports persistent, path-independent file handles.
///
/// A file handle is a short opaque byte string (max 128 bytes) that uniquely
/// identifies an inode within a filesystem instance. The handle must survive:
/// - Server reboots (handle encodes stable inode ID + generation counter)
/// - Directory renames (handle does not encode path)
/// - Mount point changes (handle is filesystem-relative, not global)
pub trait ExportOps: Send + Sync {
    /// Encode an inode into a file handle.
    ///
    /// Returns the handle bytes written and a filesystem-defined `fh_type` code
    /// (passed back to `decode_fh`; used to distinguish handle formats).
    ///
    /// # Typical encoding
    /// ext4:  [ inode_number: u32, generation: u32 ] → 8 bytes, fh_type=1
    /// XFS:   [ ino: u64, gen: u32, parent_ino: u64, parent_gen: u32 ] → 24 bytes, fh_type=1
    /// Btrfs: [ objectid: u64, root_objectid: u64, gen: u64 ] → 24 bytes, fh_type=1
    ///
    /// Returns `Err(EOVERFLOW)` if `max_bytes` is too small for this filesystem's handle.
    fn encode_fh(
        &self,
        inode: &Inode,
        handle: &mut [u8; 128],
        max_bytes: usize,
        /// If true, include parent inode info to enable NFS reconnect after server reboot.
        connectable: bool,
    ) -> Result<(usize, u8), VfsError>; // (bytes_written, fh_type)

    /// Decode a file handle back to an inode reference.
    ///
    /// Called by `open_by_handle_at`. Must look up the inode using the filesystem's
    /// internal handle format without path traversal.
    ///
    /// Returns `Err(ESTALE)` if the inode no longer exists or the generation counter
    /// does not match (inode number reused after deletion).
    fn decode_fh(
        &self,
        handle: &[u8],
        fh_type: u8,
    ) -> Result<Arc<Inode>, VfsError>;

    /// Get the parent directory inode of an inode (for NFS reconnect after reboot).
    ///
    /// Returns `Err(EACCES)` if the filesystem cannot determine the parent without a
    /// full tree walk (e.g., hardlinks with multiple parents).
    fn get_parent(&self, inode: &Inode) -> Result<Arc<Inode>, VfsError>;

    /// Get the directory entry name for `child` within `parent`.
    ///
    /// Used by the NFS server to reconstruct paths for client caches.
    /// Returns the byte length of the name written into `name_buf`.
    /// Returns `Err(ENOENT)` if no entry for `child` is found in `parent`.
    fn get_name(
        &self,
        parent: &Inode,
        child: &Inode,
        name_buf: &mut [u8; 256],
    ) -> Result<usize, VfsError>;
}

/// Kernel-side file handle: wraps the opaque handle bytes with metadata.
/// Matches the layout of Linux's `struct file_handle` for syscall ABI compatibility.
#[repr(C)]
pub struct FileHandle {
    /// Byte length of the handle data (the populated prefix of `f_handle`).
    pub handle_bytes: u32,
    /// Filesystem-defined type code (passed back verbatim to `ExportOps::decode_fh`).
    pub handle_type: i32,
    /// Opaque handle data (filesystem-defined encoding, up to 128 bytes).
    pub f_handle: [u8; 128],
}
const_assert!(size_of::<FileHandle>() == 136);

name_to_handle_at(2) implementation:

name_to_handle_at(dirfd, pathname, handle, mount_id, flags):

1. Resolve pathname to an inode (using normal path resolution with dirfd as the base;
   AT_EMPTY_PATH allows operating on dirfd itself without a pathname component).
2. Retrieve the inode's superblock.
3. Check that the superblock implements ExportOps. Return ENOTSUP if not.
4. Call superblock.export_ops.encode_fh(inode, handle.f_handle, handle.handle_bytes,
   connectable=true).
5. Write back handle_bytes and handle_type into the userspace handle struct.
6. Write the mount's numeric ID to *mount_id. Mount IDs are assigned at mount time
   via a monotonic counter (Section 14.2.3 MountNode.mnt_id).
7. Return 0 on success; EOVERFLOW if the handle buffer is too small.

open_by_handle_at(2) implementation:

open_by_handle_at(mount_fd, handle, flags):

1. Requires CAP_DAC_READ_SEARCH. This syscall bypasses normal path-based access checks
   by design — it is intended for root-equivalent processes such as NFS servers and
   backup agents. Return EPERM if the capability is absent.
2. Resolve mount_fd to identify which filesystem the handle belongs to:
   fdget(mount_fd) → extract the file's MountDentry → use that mount's superblock.
   mount_fd must be an open fd on any file or directory within the target filesystem
   (typically the mountpoint itself, e.g., `fd = open("/mnt")`). If mount_fd is
   AT_FDCWD, the current working directory's mount is used.
3. Retrieve the mount's superblock (from the MountDentry resolved in step 2).
4. Check that the superblock implements ExportOps. Return ENOTSUP if not.
5. Call superblock.export_ops.decode_fh(handle.f_handle, handle.handle_type) → Arc<Inode>.
6. If Err(ESTALE): the inode was deleted or the generation counter does not match
   (inode number reused). Return ESTALE.
7. Perform a DAC check and LSM check on the inode using the caller's credentials.
8. Allocate a new OpenFile wrapping the inode. The open file description does not
   carry a path — the inode is accessed directly without directory traversal.
9. Return the new file descriptor number.

Security note: open_by_handle_at intentionally skips directory execute-permission
checks along the path to the inode (the path is not known at this point). This is
the documented and expected behavior for NFS server use. CAP_DAC_READ_SEARCH is the
required guard.

14.1.2.3 Core VFS Data Structures

The VFS layer operates on four fundamental data structures: dentries (directory entries), inodes (index nodes), superblocks (mounted filesystem state), and open files (open file handles). All four are defined in this section.

14.1.2.3.1.1 OpenFile (Open File Description)
/// An open file description — the kernel-internal object backing one or more
/// file descriptors. Created by `open(2)`, `openat(2)`, `socket(2)`, `pipe(2)`,
/// `accept(2)`, etc. Multiple file descriptors can reference the same `OpenFile`
/// via `dup(2)` or `fork(2)`.
///
/// **Lifecycle**: Allocated at open time. Reference-counted (`Arc<OpenFile>`).
/// The `FdTable` holds `Arc<OpenFile>` entries. When the last fd referencing
/// this open file is closed (refcount drops to zero), `FileOps::release()` is
/// called and the `OpenFile` is freed.
///
/// **Concurrency**: Most fields are immutable after creation (`inode`, `dentry`,
/// `mount`, `f_ops`, `f_cred`, `f_mode`). Mutable fields use atomic operations:
/// - `f_pos`: `AtomicI64` — updated by `read()`/`write()`/`lseek()`. `pread()`
///   and `pwrite()` do not touch `f_pos`. Access is mediated by `fdget_pos()`:
///
///   **`fdget_pos()` protocol** (f_pos serialization):
///   The VFS read/write dispatch path calls `fdget_pos(fd)` instead of plain
///   `fdget(fd)`. This function returns an `FdPos` guard that provides
///   exclusive `&mut i64` access to the file position:
///
///   - **Single-user fast path**: If the `OpenFile` has exactly one `Arc`
///     reference (refcount == 1, meaning no `dup(2)` or `fork(2)` sharing),
///     `fdget_pos()` loads `f_pos` into a local `i64`, returns `&mut` to it,
///     and stores it back on drop. No mutex, no contention. This is the
///     common case for most file descriptors.
///
///   - **Multi-user slow path**: If the `OpenFile` has multiple references
///     (shared via `dup(2)` or `fork(2)` — detected by `Arc::strong_count() > 1`),
///     `fdget_pos()` acquires `f_pos_lock` (a per-OpenFile `Mutex<()>`)
///     before returning `&mut` access to a local copy. This serializes
///     concurrent `read()`/`write()` calls that share the same open file
///     description, matching POSIX requirements for atomic position updates.
///     The mutex is released when the `FdPos` guard is dropped.
///
///   - **`pread()`/`pwrite()` bypass**: These syscalls use a caller-supplied
///     offset and never call `fdget_pos()` — they call `fdget()` directly.
///     No f_pos serialization is needed because the caller-supplied offset
///     is on the stack.
///
///   This design matches Linux's `fdget_pos()` / `__fdget_pos()` protocol
///   exactly (see `fs/file.c`), ensuring identical concurrency semantics.
/// - `f_flags`: `AtomicU32` — modified by `fcntl(F_SETFL)` for `O_APPEND`,
///   `O_NONBLOCK`, `O_ASYNC`, `O_DIRECT`. Read-only flags (`O_RDONLY`,
///   `O_RDWR`, `O_CREAT`, `O_EXCL`) are set at open time and never change.
/// - `f_wb_err`: `u64` — writeback error snapshot (plain value, not atomic).
///   Initialized from `AddressSpace::wb_err.sample()` at open time. Compared
///   at `fsync()` time against `AddressSpace::wb_err` via
///   `check_and_advance(&mut self.f_wb_err)` to detect new errors.
/// - `private_data`: `AtomicPtr` — set once by `FileOps::open()` and read by
///   subsequent operations. Typically not modified after initialization.
///
/// **Relationship to FdTable**: The `FdTable` (in [Section 8.1](08-process.md#process-and-task-management))
/// maps integer file descriptors (0, 1, 2, ...) to `Arc<OpenFile>`. `dup(2)`
/// creates a new fd pointing to the same `Arc<OpenFile>`. `fork()` copies the
/// `FdTable`, incrementing the `Arc` refcount for each entry.
pub struct OpenFile {
    /// Inode backing this open file. For regular files, directories, symlinks,
    /// and device nodes, this is the filesystem inode. For pipes and sockets,
    /// this is a synthetic inode from the pipefs/sockfs pseudo-filesystem.
    pub inode: Arc<Inode>,

    /// Dentry that was used to open this file. Pinned for the lifetime of the
    /// open file — this prevents the dentry from being evicted while the file
    /// is open, which is necessary for `/proc/[pid]/fd/N` readlink (returns
    /// the path via `d_path()` on this dentry).
    pub dentry: DentryRef,

    /// Mount instance through which this file was opened. Pinned for the
    /// lifetime of the open file — this prevents `umount` from proceeding
    /// while files are open on the filesystem (umount checks `mnt_count`).
    pub mount: Arc<Mount>,

    /// File operations vtable. Set at open time from the inode's `i_fop`
    /// (regular files, directories) or the device driver's registered
    /// `FileOps` (character/block devices). Immutable after creation.
    pub f_ops: &'static dyn FileOps,

    /// Current file position (seek offset). Updated by `read()`, `write()`,
    /// and `lseek()`. Not used by `pread()`/`pwrite()` (which take an
    /// explicit offset). Initialized to 0 for regular opens, to the file
    /// size for `O_APPEND` opens (the kernel re-seeks to EOF before each
    /// `write()` regardless of the stored position).
    pub f_pos: AtomicI64,

    /// Mutex protecting `f_pos` for shared open file descriptions.
    /// Only acquired by `fdget_pos()` when `Arc::strong_count() > 1`
    /// (i.e., the open file is shared via `dup(2)` or `fork(2)`).
    /// Single-user file descriptors (the common case) never touch this
    /// mutex — `fdget_pos()` skips it entirely. This matches Linux's
    /// `struct file::f_pos_lock` mutex.
    pub f_pos_lock: Mutex<()>,

    /// Open flags. Lower bits contain the access mode (O_RDONLY=0, O_WRONLY=1,
    /// O_RDWR=2). Upper bits contain status flags (O_APPEND, O_NONBLOCK,
    /// O_ASYNC, O_DIRECT, O_NOATIME, O_CLOEXEC). Status flags may be modified
    /// by `fcntl(F_SETFL)`; access mode bits are immutable after open.
    pub f_flags: AtomicU32,

    /// File mode derived from open flags. Bitflags indicating which operations
    /// are permitted on this open file. Set at open time and immutable.
    /// Checked by the VFS before dispatching to `FileOps` methods.
    pub f_mode: FileMode,

    /// Credentials captured at open time. Used for permission checks that
    /// occur after open (e.g., writeback, async I/O completion) where the
    /// original opener's credentials must be used, not the current task's.
    /// Immutable after creation.
    pub f_cred: Arc<Credentials>,

    /// Writeback error snapshot (plain `u64`, not atomic `ErrSeq`). Initialized
    /// from `AddressSpace::wb_err.sample()` at open time. At `fsync()` time,
    /// compared against the current `AddressSpace::wb_err` via
    /// `check_and_advance(&mut self.f_wb_err)` — if a new error occurred since
    /// this fd was opened (or since the last `fsync()`), `fsync()` returns the
    /// error. After reporting, the snapshot is advanced so the error is reported
    /// exactly once per fd. The snapshot is a non-atomic `u64` because only the
    /// owning fd thread accesses it (no concurrent readers), unlike
    /// `AddressSpace::wb_err` which is the atomic source.
    pub f_wb_err: u64,

    /// Readahead state for this open file. Tracks sequential access detection,
    /// the current readahead window size, and the last readahead position.
    /// Used by `filemap_get_pages()` and the readahead engine
    /// ([Section 4.4](04-memory.md#page-cache--readahead-engine)) to decide how many pages to
    /// prefetch. Each open file has independent readahead state — two
    /// processes reading the same file at different positions maintain
    /// separate readahead windows.
    pub ra_state: Mutex<FileRaState>,

    /// Filesystem-private data. Set by `FileOps::open()` to store per-open
    /// state (e.g., ext4 journal handle, NFS delegation ID, device driver
    /// context). The VFS passes this value (as `private: u64`) to all
    /// subsequent `FileOps` method calls. Cleared by `FileOps::release()`.
    pub private_data: AtomicPtr<()>,

    /// Driver generation at the time this file was opened. Set to
    /// `sb.driver_generation.load(Acquire)` during `open()`. Compared
    /// against `sb.driver_generation` on every VFS operation; mismatch
    /// returns `ENOTCONN` (the file handle is stale from a pre-crash
    /// driver instance). Not atomic — set once at open time, read-only
    /// thereafter.
    ///
    /// The generation check is in the VFS dispatch path (before
    /// `select_ring()`):
    /// ```rust
    /// if file.open_generation != file.inode.i_sb.driver_generation.load(Acquire) {
    ///     return Err(ENOTCONN);
    /// }
    /// ```
    pub open_generation: u64,
}

bitflags! {
    /// File mode flags — derived from open flags at open time. These indicate
    /// which operations the VFS permits on this open file description.
    /// Immutable after open.
    ///
    /// These are internal VFS flags (not directly visible to userspace). They
    /// are derived from the `O_*` flags passed to `open(2)`:
    /// - `O_RDONLY` (0) → `FMODE_READ`
    /// - `O_WRONLY` (1) → `FMODE_WRITE`
    /// - `O_RDWR` (2) → `FMODE_READ | FMODE_WRITE`
    ///
    /// Additional flags are set based on the file type and filesystem
    /// capabilities.
    pub struct FileMode: u32 {
        /// Read operations permitted (`read`, `pread`, `readv`, `mmap PROT_READ`).
        const FMODE_READ    = 0x0001;
        /// Write operations permitted (`write`, `pwrite`, `writev`, `mmap PROT_WRITE`).
        const FMODE_WRITE   = 0x0002;
        /// `lseek` is meaningful. Set for regular files and block devices.
        /// Not set for pipes, sockets, and some character devices.
        const FMODE_LSEEK   = 0x0004;
        /// `pread` is supported (implies the file has a stable notion of offset).
        /// Set for regular files and block devices. Not set for pipes or sockets.
        const FMODE_PREAD   = 0x0008;
        /// `pwrite` is supported.
        const FMODE_PWRITE  = 0x0010;
        /// Execute permission was checked at open time (implies `O_PATH` was not
        /// used and the file's execute bit was verified). Used by `execveat(2)`
        /// with `AT_EMPTY_PATH` to avoid a redundant permission check.
        const FMODE_EXEC    = 0x0020;
        /// File does not contribute to filesystem busy state. Set for files
        /// opened with `O_PATH` (which are just path references, not real opens).
        const FMODE_PATH    = 0x0040;
        /// Direct I/O mode. Set when `O_DIRECT` is in effect and the filesystem
        /// supports it. The VFS bypasses the page cache for read/write.
        const FMODE_DIRECT  = 0x0080;
    }
}
14.1.2.3.1.2 Dentry (Directory Cache Entry)
/// Directory cache entry — represents a single component in a pathname.
///
/// Dentries form a tree that mirrors the filesystem namespace. Each dentry
/// caches the result of a directory lookup: the mapping from a name to an
/// inode. The dentry cache (dcache) is the primary mechanism for avoiding
/// repeated directory lookups on hot paths.
///
/// **Lifecycle**: Created by `InodeOps::lookup()` on first access. Cached
/// in the dcache hash table (keyed by parent + name). Freed when the
/// reference count drops to zero AND the dentry is evicted from the LRU.
/// Negative dentries (name exists but no inode) are also cached to avoid
/// repeated failed lookups.
///
/// **Concurrency**: Dentries are RCU-protected for lockless path resolution
/// (RCU-walk mode, Section 14.1.3). Mutations (create, unlink, rename)
/// acquire the parent dentry's `d_lock` spinlock.
///
/// `#[repr(C)]` on `Dentry` is for deterministic field ordering (cache line
/// layout control), not for cross-compilation-unit ABI stability. Inner types
/// (`DentryName`, `RcuCell<..>`, `IntrusiveList<..>`) retain Rust-default layout.
/// Tier 1 drivers never receive raw `Dentry` pointers — all access is through
/// the VFS ring protocol by inode number.
// kernel-internal, not KABI — no const_assert (contains Rust-layout inner types).
#[repr(C)]
pub struct Dentry {
    /// The name of this directory entry (the final component, not the full path).
    /// Inline for short names (<=32 bytes); heap-allocated for longer names.
    /// Immutable after creation (renames create a new dentry).
    pub d_name: DentryName,

    /// Inode that this dentry points to. `None` for negative dentries
    /// (cached "does not exist" results). Set once by `d_instantiate()`
    /// after a successful lookup or create. Protected by RCU for readers;
    /// `d_lock` for writers.
    pub d_inode: RcuCell<Option<Arc<Inode>>>,

    /// Parent dentry. The root dentry's parent is itself.
    /// Protected by RCU (for RCU-walk path resolution).
    pub d_parent: RcuCell<Arc<Dentry>>,

    /// Hash table linkage for dcache lookup (keyed by parent + name hash).
    pub d_hash: HashListNode,

    /// Children list (subdirectories and files in this directory).
    /// Only meaningful for directory dentries. Protected by `d_lock`.
    pub d_children: IntrusiveList<Dentry>,

    /// Sibling linkage (entry in parent's `d_children` list).
    pub d_sibling: IntrusiveListNode,

    /// Per-dentry spinlock. Protects `d_children`, `d_inode` mutations,
    /// and `d_flags` updates. Lock level: DENTRY_LOCK (level 16).
    pub d_lock: SpinLock<(), DENTRY_LOCK>,

    /// Dentry flags (DCACHE_MOUNTED, DCACHE_NEGATIVE, etc.).
    pub d_flags: AtomicU32,

    /// Cross-namespace mount refcount. Counts how many mount namespaces have
    /// a mount at this dentry. Incremented in `do_mount()`, decremented in
    /// `do_umount()`. `DCACHE_MOUNTED` is cleared only when this reaches 0.
    ///
    /// **Why needed**: A single dentry can be a mount point in multiple
    /// namespaces simultaneously (e.g., "/" is mounted in every namespace
    /// that cloned the mount tree). Without this refcount, `do_umount()` in
    /// one namespace would clear `DCACHE_MOUNTED` and break path resolution
    /// in all other namespaces that still have mounts at this dentry.
    ///
    /// **Protocol**:
    /// - `do_mount()` step 6f: `dentry.d_mount_refcount.fetch_add(1, Relaxed)`
    ///   THEN `dentry.d_flags.fetch_or(DCACHE_MOUNTED, Release)`.
    /// - `do_umount()` step 9:
    ///   `if dentry.d_mount_refcount.fetch_sub(1, AcqRel) == 1 {`
    ///   `    dentry.d_flags.fetch_and(!DCACHE_MOUNTED, Release);`
    ///   `}`
    /// - Same pattern in `do_umount_tree()` step 3d, `do_move_mount()` step 5c.
    ///
    /// u32: bounded by mount_max × max_namespaces. Even with 100K mounts ×
    /// 100K namespaces, the per-dentry count is bounded by namespace count
    /// (~100K), well within u32 range.
    pub d_mount_refcount: AtomicU32,

    /// Reference count. Dentries with refcount > 0 are pinned (in use).
    /// Dentries with refcount == 0 are on the LRU and may be evicted
    /// under memory pressure.
    /// u32: bounded by max_files sysctl (default 8M). At max_files=8M
    /// concurrent references to a single dentry, u32 provides ~536x
    /// headroom. AtomicU64 rejected: hot-path refcount, 2x width penalty
    /// on ILP32 architectures (ARMv7, PPC32).
    pub d_refcount: AtomicU32,

    /// Cached permission bits for fast path resolution (Section 14.1.3).
    pub cached_perm: AtomicU32,

    /// Superblock this dentry belongs to.
    pub d_sb: Arc<SuperBlock>,

    /// Filesystem-specific dentry operations (d_revalidate, d_release, etc.).
    /// Set by the filesystem during lookup. NULL for simple filesystems.
    pub d_ops: Option<&'static dyn DentryOps>,

    /// RCU head for deferred freeing.
    pub d_rcu: RcuHead,

    /// LRU list linkage for dcache reclaim.
    pub d_lru: IntrusiveListNode,

    /// Mount point generation counter. Incremented when a filesystem is
    /// mounted or unmounted on this dentry. Used by RCU-walk to detect
    /// mount table changes during lockless traversal. This is a generation
    /// counter protocol, not a Linux-style seqcount (no even/odd semantics).
    ///
    /// Reader protocol: (1) sample d_mount_seq with Acquire, (2) lookup in
    /// mount hash table, (3) sample d_mount_seq again with Acquire, (4) if
    /// values differ, retry from step 1.
    pub d_mount_seq: AtomicU32,
}

/// Short name inline buffer size. Names <=32 bytes are stored inline
/// in the dentry (no heap allocation). Covers >99% of real filenames.
pub const DENTRY_INLINE_NAME_LEN: usize = 32;

/// Maximum dentry name length (POSIX NAME_MAX).
pub const DENTRY_MAX_NAME_LEN: usize = 255;

/// Dentry name: inline for short names, heap-allocated for long names.
/// The Heap variant stores names up to DENTRY_MAX_NAME_LEN bytes; the
/// bound is enforced by d_alloc() which validates name.len() <= NAME_MAX
/// before construction. debug_assert!(name.len() <= DENTRY_MAX_NAME_LEN)
/// in the Heap constructor provides defense-in-depth.
pub enum DentryName {
    Inline { buf: [u8; DENTRY_INLINE_NAME_LEN], len: u8 },
    Heap { ptr: Box<[u8]> },
}
14.1.2.3.1.3 AddressSpace (Page Cache Mapping)
/// VFS-layer page cache wrapper for one inode. Wraps a `PageCache`
/// ([Section 4.4](04-memory.md#page-cache)) with VFS-layer writeback
/// coordination, error tracking, and filesystem-specific operations.
///
/// Each inode for a regular file or block device has exactly one
/// `AddressSpace`. Directories and symlinks typically do not use
/// `AddressSpace` unless the filesystem maps their data through the page
/// cache (e.g., directories in ext4 are page-cache-backed).
///
/// **Storage**: `AddressSpace` is embedded directly inside `Inode`
/// (field `i_mapping`). No separate allocation is needed on the fast
/// path.
///
/// **Concurrency**:
/// - `page_cache`: `Option<PageCache>` — `Some` for normal files, `None` for
///   DAX files (AS_DAX set). When `Some`, the inner XArray provides RCU-safe
///   lock-free reads and per-instance `xa_lock` for writers. See
///   [Section 4.4](04-memory.md#page-cache) for the full concurrency model.
///   All code paths that access `page_cache` must check `is_some()` first;
///   DAX paths bypass the page cache entirely.
/// - `page_cache.nr_pages`, `page_cache.nr_dirty`, `nrwriteback`: independent
///   atomic counters; no lock needed for individual increments/decrements
///   (`nr_pages` and `nr_dirty` only exist when `page_cache` is `Some`).
/// - `writeback_lock`: `Mutex` serializing concurrent writeback of
///   this inode's pages. At most one writeback agent runs per inode
///   at any time.
/// - `writeback_in_progress`: `AtomicBool` lightweight sentinel checked
///   by the reclaim path without acquiring `writeback_lock`.
pub struct AddressSpace {
    /// Back-pointer to the owning inode. `Weak` to avoid a reference
    /// cycle (Inode → AddressSpace → Inode).
    pub host: Weak<Inode>,

    /// Page storage backend — XArray with RCU-safe lock-free reads and
    /// per-instance `xa_lock` for writers. Defined in [Section 4.4](04-memory.md#page-cache).
    /// `page_cache.nr_pages` and `page_cache.nr_dirty` are the canonical
    /// page/dirty counters (no separate copies here — use accessors).
    /// None for DAX-capable filesystems that map persistent memory directly.
    pub page_cache: Option<PageCache>,

    /// Number of pages currently under active writeback I/O. A page is
    /// counted here from the moment writeback I/O is submitted until the
    /// I/O completion handler clears the `PG_WRITEBACK` flag.
    pub nrwriteback: AtomicU64,

    /// True while writeback I/O is in progress for this inode. Lightweight
    /// sentinel for the memory reclaim path: reclaim checks this flag
    /// without acquiring `writeback_lock` to skip inodes already being
    /// flushed. The writeback thread sets this AFTER acquiring
    /// `writeback_lock` and clears it BEFORE releasing the lock.
    /// Ordering: `writeback_lock` acquisition → set flag → writeback I/O →
    /// clear flag → release `writeback_lock`.
    /// Intra-domain (VFS Tier 1). Not accessed from Core directly.
    /// AtomicBool validity invariant maintained by Rust type safety
    /// within the compilation unit.
    pub writeback_in_progress: AtomicBool,

    /// Writeback error sequence counter. Updated on I/O errors via
    /// `ErrSeq::set_err(errno)`. Each open file descriptor snapshots
    /// `wb_err` at open time (`file.f_wb_err`); `fsync()` compares the
    /// snapshot to detect new errors. See [Section 14.4](#vfs-fsync-and-cow).
    pub wb_err: ErrSeq,

    /// Writeback serialisation state. At most one concurrent writeback
    /// agent is permitted per `AddressSpace` to avoid seek amplification
    /// on rotational storage and to simplify error propagation.
    ///
    /// `writeback_lock` serializes writeback *within* a single inode's
    /// `AddressSpace`. Multiple inodes on the same backing device can
    /// writeback concurrently — `BdiWriteback` ([Section 4.6](04-memory.md#writeback-subsystem--writeback-thread-organization))
    /// coordinates device-level I/O scheduling across all inodes, not
    /// per-inode serialization. Two threads holding their respective
    /// inode writeback_locks may both submit bios to the same block
    /// device — this is correct and desirable for throughput.
    pub writeback_lock: Mutex<WritebackState>,

    /// Sequence counter for truncation-fault coordination.
    ///
    /// Replaces Linux's `mapping->invalidate_lock` (rwsem, added v5.15,
    /// commit 730633f0b7f9) with a lockless seqcount. Writers (truncate,
    /// hole-punch, collapse-range) bracket page cache mutations with
    /// `invalidate_begin()` / `invalidate_end()` while holding
    /// `I_RWSEM(write)`. Readers (page fault) call `read_begin()` before
    /// page cache lookup and `read_check()` after PTE installation -- two
    /// atomic loads, no lock acquired.
    ///
    /// The seqcount eliminates the ONLY lock ordering exception that was
    /// previously required in the page fault path (`VMA_LOCK(105)` ->
    /// `INVALIDATE_LOCK(90)` violated descending-level order). With
    /// `InvalidateSeq`, the fault path lock chain is strictly ascending:
    /// `VMA_LOCK(105, read)` -> `PAGE_LOCK(180)` -> `PTL(185)`.
    ///
    /// See [Section 4.8](04-memory.md#virtual-memory-manager--invalidateseq-lockless-truncation-fault-coordination)
    /// for the full struct definition, memory ordering table, and edge
    /// case analysis.
    pub invalidate_seq: InvalidateSeq,

    /// Filesystem-provided callbacks for page cache operations.
    /// Statically known at inode creation time; never changes.
    pub ops: &'static dyn AddressSpaceOps,

    /// Flags controlling eviction and special page semantics.
    ///
    /// - `AS_UNEVICTABLE` (bit 0): pages must not be reclaimed under
    ///   memory pressure (e.g., ramfs, tmpfs locked pages).
    /// - `AS_BALLOON_PAGE` (bit 1): pages are balloon-inflated and may
    ///   be reclaimed by the balloon driver at any time.
    /// - `AS_EIO` (bit 2): a writeback error occurred; subsequent
    ///   `fsync` calls must return `-EIO` until the flag is cleared.
    /// - `AS_ENOSPC` (bit 3): a writeback error occurred due to no
    ///   space remaining on device.
    ///
    /// **Dual error reporting**: `AS_EIO`/`AS_ENOSPC` flags and `wb_err`
    /// (errseq_t) serve complementary purposes. The flags provide a quick
    /// boolean "any error occurred?" check used by `sync_file_range()` and
    /// the writeback scanner. The errseq_t counter provides per-fd error
    /// tracking so that multiple concurrent `fsync()` callers each see the
    /// error exactly once. Both are set atomically in `writeback_end_io()`.
    /// This dual mechanism matches Linux 4.13+ semantics (commit 5660e13d).
    /// - `AS_DAX` (bit 4): this mapping is DAX (Direct Access) — file data
    ///   lives in persistent memory and is mapped directly into user page
    ///   tables without page cache copies. When set, the page fault handler
    ///   calls `dax_iomap_fault()` instead of `filemap_fault()`, and
    ///   `writepages`/`writepage` are never called (no page cache to write
    ///   back). Set at mount time for filesystems on persistent memory
    ///   mounted with `-o dax`. See [Section 15.16](15-storage.md#persistent-memory--design-dax-direct-access-integration).
    ///   When `AS_DAX` is set, `page_cache` is `None` — no `PageCache` is
    ///   allocated for DAX files. Direct-access files use CPU load/store
    ///   through the DAX mapping ([Section 15.16](15-storage.md#persistent-memory--design-dax-direct-access-integration)),
    ///   bypassing the page cache entirely. This saves ~256 bytes per DAX
    ///   inode (the `PageCache` struct including its embedded XArray root,
    ///   counters, and xa_lock).
    pub flags: AtomicU32,

    /// DAX error generation counter. Only meaningful when `flags` contains
    /// `AS_DAX`. DAX files bypass the page cache, so the standard `wb_err`
    /// mechanism (which tracks writeback I/O errors on page cache pages)
    /// does not apply. Instead, hardware-detected errors on persistent
    /// memory (MCE on x86, SEA on ARM64) are recorded here.
    ///
    /// Error propagation for DAX files:
    /// - MCE/SEA → `SIGBUS` to the accessing process (immediate, via the
    ///   page fault / machine-check handler).
    /// - MCE/SEA → increment `dax_err` generation (for deferred `fsync`
    ///   reporting).
    /// - `fsync()` on a DAX file: compare `file.f_dax_err` with
    ///   `mapping.dax_err`. If generations differ, return `-EIO`. This is
    ///   the same generation-counter protocol used by `wb_err` for non-DAX
    ///   files ([Section 14.15](#disk-quota-subsystem--writeback-error-propagation-errseq)),
    ///   but applied to DAX hardware errors instead of writeback I/O errors.
    /// - `f_dax_err` is snapshotted at `open()` time, identical to `f_wb_err`.
    ///
    /// For non-DAX files (`AS_DAX` not set), this field is unused (reads as 0).
    pub dax_err: AtomicU32,

    /// Interval tree of file-backed VMAs mapping this file. Used for
    /// reverse mapping: truncation, writeback, page migration, and KSM
    /// need to find all VMAs mapping a given file offset range. This is
    /// the UmkaOS equivalent of Linux's `address_space.i_mmap` (`rb_root_cached`
    /// interval tree) protected by `i_mmap_rwsem`.
    ///
    /// The `RwLock` protects concurrent insert/remove during mmap/munmap
    /// (writers) vs. read during truncation/writeback/rmap walks (readers).
    /// Lock level: follows `mmap_lock` — callers hold `mmap_lock.write()`
    /// before acquiring `i_mmap.write()`. Readers (rmap walks) acquire
    /// `i_mmap.read()` independently.
    ///
    /// `IntervalTree<VmaRef>` stores `(start_pgoff, end_pgoff, VmaRef)` tuples.
    /// Lookup: `i_mmap.tree.query(pgoff_start, pgoff_end)` returns all VMAs
    /// whose file offset range overlaps `[pgoff_start, pgoff_end)`.
    pub i_mmap: RwLock<IntervalTree<VmaRef>>,
}

/// DAX File Handling
///
/// DAX (Direct Access) files on persistent memory bypass the page cache
/// entirely. When a filesystem is mounted with `-o dax` on a persistent
/// memory device, every inode's `AddressSpace` has `AS_DAX` set and
/// `page_cache` is `None`.
///
/// **Memory savings**: Skipping `PageCache` allocation saves ~256 bytes per
/// DAX inode (XArray root node, `nr_pages`/`nr_dirty` counters, `xa_lock`,
/// internal bookkeeping). On a persistent memory filesystem with millions
/// of small files, this is significant.
///
/// **Error tracking**: DAX files cannot use the standard `wb_err` writeback
/// error mechanism because there are no page cache pages and no writeback
/// I/O. Instead, hardware memory errors (MCE on x86-64, Synchronous
/// External Abort on AArch64) are tracked via `AddressSpace::dax_err`:
///
///   1. Hardware detects uncorrectable error on persistent memory address.
///   2. MCE/SEA handler delivers `SIGBUS` (`BUS_MCEERR_AR` for synchronous,
///      `BUS_MCEERR_AO` for asynchronous) to the process whose access
///      triggered the fault. This is immediate — the process is notified
///      before `fsync` is ever called.
///   3. MCE/SEA handler increments `mapping.dax_err` (AtomicU32 generation
///      counter, same wrap-around protocol as `ErrSeq`).
///   4. On `fsync()`: the VFS compares `file.f_dax_err` (snapshotted at
///      `open()`) with `mapping.dax_err`. If they differ, `fsync` returns
///      `-EIO` and advances `file.f_dax_err` to the current generation
///      (so the error is reported exactly once per fd, matching `wb_err`
///      semantics).
///
/// **Dirty page throttling**: `balance_dirty_pages()` excludes DAX files.
/// DAX writes go directly to persistent memory via CPU store instructions
/// — there are no dirty page cache pages to throttle. Write bandwidth is
/// bounded by the persistent memory device's write throughput, not by the
/// kernel's dirty page ratio. The `writeback_lock`, `nrwriteback`, and
/// `writeback_in_progress` fields are unused for DAX inodes.
///
/// **Page fault path**: When a DAX file is faulted, the VFS calls
/// `dax_iomap_fault()` (not `filemap_fault()`). This maps the persistent
/// memory physical address directly into the process's page table — no
/// page allocation, no page cache insertion, no copy. For huge page faults
/// (PMD-level, 2 MiB on x86-64), `dax_iomap_pmd_fault()` maps a single
/// PMD entry covering the entire 2 MiB region.

/// Serialised writeback state embedded inside `AddressSpace::writeback_lock`.
///
/// Protected by `AddressSpace::writeback_lock`. The `Mutex` ensures only
/// one writeback agent runs at a time; the fields inside track progress
/// so that a new agent can resume where the previous one left off.
pub struct WritebackState {
    /// Next page index to examine during writeback. The writeback agent
    /// advances this forward as pages are submitted for I/O. Wraps to 0
    /// after reaching the last page, implementing a cyclic scan
    /// consistent with the kernel's "kupdate" writeback policy.
    pub writeback_index: u64,

    /// Accumulated bytes of dirty data at the time writeback started.
    /// Used to limit how much data a single writeback pass writes, so
    /// that a continuous dirty stream does not starve readers.
    pub dirty_bytes: u64,
}

/// Filesystem callbacks invoked by the VFS page cache layer.
///
/// Each filesystem that participates in the page cache provides a
/// static `AddressSpaceOps` implementation. The VFS calls these methods
/// when it needs to populate the cache (read miss), flush dirty pages
/// (writeback), or decide whether a page can be dropped (reclaim).
///
/// **Object safety**: all methods take `&self` on the ops vtable plus
/// explicit `AddressSpace`/`Page` references. The vtable itself is
/// `'static`, `Send`, and `Sync`.
pub trait AddressSpaceOps: Send + Sync {
    /// Read one page (identified by `index`, a page-aligned file offset
    /// divided by `PAGE_SIZE`) from the backing store into the page
    /// cache. Fill the already-allocated and cache-inserted `page` with
    /// data from backing store. The page has already been allocated,
    /// locked (`PageFlags::LOCKED`), and inserted into the page cache
    /// XArray by the caller (`filemap_get_pages`). The filesystem must
    /// initiate the I/O to populate the page contents.
    ///
    /// **Contract**: Implementations MUST NOT allocate a new page or
    /// overwrite the page cache XArray slot. The caller owns the slot;
    /// overwriting it orphans the locked page and deadlocks concurrent
    /// readers waiting on `PageFlags::LOCKED`.
    ///
    /// Called with no locks held. The implementation may block.
    fn read_page(
        &self,
        mapping: &AddressSpace,
        index: u64,
        page: &Arc<Page>,
    ) -> Result<(), IoError>;

    /// **Example: ext4 read_page() flow**
    ///
    /// When a file-backed page fault triggers `read_page()` on an ext4 file:
    ///
    /// 1. `ext4_read_page(mapping, pgoff, page)`:
    ///    a. Map logical block: `ext4_map_blocks(inode, pgoff)` → translates file offset
    ///       to physical block number via the extent tree.
    ///    b. Build Bio: `Bio::new_read(bdev, phys_block, page)`.
    ///    c. Submit: `bio_submit(bio)` → dispatches to block device driver.
    ///    d. Wait: page is unlocked by bio completion callback when I/O finishes.
    ///    e. Return `Ok(())` — page now contains file data.
    ///
    /// The readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)) may
    /// batch multiple pages into a single Bio with scatter-gather, submitting them
    /// via `AddressSpaceOps::readahead()` instead of individual `read_page()` calls.

    /// Read multiple pages as a batch for readahead. Receives the readahead
    /// window from the readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)).
    /// Implementations should submit I/O for all requested pages in a single
    /// Bio batch. Filesystems that do not implement this method fall back to
    /// sequential `read_page()` calls.
    /// Default: returns `EOPNOTSUPP` (use `read_page` fallback).
    fn readahead(
        &self,
        mapping: &AddressSpace,
        ra: &ReadaheadControl,
    ) -> Result<(), IoError> {
        Err(IoError::new(Errno::EOPNOTSUPP))
    }

    /// Write a single dirty page to the backing store. `wbc` carries
    /// writeback control parameters (sync mode, range limits, number
    /// of pages already written in this pass). The implementation must
    /// set `PG_WRITEBACK` for the duration of the I/O. The implementation
    /// MUST NOT clear `PG_DIRTY` — the `DIRTY → clean` transition and
    /// `nr_dirty` decrement are owned exclusively by the completion callback
    /// (`writeback_end_io()` for Tier 0, or the Tier 0 `WritebackResponse`
    /// handler for Tier 1). Clearing DIRTY here would cause a double-decrement
    /// of `nr_dirty` when the completion callback also clears it.
    ///
    /// Called with no locks held. The implementation may block.
    fn writepage(
        &self,
        mapping: &AddressSpace,
        page: &Page,
        wbc: &WritebackControl,
    ) -> Result<(), IoError>;

    /// Write multiple dirty pages to the backing store in a single batch.
    /// Called by the writeback subsystem ([Section 4.6](04-memory.md#writeback-subsystem)) instead of
    /// iterating `writepage()` one page at a time. The filesystem should submit
    /// all dirty pages in the address space (subject to `wbc` constraints) as
    /// coalesced Bio requests for maximum throughput.
    ///
    /// # Returns
    /// - `Ok(n)`: Number of pages successfully submitted for writeback.
    /// - `Err(IoError)`: Fatal error; writeback aborted for this inode.
    ///
    /// Default: returns `EOPNOTSUPP` (writeback layer falls back to per-page
    /// `writepage()` calls). Filesystems that support extent-based I/O (ext4,
    /// XFS, btrfs) should implement this for 5-10x writeback throughput vs.
    /// per-page writepage on rotational media.
    fn writepages(
        &self,
        mapping: &AddressSpace,
        wbc: &WritebackControl,
    ) -> Result<u64, IoError> {
        Err(IoError::new(Errno::EOPNOTSUPP))
    }

    /// Verify data integrity of a page populated through a non-standard
    /// path (RDMA fetch, DSM migration, decompression). Filesystems that
    /// store per-page checksums (btrfs, ext4 metadata, ZFS) implement this
    /// to catch silent corruption from paths that bypass the standard block
    /// I/O checksum pipeline.
    ///
    /// Called by the DSM page fetch path ([Section 6.11](06-dsm.md#dsm-distributed-page-cache))
    /// after RDMA-fetching a page from a remote peer, before setting
    /// `PageFlags::UPTODATE`. If verification fails, the fetched page is
    /// discarded and the DSM falls back to storage I/O.
    ///
    /// Default: returns `Ok(true)` — the page is accepted without
    /// verification (appropriate for filesystems without per-page
    /// checksums, e.g., tmpfs, ext2, NFS).
    fn verify_page(
        &self,
        mapping: &AddressSpace,
        index: u64,
        page: &Page,
    ) -> Result<bool, IoError> {
        Ok(true)
    }

    /// Called by the page reclaimer immediately before a clean page is
    /// removed from the cache. The filesystem may decline eviction by
    /// returning `false` (e.g., because it has pinned the page for
    /// journalling). Returning `true` grants permission to evict.
    ///
    /// Must not block; must not acquire locks that might sleep.
    fn releasepage(&self, page: &Page) -> bool;

    /// Called by `generic_file_write_iter()` before writing user data into
    /// a page. The filesystem prepares the page for writing:
    ///
    /// - **ext4**: starts a JBD2 journal handle (`journal_start()`), allocates
    ///   blocks for delayed allocation, reads the page from disk if the write
    ///   is partial (does not cover the entire page).
    /// - **XFS**: creates a delayed allocation extent reservation.
    /// - **tmpfs**: allocates a swap-backed page.
    /// - **Default (simple filesystems)**: allocates a clean page from the
    ///   page cache if not already present, zeroing unwritten portions.
    ///
    /// The returned page reference is locked (`PageFlags::LOCKED` set).
    /// The caller (`generic_file_write_iter`) copies user data into the
    /// page between `write_begin` and `write_end`.
    ///
    /// On error (e.g., `ENOSPC` from block allocation), the write is aborted
    /// and the page is released without modification.
    ///
    /// See [Section 15.6](15-storage.md#filesystem-ext4) for ext4's implementation.
    fn write_begin(
        &self,
        mapping: &AddressSpace,
        pos: u64,
        len: usize,
        flags: u32,
    ) -> Result<PageRef, IoError>;

    /// Called by `generic_file_write_iter()` after writing user data into
    /// the page returned by `write_begin()`. The filesystem commits the
    /// write:
    ///
    /// - **ext4**: marks buffer heads dirty, stops the JBD2 journal handle
    ///   (`journal_stop()`), updates `i_size` if the write extended the file.
    /// - **XFS**: marks the page dirty, updates extent state.
    /// - **Default (simple filesystems)**: marks the page dirty via
    ///   `set_page_dirty()`.
    ///
    /// `copied` is the number of bytes actually copied by the write (may be
    /// less than `len` for a short copy from user memory). The filesystem
    /// must handle partial writes correctly (e.g., by not advancing `i_size`
    /// past the last successfully written byte).
    ///
    /// The page is still locked on entry; the filesystem may unlock it
    /// before returning.
    ///
    /// **Tier boundary for `set_page_dirty()`**: For Tier 1 filesystems
    /// (ext4, XFS, Btrfs), `write_end()` is invoked via the KABI ring:
    /// the Tier 0 VFS dispatches a `WriteEnd` command to the filesystem's
    /// domain, the filesystem processes it and returns a response. The
    /// response includes a `dirty: bool` flag indicating whether the page
    /// should be marked dirty. The **Tier 0 VFS ring consumer** -- not the
    /// Tier 1 filesystem -- calls `set_page_dirty()` upon receiving a
    /// response with `dirty == true`. This keeps all page cache metadata
    /// operations (`set_page_dirty()`, `nr_dirty` counters, BDI dirty
    /// list) in Tier 0, avoiding cross-domain direct calls from Tier 1.
    ///
    /// For Tier 0 filesystems (tmpfs, ramfs -- statically linked), the
    /// `write_end()` callback runs in the same domain and calls
    /// `set_page_dirty()` directly. No ring dispatch is needed.
    ///
    /// This is the same pattern used for block I/O completion (Tier 1
    /// NVMe driver signals via outbound ring, Tier 0 consumer calls
    /// `bio_complete()`) -- see [Section 12.8](12-kabi.md#kabi-domain-runtime) and
    /// [Section 15.19](15-storage.md#nvme-driver-architecture).
    fn write_end(
        &self,
        mapping: &AddressSpace,
        pos: u64,
        len: usize,
        copied: usize,
        page: PageRef,
    ) -> Result<usize, IoError>;

    /// Called by the page cache when a page is first dirtied. Allows the
    /// filesystem to register the affected block extent for crash recovery
    /// journaling BEFORE the page is modified.
    ///
    /// Filesystems with journaling (ext4, btrfs, XFS) implement this to
    /// record dirty extents in their journal. Filesystems without journaling
    /// (tmpfs, ramfs, NFS) leave this as the default no-op.
    ///
    /// Called from `__set_page_dirty()` with the page locked. The `offset`
    /// and `len` arguments describe the byte range within the file that
    /// will be dirtied (typically `page_offset` and `PAGE_SIZE`, but
    /// sub-page dirty tracking for large folios may pass smaller ranges).
    ///
    /// **Interaction with two-phase dirty extent protocol**: For Tier 1
    /// VFS drivers running in an isolated domain, the `dirty_extent()`
    /// callback calls `vfs_dirty_extent_reserve()` to register the
    /// logical intent in Core's dirty intent list
    /// ([Section 14.1](#virtual-filesystem-layer)). For in-place filesystems that
    /// know the block address at dirty time (ext4 non-delayed-alloc),
    /// the callback may use `vfs_dirty_extent_reserve_and_commit()` to
    /// atomically reserve and bind the physical address. CoW filesystems
    /// (Btrfs, XFS reflink) call only `vfs_dirty_extent_reserve()` here
    /// and defer `vfs_dirty_extent_commit()` to the writeback path after
    /// block allocation. For Tier 0 filesystems (statically linked, e.g.,
    /// tmpfs), the callback can directly update internal journal
    /// structures without crossing a domain boundary.
    ///
    /// **In-place filesystem implementation pattern** (ext4 example):
    /// 1. `dirty_extent()` is called with the byte range `[offset, offset+len)`.
    /// 2. The filesystem maps the byte range to physical block extents via
    ///    its extent tree.
    /// 3. Calls `vfs_dirty_extent_reserve_and_commit(inode_id, offset, len,
    ///    block_addr, block_len)` — atomic reserve+commit since the block
    ///    address is known.
    /// 4. The filesystem writes a journal descriptor block recording the
    ///    physical extents that are about to be modified.
    /// 5. Only after the journal descriptor is committed (or at least
    ///    queued for commit) does `dirty_extent()` return `Ok(())`.
    /// 6. The caller (`__set_page_dirty()`) then sets `PageFlags::DIRTY`
    ///    on the page.
    ///
    /// **CoW filesystem implementation pattern** (Btrfs example):
    /// 1. `dirty_extent()` is called with the byte range `[offset, offset+len)`.
    /// 2. Calls `vfs_dirty_extent_reserve(inode_id, offset, len)` — Phase 1
    ///    only. No block address is available yet (CoW allocates at writeback).
    /// 3. Returns `Ok(())` with the `DirtyExtentToken` stored in the inode's
    ///    per-extent pending-commit table (filesystem-private state).
    /// 4. During writeback, the filesystem allocates new blocks via its
    ///    extent allocator, then calls `vfs_dirty_extent_commit(token,
    ///    block_addr, block_len)` — Phase 2.
    /// 5. After I/O completion, calls `vfs_flush_extent_complete()`.
    ///
    /// This ordering guarantee ensures that on crash recovery, Core has a
    /// record of every dirty extent — no data modification happens without
    /// a corresponding intent entry.
    fn dirty_extent(
        &self,
        _mapping: &AddressSpace,
        _offset: u64,
        _len: u64,
    ) -> Result<(), IoError> {
        // Default: no-op. Appropriate for in-memory filesystems (tmpfs,
        // ramfs) and network filesystems (NFS, which has its own write
        // delegation protocol).
        Ok(())
    }

    /// Returns the direct-I/O implementation for this address space,
    /// if the filesystem supports bypassing the page cache (e.g., for
    /// `O_DIRECT` opens). Returns `None` if direct I/O is not supported;
    /// the VFS will then fall back to the page-cache path.
    fn direct_io(&self) -> Option<&dyn DirectIoOps> {
        None
    }
}
14.1.2.3.1.4 Generic File Operations (VFS-to-Page-Cache Bridge)

The VFS provides generic implementations of file read/write that bridge FileOps calls to the page cache. Most filesystem types delegate their FileOps::read() and FileOps::write() to these generic functions, only providing the AddressSpaceOps callbacks for cache miss I/O.

Isolation domain: filemap_get_pages() runs in Tier 0 (Core domain). The page cache XArray is Core memory. VFS dispatches the read request to Core via the KABI ring; Core's filemap_get_pages() accesses the page cache directly. The ~23-46 cycle domain crossing happens at the VFS-to-Core ring boundary.

/// Maximum number of pages fetched in a single readahead or
/// `filemap_get_pages()` call. Matches Linux's `PAGEPOOL_SIZE` (32).
/// A bounded `ArrayVec` is used to avoid heap allocation on the hot path.
pub const MAX_READAHEAD_PAGES: usize = 32;

/// Read pages from the page cache for a file read operation.
/// This is the generic implementation used by most filesystem types.
/// Equivalent to Linux's `filemap_get_pages()` + `generic_file_read_iter()`.
///
/// Returns pages covering the requested range `[pgoff, pgoff + nr_pages)`.
/// Pages not in cache are fetched via `mapping.ops.read_page()`.
///
/// **Concurrency**: Called with no inode locks held. Multiple threads may
/// call this concurrently on the same `AddressSpace`; the page cache XArray
/// ([Section 4.4](04-memory.md#page-cache)) provides internal synchronization.
///
/// **Concurrent reader deduplication (lock-or-find protocol)**:
/// When a cache miss occurs, multiple threads may race to populate the same
/// page index. The following protocol ensures exactly one thread performs
/// I/O, while all others wait for the result:
///
/// 1. **Cache probe**: `pc.pages.load(idx)` — if found, return (cache hit).
/// 2. **Allocate**: Allocate a new page, set `PageFlags::LOCKED` atomically.
/// 3. **Atomic insert**: `pc.pages.try_store(idx, page)` — attempts a
///    compare-and-swap insertion into the XArray slot.
/// 4. **Lost race**: If `try_store` returns an existing page (a concurrent
///    reader won the race and inserted first), drop our freshly allocated
///    page, then wait for the existing page's `PageFlags::LOCKED` to be
///    cleared (the winner is performing I/O). Once unlocked, the existing
///    page contains valid data — return it.
/// 5. **Won race**: If `try_store` succeeds (our page is now in the cache),
///    call `read_page()` to fill the page from the backing store. On I/O
///    completion, clear `PageFlags::LOCKED` and wake all waiters sleeping
///    on this page's lock (step 4 above). Return the filled page.
///
/// This protocol prevents duplicate I/O: at most one `read_page()` call is
/// issued per page index, regardless of the number of concurrent readers.
/// The cost of the losing path is one wasted page allocation (returned to
/// the buddy allocator immediately) plus a sleep on the page lock — no I/O.
///
/// **Readahead integration**: Before the lock-or-find path, this function
/// checks the readahead state (`FileRaState` on the `OpenFile`) and may
/// trigger `AddressSpaceOps::readahead()` to batch-fetch a window of pages
/// in a single I/O. The readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine))
/// determines the window size based on sequential access detection.
/// Readahead-populated pages are inserted via the same `try_store` protocol,
/// so concurrent readahead and fault-driven reads do not duplicate I/O.
///
/// **Error handling (short-read semantics)**: If `read_page()` fails for
/// any page in the range, the function clears `PageFlags::LOCKED` on the
/// failed page (waking waiters), sets `PageFlags::ERROR` to signal the
/// failure, and removes the failed page from the cache via
/// `pc.pages.erase(idx)`. If pages were successfully fetched in earlier
/// iterations, they are returned as a short read (the caller receives
/// fewer pages than requested — not an error). Only if *no* pages were
/// successfully fetched does the function return `Err`. This matches
/// POSIX read semantics: a successful partial transfer is reported as a
/// short read, not an error. Waiters sleeping on a page that fails I/O
/// are woken and observe `PageFlags::ERROR`, causing them to return `EIO`.
pub fn filemap_get_pages(
    mapping: &AddressSpace,
    pgoff: u64,
    nr_pages: u32,
    ra_state: &mut FileRaState,
) -> Result<ArrayVec<PageRef, MAX_READAHEAD_PAGES>, IoError> {
    let mut pages = ArrayVec::new();
    let pc = mapping.page_cache.as_ref().ok_or(IoError::new(Errno::EINVAL))?;
    for i in 0..nr_pages as u64 {
        let idx = pgoff + i;

        // Step 1: Cache probe with RCU + speculative refcount.
        // The XArray load returns a PageRef valid only under RCU read lock.
        // We must bump the refcount before releasing RCU to prevent the
        // page reclaimer from freeing the page between lookup and use.
        // This matches Linux's folio_try_get_rcu() pattern in filemap_get_pages().
        {
            let _rcu = rcu_read_lock();
            if let Some(page) = pc.pages.load(idx) {
                // Speculative refcount bump under RCU protection. If the page
                // was concurrently freed (refcount already 0), try_get_ref()
                // returns false and we fall through to the miss path.
                if page.try_get_ref() {
                    drop(_rcu); // Release RCU after refcount is stable.
                    // Cache hit — mark referenced for LRU aging.
                    page.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
                    pages.push(page);
                    continue;
                }
                // Refcount bump failed — page is being freed. Fall through
                // to the miss path (will allocate a new page or find a
                // replacement after reclaim completes).
            }
        }

        // Cache miss — DSM cooperative cache check (three-stage filter).
        //
        // Design: never add latency to the common case. Most misses are
        // local-only (no remote node has the page). The three stages
        // progressively filter out unnecessary RDMA lookups:
        //
        //   Stage 1: Bloom filter (~15-30ns, 3-5 cache lines).
        //            Per-peer counting Bloom filters ([Section 6.11](06-dsm.md#dsm-distributed-page-cache))
        //            are exchanged lazily via DSM heartbeat piggyback. A negative
        //            result means "definitely not cached remotely" — skip RDMA.
        //            Eliminates ~90-95% of remote lookups with zero I/O.
        //
        //   Stage 2: Sequential access rejection (~5ns, single branch).
        //            Sequential readahead streams rarely benefit from cooperative
        //            caching — the same sequential stream is unlikely to be cached
        //            on a peer. If FileRaState indicates sequential pattern (the
        //            readahead engine already tracks this), skip RDMA even if
        //            bloom says "maybe". Eliminates another ~3-5% of lookups.
        //
        //   Stage 3: Speculative parallel issue (RDMA + NVMe simultaneously).
        //            For the remaining ~2-5% of random-access misses where bloom
        //            says "maybe", fire BOTH the RDMA cooperative lookup AND the
        //            local NVMe readahead in parallel. First completion wins;
        //            the loser is cancelled. RDMA (~2-5μs) typically beats NVMe
        //            (~10-100μs) when the remote cache is warm, so we get a real
        //            speedup. When the remote cache is cold, NVMe completes
        //            normally — no added latency.
        //
        // Net effect: zero overhead for ~95% of misses; ~5-10μs speedup for
        // the ~2-5% where remote cache is warm; never slower than local-only.
        if mapping.host.i_sb.s_flags.load(Relaxed) & MS_DSM_COOPERATIVE != 0 {
            let file_id = DsmFileId::from_inode(&mapping.host);

            // Stage 1: Bloom filter — fast local rejection.
            let bloom_hit = dsm_bloom_probe(&file_id, idx);

            if bloom_hit {
                // Stage 2: Sequential access rejection.
                let is_sequential = ra_state.prev_pos != 0
                    && idx == ra_state.prev_pos + 1;

                if !is_sequential {
                    // Stage 3: Speculative parallel issue.
                    // Fire RDMA cooperative lookup. Simultaneously, fall through
                    // to readahead below (the NVMe path). The RDMA result is
                    // checked after readahead submission — if RDMA completed
                    // first, use the remote page and cancel the local I/O.
                    let rdma_fut = dsm_cooperative_cache_lookup_async(
                        &file_id, idx,
                    );

                    // Fall through to readahead (NVMe path starts here).
                    ra_state.start = idx;
                    page_cache_readahead(mapping, ra_state, nr_pages - i as u32);

                    // Check RDMA result — did the remote cache beat NVMe?
                    if let Some(page) = rdma_fut.try_complete() {
                        // Remote cache won. Cancel local I/O for this page
                        // (readahead may have submitted it; the page cache
                        // deduplicates — the readahead page is simply evicted
                        // on next reclaim pass if unused).
                        page.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
                        pages.push(page);
                        continue;
                    }

                    // RDMA didn't complete yet or missed. NVMe readahead is
                    // already in flight — re-check cache below (normal path).
                    // The RDMA future is dropped (cancelled on drop).
                }
            }
        }

        // Cache miss — trigger readahead before attempting I/O.
        // The readahead engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)) may
        // submit a larger I/O batch here to prefetch upcoming pages.
        // Pass the remaining page count (nr_pages - pages already fetched),
        // not the page index — page_cache_readahead uses FileRaState.start
        // (set by the caller) to determine *which* pages to read, and
        // nr_pages to bound the readahead window size.
        ra_state.start = idx;
        page_cache_readahead(mapping, ra_state, nr_pages - i as u32);

        // Re-check cache (readahead may have populated this page).
        // Same RCU + speculative refcount pattern as Step 1.
        {
            let _rcu = rcu_read_lock();
            if let Some(page) = pc.pages.load(idx) {
                if page.try_get_ref() {
                    drop(_rcu);
                    page.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
                    pages.push(page);
                    continue;
                }
            }
        }

        // Step 2: Allocate a new page with LOCKED flag. Relaxed ordering
        // suffices because the page is freshly allocated and not yet shared.
        let new_page = alloc_page(GFP_KERNEL)?;
        new_page.flags.fetch_or(PageFlags::LOCKED, Relaxed);

        // Step 3: Atomic insert — race with concurrent readers.
        match pc.pages.try_store(idx, new_page.clone()) {
            Ok(()) => {
                // Step 5: Won race — we own this slot. Fill via I/O.
                //
                // **Ring dispatch integration (VFS-BUG-2/4 fix)**:
                // For Tier 0 filesystems, `read_page` is a direct synchronous
                // call. For Tier 1 filesystems, the VFS ring protocol is used:
                //   (a) Construct VfsRequest { opcode: ReadPage, page_index: idx,
                //       buf: DmaBufferHandle from the page, offset, count }.
                //   (b) Submit via reserve_slot() -> complete_slot() on the
                //       per-superblock VfsRingPair.
                //   (c) The submitting thread sleeps on the page's wait queue
                //       (wait_on_page_locked), NOT on the ring completion queue.
                //   (d) The Tier 1 driver processes the request, fills the DMA
                //       buffer, and posts a VfsResponse. The ring completion
                //       handler copies data to the page cache page, calls
                //       set_page_uptodate(), clears LOCKED, and wakes the
                //       page's wait queue.
                // This ensures the read path works identically for both tiers:
                // the caller always sleeps on wait_on_page_locked, and the
                // I/O completion path always unlocks the page.
                match mapping.ops.read_page(mapping, idx, &new_page) {
                    Ok(()) => {
                        // I/O initiated. For Tier 0, this returns after the
                        // page is filled. For Tier 1, this returns after the
                        // ring request is submitted (async). In both cases,
                        // LOCKED is cleared by the completion path.
                        wait_on_page_locked(&new_page);
                        new_page.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
                        pages.push(new_page.clone());
                    }
                    Err(e) => {
                        // I/O failed. Mark error, unlock, remove, wake.
                        new_page.flags.fetch_or(PageFlags::ERROR, Release);
                        new_page.flags.fetch_and(!PageFlags::LOCKED, Release);
                        wake_page_waiters(&new_page);
                        pc.pages.erase(idx);
                        // Short-read semantics: if we already have pages,
                        // return them (partial success). Only error if no
                        // pages were fetched at all.
                        if !pages.is_empty() {
                            return Ok(pages);
                        }
                        return Err(e);
                    }
                }
            }
            Err(existing) => {
                // Step 4: Lost race — another thread inserted first.
                // Drop our page (returns to buddy allocator).
                drop(new_page);
                // Wait for the winner to finish I/O (LOCKED cleared).
                wait_on_page_locked(&existing);
                // Check if the winner's I/O failed.
                if existing.flags.load(Acquire) & PageFlags::ERROR != 0 {
                    // Short-read: return pages collected so far, or error
                    // if none were successfully fetched.
                    if !pages.is_empty() {
                        return Ok(pages);
                    }
                    return Err(IoError::new(Errno::EIO));
                }
                existing.flags.fetch_or(PageFlags::ACCESSED, Relaxed);
                pages.push(existing);
            }
        }
    }
    Ok(pages)
}

/// Generic file read iterator. Used by most filesystem `FileOps::read()`
/// implementations. Reads from the page cache, triggering readahead and
/// I/O as needed.
///
/// **Flow**:
/// 1. Compute the starting page offset and intra-page offset from `*offset`.
/// 2. Call `filemap_get_pages()` for the required page range.
/// 3. Copy data from the returned pages into `buf` via `copy_to_user()`.
/// 4. Advance `*offset` by the number of bytes read.
/// 5. Return the total bytes copied, or an error if no bytes were read.
///
/// **Short reads**: If the file ends mid-page (offset + len > i_size),
/// only the valid bytes are copied. This is not an error — the return
/// value reflects the actual bytes read, and `*offset` is advanced
/// accordingly.
///
/// **DAX bypass**: If `mapping.flags` has `AS_DAX` set, this function
/// is never called — DAX files use `dax_iomap_rw()` instead, which
/// maps persistent memory directly into the user's address space.
pub fn generic_file_read_iter(
    file: &OpenFile,
    buf: &mut UserSliceMut,
    offset: &mut i64,
) -> Result<usize, IoError> {
    let mapping = &file.inode.i_mapping;
    let mut total = 0usize;

    // VFS-BUG-5 fix: i_size check before read loop. Without this, a read
    // past EOF would copy uninitialized page data to userspace (information
    // disclosure). POSIX: read() past EOF returns 0 bytes, not an error.
    let i_size = file.inode.i_size.load(Acquire) as i64;
    if *offset >= i_size {
        return Ok(0);
    }
    // Clamp the effective read length to not exceed i_size. This ensures
    // we never read beyond the file's logical end, even if page cache pages
    // exist beyond i_size (e.g., from a concurrent truncate race — the
    // truncate path clears those pages asynchronously).
    let max_readable = (i_size - *offset) as usize;
    let effective_remaining = min(buf.remaining(), max_readable);

    while total < effective_remaining {
        let pgoff = (*offset as u64) / PAGE_SIZE as u64;
        let intra = (*offset as usize) % PAGE_SIZE;
        let remaining = effective_remaining - total;
        let nr = min((remaining + PAGE_SIZE - 1) / PAGE_SIZE, MAX_READAHEAD_PAGES as usize);
        let pages = filemap_get_pages(mapping, pgoff, nr as u32, &mut file.ra_state.lock())?;
        for (i, page) in pages.iter().enumerate() {
            // intra-page offset: non-zero only for the first page of each
            // filemap_get_pages batch (the read may start mid-page). All
            // subsequent pages are read from offset 0.
            let page_intra = if i == 0 { intra } else { 0 };
            // Clamp to i_size: never copy beyond the file's logical end.
            let avail = min(
                min(PAGE_SIZE - page_intra, effective_remaining - total),
                buf.remaining(),
            );
            if avail == 0 { break; }
            copy_to_user(buf, page.data_ptr().add(page_intra), avail)?;
            *offset += avail as i64;
            total += avail;
        }
        if pages.len() < nr as usize { break; } // short read or EOF
    }
    Ok(total)
}

Tier isolation note: The VFS read path dispatches copy_to_user via a KABI return-buffer mechanism when the filesystem runs in Tier 1 (hardware memory domain isolated). The Tier 1 filesystem populates a shared bounce buffer (mapped into the driver's isolation domain as writable), and umka-core (Tier 0) performs the actual copy_to_user() after the KABI ring response returns. This ensures the Tier 1 driver never directly accesses userspace memory — all user memory writes go through Tier 0 copy_to_user(), which validates the destination address against the task's address space. For Tier 0 filesystems (e.g., the root filesystem driver), the copy_to_user() call is inlined directly — no bounce buffer needed.

Read path walkthrough (numbered trace, analogous to write path):

  1. sys_read() — extract fd, buf, count from SyscallContext; resolve OpenFile.
  2. VFS dispatch — call file.ops.read_iter() (or generic_file_read_iter for regular files).
  3. filemap_get_pages() — compute page offset, probe page cache XArray.
  4. Page cache hit — mark PageFlags::ACCESSED, return cached page.
  5. Page cache miss — trigger readahead (AddressSpaceOps::readahead()).
  6. wait_on_page_locked() — sleep until I/O completion clears PageFlags::LOCKED.
  7. copy_to_user() — copy page data to userspace buffer (bounce buffer if Tier 1).
  8. Return total bytes read to userspace via syscall return register.

Cross-references: - Page cache structure and XArray: Section 4.4 - Readahead engine: Section 4.4 - DAX direct access path: Section 15.16

14.1.2.3.1.5 generic_file_write_iter() — Buffered Write Path
/// Generic buffered write implementation. Called from `FileOps::write_iter()`
/// for regular file writes. Iterates over the write range page by page, using
/// the filesystem's `write_begin()`/`write_end()` callbacks for per-page
/// preparation and commit.
///
/// Equivalent to Linux's `generic_file_write_iter()` + `generic_perform_write()`.
///
/// # Steps
///
/// 1. RLIMIT_FSIZE check (truncate write to limit).
/// 2. O_APPEND: acquire i_rwsem exclusive, seek to i_size.
/// 3. For each page in [pos, pos+count):
///    a. `write_begin(mapping, pos, len, flags)` — filesystem prepares page
///       (journal reservation, delayed allocation, partial page read-in).
///       Returns locked page reference.
///    b. `copy_from_user(page_addr + offset, buf, bytes)` — copy user data
///       into the page. Handles fault (short copy) via `bytes_copied`.
///    c. `write_end(mapping, pos, len, bytes_copied, page)` — filesystem
///       commits the write (mark dirty, update journal, update i_size).
///    d. `balance_dirty_pages_ratelimited(mapping)` — throttle the writer
///       if dirty page count exceeds the threshold, preventing memory
///       pressure from unbounded dirtying.
/// 4. Update `mtime` and `ctime` timestamps.
/// 5. Return total bytes written.
///
/// # Error handling
///
/// If `write_begin()` fails (e.g., ENOSPC from block allocation), the write
/// returns a short count or error. Pages already committed via `write_end()`
/// remain dirty and will be written back asynchronously. This matches Linux:
/// a partial write is not rolled back.
///
/// If `copy_from_user()` returns a short count (user page not present),
/// `write_end()` is called with the short `bytes_copied`. The filesystem
/// handles the partial page correctly (e.g., ext4 does not advance i_size
/// past the last successfully written byte).
fn generic_file_write_iter(
    file: &OpenFile,
    buf: &UserSlice,
    pos: &mut i64,
) -> Result<usize, IoError> {
    // WF-10 fix: field is i_mapping, not address_space.
    // WF-11 fix: file.inode is a field (Arc<Inode>), not a method call.
    let inode = &*file.inode;
    let mapping = &inode.i_mapping;

    // Step 1: RLIMIT_FSIZE — truncate write to file size limit.
    let rlimit_fsize = current_task().process.rlimits.limits[RLIMIT_FSIZE].soft;
    let count = if *pos as u64 + buf.len() as u64 > rlimit_fsize && rlimit_fsize != u64::MAX {
        signal_send(current_task(), SIGXFSZ);
        if *pos as u64 >= rlimit_fsize { return Err(IoError::EFBIG); }
        (rlimit_fsize - *pos as u64) as usize
    } else {
        buf.len()
    };

    // Step 2: O_APPEND — atomically seek to end of file.
    if file.f_flags.load(Relaxed) & O_APPEND != 0 {
        *pos = inode.i_size.load(Acquire) as i64;
    }

    // Step 3: Page-by-page write loop.
    let mut written: usize = 0;
    while written < count {
        let offset_in_page = (*pos as usize) % PAGE_SIZE;
        let bytes = core::cmp::min(PAGE_SIZE - offset_in_page, count - written);

        // 3a: Filesystem prepares page (alloc, journal, partial read-in).
        let page = mapping.ops.write_begin(mapping, *pos as u64, bytes, 0)?;

        // 3b: Copy user data from UserSlice into page.
        // UserSlice::read_to_page() performs copy_from_user internally,
        // handling SMAP/PAN page faults and returning bytes copied.
        // This is the correct API for user → kernel page copies.
        let page_kaddr = page_address(&page) + offset_in_page;
        let bytes_copied = buf.read_at(written, page_kaddr, bytes);

        // 3c: Filesystem commits write (dirty, journal, i_size update).
        // WF-03 fix: use the return value — the filesystem may commit
        // fewer bytes than copied (block-aligned partial commit).
        let committed = mapping.ops.write_end(
            mapping, *pos as u64, bytes, bytes_copied, page,
        )?;

        written += committed;
        *pos += committed as i64;

        if committed < bytes {
            break; // Short write — stop.
        }

        // 3d: Dirty page throttling.
        balance_dirty_pages_ratelimited(mapping);
    }

    // Step 4: Update timestamps.
    inode.i_mtime.store(current_time_ns(), Release);
    inode.i_ctime.store(current_time_ns(), Release);

    // Step 5: Return bytes written.
    Ok(written)
}
14.1.2.3.1.6 Inode (Index Node)
/// In-memory representation of a filesystem object (file, directory,
/// symlink, device, pipe, socket).
///
/// Each inode has a unique (superblock, inode_number) pair. The VFS
/// maintains an inode cache (icache) keyed by this pair to avoid
/// repeated disk reads.
///
/// **Lifecycle**: Created by `FileSystemOps::mount()` (root inode) or
/// `InodeOps::lookup()`/`InodeOps::create()` for other entries. Cached
/// in the icache. Freed when the last dentry referencing it is evicted
/// AND the on-disk link count drops to zero (unlinked).
///
/// **Concurrency**: Inode metadata is protected by `i_lock` (spinlock).
/// File data is protected by `i_rwsem` (read-write semaphore) — readers
/// (read, readdir) take shared; writers (write, truncate) take exclusive.
// kernel-internal, not KABI — no const_assert (contains Arc, RwSemaphore, dyn traits).
#[repr(C)]
pub struct Inode {
    /// Inode number. Unique within a superblock. Assigned by the filesystem.
    pub i_ino: u64,

    /// File type and permission mode (S_IFREG, S_IFDIR, etc. | rwxrwxrwx).
    pub i_mode: u32,

    /// Owner UID (kernel-internal representation, namespace-agnostic).
    /// Permission checks compare against `from_kuid(mnt_userns, i_uid)` to
    /// translate between the filesystem's user namespace and the calling
    /// process's user namespace ([Section 17.1](17-containers.md#namespace-architecture)).
    pub i_uid: u32,

    /// Owner GID (kernel-internal representation, namespace-agnostic).
    /// Permission checks use `from_kgid(mnt_userns, i_gid)` analogously.
    pub i_gid: u32,

    /// Hard link count. When this reaches 0 and no open file descriptors
    /// remain, the inode is freed (both in-memory and on-disk).
    pub i_nlink: AtomicU32,

    /// File size in bytes. AtomicI64 for compatibility with Linux loff_t
    /// semantics. For regular files and directories, i_size is always >= 0.
    /// For symlinks: length of the target path. Updated under `i_rwsem`.
    /// Consumers cast via `i_size.load(Acquire) as u64` after asserting
    /// non-negative: `debug_assert!(self.i_size.load(Acquire) >= 0)`.
    pub i_size: AtomicI64,

    /// Timestamps (seconds + nanoseconds since epoch).
    pub i_atime: Timespec,
    pub i_mtime: Timespec,
    pub i_ctime: Timespec,

    /// Block size for this inode's filesystem (typically 4096).
    pub i_blksize: u32,

    /// Number of 512-byte blocks allocated on disk.
    pub i_blocks: u64,

    /// Device number (major:minor) for device special files (S_IFBLK/S_IFCHR).
    /// Uses `DevId` type with Linux MKDEV encoding: `(major << 20) | minor`.
    /// See [Section 14.5](#device-node-framework) for encoding details.
    /// `DevId { raw: 0 }` for regular files.
    pub i_rdev: DevId,

    /// Generation number. Incremented when an inode is recycled (same i_ino
    /// reused for a new file). Used by NFS file handles to detect stale handles.
    /// Constrained to u32 by NFS file handle wire format (nfs_fh generation
    /// field). At 10K inode recycled/sec, wraps after ~5 days — but
    /// stale-handle collision requires matching SAME i_ino AND i_generation
    /// (1-in-4B chance). NFS clients detect wrap mismatch via ESTALE.
    /// Matches Linux i_generation behavior.
    pub i_generation: u32,

    /// Per-inode spinlock. Protects metadata updates (mode, uid, gid, timestamps,
    /// nlink). Lock level: INODE_LOCK (level 15).
    pub i_lock: SpinLock<(), INODE_LOCK>,

    /// Read-write semaphore for file data. read()/readdir() take shared;
    /// write()/truncate() take exclusive.
    pub i_rwsem: RwSemaphore,

    /// Superblock this inode belongs to.
    pub i_sb: Arc<SuperBlock>,

    /// Inode operations (lookup, create, link, unlink, etc.).
    /// Set by the filesystem when the inode is created.
    pub i_op: &'static dyn InodeOps,

    /// File operations (read, write, mmap, ioctl, etc.).
    /// Set by the filesystem; used when opening this inode as a file.
    pub i_fop: &'static dyn FileOps,

    /// Filesystem-private data. Opaque pointer used by the filesystem
    /// driver to attach its own per-inode state (e.g., ext4_inode_info).
    /// SAFETY: Set to a filesystem-specific type (e.g., *mut Ext4InodeInfo)
    /// during inode initialization under I_NEW flag. The filesystem's
    /// evict_inode() method must cast back to the original type and free.
    /// Type safety is NOT enforced — callers must maintain the type
    /// invariant. Set once during inode init, read-only thereafter.
    /// Concurrent access is safe because i_private is immutable after
    /// I_NEW is cleared.
    ///
    /// `unsafe impl Send for Inode {}` — SAFETY: i_private is set once
    /// during inode initialization (under I_NEW). After I_NEW is cleared,
    /// i_private is read-only. All other fields of Inode are either atomic
    /// or protected by documented locks.
    /// `unsafe impl Sync for Inode {}` — same safety argument applies.
    pub i_private: *mut (),

    /// Per-inode LSM security blob. Allocated by `security_inode_alloc()`
    /// during `iget()`; freed by `security_inode_free()` during
    /// `evict_inode()`. See [Section 9.8](09-security.md#linux-security-module-framework) for
    /// LsmBlob definition and lifecycle.
    pub i_security: Option<NonNull<LsmBlob>>,

    /// Page cache address space for this inode's data.
    /// Contains the `PageCache` storage backend, writeback coordination,
    /// and error tracking. See the `AddressSpace` struct defined above.
    pub i_mapping: AddressSpace,

    /// Reference count. Managed by dentry references and open file handles.
    /// u32: bounded by concurrent references (max_files sysctl, default 8M).
    /// Same rationale as Dentry.d_refcount — hot-path, ILP32 penalty.
    pub i_refcount: AtomicU32,

    /// RCU callback head for deferred inode slab free. The inode struct
    /// cannot be freed immediately when the last reference is dropped
    /// because concurrent RCU readers (path lookup via `rcu_walk`,
    /// `find_inode()` under `rcu_read_lock()`) may still hold pointers
    /// to this inode. Instead, eviction step 5 calls
    /// `call_rcu(&inode.i_rcu, inode_free_rcu)` which defers the slab
    /// free until all RCU readers have exited their critical sections.
    /// `inode_free_rcu` calls `inode_slab.free(inode)` to return the
    /// object to the inode slab cache.
    pub i_rcu: RcuHead,

    /// Dirty flag. Set when inode metadata has been modified in memory
    /// but not yet written to disk.
    pub i_state: AtomicU32,

    /// Inode cache membership flag is tracked in `i_state` (`I_HASHED` bit).
    /// Lookup is via per-superblock XArray (`SuperBlock.inode_cache`).
    /// No hash-table linkage node — XArray manages its own internal nodes.

    /// Superblock dirty inode list linkage.
    pub i_sb_list: IntrusiveListNode,

    // ---- Writeback integration ----

    /// Nanosecond timestamp when this inode was first dirtied (metadata or data).
    /// Set by `mark_inode_dirty()` on the first dirty transition (I_DIRTY_*
    /// flags going from 0 → non-zero). Used by the writeback subsystem to
    /// implement `dirty_expire_centisecs`: inodes dirtied longer than the
    /// threshold are prioritized for writeback. Zero when clean.
    pub dirtied_when: AtomicU64,

    /// Writeback association. Links this inode to a specific `BdiWriteback`
    /// instance (backing device writeback context). Set when the inode is
    /// first dirtied; cleared when writeback completes and the inode returns
    /// to clean state. `None` for inodes on pseudo-filesystems (procfs,
    /// sysfs) that have no backing device.
    pub i_wb: Option<Arc<BdiWriteback>>,

    /// BDI dirty inode list linkage. Links this inode into the per-BDI
    /// dirty list (`BdiWriteback.b_dirty`, `b_io`, or `b_more_io`)
    /// maintained by the writeback subsystem. The writeback thread walks
    /// these lists to find inodes that need flushing. The node is unlinked
    /// when the inode is no longer dirty.
    pub wb_link: IntrusiveListNode,
}
14.1.2.3.1.7 Inode Lifecycle and Page Cache Teardown

This subsection specifies the interaction between inode reference counting, page cache lifetime, and the eviction sequence. These paths are critical for correctness: a missed writeback silently loses data; a missed page free leaks memory; a race between eviction and page fault corrupts the page cache.

Inode state flags (i_state: AtomicU32):

/// Inode state bitflags stored in `Inode::i_state`.
///
/// Multiple flags may be set simultaneously. All updates use atomic
/// CAS (compare-and-swap) on `i_state` — no separate lock is required
/// for flag manipulation, but metadata fields protected by `i_lock`
/// must still be accessed under that lock.
pub mod InodeStateFlags {
    /// Inode is newly allocated; filesystem has not yet filled its
    /// on-disk fields. `unlock_new_inode()` clears this flag and
    /// wakes waiters.
    pub const I_NEW: u32         = 1 << 0;

    /// Inode metadata (mode, uid, timestamps, etc.) is dirty. Set by
    /// `mark_inode_dirty(I_DIRTY_SYNC)`.
    pub const I_DIRTY_SYNC: u32  = 1 << 1;

    /// Inode has dirty data-bearing metadata (file size, block
    /// mappings) that `fdatasync` must flush. Set by
    /// `mark_inode_dirty(I_DIRTY_DATASYNC)`.
    pub const I_DIRTY_DATASYNC: u32 = 1 << 2;

    /// Inode has dirty pages in its AddressSpace. Set when the first
    /// page is dirtied; cleared when writeback drains all dirty pages.
    pub const I_DIRTY_PAGES: u32 = 1 << 3;

    /// Inode is being freed by `evict_inode()`. While set,
    /// `mark_inode_dirty()` is a no-op (does not re-add the inode to
    /// dirty lists), and `find_inode()` skips this inode (returns
    /// `None`). Set atomically before eviction step 1; never cleared
    /// (the inode is freed).
    pub const I_FREEING: u32     = 1 << 4;

    /// Writeback of dirty timestamp fields is pending but not yet
    /// submitted. Used to coalesce frequent `atime` updates into a
    /// single writeback pass. `fdatasync` skips metadata flush when
    /// only this flag is set (timestamps are not data-relevant).
    pub const I_DIRTY_TIME: u32  = 1 << 5;

    /// Inode will be freed as soon as its reference count reaches
    /// zero. Set when `i_nlink` drops to 0 while file descriptors
    /// still hold references.
    pub const I_WILL_FREE: u32   = 1 << 6;

    /// Writeback in progress. Set by the writeback thread before issuing
    /// I/O for this inode's pages, cleared on writeback completion.
    /// Prevents concurrent writeback of the same inode (a second writeback
    /// request skips the inode while this flag is set). Also prevents the
    /// inode from being evicted from the inode cache during writeback.
    pub const I_WRITEBACK: u32   = 1 << 7;

    /// Inode is present in its superblock's inode cache XArray.
    /// Set by `inode_cache_insert()`, cleared when removed from the XArray
    /// (during eviction or `inode_cache_evict()`). Used by
    /// debug assertions to verify cache consistency.
    pub const I_HASHED: u32      = 1 << 8;

    /// Page cache of this inode was corrupted during a driver crash
    /// recovery cycle. Set by VFS ring crash recovery
    /// ([Section 14.3](#vfs-per-cpu-ring-extension)) when coherence checks fail on
    /// the inode's cached pages. While set, writeback skips this inode
    /// (writing corrupt data to disk would propagate the corruption).
    /// Applications reading from this inode receive `EIO`. Cleared
    /// only by unmount + fsck + remount, or by inode eviction.
    pub const I_PAGE_CACHE_CORRUPT: u32 = 1 << 9;

    /// Combined mask: any data or metadata is dirty (not including timestamps).
    pub const I_DIRTY: u32 = I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES;

    /// Combined mask: any dirty state including timestamps.
    pub const I_DIRTY_ALL: u32 = I_DIRTY | I_DIRTY_TIME;
}

Inode reference counting:

// Inode is reference-counted via Inode::i_refcount (AtomicU32).
//
// References are held by:
//   1. Dentry cache       — each dentry pointing to this inode holds one ref.
//   2. Open file descriptors — each struct File holds one ref via its dentry.
//   3. Page cache (implicit) — the page cache does NOT hold an explicit
//      i_refcount reference. Instead, the page cache is torn down as part
//      of the eviction sequence (step 3) which runs only after all explicit
//      references are released. Pages hold a Weak<Inode> back-pointer via
//      AddressSpace::host.
//   4. Writeback thread   — holds a temporary ref during flush (acquired
//      when the inode is picked for writeback, released when writeback
//      completes or the inode is skipped).
//
// Eviction eligibility:
//   An inode is eligible for eviction when ALL of the following hold:
//     (a) i_refcount == 0 (no dentry, fd, or writeback refs), AND
//     (b) i_nlink == 0 (no on-disk hard links remain).
//   If i_refcount == 0 but i_nlink > 0, the inode is merely removed from
//   the active inode cache and placed on the LRU list for potential reuse.
//   It is only freed (and its pages torn down) when memory pressure evicts
//   it from the LRU, or when i_nlink later drops to 0 via unlink().

Eviction sequence (evict_inode()):

The eviction sequence is the sole path through which an inode and its page cache pages are freed. It is called by iput_final() when the last reference is released on an inode with i_nlink == 0.

evict_inode(inode):

  Step 0: Set I_FREEING in inode.i_state (atomic OR).
          From this point:
          - mark_inode_dirty(inode) is a no-op.
          - find_inode(sb, ino) skips this inode (returns None).
          No new references can be acquired.

  Step 1: Remove inode from the per-superblock XArray (icache).
          sb.inode_cache.erase(inode.i_ino)
          After this, no new lookups can find the inode. Concurrent
          lookups that raced with step 0 will see I_FREEING and retry.

  Step 2: If the inode has dirty pages or dirty metadata:
          writeback_single_inode(inode, WB_SYNC_ALL)
          This flushes ALL dirty pages via AddressSpaceOps::writepage()
          and waits for every writeback I/O to complete (blocks until
          inode.i_mapping.nrwriteback == 0).
          If any writeback I/O fails, the error is recorded in
          AddressSpace::wb_err (ErrSeq). The page is still freed in
          step 3 — errors are recorded, not retried.

  Step 3: truncate_inode_pages_final(&inode.i_mapping)
          If page_cache is None (DAX inode), skip this step entirely —
          there are no cached pages to tear down.
          This is the page cache teardown:
          a. Walk AddressSpace.page_cache (unwrapped XArray), removing all entries.
          b. For each page:
             - If PG_WRITEBACK is set: wait for I/O completion
               (spin on the page's wait queue — this should be rare
               after step 2 drained writeback, but handles races with
               async I/O completion).
             - If PG_DIRTY is set: BUG — step 2 should have flushed
               all dirty pages. In release builds, log a warning and
               proceed (the page will be freed without writeback).
             - Remove the page from the LRU list if present.
             - Release the page frame back to the buddy allocator
               (Section 4.2).
          c. Set AddressSpace.page_cache.nr_pages = 0.
          d. Set AddressSpace.page_cache.nr_dirty = 0.
          Ordering: step 3 MUST NOT begin until step 2 has fully
          completed (all writeback I/O acknowledged).

  Step 4: Call the filesystem's evict_inode() for on-disk cleanup:
          inode.i_op.evict_inode(inode.i_ino)
          The filesystem driver:
          - Frees disk blocks (updates block bitmap / extent tree).
          - Removes the inode from its on-disk inode table.
          - Commits a journal transaction if journaled.
          For KABI Tier 1/2 drivers, this is dispatched as
          VfsOpcode::EvictInode through the VFS ring buffer.

  Step 5: Remove inode from superblock's inode list:
          sb.s_inode_list_lock.lock()
          sb.s_inodes.remove(&inode.i_sb_list)
          sb.s_inode_list_lock.unlock()
          Defer inode slab free via RCU:
          call_rcu(&inode.i_rcu, inode_free_rcu)
          where inode_free_rcu returns the slab object to the
          inode slab cache. This is necessary because concurrent
          RCU readers (rcu_walk path lookup, find_inode under
          rcu_read_lock) may still hold pointers to this inode.
          Direct slab free would cause use-after-free.

InodeOps::evict_inode() method (extends the InodeOps trait defined above):

    /// Called during inode eviction (step 4) after the VFS has flushed
    /// all dirty pages and torn down the page cache. The filesystem
    /// must free on-disk resources (blocks, extent tree entries, inode
    /// bitmap bit) and commit any necessary journal transactions.
    ///
    /// `i_nlink` is guaranteed to be 0 when this is called for a real
    /// eviction (unlinked file). For pseudo-filesystems (tmpfs, procfs)
    /// that have no on-disk state, this method is a no-op.
    ///
    /// Must not fail — on-disk cleanup is best-effort. If the journal
    /// commit fails, the filesystem marks itself as needing fsck
    /// (sets the error flag in the superblock) and returns.
    fn evict_inode(&self, ino: InodeId);

VfsOpcode::EvictInode (extends the VfsOpcode enum):

    /// `InodeOps::evict_inode`. Inode eviction — free on-disk resources.
    /// Sent after the VFS has completed page cache teardown (step 3).
    EvictInode = 38,

With a corresponding VfsRequestArgs variant:

    /// `InodeOps::evict_inode`. No extra arguments — the inode number
    /// is in `VfsRequest::ino`.
    EvictInode {},

Truncate path (truncate_inode_pages_range(mapping, lstart, lend)):

Called by ftruncate(2) (shrinking a file), unlink(2) (via eviction), fallocate(FALLOC_FL_PUNCH_HOLE), and fallocate(FALLOC_FL_COLLAPSE_RANGE). This is a partial teardown — only pages in the specified range are removed, unlike truncate_inode_pages_final() which removes all pages.

truncate_inode_pages_range(mapping, lstart, lend):

  Precondition: caller holds inode.i_rwsem exclusive (write lock).
  This prevents concurrent page faults, reads, and writes from
  populating the range being truncated.

  1. Compute page-aligned range:
     start_index = lstart / PAGE_SIZE
     end_index   = lend / PAGE_SIZE  (or u64::MAX for "to end of file")

  2. If mapping.page_cache is None (DAX inode), skip steps 2–3 — DAX
     files have no cached pages. Proceed directly to step 4 (on-disk
     block deallocation).
     For each page in mapping.page_cache[start_index..=end_index]:

     a. Acquire page lock (set `PageFlags::LOCKED` via atomic CAS; sleep if
        already locked by another thread — e.g., readahead).

     b. If `PageFlags::DIRTY` is set:
        cancel_dirty_page(page):
        - Clear `PageFlags::DIRTY` on the page.
        - Decrement mapping.page_cache.nr_dirty.
        - Decrement the BDI (backing device info) dirty page counter.
        - The page is NOT written back — truncated data is discarded.

     c. If `PageFlags::WRITEBACK` is set:
        wait_on_page_writeback(page):
        - Sleep until the bio completion handler clears `PageFlags::WRITEBACK`.
        - This handles the race where writeback was submitted before
          truncate acquired i_rwsem but has not yet completed.

     d. Invalidate shared futex keys on this page:
        futex_key_invalidate(page.phys_frame())
        ([Section 19.4](19-sysapi.md#futex-and-userspace-synchronization--physical-page-stability-for-shared-futex-keys)).
        Wakes all FUTEX_WAIT callers keyed on this physical frame
        with -EINVAL, since the backing page is about to be freed.

     e. Remove the page from mapping.page_cache (XArray delete).

     f. Remove the page from the LRU list (if present).

     g. Unlock the page (clear `PageFlags::LOCKED`).

     h. Release the page frame reference. If this is the last
        reference, the page is freed back to the buddy allocator
        (Section 4.2). If another mapping holds a reference (e.g.,
        a shared mmap), the page survives until that reference is
        dropped.

  3. Decrement mapping.page_cache.nr_pages by the count of removed
     pages (atomic subtract).

  4. Notify the filesystem for on-disk block deallocation:
     inode.i_op.truncate_range(ino, lstart, lend)
     The filesystem frees the corresponding disk blocks, updates
     extent trees, and journals the change. For KABI Tier 1/2
     drivers this is dispatched as the existing VfsOpcode::Truncate
     with the range encoded in the size field.

  5. If lstart is not page-aligned (partial page at the start of the
     range): zero the tail of the partial page from lstart to the
     next page boundary. The page remains in the cache with its
     leading portion intact. Mark it dirty so the zeroed region is
     written back.

  6. If lend is not page-aligned and lend != u64::MAX (partial page
     at the end): zero the head of the partial page from the page
     start to lend. Mark it dirty. (This case arises only with
     FALLOC_FL_PUNCH_HOLE; ftruncate always has lend = u64::MAX.)

Dirty page handling before eviction:

Dirty pages are NEVER silently discarded during eviction. The eviction sequence guarantees data integrity through the following invariants:

  1. writeback_single_inode() (eviction step 2) is called with WB_SYNC_ALL, which means:
  2. ALL dirty pages are submitted for writeback via AddressSpaceOps::writepage().
  3. The caller blocks until every submitted bio has completed (waits on each page's PG_WRITEBACK flag).
  4. If the inode is already being written back by the periodic writeback thread, WB_SYNC_ALL waits for that in-progress writeback to finish, then re-scans for any pages dirtied in the interim.

  5. If writeback I/O fails (disk error, transport error):

  6. The error code is recorded in AddressSpace::wb_err (ErrSeq counter) so that any concurrent fsync() on another fd for this inode will observe the error.
  7. The AS_EIO or AS_ENOSPC flag is set in AddressSpace::flags.
  8. The page is still freed in step 3 — there is no retry loop. The data is lost, but the error is recorded. This matches POSIX semantics: a subsequent fsync() returns -EIO exactly once per file descriptor.
  9. A kernel log message is emitted at KERN_ERR level: "VFS: writeback error during eviction of inode {sb}:{ino}: {errno}".

  10. truncate_inode_pages_final() (eviction step 3) asserts that no dirty pages remain (PG_DIRTY must be clear after step 2). In debug builds, a dirty page at this point triggers a BUG (logic error in the writeback path). In release builds, it is logged as a warning and the page is freed without writeback.

Race prevention:

The eviction sequence must be safe against concurrent operations:

Race scenario Prevention mechanism
find_inode() during eviction I_FREEING flag checked by find_inode() — returns None, causing the caller to read a fresh inode from disk (or get ENOENT if unlinked).
mark_inode_dirty() during eviction I_FREEING flag checked — mark_inode_dirty() is a no-op when I_FREEING is set.
Page fault on evicting inode find_inode() returns None (inode removed from XArray in step 1). The fault handler reads a fresh inode from disk. If the file is unlinked (i_nlink == 0), the on-disk inode is already marked free and the read fails with ESTALE.
Writeback thread picks inode during eviction The writeback thread checks I_FREEING before acquiring the inode ref. If the flag is set, the inode is skipped. If writeback is already in progress when I_FREEING is set, step 2 (WB_SYNC_ALL) waits for it to complete.
iget() racing with iput_final() iget() loads the inode from sb.inode_cache XArray under RCU and increments i_refcount atomically. iput_final() only proceeds if the CAS i_refcount: 1 -> 0 succeeds. If iget() increments first, the CAS fails and eviction is aborted.
Truncate racing with readahead truncate_inode_pages_range() holds i_rwsem exclusive. Readahead acquires i_rwsem shared. The rwsem serializes them.
Truncate racing with mmap read fault The page fault handler acquires i_rwsem shared (via filemap_fault()). Truncate holds it exclusive. The fault retries after truncate completes and finds no page (returns SIGBUS if the fault address is beyond the new EOF).

Cross-references: - Writeback thread organization and writeback_single_inode(): Section 4.6 - Buddy allocator (page frame release): Section 4.2 - fsync end-to-end flow and ErrSeq semantics: this section (fsync / fdatasync End-to-End Flow) - VFS ring buffer protocol (EvictInode dispatch): this section (VFS Ring Buffer Protocol) - Page cache XArray structure: Section 4.4 - LRU lists and page reclaim: Section 4.2 - DLM-aware page cache invalidation on lock release/downgrade: Section 15.15

14.1.2.3.1.8 Inode Cache (icache)

All in-memory inodes are registered in a per-superblock inode cache (icache). The cache serves two purposes: (1) deduplication — ensuring that only one Inode instance exists for any given (superblock, inode_number) pair via per-superblock XArray lookup, and (2) memory management — tracking unreferenced inodes on a global LRU list for eviction under memory pressure.

/// Global inode cache LRU and shrinker state. Inode lookup is per-superblock
/// (via `SuperBlock.inode_cache: XArray<u64, Arc<Inode>>`), but the LRU list
/// for memory pressure eviction is global — the shrinker needs a single list
/// to scan across all filesystems.
///
/// **Design rationale**: Per-superblock XArray eliminates hash computation
/// (~15-25 cycles saved), provides O(1) guaranteed lookup (no collision
/// chains), and improves cache locality (per-filesystem working set stays
/// in its own radix tree). The caller always has the superblock from path
/// resolution (`dentry.d_sb`), so no extra lookup is needed.
///
/// **Singleton**: one global instance, initialized during VFS subsystem
/// init. Accessed via `inode_cache()` which returns `&'static InodeCache`.
pub struct InodeCache {
    /// LRU list of unreferenced inodes (i_refcount == 0, i_nlink > 0).
    ///
    /// Head = least recently used (oldest unreferenced inode).
    /// Tail = most recently unreferenced inode.
    ///
    /// An inode is added to the LRU tail when its `i_refcount` drops
    /// to 0 (via `iput()`) and `i_nlink > 0` (still has on-disk links).
    /// It is removed from the LRU when:
    ///   - A new reference is acquired (`iget()` / `find_inode()`).
    ///   - Memory pressure triggers LRU eviction.
    ///   - The inode is evicted due to `i_nlink` dropping to 0.
    ///
    /// Protected by `lru_lock`. Lock ordering: `lru_lock` is acquired
    /// AFTER any per-superblock XArray lock (never the reverse).
    pub lru: SpinLock<IntrusiveList<Inode>>,

    /// Number of inodes currently on the LRU list. Updated atomically
    /// on LRU insert/remove. Used by the shrinker to estimate
    /// reclaimable memory without taking `lru_lock`.
    pub lru_count: AtomicU64,

    /// High watermark — when `lru_count` exceeds this value, the
    /// background reclaim kthread is woken to proactively evict cold
    /// inodes. Set during VFS init based on total system memory:
    ///   reclaim_watermark = max(1024, total_pages / 256)
    /// This keeps the LRU from growing unboundedly on large-memory
    /// systems while ensuring small systems still cache a useful
    /// number of inodes.
    pub reclaim_watermark: u64,
}

Inode cache operations:

/// Look up an inode in the per-superblock XArray by inode number.
///
/// Uses RCU read-side protection — no locks, no atomic increments on
/// the fast path. The returned `Arc<Inode>` has its reference count
/// incremented (the caller owns one reference).
///
/// Returns `None` if:
///   - No inode with this ino exists in the superblock's inode cache.
///   - The inode has `I_FREEING` set (eviction in progress) — the
///     caller should re-read from disk or retry.
///
/// **Hot path**: called on every `open()`, `stat()`, and path lookup
/// that misses the dentry cache. O(1) XArray lookup with no hash
/// computation and no collision chains.
pub fn inode_cache_lookup(sb: &SuperBlock, ino: u64) -> Option<Arc<Inode>> {
    // rcu_read_lock() (implicit in XArray::load)
    // sb.inode_cache.load(ino)
    //   → if found and !(i_state & I_FREEING): increment i_refcount, return Some
    //   → if found and (i_state & I_FREEING): return None (skip dying inode)
    //   → if not found: return None
    // If the inode was on the LRU (i_refcount was 0), remove it from
    // the global LRU list (acquire lru_lock, unlink, decrement lru_count).
    // rcu_read_unlock() (implicit)
}

/// Insert an inode into the global cache.
///
/// Called after a filesystem driver has allocated and filled a new inode
/// (from `InodeOps::lookup()`, `InodeOps::create()`, or `read_inode()`).
/// The inode must have `I_NEW` set in `i_state` — this flag is cleared
/// by `unlock_new_inode()` after the filesystem finishes initialization.
///
/// **Preconditions**:
///   - `inode.i_ino` and `inode.i_sb` are set (valid superblock + ino).
///   - `inode.i_refcount >= 1` (the caller holds a reference).
///   - No existing entry for `inode.i_ino` in `inode.i_sb.inode_cache`
///     — callers must check `inode_cache_lookup()` first.
///
/// **Panics** in debug builds if a duplicate entry exists (logic error
/// in the filesystem driver). In release builds, returns the existing
/// inode and drops the new one.
pub fn inode_cache_insert(inode: Arc<Inode>) {
    // inode.i_sb.inode_cache.store(inode.i_ino, inode.clone())
    // Set i_state |= I_HASHED (indicates presence in icache).
}

/// Evict unreferenced inodes from the LRU list to reclaim memory.
///
/// Called by the memory shrinker when the slab allocator or page reclaim
/// needs to free memory. Evicts up to `count` inodes from the LRU head
/// (least recently used first).
///
/// For each evicted inode:
///   1. Remove from LRU list (under `lru_lock`).
///   2. Set `I_FREEING` in `i_state` (atomic OR).
///   3. Remove from the superblock's XArray (`sb.inode_cache.erase(ino)`).
///   4. If the inode has dirty pages or metadata: call
///      `writeback_single_inode(inode, WB_SYNC_NONE)` — best-effort
///      writeback. Dirty inodes at the LRU head are skipped and moved
///      to the tail (they will be written back by the periodic writeback
///      thread before being eligible for eviction again).
///   5. Call `evict_inode()` for page cache teardown and on-disk cleanup.
///   6. Defer inode slab free via `call_rcu(&inode.i_rcu, inode_free_rcu)`
///      (RCU readers may still hold pointers — see `Inode::i_rcu` field).
///
/// **Concurrency**: `iget()` racing with eviction is safe — `iget()`
/// checks `I_FREEING` after incrementing `i_refcount`. If `I_FREEING`
/// is set, `iget()` decrements `i_refcount` back and returns `None`.
///
/// **Scope**: `sb` constrains eviction to a single superblock. Pass
/// `None` to evict across all superblocks (global memory pressure).
pub fn inode_cache_evict(sb: Option<&SuperBlock>, count: usize) -> usize {
    // Returns the number of inodes actually evicted (may be < count
    // if fewer reclaimable inodes exist or dirty inodes are skipped).
}

Shrinker integration:

The inode cache registers a memory shrinker callback during VFS init. The shrinker is invoked by the page reclaim path (Section 4.2) when free memory falls below the low watermark.

/// Inode cache shrinker. Registered as a global shrinker during VFS init.
///
/// `count_objects()`: returns `inode_cache().lru_count` (fast — no lock).
/// `scan_objects()`: calls `inode_cache_evict(None, nr_to_scan)`.
///
/// Priority: shrinker priority is set to `SHRINKER_DEFAULT_PRIORITY` (0).
/// The inode cache shrinker runs alongside the dentry cache shrinker and
/// slab shrinkers. The reclaim path distributes scan pressure across all
/// registered shrinkers proportionally to their `count_objects()` return
/// value — larger caches receive more scan pressure.
pub static INODE_CACHE_SHRINKER: Shrinker = Shrinker {
    count_objects: inode_cache_count,
    scan_objects: inode_cache_scan,
    seeks: DEFAULT_SEEKS,  // 2 — moderate cost to recreate (disk read)
    flags: 0,
};

Invariants: - Every in-memory Inode with I_HASHED set is present in exactly one SuperBlock.inode_cache XArray entry. Removing from the XArray clears I_HASHED. - An inode is on the LRU list if and only if i_refcount == 0 AND i_nlink > 0 AND I_FREEING is not set. - lru_count always equals the number of nodes in lru (maintained atomically — incremented on LRU insert, decremented on LRU remove). - Lock ordering: XArray internal lock -> lru_lock. Never acquire an XArray lock while holding lru_lock.

Cross-references: - Inode struct and lifecycle: this section (Inode above) - Eviction sequence: this section (Inode Lifecycle and Page Cache Teardown above) - Page reclaim and shrinker framework: Section 4.2 - Dentry cache (parallel deduplication cache for path components): this section (Dentry above) - Crash recovery inode iteration: this section (Dirty Page Handling on VFS Crash above)

14.1.2.3.1.9 SuperBlock
/// In-memory representation of a mounted filesystem.
///
/// Each mount creates one SuperBlock instance. The superblock holds
/// filesystem-level metadata (block size, feature flags, root inode)
/// and provides the interface between the VFS and the filesystem driver.
///
/// **Lifecycle**: Created by `FileSystemOps::mount()`. Destroyed by
/// `FileSystemOps::unmount()` after all references are released.
pub struct SuperBlock {
    /// Filesystem type identifier (e.g., "ext4", "xfs", "tmpfs").
    pub s_type: &'static str,

    /// Block size in bytes (typically 1024, 2048, or 4096).
    pub s_blocksize: u32,

    /// Log2 of block size (for bit-shift division).
    pub s_blocksize_bits: u8,

    /// Maximum file size supported by this filesystem.
    pub s_maxbytes: i64,

    /// Root dentry of the mounted filesystem.
    pub s_root: Arc<Dentry>,

    /// Filesystem operations (mount, unmount, statfs, sync).
    pub s_op: &'static dyn FileSystemOps,

    /// Mount flags (MS_RDONLY, MS_NOSUID, MS_NODEV, MS_DSM_COOPERATIVE, etc.).
    ///
    /// **`MS_DSM_COOPERATIVE` (bit 30)**: Set at mount time to indicate this
    /// filesystem participates in DSM cooperative caching. When set, the VFS
    /// page cache miss path (`filemap_get_pages`) uses the three-stage
    /// speculative protocol ([Section 6.11](06-dsm.md#dsm-distributed-page-cache)): (1) Bloom
    /// filter fast rejection (~15ns), (2) sequential access skip, (3)
    /// speculative parallel RDMA + NVMe for random misses. Zero overhead
    /// for ~95% of misses; ~5-10μs speedup for the rest. Enables
    /// cluster-wide page cache sharing for distributed filesystems.
    /// Only meaningful when DSM is enabled ([Section 6.11](06-dsm.md#dsm-distributed-page-cache)).
    /// Filesystems that do not set this flag use the standard local-only
    /// page cache lookup (no DSM overhead on the read path).
    /// AtomicU32: all defined flags fit in bits 0-30. If more than 31
    /// flags are needed, widen to AtomicU64.
    pub s_flags: AtomicU32,

    /// Filesystem-specific data. Opaque pointer used by the filesystem
    /// driver to attach its own per-superblock state (e.g., ext4_sb_info,
    /// xfs_mount).
    /// SAFETY: Set to a filesystem-specific type during mount(). The
    /// filesystem's kill_sb()/unmount() must cast back and free. Valid
    /// for the lifetime of the superblock. Set once during mount,
    /// read-only thereafter.
    pub s_fs_info: *mut (),

    /// UUID of the filesystem (if supported). Used for persistent mount
    /// identification and `/proc/mounts` output.
    pub s_uuid: [u8; 16],

    /// Per-superblock inode cache. Provides O(1) lookup by inode number
    /// via XArray (radix tree). RCU-protected reads on the hot path.
    ///
    /// Per-superblock XArray eliminates hash computation (~15-25 cycles
    /// saved vs. global RcuHashMap), provides O(1) guaranteed lookup
    /// (no collision chains), and improves cache locality (per-filesystem
    /// working set stays in its own radix tree). The caller always has the
    /// superblock from path resolution (`dentry.d_sb`), so no extra lookup
    /// is needed.
    /// On 32-bit architectures (ARMv7, PPC32), XArray stores u64 keys via
    /// a synthetic two-level index. Performance is O(1) for keys <= u32::MAX
    /// and O(log64(N)) for larger keys.
    pub inode_cache: XArray<u64, Arc<Inode>>,

    /// List of all inodes belonging to this superblock.
    /// Protected by `s_inode_list_lock`.
    pub s_inodes: IntrusiveList<Inode>,

    // Dirty inode tracking is handled exclusively by BdiWriteback.b_dirty
    // (see [Section 4.6](04-memory.md#writeback-subsystem--bdiwriteback)). No per-superblock dirty list.

    /// Per-superblock lock for inode list management.
    pub s_inode_list_lock: SpinLock<()>,

    /// Block device backing this filesystem (None for pseudo-filesystems
    /// like tmpfs, procfs, sysfs).
    pub s_bdev: Option<Arc<BlockDevice>>,

    /// Backing device info — controls writeback rate limiting, readahead
    /// window, and per-device dirty page accounting. Set during mount:
    ///
    /// - **Disk-backed filesystems** (ext4, XFS, Btrfs): points to the
    ///   `BackingDevInfo` owned by the underlying `BlockDevice`. The BDI
    ///   is shared if multiple mounts use the same block device (e.g.,
    ///   bind mounts). The writeback thread
    ///   ([Section 4.6](04-memory.md#writeback-subsystem--writeback-thread-organization)) uses `s_bdi` to
    ///   locate the inode dirty lists (`BdiWriteback.b_dirty`) and to
    ///   enforce per-device dirty page throttling via
    ///   `balance_dirty_pages()`.
    ///
    /// - **Network filesystems** (NFS, CIFS, 9P): allocate a dedicated
    ///   `BackingDevInfo` per superblock during mount. The BDI's `bdev`
    ///   field is `None`; writeback goes through the filesystem's own
    ///   network I/O path rather than the block layer.
    ///
    /// - **Pseudo-filesystems** (tmpfs, procfs, sysfs, devtmpfs): `None`.
    ///   These filesystems have no backing store and never produce dirty
    ///   pages that require writeback. The VFS skips all writeback and
    ///   dirty-throttling code paths when `s_bdi` is `None`.
    ///
    /// **Writeback chain**: inode → `i_sb` (`SuperBlock`) → `s_bdi`
    /// (`BackingDevInfo`) → `wb` (`BdiWriteback`). This chain is how the
    /// writeback subsystem discovers which device an inode's dirty pages
    /// should be flushed to. Breaking this chain (e.g., a disk-backed
    /// filesystem with `s_bdi = None`) would silently prevent writeback
    /// and leak dirty pages indefinitely.
    ///
    /// **Cross-reference**: `BackingDevInfo` struct definition and
    /// `BdiWriteback` internals are in [Section 4.6](04-memory.md#writeback-subsystem--writeback-thread-organization).
    pub s_bdi: Option<Arc<BackingDevInfo>>,

    /// Reference count. Held by Mount nodes and open file handles.
    pub s_refcount: AtomicU32,

    /// Freeze count. >0 means filesystem is frozen (FIFREEZE).
    pub s_freeze_count: AtomicU32,

    /// Per-superblock writer tracking for the freeze state machine.
    /// Implements the `SB_FREEZE_WRITE -> SB_FREEZE_PAGEFAULT -> SB_FREEZE_FS
    /// -> SB_FREEZE_COMPLETE` progression used by FIFREEZE/FITHAW ioctls and
    /// `do_remount()`.
    pub s_writers: SbWriters,

    /// Error handling behavior for this filesystem. Set at mount time from
    /// the `errors=` mount option (e.g., `errors=remount-ro`). Defaults to
    /// `FsErrorMode::Continue` unless the filesystem specifies otherwise.
    /// Consulted by the VFS error path when a filesystem reports an I/O or
    /// metadata corruption error.
    pub s_error_behavior: FsErrorMode,

    /// Per-mount VFS ring set for Tier 1 filesystem driver communication.
    /// Contains N ring pairs (request + response), where N is negotiated
    /// at mount time. `None` for pseudo-filesystems (tmpfs, procfs, sysfs)
    /// that run in Tier 0 and do not use ring-based dispatch.
    ///
    /// The ring_set is allocated at mount time and persists across driver
    /// crashes (rings are drained and reset, not recreated). The replacement
    /// driver re-binds to the existing ring_set during crash recovery
    /// Step U14 ([Section 14.3](#vfs-per-cpu-ring-extension--crash-recovery)).
    pub ring_set: Option<Box<VfsRingSet>>,

    /// Driver generation counter. Incremented each time the filesystem
    /// driver is (re)loaded after a crash (Step U16 of the unified VFS
    /// crash recovery sequence in [Section 14.3](#vfs-per-cpu-ring-extension--crash-recovery)).
    /// Initialized to 0 at first mount.
    ///
    /// Used for:
    /// - **Stale response detection**: VFS response consumer checks
    ///   `response.driver_generation == sb.driver_generation.load(Acquire)`
    ///   before processing any completion. Mismatched responses (from the
    ///   pre-crash driver instance) are discarded.
    /// - **Stale file handle detection**: `OpenFile.open_generation` is
    ///   compared against this field; mismatch returns `ENOTCONN`.
    ///
    /// One counter per superblock (not per ring) — all rings on a mount
    /// share the same generation. Persisted in Core (Tier 0) memory, so
    /// it survives Tier 1 driver crashes.
    ///
    /// **Longevity**: u64 at crash-per-minute rate (extreme) wraps after
    /// ~35 trillion years. No wrap handling needed.
    pub driver_generation: AtomicU64,
}

/// Freeze level for the superblock writer tracking state machine.
///
/// The freeze process advances through levels in order:
/// `Unfrozen -> Write -> PageFault -> Fs -> Complete`.
/// Each level blocks a broader category of operations. Thaw reverses
/// the progression. Used by FIFREEZE/FITHAW ioctls and `do_remount()`.
#[repr(u8)]
pub enum SbFreezeLevel {
    /// Not frozen. All operations permitted.
    Unfrozen = 0,
    /// Block new writes (`write`, `truncate`, `fallocate`).
    Write = 1,
    /// Block page faults (prevent new page-cache population via mmap writes).
    PageFault = 2,
    /// Block filesystem operations (metadata updates, journal commits).
    Fs = 3,
    /// Fully frozen. No filesystem activity. Safe for snapshots.
    Complete = 4,
}

/// Per-superblock writer tracking for the freeze state machine.
///
/// The `frozen` field records the current freeze level. The `writers`
/// array tracks the number of active writers at each of the three
/// blockable levels (Write, PageFault, Fs). Each counter uses
/// `PerCpuCounter` for scalable per-CPU increment/decrement on the
/// write path, with a global `sum()` used only during freeze/thaw
/// transitions to wait for all writers to drain.
///
/// **Freeze protocol**:
/// 1. Set `frozen` to the target level (e.g., `SbFreezeLevel::Write`).
/// 2. Wait for `writers[level - 1].sum() == 0` (all active writers at
///    that level have completed).
/// 3. Advance to the next level. Repeat until `Complete`.
///
/// **Thaw protocol**: Set `frozen` back to `Unfrozen` and wake all
/// waiters blocked by the freeze.
pub struct SbWriters {
    /// Current freeze level. Read with `Acquire`, written with `Release`.
    pub frozen: AtomicU8,
    /// Per-level writer counts: `[0]` = Write level, `[1]` = PageFault
    /// level, `[2]` = Fs level. `PerCpuCounter` for scalable hot-path
    /// increment (no cross-CPU contention on the write path).
    pub writers: [PerCpuCounter; 3],
    /// WaitQueue for threads blocked by a freeze. Woken on thaw.
    pub freeze_wait: WaitQueue,
}

Writer entry/exit protocol (sb_start_write / sb_end_write):

Every VFS operation that modifies the filesystem must bracket its work with the sb_start_write()/sb_end_write() protocol. This allows the freeze state machine to wait for all in-flight writers to drain before advancing to the next freeze level.

/// Attempt to enter the filesystem for a write-class operation.
/// Returns a guard that decrements the writer count on drop.
///
/// If the filesystem is frozen at or beyond the requested level,
/// the caller blocks on `sb.s_writers.freeze_wait` until thaw.
///
/// **Interruptibility**: If the calling task receives a fatal signal
/// while blocked, `sb_start_write` returns `Err(EINTR)`. The VFS
/// write path converts this to `-EINTR` for the syscall.
///
/// # Arguments
/// * `sb` — The superblock of the filesystem being written to.
/// * `level` — The freeze level this operation belongs to:
///   - `SbFreezeLevel::Write` (1): data writes (`write`, `truncate`, `fallocate`).
///   - `SbFreezeLevel::PageFault` (2): page fault writes (mmap dirty page).
///   - `SbFreezeLevel::Fs` (3): filesystem metadata updates (journal commit).
///
/// # Hot path
/// The common case (filesystem not frozen) is a single `Acquire` load
/// on `frozen` + one `PerCpuCounter::inc()` (~5-10 cycles total, no
/// contention). The slow path (freeze in progress) blocks.
pub fn sb_start_write(sb: &SuperBlock, level: SbFreezeLevel) -> Result<SbWriteGuard, Errno> {
    loop {
        let frozen = sb.s_writers.frozen.load(Ordering::Acquire);
        if frozen >= level as u8 {
            // Filesystem is frozen at or beyond our level. Block.
            sb.s_writers.freeze_wait.wait_interruptible()?;
            continue;
        }
        sb.s_writers.writers[(level as u8 - 1) as usize].inc();
        // Re-check after increment (the freeze may have advanced between
        // our check and increment — same race as Linux's percpu_rwsem).
        let frozen_after = sb.s_writers.frozen.load(Ordering::Acquire);
        if frozen_after >= level as u8 {
            sb.s_writers.writers[(level as u8 - 1) as usize].dec();
            sb.s_writers.freeze_wait.wait_interruptible()?;
            continue;
        }
        return Ok(SbWriteGuard { sb, level });
    }
}

/// RAII guard that decrements the writer count on drop.
pub struct SbWriteGuard<'a> {
    sb: &'a SuperBlock,
    level: SbFreezeLevel,
}

impl Drop for SbWriteGuard<'_> {
    fn drop(&mut self) {
        self.sb.s_writers.writers[(self.level as u8 - 1) as usize].dec();
    }
}

/// Freeze a filesystem to the specified level.
///
/// Called by `FIFREEZE` ioctl and `do_remount()` (remount read-only).
/// Advances through freeze levels sequentially:
/// `Write -> PageFault -> Fs -> Complete`.
///
/// At each level:
/// 1. Set `frozen` to the target level (Release store).
/// 2. Wait for `writers[level-1].sum() == 0` (all writers at that level drained).
///
/// After reaching `Complete`, the filesystem is fully quiesced: no pending
/// writes, no page faults, no metadata updates. Safe for LVM snapshots,
/// device-mapper operations, and backup tools.
///
/// # Errors
/// Returns `Err(EBUSY)` if the filesystem is already frozen.
/// Returns `Err(EINTR)` if the wait is interrupted by a fatal signal.
pub fn freeze_super(sb: &SuperBlock) -> Result<(), Errno>;

/// Thaw a frozen filesystem.
///
/// Called by `FITHAW` ioctl. Sets `frozen` back to `Unfrozen` and
/// wakes all threads blocked on `freeze_wait`.
pub fn thaw_super(sb: &SuperBlock) -> Result<(), Errno>;

Freeze/thaw interaction with VFS ring protocol: When a filesystem is frozen at SbFreezeLevel::Write or beyond, the VFS ring dispatch path (Section 14.2) returns -EROFS for write-class operations (Write, Truncate, Fallocate, Create, Mkdir, Symlink, Link, Unlink, Rmdir, Rename, Mknod, SetAttr, SetXattr, RemoveXattr) without enqueuing them on the ring. Read-class operations (Read, Lookup, Getattr, Readdir, ReadPage, Readahead) continue to function during freeze. The Freeze and Thaw opcodes in the ring protocol (Section 14.2) are used to notify the filesystem driver to quiesce/resume its own internal state (journal, allocator).

/// Filesystem error handling behavior (set via mount option `errors=`).
///
/// When a filesystem encounters an I/O error or metadata corruption,
/// the VFS consults `SuperBlock.s_error_behavior` to determine the
/// system-level response. This is separate from the error returned to
/// the calling application (which always gets an appropriate errno).
///
/// **Linux compatibility**: The `errors=` mount option values and their
/// semantics match Linux exactly. The numeric values match the ext4
/// `s_errors` on-disk field for ext4 compatibility.
#[repr(u8)]
pub enum FsErrorMode {
    /// Continue operation after error (default for ext4).
    /// The error is reported to the application via errno, but the
    /// filesystem remains mounted read-write. Suitable for non-critical
    /// filesystems where availability is preferred over safety.
    Continue = 0,

    /// Remount filesystem read-only on error.
    /// This is the safest non-destructive option: it prevents further
    /// data corruption while keeping existing data readable.
    ///
    /// **Remount-ro procedure**:
    /// 1. Set `SuperBlock.s_flags |= MS_RDONLY` (atomic OR).
    /// 2. Flush all dirty pages via `sync_fs(sb, wait=true)`. Pages that
    ///    fail to flush are marked with `PG_ERROR` and left in the cache
    ///    (they cannot be written back to a read-only filesystem).
    /// 3. Reject all future write operations (`write`, `truncate`,
    ///    `fallocate`, `rename`, `unlink`, `mkdir`, etc.) with `EROFS`.
    /// 4. Log the error and the remount event to the kernel log and
    ///    the fault management subsystem ([Section 20.1](20-observability.md#fault-management-architecture)).
    /// 5. Existing read-only file descriptors continue to work.
    ///    Existing read-write file descriptors remain open but all
    ///    write operations return `EROFS`.
    RemountRo = 1,

    /// Kernel panic on filesystem error.
    /// Used for critical root filesystems where continuing with a
    /// corrupted filesystem is worse than rebooting. This should only
    /// be set on the root filesystem in environments with automatic
    /// reboot and fsck (e.g., servers with watchdog timers).
    Panic = 2,
}

Writeback error integration: The writeback subsystem calls check_fs_error_mode(sb) when a page writeback I/O completes with an error. This function inspects sb.s_error_behavior and takes the configured action (log, remount-ro, or panic). Without this hook, errors=remount-ro would be meaningless for asynchronous writeback errors — the writeback subsystem would mark pages PG_ERROR but never trigger the VFS-level error policy. See Section 4.6 for the writeback I/O completion path that invokes this check.

See Section 14.2 for the VFS ring buffer cross-domain dispatch protocol (request/response ring pairs, opcodes, marshaling, timeout, cancellation, crash recovery).

See Section 14.4 for the fsync/fdatasync end-to-end flow and Copy-on-Write / Redirect-on-Write infrastructure (WriteMode, ExtentSharingOps, shared-extent page cache, reflink ioctls, CoW-aware writeback, free space accounting).

14.1.2.4 End-to-End Write Path: Userspace to Hardware

This walkthrough traces a single buffered write(2) from a userspace application through every kernel layer to stable media. It serves as a cross-reference map connecting the VFS, page cache, writeback, block layer, and device driver specifications.

1. USERSPACE: write(fd, buf, len)
   → Syscall entry ([Section 19.1](19-sysapi.md#syscall-interface))
   → umka-sysapi resolves fd to OpenFile

2. VFS DISPATCH: vfs_write(file, buf, len, &pos)
   → File operations dispatch via file.f_op.write_iter
   → fdget_pos() acquires position lock (§14.5 above)
   → Calls generic_file_write_iter() for regular files

3. PAGE CACHE WRITE: generic_file_write_iter()
   → pagecache_get_page(mapping, pgoff, FGP_WRITEBEGIN)
     → Page cache lookup via XArray ([Section 4.4](04-memory.md#page-cache))
     → On miss: allocate page, insert into XArray
   → copy_from_user(page_addr + offset, buf, len)
     → Data copied from userspace buffer to page cache page
   → set_page_dirty(page) → marks PG_DIRTY
   → vfs_dirty_extent_reserve() ([Section 14.4](#vfs-fsync-and-cow))
     → Reserves writeback intent in the dirty extent tracker

4. FILESYSTEM NOTIFICATION: .write_iter() callback
   → For ext4: ext4_write_begin() / ext4_write_end()
     → Journal reservation (JBD2) for metadata
     → Delayed allocation: logical blocks reserved, physical not yet assigned
   → For XFS: xfs_file_write_iter() → iomap framework
   → For Btrfs: CoW reservation via extent tree
   **Tier boundary**: For Tier 1 filesystems, write_begin/write_end are
   dispatched via KABI ring. The Tier 1 filesystem's write_end() response
   includes `dirty: bool`. The Tier 0 VFS ring consumer calls
   set_page_dirty() on behalf of the filesystem -- the filesystem never
   calls set_page_dirty() directly across the domain boundary.
   ([Section 12.8](12-kabi.md#kabi-domain-runtime))

5. WRITEBACK (ASYNCHRONOUS): triggered by dirty ratio threshold,
   periodic writeback timer (default 5s), or explicit fsync()
   → Writeback thread ([Section 4.6](04-memory.md#writeback-subsystem--writeback-thread-organization))
     picks dirty inode from per-bdi writeback list
   → writeback_single_inode() → .writepages() or .writepage()
   → Filesystem allocates physical blocks (delayed allocation commit):
     - ext4: ext4_writepages() → ext4_map_blocks() assigns physical extents
     - XFS: xfs_vm_writepages() → xfs_bmapi_write()
     - Btrfs: extent_writepages() → CoW extent allocation
   → vfs_dirty_extent_commit() binds physical block address to intent

6. BIO CONSTRUCTION: filesystem builds Bio from dirty pages
   → Bio { op: Write, start_lba, segments: [page, ...] }
   → Sets BioFlags: FUA for journal commits, PERSISTENT for critical I/O
   → ([Section 15.2](15-storage.md#block-io-and-volume-management--bio-crash-recovery))

7. BLOCK LAYER: bio_submit()
   → Cgroup I/O throttling check ([Section 15.2](15-storage.md#block-io-and-volume-management--cgroup-io-throttling))
   → I/O scheduler path (if attached):
     bio_to_io_request() → scheduler merges, reorders
     ([Section 15.18](15-storage.md#io-priority-and-scheduling))
   → Direct dispatch path (NVMe multi-queue): bypass scheduler

8. DEVICE DRIVER: BlockDeviceOps::submit_bio()
   → Tier 0: direct function call in kernel context
   → Tier 1: KABI ring dispatch through DomainRingBuffer
     ([Section 12.3](12-kabi.md#kabi-bilateral-capability-exchange))
   → Tier 2: IPC message to userspace driver process

9. HARDWARE DMA: driver programs NVMe SQ / AHCI command slot / virtio desc
   → DMA from page cache page to device
   → DmaDevice::dma_map_sgl() creates IOMMU mapping
     ([Section 4.14](04-memory.md#dma-subsystem))
   → Device writes data to stable media

10. COMPLETION: device signals IRQ → driver processes CQ entry
    → bio_complete() invokes bio.end_io callback (interrupt context)
    → Deferred to blk-io workqueue for page cache updates:
      - Clear PG_WRITEBACK on the page
      - Wake fsync() waiters if applicable
      - Update AddressSpace.wb_err on error

Design note — write() and async writeback visibility: write() returns success as soon as data is in the page cache (step 3 above). Asynchronous writeback failures (step 10) are NOT visible to write() — they are visible only to fsync() via the ErrSeq mechanism (Section 15.1). This is intentional and Linux-compatible: write() is a buffer-fill operation, not a durability guarantee. Applications that need durability must call fsync() or use O_SYNC/O_DSYNC. The ErrSeq mechanism ensures each open file descriptor sees each writeback error exactly once on the next fsync() call — the fd snapshots AddressSpace::wb_err at open() time (file.f_wb_err), and fsync() compares the snapshot to the current wb_err generation to detect new errors. If the application never calls fsync(), writeback errors are silently absorbed (the data is lost, but the application was not requesting durability guarantees). This matches POSIX semantics and Linux 4.13+ behavior (errseq_t).

14.1.2.5 O_SYNC / O_DSYNC Write Path

When a file is opened with O_SYNC or O_DSYNC, write() must not return until the data (and possibly metadata) is on stable storage. This guarantee is enforced after step 3 (page cache write) completes, before returning to userspace.

The synchronous write path reuses the normal page-cache write path above — data is still copied to a page cache page and the page is marked dirty. The difference is that the caller blocks on writeback before returning, instead of deferring to the asynchronous writeback thread. This design keeps the page cache as the single source of truth for dirty tracking and avoids duplicating writeback logic.

O_SYNC/O_DSYNC branch (inserted between steps 3 and 4 above):

3a. SYNC CHECK: after set_page_dirty() and vfs_dirty_extent_reserve():
    if file.f_flags & (O_SYNC | O_DSYNC) != 0:
        // Flush the dirty range we just wrote to stable storage.
        err = filemap_write_and_wait_range(
            mapping,
            offset,           // start of the write
            offset + len - 1, // end of the write (inclusive)
        )
        // filemap_write_and_wait_range():
        //   1. Calls writeback_range(mapping, start, end) which triggers
        //      AddressSpaceOps::writepages() for the dirty pages in [start, end].
        //   2. Waits for PG_WRITEBACK to clear on all pages in the range
        //      (blocks until device DMA + completion for those pages).
        //   3. Returns the first error from wb_err in the range, if any.
        if err != 0:
            // Writeback failed — propagate error to write() caller.
            // The page remains in the page cache (still dirty or errored).
            // AddressSpace.wb_err records the error for subsequent fsync().
            return Err(err)

        // O_SYNC: data + ALL metadata must be stable.
        // O_DSYNC: data must be stable; metadata only if file size changed.
        if file.f_flags & O_SYNC != 0:
            // Full sync: flush inode metadata (timestamps, size, blocks).
            err = vfs_fsync_metadata(inode)
            if err != 0:
                return Err(err)
        else:
            // O_DSYNC: flush metadata only if i_size changed (data integrity).
            // File size changes affect data recoverability — a crash after
            // extending the file but before updating i_size on disk would lose
            // the new data (it would be beyond the on-disk EOF). Timestamp
            // updates (mtime, ctime) are NOT required for data integrity.
            if offset + len > old_i_size:
                err = vfs_fsync_metadata(inode)
                if err != 0:
                    return Err(err)

vfs_fsync_metadata(inode) calls InodeOps::write_inode(inode, WB_SYNC_ALL) to flush the inode's on-disk metadata. For journaling filesystems (ext4, XFS), this commits the journal transaction containing the inode update. For non-journaling filesystems, this writes the inode block and issues a cache flush.

Performance: O_SYNC adds the device write latency to every write() call (~10-15 us on NVMe, ~3-8 ms on SATA). This is inherent — the user requested durability. The page cache write (step 3) remains ~1-5 us; the additional cost is entirely device I/O.

Interaction with writeback: The synchronous flush in step 3a writes back the same dirty pages that the asynchronous writeback thread (step 5) would eventually process. After filemap_write_and_wait_range() completes, the pages are clean (PG_DIRTY cleared), so the writeback thread skips them. No double-write occurs. Dirty page accounting (AddressSpace.page_cache.nr_dirty, BDI dirty counters) is correctly decremented by the writeback completion path, regardless of whether writeback was triggered synchronously or asynchronously.

Error semantics: If the device reports a write error, the error is: 1. Stored in AddressSpace.wb_err (for subsequent fsync() error reporting). 2. Returned from write() to the caller (the write "failed" from the durability perspective, even though the data is in the page cache). 3. The page may remain dirty in the cache (for retry on next writeback attempt).

14.1.2.6 O_DIRECT (Direct I/O) Path

O_DIRECT bypasses the page cache entirely: data is transferred via DMA directly between the user buffer and the block device. This eliminates double-copying (user buffer to page cache to device) and avoids polluting the page cache with streaming I/O data that will never be re-read.

Alignment requirements: O_DIRECT requires sector-aligned file offset and transfer length. The required alignment is filesystem-dependent and reported via statx() (dio_offset_align, dio_mem_align fields in UmkaStatx). Typical values: - ext4/XFS on NVMe: 512 bytes (sector size) - ext4 with bigalloc: filesystem block size (e.g., 4096) - Btrfs: sector size (4096)

Unaligned offset or length returns EINVAL from write() / read().

The user buffer must also be aligned to dio_mem_align (typically 512 bytes). This ensures the DMA controller can transfer directly to/from the buffer without bounce buffering.

/// Direct I/O operations. Returned by `AddressSpaceOps::direct_io()` for
/// filesystems that support O_DIRECT. The VFS calls these methods instead
/// of the page-cache path when `FMODE_DIRECT` is set on the file.
pub trait DirectIoOps: Send + Sync {
    /// Perform a direct read from the block device into the user buffer.
    ///
    /// `file`: the open file (provides inode, block mapping).
    /// `buf`: user-space destination buffer (must be dio_mem_align-aligned).
    /// `offset`: file offset (must be dio_offset_align-aligned).
    /// `len`: number of bytes to read.
    ///
    /// Returns the number of bytes actually read (may be less than `len`
    /// on EOF or partial DMA completion).
    ///
    /// The implementation must:
    /// 1. Map the file offset range to block device LBAs via the filesystem's
    ///    extent/block map.
    /// 2. Pin the user buffer pages via `get_user_pages(buf, len, WRITE)`.
    /// 3. Build a Bio with the pinned user pages as DMA targets.
    /// 4. Submit the Bio and wait for completion.
    /// 5. Unpin the user pages on completion.
    fn direct_read(
        &self,
        file: &OpenFile,
        buf: UserSliceMut,
        offset: u64,
        len: u64,
    ) -> Result<u64, IoError>;

    /// Perform a direct write from the user buffer to the block device.
    ///
    /// `file`: the open file.
    /// `buf`: user-space source buffer (must be dio_mem_align-aligned).
    /// `offset`: file offset (must be dio_offset_align-aligned).
    /// `len`: number of bytes to write.
    ///
    /// Returns the number of bytes actually written (short write on error).
    ///
    /// The implementation must:
    /// 1. Allocate blocks if writing beyond current extents (fallocate or
    ///    delayed allocation commit).
    /// 2. Pin the user buffer pages via `get_user_pages(buf, len, READ)`.
    /// 3. Build a Bio with the pinned user pages as DMA sources.
    /// 4. Submit the Bio and wait for completion.
    /// 5. Update i_size if the write extended the file.
    /// 6. Unpin the user pages on completion.
    fn direct_write(
        &self,
        file: &OpenFile,
        buf: UserSlice,
        offset: u64,
        len: u64,
    ) -> Result<u64, IoError>;
}

Cache coherence protocol: O_DIRECT and buffered I/O on the same file must not produce stale reads or lost writes. The VFS enforces coherence through serialization and invalidation:

Direct I/O cache coherence:

1. SERIALIZATION via i_rwsem:
   - Buffered read/write: acquires i_rwsem SHARED.
   - Direct I/O read/write: acquires i_rwsem EXCLUSIVE.
   This prevents concurrent DIO and buffered I/O on the same inode.
   A DIO write cannot race with a buffered read that might see stale
   page cache data, and a DIO read cannot race with a buffered write
   that might dirty a page cache page covering the same range.

2. BEFORE DIO READ — invalidate cached pages in the range:
   invalidate_inode_pages2_range(mapping, offset, offset + len - 1)
   → Removes all page cache pages covering [offset, offset+len).
   → If any page is dirty, it is written back first (to avoid data loss),
     then removed from the cache.
   → Subsequent buffered reads will re-fetch from disk (seeing DIO writes).

3. BEFORE DIO WRITE — flush + invalidate:
   filemap_write_and_wait_range(mapping, offset, offset + len - 1)
   → Writes back any dirty pages in the range to disk.
   invalidate_inode_pages2_range(mapping, offset, offset + len - 1)
   → Removes the pages from the cache.
   → This ensures the DIO write does not race with dirty page writeback
     (which would overwrite the DIO data with stale page cache contents).

4. AFTER DIO WRITE — no page cache update:
   The written data is on disk. Page cache pages for this range were
   invalidated in step 3 and are not re-populated. Subsequent buffered
   reads will fetch the new data from disk (cache miss → readpage).

VMA integration: O_DIRECT does NOT allocate page cache pages. The user buffer pages are pinned in physical memory via get_user_pages() for the duration of the DMA transfer and unpinned on completion. This is fundamentally different from buffered I/O, where data passes through kernel-owned page cache pages.

Fallback: If AddressSpaceOps::direct_io() returns None (filesystem does not support DIO), the VFS silently falls back to the buffered I/O path. The FMODE_DIRECT flag is not set on the OpenFile, and all reads/writes use the page cache. This matches Linux behavior for filesystems without DIO support (e.g., some network filesystems).

Error handling: If DMA fails mid-transfer (device error, IOMMU fault), the Bio completion callback reports the error. The DIO path returns the number of bytes successfully transferred (short read/write). If zero bytes were transferred, the error code from the Bio is returned directly (e.g., EIO).

Key latency contributors (approximate, NVMe on x86-64): - Steps 1-4 (VFS + page cache): ~1-5 us (CPU-bound, no I/O) - Step 3a (O_SYNC/O_DSYNC): +10-15 us NVMe, +3-8 ms SATA (device write latency) - Step 5 (writeback trigger): 0-5s delay (async) or 0 (fsync path) - Steps 6-7 (bio construction + block layer): ~1-3 us - Steps 8-9 (driver + DMA): ~1-2 us (Tier 0/1) or ~5-10 us (Tier 2) - Step 10 (hardware): 10-100 us (NVMe) / 1-10 ms (SATA) - O_DIRECT path (bypasses steps 3-5): ~15-120 us total (DMA + device)

Cross-references for each step: - Syscall dispatch: Section 19.1 - Page cache: Section 4.4 - Dirty extent tracking: Section 14.4 - Writeback subsystem: Section 4.6 - Bio and block device trait: Section 15.2 - I/O scheduling: Section 15.18 - KABI ring dispatch: Section 12.3 - DMA subsystem: Section 4.14 - Tier 1 crash recovery for in-flight writes: Section 11.9

The following sections (Pipe Subsystem, Inode Cache, Dentry Cache, Path Resolution, Mount Namespace) remain in this file.

14.1.3 Pipe Subsystem

For pipe implementation, see Section 14.17.

14.1.4 Inode Cache (icache)

The inode cache uses per-superblock XArray lookup (SuperBlock.inode_cache) with a global LRU list (InodeCache) for eviction. See the Core VFS Data Structures section above for struct definitions. It provides:

  • inode_cache_lookup(sb, ino) — O(1) per-superblock XArray lookup under RCU (hot path).
  • inode_cache_insert(inode) — inserts into the inode's superblock XArray.
  • inode_cache_evict(sb, count) — LRU eviction for memory pressure.
  • INODE_CACHE_SHRINKER — registered shrinker for integration with the page reclaim subsystem.

Each superblock holds an XArray<u64, Arc<Inode>> keyed by inode number for O(1) lookup with no hash computation. Unreferenced inodes (i_refcount == 0, i_nlink > 0) are placed on a global LRU list for eviction under memory pressure.

Memory pressure integration: The inode cache registers a shrinker with umka-core's memory reclaim subsystem (Section 4.2). When the page allocator signals pressure, inode_cache_evict() walks the LRU list and evicts up to nr_to_scan inodes. Each eviction frees the inode struct, its associated page cache pages (via truncate_inode_pages()), and its LSM blob. The dentry cache shrinker runs first (evicting dentries drops inode refcounts, making more inodes eligible for LRU eviction).

14.1.5 Dentry Cache

The dentry (directory entry) cache is the performance-critical data structure of the VFS. It maps (parent_inode, name) pairs to child inodes, eliminating repeated disk lookups for path resolution.

Data structure: RCU-protected hash table. Read-side lookups are lock-free — no atomic operations on the read path, only a memory barrier on RCU read lock entry/exit. This matches Linux's dentry cache design, which is similarly RCU-protected for the same performance reasons.

Negative dentries: When a lookup() returns ENOENT, the VFS caches a negative dentry for that (parent, name) pair. Subsequent lookups for the same nonexistent path component return ENOENT immediately without calling into the filesystem driver. This is critical for workloads like $PATH searches where the shell looks for an executable in 5-10 directories, finding it only in one. Without negative dentries, every command invocation would perform 4-9 unnecessary disk lookups.

Eviction: LRU eviction under memory pressure. The dentry cache integrates with umka-core's memory reclaim (Section 4.12 — Memory Compression Tier, in 04-memory.md) — when the page allocator signals memory pressure, the dentry cache shrinker evicts least-recently-used entries. Negative dentries are evicted preferentially (they are cheaper to re-create than positive dentries).

14.1.6 Path Resolution

Path resolution walks the dentry cache component by component. For example, /usr/lib/libfoo.so resolves as: root dentry -> lookup("usr") -> lookup("lib") -> lookup("libfoo.so").

RCU path walk (fast path): The entire resolution is attempted under an RCU read-side critical section. No dentry reference counts are taken, no locks are acquired. If every component is in the dentry cache and no concurrent renames or unmounts are in progress, the entire path resolves with zero atomic operations.

Ref-walk fallback (slow path): If any component is not cached, or if a concurrent mount/rename is detected (via sequence counters), the RCU walk aborts and restarts in ref-walk mode. Ref-walk takes dentry reference counts and inode locks as needed. This two-phase approach is identical to Linux's LOOKUP_RCU -> LOOKUP_LOCKED fallback.

Mount point traversal: When a dentry is flagged as a mount point, resolution crosses into the mounted filesystem's root dentry. The mount table is consulted via RCU lookup (no lock) in the fast path.

Symlink resolution: The VFS follows up to 40 nested symlinks before returning ELOOP. This matches the Linux limit and prevents infinite symlink loops.

Symlink namespace semantics: Symlink targets are always resolved relative to the current task's mount namespace, not the symlink inode's namespace. Absolute symlink targets (/foo/bar) start from task.fs.root (the task's chroot/pivot_root). Relative symlink targets are resolved from the symlink's parent directory. The AT_SYMLINK_NOFOLLOW flag prevents resolution entirely (returns the symlink inode). This matches Linux behavior and ensures that symlinks do not become cross-namespace escape vectors — a symlink created in one mount namespace cannot force resolution through a different namespace's mount tree.

Capability checks: Traverse permission is checked at each path component, but not via an inter-domain ring call on every component. Instead, the dentry cache stores a cached_perm: AtomicU32 field containing the permission bits resolved on the last successful access by the current UID. During RCU-walk, the VFS reads cached_perm from the dentry (same domain, no ring call) and compares against the requesting process's UID and requested permission. If the cached permission matches (common case — same user accessing the same path), no domain crossing occurs and the check costs only a single atomic load (~1-3 cycles). The permission cache is invalidated on chmod(), chown(), ACL changes, and capability revocation (all infrequent operations).

Permission cache encoding: The 32-bit cached_perm field is divided into: - Bits [31:16]: Truncated UID hash (upper 16 bits of a fast hash of the accessor's UID). This is NOT a full UID — it is a probabilistic match filter. - Bits [15:12]: Reserved (zero). - Bits [11:9]: Permission result for owner (rwx). - Bits [8:6]: Permission result for group (rwx). - Bits [5:3]: Permission result for other (rwx). - Bits [2:0]: Access mode that was checked (rwx).

On a cache hit (UID hash matches AND requested permission bits are a subset of the cached grant), the VFS skips the domain crossing. On a cache miss (UID hash mismatch or permission bits not cached), the VFS performs a full capability check via the inter-domain ring and updates the cache. The 16-bit UID hash has a ~1/65536 false positive rate — a different user may produce the same truncated UID hash as the cached entry. However, the slow-path capability check is always invoked on any hash collision, so the false positive never results in unauthorized access.

On hash collision (false positive rate ~1/65536 per lookup): access is denied — the VFS falls back to the slow-path inter-domain capability check, which always produces a correct result. The permission cache is purely advisory; a collision always causes a cache miss, never a permission elevation. Fail-safe direction: deny unknown, never grant unknown.

Multi-user ping-pong: On shared directory trees accessed by multiple UIDs concurrently, the single-entry permission cache experiences ping-pong (alternating misses as different UIDs overwrite each other's cached hash). This is acceptable for the common case (single-user access patterns dominate) and bounded — each miss costs one domain crossing, identical to the no-cache case. A multi-entry cache was considered but rejected due to per-dentry memory overhead (each additional entry adds 4 bytes per dentry, multiplied by millions of dentries in active caches).

This design is correct because: 1. A cache hit is only accepted when the UID hash AND the requested permission bits match the stored grant exactly. The probability of a different user with a different permission set matching both fields is ~1/65536 per lookup — and that case results in a cache miss and full slow-path check anyway. 2. The cache is invalidated on ALL permission-changing operations (chmod, chown, ACL changes, capability revocation), ensuring stale grants are never served after the underlying permission state changes. 3. Only the slow-path inter-domain ring call is authoritative. umka-vfs cannot grant access that umka-core's capability tables do not authorize.

Only on a cache miss (first access, different UID, or invalidated entry) does the VFS call umka-core via the inter-domain ring to perform a full capability check and update the dentry's cached permissions. This amortized design preserves the security guarantee (umka-vfs cannot bypass capability checks — it has no access to capability tables, per Section 11.2 and Section 11.3) while keeping the hot-path overhead to a single atomic load per component, comparable to Linux's inode->i_mode check.

14.1.6.1 MountDentry — VFS Location Pair

A MountDentry is the fundamental VFS location type: a (mount, dentry) pair that uniquely identifies a point in the mount tree. It is the result of every path resolution and the primary reference type passed between VFS operations.

/// A reference to a location in the mount tree: the specific mount and the
/// dentry within that mount's filesystem. Two dentries with the same inode
/// in different mounts are different `MountDentry` values.
///
/// `MountDentry` holds Arc references to both the `Mount` and the `Dentry`,
/// keeping both alive for the duration of the reference. Dropping a
/// `MountDentry` decrements both refcounts.
pub struct MountDentry {
    /// The mount containing this dentry.
    pub mnt: Arc<Mount>,
    /// The dentry within the mount's filesystem.
    pub dentry: Arc<Dentry>,
}

/// Resolve an open file descriptor to its `MountDentry`.
///
/// Used by `open_by_handle_at()`, `fstatat(AT_EMPTY_PATH)`, and io_uring
/// `AT_FDCWD` resolution. The returned `MountDentry` identifies the mount
/// and dentry that the file descriptor was opened on.
///
/// # Errors
/// - `EBADF`: `fd` is not a valid open file descriptor.
/// - `ENOENT`: The file descriptor's dentry has been unlinked (deleted)
///   and `FMODE_PATH` is not set.
pub fn fd_to_mount_dentry(fd: i32) -> Result<MountDentry, Errno> {
    let file = fget(fd)?;
    Ok(MountDentry {
        mnt: Arc::clone(&file.mnt),
        dentry: Arc::clone(&file.dentry),
    })
}

14.1.6.2 Path Lookup Entry Point

The path_lookup function is the primary entry point for all VFS path resolution. Every syscall that accepts a pathname (open, stat, access, mkdir, unlink, mount, execve, etc.) calls path_lookup to translate the user-provided path string into a MountDentry pair identifying the target location in the mount tree.

bitflags! {
    /// Flags controlling path resolution behavior. Passed to `path_lookup()`
    /// by syscall handlers to customize resolution semantics.
    ///
    /// These flags correspond to Linux's internal `LOOKUP_*` flags (not
    /// directly visible to userspace, but indirectly controlled by syscall
    /// flags like `O_NOFOLLOW`, `O_DIRECTORY`, `O_CREAT`, `AT_SYMLINK_NOFOLLOW`,
    /// `AT_EMPTY_PATH`, `RESOLVE_BENEATH`, `RESOLVE_NO_XDEV`, etc.).
    pub struct LookupFlags: u32 {
        /// Follow the terminal symlink. If the final path component is a
        /// symlink and this flag is set, resolution follows it to the target.
        /// If not set, resolution returns the symlink inode itself.
        ///
        /// Default for most syscalls (`open`, `stat`). Cleared by `O_NOFOLLOW`
        /// and `AT_SYMLINK_NOFOLLOW`. `lstat()` clears this flag.
        ///
        /// Note: intermediate symlinks (non-terminal components) are ALWAYS
        /// followed regardless of this flag — only the final component is
        /// affected. This matches POSIX behavior.
        const FOLLOW        = 0x0001;

        /// The final component must be a directory. If it resolves to a
        /// non-directory inode, return `ENOTDIR`.
        ///
        /// Set by `O_DIRECTORY` (openat2), `mkdir` (parent lookup), and
        /// `rmdir` (target validation). Also implicitly set when the path
        /// ends with a trailing `/` (POSIX: trailing slash implies directory).
        const DIRECTORY     = 0x0002;

        /// Resolve the parent directory of the final component, not the
        /// final component itself. The final component name is returned
        /// separately (not resolved to an inode). Used by syscalls that
        /// create or remove entries: `mkdir`, `mknod`, `unlink`, `rmdir`,
        /// `rename`, `link`, `symlink`.
        ///
        /// When set, `path_lookup` returns the `MountDentry` of the parent
        /// directory, and the final component name is stored in a separate
        /// output parameter (not shown in this simplified signature).
        const PARENT        = 0x0004;

        /// The syscall is creating a new entry (`O_CREAT`). This flag is
        /// informational — it does not change resolution behavior, but it
        /// is checked by audit/LSM hooks to distinguish "create" from
        /// "open existing".
        const CREATE        = 0x0008;

        /// The syscall requires exclusive creation (`O_EXCL`). Combined
        /// with `CREATE`. If the final component already exists, return
        /// `EEXIST`. The VFS checks this after resolution completes.
        const EXCL          = 0x0010;

        /// The resolution is part of an `open()` operation. This flag
        /// enables open-intent optimizations: the VFS can pass an open
        /// intent to the filesystem's `lookup()` so that NFS can perform
        /// an atomic lookup-and-open in a single RPC, avoiding a TOCTOU
        /// race between lookup and open.
        const OPEN          = 0x0020;

        /// Resolution must not cross the `root` boundary upward. If the
        /// path contains `..` components that would ascend above `root`,
        /// return `EXDEV` instead of silently clamping to `root`.
        ///
        /// Maps to `RESOLVE_BENEATH` (openat2). Provides a stronger
        /// security guarantee than chroot: even a privileged process
        /// cannot escape the `root` boundary via `..` traversal.
        const BENEATH       = 0x0040;

        /// Resolution must not cross mount point boundaries. If the path
        /// traverses a mount point (in either direction — into a mounted
        /// filesystem or back out via `..`), return `EXDEV`.
        ///
        /// Maps to `RESOLVE_NO_XDEV` (openat2). Used by sandboxed
        /// processes and container runtimes that want to confine path
        /// resolution to a single filesystem.
        const NO_XDEV       = 0x0080;

        /// Do not trigger automounts during resolution. If a path
        /// component is an autofs trigger point, return `ENOENT` instead
        /// of mounting the remote filesystem.
        ///
        /// Maps to `AT_NO_AUTOMOUNT`. This is separate from
        /// `RESOLVE_NO_MAGICLINKS` — automount suppression and magic-link
        /// suppression are independent concepts.
        const NO_AUTOMOUNT  = 0x0100;

        /// Fail if any path component (including the terminal) is a
        /// symlink. Stricter than clearing `FOLLOW` (which only affects
        /// the terminal component).
        ///
        /// Maps to `RESOLVE_NO_SYMLINKS` (openat2, 0x04).
        const NO_SYMLINKS   = 0x0200;

        /// Fail on `/proc/[pid]/fd/*` style magic symlinks (procfs magic
        /// links that jump to arbitrary filesystem locations). Regular
        /// symlinks are still followed unless `NO_SYMLINKS` is also set.
        ///
        /// Maps to `RESOLVE_NO_MAGICLINKS` (openat2, 0x02).
        const NO_MAGICLINKS = 0x0400;

        /// Treat `dirfd` as the filesystem root. `..` at `dirfd` stays
        /// at `dirfd` (like chroot but per-syscall). Combined with
        /// `BENEATH`, provides a complete sandboxed lookup.
        ///
        /// Maps to `RESOLVE_IN_ROOT` (openat2, 0x10).
        const IN_ROOT       = 0x0800;

        /// Non-blocking lookup. If the resolution would block (uncached
        /// dentry, lazy NFS lookup, autofs trigger), return `EAGAIN`
        /// instead of blocking. Used by io_uring for async path ops.
        ///
        /// Maps to `RESOLVE_CACHED` (openat2, 0x20).
        const CACHED        = 0x1000;

        /// Empty path resolution. When set with an empty path string,
        /// the resolution returns the `MountDentry` of the `dirfd`
        /// itself (or `pwd` if `dirfd` is `AT_FDCWD`). Used by
        /// `AT_EMPTY_PATH` (fstatat, linkat, etc.) and `fexecve`.
        const EMPTY_PATH    = 0x2000;
    }
}

/// VFS path resolution entry point. Called from syscall handlers to resolve
/// a user-provided path string to a `MountDentry` pair
/// ([Section 8.1](08-process.md#process-and-task-management--fsstruct)).
///
/// This function implements the two-phase resolution protocol described above:
/// first attempts RCU-walk (lockless, zero atomic operations on hit), then
/// falls back to ref-walk on miss or concurrent modification.
///
/// # Arguments
///
/// - `mnt_ns`: Mount namespace for mount point traversal. Determines which
///   mounts are visible during resolution. Obtained from
///   `task.nsproxy.mount_ns` ([Section 17.1](17-containers.md#namespace-architecture)).
/// - `root`: Chroot root boundary from `task.fs.read().root`. Path resolution
///   never ascends above this point via `..`. If `LOOKUP_BENEATH` is set,
///   attempting to ascend above `root` returns `EXDEV` instead of clamping.
/// - `pwd`: Current working directory from `task.fs.read().pwd`. Used as the
///   starting point for relative path resolution. Ignored for absolute paths
///   (paths starting with `/`).
/// - `path`: Path string (absolute or relative). Kernel-space byte slice —
///   the syscall layer has already copied this from userspace via
///   `copy_from_user`. Must be null-terminated or bounded by `PATH_MAX`
///   (4096 bytes). An empty `path` is valid only if `LOOKUP_EMPTY_PATH` is
///   set in `flags`.
/// - `flags`: `LookupFlags` bitflags controlling resolution behavior (see
///   the `LookupFlags` definition above).
///
/// # Returns
///
/// On success, returns a `MountDentry` identifying the resolved location
/// in the mount tree. The returned `MountDentry` holds references to both
/// the mount and the dentry (refcounts incremented). The caller is
/// responsible for releasing these references when done.
///
/// # Errors
///
/// | Error | Condition |
/// |-------|-----------|
/// | `ENOENT` | A path component does not exist (and `LOOKUP_CREATE` is not set) |
/// | `EACCES` | Traverse (execute) permission denied on a directory component |
/// | `ENOTDIR` | A non-terminal component is not a directory, or `LOOKUP_DIRECTORY` is set and the final component is not a directory |
/// | `ELOOP` | More than 40 nested symlinks encountered during resolution |
/// | `ENAMETOOLONG` | A path component exceeds `NAME_MAX` (255 bytes) or the total path exceeds `PATH_MAX` (4096 bytes) |
/// | `EXDEV` | `LOOKUP_BENEATH`: path escapes `root` via `..`. `LOOKUP_NO_XDEV`: path crosses a mount boundary |
/// | `EINVAL` | Empty path without `LOOKUP_EMPTY_PATH` |
///
/// # Concurrency
///
/// Thread-safe. Multiple threads may call `path_lookup` concurrently. The
/// RCU-walk phase is fully lockless. The ref-walk fallback acquires per-dentry
/// spinlocks and inode `i_rwsem` (shared) as needed.
///
/// # Performance
///
/// Hot path (all components cached, no concurrent mutations): O(n) where n is
/// the number of path components. Each component costs one dentry hash lookup
/// (~5-10ns) plus one `cached_perm` check (~1-3ns). No domain crossings, no
/// locks, no atomic RMW operations.
pub fn path_lookup(
    mnt_ns: &MountNamespace,
    root: &MountDentry,
    pwd: &MountDentry,
    path: &[u8],
    flags: LookupFlags,
) -> Result<MountDentry, Errno>

Credential resolution: path_lookup() accesses the calling task's credentials via current_task().cred (RCU-protected read) for permission checks at each path component. This is valid because VFS runs as Tier 1 (Ring 0, shared per-CPU state with Core). The CpuLocal current_task pointer lives in Nucleus memory, readable by all Tier 1 domains. No domain crossing is needed to read task credentials.

RESOLVE_IN_ROOT capture timing: The root boundary is captured at the start of path_lookup() from task.fs.root. If the task's namespace changes between syscall entry and path_lookup(), the root captured at path_lookup entry is authoritative. This is consistent with Linux openat2() behavior.

dirfd validity across unshare(CLONE_NEWNS): After unshare(CLONE_NEWNS), existing file descriptors (including dirfd values) remain valid. The dentry referenced by a dirfd is in the mount tree — after unshare, the new mount namespace is a copy of the old, and existing dentries are shared (copy-on-write mount points). Operations using AT_FDCWD or an explicit dirfd resolve in the calling task's current mount namespace. The dirfd does not become invalid.

pwd after unshare(CLONE_NEWNS): If the current working directory is unreachable from the new namespace's root (e.g., the mount point was not copied), the pwd becomes a "floating" dentry. File operations relative to pwd succeed (the dentry is still valid). getcwd() returns ENOENT. This matches Linux behavior.

Syscall-to-VFS bridge: Syscall handlers construct the path_lookup call from the current task's state:

// Example: openat(dirfd, pathname, flags, mode) syscall handler sketch.
// Shows how SyscallContext fields feed into path_lookup.
fn sys_openat(ctx: &mut SyscallContext) -> i64 {
    let dirfd = ctx.args[0] as i32;
    let pathname = copy_path_from_user(ctx.args[1] as *const u8, PATH_MAX)?;
    let flags = ctx.args[2] as u32;
    let mode = ctx.args[3] as u32;

    let fs = ctx.task.fs.read();
    let nsproxy = ctx.task.nsproxy.load();
    let mnt_ns = &nsproxy.mount_ns;

    // Determine the base directory for relative paths.
    let base = if dirfd == AT_FDCWD {
        &fs.pwd
    } else {
        &fd_to_mount_dentry(ctx.task.files.get(dirfd)?)?
    };

    let lookup_flags = open_flags_to_lookup_flags(flags);
    let target = path_lookup(mnt_ns, &fs.root, base, &pathname, lookup_flags)?;

    // ... proceed with open using the resolved MountDentry ...
}

Flag translation functions:

/// Translate `open(2)` / `openat(2)` O_* flags to internal LookupFlags.
/// Used by sys_open, sys_openat, and legacy open paths.
fn open_flags_to_lookup_flags(o_flags: u32) -> LookupFlags {
    let mut lf = LookupFlags::FOLLOW; // default: follow terminal symlinks
    if o_flags & O_NOFOLLOW != 0 {
        lf.remove(LookupFlags::FOLLOW);
    }
    if o_flags & O_DIRECTORY != 0 {
        lf |= LookupFlags::DIRECTORY;
    }
    if o_flags & O_CREAT != 0 {
        lf |= LookupFlags::CREATE;
    }
    if o_flags & O_PATH != 0 {
        // O_PATH opens are lightweight fd-only references —
        // no file data access, no permission check on the final component.
        lf |= LookupFlags::EMPTY_PATH;
    }
    lf
}

/// Translate `openat2(2)` resolve flags (from `struct open_how.resolve`) to
/// internal LookupFlags. Called by sys_openat2 AFTER `open_flags_to_lookup_flags()`
/// to layer the RESOLVE_* restrictions on top of the O_* translations.
///
/// Linux `openat2(2)` resolve flag values (from `include/uapi/linux/openat2.h`):
///   RESOLVE_NO_XDEV       = 0x01
///   RESOLVE_NO_MAGICLINKS = 0x02
///   RESOLVE_NO_SYMLINKS   = 0x04
///   RESOLVE_BENEATH       = 0x08
///   RESOLVE_IN_ROOT       = 0x10
///   RESOLVE_CACHED        = 0x20
///
/// Returns `Err(EINVAL)` if mutually exclusive flags are combined
/// (e.g., RESOLVE_BENEATH | RESOLVE_IN_ROOT is allowed in Linux 5.12+
/// but the semantics are RESOLVE_IN_ROOT-dominant).
fn resolve_flags_to_lookup_flags(
    base: LookupFlags,
    resolve: u64,
) -> Result<LookupFlags, Errno> {
    let mut lf = base;
    if resolve & RESOLVE_NO_XDEV != 0 {
        lf |= LookupFlags::NO_XDEV;
    }
    if resolve & RESOLVE_NO_MAGICLINKS != 0 {
        lf |= LookupFlags::NO_MAGICLINKS;
    }
    if resolve & RESOLVE_NO_SYMLINKS != 0 {
        lf |= LookupFlags::NO_SYMLINKS;
        lf.remove(LookupFlags::FOLLOW); // NO_SYMLINKS implies no terminal follow
    }
    if resolve & RESOLVE_BENEATH != 0 {
        lf |= LookupFlags::BENEATH;
    }
    if resolve & RESOLVE_IN_ROOT != 0 {
        lf |= LookupFlags::IN_ROOT;
    }
    if resolve & RESOLVE_CACHED != 0 {
        lf |= LookupFlags::CACHED;
    }
    // Reject unknown bits (forward compatibility).
    let known_bits = RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS
        | RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_CACHED;
    if resolve & !known_bits != 0 {
        return Err(Errno::EINVAL);
    }
    Ok(lf)
}

14.1.7 Mount Namespace and Capability-Gated Mounting

Each process belongs to a mount namespace containing its own mount tree.

Mount operations are capability-gated:

Operation Required Capability Scope
mount CAP_MOUNT Mount namespace
bind mount CAP_MOUNT + read access to source Mount namespace + source
remount CAP_MOUNT Mount namespace
umount CAP_MOUNT Mount namespace
pivot_root CAP_SYS_ADMIN Mount namespace

CAP_MOUNT is scoped to the calling process's mount namespace — it does not grant mount authority in other namespaces. A container with its own mount namespace can mount filesystems within that namespace without affecting the host.

Mount propagation: Shared, private, slave, and unbindable propagation types, with the same semantics as Linux (MS_SHARED, MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE). This is essential for container runtimes that rely on mount propagation for volume mounts.

Filesystem type registration: Only umka-core can register new filesystem types with the VFS. Filesystem drivers request registration via the inter-domain ring, and umka-core verifies the driver's identity and KABI certification before granting registration.

14.1.7.1 Mount Lifecycle

The mount(2) syscall drives a multi-step flow that creates or reuses a SuperBlock, allocates a MountPoint node, and inserts it into the calling process's mount tree. Each step has a defined rollback on failure, ensuring no resource leaks.

Mount flow (do_mount) (summary; the canonical do_mount algorithm with full step ordering is defined in Section 14.6):

  1. Lookup filesystem type. Search the global FS_TYPE_TABLE (XArray keyed by filesystem name hash) for the requested fs_type string (e.g., "ext4", "tmpfs"). If not found, return ENODEV.

  2. Cgroup device controller check. For block-backed filesystems (source resolves to a block device), check the calling task's cgroup device controller allowlist (Section 17.2). If the device's (major, minor) is not in devices.allow for the task's cgroup, return EPERM. This prevents containers from mounting arbitrary block devices. Pseudo-filesystems (tmpfs, procfs, sysfs) skip this check.

  3. Resolve or allocate SuperBlock. For block-backed filesystems, hash the (fs_type, device) pair and search the active superblock table:

  4. Existing superblock found: Increment s_refcount. Verify mount flags are compatible (e.g., cannot mount the same device MS_RDONLY and read-write simultaneously). If incompatible, return EBUSY.
  5. No existing superblock: Allocate a new SuperBlock from the VFS slab cache. Initialize s_type, s_blocksize, s_flags, s_bdev, and s_uuid with defaults. Set s_refcount = 1.

  6. Call FileSystemOps::mount(sb, source, flags, data). The filesystem driver reads the on-disk superblock (for block-backed filesystems), fills the SuperBlock fields (s_blocksize, s_maxbytes, s_root, s_fs_info), performs journal replay if needed (ext4, XFS), and returns. For pseudo- filesystems (tmpfs, procfs), this step populates the root inode and dentry without any block I/O.

  7. On error: release the SuperBlock (decrement refcount; if zero, free it). Return the filesystem's error code.

  8. Create MountPoint node. Allocate a MountPoint linking:

  9. parent: the dentry where this mount is attached (e.g., /mnt/data).
  10. source: the device path or source string (e.g., /dev/sda1).
  11. sb: the SuperBlock from step 3/4.
  12. mount_id: a globally unique monotonic u64 mount identifier (exposed to userspace via STATX_MNT_ID).
  13. flags: mount flags (MS_RDONLY, MS_NOSUID, MS_NODEV, etc.).
  14. propagation: propagation type (MS_SHARED, MS_PRIVATE, etc.), defaulting to MS_PRIVATE.
  15. On error: call FileSystemOps::unmount(sb), release SuperBlock.

  16. Bind BackingDevInfo. For block-backed filesystems, associate the BackingDevInfo (BDI) from the block device with the superblock. The BDI controls writeback rate limiting, readahead window defaults, and dirty page accounting per backing device (Section 4.6).

  17. For pseudo-filesystems (tmpfs, procfs), a default BDI with no writeback is used.

  18. Insert into mount tree. Acquire the mount namespace write lock. Attach the MountPoint as a child of the parent dentry in the namespace's mount tree. Set d_mount_seq on the parent dentry (incremented to invalidate any in-flight RCU-walk lookups that cached the old state). Apply mount propagation rules: if the parent mount is MS_SHARED, replicate the new mount into all peer mount namespaces. Release the mount namespace lock.

  19. On error: deallocate MountPoint, call FileSystemOps::unmount(sb), release SuperBlock.

  20. Return success. The filesystem is now accessible at the mount point.

Unmount flow (do_umount):

  1. Check reference count. If the mount has active open files, child mounts, or CWD references, return EBUSY (unless MNT_FORCE or MNT_DETACH is specified).

  2. Detach from mount tree. Acquire the mount namespace write lock. Remove the MountPoint from its parent's child list. Increment d_mount_seq on the parent dentry. For MNT_DETACH (lazy umount), the mount is detached from the namespace tree immediately but the SuperBlock is kept alive until all references are released.

  3. Sync dirty data. Call FileSystemOps::sync_fs(sb, wait=true) to flush all dirty pages and metadata. This invokes the writeback thread (Section 4.6) for the superblock's BDI. For MNT_FORCE, skip the sync and proceed with best-effort teardown (in-flight I/O is drained with -EIO).

  4. Tear down SuperBlock. Decrement s_refcount. If the refcount reaches zero (no other mounts share this superblock): a. Evict all inodes: walk s_inodes, call inode eviction sequence (writeback dirty pages, InodeOps::evict_inode, remove from sb.inode_cache XArray). b. Call FileSystemOps::unmount(sb) — the filesystem flushes its journal, writes the clean-unmount marker, and releases s_fs_info. c. Release s_bdev reference (if block-backed). d. Free the SuperBlock slab object.

  5. Deallocate MountPoint. Free the MountPoint slab object.

Force unmount (umount2 with MNT_FORCE): Calls FileSystemOps::force_umount(sb), which aborts in-flight I/O with -EIO and skips journal commit. Used when a network filesystem server is unreachable or a device has been physically removed. Data loss may occur for unflushed dirty pages.

Remount (mount -o remount): Does not create a new MountPoint. Instead, calls FileSystemOps::remount(sb, new_flags, data) to update mount options on the existing superblock. The VFS validates flag transitions (e.g., MS_RDONLY → read-write requires CAP_MOUNT and a journal replay check).

See Section 14.5 for the character/block device node framework (chrdev/blkdev registration, major number table, devtmpfs automatic /dev node lifecycle).

14.1.7.2 ML Policy Integration for VFS

The VFS subsystem emits observations and exposes tunable parameters through the ML policy framework (Section 23.1). This enables closed-loop optimization of readahead, writeback scheduling, and dirty page throttling.

Observation hooks: The following observe_kernel! call sites are placed in VFS hot/warm paths. Each call is zero-cost (NOP) when no policy service consumer is attached (static key patching; see Section 23.1).

Call site Subsystem Observation type Path class Data emitted
filemap_get_pages() cache miss VfsLayer VfsObs::PageCacheMiss Hot (ino, file_offset, ra_window_size, sequential: bool)
filemap_get_pages() cache hit VfsLayer VfsObs::PageCacheHit Hot (ino, file_offset) — sampled at 1/64 rate to bound overhead
generic_file_write_iter() VfsLayer VfsObs::BufferedWrite Hot (ino, bytes_written, dirty_pages_after) — sampled at 1/16
writeback_single_inode() completion VfsLayer VfsObs::WritebackComplete Warm (ino, pages_written, elapsed_us, sequential_ratio)
balance_dirty_pages() throttle VfsLayer VfsObs::DirtyThrottle Warm (bdi_id, dirty_pages, dirty_limit, throttle_ms)
page_cache_readahead() trigger VfsLayer VfsObs::ReadaheadTrigger Warm (ino, start_offset, window_pages, sequential: bool)
Dentry cache miss in path_lookup() VfsLayer VfsObs::DentryCacheMiss Hot (parent_ino, name_hash) — sampled at 1/32
VFS ring request enqueue VfsLayer VfsObs::RingEnqueue Hot (mount_id, opcode, ring_index) — sampled at 1/128
path_lookup() completion VfsLayer VfsObs::PathLookupLatency Hot (path_components, elapsed_ns, rcu_walk_success: bool) — sampled at 1/64. Measures end-to-end path resolution latency including mount crossings and symlink follows. RCU-walk success rate is a key metric: low success rate indicates contention forcing ref-walk fallbacks.
select_ring() → response dequeue VfsLayer VfsObs::RingUtilization Warm (mount_id, ring_index, ring_depth, pending_slots, response_latency_ns) — emitted on every response dequeue. Measures ring fill level and round-trip latency. High pending_slots/ring_depth ratio signals the ring is saturated and ring count should be increased (or ring depth enlarged).
Readahead completion audit VfsLayer VfsObs::ReadaheadHitRate Warm (bdi_id, window_pages, pages_used_before_eviction, hit_ratio_pct) — emitted when a readahead window is fully consumed or evicted. Tracks how many prefetched pages were actually accessed before eviction. Low hit rate means the readahead window is oversized (wasting memory and I/O bandwidth).
/// VFS-specific observation types for the ML policy framework.
/// Used as the `obs_type` field in `observe_kernel!` calls.
#[repr(u16)]
pub enum VfsObs {
    /// Page cache miss — readahead evaluation opportunity.
    PageCacheMiss       = 0,
    /// Page cache hit — confirms readahead effectiveness.
    PageCacheHit        = 1,
    /// Buffered write completion — dirty page accumulation signal.
    BufferedWrite       = 2,
    /// Writeback completion for a single inode.
    WritebackComplete   = 3,
    /// Dirty page throttling engaged — backpressure signal.
    DirtyThrottle       = 4,
    /// Readahead triggered — window sizing feedback.
    ReadaheadTrigger    = 5,
    /// Dentry cache miss — path resolution pressure signal.
    DentryCacheMiss     = 6,
    /// VFS ring enqueue — cross-domain I/O pressure signal.
    RingEnqueue         = 7,
    /// Path resolution end-to-end latency — RCU-walk success rate signal.
    PathLookupLatency   = 8,
    /// Ring utilization — fill level and response latency signal.
    RingUtilization     = 9,
    /// Readahead hit rate — window sizing effectiveness feedback.
    ReadaheadHitRate    = 10,
}

Tunable parameters: The following VFS parameters are registered in the Kernel Tunable Parameter Store (Section 23.1). ParamId values are allocated in the I/O Scheduler range (0x0300-0x03FF) since VFS readahead and writeback are I/O-adjacent. Each parameter has a default, bounds, and a cooldown period to prevent oscillation.

ParamId Name Default Min Max Cooldown Description
IoReadaheadPages (0x0300) readahead_pages 32 1 512 30s Per-BDI max readahead window in pages
0x0303 vfs_dirty_ratio_pct 20 5 80 60s vm.dirty_ratio equivalent — percentage of total memory that can be dirty before synchronous writeback
0x0304 vfs_dirty_bg_ratio_pct 10 1 50 60s vm.dirty_background_ratio — background writeback trigger threshold
0x0305 vfs_writeback_interval_cs 500 100 6000 30s Writeback timer interval in centiseconds (default 5s = 500cs)
0x0306 vfs_ra_sequential_threshold 4 1 32 30s Number of sequential page accesses before readahead window doubles
0x0307 vfs_ring_coalesce_batch 8 1 64 10s Default VFS ring doorbell coalescing batch size for regular I/O
0x0308 vfs_ring_coalesce_timeout_us 20 1 200 10s Default VFS ring doorbell coalescing timeout in microseconds
0x0309 vfs_completion_coalesce_batch 8 1 32 10s Response-direction completion coalescing batch size (Section 14.3). Number of completions batched before waking the VFS consumer.
0x030A vfs_completion_coalesce_timeout_us 10 1 100 10s Response-direction completion coalescing timeout in microseconds. Bounds worst-case latency for sparse completion streams.

Closed-loop example — readahead window auto-tuning:

  1. Policy service observes VfsObs::PageCacheMiss and VfsObs::ReadaheadTrigger on a per-BDI basis. High miss rate after readahead suggests the window is too small.
  2. Policy service computes the optimal readahead_pages using the PID controller (Section 23.1). Target metric: page cache hit rate > 95% for sequential workloads.
  3. Policy service sends a ParamAdjust { param_id: IoReadaheadPages, value: N } message.
  4. The VFS readahead engine reads the updated value via PARAM_STORE.get(IoReadaheadPages) on the next readahead evaluation (warm path, no hot-path overhead).

Phase assignment: VFS observation hooks are Phase 3 (functional without ML; ML provides optimization). Parameter registration is Phase 2 (parameters are readable by sysctl even without a policy service).

14.2 VFS Ring Buffer Protocol (Cross-Domain Dispatch)

The tier model (Section 11.3) requires ALL cross-domain communication to use ring buffer IPC. However, the FileSystemOps, InodeOps, and FileOps traits defined in Section 14.1 use direct Rust function call signatures. This section specifies how trait method calls are marshaled across the isolation domain boundary between umka-core (VFS layer) and Tier 1 filesystem drivers.

Architecture: Each mounted filesystem has a dedicated request/response ring pair:

/// Maximum inline I/O data size in bytes. Reads/writes at or below this
/// threshold carry data inline in the ring entry, avoiding DMA buffer
/// allocation and IOMMU mapping. Covers >90% of procfs/sysfs reads.
///
/// 192 bytes fits within the ring entry without bloating large-I/O variants
/// (the `VfsRequestArgs` union is already dominated by `SetXattr` at ~280 bytes).
/// Saves ~150-300ns per small I/O by eliminating DMA alloc/free + IOMMU map/unmap.
pub const INLINE_IO_MAX: usize = 192;

/// Sentinel value for `DmaBufferHandle` indicating no DMA buffer is
/// allocated. Used by the inline small I/O path: when `buf == ZERO`,
/// data is carried inline in the ring entry (request's `inline_data` for
/// writes, response's `inline_data` for reads). The driver checks
/// `buf == DmaBufferHandle::ZERO` to select the inline path.
impl DmaBufferHandle {
    /// All-zero sentinel indicating no DMA buffer is allocated.
    /// Pool ID 0 is reserved as invalid and never allocated by the DMA
    /// buffer pool allocator ([Section 4.14](04-memory.md#dma-subsystem)). Combined with
    /// `iova_base = 0` (page 0 is never mapped by any IOMMU implementation),
    /// `ZERO` is guaranteed to never match any valid handle.
    pub const ZERO: Self = DmaBufferHandle { pool_id: 0, generation: 0, offset: 0, iova_base: 0 };
}

/// VFS-specific ring buffer. Extends `DomainRingBuffer`
/// ([Section 11.8](11-drivers.md#ipc-architecture-and-message-passing)) with per-slot state tracking
/// for the split reservation protocol (`EMPTY -> RESERVED -> FILLED ->
/// CONSUMED -> EMPTY`) and a per-ring in-flight operation counter for
/// crash recovery quiescence.
///
/// The `DomainRingBuffer` header occupies 128 bytes (2 cache lines).
/// The `slot_states` array is allocated contiguously after the ring data
/// region. The `inflight_ops` counter is used by crash recovery
/// (`drain_all_vfs_rings()`) and live evolution quiescence to wait for
/// producers to complete their current operations before draining.
///
/// **Relationship to DomainRingBuffer**: `RingBuffer<T>` composes (not
/// inherits) `DomainRingBuffer`. All ring pointer fields (`head`, `tail`,
/// `published`, `state`) are accessed through `inner`. The generic
/// parameter `T` is the entry type (`VfsRequest` or `VfsResponseWire`)
/// for type-safe entry access via `read_entry()`.
// kernel-internal, not KABI
pub struct RingBuffer<T> {
    /// Base ring buffer header (128 bytes) + data region.
    /// Contains `head`, `tail`, `published`, `state`, `size`, `entry_size`.
    pub inner: DomainRingBuffer,

    /// Per-slot state for the split reservation protocol.
    /// Length: `inner.size` entries. Allocated contiguously after the
    /// ring data region. Each entry is an `AtomicU8` holding one of
    /// the `RingSlotState` values (`Empty`, `Reserved`, `Filled`,
    /// `Consumed`).
    ///
    /// SAFETY: Pointer is valid for `inner.size` elements. Allocated
    /// from kernel slab at mount time, valid for the lifetime of the
    /// mount. Freed during umount after all rings are drained.
    pub slot_states: *const AtomicU8,

    /// Number of in-flight producer operations on this ring. Incremented
    /// in `reserve_slot()` after successful slot reservation (after CAS
    /// on `inner.head`). Decremented in `complete_slot()` after marking
    /// the slot `FILLED` and advancing `inner.published`.
    ///
    /// Used by crash recovery (`drain_all_vfs_rings()`) and live evolution
    /// quiescence to wait for all producers to complete before draining.
    /// The counter reaching zero guarantees no producer is between
    /// `reserve_slot()` and `complete_slot()` (i.e., no producer is
    /// mid-`copy_from_user` with a RESERVED slot).
    ///
    /// `AtomicU32`: maximum concurrent producers per ring is bounded by
    /// CPU count (≤256 for MAX_VFS_RINGS). `u32` is sufficient.
    pub inflight_ops: AtomicU32,

    /// Phantom type for entry type safety.
    _marker: core::marker::PhantomData<T>,
}

impl<T> RingBuffer<T> {
    /// Read a typed entry at the given slot index.
    ///
    /// SAFETY: `idx` must be < `self.inner.size`. The caller must ensure
    /// the slot contains valid data (slot_state == FILLED or the ring is
    /// being drained during crash recovery with all producers quiesced).
    pub unsafe fn read_entry(&self, idx: usize) -> &T {
        debug_assert!(idx < self.inner.size as usize);
        let data_base = (&self.inner as *const DomainRingBuffer as *const u8)
            .add(size_of::<DomainRingBuffer>());
        &*(data_base.add(idx * self.inner.entry_size as usize) as *const T)
    }
}

/// Per-mount ring buffer pair for VFS <-> filesystem driver communication.
///
/// The VFS (in umka-core) enqueues requests on `request_ring`; the filesystem
/// driver dequeues, processes, and enqueues responses on `response_ring`.
/// Both rings are in shared memory (PKEY 1 on x86-64 — read-only for both
/// domains; actual data in PKEY 14 shared DMA pool).
pub struct VfsRingPair {
    /// Request ring: VFS -> filesystem driver. Ring size: 256 entries
    /// (configurable per-mount via mount options).
    ///
    /// **Producer model**: Under PerCpu granularity, each ring has exactly
    /// one producer (pure SPSC). Under PerNuma/PerLlc/Fixed granularity,
    /// multiple CPUs may share a ring; the producer side uses a CAS loop
    /// on `head` for atomic slot reservation (see `reserve_slot()` in
    /// [Section 14.3](#vfs-per-cpu-ring-extension)). The consumer side is always single-
    /// threaded per ring (driver consumer thread).
    pub request_ring: RingBuffer<VfsRequest>,

    /// Whether this ring is shared by multiple CPUs. Set at mount time
    /// based on the ring granularity and CPU-to-ring mapping. When `true`,
    /// `reserve_slot()` uses a CAS loop on `head` for atomic slot
    /// allocation. When `false` (PerCpu mode), `reserve_slot()` uses a
    /// simple load/store on `head` (no contention possible).
    ///
    /// Invariant: `shared_ring == false` implies exactly one CPU maps to
    /// this ring in the `cpu_to_ring` table. This is verified at mount time.
    pub shared_ring: bool, // Kernel-internal, not KABI.

    /// Response ring: filesystem driver -> VFS. SPSC (driver produces, VFS
    /// consumes). Same size as request ring.
    pub response_ring: RingBuffer<VfsResponseWire>,

    /// Doorbell: filesystem driver writes to signal request availability.
    /// Uses the doorbell coalescing mechanism (Section 11.5.1.1) to batch
    /// notifications when multiple requests are enqueued.
    pub doorbell: DoorbellRegister,

    /// Completion WaitQueue: VFS threads wait here when a synchronous
    /// operation needs a response. Multiple threads may be blocked on
    /// the same WaitQueue simultaneously (one per in-flight request).
    ///
    /// **Response matching protocol** (request_id -> waiting thread):
    ///
    /// 1. **Submit**: The VFS caller allocates a `request_id` from
    ///    `VfsRingPair.next_request_id` (AtomicU64, fetch_add(1, Relaxed)),
    ///    stores it in `VfsRequest.request_id`, enqueues the request on
    ///    the request ring, and parks on `completion` via `wait_event!`.
    ///
    /// 2. **Wait condition**: The caller's `wait_event!` condition checks:
    ///    ```rust
    ///    wait_event!(ring.completion, {
    ///        // Check for crash recovery (ring set entered RECOVERING state).
    ///        if ring_set.state.load(Acquire) == VFSRS_RECOVERING {
    ///            return Err(EIO);
    ///        }
    ///        // Check if our response has been deposited in the response table.
    ///        ring.response_table.contains(our_request_id)
    ///    });
    ///    ```
    ///
    /// 3. **Completion**: The VFS response consumer thread (running in
    ///    Tier 0) drains the response ring, reads each `VfsResponseWire`,
    ///    and deposits it into `response_table` keyed by `request_id`.
    ///    After depositing, it calls `completion.wake_up_all()`. Each
    ///    woken thread re-evaluates its wait_event condition: only the
    ///    thread whose `request_id` is in `response_table` proceeds.
    ///    Others re-sleep. This is the standard "thundering herd with
    ///    condition recheck" pattern — acceptable because VFS rings are
    ///    per-CPU (PerCpu mode: one thread at a time) or per-NUMA
    ///    (bounded contention).
    ///
    /// 4. **Retrieval**: The woken thread calls
    ///    `response_table.remove(our_request_id)` to extract its
    ///    `VfsResponseWire`, processes the result, and returns.
    pub completion: WaitQueue,

    /// Per-ring response table: maps request_id -> VfsResponseWire.
    /// Used by the response matching protocol to deliver responses to
    /// the correct waiting thread. XArray keyed by request_id (u64).
    ///
    /// Populated by the VFS response consumer thread (Tier 0) as it
    /// drains the response ring. Consumed by waiting threads after
    /// `wake_up_all()` signals availability.
    ///
    /// Bounded size: at most `request_ring.inner.size` entries (one per
    /// in-flight request slot). Entries are removed by the waiting thread
    /// after retrieval, so the table does not grow unboundedly.
    pub response_table: XArray<VfsResponseWire>,

    /// Monotonically increasing request ID allocator. Each VFS caller
    /// gets a unique request_id for response matching.
    /// u64: at 10^9 requests/sec, wraps in 584 years.
    pub next_request_id: AtomicU64,
}

/// VFS request message. Serialized representation of a trait method call.
/// Fixed-size header + variable-length payload.
///
/// **Layout**: The header fields (`request_id`, `opcode`, `ino`, `fh`) are
/// followed by the tagged-union `args` payload. The `opcode` field is `u32`
/// (from `VfsOpcode`); `_pad_opcode` provides explicit padding to maintain
/// natural alignment of the subsequent `u64` fields. This prevents
/// information disclosure via implicit compiler-inserted padding bytes.
///
/// **Size**: Header is 32 bytes + `VfsRequestArgs` (largest variant is
/// `SetXattr` at ~280 bytes). Total entry size ~320 bytes (see per-CPU ring
/// extension memory analysis). `const_assert!` below verifies the header.
#[repr(C)]
pub struct VfsRequest {
    /// Unique request ID for matching responses. Globally unique per mount
    /// (allocated from `VfsRingSet::next_request_id`). IDs are unique but
    /// not necessarily monotonic within a single ring — two CPUs sharing a
    /// ring may allocate IDs before ring slot assignment, so a lower ID can
    /// appear in a later slot. The protocol uses IDs for response matching
    /// only, not ordering. u64 counter: at 10M ops/sec, wraps after
    /// ~58,000 years (well beyond the 50-year uptime target). No wrap
    /// handling needed.
    pub request_id: u64,

    /// Operation code identifying the trait method.
    pub opcode: VfsOpcode,

    /// Explicit padding after the u32 `opcode` to align `ino` to 8 bytes.
    /// Must be zero. Prevents information disclosure from implicit padding.
    pub _pad_opcode: u32,

    /// Inode number (for InodeOps/FileOps calls). 0 for FileSystemOps calls.
    pub ino: u64,

    /// File handle (for FileOps calls). u64::MAX for non-file operations.
    pub fh: u64,

    /// Operation-specific arguments. The variant must match `opcode`.
    /// Variable-length data (filenames, xattr values, write data) is
    /// passed via shared DMA buffer references embedded in the variant,
    /// not stored inline in the ring entry.
    ///
    /// The VFS dispatcher validates that the `args` variant matches
    /// `opcode` before dispatching; a mismatch is a kernel bug and
    /// triggers a panic in debug builds, a silent no-op error response
    /// in release builds.
    pub args: VfsRequestArgs,
}
// Verify header layout: request_id(8) + opcode(4) + pad(4) + ino(8) + fh(8) = 32.
const_assert!(core::mem::offset_of!(VfsRequest, args) == 32);
// Verify total size: header(32) + VfsRequestArgs(288, largest variant SetXattr) = 320.
// VfsRequestArgs = 4 (discriminant) + 256 (KernelString) + 4 (padding) + 16 (DmaBufferHandle)
//                + 4 (value_len) + 4 (flags) = 288 bytes, aligned to 8.
//
// Memory tradeoff: 320 bytes per entry × 256 entries × N rings = ~80 KiB per ring.
// At 64 CPUs × 256 entries = ~5 MiB per mount. The inline_data optimization
// (192 bytes embedded in the Write variant) eliminates DMA alloc + IOMMU mapping
// for small I/O (saving ~150-300 ns per operation). Ring memory is pinned DMA
// pages pre-allocated from the shared DMA pool — these pages are committed
// regardless of entry size and cannot be used for other purposes.
const_assert!(core::mem::size_of::<VfsRequest>() == 320);

/// Per-opcode argument payload for a `VfsRequest`.
///
/// `#[repr(C, u32)]` tagged union: the discriminant is a `u32` matching
/// `VfsOpcode`, and each variant is an independent `#[repr(C)]` struct.
/// This ensures a stable ABI across the Tier 0 / Tier 1 KABI boundary
/// (zero-copy ring, matching the io_uring SQE pattern). The `VfsRequest.opcode`
/// field in the header serves as the authoritative discriminant; the
/// in-union discriminant is redundant but guarantees Rust's safety
/// invariant (no invalid discriminant UB).
///
/// Every `VfsOpcode` variant has a corresponding `VfsRequestArgs` variant
/// with the exact parameters that the trait method requires. Variants
/// that carry no extra data beyond what is already in the `VfsRequest`
/// header (opcode, ino, fh) use an empty body `{}`.
///
/// **Inline string limits**: `KernelString` holds up to 255 bytes. Names
/// longer than 255 bytes (possible on some exotic filesystems) must be
/// passed via a `DmaBufferHandle` placed in the `buf` field of the
/// relevant variant; the VFS sets the string `len` to 0 as a sentinel in
/// that case.
///
/// **Caller contract**: The caller fills `VfsRequest { opcode, args, .. }`
/// and enqueues it on `request_ring`. The VFS dispatcher validates that
/// the `args` variant matches `opcode` before dispatching to the
/// filesystem driver.
#[repr(C, u32)]
pub enum VfsRequestArgs {
    // ---------------------------------------------------------------
    // FileSystemOps
    // ---------------------------------------------------------------

    /// `FileSystemOps::mount`. No extra args; mount options are passed
    /// via a separate `DmaBufferHandle` in the ring header.
    Mount {},
    /// `FileSystemOps::unmount`. Graceful unmount; all dirty data must
    /// be flushed before the response is sent.
    Unmount {},
    /// `FileSystemOps::force_unmount`. Best-effort: abandon in-flight
    /// I/O and free resources.
    ForceUnmount {},
    /// `FileSystemOps::statfs`. No per-call arguments.
    Statfs {},
    /// `FileSystemOps::sync_fs`. `wait` controls whether the driver
    /// must block until all I/O is complete (`true`) or may return once
    /// I/O is queued (`false`).
    SyncFs { wait: u8 }, // 0 = no-wait, 1 = wait. u8 for cross-domain safety.
    /// `FileSystemOps::remount`. New flags; updated option string is in
    /// a `DmaBufferHandle` in the ring header.
    Remount { flags: u32 },
    /// `FileSystemOps::freeze`. Quiesce all writes for snapshotting.
    Freeze {},
    /// `FileSystemOps::thaw`. Resume writes after a freeze.
    Thaw {},

    // ---------------------------------------------------------------
    // InodeOps
    // ---------------------------------------------------------------

    /// `InodeOps::lookup`. Look up `name` in the directory identified
    /// by `VfsRequest::ino`.
    Lookup { name: KernelString },
    /// `InodeOps::create`. Create a regular file named by the dentry
    /// already allocated by the VFS. `mode` is the combined file-type
    /// and permission bits.
    Create { mode: FileMode },
    /// `InodeOps::link`. Create a hard link whose new name is
    /// `new_name` inside the directory inode of the request.
    Link { src_ino: u64, new_name: KernelString },
    /// `InodeOps::unlink`. Remove a directory entry. The inode is freed
    /// when its link count reaches zero and all file descriptors are
    /// closed.
    Unlink { name: KernelString },
    /// `InodeOps::mkdir`. Create a directory with the given permission
    /// bits.
    Mkdir { mode: FileMode },
    /// `InodeOps::rmdir`. Remove an empty directory.
    Rmdir { name: KernelString },
    /// `InodeOps::rename`. Move or rename a directory entry.
    /// `new_dir_ino` is the inode number of the destination directory.
    /// `new_name` is the destination name. `flags` carries `RENAME_*`
    /// constants (e.g., `RENAME_NOREPLACE`, `RENAME_EXCHANGE`).
    Rename { new_dir_ino: u64, new_name: KernelString, flags: u32 },
    /// `InodeOps::symlink`. Create a symbolic link whose target path is
    /// `target`. The created inode is named by the dentry pre-allocated
    /// by the VFS.
    Symlink { target: KernelString },
    /// `InodeOps::readlink`. Resolve the symlink target into
    /// `buf`. The driver writes the target string into the DMA buffer
    /// identified by `buf`.
    Readlink { buf: DmaBufferHandle },
    /// `InodeOps::mknod`. Create a special file (block device, character
    /// device, FIFO, or socket). `dev` carries the (major, minor) pair
    /// using the `DevId` type with Linux MKDEV encoding: `(major << 20) | minor`.
    /// See [Section 14.5](#device-node-framework) for encoding details.
    Mknod { mode: FileMode, dev: DevId },
    /// `InodeOps::getattr`. Retrieve inode attributes into an
    /// `InodeAttr`. `request_mask` is a bitmask of `STATX_*` fields the
    /// caller wants. `flags` is `AT_*` flags from `statx(2)`.
    GetAttr { request_mask: u32, flags: u32 },
    /// `InodeOps::setattr`. Modify inode attributes. `valid` is a
    /// bitmask of `ATTR_*` flags indicating which fields in `attr` the
    /// driver must update.
    SetAttr { attr: InodeAttr, valid: u32 },
    /// `InodeOps::truncate`. Set the file size to `size` bytes,
    /// releasing or zero-extending as needed.
    Truncate { size: u64 },
    /// `InodeOps::getxattr`. Retrieve the extended attribute `name` into
    /// `buf`. On return, the response `status` field (>= 0) carries the
    /// attribute value length.
    GetXattr { name: KernelString, buf: DmaBufferHandle },
    /// `InodeOps::setxattr`. Set extended attribute `name` to `value`.
    /// `flags` is `XATTR_CREATE`, `XATTR_REPLACE`, or 0.
    SetXattr { name: KernelString, value: DmaBufferHandle, value_len: u32, flags: u32 },
    /// `InodeOps::listxattr`. Enumerate all extended attribute names into
    /// `buf` as a sequence of NUL-terminated strings. On return, the
    /// response `status` field (>= 0) carries the total length written.
    ListXattr { buf: DmaBufferHandle },
    /// `InodeOps::removexattr`. Delete the extended attribute `name`.
    RemoveXattr { name: KernelString },
    /// `FileSystemOps::show_options`. Write the filesystem-specific
    /// mount options (as they would appear in `/proc/mounts`) into
    /// `buf`.
    ShowOptions { buf: DmaBufferHandle },

    // ---------------------------------------------------------------
    // AddressSpaceOps (page cache → filesystem driver)
    // ---------------------------------------------------------------

    /// `AddressSpaceOps::read_page`. Populate one page from backing
    /// store on a page cache miss. `page_index` is the page-aligned
    /// file offset divided by `PAGE_SIZE`. The driver reads data from
    /// the backing block device and writes it into the DMA buffer
    /// identified by `buf` (exactly `PAGE_SIZE` bytes). The page has
    /// already been allocated and inserted into the page cache by the
    /// caller; the driver only needs to fill it.
    ///
    /// `page_cache_id`: Core-resident handle identifying the
    /// `AddressSpace`/`PageCache` that owns this page. Set by the VFS
    /// dispatch path (in Core, Tier 0) before enqueuing. Used by crash
    /// recovery to resolve orphaned pages WITHOUT traversing VFS-domain
    /// state (the inode cache is in VFS/Tier 1 and may be corrupted).
    /// The handle is the `AddressSpace` pointer cast to `u64` — valid
    /// because `AddressSpace` is in Core memory and pinned for the
    /// lifetime of the superblock.
    ReadPage { page_index: u64, page_cache_id: u64, buf: DmaBufferHandle },
    /// `AddressSpaceOps::readahead`. Batch read for the readahead
    /// engine ([Section 4.4](04-memory.md#page-cache--readahead-engine)). `start_index` is
    /// the first page index; `nr_pages` is the count. The driver
    /// should submit I/O for the entire range in a single Bio batch.
    /// `buf` is a DMA buffer large enough for `nr_pages * PAGE_SIZE`
    /// bytes. Pages have been pre-allocated and cache-inserted by the
    /// readahead engine; the driver fills them sequentially.
    /// Filesystems that do not implement batched readahead return
    /// `EOPNOTSUPP`; the VFS falls back to per-page `ReadPage` calls.
    ///
    /// `page_cache_id`: Same semantics as `ReadPage::page_cache_id`.
    /// All pages in the readahead batch belong to the same
    /// `AddressSpace`.
    Readahead { start_index: u64, nr_pages: u32, page_cache_id: u64, buf: DmaBufferHandle },
    /// `AddressSpaceOps::writepage`. Write a single dirty page to
    /// the backing store. Used by the page reclaimer when it needs to
    /// evict a dirty page. For normal writeback, the `WritebackRequest`
    /// ring ([Section 4.6](04-memory.md#writeback-subsystem--writeback-domain-crossing-tier-0---tier-1))
    /// is used instead (batched, higher throughput). `writepage` is
    /// the single-page fallback for reclaim pressure.
    WritePage { page_index: u64, buf: DmaBufferHandle, sync_mode: u8 },
    /// `AddressSpaceOps::dirty_extent`. Notify the filesystem that a
    /// page range is about to be dirtied. The filesystem records the
    /// affected extent for crash-recovery journaling. `offset` and
    /// `len` are byte offsets within the file.
    DirtyExtent { offset: u64, len: u64 },
    /// `AddressSpaceOps::releasepage`. Ask the filesystem whether a
    /// clean page may be evicted from the cache. The driver responds
    /// with `ok = true` (permit eviction) or `ok = false` (page is
    /// pinned for journaling or other reasons). Must not block.
    ReleasePage { page_index: u64 },

    // ---------------------------------------------------------------
    // FileOps
    // ---------------------------------------------------------------

    /// `FileOps::open`. Open the file. `flags` are the `O_*` open
    /// flags from `open(2)`/`openat(2)`. `mode` is relevant only when
    /// `O_CREAT` is set.
    Open { flags: u32, mode: FileMode },
    /// `FileOps::release`. The last reference to this open file
    /// descriptor has been closed. The driver must flush any cached
    /// state for `fh`.
    Release {},
    /// `FileOps::read`. Read up to `count` bytes starting at `offset`
    /// from the file into `buf`. The driver writes data into the DMA
    /// buffer identified by `buf`. On return, `VfsResponseWire::status`
    /// (>= 0) carries the number of bytes actually read.
    ///
    /// **Inline small I/O path**: If `count <= INLINE_IO_MAX` (192 bytes),
    /// the VFS sets `buf` to `DmaBufferHandle::ZERO` (sentinel: no DMA
    /// buffer allocated). The driver writes read data into the response's
    /// `inline_data` field instead of a DMA buffer. This eliminates DMA
    /// alloc/free + IOMMU map/unmap for small reads (procfs, sysfs, small
    /// config files). Saves ~150-300ns per small read. Covers >90% of
    /// procfs/sysfs reads. See `VfsResponseWire::inline_data`.
    Read { buf: DmaBufferHandle, offset: u64, count: u32 },
    /// `FileOps::write`. Write `count` bytes from `buf` into the file
    /// starting at `offset`. `buf` points to a DMA buffer the VFS has
    /// already filled with the data to be written.
    ///
    /// **Inline small I/O path**: If `count <= INLINE_IO_MAX` (192 bytes),
    /// the VFS places write data inline in `inline_data` and sets `buf` to
    /// `DmaBufferHandle::ZERO` (sentinel). The driver reads from
    /// `inline_data` instead of a DMA buffer. The `copy_from_user()` that
    /// fills `inline_data` happens after slot reservation but before
    /// `complete_slot()` — enabled by the split reservation/completion
    /// protocol ([Section 14.3](#vfs-per-cpu-ring-extension)).
    Write { buf: DmaBufferHandle, offset: u64, count: u32,
            inline_data: [u8; INLINE_IO_MAX] },
    /// `FileOps::fsync`. Flush dirty data and metadata to stable
    /// storage. If `datasync` is `true`, only data blocks need to be
    /// flushed (equivalent to `fdatasync(2)`). `start`..`end` is the
    /// byte range to sync; `end == u64::MAX` means "to end of file".
    Fsync { datasync: u8, start: u64, end: u64 }, // 0 = fsync, 1 = fdatasync. u8 for cross-domain safety.
    /// `FileOps::readdir`. Enumerate directory entries into `buf`
    /// starting after the position identified by `cookie`. A `cookie` of
    /// 0 means start from the beginning. The driver fills `buf` with
    /// `linux_dirent64` records. `VfsResponseWire::status` (>= 0) carries
    /// the number of bytes written.
    ReadDir { buf: DmaBufferHandle, cookie: u64 },
    /// `FileOps::ioctl`. Pass a device-specific command to the
    /// filesystem driver. `cmd` is the ioctl number; `arg` is the raw
    /// usize argument (may be a user pointer, a small integer, or a
    /// `DmaBufferHandle` depending on the command).
    Ioctl { cmd: u32, arg: usize },
    /// `FileOps::mmap`. Establish a memory mapping. `vma_token` is an
    /// opaque handle the VFS passes to the driver to identify the
    /// virtual memory area; the driver uses it to call back into the
    /// VFS to install PTEs via the KABI page-fault callback.
    Mmap { vma_token: u64, prot: u32, flags: u32 },
    /// `FileOps::fallocate`. Pre-allocate or manipulate storage for the
    /// given byte range. `mode` carries `FALLOC_FL_*` flags.
    Fallocate { mode: u32, offset: u64, len: u64 },
    /// `FileOps::seek_data`. Find the next byte range containing data
    /// at or after `offset` (implements `SEEK_DATA` from `lseek(2)`).
    SeekData { offset: u64 },
    /// `FileOps::seek_hole`. Find the next hole (unallocated range) at
    /// or after `offset` (implements `SEEK_HOLE` from `lseek(2)`).
    SeekHole { offset: u64 },
    /// `FileOps::poll`. Query which I/O events are ready. `events` is
    /// a bitmask of `POLLIN`, `POLLOUT`, `POLLERR`, etc. The driver
    /// responds immediately with the currently ready events; the VFS
    /// handles `epoll`/`select` wait registration separately.
    Poll { events: u32 },
    /// `FileOps::splice_read`. Transfer up to `len` bytes from the file
    /// at `offset` into an in-kernel pipe identified by `pipe_ino`,
    /// without copying through userspace. `flags` carries `SPLICE_F_*`
    /// flags.
    SpliceRead { pipe_ino: u64, offset: u64, len: u32, flags: u32 },
    /// `FileOps::splice_write`. Transfer up to `len` bytes from the
    /// in-kernel pipe identified by `pipe_ino` into the file at
    /// `offset`. `flags` carries `SPLICE_F_*` flags.
    SpliceWrite { pipe_ino: u64, offset: u64, len: u32, flags: u32 },

    // ---------------------------------------------------------------
    // Inode lifecycle operations
    // ---------------------------------------------------------------

    /// `InodeOps::evict_inode`. Inode eviction — free on-disk resources.
    /// Sent after the VFS has completed page cache teardown. The inode
    /// number is in `VfsRequest::ino`; no extra arguments are needed.
    /// The driver MUST release all on-disk resources (extent tree entries,
    /// block allocations, journal reservations) associated with the inode.
    /// The response is `VfsResponse::Ok(0)` on success or
    /// `VfsResponse::Err(-errno)` on failure (which is logged but does not
    /// prevent inode freeing — the VFS continues eviction regardless).
    EvictInode {},

    /// `InodeOps::truncate_range`. Deallocate blocks within
    /// `[offset, offset+len)` without changing file size.
    /// Used by `FALLOC_FL_PUNCH_HOLE`, `FALLOC_FL_ZERO_RANGE`, and
    /// `FALLOC_FL_COLLAPSE_RANGE`. Separate from `Truncate` (which sets
    /// `i_size` via `setattr`). The VFS evicts page cache pages in the
    /// affected range before sending this request.
    TruncateRange { offset: u64, len: u64 },

    /// `InodeOps::write_inode`. Flush inode metadata to stable storage.
    /// Called by `vfs_fsync_metadata()` for O_SYNC/O_DSYNC writes when the
    /// inode's on-disk metadata must be updated (timestamps, size, block map).
    /// `sync_mode`: 0 = `WB_SYNC_NONE` (schedule I/O, don't wait),
    /// 1 = `WB_SYNC_ALL` (wait for I/O completion). u8 for cross-domain safety.
    WriteInode { sync_mode: u8 },

    // ---------------------------------------------------------------
    // Batched metadata operations (io_uring coalescing)
    // ---------------------------------------------------------------

    /// Batched `statx()` request. Generated by the io_uring dispatch path
    /// when consecutive `IORING_OP_STATX` SQEs are detected. The VFS
    /// resolves all paths in a single domain stay. Never sent by
    /// filesystem drivers.
    /// See [Section 14.1](#virtual-filesystem-layer--mechanism-2-io_uring-statx-coalescing).
    StatxBatch { count: u8, entries_buf: DmaBufferHandle },
}

/// Bounded kernel-internal string. Avoids heap allocation for the common
/// case of short names (directory entries, xattr names, symlink targets
/// ≤ 255 bytes).
///
/// For strings longer than 255 bytes the caller must use a
/// `DmaBufferHandle` instead and set `len = 0` as a sentinel.
#[repr(C)]
pub struct KernelString {
    /// Byte length of the string, not including any NUL terminator.
    /// Range: 0 (sentinel for "use DMA buffer") to 255.
    pub len: u8,
    /// Inline storage. Valid bytes are `data[..len]`. The remainder
    /// is zero-padded. Not NUL-terminated; callers must use `len`.
    pub data: [u8; 255],
}
// Layout: 1 + 255 = 256 bytes.
const_assert!(size_of::<KernelString>() == 256);

/// VFS operation codes. One-to-one mapping to trait methods.
#[repr(u32)]
pub enum VfsOpcode {
    // FileSystemOps
    Mount = 1,
    Unmount = 2,
    ForceUnmount = 3,
    Statfs = 4,
    SyncFs = 5,
    Remount = 6,
    Freeze = 7,
    Thaw = 8,
    ShowOptions = 37,    // → FileSystemOps::show_options; called by /proc/mounts, mount(8)

    // InodeOps
    Lookup = 20,
    Create = 21,
    Link = 22,
    Unlink = 23,
    Mkdir = 24,
    Rmdir = 25,
    Rename = 26,
    Symlink = 27,
    Readlink = 28,
    Getattr = 29,
    Setattr = 30,
    Truncate = 35,
    Getxattr = 31,
    Setxattr = 32,
    Listxattr = 33,
    Removexattr = 34,
    Mknod = 36,          // → InodeOps::mknod; called by mknod(2) for device nodes
    EvictInode = 38,     // → InodeOps::evict_inode; called when inode's last reference drops
    TruncateRange = 39,  // → InodeOps::truncate_range; FALLOC_FL_PUNCH_HOLE/ZERO_RANGE
                         //   Separate from Truncate (35) which sets i_size via setattr.
                         //   TruncateRange deallocates blocks within [offset, offset+len)
                         //   without changing file size.
    WriteInode = 55,     // → InodeOps::write_inode; flush inode metadata to stable storage
                         //   Called by vfs_fsync_metadata() for O_SYNC/O_DSYNC writes.

    // AddressSpaceOps (page cache ↔ filesystem)
    ReadPage = 60,       // → AddressSpaceOps::read_page; page cache miss
    Readahead = 61,      // → AddressSpaceOps::readahead; batched readahead
    WritePage = 62,      // → AddressSpaceOps::writepage; reclaim single-page writeback
    DirtyExtent = 63,    // → AddressSpaceOps::dirty_extent; journal pre-registration
    ReleasePage = 64,    // → AddressSpaceOps::releasepage; reclaim eviction check

    // FileOps
    Open = 40,
    Release = 41,
    Read = 42,
    Write = 43,
    Fsync = 44,
    Readdir = 45,
    Ioctl = 46,
    Mmap = 47,
    Fallocate = 48,
    SeekData = 49,
    SeekHole = 50,
    Poll = 51,
    SpliceRead = 52,     // → FileOps::splice_read; called by splice(2), sendfile(2)
    SpliceWrite = 53,    // → FileOps::splice_write; called by splice(2) write side

    // Batched metadata operations (io_uring coalescing)
    // These opcodes are generated only by the io_uring statx coalescing path
    // ([Section 14.1](#virtual-filesystem-layer--mechanism-2-io_uring-statx-coalescing)).
    // They are never exposed to filesystem drivers directly — the VFS
    // dispatches individual Getattr calls internally for each batch entry.
    StatxBatch = 70,       // → Batched statx; args in DmaBufferHandle as StatxBatchEntry[]
    StatxBatchResult = 71, // → Response-only opcode (no VfsRequestArgs variant). Carries
                         //   per-entry StatxBuf or error as a batched response payload.
                         //   Used only by VFS internal response routing; never sent on
                         //   the request ring.
}

/// VFS response message — wire-level representation on the response ring.
///
/// Every request placed on the `request_ring` eventually produces exactly one
/// `VfsResponseWire` on the paired `response_ring`. The `request_id` field
/// matches the request it completes, enabling out-of-order completion.
///
/// **Status encoding**: `status` is a signed 64-bit value.
/// - `status >= 0`: Success. For data-transfer operations (`Read`, `Write`,
///   `Readdir`, `ReadPage`, `Readahead`, `SpliceRead`, `SpliceWrite`), the
///   value is the byte count transferred. For operations that return a new
///   inode (`Lookup`, `Create`, `Mkdir`, `Symlink`, `Mknod`), the value
///   is the new inode number. For all other operations, the value is 0.
/// - `status == -4095..-1`: Error. The negated Linux errno (e.g., `-2` for
///   `ENOENT`). Matches the kernel's standard error encoding.
/// - `status == i64::MIN` (`0x8000_0000_0000_0000`): Pending — the driver
///   has acknowledged the request but not yet completed it. The VFS must
///   continue waiting for the final response. At most one `Pending` response
///   per request is permitted.
///
/// **Size**: Header is 40 bytes (8 + 8 + 8 + 8 + 4 + 4). For responses
/// carrying inline read data (small I/O path), `inline_data` adds up to
/// `INLINE_IO_MAX` (192) bytes. Total response entry: 256 bytes
/// (40 header + 192 inline_data + 24 padding, aligned to 256 for cache
/// efficiency on response ring). For responses without inline data
/// (large I/O, non-read operations), `inline_data_len` is 0 and the
/// consumer can skip the inline data region.
#[repr(C, align(256))]
pub struct VfsResponseWire {
    /// Request ID this response completes. Matches `VfsRequest::request_id`.
    pub request_id: u64,

    /// Driver generation counter at the time this response was produced.
    /// The VFS discards responses whose generation does not match the
    /// current `sb.driver_generation` (stale responses from a pre-crash
    /// driver instance). See Step 5.5 below.
    pub driver_generation: u64,

    /// Status code: >= 0 for success (byte count or inode number),
    /// negative for error (negated errno), `i64::MIN` for Pending.
    pub status: i64,

    /// Operation-specific supplementary data. Currently used by:
    /// - `Lookup`/`Create`/`Mkdir`/`Symlink`/`Mknod`: inode generation
    ///   counter in `aux[0]` (u32, for NFS file handle staleness detection).
    /// - `Open`: filesystem-private file handle in `status` (u64, stored
    ///   by VFS in `OpenFile::private_data`).
    /// - `GetAttr`: `STATX_*` result mask in `aux[0]`.
    /// - `ReleasePage`: `aux[0]` = 1 if eviction is permitted, 0 if denied.
    /// - All other operations: `aux` is zero.
    pub aux: [u32; 2],

    /// Number of valid bytes in `inline_data`. 0 for non-inline responses.
    /// Range: 0..=INLINE_IO_MAX (192). When > 0, the VFS copies
    /// `inline_data[..inline_data_len]` directly to userspace, bypassing
    /// the DMA buffer entirely.
    pub inline_data_len: u32,

    /// Padding after inline_data_len to maintain 8-byte alignment.
    pub _pad_len: u32,

    /// Inline read data for small I/O responses. Used when the original
    /// `Read` request had `count <= INLINE_IO_MAX` and `buf == ZERO`.
    /// The filesystem driver writes read data here instead of into a DMA
    /// buffer. Eliminates DMA alloc/free + IOMMU map/unmap for small reads.
    /// For non-inline responses, this region is unused (content undefined).
    pub inline_data: [u8; INLINE_IO_MAX],

    /// Padding to fill the 256-byte struct size mandated by `#[repr(C, align(256))]`.
    /// 8 + 8 + 8 + 8 + 4 + 4 + 192 = 232 bytes of fields. 256 - 232 = 24 bytes pad.
    /// The `align(256)` attribute ensures each response entry is cache-line-aligned
    /// and power-of-two sized for efficient ring indexing (index × 256 = byte offset).
    pub _pad: [u8; 24],
}
const_assert!(core::mem::size_of::<VfsResponseWire>() == 256);

Dispatch flow (read syscall example):

  1. Userspace calls read(fd, buf, len).
  2. Syscall entry point resolves fd to a ValidatedCap (Section 9.1).
  3. VFS checks the page cache (Section 4.4). On cache HIT: data is served from core memory with zero domain crossings. On cache MISS: continue.
  4. VFS constructs a VfsRequest:
  5. Large I/O (count > INLINE_IO_MAX): { opcode: Read, buf: DmaBufferHandle, offset, count }. The buf is a DmaBufferHandle pointing to a shared-memory region where the driver will write the read data (zero-copy).
  6. Small I/O (count <= INLINE_IO_MAX): { opcode: Read, buf: DmaBufferHandle::ZERO, offset, count }. No DMA buffer is allocated. The driver writes data into VfsResponseWire::inline_data.
  7. VFS enqueues the request on request_ring and rings the doorbell.
  8. The filesystem driver (in its Tier 1 domain) dequeues the request. It checks buf == DmaBufferHandle::ZERO to select the path:
  9. DMA path: reads via BlockDevice, writes data to the shared DMA buffer.
  10. Inline path: reads from page cache or block device, writes data into VfsResponseWire::inline_data[..inline_data_len] and sets inline_data_len.
  11. Driver enqueues a VfsResponseWire { request_id, status, inline_data_len, ... } on response_ring.
  12. VFS dequeues the response:
  13. DMA path: populates the page cache, copies data from DMA buffer to userspace.
  14. Inline path: copies inline_data[..inline_data_len] directly to userspace. No DMA buffer to free, no IOMMU unmap. Saves ~150-300ns per small read.

Key design properties:

  • Page cache absorbs most I/O: Only cache misses cross the domain boundary. On a warm cache (common for frequently accessed files), read() has zero domain crossings — data is served directly from core memory. This is why the page cache lives in umka-core, not in the filesystem driver.
  • Zero-copy data path: Read/write data is transferred via shared DMA buffer handles, not copied into the ring buffer. The ring carries only the metadata (opcode, offsets, lengths, buffer handles). Data pages are in the shared DMA pool (PKEY 14 / domain 2).
  • Batching: The doorbell coalescing mechanism (Section 11.5.1.1) batches multiple requests into a single domain switch. readahead() enqueues multiple read requests before ringing the doorbell once.
  • Trait interface as specification: The FileSystemOps, InodeOps, FileOps, and AddressSpaceOps traits defined in Section 14.1 serve as the SPECIFICATION of the ring protocol. Each trait method maps to exactly one VfsOpcode. The trait signatures define the arguments; the ring protocol serializes them into VfsRequestArgs. Filesystem driver developers implement the traits; the KABI code generator (Section 12.1) produces the serialization/deserialization stubs.

VFS Ring Error Handling and Cancellation:

Every cross-domain VFS request is subject to timeout, cancellation, and driver crash handling. This section specifies the complete lifecycle of a request that does not complete normally.

1. Timeout: Every VFS request has a per-operation timeout based on the expected latency class of the operation:

Timeout class Operations Default timeout
Regular Read, Write, Stat, Lookup, Create, Open, Release, Getattr, Setattr, Readdir, Readlink, Link, Unlink, Mkdir, Rmdir, Rename, Symlink, Getxattr, Setxattr, Listxattr, Removexattr, Mmap, SeekData, SeekHole, Poll, Ioctl 30 seconds
Slow Fsync, Truncate, Fallocate 120 seconds
Mount Mount, Unmount, ForceUnmount, Remount, Statfs, SyncFs, Freeze, Thaw 300 seconds

The kernel VFS layer starts a per-request timer when the request is enqueued on the request_ring. If the timer fires before a VfsResponse::Ok or VfsResponse::Err arrives on the response_ring, the kernel performs the following steps:

a. Sets request.state to Cancelled in the shared ring metadata. b. Returns ETIMEDOUT to the waiting syscall (waking the blocked thread via the VfsRingPair::completion wait queue). c. Enqueues a CancelToken { request_id, reason: CancelReason::Timeout } on a dedicated cancellation side-channel in the ring so the filesystem driver can detect the cancellation and avoid processing a stale request. The driver is expected to check the cancellation channel before beginning I/O for each dequeued request.

Timeouts are per-mount configurable via mount options (vfs_timeout_regular=<secs>, vfs_timeout_slow=<secs>, vfs_timeout_mount=<secs>). The values above are defaults.

2. Crash handling (filesystem driver crashes): When a Tier 1 filesystem driver crashes (detected by the isolation recovery mechanism described in Section 11.6), the kernel VFS layer performs the following recovery sequence:

a. All pending requests for the crashed filesystem driver are immediately failed with EIO. Every thread blocked on VfsRingPair::completion for that mount is woken with VfsResponse::Err(-EIO). b. The VFS ring is closed: the kernel unmaps the shared ring pages and marks the VfsRingPair as defunct. No new requests are accepted. c. Any subsequent access to files on that filesystem (open files, cached dentries, inode operations) returns ENOTCONN until the driver is restarted and the filesystem is remounted. d. For Tier 1 filesystem drivers: the crash recovery mechanism reloads the driver module and replays the mount sequence (using the stored mount arguments from SuperBlock). Pending request state is lost — applications whose requests were failed with EIO must retry. Open file descriptors pointing to the crashed filesystem become invalid and return ENOTCONN on any operation; applications must close and reopen them after remount completes.

Crash Recovery Algorithm — Complete Specification:

VFS crash recovery runs when a Tier 1 VFS driver (e.g., ext4, XFS) crashes and is reloaded (Section 11.9).

Synchronization during recovery (no lock-based ordering — uses atomics):

The unified VFS crash recovery sequence (Section 14.3) uses VfsRingSet.state atomics (VFSRS_RECOVERING) to block new operations, NOT explicit locks. The previous lock-based model (vfs_global_lock, sb.recovery_lock) was replaced by the atomic state machine approach: - ring_set.state.store(VFSRS_RECOVERING, Release) blocks all new select_ring() calls. - Per-ring inflight_ops counters provide the quiescence barrier. - Per-inode inode.lock (level 185) is acquired only if individual inodes need repair (e.g., truncate-on-recovery for partially-written files).

This eliminates lock ordering complexity and avoids adding a global lock to the recovery path. See the unified U1-U18 sequence in Section 14.3 for the authoritative step ordering.

Step 1: Quiesce in-flight operations - Set ring_set.state = VFSRS_RECOVERING (atomic store, Release ordering). Note: the recovery state is tracked on the VfsRingSet, not on the SuperBlock. See Section 14.3 for the unified U1-U18 recovery sequence. - All new VFS operations on this superblock return ENXIO immediately (no-op check at syscall entry). - Wait for all per-ring inflight_ops counters to reach zero: sum(ring.request_ring.inflight_ops for all rings in ring_set) == 0 (spin with a 5s timeout; if not drained after 5s, send SIGKILL to processes with operations stuck in the crashed driver's domain). SIGKILL is the escalation path of last resort: it is used only when a process cannot be unblocked by returning EIO on its stuck syscall (i.e., the process is in TASK_UNINTERRUPTIBLE waiting on a ring response that will never arrive). SIGTERM cannot wake an uninterruptible process. Only processes with operations stuck in the crashed driver's domain are affected. - The inflight_ops counter (defined in RingBuffer<T>) is incremented in reserve_slot() after successful slot reservation and decremented in complete_slot() after marking the slot FILLED. Per-ring counters avoid false sharing between CPUs on different rings. See Section 14.3 for the per-CPU ring extension that defines these counters. - Why per-ring, not per-sb: A single per-sb AtomicU32 would be a cache line contention point for N concurrent producers. Per-ring counters eliminate cross-ring false sharing. The crash recovery path sums all N counters (cold path, O(N) where N <= 256).

Step 2: Extract ring data, drain, and wake orphaned page waiters

This step has three phases that MUST execute in order. The extraction phase reads ring entries while ring pointers are still valid; the drain phase resets ring pointers; the wake phase unlocks orphaned pages using the extracted data. The unified recovery sequence in Section 14.3 specifies the exact interleaving with the general crash recovery steps from Section 11.9.

Phase 2a: EXTRACT (ring pointers still valid) - Walk each ring's request entries from tail to published. For each entry: - If the opcode is ReadPage or Readahead: extract the page_cache_id: u64 and page_index: u64 from the request args (these identify the page in Core memory via the page cache, NOT via the VFS-domain inode cache). Collect into an ArrayVec<OrphanedPageEntry, MAX_ORPHANED_PAGES>. - If the entry holds a DMA buffer handle (buf != DmaBufferHandle::ZERO): collect the handle for deferred freeing. - Inode cache independence: The page cache lives in Core (Tier 0). During crash recovery, the VFS domain (Tier 1) may be corrupted. The ring entry must carry enough information to resolve the page WITHOUT traversing VFS-domain state. Specifically, VfsRequest carries page_cache_id (a u64 handle to the Core-resident AddressSpace/PageCache) set by the VFS dispatch path BEFORE enqueuing the request. This handle is read directly from the ring entry during extraction — no inode lookup needed.

Phase 2b: DRAIN (reset rings) - The driver-to-kernel ring buffer (Section 12.1) may have pending completion events from operations submitted before the crash. - Call ring_drain_completions(sb.driver_ring): process all pending completions (call the registered callback for each entry). Completions after a crash return EIO. - Discard all pending submission-side entries by waking blocked threads with EIO. - Free all collected DMA buffer handles (free_request_dma_handles()). - Reset all slot_states to EMPTY and ring pointers to 0. See drain_all_vfs_rings() in Section 14.3.

Phase 2c: WAKE (orphaned pages) - Orphaned page wake: Crash recovery must handle threads sleeping on page wait queues (wait_on_page_locked), not just the ring completion WaitQueue. When a cache-miss read is in progress, the requesting thread sleeps on the PAGE's wait queue, not on ring.completion. If the driver crashes, these pages remain LOCKED forever, causing indefinite hangs. For each collected OrphanedPageEntry: (a) Resolve the target page from page_cache_id + page_index using Core- resident page cache lookups (no VFS-domain data needed). (b) Set PageFlags::ERROR on the page. (c) Clear PageFlags::LOCKED via unlock_page(). (d) This wakes all threads sleeping on wait_on_page_locked() for that page. They observe PageFlags::ERROR and return EIO to userspace. This ensures no orphaned LOCKED pages survive a driver crash.

Step 2.5: Page cache integrity verification - Before inspecting page cache state, verify page cache metadata integrity. A crashing Tier 1 driver may have corrupted XArray tree nodes (if the driver had write access to page cache metadata via the shared memory domain). - For each inode with a non-empty page cache: walk the XArray tree and verify: (a) all slot pointers are within valid slab regions, (b) the xa_node.count field matches the actual non-null slot count, (c) no cycles exist (bounded walk depth = XArray max height = 6 for 64-bit).

Synchronization protocol for the XArray walk: - Read-only validation phase: Acquire rcu_read_lock(). The XArray walk reads node pointers and slot entries under RCU protection. kswapd may run concurrently (it removes pages from the XArray via xa_erase()), but RCU protects node lifetimes — freed XArray nodes are not reused until the grace period ends. - Mutation phase (corruption repair): If corruption is detected, drop the RCU lock, acquire the per-AddressSpace i_pages lock (xa_lock_irq(&mapping->i_pages)), then call truncate_inode_pages() to drop the entire page cache for that inode. The i_pages lock prevents concurrent page cache mutations during the truncation. - The walk is bounded (max XArray height = 6 levels, each level is 64-way fanout = 64^6 = ~68 billion pages). For a 1TB file with 4KB pages, the tree has ~256M entries across ~4M nodes, requiring ~100ms to walk. This is acceptable for a cold crash recovery path.

  • If corruption is detected: mark the inode as I_PAGE_CACHE_CORRUPT, drop the entire page cache for that inode (truncate_inode_pages()), and log an FMA event. The data will be re-read from disk after remount.

Step 3: Dirty page detection and writeback - Walk each inode's page cache (skipping DAX inodes where page_cache is None and inodes marked I_PAGE_CACHE_CORRUPT) for all dirty pages: pages with PageFlags::Dirty set. - For journaled filesystems (e.g., ext4/JBD2), the filesystem's journal tracks the relationship between journal transactions and page state via Transaction.tid — not via per-page LSN fields. The VFS layer checks whether each dirty page's owning transaction has been committed: - If the page's transaction has committed: mark page clean (the journal will replay it during recovery). - If the page's transaction has NOT committed: writeback must be deferred until the filesystem is repaired. - Dirty pages beyond the last commit are kept in memory (pinned) until the filesystem is fsck'd and remounted, at which point a forced writeback is issued.

Step 4: Reload driver and remount - Load the new driver image (Section 11.3 reload protocol). - Call driver.mount(sb.device, sb.flags) with MS_RDONLY first (safe mode). - Run the filesystem's built-in consistency check (ext4 replay journal; XFS log recovery; Btrfs tree walk) via driver.fsck_fast(). - If fsck_fast() returns Ok(()): remount read-write; resume normal operations. - If fsck_fast() returns Err: emit FMA fault event, keep read-only, require manual intervention.

Step 5: Flush deferred dirty pages - After successful RW remount, call writeback_deferred_dirty(sb) to flush the dirty pages held since Step 3.

Step 3a: Generation counter bump (before driver reload) - The SuperBlock has a driver_generation: AtomicU64 counter that is incremented each time the driver is (re)loaded. The VFS sets sb.driver_generation = old_generation + 1 BEFORE the driver is reloaded (Step 4) and BEFORE dirty page writeback (Step 5). This ordering is critical: the replacement driver's responses (including writeback completions) must carry the NEW generation so the VFS consumer does not discard them. (Previously numbered Step 5.5 and placed after Step 5 — moved to avoid silent data loss on the writeback flush path. See Section 14.3 Step U13a for the unified sequence rationale.) This ensures: - Stale responses from the pre-crash ring (if any were in-flight) are detected and discarded: the VFS checks response.driver_generation == sb.driver_generation.load(Acquire) before processing any completion. - Open file handles acquired before the crash carry the old generation (stored in OpenFile.open_generation, set at open time). Any VFS operation using an old-generation file handle returns ENOTCONN:

if file.open_generation != file.inode.i_sb.driver_generation.load(Acquire) {
    return Err(ENOTCONN);
}
- The generation counter is persisted in the SuperBlock struct (not in driver-owned memory), so it survives driver crashes. One counter per mount (not per ring) — all rings on a mount share the same generation.

Recovery latency target: ≤500ms for ≤1 million in-flight operations and ≤10 million dirty pages.

3. Cancellation protocol: A caller (or the kernel on behalf of a caller) can cancel a pending request through the following protocol:

a. The caller invokes vfs_cancel(request_id) (internal kernel API, not exposed as a syscall — cancellation is triggered by signal delivery, thread exit, or timeout). b. The kernel sets the CANCEL bit in the RingSlotFlags of the target request's ring slot header. This is an atomic bitwise-OR on the slot's flags field (Relaxed ordering — the flag is advisory; correctness depends on the completion protocol, not on ordering of the flag write itself). c. The kernel enqueues a CancelToken on the cancellation side-channel of the VfsRingPair (belt-and-suspenders: the side-channel ensures the driver is notified even if it has already dequeued the request but not yet checked the slot flags). d. The filesystem driver checks RingSlotFlags::CANCEL in the slot header before starting I/O for each dequeued request. If the flag is set, the driver writes a VfsResponse::Err(-ECANCELED) to the response ring and moves to the next request. The driver MUST NOT begin any side-effecting I/O (block reads, metadata updates) for a cancelled request. e. If the driver has already started processing the request (e.g., issued a block I/O read before the CANCEL flag was set), the driver completes the operation normally and writes the result to the response ring. The kernel discards the response silently since the request is already resolved from the caller's perspective. f. The caller (kernel-side) always waits for a completion slot from the driver — either VfsResponse::Err(-ECANCELED) (if the driver saw the flag) or a normal VfsResponse::Ok/VfsResponse::Err (if the driver had already started I/O). The per-request timeout still applies; if neither response arrives within the timeout, the request enters the crash-recovery path (step 2 above).

bitflags::bitflags! {
    /// Per-slot flags for the cancellation protocol.
    ///
    /// **Storage**: These flags are stored in the `slot_states` array
    /// alongside the `RingSlotState` values (see
    /// [Section 14.3](#vfs-per-cpu-ring-extension--ring-topology)), NOT in the ring
    /// slot entry itself. The `VfsRequest` struct starts with
    /// `request_id: u64` — there is no flags header in the slot data.
    ///
    /// The `slot_states[idx]` array uses `AtomicU8` where the lower 2 bits
    /// encode `RingSlotState` (EMPTY=0, RESERVED=1, FILLED=2, CONSUMED=3)
    /// and bit 7 encodes the CANCEL flag. This allows both state transitions
    /// and cancellation to be managed through a single atomic byte per slot,
    /// avoiding a separate flags array.
    ///
    /// The `OCCUPIED` flag from the previous design is replaced by the
    /// `RingSlotState::Filled` value — a slot is "occupied" when its state
    /// is FILLED (consumer has not yet consumed it).
    #[repr(transparent)]
    pub struct RingSlotFlags: u8 {
        /// Request is cancelled. The writer (VFS side) sets this bit to
        /// signal that the driver should skip processing. The driver checks
        /// this bit before beginning I/O. If the driver has already started
        /// I/O, the flag is ignored and the operation completes normally.
        ///
        /// Bit 7 of the slot_states[idx] AtomicU8. Set via
        /// `slot_states[idx].fetch_or(CANCEL.bits(), Relaxed)`.
        const CANCEL = 1 << 7;

        /// Bits 2-6 reserved for future use.
    }
}

/// Token placed on the cancellation side-channel of a VfsRingPair to notify
/// the filesystem driver that a previously enqueued request should be skipped.
#[repr(C)]
pub struct CancelToken {
    /// The `request_id` of the cancelled request. Matches `VfsRequest::request_id`.
    pub request_id: u64,
    /// Why the request was cancelled.
    pub reason: CancelReason,
    /// Explicit trailing padding (struct alignment = 8 from u64 field).
    _pad: [u8; 4],
}
const_assert!(size_of::<CancelToken>() == 16);

/// Reason for request cancellation.
#[repr(u32)]
pub enum CancelReason {
    /// The per-operation timeout expired before the driver responded.
    Timeout = 1,
    /// The calling thread was interrupted (signal delivery or thread exit).
    CallerCancelled = 2,
    /// The filesystem driver crashed; all pending requests are being flushed.
    DriverCrash = 3,
}

Cancellation state machine — the lifecycle of a cancelled request from the driver's perspective:

Driver dequeues request from ring:
  1. Read slot.flags with Acquire ordering.
  2. If flags & CANCEL:
       → Write VfsResponse::Err(-ECANCELED) to response ring.
       → Increment consumer pointer. Done.
  3. If !(flags & CANCEL):
       → Begin I/O processing.
       → Re-check flags & CANCEL periodically for long operations
         (optional optimization — not required for correctness).
       → Complete I/O. Write VfsResponse::Ok/Err to response ring.
       → If CANCEL was set between steps 3 and completion:
         kernel discards the response (request already resolved).

This protocol guarantees: (1) the caller always receives exactly one error or the original lower-layer file remains untouched, (2) no I/O is wasted on requests that were cancelled before the driver began processing, (3) I/O that has already started is never aborted mid-flight (which would risk filesystem inconsistency).

4. VfsResponse::Pending semantics: A VfsResponse::Pending response from the filesystem driver means the request has been accepted and acknowledged but not yet completed (for example, the driver has issued a block I/O request and is waiting for device completion). The contract is:

  • The caller must poll the response_ring or sleep on VfsRingPair::completion for the final VfsResponse::Ok or VfsResponse::Err.
  • Pending does NOT reset the per-request timeout timer. The maximum time in Pending state is bounded by the operation timeout defined above. If the final response does not arrive within the timeout, the request is cancelled using the standard cancellation protocol (step 3).
  • A driver may send at most one Pending response per request. Sending multiple Pending responses for the same request_id is a protocol violation; the kernel logs a warning and ignores duplicate Pending responses.
  • Pending is optional: a driver may respond directly with Ok or Err without ever sending Pending. It exists to allow the VFS layer to distinguish "driver has seen the request" from "request is still sitting in the ring unprocessed" for diagnostic and health-monitoring purposes (Section 20.1).

5. Filesystem Driver Concurrency Requirements:

Driver concurrency model: The SPSC ring serializes delivery of requests to the filesystem driver, not processing. Filesystem driver implementations MUST process dequeued requests concurrently using internal work queues. A driver that processes requests sequentially will exhibit head-of-line blocking (e.g., a stat() waiting behind a fsync()).

The request_id-based response matching already supports out-of-order completion. A compliant driver implementation:

  1. Dequeues requests in batches (up to ring depth).
  2. Dispatches each request to an appropriate internal work queue (e.g., metadata operations to a fast-path thread, journal commits to a dedicated journal thread).
  3. Responds via VfsResponse with the matching request_id as each operation completes -- responses may arrive in any order.

The VfsResponse::Pending mechanism (see above) provides acknowledgment for long-running operations, allowing the VFS layer to distinguish "driver is processing" from "driver is stuck."

Rationale: Unlike Linux, where multiple threads enter the filesystem driver concurrently on different CPUs (with filesystem-internal locking providing serialization at a finer granularity), UmkaOS's SPSC ring serializes delivery. The page cache absorbs >95% of read/stat operations with zero domain crossings (see "Key design properties" above), so the ring is crossed primarily for cache-miss reads, writes, and metadata mutations. For these operations, concurrent driver-internal processing is essential for matching Linux's multi-threaded I/O throughput on a single filesystem.

Scaling beyond single-ring: For workloads with high write concurrency on a single mount (e.g., PostgreSQL checkpoint with 64 backends issuing concurrent fsyncs), the single SPSC ring becomes a producer-side contention point. The per-CPU VFS ring extension (Section 14.3) replaces the single ring with N SPSC rings (one per CPU or per CPU group), eliminating cross-CPU cache line bouncing while preserving all SPSC lock-free invariants per ring.

14.3 Per-CPU VFS Ring Extension

14.3.1 Motivation

The baseline VFS ring protocol (Section 14.2) allocates a single SPSC VfsRingPair per mounted filesystem. The VFS (umka-core) is the sole producer on the request ring; the filesystem driver is the sole consumer. This design is correct and efficient for single-threaded workloads, but creates a producer-side serialization bottleneck when many CPUs issue concurrent filesystem operations on the same mount.

PostgreSQL checkpoint scenario (the motivating workload): PostgreSQL's checkpoint process (and since PG 15, the checkpointer plus background writer) issues fsync() on hundreds to thousands of relation files concurrently. With 64 backends each doing fsync() on different files, all fsync requests serialize through the single request ring. The ring depth is 256 entries by default; under contention, producers spin waiting for the ring to drain. The filesystem driver dequeues and dispatches to internal work queues, but the single-ring bottleneck means:

  1. Producer serialization: The SPSC ring protocol requires a single producer. Since the VFS runs on any CPU, slot reservation must be serialized. Under 64-way contention with a single ring, this serialization becomes the bottleneck. Each CPU contends for slot reservation, waiting for the lock, and the lock holder's cache line bounces between CPUs via the coherence protocol. At ~50-70 cycles per bounce on x86-64, the lock acquisition alone costs ~3.2-4.5 us under 64-way contention.

  2. Head-of-line blocking: A slow fsync that fills the ring blocks all subsequent producers from enqueueing any request (reads, stats, lookups) on the same mount. The single ring's depth (256 entries by default) is shared across all CPUs.

  3. Doorbell storm: Each CPU that reserves a slot and enqueues a request rings the doorbell independently. With 64 concurrent producers, the driver receives up to 64 doorbells for requests that could have been coalesced.

This extension replaces the single SPSC ring per mount with N rings per mount, one per CPU or per CPU group. Under PerCpu granularity (the primary mode), each ring is pure SPSC — no CAS, no contention between producers. Under shared-ring modes (PerNuma/PerLlc/Fixed), multiple CPUs share a ring with atomic (CAS) head allocation. The CPU that issues a VFS operation uses its assigned ring, eliminating or greatly reducing cross-CPU cache line bouncing.

14.3.2 Design Principles

  1. PerCpu rings are pure SPSC — the primary mode. Under PerCpu granularity (the default for >= 65 CPUs), each ring has exactly one producer. All SPSC invariants, memory ordering guarantees, and backpressure semantics are preserved per ring. No CAS on the ring head — just relaxed load/store.

  2. Shared-ring modes use CAS on head — safe but slower. Under PerNuma, PerLlc, and Fixed granularity, multiple CPUs share a ring. The reserve_slot() function uses a CAS loop on head for atomic slot allocation. This is lock-free (no spinlock) but has O(N) expected CAS retries under N-way contention. The CAS loop is bounded by ring size.

  3. The driver side multiplexes N rings. The filesystem driver's consumer thread(s) poll all N rings. This is the only new complexity — the driver must handle multiple input rings instead of one.

  4. Backward compatible. Drivers that only support single-ring mode (ring_count_max = 1 in their KABI manifest) continue to work unchanged. The VFS falls back to single-ring mode for such drivers.

  5. No new lock types. Ring selection is determined by cpu_id at VFS entry — no lock, no CAS, no arbitration. The mapping is a static array lookup. The shared-ring CAS operates on the existing head atomic.


14.3.3 Ring Topology

14.3.3.1 VfsRingSet: Per-Mount Ring Collection

The single VfsRingPair is replaced by a VfsRingSet that contains 1..N ring pairs, where N is negotiated at mount time.

/// VfsRingSet state constants. These are the mount-level recovery states,
/// distinct from the per-ring DomainRingBuffer.state values (Active=0,
/// Disconnected=1).
pub const VFSRS_ACTIVE: u8 = 0;
pub const VFSRS_RECOVERING: u8 = 1;
pub const VFSRS_QUIESCING: u8 = 2;

/// Per-mount collection of VFS ring pairs. Replaces the single `VfsRingPair`
/// from [Section 14.2](#vfs-ring-buffer-protocol).
///
/// Each ring pair is a full SPSC channel (request + response + doorbell +
/// completion wait queue). Rings are indexed by CPU group — each CPU is
/// statically assigned to exactly one ring at mount time.
///
/// Placement: Tier 0 (Core). The VfsRingSet is owned by the superblock
/// and lives in umka-core memory. The ring data regions are in shared
/// memory (PKEY 14 shared DMA pool on x86-64).
// Kernel-allocated, kernel-owned. Layout is depended upon by Tier 1
// drivers via `&VfsRingSet` reference passed at vfs_init() time.
// `#[repr(C)]` ensures stable field offsets across compilation units.
#[repr(C)]
pub struct VfsRingSet {
    /// Array of ring pairs. Length is `ring_count`. Index 0 is always valid.
    /// Allocated from the kernel slab at mount time (warm path, bounded N).
    /// Maximum N = MAX_VFS_RINGS (256, sufficient for 256-core systems
    /// with 1:1 CPU-to-ring mapping).
    ///
    /// SAFETY: Allocated from kernel slab at mount time. Valid for the
    /// lifetime of the VfsRingSet (which is the lifetime of the mount).
    /// `ring_count` is the element count. Freed in umount after all rings
    /// are drained. During crash recovery, `drain_all_vfs_rings()` accesses
    /// rings under `VfsRingSet.state == RECOVERING`, which prevents
    /// concurrent `select_ring()` access. Raw pointer required because
    /// VfsRingSet is `#[repr(C)]` for KABI transport.
    /// `debug_assert!(ring_count <= MAX_VFS_RINGS)` at mount-time init.
    ///
    /// VfsRingSet implements Send + Sync because `rings` points to
    /// slab-allocated memory that outlives all users; ring_count is
    /// immutable after mount.
    ///
    /// `*const` because VfsRingPair's mutable fields (via `RingBuffer<T>`:
    /// `inner.head`, `inner.tail`, `inner.published`, `slot_states`,
    /// `inflight_ops`) are all atomics — shared access is safe via atomic
    /// operations. The pointer itself is immutable after mount (the array
    /// base never changes); interior mutability is provided by the atomic
    /// fields within each VfsRingPair's `RingBuffer<T>` members.
    ///
    /// Raw pointer required because `#[repr(C)]` layout is depended upon
    /// by Tier 1 VFS drivers that receive `&VfsRingSet` via `vfs_init()`.
    /// Although VfsRingSet is kernel-allocated and kernel-owned, its field
    /// offsets are ABI-visible to the driver through the reference.
    pub rings: *const VfsRingPair,

    /// Number of active ring pairs. Range: 1..=MAX_VFS_RINGS.
    /// Set at mount time after negotiation with the driver.
    /// Invariant: ring_count >= 1 (single-ring mode is the minimum).
    pub ring_count: u16,

    /// CPU-to-ring mapping table. Index: CPU ID (0..nr_cpu_ids).
    /// Value: ring index (0..ring_count). Populated at mount time.
    /// Updated atomically on CPU hotplug events.
    ///
    /// This is a read-only lookup table on the hot path — no lock, no CAS.
    /// The table is allocated once at mount time with capacity for
    /// `nr_cpu_ids` entries (runtime-discovered, not hardcoded).
    ///
    /// For CPU IDs beyond the table size (should not happen — table is
    /// sized to nr_cpu_ids at mount time), the fallback is ring index 0.
    ///
    /// SAFETY: Same lifetime as `rings` — allocated at mount, freed at
    /// umount. `cpu_to_ring_len` is the element count. Raw pointer for
    /// `#[repr(C)]` KABI transport compatibility. `*mut` because
    /// `cpu_to_ring` entries are updated by CPU hotplug
    /// (`vfs_rings_cpu_online`) via `AtomicU16::store()`.
    pub cpu_to_ring: *mut AtomicU16,

    /// Number of entries in the cpu_to_ring table (== nr_cpu_ids at mount time).
    /// u32: supports systems with >65535 CPUs (large HPC / datacenter nodes).
    pub cpu_to_ring_len: u32,

    /// Ring allocation granularity used at mount time.
    pub granularity: RingGranularity,

    /// Per-ring doorbell coalescing state for cross-ring coalesced doorbells.
    /// See "Doorbell Coalescing" section below.
    pub coalesced_doorbell: CoalescedDoorbell,

    /// Global request_id generator for this mount. All rings draw from this
    /// single atomic counter to ensure mount-wide unique request IDs.
    /// See "Request ID Generation" below.
    ///
    /// **Cache line isolation**: This field is hot-path (atomically incremented
    /// on every VFS operation). It must NOT share a cache line with other
    /// frequently-written fields. The `state` field below is cold-path (written
    /// only during crash recovery), so false sharing with `state` is acceptable.
    /// The `_cacheline_pad` ensures `next_request_id` starts on its own cache
    /// line when VfsRingSet is cache-line aligned.
    pub next_request_id: CacheLinePadded<AtomicU64>,

    /// Mount-level state for crash recovery coordination.
    /// Set to RECOVERING during crash recovery to block all rings.
    /// Cold path only — written during crash recovery, read (Relaxed) on
    /// ring selection hot path for early-exit check.
    ///
    /// Constants:
    /// - `VFSRS_ACTIVE = 0u8`: Normal operation. select_ring() proceeds.
    /// - `VFSRS_RECOVERING = 1u8`: Crash recovery in progress. select_ring()
    ///   returns ENXIO. Set at Step U3, cleared at Step U17.
    /// - `VFSRS_QUIESCING = 2u8`: Live evolution quiescence in progress.
    ///   select_ring() returns ENXIO. Set at evolution initiation, cleared
    ///   when the replacement driver's consumer loops are ready.
    ///
    /// These are DISTINCT from the per-ring `DomainRingBuffer.state` values
    /// (0 = Active, 1 = Disconnected). The VfsRingSet state gates ALL rings
    /// in the mount; the per-ring state controls individual ring operation.
    pub state: AtomicU8,

    /// Padding to fill remaining bytes in the struct layout after state,
    /// preventing adjacent slab allocations from sharing this cache line.
    /// (Not cache-line alignment of `state` itself — for that, add
    /// `#[repr(C, align(64))]` to the struct.)
    _pad: [u8; 5],
}
// VfsRingSet is kernel-allocated but its layout is depended upon by Tier 1
// drivers that receive &VfsRingSet via vfs_init(). Contains raw pointers and
// atomics with platform-dependent sizes — no const_assert (size varies between
// 32-bit and 64-bit architectures).

// SAFETY: `rings` and `cpu_to_ring` point to slab-allocated memory that
// outlives all users of VfsRingSet (freed at umount after all rings are
// drained). All mutable state within VfsRingPair uses atomics. ring_count
// and cpu_to_ring_len are immutable after mount-time initialization.
unsafe impl Send for VfsRingSet {}
unsafe impl Sync for VfsRingSet {}

/// Maximum number of VFS rings per mount. Sized for 256-core systems
/// with 1:1 CPU-to-ring mapping. Systems with >256 CPUs use CPU-group
/// mapping (multiple CPUs share one ring). The value is a compile-time
/// upper bound for array sizing; the actual ring count is negotiated
/// at mount time and is typically much smaller.
pub const MAX_VFS_RINGS: usize = 256;

/// Ring allocation granularity — how CPUs are mapped to rings.
#[repr(u8)]
pub enum RingGranularity {
    /// One ring per CPU. Maximum parallelism, maximum memory usage.
    /// Best for high-IOPS workloads (databases, storage servers).
    PerCpu = 0,

    /// One ring per NUMA node. CPUs on the same NUMA node share one ring.
    /// Good balance of parallelism and memory. Reduces cross-NUMA cache
    /// bouncing while keeping ring count manageable.
    PerNuma = 1,

    /// One ring per LLC (Last-Level Cache) group. CPUs sharing an L3 cache
    /// share one ring. Finer than PerNuma on multi-CCX/chiplet designs
    /// (AMD EPYC, Intel Sapphire Rapids). Within an LLC group, cache line
    /// bouncing for the ring head is L3-local (~10-15 cycles) rather than
    /// cross-socket (~50-70 cycles).
    PerLlc = 2,

    /// Fixed number of rings (specified via mount option). CPUs are
    /// distributed round-robin across rings. Used when the operator
    /// wants explicit control (e.g., `vfs_ring_count=4`).
    Fixed = 3,

    /// Single ring (legacy mode). Equivalent to the baseline protocol.
    /// Used when the driver reports `ring_count_max = 1`.
    Single = 4,
}

14.3.3.2 CPU-to-Ring Assignment

At mount time, the VFS builds the cpu_to_ring mapping table based on the negotiated ring_count and granularity:

/// Build the CPU-to-ring mapping table at mount time.
///
/// # Arguments
/// * `ring_count` — Negotiated number of rings (1..=MAX_VFS_RINGS).
/// * `granularity` — How CPUs are grouped into rings.
///
/// # Returns
/// Slab-allocated mapping table of length `nr_cpu_ids`.
///
/// Hot path access: `cpu_to_ring[smp_processor_id()].load(Relaxed)` — one
/// atomic load (~1 cycle on x86-64 TSO, ~1-3 cycles on ARM/RISC-V).
fn build_cpu_to_ring_map(
    ring_count: u16,
    granularity: RingGranularity,
) -> &'static [AtomicU16] {
    let nr_cpus = arch::current::cpu::nr_cpu_ids();
    let table = slab_alloc_zeroed::<AtomicU16>(nr_cpus);

    match granularity {
        RingGranularity::PerCpu => {
            // 1:1 mapping: CPU i → ring min(i, ring_count - 1).
            // If nr_cpus > ring_count, wrap with modulo.
            for cpu in 0..nr_cpus {
                table[cpu].store((cpu % ring_count as usize) as u16, Relaxed);
            }
        }
        RingGranularity::PerNuma => {
            // One ring per NUMA node (up to ring_count nodes).
            // NUMA node IDs are discovered at boot via ACPI SRAT / device tree.
            for cpu in 0..nr_cpus {
                let node = arch::current::cpu::cpu_to_node(cpu);
                table[cpu].store((node % ring_count as usize) as u16, Relaxed);
            }
        }
        RingGranularity::PerLlc => {
            // One ring per LLC group. LLC group IDs are discovered at boot
            // via CPUID (x86), CLIDR_EL1 (AArch64), or device tree.
            for cpu in 0..nr_cpus {
                let llc_id = arch::current::cpu::cpu_to_llc_group(cpu);
                table[cpu].store((llc_id % ring_count as usize) as u16, Relaxed);
            }
        }
        RingGranularity::Fixed => {
            // Round-robin distribution.
            for cpu in 0..nr_cpus {
                table[cpu].store((cpu % ring_count as usize) as u16, Relaxed);
            }
        }
        RingGranularity::Single => {
            // All CPUs map to ring 0.
            // Table is already zero-initialized.
        }
    }

    table
}

Hot path ring selection — the VFS dispatch path (step 4 in the dispatch flow from Section 14.2) selects the ring as follows:

/// Ring slot states for the split reservation/completion protocol.
///
/// This protocol separates slot reservation (which requires producer ordering)
/// from data fill (which may fault on `copy_from_user`). The key insight:
/// `preempt_disable` is needed only around `select_ring()` + `reserve_slot()`
/// (~3 instructions, ~nanoseconds), NOT around the entire ring operation.
/// After reservation, the slot is owned by the reserving task regardless of
/// which CPU it runs on. This enables:
/// - Inline small I/O: `copy_from_user` during ring fill
/// - FUSE passthrough: userspace access during ring submission
/// - Any path needing page faults during data fill
///
/// ```
/// EMPTY → RESERVED → FILLED → CONSUMED → EMPTY
///   ↑                                       |
///   +---------------------------------------+
/// ```
///
/// - `EMPTY`: Slot is available for reservation. Producer may CAS to RESERVED.
///   Consumer skips EMPTY slots.
/// - `RESERVED`: Slot is claimed by a producer (CAS from EMPTY). The producer
///   owns the slot exclusively. Consumer stops at RESERVED slots (head-of-line
///   blocking — the slot is not yet ready). The consumer does NOT skip past
///   RESERVED slots; it returns and retries later.
/// - `FILLED`: Producer has written data and marked the slot complete. Consumer
///   may process this slot. Transition: consumer sets CONSUMED after processing.
/// - `CONSUMED`: Consumer has processed the slot. The consumer transitions
///   CONSUMED → EMPTY immediately after processing (single owner; no race).
///   `tail` advances past contiguous CONSUMED→EMPTY slots.
///
/// **State ownership**: Only the producer writes EMPTY→RESERVED and
/// RESERVED→FILLED. Only the consumer writes FILLED→CONSUMED→EMPTY and
/// advances `tail`. The `head` is producer-owned. This two-party protocol
/// has no ABA risk because each slot has exactly one owner at a time.
#[repr(u8)]
pub enum RingSlotState {
    Empty    = 0,
    Reserved = 1,
    Filled   = 2,
    Consumed = 3,
}

/// Select the VFS ring for the current CPU and reserve a slot.
///
/// This is the hot-path entry point — called on every VFS operation that
/// crosses the domain boundary. The full protocol:
///
/// ```
/// preempt_disable();
/// let ring = select_ring(ring_set)?;           // ENXIO if RECOVERING/QUIESCING
/// let (slot_idx, seq) = reserve_slot(ring)?;   // CAS on head (shared) or store (per-CPU)
/// preempt_enable();
/// // --- preemption safe zone: fill data, may fault ---
/// fill_slot_data(slot_idx, &request);           // copy_from_user OK here
/// complete_slot(ring, slot_idx, seq);            // store(FILLED, Release) + advance published
/// ```
///
/// **Preemption note**: `preempt_disable()` is held only around
/// `select_ring()` + `reserve_slot()` (~3-8 instructions, ~4-20 cycles).
/// After reservation, `preempt_enable()` — the slot is owned by the
/// reserving task regardless of which CPU it runs on. Data fill and
/// `complete_slot()` (mark slot as FILLED) can happen from any CPU,
/// including after migration or page fault. The preempt_disable window
/// is ~nanoseconds (slot reservation only), not the entire ring operation.
///
/// **Producer model**: Under `PerCpu` granularity (the primary mode), each
/// ring has exactly one producer — the `preempt_disable` window guarantees
/// no other task on this CPU can interleave. This is pure SPSC: no CAS on
/// `head`, just a relaxed load + store. Under `PerNuma`/`PerLlc`/`Fixed`
/// granularity, multiple CPUs share a ring. The `reserve_slot()` function
/// uses a CAS loop on `head` to provide atomic slot allocation. See
/// `reserve_slot()` below for both paths.
///
/// **Inline write advisory**: For shared-ring modes (`PerNuma`/`PerLlc`/
/// `Fixed`), inline writes that may trigger page faults during
/// `copy_from_user()` can hold a RESERVED slot for the duration of the
/// fault (~1-50 us minor, ~1-10 ms major). This causes head-of-line
/// blocking for the consumer on that ring. For shared rings, the VFS
/// dispatch path SHOULD prefer the DMA buffer path (pre-copy before
/// reservation) for inline-eligible writes if the ring's pending count
/// exceeds `ring.size / 2`. This is a performance heuristic, not a
/// correctness requirement — the consumer handles RESERVED slots correctly
/// by waiting (see consumer algorithm below).
///
/// Cost (PerCpu): 1 atomic load (Relaxed) + bounds check + 1 store. ~3-5 cycles.
/// Cost (shared): 1 CAS loop (~5-20 cycles under contention) + 1 CAS on slot state.
#[inline(always)]
/// Check open_generation before dispatching any VFS operation.
/// This is the VFS dispatch entry point referenced in the OpenFile
/// struct doc comment. Must be called before `select_ring()`.
///
/// Returns `Err(ENOTCONN)` if the file was opened with a different
/// driver generation (the driver has crashed and been reloaded since
/// this file was opened). Userspace must close and re-open the file.
#[inline(always)]
fn vfs_check_open_generation(file: &File) -> Result<(), KernelError> {
    let current_gen = file.inode.i_sb.driver_generation.load(Ordering::Acquire);
    if file.open_generation != current_gen {
        return Err(KernelError::ENOTCONN);
    }
    Ok(())
}

fn select_ring(ring_set: &VfsRingSet) -> Result<&VfsRingPair, KernelError> {
    // Fast-path rejection during crash recovery or live evolution.
    // Relaxed ordering is sufficient for two reasons:
    // 1. **Downstream Acquire barriers provide correctness**: operations
    //    that pass this check go on to `reserve_slot()`, which uses an
    //    Acquire load on `inflight_ops`. The inflight_ops barrier is the
    //    true synchronization point — any operation that slips past this
    //    Relaxed check is still counted and waited for during drain.
    // 2. **False negatives are bounded**: reading ACTIVE when the state
    //    is actually RECOVERING is bounded by store propagation delay
    //    (nanoseconds on TSO, microseconds on weak-memory architectures).
    //    The operation will be caught by the inflight_ops drain.
    //
    // False positives during ACTIVE -> RECOVERING transition cannot happen
    // because the inflight_ops barrier in wait_for_producers_quiesced()
    // ensures all producers have completed before drain begins.
    //
    // False positives during RECOVERING -> ACTIVE transition (U17) CAN
    // occur on weak-memory architectures (ARM, RISC-V): a producer on
    // another CPU may load Relaxed and see RECOVERING after U17 stores
    // ACTIVE with Release. This is harmless — the producer gets ENXIO and
    // retries on the next attempt (the window is nanoseconds to microseconds).
    if ring_set.state.load(Ordering::Relaxed) != VFSRS_ACTIVE {
        return Err(KernelError::ENXIO);
    }
    let cpu = arch::current::cpu::smp_processor_id();
    let ring_idx = if cpu < ring_set.cpu_to_ring_len as usize {
        // SAFETY: cpu < cpu_to_ring_len, validated above. cpu_to_ring points
        // to a slab-allocated array of cpu_to_ring_len AtomicU16 elements,
        // valid for the lifetime of the mount.
        unsafe { (*ring_set.cpu_to_ring.add(cpu)).load(Ordering::Relaxed) } as usize
    } else {
        0 // Fallback for CPUs beyond the table (should not happen).
    };
    // SAFETY: ring_idx is in bounds — cpu_to_ring values are validated
    // at mount time to be < ring_count. Bounds check is redundant but
    // present for defense-in-depth.
    debug_assert!(ring_idx < ring_set.ring_count as usize);
    let idx = if ring_idx < ring_set.ring_count as usize {
        ring_idx
    } else {
        0
    };
    // SAFETY: rings is a valid, non-null pointer to ring_count VfsRingPair
    // elements, allocated from kernel slab at mount time and valid for the
    // lifetime of the mount. The pointer is set during mount initialization
    // and never modified afterward.
    debug_assert!(!ring_set.rings.is_null());
    Ok(unsafe { &*ring_set.rings.add(idx) })
}

/// Reserve a slot on the given ring. Returns `(slot_index, sequence_number)`.
/// The `sequence_number` is the `head` value at reservation time — passed to
/// `complete_slot()` so it can correctly advance the `published` watermark
/// without reading a potentially stale `head`.
///
/// Must be called with preemption disabled (caller holds PreemptGuard).
///
/// **PerCpu mode** (single producer per ring): No CAS on `head`. The caller
/// is the sole producer under `preempt_disable`, so `head` load + store is
/// safe. Cost: ~3-5 cycles.
///
/// **Shared-ring mode** (PerNuma/PerLlc/Fixed — multiple CPUs share a ring):
/// Uses a CAS loop on `head` to atomically claim the next slot. If the CAS
/// fails, the loop retries with the updated `head` (no false RingFull).
/// Cost: ~5-20 cycles depending on contention.
///
/// After `head` is successfully advanced, the producer CAS's the slot state
/// from EMPTY → RESERVED as defense-in-depth (should always succeed because
/// the consumer transitions CONSUMED → EMPTY before `tail` advances past
/// the slot, and `head - tail < size` was checked).
///
/// **Inflight tracking**: On successful reservation, increments
/// `ring.request_ring.inflight_ops` (AtomicU32, Relaxed). This counter is
/// decremented by `complete_slot()`. Crash recovery waits for all per-ring
/// `inflight_ops` to reach zero before draining, ensuring no producer is
/// mid-`copy_from_user` with a RESERVED slot.
#[inline]
fn reserve_slot(ring: &VfsRingPair) -> Result<(u32, u64), RingFull> {
    if ring.shared_ring {
        // --- Shared-ring path: CAS loop on head ---
        loop {
            let head = ring.request_ring.inner.head.load(Ordering::Acquire);
            let tail = ring.request_ring.inner.tail.load(Ordering::Acquire);
            if head.wrapping_sub(tail) >= ring.request_ring.inner.size as u64 {
                return Err(RingFull);
            }
            // Attempt to advance head atomically.
            match ring.request_ring.inner.head.compare_exchange_weak(
                head,
                head.wrapping_add(1),
                Ordering::AcqRel,
                Ordering::Relaxed,
            ) {
                Ok(_) => {
                    // We own slot `head`. CAS slot state as defense-in-depth.
                    let idx = (head & (ring.request_ring.inner.size as u64 - 1)) as u32;
                    let slot_state = &ring.request_ring.slot_states[idx as usize];
                    let prev = slot_state.swap(
                        RingSlotState::Reserved as u8,
                        Ordering::AcqRel,
                    );
                    debug_assert_eq!(prev, RingSlotState::Empty as u8,
                        "Slot {} should be EMPTY after head CAS, was {}",
                        idx, prev);
                    // Track in-flight operation for crash recovery quiescence.
                    ring.request_ring.inflight_ops.fetch_add(1, Ordering::Relaxed);
                    return Ok((idx, head));
                }
                Err(_) => {
                    // Another CPU advanced head. Retry with updated value.
                    core::hint::spin_loop();
                }
            }
        }
    } else {
        // --- PerCpu path: single producer, no CAS on head ---
        let head = ring.request_ring.inner.head.load(Ordering::Relaxed);
        let tail = ring.request_ring.inner.tail.load(Ordering::Acquire);
        if head.wrapping_sub(tail) >= ring.request_ring.inner.size as u64 {
            return Err(RingFull);
        }
        let idx = (head & (ring.request_ring.inner.size as u64 - 1)) as u32;
        // Unconditional swap(RESERVED). Under preempt_disable, we are the
        // sole producer on this per-CPU ring. If head - tail < size, the
        // slot at `head` MUST be EMPTY — any other state indicates a broken
        // invariant (a bug), not ring congestion. Using swap instead of CAS
        // avoids silently returning RingFull on invariant violations.
        let slot_state = &ring.request_ring.slot_states[idx as usize];
        let prev = slot_state.swap(
            RingSlotState::Reserved as u8,
            Ordering::AcqRel,
        );
        debug_assert_eq!(prev, RingSlotState::Empty as u8,
            "PerCpu ring invariant violation: slot {} should be EMPTY, was {}",
            idx, prev);
        ring.request_ring.inner.head.store(
            head.wrapping_add(1), Ordering::Release,
        );
        // Track in-flight operation for crash recovery quiescence.
        ring.request_ring.inflight_ops.fetch_add(1, Ordering::Relaxed);
        Ok((idx, head))
    }
}

/// Mark a reserved slot as filled (data is ready for consumer).
/// Called after data fill is complete. May be called from any CPU —
/// the slot is owned by the reserving task, not bound to a CPU.
///
/// `seq` is the sequence number (head value) returned by `reserve_slot()`.
/// It is used to advance the `published` watermark: `published` tracks
/// `seq + 1` (i.e., the slot just filled is now visible). The consumer
/// uses `published` as a doorbell hint — it tells the consumer that new
/// slots may be ready up to `published`, but the consumer still checks
/// per-slot state to determine which slots are actually FILLED.
///
/// **Out-of-order completion**: When slots are completed out of order
/// (e.g., slot 6 fills before slot 5 because slot 5's copy_from_user
/// faulted), `published` may temporarily lag behind the highest FILLED
/// slot. This is correct: `published` is a *lower bound* on the highest
/// filled slot. The consumer scans from `tail` to `published` and
/// processes only FILLED slots. A RESERVED slot at position `tail` blocks
/// `tail` advancement (head-of-line blocking), but the consumer can
/// process FILLED slots ahead of it (scan-and-process model, see
/// consumer algorithm below).
///
/// **Memory ordering**: The `Release` store on `slot_state` (FILLED)
/// happens before the `Release` in `fetch_max` on `published`. The
/// consumer loads `published` with `Acquire`, establishing a happens-before
/// relationship: the consumer is guaranteed to see the FILLED state and
/// all data written by the producer.
#[inline]
fn complete_slot(ring: &VfsRingPair, idx: u32, seq: u64) {
    let slot_state = &ring.request_ring.slot_states[idx as usize];
    slot_state.store(RingSlotState::Filled as u8, Ordering::Release);
    // Advance published watermark. Uses the reservation-time sequence
    // number (seq + 1), NOT the current head. This prevents advertising
    // slots that were reserved after this one but not yet filled.
    ring.request_ring.inner.published.fetch_max(
        seq.wrapping_add(1),
        Ordering::Release,
    );
    // Decrement in-flight counter. Crash recovery waits for this to
    // reach zero before draining (ensures no producer is mid-fill).
    ring.request_ring.inflight_ops.fetch_sub(1, Ordering::Release);
}

14.3.3.3 Consumer-Side Algorithm

The consumer (filesystem driver) processes slots from tail towards published. The algorithm handles out-of-order completion (RESERVED slots between FILLED slots) by using a scan-and-process model with strict tail advancement rules.

/// Consumer-side ring drain algorithm. Called by the driver's consumer
/// thread(s) when the doorbell fires or on polling wakeup.
///
/// **Invariants maintained by this function**:
/// - `tail` advances only past contiguous slots that have been processed
///   (CONSUMED → EMPTY transition). `tail` never skips a RESERVED slot.
/// - FILLED slots ahead of a RESERVED slot ARE processed (dispatched to
///   the driver's internal work queue), but `tail` does not advance past
///   the RESERVED blocker until it becomes FILLED and is processed.
/// - After processing a FILLED slot, the consumer transitions it to
///   CONSUMED and then immediately to EMPTY (single owner — no race).
///   This two-step transition maintains the documented state machine
///   and allows diagnostic observation of the CONSUMED state.
///
/// **Head-of-line blocking**: A slow RESERVED slot (e.g., producer stuck
/// in a page fault during copy_from_user) blocks `tail` advancement but
/// does NOT block processing of later FILLED slots. The ring's effective
/// capacity is reduced by the number of RESERVED slots between `tail` and
/// the first unprocessed FILLED slot. Under PerCpu mode (the primary
/// mode), the producer IS the blocked task, so no other slots can be
/// reserved on this ring during the fault — head-of-line blocking is
/// moot. Under shared-ring modes, other CPUs can reserve and fill slots
/// past the blocked one; those slots are processed but `tail` stays
/// pinned until the blocked slot completes.
///
/// **Livelock prevention**: The scan window is bounded by `published`.
/// The consumer does not scan beyond `published` (which is updated
/// atomically by producers). Each scan pass has bounded work: at most
/// `published - tail` slots. If no FILLED slots are found in a pass,
/// the consumer returns and waits for the next doorbell.
fn drain_ring(ring: &VfsRingPair) {
    let mask = ring.request_ring.inner.size as u64 - 1;

    loop {
        let tail = ring.request_ring.inner.tail.load(Ordering::Acquire);
        let published = ring.request_ring.inner.published.load(Ordering::Acquire);

        if tail == published {
            return; // Ring is empty (no new slots to process).
        }

        let mut processed_any = false;

        // Scan from tail to published, processing FILLED slots.
        // Use `!=` instead of `<` for wrapping-safe comparison: pos starts
        // at tail and increments toward published; the ring size guarantees
        // published - tail <= ring.size, so the scan always terminates.
        let mut pos = tail;
        while pos != published {
            let idx = (pos & mask) as usize;
            let state = ring.request_ring.slot_states[idx]
                .load(Ordering::Acquire);

            match state {
                s if s == RingSlotState::Filled as u8 => {
                    // Process this slot.
                    // SAFETY: slot is FILLED and we are the sole consumer.
                    let request = unsafe { ring.request_ring.read_entry(idx) };
                    dispatch_to_work_queue(request);

                    // Transition: FILLED → CONSUMED → EMPTY.
                    // Single owner (consumer), so no CAS needed.
                    ring.request_ring.slot_states[idx]
                        .store(RingSlotState::Empty as u8, Ordering::Release);
                    processed_any = true;
                }
                s if s == RingSlotState::Reserved as u8 => {
                    // Slot not yet filled by its producer (e.g., blocked
                    // in copy_from_user page fault). Skip this slot and
                    // continue scanning — FILLED slots ahead of a RESERVED
                    // slot ARE processed (dispatched to the driver's work
                    // queue). `tail` cannot advance past this RESERVED
                    // blocker until it becomes FILLED and is processed,
                    // but processing later FILLED slots reduces latency
                    // for those requests. Under PerCpu mode, only one
                    // producer exists per ring, so a RESERVED slot means
                    // the producer is mid-fill and no later FILLED slots
                    // exist — the continue is effectively a no-op. Under
                    // shared-ring mode, other CPUs may have filled later
                    // slots that can be processed now.
                    pos = pos.wrapping_add(1);
                    continue;
                }
                _ => {
                    // EMPTY or CONSUMED — should not appear between
                    // tail and published. Debug assert for invariant
                    // violation.
                    debug_assert!(false,
                        "Unexpected slot state {} at pos {} (tail={}, published={})",
                        state, pos, tail, published);
                    break;
                }
            }
            pos = pos.wrapping_add(1);
        }

        // Advance tail past contiguous EMPTY slots from the old tail.
        // Slots we processed above were set to EMPTY, so tail advances
        // past all of them (up to the first non-EMPTY slot).
        let mut new_tail = tail;
        while new_tail != published {
            let idx = (new_tail & mask) as usize;
            let state = ring.request_ring.slot_states[idx]
                .load(Ordering::Acquire);
            if state != RingSlotState::Empty as u8 {
                break;
            }
            new_tail = new_tail.wrapping_add(1);
        }
        if new_tail != tail {
            ring.request_ring.inner.tail.store(new_tail, Ordering::Release);
        }

        if !processed_any {
            return; // No progress this pass — wait for next doorbell.
        }
        // Loop to check if new slots were published while we were processing.
    }
}

Tail advancement guarantee: tail advances strictly monotonically and only past slots in EMPTY state (meaning they were FILLED, processed, and reset to EMPTY by the consumer). A RESERVED slot at position tail pins tail until that slot completes its lifecycle (RESERVED → FILLED → process → EMPTY). This ensures no slot is ever skipped or double-processed.

Ring capacity under head-of-line blocking: If one RESERVED slot pins tail, the ring's usable capacity is reduced by the number of EMPTY slots between tail and the RESERVED blocker. In the worst case (one RESERVED slot at tail, all other slots EMPTY), the ring has size - 1 usable slots — functionally identical to a normal SPSC ring. Under shared-ring modes with high contention, multiple RESERVED slots can accumulate, temporarily reducing effective capacity. The negotiation protocol's auto- selection heuristic (Section 14.3) accounts for this by allocating more slots per ring in shared modes (default depth 256 for PerNuma/PerLlc vs 64-128 for PerCpu with many rings).


14.3.4 Request ID Generation

Request IDs must be unique within a mount to support response matching. The baseline protocol uses per-ring monotonic IDs. With N rings, two strategies are possible:

Chosen strategy: Global atomic counter per mount.

/// Generate a mount-globally unique request ID.
///
/// All rings on this mount share a single AtomicU64 counter. This ensures
/// that request IDs are unique across all rings without per-ring namespacing.
///
/// Cost: one atomic fetch_add(1, Relaxed) per VFS operation. On x86-64,
/// this is a LOCK XADD (~15-20 cycles under contention). On AArch64
/// with LSE atomics (ARMv8.1+), LDADD is ~10-30 cycles uncontended,
/// ~40-80 cycles under 64-core contention. On ARMv8.0 without LSE or
/// RISC-V without Zacas, the LL/SC retry loop costs ~10-20 cycles per
/// attempt with ~O(N) expected attempts under N-way contention. Under
/// 64-core saturation (worst case): ~640-1280 cycles per request_id
/// allocation. Phase 3 optimization: per-CPU pre-allocated ID ranges
/// can eliminate this contention for non-LSE ARM targets.
/// Under 64-core contention, the cache line bounces — but this is a
/// DIFFERENT cache line from the ring's head/published, so it does not
/// compound with ring contention.
///
/// The alternative (per-ring monotonic + ring_index prefix) was rejected
/// because it complicates response matching in the driver and breaks the
/// existing assumption that request_id is a simple monotonic u64.
#[inline]
fn alloc_request_id(ring_set: &VfsRingSet) -> u64 {
    ring_set.next_request_id.fetch_add(1, Ordering::Relaxed)
}

Longevity analysis: At 10 billion requests per second (far beyond any conceivable filesystem workload — 100 Gbps NVMe at 4KB would be ~25M IOPS), a u64 counter wraps after ~58 years. At realistic rates (1M IOPS sustained), wrap time is ~584,000 years. Safe for 50-year uptime.

Why not per-ring monotonic IDs? Per-ring IDs would avoid the global atomic but create two problems: (a) the driver must track which ring a response belongs to, adding complexity to the response matching path; (b) the cancellation side-channel uses request_id to identify requests — with per-ring IDs, cancel tokens would need a (ring_index, request_id) pair, breaking the existing CancelToken struct layout (Section 14.2). The global counter adds ~15-80 cycles under contention (< 0.5% of minimum VFS operation latency). Per-ring IDs were rejected because: (1) breaks CancelToken uniqueness without ring_index, (2) makes driver response matching ring-aware, (3) complicates crash recovery log replay (per-ring ordering must be reconstructed from ring sequence numbers rather than using a single global ID sequence). Relaxed ordering on the counter is sufficient: the ring's Release/Acquire on published provides the happens-before guarantee between the writer (who wrote the request including its ID) and the reader. The CacheLinePadded wrapping ensures no false sharing.


14.3.5 Doorbell Coalescing Across N Rings

With N rings, naively ringing each ring's doorbell independently would cause N doorbell interrupts per batch of operations. The coalesced doorbell mechanism aggregates notifications across all rings in a mount.

/// Coalesced doorbell for a VfsRingSet. Instead of each ring having an
/// independent doorbell, a single coalesced doorbell aggregates pending
/// work across all rings.
///
/// The producer (VFS) sets a per-ring "pending" bit and then decides
/// whether to ring the shared doorbell based on coalescing policy.
/// The consumer (driver) checks all rings with pending bits set.
// kernel-internal, not KABI
#[repr(C, align(64))]
pub struct CoalescedDoorbell {
    /// Bitmask of rings that have new entries since the last doorbell.
    /// Bit i is set when ring i has new entries. The driver clears bits
    /// as it drains rings.
    ///
    /// AtomicU256 is not available on all architectures. For MAX_VFS_RINGS
    /// = 256, this is implemented as an array of 4 AtomicU64 values.
    /// Each AtomicU64 covers 64 rings.
    ///
    /// **32-bit architecture note** (ARMv7, PPC32): 64-bit atomics require
    /// LDREXD/STREXD (ARMv7) or lwarx/stwcx pairs (PPC32), which are slower
    /// than native-width atomics (~10-20 cycles vs ~3-5 cycles). This is
    /// acceptable: called per-VFS-operation (hot path), not per-mount. On
    /// 32-bit architectures, the ~10-20 cycle overhead for LDREXD/STREXD is
    /// <8% of minimum VFS metadata operation latency (~200-500 ns).
    /// Functional correctness is maintained on all architectures.
    pub pending_mask: [AtomicU64; 4],

    /// The actual doorbell register. Writing any non-zero value wakes
    /// the driver's consumer thread(s).
    pub doorbell: DoorbellRegister,

    /// Coalescing state (producer-side). Tracks entries since last doorbell
    /// for adaptive coalescing.
    pub coalescer: DoorbellCoalescer,
}
// kernel-internal, not KABI — CoalescedDoorbell contains DoorbellRegister and
// DoorbellCoalescer which have platform-dependent layout. Accessed only within
// Tier 0 Core via VfsRingSet.

impl CoalescedDoorbell {
    /// Mark a ring as having pending entries and optionally ring the doorbell.
    ///
    /// Called by the VFS after enqueueing a request on a specific ring.
    /// The doorbell is rung when:
    /// (a) the coalescer's pending_count reaches max_batch, OR
    /// (b) the coalescer's timeout expires (first entry in batch is older
    ///     than coalesce_timeout_us), OR
    /// (c) the request is a synchronous high-priority operation (fsync,
    ///     mount, unmount) — these bypass coalescing entirely.
    ///
    /// Cost: one atomic OR (~5-10 cycles) + conditional doorbell write.
    #[inline]
    pub fn notify(&self, ring_index: u16, force: bool) {
        let word = ring_index as usize / 64;
        let bit = ring_index as u64 % 64;
        self.pending_mask[word].fetch_or(1u64 << bit, Ordering::Release);

        if force || self.coalescer.should_ring() {
            self.doorbell.ring();
            self.coalescer.reset();
        }
    }

    /// Read and clear pending ring mask. Called by the driver consumer.
    ///
    /// Returns a snapshot of which rings have pending entries, then clears
    /// those bits. The driver iterates the set bits and drains each ring.
    #[inline]
    pub fn take_pending(&self) -> [u64; 4] {
        let mut result = [0u64; 4];
        for i in 0..4 {
            result[i] = self.pending_mask[i].swap(0, Ordering::AcqRel);
        }
        result
    }
}

Coalescing policy for VFS operations:

Operation class Coalescing behavior Rationale
Synchronous metadata (Fsync, Mount, Unmount, Freeze, Thaw, SyncFs) Force doorbell immediately (force = true) Caller is blocked waiting for completion. Coalescing adds latency with no throughput benefit.
Readahead (Readahead, ReadPage) Coalesce up to batch-32 or 50 us timeout Readahead is speculative; latency tolerance is high. Batch amortizes doorbell cost.
Regular I/O (Read, Write, Lookup, Create, etc.) Coalesce up to batch-8 or 20 us timeout Balance between latency and throughput.
Batched metadata (StatxBatch) Coalesce entire batch into one doorbell Already batched by io_uring path.

Performance impact: With N=64 rings and coalesced doorbells, the PostgreSQL checkpoint scenario goes from 64 independent doorbells per batch to 1 coalesced doorbell. This saves ~63 * doorbell_cost (~5-150 cycles depending on whether the doorbell is an MMIO write or a memory write with interrupt). Combined with the elimination of cache line bouncing on the ring head, the total saving per fsync batch is ~200-4500 cycles.


14.3.6 Mount-Time Negotiation

Ring count is negotiated between the VFS and the filesystem driver during the mount() sequence. The driver advertises its capability; the VFS selects the actual count based on system topology and mount options.

14.3.6.1 Driver Capability Advertisement

The KABI driver manifest (Section 12.6) is extended with a VFS ring count field:

/// Extension to KabiDriverManifest for VFS drivers.
/// Placed in section `.kabi_vfs_caps` adjacent to `.kabi_manifest`.
#[repr(C)]
pub struct KabiVfsCapabilities {
    /// Magic: 0x56465343 ("VFSC") — identifies a valid VFS capability block.
    pub magic: u32,
    /// Structure version (currently 1).
    pub version: u32,

    /// Maximum number of VFS request rings this driver can consume.
    /// 1 = legacy single-ring mode (backward compatible).
    /// N > 1 = driver supports multi-ring mode with up to N rings.
    ///
    /// The driver must be prepared to handle any ring_count in [1, ring_count_max].
    /// The VFS selects the actual count and communicates it during mount.
    pub ring_count_max: u16,

    /// Driver's preferred ring granularity hint. The VFS may override this
    /// based on system topology and mount options.
    pub preferred_granularity: RingGranularity,

    /// Reserved for future use. Must be zero.
    pub _reserved: [u8; 5],
}

const_assert!(size_of::<KabiVfsCapabilities>() == 16);

/// Default VFS capabilities for drivers that do not include a `.kabi_vfs_caps`
/// section (backward compatibility).
pub const KABI_VFS_CAPS_DEFAULT: KabiVfsCapabilities = KabiVfsCapabilities {
    magic: 0x56465343,
    version: 1,
    ring_count_max: 1,         // Single-ring mode (legacy).
    preferred_granularity: RingGranularity::Single,
    _reserved: [0; 5],
};

14.3.6.2 Negotiation Protocol

Mount sequence (extended from [Section 14.2](#vfs-ring-buffer-protocol)):

1. VFS reads driver's KabiVfsCapabilities from the .kabi_vfs_caps ELF section.
   If absent, use KABI_VFS_CAPS_DEFAULT (ring_count_max = 1).

2. VFS determines the desired ring count:
   a. If mount option `vfs_ring_count=N` is specified: use min(N, driver.ring_count_max).
   b. If mount option `vfs_ring_granularity=<mode>` is specified: compute ring count
      from topology (per-cpu, per-numa, per-llc).
   c. If no mount options: auto-select based on online CPU count:
      - 1-4 CPUs: 1 ring (single-ring mode, no overhead).
      - 5-16 CPUs: min(nr_numa_nodes, driver.ring_count_max) rings (PerNuma).
      - 17-64 CPUs: min(nr_llc_groups, driver.ring_count_max) rings (PerLlc).
      - 65+ CPUs: min(nr_cpus, driver.ring_count_max) rings (PerCpu).

3. VFS allocates ring_count VfsRingPair structures in shared memory.
   Each ring has independent head/tail/published/size fields.
   Ring entry size and depth are uniform across all rings (same as the
   per-mount mount options `vfs_ring_depth=N`).

4. VFS builds the cpu_to_ring mapping table.

5. VFS passes the ring set to the driver during mount initialization.
   For T1 transport: the KabiT1EntryFn signature is extended to accept
   a ring array:
     entry_ring(ksvc, rings: *mut [RingBuffer], ring_count: u16) -> u32
   For backward compatibility: if ring_count == 1, this is identical to
   the existing single-ring entry point.

6. Driver initializes its consumer side for all ring_count rings.
   The driver may spawn multiple consumer threads or use a single thread
   with round-robin polling — this is an internal driver decision.

14.3.6.3 Mount Options

New mount options for per-CPU rings:

Mount option Type Default Description
vfs_ring_count=N u16 auto Number of VFS rings. 0 = auto (topology-based). 1 = force single-ring.
vfs_ring_granularity=<mode> string auto One of: auto, per-cpu, per-numa, per-llc, fixed. auto selects based on CPU count (step 2c above).
vfs_ring_depth=N u32 256 Depth of each individual ring. Same as existing mount option. Applies uniformly to all rings.

14.3.7 Driver-Side Multiplexing

The filesystem driver must consume requests from N rings instead of one. Three consumer strategies are supported:

14.3.7.1 Strategy 1: Single-Thread Round-Robin (Simple Drivers)

/// Simple round-robin consumer for multi-ring VFS.
/// Suitable for filesystem drivers with a single consumer thread.
///
/// The thread iterates all rings in round-robin order, draining each ring
/// before moving to the next. The pending_mask from the coalesced doorbell
/// guides which rings to check, avoiding wasted iteration over empty rings.
fn vfs_consumer_round_robin(
    rings: &[VfsRingPair],
    doorbell: &CoalescedDoorbell,
) {
    loop {
        // Wait for doorbell notification.
        doorbell.doorbell.wait();

        // Read which rings have pending work.
        let pending = doorbell.take_pending();

        // Iterate set bits — each bit corresponds to a ring with work.
        for word_idx in 0..4 {
            let mut bits = pending[word_idx];
            while bits != 0 {
                let bit = bits.trailing_zeros() as u16;
                let ring_idx = (word_idx as u16 * 64) + bit;
                bits &= bits - 1; // Clear lowest set bit.

                // Drain this ring completely before moving to the next.
                drain_ring(&rings[ring_idx as usize]);
            }
        }
    }
}

14.3.7.2 Strategy 2: Per-Ring Consumer Threads (High-Performance Drivers)

/// High-performance consumer model: one kthread per ring.
///
/// Each kthread is affinity-bound to the NUMA node (or LLC group) that
/// the ring serves. This maximizes cache locality — the ring's head/published
/// cache lines stay in the consumer thread's L1/L2.
///
/// Suitable for high-IOPS filesystem drivers (ext4, XFS, btrfs) that can
/// process requests independently per ring.
///
/// The per-ring doorbell is embedded in each VfsRingPair (the existing
/// doorbell field). The coalesced doorbell is used only when strategy 1
/// is active. In strategy 2, each ring uses its own doorbell independently.
fn vfs_consumer_per_ring(
    ring: &VfsRingPair,
    ring_index: u16,
) {
    loop {
        ring.doorbell.wait();
        drain_ring(ring);
    }
}

Drivers detect the ring count at mount time and choose: - ring_count == 1: Use the existing single-consumer model (no change). - ring_count <= 4: Use round-robin (one thread, minimal overhead). - ring_count > 4: Use per-ring consumer threads (maximum parallelism).

The threshold (4) is a heuristic — with 4 rings, one thread can drain all rings without significant latency. Above 4, the round-robin cycle time exceeds the typical operation latency and per-ring threads become worthwhile.

14.3.7.4 Response Routing

With N request rings, the driver sends responses on the corresponding response ring — ring index i's request ring has a paired response ring i. The VFS consumer for responses is already per-ring (each VfsRingPair has its own response_ring and completion WaitQueue). A thread blocked on a synchronous VFS operation sleeps on the specific ring's WaitQueue, not a global one. When the response arrives on ring i's response_ring, only threads waiting on ring i are woken.

Request flow:
  CPU 7 → cpu_to_ring[7] = ring 2 → ring_set.rings[2].request_ring → enqueue

Response flow:
  Driver dequeues from ring 2's request_ring
  Driver enqueues response on ring 2's response_ring
  VFS consumer for ring 2 wakes threads on ring 2's completion WaitQueue

This ensures that the response arrives on the same ring where the request was submitted. The thread that submitted the request is sleeping on that ring's WaitQueue and is woken directly — no global WaitQueue scanning.


14.3.8 Driver-Side Ring Entry Prefetch

When a filesystem driver's consumer thread dequeues a request from the ring, the ring protocol specifies that the consumer prefetches the next N ring entries into CPU cache while processing the current request. This hides the memory latency of reading ring entries behind the computation/I/O latency of processing the current request.

/// Prefetch the next `prefetch_count` ring entries after the current dequeue
/// position. Called by the driver's consumer loop immediately after dequeuing
/// a request, before beginning I/O processing for that request.
///
/// The prefetch is a cache hint (software prefetch instruction); it does not
/// modify the ring state or advance the consumer pointer. If the prefetched
/// entries are not yet published (tail has not advanced that far), the
/// prefetch is a harmless no-op (prefetching an unpublished slot reads
/// stale data that will be overwritten before the consumer reaches it).
///
/// **Architecture mapping**:
/// - x86-64: `_mm_prefetch(ptr, _MM_HINT_T0)` — prefetch into L1.
/// - AArch64: `PRFM PLDL1KEEP, [ptr]` — prefetch for load, keep in L1.
/// - ARMv7: `PLD [ptr]` — data prefetch.
/// - RISC-V: no standard prefetch instruction (Zicbop extension adds
///   `prefetch.r`; fallback is no-op on cores without Zicbop).
/// - PPC32/PPC64LE: `dcbt` — data cache block touch.
/// - s390x: no user-accessible prefetch; no-op.
/// - LoongArch64: `preld` — prefetch for load.
///
/// **Prefetch count**: 4 entries is the default. This covers the typical
/// pipeline depth where the driver has dispatched the current request to
/// a work queue and is about to dequeue the next. At ~320 bytes per
/// `VfsRequest` entry, 4 entries = 1,280 bytes = ~20 cache lines. This
/// fits comfortably in L1 on all architectures (minimum L1 = 16 KiB on
/// ARMv7 Cortex-A15). Configurable per-driver via the KABI manifest
/// field `ring_prefetch_count` (default 4, range 0-16, 0 = disabled).
///
/// **Cost**: ~1-4 cycles per prefetch instruction (overlapped with
/// current request processing — effectively zero additional latency).
///
/// **Benefit**: Eliminates ~50-100 cycles of L2/L3 read latency per
/// ring entry dequeue (the entry is already in L1 when the consumer
/// reaches it). Over a batch of 8 dequeues, this saves ~400-800 cycles.
#[inline(always)]
fn prefetch_ring_entries(
    ring: &RingBuffer<VfsRequest>,
    current_tail: u64,
    prefetch_count: u32,
) {
    let mask = ring.size - 1; // ring size is power-of-2
    for i in 1..=prefetch_count {
        let idx = (current_tail.wrapping_add(i as u64)) & mask;
        let entry_ptr = ring.data_ptr(idx as usize);
        // SAFETY: entry_ptr is within the ring's allocated data region
        // (bounded by mask). The prefetch is a hint and does not dereference
        // the pointer; no memory safety violation even if the slot is
        // unpublished.
        unsafe { arch::current::cpu::prefetch_read(entry_ptr as *const u8); }
    }
}

Integration point: The prefetch call is inserted into the consumer loop between "dequeue current entry" and "dispatch to work queue":

loop {
    let entry = ring.dequeue();           // Read current request
    prefetch_ring_entries(ring, tail, 4); // Prefetch next 4 while processing
    dispatch_to_work_queue(entry);        // Dispatch (may involve I/O)
    tail = tail.wrapping_add(1);
}

14.3.9 Completion Coalescing (Response Direction)

The doorbell coalescing mechanism (above) batches request-direction notifications (Tier 0 VFS → Tier 1 driver). An analogous mechanism is needed for the response direction (Tier 1 driver → Tier 0 VFS) to avoid waking the VFS consumer on every individual completion.

Problem: Without completion coalescing, a driver that completes 8 requests in rapid succession (e.g., 8 readahead pages arriving from NVMe in a single interrupt) generates 8 separate WaitQueue wakeups on the VFS side. Each wakeup involves an IPI to the waiting CPU (~200-500 cycles on x86-64 cross-core) plus the WaitQueue wake protocol (~50-100 cycles). For 8 completions: ~2,000-4,800 cycles of wakeup overhead.

Solution: The driver batches completions on the response ring and signals the VFS with a single completion doorbell after writing N responses (or after a configurable timeout).

/// Completion coalescing state, embedded in each VfsRingPair.
/// The driver side accumulates completions and signals the VFS consumer
/// only after the batch threshold is reached or the coalescing timeout
/// expires.
///
/// **Classification**: Nucleus (data structure), Evolvable (threshold/timeout
/// parameters are ML-tunable via ParamId 0x0309 and 0x030A).
// SAFETY: All fields are accessed exclusively by the driver's consumer
// thread via `&mut self`. No cross-domain or cross-thread reads. Adding
// a diagnostic read path requires converting to AtomicU32/AtomicU64.
pub struct CompletionCoalescer {
    /// Number of responses written since the last VFS wakeup.
    /// Incremented by the driver on each response_ring enqueue.
    /// Reset to 0 after signaling the VFS.
    pub pending_completions: u32,

    /// Batch threshold: signal VFS after this many completions.
    /// Default: 8 for regular I/O, 1 for synchronous operations
    /// (Fsync, Mount — these bypass coalescing because the caller
    /// is blocked waiting for exactly one response).
    pub batch_threshold: u32,

    /// Timestamp (in TSC or arch-equivalent monotonic cycles) of the
    /// first unsignaled completion. If the time since first unsignaled
    /// completion exceeds `coalesce_timeout_cycles`, signal the VFS
    /// regardless of batch size. This bounds worst-case latency for
    /// sparse completion streams.
    ///
    /// **CPU migration note**: If the consumer thread migrates between
    /// CPUs, TSC on the new CPU may differ from the old CPU (pre-Zen3
    /// AMD, non-constant-TSC platforms). The `wrapping_sub` check in
    /// `should_signal()` treats a negative delta as a very large positive
    /// value, causing immediate signal — a false positive (extra wakeup),
    /// not a missed wakeup. This is benign: one extra wakeup per
    /// migration event (~once per seconds at most). On AArch64 (CNTVCT_EL0
    /// is globally synchronized) and x86 with constant_tsc + nonstop_tsc,
    /// migration has no effect on cycle counter monotonicity.
    ///
    /// **Recommendation**: Driver consumer threads SHOULD be affinity-pinned
    /// to a NUMA node or LLC group (see Strategy 2: Per-Ring Consumer Threads).
    /// Pinned threads avoid this edge case entirely.
    pub first_unsignaled_cycles: u64,

    /// Coalescing timeout in cycles. Default: ~10 us worth of cycles
    /// (e.g., ~30,000 cycles at 3 GHz). Converted from the ML-tunable
    /// parameter `vfs_completion_coalesce_timeout_us` (ParamId 0x030A)
    /// at mount time and on parameter update.
    pub coalesce_timeout_cycles: u64,
}

impl CompletionCoalescer {
    /// Called by the driver after writing a response to the response ring.
    /// Returns `true` if the VFS should be signaled (wake the completion
    /// WaitQueue), `false` if the completion should be coalesced.
    ///
    /// **Synchronous bypass**: If the response is for a synchronous
    /// operation (Fsync, Mount, Unmount, Freeze, Thaw, SyncFs), this
    /// function always returns `true` — the caller is blocked and must
    /// be woken immediately.
    #[inline]
    pub fn should_signal(&mut self, is_sync_op: bool) -> bool {
        if is_sync_op {
            self.pending_completions = 0;
            return true;
        }

        self.pending_completions += 1;

        if self.pending_completions == 1 {
            self.first_unsignaled_cycles = arch::current::cpu::read_cycle_counter();
        }

        if self.pending_completions >= self.batch_threshold {
            self.pending_completions = 0;
            return true;
        }

        let now = arch::current::cpu::read_cycle_counter();
        if now.wrapping_sub(self.first_unsignaled_cycles) >= self.coalesce_timeout_cycles {
            self.pending_completions = 0;
            return true;
        }

        false
    }
}

Driver integration: After writing each VfsResponseWire to the response ring:

ring.response_ring.enqueue(response);
ring.response_ring.inner.published.store(new_published, Release);
if ring.completion_coalescer.should_signal(is_sync_op) {
    ring.completion.wake_all(); // Signal VFS consumer
}

Performance impact: With completion coalescing at batch=8, the 8-readahead scenario generates 1 wakeup instead of 8. Savings: ~1,750-4,200 cycles per readahead batch. Combined with request-side doorbell coalescing, a full readahead cycle (8 requests batched into 1 doorbell + 8 responses batched into 1 wakeup) saves ~2,000-5,000 cycles total vs. the uncoalesced baseline.


14.3.10 Crash Recovery

Crash recovery (Section 11.9) must drain ALL N rings when a filesystem driver crashes.

14.3.10.1 Unified VFS Driver Crash Recovery Sequence

The base protocol (Section 14.2) defines VFS-specific Steps 1-5.5. The general recovery protocol (Section 11.9) defines Steps 1-9. This section specifies the canonical merged sequence — the single authoritative ordering of all steps. Both base protocol and general recovery descriptions are normative for their individual step content, but THIS section defines the step ordering and interleaving. An implementing agent follows this sequence top-to-bottom.

UNIFIED VFS DRIVER CRASH RECOVERY SEQUENCE

Step U1.  [General 1]    FAULT DETECTED
          Hardware exception / watchdog / ring corruption in Tier 1 domain.

Step U1a. [General 1a]   TIER CHECK
          If effective_tier() == Tier::Zero: panic (no isolation).
          If Tier::One or Tier::Two: proceed.

Step U2.  [General 2]    ISOLATE
          Revoke domain (PKRU AD bit / POR_EL0 / DACR). Mask interrupts.

Step U3.  [General 2']   SET RING STATE
          ring_set.state = RECOVERING (blocks select_ring()).
          Per-ring inner.state = RING_STATE_DISCONNECTED.
          See set_all_rings_disconnected() below.

Step U4.  [General 2a]   NMI EJECTION
          NMI IPI to eject CPUs still in the crashed domain.

Step U5.  [VFS Step 1]   QUIESCE PRODUCERS
          Wait for all per-ring inflight_ops == 0 (5s timeout).
          This ensures no producer is mid-copy_from_user with a
          RESERVED slot. If timeout: SIGKILL stuck processes.
          See wait_for_producers_quiesced() below.

Step U6.  [VFS Step 2a]  EXTRACT RING DATA (ring pointers still valid)
          Walk each ring's request entries (tail..published).
          Collect orphaned ReadPage/Readahead entries (page_cache_id +
          page_index) for later page unlock.
          Collect DMA buffer handles for deferred freeing.
          See collect_orphaned_page_entries() and
          collect_dma_handles() in drain_all_vfs_rings() below.

Step U7.  [VFS Step 2b + General 3]  DRAIN AND RESET RINGS
          For each ring:
            (a) Drain response ring (driver -> VFS completions as EIO).
            (b) Fail pending requests (wake threads with EIO).
            (c) Free collected DMA buffer handles.
            (d) Reset all slot_states to EMPTY.
            (e) Reset ring pointers (head=tail=published=0).
            (f) Reset per-ring inner.state to RING_STATE_ACTIVE.
                NOTE: Between U7(f) and U17, individual ring states are ACTIVE
                while ring_set.state remains RECOVERING. This mixed state is
                safe and intentional:
                - Producers: blocked by ring_set.state == RECOVERING at
                  select_ring() — no new requests can be enqueued.
                - Consumer: the new driver (loaded in U14) needs ACTIVE
                  per-ring state to start its consumer loop. If ring.state
                  were still RING_STATE_DISCONNECTED, the consumer loop's Phase 1.5
                  state check would immediately break out.
          See drain_all_vfs_rings() below.

Step U8.  [VFS Step 2c]  WAKE ORPHANED PAGES
          For each collected OrphanedPageEntry:
            Set PageFlags::ERROR, unlock_page().
            Wakes wait_on_page_locked() waiters -> EIO.

Step U9.  [VFS Step 2.5] PAGE CACHE INTEGRITY CHECK
          Walk XArray trees for corruption detection.

Step U10. [VFS Step 3]   DIRTY PAGE DETECTION
          Identify dirty pages for deferred writeback.

Step U11. [General 4]    DRAIN PENDING I/O
          Complete all remaining user requests with EIO.
          Post io_uring CQEs with error status.

Step U12. [General 4a]   EMIT FMA EVENT
          fma_emit(FaultEvent::DriverCrash { ... })

Step U13. [General 5-7 + DMA]  DEVICE RESET + RELEASE LOCKS + UNLOAD
          FLR, KABI lock release, driver memory free.
          DMA quiescence (FLR + IOTLB invalidation + wait_dma_quiesce) was
          initiated between U2-U4 as part of the unified interleaving
          specified in [Section 11.9](11-drivers.md#crash-recovery-and-state-preservation--dma-quiescence-during-crash-recovery).
          By this step, DMA is fully quiesced and IOMMU entries are revoked.

Step U13a. [VFS Step 5.5] GENERATION COUNTER BUMP (moved before U14)
          sb.driver_generation.fetch_add(1, Release)
          **This MUST happen BEFORE driver reload (U14)** so that the
          new driver instance and all its responses (including writeback
          completions in U15) carry the NEW generation. If the bump were
          after U15 (the old U16 position), writeback responses from U15
          would carry the OLD generation and be discarded by the VFS
          consumer's `response.driver_generation == sb.driver_generation`
          check, causing **silent data loss** (dirty pages completed by
          the driver but not marked clean).

Step U14. [VFS Step 4 + General 8]  RELOAD DRIVER AND REMOUNT
          Load new driver binary from CrashRecoveryPool or buddy allocator.
          The new driver instance goes through the standard KABI module Hello
          protocol ([Section 12.8](12-kabi.md#kabi-domain-runtime--module-hello-protocol)):
          (a) Register with the domain service.
          (b) Declare dependencies (block device, DMA allocator, etc.).
          (c) Domain service resolves dependencies and hands out handles.
              **KABI→VFS ring handoff**: The Hello protocol creates
              `CrossDomainRing` objects for generic KABI service bindings.
              However, VFS rings use a different ring type (`VfsRingPair`
              with 320-byte `VfsRequest` entries and `VfsOpcode`-based
              dispatch). The handoff works as follows:
              - The generic KABI Hello protocol creates `CrossDomainRing`
                objects for the driver's non-VFS dependencies (DMA, crypto,
                etc.) — these use the standard 64-byte `T1CommandEntry`.
              - For the VFS-specific ring, the domain service does NOT
                create a `CrossDomainRing`. Instead, it passes the existing
                `VfsRingSet` pointer (which survived the crash — ring
                memory is kernel-owned, not domain-owned) directly to the
                driver's `vfs_init()` entry point. The `VfsRingSet` was
                reset in Steps U7-U10 and its rings are ready for reuse.
              - The driver's `vfs_init()` receives both the generic KABI
                handles (from Hello) and the VFS-specific `VfsRingSet`
                pointer (passed separately by the domain service).

          **VFS initialization KABI interface**: The `vfs_init()` function
          is declared in the filesystem KABI `.kabi` definition as an
          optional initialization method (present only for filesystem
          drivers, not for all Tier 1 drivers):
          ```rust
          /// Filesystem-specific initialization. Called by the domain
          /// service after the Hello protocol completes and generic KABI
          /// handles are resolved. Receives the VfsRingSet for this mount.
          ///
          /// The driver creates consumer threads (one per ring pair) using
          /// `kernel_services.create_kthread()` — a KABI kernel-services
          /// method, NOT a direct kthread_create() syscall. Tier 1 drivers
          /// cannot create kthreads directly; they request creation via
          /// the kernel-services KABI handle obtained during Hello.
          ///
          /// Ring memory permissions: VfsRingSet and its ring data regions
          /// are in shared memory mapped with the driver's domain key
          /// (read-write for the driver's PKEY/POE/DACR domain). The ring
          /// control structures (head/tail/published) are AtomicU64 —
          /// interior mutability through shared references is safe.
          fn vfs_init(
              &self,
              ring_set: &VfsRingSet,
              kernel_services: &KernelServicesHandle,
          ) -> Result<(), KabiError>;
          ```
          (d) The new driver inherits the existing sb.ring_set: the domain
              service passes the VfsRingSet pointer to the driver's init
              function. The driver starts N consumer threads (one per ring
              in ring_set), each bound to the corresponding VfsRingPair.
              The per-ring inner.state was reset to ACTIVE in U7(f), so
              the consumer loop's Phase 1.5 state check passes.
          (e) Mount RO, run fsck_fast() (fast metadata consistency check),
              remount RW if fsck_fast passes.

Step U15. [VFS Step 5]   FLUSH DEFERRED DIRTY PAGES
          writeback_deferred_dirty(sb)

Step U16. (REMOVED — generation bump moved to Step U13a, before driver reload.
          See U13a rationale above. This step is now a no-op placeholder
          to preserve step numbering.)

Step U17. [Per-CPU ext]  RING SET ACTIVE
          ring_set.state.store(VFSRS_ACTIVE, Release)
          This is the LAST step. The Release ordering ensures all
          prior ring resets, slot_states clears, and driver init are
          visible to producers before they observe ACTIVE.

Step U18. [General 9]    DRIVER READY
          Driver announces readiness to domain service.

VFS/Block Recovery Interleaving (Same-Domain Crash)

When a Tier 1 filesystem driver crashes and it shares a domain with the block driver (common on platforms with limited isolation domains), TWO crash recovery sequences are triggered for the SAME domain crash event:

  1. Block I/O recovery (Section 15.2): Drains block request queues, completes in-flight bios with EIO, resets the block device's hardware queues.
  2. VFS recovery (this section): The unified sequence U1-U18 above.

Ordering constraint: Block I/O recovery MUST complete before VFS Step U14 (RELOAD DRIVER AND REMOUNT). The filesystem driver's vfs_init() submits block I/O (for fsck_fast metadata reads), which requires the block device to be operational. The crash recovery worker (Section 12.8) serializes these via the domain-level recovery_mutex — block recovery runs first (it is faster: ~10-50ms for queue drain), then VFS recovery starts.

Bio completion callback domain: When the VFS submits a bio to the block layer, the bio completion callback (bio.bi_end_io) executes in the context of the block device's interrupt handler — which runs in Tier 0 (Core domain), not in the VFS driver domain. This is by design: bio completions update Core-resident page cache state (PG_writeback, PG_uptodate, PG_error flags) and wake Tier 0 waitqueues. The bio completion callback NEVER enters the VFS Tier 1 domain — it only touches Core data structures. This ensures bio completions continue to work even during VFS driver recovery.

Recovery Functions

Step U3: Set ring state

fn set_all_rings_disconnected(ring_set: &VfsRingSet) {
    // Block new select_ring() calls.
    ring_set.state.store(VFSRS_RECOVERING, Ordering::Release);

    for i in 0..ring_set.ring_count as usize {
        // SAFETY: rings is valid for ring_count elements.
        let ring = unsafe { &*ring_set.rings.add(i) };
        ring.request_ring.inner.state.store(RING_STATE_DISCONNECTED, Ordering::Release);
        ring.response_ring.inner.state.store(RING_STATE_DISCONNECTED, Ordering::Release);
    }
}

The mount-level ring_set.state is checked at the top of select_ring() — if not ACTIVE, the VFS operation returns ENXIO immediately without touching any individual ring. This provides a fast-path rejection of new operations during recovery, avoiding the need to check each ring's state individually.

Step U5: Wait for producer quiescence

/// Wait for all in-flight producers to complete their current operations.
///
/// After set_all_rings_disconnected(), no NEW operations can enter via
/// select_ring() (returns ENXIO). But producers that already passed
/// select_ring() and reserve_slot() may be mid-copy_from_user with a
/// RESERVED slot. This function waits for those producers to finish.
///
/// The inflight_ops counter is incremented in reserve_slot() and
/// decremented in complete_slot(). When it reaches zero for all rings,
/// no producer is between reserve and complete — safe to drain.
///
/// Timeout: 5 seconds (matches the per-sb quiescence timeout in the
/// base protocol). After timeout: SIGKILL processes with stuck operations.
///
/// **Busy-wait justification**: This is a cold path (crash recovery only,
/// not normal operation). The spin_loop uses `core::hint::spin_loop()`
/// which issues a PAUSE instruction on x86 (reducing power and yielding
/// the pipeline to other hyperthreads). The 5-second window is the maximum;
/// typical quiescence completes in microseconds (the stuck producer only
/// needs to finish its `copy_from_user` + `complete_slot()` sequence).
/// A WaitQueue-based approach was considered but rejected because the
/// producers are not aware of the recovery and thus cannot signal a
/// waitqueue — polling is the only option.
fn wait_for_producers_quiesced(ring_set: &VfsRingSet) {
    let deadline = ktime_get() + Duration::from_secs(5);

    loop {
        let mut all_quiesced = true;
        for i in 0..ring_set.ring_count as usize {
            // SAFETY: rings pointer valid for ring_count elements.
            let ring = unsafe { &*ring_set.rings.add(i) };
            if ring.request_ring.inflight_ops.load(Ordering::Acquire) != 0 {
                all_quiesced = false;
                break;
            }
        }
        if all_quiesced {
            return;
        }
        if ktime_get() > deadline {
            // Escalation: SIGKILL stuck processes (same as base protocol).
            sigkill_stuck_producers(ring_set);
            return;
        }
        core::hint::spin_loop();
    }
}

/// Identify and SIGKILL processes that have operations stuck in the VFS ring.
///
/// Called when `wait_for_producers_quiesced()` times out after 5 seconds.
/// A producer is "stuck" if it called `reserve_slot()` (incrementing
/// `inflight_ops`) but never called `complete_slot()` (decrementing it).
/// This can happen if the producer's thread is blocked in an uninterruptible
/// sleep between reserve and complete (e.g., page fault during `copy_from_user`
/// that blocks on I/O to the now-crashed filesystem — a deadlock).
///
/// **Mechanism**: Each ring maintains a per-ring `WaitQueue` that producers
/// sleep on when the ring is full (`reserve_slot()` calls `wq.wait_event`).
/// After the ring state is set to RING_STATE_DISCONNECTED, producers woken from this
/// waitqueue check the state and return ENXIO. For producers stuck in
/// `copy_from_user()` (not sleeping on the waitqueue), SIGKILL is the only
/// option — it interrupts the page fault handler and causes the thread to
/// enter `do_exit()`.
///
/// The function iterates all tasks in the system and sends SIGKILL to any
/// task that has a pending VFS operation on a ring belonging to this ring_set.
/// This is identified by checking if the task's `current_vfs_ring` pointer
/// (set in `reserve_slot()`, cleared in `complete_slot()`) points to a ring
/// in this ring_set.
fn sigkill_stuck_producers(ring_set: &VfsRingSet) {
    // Iterate the task table (RCU-protected read) looking for tasks
    // with current_vfs_ring pointing into this ring_set's ring array.
    let ring_base = ring_set.rings as usize;
    let ring_end = ring_base + ring_set.ring_count as usize
        * core::mem::size_of::<VfsRingPair>();

    rcu_read_lock();
    for_each_task(|task| {
        let ring_ptr = task.current_vfs_ring.load(Ordering::Relaxed) as usize;
        if ring_ptr >= ring_base && ring_ptr < ring_end {
            // This task has an in-flight VFS operation on one of our rings.
            // Send SIGKILL to force it out of whatever blocking state it's in.
            signal_wake_up(task, /* fatal */ true);
        }
    });
    rcu_read_unlock();
}

Steps U6-U8: Drain all VFS rings (unified 4-phase function)

/// Maximum orphaned page entries to collect during crash recovery.
/// With 256 rings * 256 depth = 65536 total entries, but only a
/// fraction are ReadPage/Readahead. 4096 covers the worst case
/// for a single mount under heavy read load.
const MAX_ORPHANED_PAGES: usize = 4096;

/// Orphaned page entry collected from ring during crash recovery.
/// Contains Core-resident data only (no VFS-domain pointers).
struct OrphanedPageEntry {
    /// Core-resident AddressSpace handle (pointer cast to u64).
    /// Valid because AddressSpace is in Core memory, pinned for
    /// the lifetime of the superblock.
    page_cache_id: u64,
    /// Page index within the AddressSpace.
    page_index: u64,
}

/// Drain all VFS rings during crash recovery.
///
/// This function implements Steps U6 (EXTRACT), U7 (DRAIN+RESET), and
/// collects data for U8 (WAKE). The caller invokes wake_orphaned_pages()
/// after this function returns.
///
/// **Four phases per ring** (executed sequentially per ring, rings
/// processed sequentially):
///
/// Phase 1: EXTRACT — read ring entries while pointers are valid.
///   Collect orphaned page entries (ReadPage/Readahead requests with
///   page_cache_id for Core-resident page lookup).
///   Collect DMA buffer handles for deferred freeing.
///
/// Phase 2: DRAIN — process completions and fail pending requests.
///   Drain response ring (EIO for all entries).
///   Wake blocked threads with EIO.
///   Free DMA buffer handles collected in Phase 1.
///
/// Phase 3: RESET — clear ring state for replacement driver.
///   Reset all slot_states to EMPTY.
///   Reset ring pointers (head = tail = published = 0).
///   Reset per-ring inner.state to RING_STATE_ACTIVE.
///
/// Phase 4 (after all rings): WAKE orphaned pages.
///   Done by caller using the returned orphaned_pages collection.
///
/// Order: rings are drained sequentially (ring 0, ring 1, ..., ring N-1).
/// No parallelism needed — crash recovery is a cold path with a 500ms
/// latency target.
fn drain_all_vfs_rings(
    ring_set: &VfsRingSet,
) -> ArrayVec<OrphanedPageEntry, MAX_ORPHANED_PAGES> {
    let mut orphaned_pages: ArrayVec<OrphanedPageEntry, MAX_ORPHANED_PAGES> =
        ArrayVec::new();

    for i in 0..ring_set.ring_count as usize {
        // SAFETY: rings pointer valid for ring_count elements.
        let ring = unsafe { &*ring_set.rings.add(i) };

        // Phase 1: EXTRACT — read ring entries while pointers are valid.
        let tail = ring.request_ring.inner.tail.load(Ordering::Acquire);
        let published = ring.request_ring.inner.published.load(Ordering::Acquire);
        let mask = ring.request_ring.inner.size as u64 - 1;
        let mut pos = tail;
        // Use `!=` (not `<`) for wrapping-safe comparison — consistent
        // with drain_ring() at the hot-path consumer (see its comment).
        // At u64 scale wrapping is unreachable (~58,000 years at 10M ops/sec),
        // but the correct idiom avoids self-contradiction in the spec.
        while pos != published {
            let idx = (pos & mask) as usize;
            // SAFETY: pos is between tail and published; all producers are
            // quiesced (Step U5 waited for inflight_ops == 0), so all slots
            // in [tail..published) were written by producers before they
            // decremented inflight_ops. Slots in this range are FILLED (not
            // RESERVED or EMPTY) because: (1) the producer's complete_slot()
            // call advances `published` only AFTER storing FILLED state, and
            // (2) quiescence guarantees no producer is mid-fill. A RESERVED
            // slot would imply an incomplete producer, contradicting the
            // inflight_ops == 0 quiescence condition.
            // The state is NOT verified at runtime — the quiescence guarantee
            // makes the check unnecessary, and adding one would add overhead
            // to a cold crash-recovery path for no correctness benefit.
            let entry: &VfsRequest = unsafe { ring.request_ring.read_entry(idx) };

            // Collect orphaned page entries for ReadPage/Readahead.
            match entry.opcode {
                VfsOpcode::ReadPage => {
                    if let VfsRequestArgs::ReadPage { page_index, page_cache_id, .. } = &entry.args {
                        if !orphaned_pages.is_full() {
                            orphaned_pages.push(OrphanedPageEntry {
                                page_cache_id: *page_cache_id,
                                page_index: *page_index,
                            });
                        }
                        // If full: log FMA warning. Pages will remain locked
                        // until oom_reaper or manual intervention.
                    }
                }
                VfsOpcode::Readahead => {
                    if let VfsRequestArgs::Readahead { start_index, nr_pages, page_cache_id, .. } = &entry.args {
                        for pg in 0..*nr_pages as u64 {
                            if !orphaned_pages.is_full() {
                                orphaned_pages.push(OrphanedPageEntry {
                                    page_cache_id: *page_cache_id,
                                    page_index: start_index + pg,
                                });
                            }
                        }
                    }
                }
                _ => {}
            }

            // Collect DMA buffer handles for freeing.
            free_request_dma_handles(entry);

            pos = pos.wrapping_add(1);
        }

        // Phase 2: DRAIN — process completions and fail pending.
        // NOTE: These operate on DIFFERENT rings within the same VfsRingPair:
        //   - drain_spsc_response_ring: advances the RESPONSE ring tail to
        //     discard driver completions (driver → VFS direction).
        //   - fail_pending_requests: checks the REQUEST ring pending_count
        //     and wakes blocked submitters (VFS → driver direction).
        // The response ring drain does NOT affect request ring state.
        // fail_pending_requests correctly sees the original pending_count
        // because it reads from a separate ring object.
        drain_spsc_response_ring(&ring.response_ring);
        fail_pending_requests(&ring.request_ring, &ring.completion);

        // Phase 3: RESET — clear ring state for replacement driver.
        // Reset all slot_states to EMPTY FIRST (before ring pointers).
        // Relaxed ordering is safe here because the downstream barrier
        // chain guarantees visibility: the ring_set.state store to
        // RING_STATE_ACTIVE (Step U17) uses Release ordering. The
        // replacement driver's consumer thread observes RING_STATE_ACTIVE
        // via Acquire load. This Release/Acquire pair establishes a
        // happens-before relationship: all Relaxed stores to slot_states
        // (done before the Release store) are visible to the consumer
        // thread (which reads after the Acquire load). No intermediate
        // Release is needed on the individual slot_state stores.
        for slot_idx in 0..ring.request_ring.inner.size as usize {
            // SAFETY: slot_states points to an array of `inner.size`
            // AtomicU8 elements, allocated at mount time. `slot_idx` is
            // bounded by `inner.size`. Raw pointer arithmetic is required
            // because `*const AtomicU8` does not support `[]` indexing.
            unsafe {
                (*ring.request_ring.slot_states.add(slot_idx)).store(
                    RingSlotState::Empty as u8,
                    Ordering::Relaxed,
                );
            }
        }
        for slot_idx in 0..ring.response_ring.inner.size as usize {
            // SAFETY: same invariant as request_ring slot_states above.
            unsafe {
                (*ring.response_ring.slot_states.add(slot_idx)).store(
                    RingSlotState::Empty as u8,
                    Ordering::Relaxed,
                );
            }
        }
        // Reset inflight_ops to zero (should already be zero after
        // quiescence, but defensive reset for correctness).
        ring.request_ring.inflight_ops.store(0, Ordering::Relaxed);
        ring.response_ring.inflight_ops.store(0, Ordering::Relaxed);
        // Reset ring pointers.
        ring.request_ring.inner.head.store(0, Ordering::Release);
        ring.request_ring.inner.published.store(0, Ordering::Release);
        ring.request_ring.inner.tail.store(0, Ordering::Release);
        ring.response_ring.inner.head.store(0, Ordering::Release);
        ring.response_ring.inner.published.store(0, Ordering::Release);
        ring.response_ring.inner.tail.store(0, Ordering::Release);
        // Reset per-ring state to Active for the replacement driver.
        ring.request_ring.inner.state.store(RING_STATE_ACTIVE, Ordering::Release);
        ring.response_ring.inner.state.store(RING_STATE_ACTIVE, Ordering::Release);
    }

    orphaned_pages
}

/// Free DMA buffer handles referenced by a pending VFS request.
/// Called during Phase 1 extraction to prevent DMA pool memory leaks.
///
/// Over 50-year uptime with periodic driver crashes, unfree'd DMA handles
/// are a slow memory leak. Each crash could leak up to ring_depth *
/// ring_count DMA buffer handles if not properly freed.
fn free_request_dma_handles(entry: &VfsRequest) {
    // Free any DMA buffer handles carried by VfsRequestArgs variants.
    // Every variant with a `buf: DmaBufferHandle` or `entries_buf: DmaBufferHandle`
    // field must be handled. A missing arm leaks DMA pool memory on every crash —
    // over 50-year uptime, this is a slow memory leak.
    //
    // Note: For Read and Write, the inline small I/O path uses DmaBufferHandle::ZERO
    // (no DMA buffer allocated for inline payloads). The != ZERO check correctly
    // skips these. Large I/O paths DO allocate DMA buffers and their handles are
    // freed here.
    match &entry.args {
        // Page cache operations: buf is the DMA-mapped page buffer.
        VfsRequestArgs::ReadPage { buf, .. }
        | VfsRequestArgs::WritePage { buf, .. }
        | VfsRequestArgs::Readahead { buf, .. } => {
            if *buf != DmaBufferHandle::ZERO {
                dma_pool_free(*buf);
            }
        }
        // Byte-range I/O operations: buf is the DMA-mapped user data buffer.
        // DmaBufferHandle::ZERO for inline small I/O (no DMA buffer allocated).
        VfsRequestArgs::Read { buf, .. }
        | VfsRequestArgs::Write { buf, .. }
        | VfsRequestArgs::Readlink { buf, .. } => {
            if *buf != DmaBufferHandle::ZERO {
                dma_pool_free(*buf);
            }
        }
        // Extended attribute operations: value/buf is the DMA-mapped xattr data.
        VfsRequestArgs::SetXattr { value, .. } => {
            if *value != DmaBufferHandle::ZERO {
                dma_pool_free(*value);
            }
        }
        VfsRequestArgs::GetXattr { buf, .. }
        | VfsRequestArgs::ListXattr { buf, .. } => {
            if *buf != DmaBufferHandle::ZERO {
                dma_pool_free(*buf);
            }
        }
        // Directory and info operations: buf is the DMA-mapped result buffer.
        VfsRequestArgs::ShowOptions { buf, .. }
        | VfsRequestArgs::ReadDir { buf, .. } => {
            if *buf != DmaBufferHandle::ZERO {
                dma_pool_free(*buf);
            }
        }
        // Batched statx: entries_buf is the DMA-mapped result array.
        VfsRequestArgs::StatxBatch { entries_buf, .. } => {
            if *entries_buf != DmaBufferHandle::ZERO {
                dma_pool_free(*entries_buf);
            }
        }
        // Variants without DMA buffer handles — listed exhaustively
        // so the compiler catches new variants with DMA buffers.
        // Adding a VfsRequestArgs variant with a DmaBufferHandle field
        // WITHOUT adding an arm here is a DMA pool memory leak.
        VfsRequestArgs::Mount { .. }
        | VfsRequestArgs::Unmount { .. }
        | VfsRequestArgs::ForceUnmount { .. }
        | VfsRequestArgs::Statfs { .. }
        | VfsRequestArgs::SyncFs { .. }
        | VfsRequestArgs::Remount { .. }
        | VfsRequestArgs::Freeze { .. }
        | VfsRequestArgs::Thaw { .. }
        | VfsRequestArgs::Lookup { .. }
        | VfsRequestArgs::Create { .. }
        | VfsRequestArgs::Link { .. }
        | VfsRequestArgs::Unlink { .. }
        | VfsRequestArgs::Mkdir { .. }
        | VfsRequestArgs::Rmdir { .. }
        | VfsRequestArgs::Rename { .. }
        | VfsRequestArgs::Symlink { .. }
        | VfsRequestArgs::Mknod { .. }
        | VfsRequestArgs::GetAttr { .. }
        | VfsRequestArgs::SetAttr { .. }
        | VfsRequestArgs::Truncate { .. }
        | VfsRequestArgs::RemoveXattr { .. }
        | VfsRequestArgs::DirtyExtent { .. }
        | VfsRequestArgs::ReleasePage { .. }
        | VfsRequestArgs::Open { .. }
        | VfsRequestArgs::Release { .. }
        | VfsRequestArgs::Fsync { .. }
        | VfsRequestArgs::Ioctl { .. }
        | VfsRequestArgs::Mmap { .. }
        | VfsRequestArgs::Fallocate { .. }
        | VfsRequestArgs::SeekData { .. }
        | VfsRequestArgs::SeekHole { .. }
        | VfsRequestArgs::Poll { .. }
        | VfsRequestArgs::SpliceRead { .. }
        | VfsRequestArgs::SpliceWrite { .. }
        | VfsRequestArgs::EvictInode { .. }
        | VfsRequestArgs::TruncateRange { .. }
        | VfsRequestArgs::WriteInode { .. } => {}
    }
}

/// Drain the response ring (driver -> VFS completions) during crash recovery.
/// Any completions that the driver had posted but the VFS consumer had not
/// yet consumed are processed here. Since the driver crashed, these
/// completions may contain partial or corrupted data — they are treated
/// as EIO errors regardless of the completion status.
///
/// Walks from `response_ring.inner.tail` to `response_ring.inner.published`,
/// reading each completion entry and processing it.
fn drain_spsc_response_ring(response_ring: &RingBuffer<VfsResponseWire>) {
    let mut tail = response_ring.inner.tail.load(Ordering::Acquire);
    let published = response_ring.inner.published.load(Ordering::Acquire);

    // Use `!=` (not `<`) for wrapping-safe comparison — consistent
    // with drain_ring() and drain_all_vfs_rings() (see SF-169).
    while tail != published {
        let idx = (tail % response_ring.inner.size as u64) as usize;
        // SAFETY: idx is within ring bounds (tail < published, ring is sized).
        let completion: &VfsResponseWire = unsafe {
            response_ring.read_entry(idx)
        };
        // Each response contains a request_id that maps to a waiting thread's
        // completion token. Wake the thread with EIO status — the driver crashed,
        // so even "successful" completions in the ring are suspect.
        // The waiting thread checks the ring_set.state (RECOVERING) in its
        // wait_event condition and translates the wake into an EIO return.
        // Intentionally discard all completions (including successful ones).
        // During crash recovery, the driver's state is unknown — even
        // "successful" completions may reflect corrupted state. Waiting
        // threads are woken below and observe VFSRS_RECOVERING, which they
        // translate into -EIO. This is the correct crash recovery policy:
        // discard everything and let the application retry after the
        // replacement driver is loaded.
        let _ = completion;
        tail = tail.wrapping_add(1);
    }
    response_ring.inner.tail.store(tail, Ordering::Release);
}

/// Fail all pending requests on a single request ring.
/// Walks from tail to published, waking each blocked thread with EIO.
///
/// Uses `wake_up_all()` (not a hypothetical `wake_all_with_error(EIO)`) because
/// `WaitQueueHead` does not have an error-passing wake method. Woken threads
/// check the ring/superblock state in their `wait_event` condition loop and
/// detect the RECOVERING state, translating it into an EIO return to userspace.
/// This is the standard Linux pattern: `wake_up_all()` + condition check in
/// the waiter loop.
fn fail_pending_requests(
    request_ring: &RingBuffer<VfsRequest>,
    completion: &WaitQueue,
) {
    let tail = request_ring.inner.tail.load(Ordering::Acquire);
    let published = request_ring.inner.published.load(Ordering::Acquire);
    let pending_count = published.wrapping_sub(tail);

    // Wake all threads on this ring's completion queue.
    // The woken threads check ring_set.state in their wait_event loop
    // condition and return EIO when they observe RECOVERING.
    if pending_count > 0 {
        completion.wake_up_all();
    }
}

/// Wake orphaned pages after ring drain (Step U8).
///
/// Uses Core-resident page_cache_id to resolve pages WITHOUT traversing
/// VFS-domain state (which may be corrupted after the crash). The page
/// cache is in Core (Tier 0); the inode cache is in VFS (Tier 1).
fn wake_orphaned_pages(orphaned_pages: &[OrphanedPageEntry]) {
    // RCU read lock is mandatory: the page could be concurrently evicted
    // by memory reclaim between the XArray lookup and the flags update.
    // RCU protection ensures the page reference remains valid for the
    // duration of the lookup + flag-set + unlock sequence.
    let _rcu = rcu_read_lock();
    for entry in orphaned_pages {
        // SAFETY: page_cache_id is a pointer to a Core-resident
        // AddressSpace, cast to u64 at enqueue time. The AddressSpace
        // is pinned for the lifetime of the superblock.
        //
        // **Driver corruption mitigation**: page_cache_id was read from
        // the VFS request ring, which is in shared memory accessible to
        // the (now crashed) driver. A corrupted driver could have written
        // arbitrary values to request ring entries before crashing.
        // Validate that the pointer falls within Core (Tier 0) memory
        // before dereferencing. This prevents a corrupted page_cache_id
        // from causing the recovery path to dereference a driver-domain
        // or arbitrary address.
        if !is_core_memory_range(entry.page_cache_id as usize, core::mem::size_of::<AddressSpace>()) {
            log_fma_warning!("wake_orphaned_pages: page_cache_id {:#x} not in Core memory, skipping",
                entry.page_cache_id);
            continue;
        }
        let address_space = unsafe {
            &*(entry.page_cache_id as *const AddressSpace)
        };
        // Look up the page in the XArray-backed page cache under RCU.
        // PageCache.pages is an XArray<PageEntry>; load() returns
        // Option<&PageEntry> under the current RCU read lock.
        if let Some(page_entry) = address_space.page_cache.pages.load(entry.page_index) {
            let page = &page_entry.page;
            page.flags.fetch_or(PageFlags::ERROR.bits(), Ordering::Release);
            unlock_page(&page);
        }
        // If page not found: it was already evicted or never inserted.
        // No action needed — no thread can be waiting on a non-existent page.
    }
    // _rcu dropped here — end of RCU read-side critical section.
}

writeback_deferred_dirty() Definition (Step U15)

/// Flush dirty pages that were deferred during crash recovery.
///
/// During the driver outage (Steps U1-U14), dirty pages accumulated in
/// the page cache with no backing driver to write them to disk. After
/// the replacement driver remounts (Step U14), this function flushes
/// those dirty pages via the standard writeback path.
///
/// The function iterates the superblock's dirty inode list
/// (`sb.s_dirty` / `sb.s_io`) and submits writeback work items for
/// each dirty inode. The writeback is synchronous: this function
/// blocks until all deferred dirty pages are written (or error).
///
/// # Arguments
///
/// - `sb`: The superblock of the remounted filesystem. The replacement
///   driver is already loaded and accepting writeback requests.
///
/// # Errors
///
/// Individual page writeback errors are logged via FMA but do not abort
/// the recovery. Pages that fail writeback retain the DIRTY flag and
/// are retried on the next periodic writeback cycle. The function
/// returns the count of failed pages for diagnostic purposes.
///
/// # Performance
///
/// This is a cold path (runs once per crash recovery). The writeback
/// rate is bounded by the replacement driver's throughput. On a typical
/// NVMe device, flushing 100 MB of deferred dirty pages takes ~20-50ms.
pub fn writeback_deferred_dirty(sb: &SuperBlock) -> u64 {
    let mut failed_count: u64 = 0;

    // Phase 1: Collect dirty inodes under RCU read lock.
    // We MUST NOT perform blocking I/O (WB_SYNC_ALL) inside an RCU
    // read-side critical section — blocking with tree-RCU prevents
    // grace period completion, causing RCU stalls and potential deadlock.
    // Instead, collect inode references into a bounded ArrayVec, drop
    // the RCU lock, then writeback outside RCU.
    //
    // Capacity 1024: sufficient for most crash recovery scenarios.
    // If the superblock has more than 1024 dirty inodes, the function
    // iterates in batches (collect 1024, drop RCU, writeback, re-acquire
    // RCU for the next batch). This is correct because new dirty inodes
    // cannot be created during crash recovery (ring_set.state == RECOVERING,
    // so no new I/O is accepted). The dirty inode list only shrinks
    // (as writeback completes) or stays the same.
    const BATCH_SIZE: usize = 1024;
    let mut batch: ArrayVec<InodeRef, BATCH_SIZE> = ArrayVec::new();
    let mut resume_ino: u64 = 0;

    loop {
        batch.clear();
        // Phase 1: Collect dirty inodes under RCU protection.
        {
            let _rcu = rcu_read_lock();
            for inode in sb.dirty_inodes_iter_from(resume_ino) {
                if batch.is_full() {
                    resume_ino = inode.ino + 1;
                    break;
                }
                // Acquire a reference to the inode (pin it against eviction).
                batch.push(InodeRef::from(&inode));
            }
        }
        // _rcu dropped here — RCU lock released before blocking I/O.

        if batch.is_empty() {
            break; // No more dirty inodes.
        }

        // Phase 2: Writeback each dirty inode OUTSIDE RCU.
        for inode_ref in batch.iter() {
            let mapping = &inode_ref.address_space;
            let wbc = WritebackControl {
                sync_mode: WB_SYNC_ALL,
                nr_to_write: i64::MAX, // Write all dirty pages
                range_start: 0,
                range_end: i64::MAX,
            };
            if let Err(_e) = mapping.writeback_range(&wbc) {
                failed_count += 1;
                fma_emit(FaultEvent::WritebackError {
                    inode: inode_ref.ino,
                    sb_dev: sb.s_dev,
                });
            }
        }
        // InodeRef drops release the references.
    }
    failed_count
}

Step U17: Restore ring set state to ACTIVE

After the generation counter is bumped (Step U13a), the replacement driver remounts (Step U14), and dirty pages are flushed (Step U15), the ring set is re-activated:

// Step U13a already bumped generation (before driver reload).
// Step U16 is now a no-op (generation bump moved to U13a).

// Step U17: re-activate the ring set. This is the LAST step.
// The Release ordering ensures all prior ring resets, slot_states clears,
// driver initialization, and generation bump are visible to producers
// before they observe ACTIVE and begin enqueuing new requests.
ring_set.state.store(VFSRS_ACTIVE, Ordering::Release);

This transition from RECOVERING to ACTIVE is the final gate. Without it, the mount remains permanently stuck in RECOVERING state and all VFS operations return ENXIO indefinitely.

Recovery latency impact: Draining N rings sequentially adds O(N) to recovery time. Each ring drain is O(ring_depth) — with depth 256, each drain is ~256 iterations of entry extraction + pointer arithmetic. For N=64 rings: 64 * 256 = 16,384 iterations, each ~50-200 ns (includes DMA handle freeing and orphaned page collection) = ~0.8-3.3 ms. Well within the 500 ms recovery latency target. The producer quiescence wait (Step U5) adds at most 5 seconds in the worst case (copy_from_user on a major page fault), but typically completes in microseconds (most copy_from_user operations are cache-hot).


14.3.11 Live Evolution

Live kernel evolution (Section 13.18) replaces a running filesystem driver with a new version. The evolution protocol interacts with per-CPU rings as follows:

Phase A' (Quiescence) — extended for N rings:

  1. Set ring_set.state = QUIESCING — new VFS operations return EAGAIN (callers retry after evolution completes).
  2. Wait for all N rings' inflight_ops counters to reach zero. Each RingBuffer<T> has an independent inflight_ops: AtomicU32 counter (incremented in reserve_slot(), decremented in complete_slot()). The same counter is used by crash recovery (Step U5 above).
  3. Drain all N response rings to process any final completions from the old driver.

Phase B (Atomic Swap) — the ring pointers are unchanged during vtable swap. The new driver inherits the same ring set. Ring count does not change during live evolution (changing ring count requires unmount/remount).

Phase C (Post-Swap Cleanup) — the new driver re-initializes its consumer threads for all N rings. If the new driver supports a different ring_count_max than the old driver, the ring count remains unchanged until the next remount.


14.3.12 CPU Hotplug

When a CPU comes online or goes offline, the cpu_to_ring mapping must be updated.

14.3.12.1 CPU Online

/// Called by the CPU hotplug framework when a new CPU comes online.
/// Updates the cpu_to_ring mapping for all mounted filesystems.
fn vfs_rings_cpu_online(cpu: CpuId) {
    for sb in all_superblocks() {
        let ring_set = &sb.ring_set;
        if cpu < ring_set.cpu_to_ring_len as usize {
            // Assign the new CPU to a ring based on the mount's granularity.
            let ring_idx = compute_ring_for_cpu(cpu, ring_set);
            // SAFETY: cpu < cpu_to_ring_len, validated above. cpu_to_ring
            // points to a slab-allocated array valid for the mount's lifetime.
            unsafe { (*ring_set.cpu_to_ring.add(cpu)).store(ring_idx, Ordering::Release) };
        }
        // If cpu >= cpu_to_ring_len (CPU ID exceeds table allocated at mount
        // time), the fallback in select_ring() routes to ring 0. This can
        // occur if CPUs are hot-added beyond the boot-time nr_cpu_ids.
        // A remount would rebuild the table with the new nr_cpu_ids.
    }
}

14.3.12.2 CPU Offline

/// Called by the CPU hotplug framework when a CPU goes offline.
/// The cpu_to_ring entry for the offline CPU is NOT cleared — it becomes
/// stale but harmless (any thread migrated off the offline CPU will call
/// select_ring() on its new CPU and get the correct ring). No ring is
/// removed or deallocated on CPU offline — ring count is fixed for the
/// lifetime of the mount.
fn vfs_rings_cpu_offline(cpu: CpuId) {
    // No action required. The ring assigned to this CPU continues to exist
    // and may still have in-flight operations. The driver's consumer thread
    // for this ring continues to drain it.
    //
    // If the offline CPU was the ONLY CPU assigned to a particular ring,
    // that ring becomes idle — the driver's consumer thread for it will
    // find no new work. This is benign.
}

Ring count is immutable for the lifetime of a mount. Rings are allocated at mount time and freed at unmount. CPU hotplug changes the CPU-to-ring mapping but never adds or removes rings. Adding rings would require the driver to reinitialize its consumer side (equivalent to a mini-remount); removing rings would orphan in-flight requests. Both are too disruptive for a hotplug event. If the operator wants to adjust ring count after a topology change, they must unmount and remount.


14.3.13 Performance Analysis

14.3.13.1 Cache Line Contention Elimination

The primary performance gain is eliminating cross-CPU cache line contention on the request ring's head/published cache line.

Before (single ring, 64-CPU PostgreSQL checkpoint):

Metric Value
Producer reservation contention per fsync ~63 contenders on single-ring CAS
Reservation CAS cost (x86-64, 64-way contention) ~3,150-4,410 cycles (~1.3-1.8 us)
Ring head cache line bounces per produce ~1 (lock holder writes, lock release bounces)
Total contention overhead per fsync ~3,200-4,500 cycles
Total for 1000-file checkpoint ~1.3-1.8 ms contention overhead

After (per-CPU rings, 64-CPU PostgreSQL checkpoint):

Metric Value
Producer reservation contention 0 (each CPU is sole SPSC producer on its ring)
Ring head cache line bounces per produce 0 (ring head is CPU-local)
Global request_id bounce ~1 per fsync (~15-20 cycles)
Total contention overhead per fsync ~15-20 cycles (~6-8 ns)
Total for 1000-file checkpoint ~6-8 us contention overhead

Speedup on contention (PerCpu mode): ~200x reduction in per-fsync contention overhead. The single-ring CAS contention is eliminated entirely — each CPU owns its ring and reserves slots without cross-CPU contention.

Shared-ring mode (PerNuma with 16 CPUs/node): The CAS loop on head has O(N) expected retries under N-way contention, where N is the number of CPUs sharing the ring. With 16 CPUs per NUMA node: ~15 contenders on head CAS, ~750-1,050 cycles per reservation (vs ~3,150-4,410 for the global single-ring case). This is a ~4x improvement over single-ring, but ~50x worse than PerCpu. PerNuma is appropriate when the memory overhead of PerCpu is unacceptable but some contention reduction is needed.

14.3.13.2 Memory Overhead

Each VfsRingPair occupies: - Ring headers: 2 * 128 bytes = 256 bytes (two cache lines per ring, request + response) - Ring data: 2 * (ring_depth * entry_size). VfsRequest is a Rust enum whose size is dominated by the largest VfsRequestArgs variant (SetXattr contains a KernelString at 256 bytes plus DmaBufferHandle, value_len, and flags). With the enum discriminant, alignment padding, and the VfsRequest header fields (request_id, opcode, ino, fh), the actual entry size is ~320 bytes. VfsCompletion is smaller (~64 bytes). With default depth 256: request ring = 256 * 320 = 80 KB, response ring = 256 * 64 = 16 KB, total ring data ≈ 96 KB per ring pair. - Doorbell + WaitQueue: ~128 bytes - Total per ring: ~96.4 KB

Ring count Total memory per mount Notes
1 (legacy) ~96 KB Baseline
4 (per-NUMA, 4-socket) ~386 KB Typical server
16 (per-LLC, AMD EPYC) ~1.5 MB High-core-count server
64 (per-CPU) ~6.2 MB Maximum parallelism
256 (per-CPU, 256-core) ~24.7 MB Extreme case

Note: If the per-ring memory for high ring counts is excessive, the ring depth can be reduced proportionally. With 64 per-CPU rings at depth 64 (instead of 256), total per-mount memory drops to ~1.5 MB while still providing sufficient queue depth for typical VFS workloads.

Memory is allocated from the kernel slab at mount time (warm path). For the common case (4-16 rings), memory overhead is modest. The 64-ring and 256-ring cases are opt-in via explicit mount options and appropriate only for high-IOPS storage servers.

14.3.13.3 Per-CPU Ring Depth Reduction

With N rings, each ring serves fewer CPUs and therefore needs fewer slots. The effective queue depth per CPU remains the same or better:

Configuration Rings Depth/ring Effective depth/CPU Total slots
Single ring, 64 CPUs 1 256 4 256
Per-NUMA (4 nodes), 64 CPUs 4 256 16 1024
Per-LLC (16 groups), 64 CPUs 16 128 128 2048
Per-CPU, 64 CPUs 64 64 64 4096

For per-CPU mode, the per-ring depth can be reduced (via vfs_ring_depth mount option) without reducing effective per-CPU capacity. A depth of 64 per ring with 64 rings provides 64 in-flight operations per CPU — far more than any single CPU can sustain.

14.3.13.4 Impact on Performance Budget

The per-CPU ring extension does NOT increase the per-I/O domain crossing cost. The ring protocol (SPSC produce, domain switch, consume) is identical per operation. The changes affect only:

Cost component Change Impact
Ring selection (select_ring()) +1-3 cycles (atomic load + bounds check) +0.001% on 10 us op
Request ID generation +15-20 cycles (global atomic fetch_add) +0.006-0.008% on 10 us op
Doorbell coalescing mask update +5-10 cycles (atomic OR) +0.002-0.004% on 10 us op
Cache line contention elimination -3,150-4,410 cycles (under 64-CPU contention) -1.3-1.8% saved
Net impact under contention -3,100-4,370 cycles saved -1.26-1.78% improvement
Net impact without contention +21-33 cycles added +0.009-0.013% overhead

The extension is a net performance win under any multi-CPU workload and negligibly more expensive (~0.01%) for single-CPU workloads. The overhead is well within the existing 2.5% headroom under the 5% budget (Section 3.4).

14.3.13.5 Amortization Math: Negative Overhead Analysis

The design target is NEGATIVE overhead — UmkaOS filesystem I/O must be FASTER than Linux despite the Tier 1 domain switch (Section 1.1). This section presents the amortization math against a production Linux kernel baseline (CONFIG_PROVE_LOCKING=n, CONFIG_LOCK_STAT=n).

Linux baseline for a read() cache miss (measured path, no isolation):

Linux path component Cost (x86-64, cycles) Notes
Syscall entry (SYSCALL + kernel stack setup) ~40 Shared with UmkaOS
VFS vfs_read() dispatch (function call, no vtable) ~10-15 Direct call
Filesystem ext4_file_read_iter() (indirect call through f_op) ~5-8 Indirect call + branch predictor
filemap_get_pages() (page cache XArray lookup) ~30-50 XArray walk, cache miss case
ext4_readahead() (extent tree lookup + bio build) ~100-300 Varies with extent depth
bio_submit() (block layer dispatch) ~50-100 Request queue + scheduling
Total Linux in-kernel overhead ~235-513 Before device I/O

UmkaOS path for the same read() cache miss (with per-CPU ring, N=8 batch):

UmkaOS path component Cost (x86-64, cycles) Notes
Syscall entry (SYSCALL + kernel stack setup) ~40 Identical to Linux
VFS ring enqueue (write entry + advance published) ~15-20 SPSC produce, no lock
Ring selection (select_ring()) ~1-3 Atomic load + bounds check
Request ID generation ~15-20 Atomic fetch_add
Doorbell coalescing mask update ~5-10 Atomic OR (amortized)
Doorbell (domain switch) — amortized over N ~23/N WRPKRU, amortized
Driver dequeue (ring entry already prefetched in L1) ~5-10 L1 hit from prefetch
Filesystem processing (same as Linux) ~100-300 Extent tree + bio
Response enqueue + completion coalescing ~10-15 SPSC produce + coalesce check
Completion wakeup — amortized over N ~300/N IPI + WaitQueue, amortized (see note)
Total UmkaOS in-kernel overhead (N=16) ~220-433 Before device I/O

Completion wakeup cost (300 cycles) derivation: The 300-cycle figure assumes same-NUMA-node IPI (~200 cycles on x86-64 Intel Xeon Scalable, measured via rdtsc across APIC_ICR write to handler entry) plus WaitQueue wake overhead (~50-100 cycles for priority-ordered wake + rescheduling check). Cross-NUMA IPI costs ~500-1000 cycles; if the VFS consumer runs on a different NUMA node than the filesystem driver, completion wakeup rises to ~600-1100 cycles. The 300-cycle figure applies to PerLlc and PerNuma ring granularities where producer and consumer share a NUMA node. Cross-NUMA worst case is documented in the per-architecture table below.

Per-operation domain crossing cost — production Linux baseline:

The per-operation overhead that UmkaOS must amortize is compared against the Linux function-call chain overhead that UmkaOS eliminates by replacing indirect calls with a ring protocol.

Linux production per-operation function-call overhead eliminated by UmkaOS:
  vfs_read() dispatch:  ~10-15 cycles (direct call)
  f_op->read_iter:      ~5-8 cycles  (indirect call, x86-64 retpoline)
  Lock stat accounting: ~5-10 cycles (CONFIG_LOCK_STAT=n: 0 cycles;
                                      production kernels vary, ~5 cycles
                                      for inline static key check residual)
  TOTAL (production):   ~20-28 cycles

Note: Debug kernels with CONFIG_PROVE_LOCKING=y add ~25 cycles of lockdep
checking per lock acquisition. The baseline above uses PRODUCTION builds only.

Total domain crossing overhead = doorbell_cost + completion_cost
                               = 23 + 300 = 323 cycles (uncoalesced, x86-64)

Breakeven batch size = 323 / 28 = ~12 operations (production Linux)

Why UmkaOS achieves savings despite the domain switch:

  1. No lockdep overhead: Linux's lockdep (lock dependency checker) adds ~20-30 cycles to every lock acquisition on debug kernels. Production kernels with CONFIG_PROVE_LOCKING=n and CONFIG_LOCK_STAT=n have zero lockdep overhead, but still pay ~5 cycles for inline static key checks on lock-stat-capable paths. UmkaOS's compile-time lock ordering eliminates all runtime lock validation: 0 cycles on all builds.

  2. No indirect call overhead: Linux's VFS dispatches through f_op->read_iter — an indirect call through a function pointer. On x86-64 with Spectre v2 mitigations (retpoline/IBRS), indirect calls cost ~15-25 cycles. UmkaOS's ring protocol avoids indirect calls on the hot path — the opcode is a match on a u32, which the compiler converts to a jump table (direct branch).

  3. Cache-friendlier ring layout: The ring buffer is a contiguous array with predictable access pattern (sequential consume). Linux's VFS path walks multiple non-contiguous data structures (file -> dentry -> inode -> superblock -> f_op -> address_space -> page tree). The ring's sequential layout produces fewer L1 cache misses (~2-3 vs ~5-8 for the pointer-chasing VFS path).

  4. Prefetch hides latency: The driver-side ring entry prefetch (see "Driver-Side Ring Entry Prefetch" above) loads the next 4 entries into L1 while processing the current request. Linux has no equivalent — each VFS function call must load its arguments from wherever they happen to reside in the cache hierarchy.

Summary table — per-operation overhead at different batch sizes (x86-64, production Linux):

Batch size (N) Domain crossing per-op Linux prod. overhead saved Net Verdict
1 (uncoalesced) 323 cycles 28 cycles +295 cycles 1.2% overhead on 10us op
2 162 cycles 28 cycles +134 cycles 0.54% overhead
4 81 cycles 28 cycles +53 cycles 0.21% overhead
8 40 cycles 28 cycles +12 cycles 0.048% overhead
12 27 cycles 28 cycles -1 cycle Breakeven
16 20 cycles 28 cycles -8 cycles NEGATIVE overhead
32 10 cycles 28 cycles -18 cycles Negative (bonus)

Breakeven is at N=12 against production Linux. At N>=12, UmkaOS per-operation cost is less than Linux's equivalent function-call path. At N=16 (typical io_uring depth), the saving is ~8 cycles/op. At N=32 (common for high-IOPS NVMe workloads), the saving is ~18 cycles/op.

N=8 (the default regular I/O coalescing batch) is NOT negative-overhead against production Linux — it adds ~12 cycles/op (+0.048% on a 10us operation). This is well within the 2.5% headroom under the 5% budget. The negative-overhead threshold requires N>=12, which is achieved by: - io_uring workloads (typical depth 32-128): always negative overhead. - PostgreSQL checkpoint (fsync storm on 64 backends): N>>12. - Batched readahead (sequential reads trigger 4-32 page prefetch): N>=16.

When N < 12: For single-threaded sequential reads (effective batch size 1-4), the domain crossing adds ~0.2-1.2% overhead — well within the 5% budget. The page cache absorbs >95% of reads without any domain crossing (cache hits are served entirely in Tier 0), so the effective overhead across all operations is much lower than the per-miss figure.

Inline small I/O path (N=1, negative overhead): For reads/writes where count <= INLINE_IO_MAX (192 bytes) — covering >90% of procfs/sysfs accesses — data is carried inline in the ring entry (Section 14.2). No DMA buffer allocation, no IOMMU map/unmap. This eliminates ~150-300ns per small I/O:

Path Linux cost UmkaOS inline cost Delta
read("/proc/self/status", buf, 128) ~280-400 cycles (VFS + seq_file + copy_to_user) ~180-260 cycles (ring + inline_data + copy_to_user) -100 to -140 cycles
read("/sys/class/net/eth0/mtu", buf, 8) ~250-380 cycles ~160-240 cycles -90 to -140 cycles

For metadata-heavy workloads (container startup reading hundreds of small files), this is measurable throughput improvement. The inline path achieves negative overhead at N=1 — no batching required. This is the strongest negative-overhead argument for procfs/sysfs workloads that dominate container and monitoring scenarios.


14.3.14 Backward Compatibility

14.3.14.1 Single-Ring Drivers

Drivers that do not include a .kabi_vfs_caps ELF section (or set ring_count_max = 1) operate in single-ring mode. The VfsRingSet is allocated with ring_count = 1 and the cpu_to_ring table maps all CPUs to ring 0. This is functionally identical to the baseline protocol — no behavioral change.

14.3.14.2 VfsRingPair Preservation

The VfsRingPair struct is unchanged. The extension wraps it in VfsRingSet without modifying the per-ring structure. The request/response ring layout, entry format, opcodes, and all existing fields remain identical.

14.3.14.3 Cancellation Protocol

The cancellation protocol (Section 14.2) is unchanged per ring. CancelToken.request_id uses the mount-global ID, so the driver can match it against any ring's pending requests. The cancellation side-channel is per ring — each ring has its own cancellation channel, and the cancel token is enqueued on the ring where the original request was submitted (the VFS tracks which ring each request went to via the per-task last_vfs_ring_index field, set during select_ring()).

14.3.14.4 Timeout Handling

Per-request timeouts (Section 14.2) are unchanged. Each request has an independent timer regardless of which ring it was submitted on. The timer callback cancels the request on the specific ring where it was submitted.


14.3.15 Cross-References


14.3.16 Phase Assignment

Component Phase Rationale
VfsRingSet struct and single-ring allocation Phase 2 Replaces VfsRingPair allocation at mount; backward compatible with ring_count=1.
Mount option parsing (vfs_ring_count, vfs_ring_granularity) Phase 2 Mount option infrastructure exists; adding new options is incremental.
CPU-to-ring mapping table and select_ring() Phase 2 Core hot-path change; must be correct from first multi-ring mount.
Global request_id counter Phase 2 Replaces per-ring counter; simple atomic.
Coalesced doorbell Phase 2 Required for multi-ring to avoid doorbell storms.
Driver-side round-robin consumer (Strategy 1) Phase 2 Minimum viable multi-ring consumer.
KabiVfsCapabilities ELF section and negotiation Phase 2 Must exist before any driver can advertise multi-ring support.
Crash recovery for N rings Phase 2 Must be correct before any production use of multi-ring mode.
Per-ring consumer threads (Strategy 2) Phase 3 Optimization; round-robin is sufficient for Phase 2.
CPU hotplug integration Phase 3 Hotplug is uncommon in production; Phase 2 mapping is static.
Live evolution for N rings Phase 3 Live evolution is Phase 3 feature.
Adaptive granularity auto-selection Phase 3 Requires topology discovery infrastructure.
Driver-side ring entry prefetch Phase 2 Trivial to implement (one prefetch intrinsic per dequeue); significant L1 cache benefit.
Completion coalescing (response direction) Phase 2 Required for negative-overhead target; mirrors request-side doorbell coalescing.

14.4 fsync / fdatasync End-to-End Flow

The fsync(2) and fdatasync(2) syscalls guarantee that file data (and optionally metadata) reach stable storage. This section documents the complete call path from syscall entry to disk write completion, crossing VFS, page cache, filesystem, and block layer boundaries.

Syscall entry → VFS dispatch:

fsync(fd) / fdatasync(fd)
  → sys_fsync() / sys_fdatasync()
  → vfs_fsync_range(file, start=0, end=LLONG_MAX, datasync)

vfs_fsync_range(file, start, end, datasync) is the canonical entry point. The sync_file_range(2) syscall also calls it with a sub-range.

Step 1 — Writeback dirty pages:

/// VFS-level fsync implementation.
fn vfs_fsync_range(
    file: &File,
    start: i64,
    end: i64,
    datasync: bool,
) -> Result<(), IoError> {
    let mapping = &file.inode.i_mapping;

    // (1) Flush all dirty pages in [start, end] to the block layer.
    //     The writeback engine iterates dirty pages, calling
    //     AddressSpaceOps::writepage() for each one, which builds bios
    //     and submits them. Does NOT wait for I/O completion yet.
    filemap_write_and_wait_range(mapping, start, end)?;

    // (1b) DSM-aware writeback: for MS_DSM_COOPERATIVE superblocks, after
    //      local writeback completes, wait for DSM home node acknowledgment.
    //      dsm_sync_pages() sends dirty DSM pages via RDMA and blocks until
    //      PutAck is received from each home node, ensuring data durability
    //      on the remote node before fsync returns.
    //      See [Section 6.12](06-dsm.md#dsm-subscriber-controlled-caching--fsync-semantics).
    if mapping.host.i_sb.s_flags.load(Relaxed) & MS_DSM_COOPERATIVE != 0 {
        dsm_sync_pages(&file.inode, start, end)?;
    }

    // (2) Dispatch to filesystem-specific fsync (journal commit, etc).
    //     For KABI Tier 1/2 drivers: sends Fsync through VFS ring buffer.
    //     For in-kernel filesystems: calls FileOps::fsync() directly.
    //     The inode and private values are extracted from the OpenFile.
    let inode_id = file.inode.id;
    let private = file.private_data.load(Ordering::Relaxed) as u64;
    file.f_op.fsync(inode_id, private, start as u64, end as u64, datasync as u8)?;

    // (3) Check for writeback errors that arrived asynchronously since
    //     this fd was opened (or since the last successful fsync).
    //     ErrSeq::check_and_advance() compares the fd's snapshot against
    //     the mapping's current wb_err generation. If they differ, a
    //     writeback error occurred — return it and advance the snapshot.
    //     Without this step, writeback errors that arrive between
    //     write() and fsync() would be silently lost.
    if let Some(errno) = mapping.wb_err.check_and_advance(&mut file.f_wb_err) {
        return Err(IoError::from_raw(errno));
    }

    Ok(())
}

filemap_write_and_wait_range(mapping, start, end) performs two phases:

  1. Write phase: Walk the page cache (XArray range scan) for pages in [start, end] with PageFlags::DIRTY set. For each dirty page, call AddressSpaceOps::writepage(mapping, page, wbc). The filesystem maps the page to physical block(s), builds a Bio, and submits it via bio_submit() (Section 15.2). Set WRITEBACK flag on the page.

DIRTY flag ownership: filemap_write_and_wait_range does NOT clear DIRTY — it only sets WRITEBACK. The DIRTY → clean transition and nr_dirty decrement are owned by the completion callback: - Tier 0 filesystems: writeback_end_io() (deferred via writeback_end_io_deferred callback on the blk-io workqueue) clears DIRTY and decrements nr_dirty on success. - Tier 1 filesystems: The Tier 0 WritebackResponse handler (step 11 in Section 4.6) clears DIRTY and decrements nr_dirty. This single-owner design prevents the double-decrement bug that would occur if both the write phase and the completion handler cleared DIRTY.

  1. Wait phase: Walk the same range again. For each page with WRITEBACK set, block until the bio completion callback clears the flag. If any page has AS_EIO / AS_ENOSPC error state, return the error and clear it (one-shot error reporting — see below).

Dual error reporting (AS flags + ErrSeq): The AS_EIO/AS_ENOSPC AddressSpace flags are a legacy mechanism. ErrSeq (wb_err) is the primary error reporting mechanism: it provides per-fd error visibility (each open fd sees the error exactly once via check_and_advance()). Both mechanisms are set together by writeback_end_io() for consistency, but ErrSeq is authoritative for fsync() error returns. The AS flags are used by filemap_write_and_wait_range for backward compatibility with callers that check AddressSpace state directly.

/// Flush all dirty pages in [start, end] to the block layer and wait
/// for completion. This is the core implementation behind fsync step 1.
///
/// # Algorithm
/// Phase 1 (write): iterate dirty pages and submit writeback bios.
/// Phase 2 (wait): block until all submitted bios complete.
///
/// # Locking
/// Does NOT hold i_rwsem (already held by caller for write paths).
/// Acquires page locks individually via write_begin/end protocol.
///
/// # Error handling
/// Collects errors from both phases. Returns the first error encountered.
/// All pages are processed even after an error (no short-circuit) to
/// maximize data written to disk before returning failure.
fn filemap_write_and_wait_range(
    mapping: &AddressSpace,
    start: i64,
    end: i64,
) -> Result<(), IoError> {
    let mut first_err: Result<(), IoError> = Ok(());

    // --- Phase 1: Write dirty pages ---
    let start_idx = (start as u64) >> PAGE_SHIFT;
    let end_idx = (end as u64) >> PAGE_SHIFT;
    let mut wbc = WritebackControl {
        sync_mode: WbSyncMode::All,
        range_start: start as u64,
        range_end: end as u64,
        nr_to_write: i64::MAX,
    };

    // Prefer writepages() for batch I/O if the filesystem supports it.
    if mapping.ops.writepages(mapping, &wbc).is_err() {
        // Fall back to per-page writepage() iteration.
        let mut idx = start_idx;
        while idx <= end_idx {
            if let Some(page) = mapping.pages.load(idx) {
                if page.flags.load(Acquire) & PageFlags::DIRTY != 0 {
                    // Set WRITEBACK before submitting. Do NOT clear DIRTY —
                    // the completion callback owns that transition.
                    // set_page_writeback: atomically set WRITEBACK flag
                    // AND increment nrwriteback. The matching decrement
                    // is in writeback_end_io(). Without the increment,
                    // the decrement in writeback_end_io causes underflow
                    // (u64::MAX), permanently corrupting writeback scheduling.
                    page.flags.fetch_or(PageFlags::WRITEBACK, Release);
                    mapping.nrwriteback.fetch_add(1, Relaxed);
                    if let Err(e) = mapping.ops.writepage(mapping, &page, &wbc) {
                        page.flags.fetch_and(!PageFlags::WRITEBACK, Release);
                        mapping.nrwriteback.fetch_sub(1, Relaxed);
                        if first_err.is_ok() { first_err = Err(e); }
                    }
                }
            }
            idx += 1;
        }
    }

    // --- Phase 2: Wait for WRITEBACK completion ---
    let mut idx = start_idx;
    while idx <= end_idx {
        if let Some(page) = mapping.pages.load(idx) {
            // Spin/sleep until the completion callback clears WRITEBACK.
            wait_on_page_writeback(&page);
            // Check for per-page error (set by writeback_end_io on failure).
            if page.flags.load(Acquire) & PageFlags::ERROR != 0 {
                if first_err.is_ok() {
                    first_err = Err(IoError::new(Errno::EIO));
                }
            }
        }
        idx += 1;
    }

    // Check AS-level error flags (set by mapping_set_error on I/O error).
    if mapping.flags.load(Acquire) & (AS_EIO | AS_ENOSPC) != 0 {
        let flags = mapping.flags.fetch_and(!(AS_EIO | AS_ENOSPC), Release);
        if first_err.is_ok() {
            if flags & AS_ENOSPC != 0 {
                first_err = Err(IoError::new(Errno::ENOSPC));
            } else {
                first_err = Err(IoError::new(Errno::EIO));
            }
        }
    }

    first_err
}

Page Wait Queue Infrastructure (used by wait_on_page_writeback and wait_on_page_locked):

/// Global page wait hash table. Hashed by page address to reduce memory
/// overhead (one WaitQueueHead per hash bucket, not per page).
/// Size: 256 buckets (matches Linux's PAGE_WAIT_TABLE_BITS = 8).
/// Warm path: accessed on every fsync and every page fault that waits
/// for I/O completion.
static PAGE_WAIT_TABLE: [WaitQueueHead; 256] = [WaitQueueHead::new(); 256];

/// Map a Page reference to its hash bucket in the page wait table.
fn page_waitqueue(page: &Page) -> &'static WaitQueueHead {
    // Hash by page struct address (NOT physical address) — the page struct
    // is pinned in MEMMAP and has a stable address for the kernel lifetime.
    let hash = (page as *const Page as usize >> PAGE_SHIFT) & 0xFF;
    &PAGE_WAIT_TABLE[hash]
}

/// Sleep until the WRITEBACK flag is cleared on the page.
/// Called by fsync Phase 2 and by the page fault path when a page is
/// undergoing writeback. The matching wake is in `Page::wake_waiters()`,
/// called from `writeback_end_io()` after clearing WRITEBACK.
fn wait_on_page_writeback(page: &Page) {
    // Fast check: if WRITEBACK is already clear, return immediately.
    if page.flags.load(Acquire) & PageFlags::WRITEBACK.bits() == 0 {
        return;
    }
    let wq = page_waitqueue(page);
    wq.wait_event(|| page.flags.load(Acquire) & PageFlags::WRITEBACK.bits() == 0);
}

impl Page {
    /// Wake all waiters sleeping on this page's hash bucket.
    /// Called from writeback completion (after clearing WRITEBACK) and
    /// from unlock_page (after clearing PG_LOCKED).
    pub fn wake_waiters(&self) {
        page_waitqueue(self).wake_up_all();
    }
}
/// Set writeback error on an AddressSpace. Called from writeback completion
/// paths (writeback_end_io, end_page_writeback) when a bio completes with
/// an I/O error.
///
/// Wraps ErrSeq::set_err() and sets the legacy AS_EIO/AS_ENOSPC flags.
/// Both mechanisms are updated together for consistency — ErrSeq is
/// authoritative for fsync() error reporting, AS flags are for legacy
/// callers that check AddressSpace state directly.
fn mapping_set_error(mapping: &AddressSpace, errno: Errno) {
    // Increment ErrSeq generation and store the error code.
    mapping.wb_err.set_err(errno);
    // Set legacy AS flags for backward compatibility.
    if errno == Errno::ENOSPC {
        mapping.flags.fetch_or(AS_ENOSPC, Release);
    } else {
        mapping.flags.fetch_or(AS_EIO, Release);
    }
}

/// Add an inode to the BDI's dirty list if not already present.
/// Called from set_page_dirty() when a page first transitions to dirty.
/// Equivalent to mark_inode_dirty(inode, I_DIRTY_PAGES) — checks the
/// I_DIRTY_PAGES flag to avoid duplicate list insertions.
fn bdi_dirty_inode(bdi: &BackingDevInfo, inode: InodeId) {
    // Check I_DIRTY_PAGES — idempotent, no action if already set.
    let inode_ref = inode_lookup(inode);
    if inode_ref.state.fetch_or(I_DIRTY_PAGES, AcqRel) & I_DIRTY_PAGES != 0 {
        return; // Already on the dirty list.
    }
    // Add to BDI's b_dirty list under writeback_lock.
    bdi.wb.push_dirty_inode(inode_ref);
}

Step 2 — Filesystem-specific sync (journaled filesystems):

For journaled filesystems (ext4, XFS, btrfs), fsync() does more than flush pages:

Filesystem fsync action after writeback
ext4 (data=ordered) jbd2_journal_force_commit() — flush journal to disk, issue cache flush
ext4 (data=journal) Commit journal transaction containing both data and metadata
XFS xfs_log_force_lsn() — force log to LSN that covers the inode's metadata
btrfs btrfs_sync_log() — flush the per-root log tree, then superblock
tmpfs No-op (no backing store)
NFS nfs_file_fsync() — send COMMIT RPC to server
FUSE (Section 14.11) Send FUSE_FSYNC opcode through /dev/fuse
KABI Tier 1/2 Send VfsRequest::Fsync { datasync, start, end } through ring buffer
/// ext4 fsync implementation. Called via FileOps::fsync() dispatch.
/// Ensures all data and metadata for the inode reach stable storage
/// by forcing the JBD2 journal to commit.
///
/// Linux equivalent: ext4_sync_file() in fs/ext4/fsync.c.
fn ext4_fsync(
    inode_id: InodeId,
    private: u64,
    start: u64,
    end: u64,
    datasync: u8,
) -> Result<()> {
    let inode = inode_lookup(inode_id);
    let sbi = ext4_sb_info(&inode.i_sb);
    let journal = &sbi.journal;

    // For data=journal mode, all data is already in the journal.
    // Force the transaction containing this inode's data to commit.
    //
    // For data=ordered mode, data pages were already flushed by
    // filemap_write_and_wait_range() (step 1 in vfs_fsync_range).
    // We only need to commit the metadata transaction.

    // If fdatasync and only timestamps changed (I_DIRTY_TIME without
    // I_DIRTY_DATASYNC), skip the journal commit entirely.
    if datasync != 0
        && inode.state.load(Acquire) & I_DIRTY_DATASYNC == 0
        && inode.state.load(Acquire) & I_DIRTY_TIME != 0
    {
        return Ok(());
    }

    // Force the journal transaction containing this inode's metadata
    // to disk. journal_force_commit() waits for the commit I/O to
    // complete, including the commit block written with BioFlags::FUA.
    // The FUA flag ensures the commit block reaches stable storage
    // without needing a separate cache flush bio.
    // i_datasync_tid is in the ext4-specific inode info, not the generic Inode.
    // Access via i_private cast (see Ext4InodeInfo in [Section 15.6](15-storage.md#filesystem-ext4)).
    // SAFETY: i_private was set by ext4's alloc_inode() and is valid for the
    // inode's lifetime. The Ext4InodeInfo is slab-allocated and immovable.
    let ext4_info = unsafe { &*(inode.i_private as *const Ext4InodeInfo) };
    let tid = ext4_info.i_datasync_tid.load(Acquire);
    journal.force_commit(tid)?;

    // On ext4, the journal commit with FUA on the commit block serves
    // as the device cache flush. No separate REQ_PREFLUSH bio is needed
    // because the FUA commit block guarantees ordering: all data written
    // before the commit block is on stable media.

    Ok(())
}

Step 3 — Block layer flush (inside filesystem-specific fsync):

The device cache flush described below is performed inside the filesystem-specific fsync() implementation (step 2), not as a separate VFS-level post-fsync action. For ext4, the BioFlags::FUA on the journal commit block serves this purpose. For other filesystems:

After the filesystem commits its journal/log, it issues a cache flush to the storage device to ensure write-back caches are drained:

bio_submit(bio with REQ_PREFLUSH | REQ_FUA)
  → block device request queue
  → NVMe: FLUSH command (opcode 0x00) / SATA: FLUSH CACHE EXT (0xEA)
  → completion interrupt → bio_endio() → wake waiters

Tier 1 crash recovery: Journal commit bios MUST set BioFlags::PERSISTENT (Section 15.2) so they are preserved across Tier 1 storage driver crash recovery. The block layer's pending bio list retains PERSISTENT bios during domain teardown and replays them to the new driver instance after reload. Without this flag, a Tier 1 driver crash between journal write submission and completion would lose the journal commit — corrupting the filesystem on the next mount (journal replay would be incomplete).

For fdatasync: metadata-only changes (atime, mtime) are NOT flushed. The filesystem skips journal commit if only timestamps changed (I_DIRTY_TIME flag without I_DIRTY_DATASYNC).

Error reporting — one-shot semantics:

/// Error state per AddressSpace. Encapsulates a monotonic sequence counter
/// and the most recent errno. On writeback I/O failure, call `set_err(errno)`.
/// On fsync(), compare the file's snapshot with `sample()` to detect new errors.
///
/// Provides the same "each error seen exactly once per fd" semantics as
/// Linux's errseq_t. Errno and counter are packed into a single atomic word
/// to prevent torn reads between errno and sequence counter.
///
/// **64-bit architectures** (x86-64, AArch64, RISC-V 64, PPC64LE, s390x,
/// LoongArch64): uses `AtomicU64` — errno in low 16 bits, counter in high
/// 48 bits. No counter wrap concern.
///
/// **32-bit architectures** (ARMv7, PPC32): uses packed `AtomicU32` matching
/// Linux's errseq_t layout — bits [11:0] = errno, bit [12] = seen flag,
/// bits [31:13] = counter.
/// Longevity: 19-bit counter wraps after 524,288 errors. At 1 error/sec
/// (extreme), wraps in ~6 days. Matches Linux errseq_t layout (ABI-constrained).
/// 32-bit targets (ARMv7, PPC32) only. 64-bit targets use 47-bit counter
/// (140T values, safe for 50-year uptime at any realistic error rate).
/// Wrap behavior: false "no new error" on check_and_advance — acceptable
/// because (a) errno field still carries the last error code, (b) filesystems
/// with 500K+ errors have already been marked for fsck.
///
/// Single atomic operation for both `set_err()` and `check_and_advance()` —
/// no torn reads between errno and counter.
#[cfg(target_pointer_width = "64")]
pub struct ErrSeq {
    /// Packed: bits [15:0] = errno (unsigned, max 4095),
    /// bit [16] = seen flag, bits [63:17] = counter.
    /// Single AtomicU64 prevents torn reads.
    inner: AtomicU64,
}

#[cfg(target_pointer_width = "32")]
pub struct ErrSeq {
    /// Packed: bits [11:0] = errno, bit [12] = seen flag,
    /// bits [31:13] = counter. Matches Linux errseq_t layout.
    inner: AtomicU32,
}

#[cfg(target_pointer_width = "64")]
impl ErrSeq {
    // 64-bit ERRNO_BITS is 16 (not 12) to simplify the bit layout; values > 4095
    // are kernel bugs caught by debug_assert. 47-bit counter provides ample headroom.
    const ERRNO_BITS: u32 = 16;
    const SEEN_BIT: u64 = 1 << Self::ERRNO_BITS;
    const CTR_INC: u64 = 1 << (Self::ERRNO_BITS + 1);
    const ERRNO_MASK: u64 = Self::SEEN_BIT - 1;

    pub const fn new() -> Self { Self { inner: AtomicU64::new(0) } }

    /// Record a new writeback error. Atomically packs errno + incremented
    /// counter into a single word. No backoff needed: writeback_lock
    /// serializes concurrent writeback per inode. The only contention
    /// source is a rare race between set_err (writeback error completion)
    /// and check_and_advance (fsync), which retries at most once.
    pub fn set_err(&self, errno: i32) {
        debug_assert!(errno.unsigned_abs() <= 4095, "errno exceeds MAX_ERRNO");
        let errno_val = (errno.unsigned_abs() as u64) & Self::ERRNO_MASK;
        loop {
            let old = self.inner.load(Acquire);
            let new_val = ((old & !Self::ERRNO_MASK & !Self::SEEN_BIT)
                .wrapping_add(Self::CTR_INC)) | errno_val;
            match self.inner.compare_exchange_weak(old, new_val, AcqRel, Acquire) {
                Ok(_) => break,
                Err(_) => continue,
            }
        }
    }

    /// Snapshot current value with "seen" bit set.
    pub fn sample(&self) -> u64 { self.inner.load(Acquire) | Self::SEEN_BIT }

    /// Check for new errors since `since`. Returns errno if changed.
    /// The returned errno is always POSITIVE (e.g., 5 for EIO, not -5).
    /// `set_err()` accepts both positive and negative errnos (via
    /// `unsigned_abs()`), but the stored and returned value is always the
    /// absolute (positive) errno. Callers that need a negative errno for
    /// syscall returns must negate: `Err(-(errno as i32))`.
    pub fn check_and_advance(&self, since: &mut u64) -> Option<i32> {
        let current = self.inner.load(Acquire);
        if current == *since { return None; }
        *since = current | Self::SEEN_BIT;
        let errno = (current & Self::ERRNO_MASK) as i32;
        if errno == 0 { None } else { Some(errno) }
    }
}

#[cfg(target_pointer_width = "32")]
impl ErrSeq {
    const ERRNO_BITS: u32 = 12; // ilog2(MAX_ERRNO=4095) + 1
    const SEEN_BIT: u32 = 1 << Self::ERRNO_BITS;
    const CTR_INC: u32 = 1 << (Self::ERRNO_BITS + 1);
    const ERRNO_MASK: u32 = Self::SEEN_BIT - 1;

    pub const fn new() -> Self { Self { inner: AtomicU32::new(0) } }

    pub fn set_err(&self, errno: i32) {
        debug_assert!(errno.unsigned_abs() <= 4095, "errno exceeds MAX_ERRNO");
        let errno_val = (errno.unsigned_abs()) & Self::ERRNO_MASK;
        loop {
            let old = self.inner.load(Acquire);
            let new_val = ((old & !Self::ERRNO_MASK & !Self::SEEN_BIT)
                .wrapping_add(Self::CTR_INC)) | errno_val;
            match self.inner.compare_exchange_weak(old, new_val, AcqRel, Acquire) {
                Ok(_) => break,
                Err(_) => continue,
            }
        }
    }

    pub fn sample(&self) -> u32 { self.inner.load(Acquire) | Self::SEEN_BIT }

    /// Returns positive errno (see 64-bit variant doc comment).
    pub fn check_and_advance(&self, since: &mut u32) -> Option<i32> {
        let current = self.inner.load(Acquire);
        if current == *since { return None; }
        *since = current | Self::SEEN_BIT;
        let errno = (current & Self::ERRNO_MASK) as i32;
        if errno == 0 { None } else { Some(errno) }
    }
}

fdatasync vs fsync decision matrix:

Condition fdatasync action fsync action
Dirty data pages exist Writeback + wait Writeback + wait
File size changed (truncate/append) Metadata flush (size is data-relevant) Metadata flush
Only timestamps changed Skip metadata flush Metadata flush
Permissions/ownership changed Skip metadata flush Metadata flush
Journal commit needed Yes (for data blocks) Yes (for all)
Device cache flush Yes Yes

Cross-references: - AddressSpaceOps::writepage(): Section 14.1 - Page cache dirty tracking: Section 4.2 - Block I/O layer and bio_submit(): Section 15.2 - Journal write barrier: Section 15.5 - VFS ring buffer protocol (Tier 1/2 fsync dispatch): Section 14.2 - Writeback thread organization: Section 4.6 - Copy-on-Write / Redirect-on-Write infrastructure: below

14.4.1 Copy-on-Write and Redirect-on-Write Infrastructure

Modern filesystems fall into three write models. Linux treats all three identically at the VFS level — each filesystem independently manages its own write path, extent sharing, and snapshot interaction. This means the VFS cannot optimize writeback scheduling, cannot share page cache pages between reflinked files, and cannot accurately predict free space costs for dirty page flushes.

UmkaOS's VFS distinguishes these models explicitly, enabling generic optimizations that benefit all CoW/RoW filesystems without filesystem-specific code in the VFS.

14.4.1.1 Write Mode Declaration

/// Write mode declared by each filesystem via `FileSystemOps::write_mode()`.
/// Cached in `SuperBlock.write_mode` at mount time. Informs the VFS writeback
/// path, page cache sharing strategy, and free space accounting.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum WriteMode {
    /// Traditional in-place overwrite (ext4 without reflinks, tmpfs, ramfs).
    /// Writeback reuses the same block address. No extent sharing awareness
    /// needed. Free space cost of flushing dirty pages: zero (no new blocks).
    InPlace,

    /// Copy-on-Write for shared extents (XFS with reflinks, ext4 with reflinks).
    /// Non-shared extents are overwritten in place. Shared extents (refcount > 1
    /// due to reflinks, snapshots, or dedup) require new block allocation on
    /// write. The writeback path queries `ExtentSharingOps::is_extent_shared()`
    /// to decide: shared → allocate new block; unshared → overwrite in place.
    CopyOnWrite,

    /// Redirect-on-Write: the filesystem NEVER overwrites data blocks (Btrfs,
    /// ZFS, bcachefs, UPFS). All writes allocate new blocks; old blocks are
    /// freed only when no snapshot, clone, or active reference retains them.
    /// Consistency is achieved by atomic metadata root pointer updates (e.g.,
    /// ZFS uberblock, Btrfs tree root, UPFS checkpoint record) rather than
    /// journaling.
    ///
    /// Writeback always requests a new block address from the filesystem.
    /// Free space accounting must reserve space for pending redirections.
    RedirectOnWrite,
}

Why three modes matter:

Aspect InPlace CopyOnWrite RedirectOnWrite
Writeback block allocation Reuse existing Conditional (check sharing) Always new
Free space cost of dirty flush Zero Zero (unshared) or one block (shared) One new block per dirty block
Sequential writeback batching Not useful (scattered overwrites) Not useful (scattered) Highly useful (batch new allocations into sequential runs)
Page cache sharing for reflinks N/A (no reflinks) Yes (shared extents) Yes (all data may be snapshot-shared)
Journal needed Typically yes Depends on FS No (atomic root update suffices)
Snapshot integration External (LVM/dm-snapshot) Per-extent refcount Native (tree root versioning)

14.4.1.2 Extent Sharing Operations

/// Trait implemented by filesystems that support extent sharing (reflinks,
/// snapshots, clones, dedup). Optional — only `CopyOnWrite` and
/// `RedirectOnWrite` filesystems implement this.
///
/// The VFS queries these methods during:
/// - Writeback: to decide CoW vs in-place for `CopyOnWrite` filesystems.
/// - Page cache lookup: to enable shared-extent page cache.
/// - Free space accounting: to estimate true cost of flushing dirty pages.
///
/// All methods are called from the writeback workqueue context (not the page
/// fault hot path). Implementations may acquire filesystem-internal locks.
pub trait ExtentSharingOps: Send + Sync {
    /// Returns true if the extent covering `[file_offset, file_offset + len)`
    /// in the given inode is shared (refcount > 1 due to reflinks, snapshots,
    /// or dedup). Returns false for holes, unallocated ranges, and unshared
    /// extents.
    ///
    /// For `CopyOnWrite` filesystems: determines whether writeback must
    /// allocate a new block or can overwrite in place.
    /// For `RedirectOnWrite` filesystems: always returns true conceptually
    /// (all blocks are "shared" with the previous tree version), but the
    /// implementation may optimize by returning false for blocks that are
    /// guaranteed unshared (e.g., newly allocated since the last checkpoint).
    fn is_extent_shared(&self, inode: InodeId, file_offset: u64, len: u64) -> bool;

    /// Returns the physical location of the extent backing `[file_offset..]`
    /// in the given inode. Used by the shared-extent page cache to index
    /// pages by physical address. Returns `None` for holes, unallocated
    /// ranges, and inline data.
    fn extent_phys_addr(&self, inode: InodeId, file_offset: u64) -> Option<PhysExtent>;

    /// Allocate a new block for a CoW write. Called by the writeback path when
    /// `is_extent_shared()` returns true (CopyOnWrite mode) or unconditionally
    /// (RedirectOnWrite mode). The filesystem allocates a new physical extent,
    /// updates its internal mapping, and returns the new extent descriptor.
    ///
    /// The old extent's refcount is decremented. If it drops to zero and no
    /// snapshot retains it, the filesystem may free it (deferred to the
    /// filesystem's own garbage collection or checkpoint cycle).
    fn cow_allocate(
        &self,
        inode: InodeId,
        file_offset: u64,
        len: u64,
    ) -> Result<PhysExtent>;
}

/// A physical extent descriptor — identifies a contiguous range on a block device.
pub struct PhysExtent {
    /// Block device that owns this extent.
    pub bdev: BlockDeviceId,
    /// Physical byte offset on the block device.
    pub phys_offset: u64,
    /// Length of the extent in bytes.
    pub len: u64,
}

14.4.1.3 Shared-Extent Page Cache

Problem (Linux limitation): Linux's page cache is indexed solely by (address_space, file_offset). When two files share a physical extent via reflink, each file gets a separate page cache entry for the same data — doubling memory consumption. This is a known limitation acknowledged by Linux developers, unfixed because the assumption "one page = one mapping pointer" is deeply embedded in the Linux MM.

UmkaOS design: dual-indexed page cache with physical extent awareness.

UmkaOS has no such legacy constraint. The page cache uses a two-level lookup:

  1. Primary index (unchanged): per-inode PageCache keyed by file_offset — the standard per-inode page tree used for all I/O operations (Section 4.4).

  2. Secondary index (new): PhysExtentCache — a global RcuHashMap keyed by (BlockDeviceId, phys_offset) that maps to Page references. Populated only for filesystems with WriteMode::CopyOnWrite or WriteMode::RedirectOnWrite.

/// Global cache of pages indexed by physical extent location.
/// Enables page sharing between files that reference the same physical blocks
/// (reflinks, snapshots, clones). RCU-protected for lock-free read-side lookups.
///
/// Lookup path (read): RCU read lock → hash lookup → Page refcount increment.
/// Insert path (miss): filesystem provides PhysExtent via ExtentSharingOps →
///   read from disk → insert into PhysExtentCache → insert into per-inode PageCache.
/// Eviction: when a Page's refcount (across all address_spaces) drops to zero,
///   the page is removed from PhysExtentCache and per-inode PageCaches.
pub static PHYS_EXTENT_CACHE: OnceCell<RcuHashMap<PhysExtentKey, Page>> = OnceCell::new();

/// Key for the physical extent cache: (block device, physical byte offset).
/// Uses the page-aligned offset (physical offset rounded down to page size).
#[derive(Hash, Eq, PartialEq, Clone, Copy)]
pub struct PhysExtentKey {
    pub bdev: BlockDeviceId,
    pub phys_offset: u64, // page-aligned
}

Read path (for CoW/RoW filesystems):

read(inode_A, file_offset_X):
  1. Look up (inode_A.address_space, file_offset_X) in per-inode PageCache.
     → Hit: return page (standard fast path, no change from InPlace mode).
     → Miss: continue to step 2.
  2. Call extent_phys_addr(inode_A, file_offset_X) → PhysExtent { bdev, phys_offset }.
  3. Look up (bdev, phys_offset) in PHYS_EXTENT_CACHE.
     → Hit: page is already cached (another file sharing this extent loaded it).
       Insert a reference into inode_A's PageCache. Return page.
     → Miss: read from disk. Insert into PHYS_EXTENT_CACHE and inode_A's PageCache.

Write path (CoW-on-write for shared pages):

write(inode_A, file_offset_X, data):
  1. Look up page in inode_A's PageCache.
  2. If page is in PHYS_EXTENT_CACHE AND has references from multiple address_spaces:
     → Allocate a new private page for inode_A.
     → Copy old page contents to new page.
     → Apply the write to the new page.
     → Remove inode_A's reference from the old page in PHYS_EXTENT_CACHE.
     → Insert new page into inode_A's PageCache (not into PHYS_EXTENT_CACHE —
       it's now private to inode_A until the filesystem assigns it a new
       physical location during writeback).
     → Mark new page dirty.
  3. If page is NOT shared (single reference):
     → Modify in place, mark dirty (standard path).

This design eliminates the 2x memory penalty for reflinked files. The cost is one additional hash lookup on cache miss (step 2-3), which is cold-path (disk I/O dominates). Hot-path reads (step 1 hit) are unchanged.

Memory savings: For a 10 GB dataset reflinked to 5 containers, Linux uses 50 GB of page cache (5 copies). UmkaOS uses 10 GB (one copy, 5 references). This is particularly significant for container-dense workloads where the same base image is reflinked across hundreds of containers.

The remap_file_range() method in FileOps (Section 14.1) is the filesystem-level backend. The VFS generic layer handles validation and dispatches via ioctls:

/// Flags for remap_file_range().
pub struct RemapFlags(u32);
impl RemapFlags {
    /// Only remap if source and destination byte ranges are identical.
    /// Used by FIDEDUPERANGE ioctl for deduplication.
    pub const REMAP_FILE_DEDUP: Self = Self(1 << 0);

    /// Caller accepts a shorter remap than requested. The filesystem may
    /// return fewer bytes than `len` if the source extent ends early.
    /// Used by FICLONE/FICLONERANGE (always set) and copy_file_range.
    pub const REMAP_FILE_CAN_SHORTEN: Self = Self(1 << 1);
}

/// Clone range descriptor for FICLONERANGE ioctl.
/// Matches Linux's `struct file_clone_range` layout for binary compatibility.
#[repr(C)]
pub struct FileCloneRange {
    /// Source file descriptor.
    pub src_fd: i64,
    /// Source file offset.
    pub src_offset: u64,
    /// Length to clone (0 = clone to EOF).
    pub src_length: u64,
    /// Destination file offset.
    pub dest_offset: u64,
}
// Layout: 8 + 8 + 8 + 8 = 32 bytes.
const_assert!(size_of::<FileCloneRange>() == 32);

Ioctl definitions (Linux ABI-compatible):

Ioctl Number (x86-64) Argument Semantics
FICLONE 0x40049409 (_IOW(0x94, 9, i32)) Source fd Clone entire file into destination fd
FICLONERANGE 0x4020940d (_IOW(0x94, 13, FileCloneRange)) FileCloneRange Clone specified byte range
FIDEDUPERANGE 0xC0189436 (_IOWR(0x94, 54, FileDeduperangeHdr)) Variable-length Dedup if content matches

VFS ioctl dispatch (generic, not per-filesystem):

ioctl(dst_fd, FICLONE, src_fd):
  1. Validate: both fds open, dst writable, same superblock, same filesystem type.
  2. Lock ordering: inode_lock(src) then inode_lock(dst) (lower ino first to prevent
     deadlock when src and dst are swapped in concurrent calls).
  3. Call dst.file_ops.remap_file_range(src, 0, dst, 0, src.size,
     REMAP_FILE_CAN_SHORTEN).
  4. Invalidate dst's affected range in PHYS_EXTENT_CACHE (new shared extents will
     be populated lazily on next read).
  5. Return 0 on success, -errno on failure.

14.4.1.5 copy_file_range() VFS Dispatch

copy_file_range(2) (syscall 326 on x86-64) is the general-purpose server-side copy interface. The VFS dispatch prioritizes zero-copy where possible:

copy_file_range(fd_in, off_in, fd_out, off_out, len, flags=0):
  1. If same filesystem AND filesystem implements remap_file_range():
     → Try reflink (remap_file_range with REMAP_FILE_CAN_SHORTEN).
     → If EOPNOTSUPP: fall through to step 2.
  2. If same filesystem AND filesystem implements a dedicated copy_file_range handler:
     → Use filesystem-specific server-side copy (e.g., NFS server-side copy,
       CIFS CopyChunk).
  3. Fallback: splice-based copy through page cache (generic, works cross-filesystem).
     This reads source pages into the page cache, then writes them to the
     destination — no userspace round-trip, but does consume page cache memory.

Note: Unlike Linux (which uses syscall 326 on x86-64), UmkaOS uses the same syscall number for ABI compatibility. The flags parameter is reserved (must be 0).

14.4.1.6 CoW/RoW-Aware Writeback

The writeback thread (Section 4.6) uses SuperBlock.write_mode to adapt its behavior:

Write mode Writeback behavior
InPlace Standard: for each dirty page, issue write to the page's existing block address. The block address is already known (stored in the iomap).
CopyOnWrite Before writing each dirty page, call is_extent_shared(). If shared: call cow_allocate() to get a new block address, write to the new address, update the filesystem's extent mapping. If unshared: write in place (same as InPlace).
RedirectOnWrite For ALL dirty pages, call cow_allocate() to get new block addresses. The filesystem's allocator can batch these requests to produce sequential physical layouts, reducing seek overhead on rotational media and improving flash write amplification. The old block addresses are not reused until the filesystem's checkpoint/commit cycle confirms the new tree root.

RoW writeback batching optimization: For RedirectOnWrite filesystems, the writeback thread collects all dirty pages for a given inode before requesting block allocations. This allows the filesystem allocator to assign a contiguous physical range (one large extent) rather than many scattered single-block allocations. The batch size is bounded by BDI_MAX_WRITEBACK_BATCH (default: 1024 pages = 4 MB). This produces sequential I/O patterns even when dirty pages were written at random file offsets — a significant advantage for RoW filesystems on both rotational and flash storage.

14.4.1.7 Free Space Accounting for CoW/RoW Filesystems

Traditional statfs() reports free blocks = total - used. For CoW/RoW filesystems, this is misleading because:

  • Pending CoW: Dirty shared pages will consume new blocks on writeback. The "true" free space is lower than statfs() reports.
  • Snapshot overhead: Deleting a file doesn't free its blocks if snapshots reference them.
  • RoW garbage: Old blocks from previous tree versions occupy space until garbage collection reclaims them.

UmkaOS adds an extended space accounting interface:

/// Extended filesystem space information, reported by CoW/RoW-aware
/// filesystems in addition to the standard StatFs. Optional — InPlace
/// filesystems return None.
pub struct ExtendedSpaceInfo {
    /// Bytes reserved for pending CoW allocations (dirty shared pages that
    /// will need new blocks on writeback).
    pub cow_reserved_bytes: u64,
    /// Bytes reclaimable by snapshot deletion (blocks held only by snapshots,
    /// not by live files).
    pub snapshot_reclaimable_bytes: u64,
    /// Bytes occupied by stale RoW tree versions pending garbage collection.
    pub gc_pending_bytes: u64,
    /// Effective free bytes = statfs.free - cow_reserved - gc_pending.
    /// This is the "true" free space available for new writes.
    pub effective_free_bytes: u64,
}

This information is exposed via the statfs extended attributes (STATX_ATTR_* flags) and through the UmkaOS-specific /ukfs/kernel/fs/<mount>/space_info umkafs interface.

Cross-references: - WriteMode declaration: FileSystemOps::write_mode() (Section 14.1) - remap_file_range(): FileOps trait (Section 14.1) - Writeback thread organization: Section 4.6 - Page cache and AddressSpace: Section 4.4 - Block I/O submission: Section 15.2 - Btrfs (RedirectOnWrite): Section 15.8 - XFS (CopyOnWrite with reflinks): Section 15.7 - ZFS (RedirectOnWrite): Section 15.10 - FICLONE/FICLONERANGE Linux compat: Section 19.1 - Dirty extent pre-registration: Section 14.1 (VFS crash recovery)

14.5 Character and Block Device Node Framework

All device classes that expose character or block device files under /dev register through a unified device node framework. This framework manages major/minor number allocation, the global device registry, and automatic /dev node lifecycle via devtmpfs.

14.5.1 Character Device Region Registration

/// Character device region registration. All device classes (TTY, evdev, ALSA,
/// DRM, watchdog, SPI, RTC, etc.) register through this unified interface.
/// A region reserves a contiguous range of minor numbers under a single major.
pub struct ChrdevRegion {
    /// Major device number. Either a well-known major from the allocation
    /// table below, or dynamically allocated via `alloc_chrdev_region()`.
    /// Valid range: 1-4095 (0 is reserved). Linux ABI uses 12-bit major (MKDEV
    /// encoding: bits 31:20), so the hard limit is 4095. Dynamic allocation
    /// uses the range 234-254, then 384-511.
    pub major: u16,

    /// First minor number in this region.
    pub minor_base: u32,

    /// Number of minor numbers reserved (contiguous from `minor_base`).
    /// Must be >= 1. The range `minor_base..minor_base+minor_count` must
    /// not overlap with any other registered region under the same major.
    ///
    /// **Overflow check**: `register_chrdev_region()` validates:
    /// ```
    /// minor_base.checked_add(minor_count).ok_or(EINVAL)?;
    /// assert!(minor_base + minor_count <= MINORMASK + 1);
    /// ```
    /// Without this check, a `minor_base + minor_count` overflow wraps
    /// to a valid range, causing silent overlap with unrelated device
    /// regions under the same major. `MINORMASK` is `0xFFFFF` (20 bits,
    /// matching Linux's `MINORBITS = 20`).
    pub minor_count: u32,

    /// File operations for all devices in this region. Called by the VFS
    /// when userspace opens, reads, writes, ioctls, or closes a device node
    /// with a matching major:minor pair.
    pub fops: &'static dyn FileOps,

    /// Human-readable name for this region (e.g., "ttyS", "input/event",
    /// "snd/pcmC"). Used in `/proc/devices` output and diagnostic logging.
    /// Max 31 bytes (null-terminated).
    pub name: &'static str,
}

/// Global character device registry. Indexed by a composite key of
/// `(major << 20 | minor_base)` for O(1) lookup during `open()`.
///
/// `RcuXArray` provides:
/// - O(1) lookup by composite key on the `open()` hot path.
/// - RCU-protected reads: `open()` does not acquire any lock — readers
///   call `rcu_read_lock()` + `xa_load()` for lockless lookup.
/// - Ordered iteration for `/proc/devices` enumeration.
///
/// Writers (register/unregister) acquire `CHRDEV_WRITE_LOCK` to serialize
/// mutations, then modify the XArray under its internal lock.
/// Registrations happen at subsystem init and driver probe (warm path).
static CHRDEV_TABLE: RcuXArray<Arc<ChrdevRegion>> = RcuXArray::new();

/// Writer-side serialization for `CHRDEV_TABLE`. Readers never touch this lock.
/// Held only during `register_chrdev_region` / `unregister_chrdev_region`.
static CHRDEV_WRITE_LOCK: SpinLock<()> = SpinLock::new(());

/// Register a character device region. Called by subsystems during init
/// (e.g., TTY layer registers major 4 for serial, input layer registers
/// major 13 for evdev).
///
/// Returns `Ok(())` on success. Returns `Err(DeviceError::RegionConflict)`
/// if the requested major:minor range overlaps with an existing registration.
/// Returns `Err(DeviceError::MajorExhausted)` if dynamic allocation is
/// requested (`major == 0`) and no free major numbers remain.
pub fn register_chrdev_region(region: ChrdevRegion) -> Result<(), DeviceError>;

/// Dynamically allocate a major number and register a region. Used by
/// device classes that do not have a well-known major (UIO, RTC, etc.).
/// The kernel selects the lowest available major in the dynamic range
/// (234-254, then 384-511).
///
/// Returns the allocated major number on success.
pub fn alloc_chrdev_region(
    name: &'static str,
    minor_base: u32,
    minor_count: u32,
    fops: &'static dyn FileOps,
) -> Result<u16, DeviceError>;

/// Unregister a character device region. Called during driver unload or
/// crash recovery. After this call, `open()` on device nodes with matching
/// major:minor returns `ENODEV`.
///
/// Does NOT remove `/dev` nodes — that is handled by `devtmpfs_remove_node()`.
/// The two operations are decoupled because a crash recovery sequence may
/// unregister the old region before registering the replacement.
pub fn unregister_chrdev_region(major: u16, minor_base: u32);

open() dispatch: When userspace calls open("/dev/foo", ...):

  1. VFS resolves the path through the dentry cache to a device inode.
  2. The inode's i_rdev field contains the DevId (major:minor).
  3. Cgroup device access check: The VFS calls cgroup_bpf_run(BPF_CGROUP_DEVICE, &ctx) where ctx is a BpfCgroupDevCtx { access_type, dev_type, major, minor } populated from the inode's device type and the requested access flags (BPF_DEVCG_ACC_READ, BPF_DEVCG_ACC_WRITE, or both depending on O_RDONLY/O_WRONLY/O_RDWR). The BPF program is evaluated bottom-up from the task's cgroup to the root — access is allowed only if every ancestor's program (if any) returns 1. If any program returns 0, open() returns -EPERM immediately. If no BPF program is attached to any ancestor, access is allowed by default. See Section 17.2 for the full enforcement model and the v1 devices.allow/devices.deny translation.
  4. The VFS extracts the actual major and minor from the inode's i_rdev (DevId). It looks up the ChrdevRegion in CHRDEV_TABLE (for character devices) or BlkdevRegion in BLKDEV_TABLE (for block devices) under RCU read lock. The lookup iterates entries for the given major to find the region whose range [minor_base, minor_base + minor_count) contains the inode's minor number. The XArray is keyed by (major << 20 | minor_base), so for a major with multiple regions, the lookup walks entries at keys (major << 20 | 0) through (major << 20 | inode_minor) to find the containing range (XArray ordered iteration, typically 1-2 entries per major).
  5. If found, the region's fops is attached to the new OpenFile.
  6. fops.open() is called with the inode's actual minor number (not the region's minor_base), allowing the driver to compute the device instance index as minor - region.minor_base.

The cgroup check (step 3) applies identically to both chrdev_open() and blkdev_open() — the BPF context distinguishes the two via the dev_type field (BPF_DEVCG_DEV_CHAR=2 vs BPF_DEVCG_DEV_BLOCK=1). The mknod() syscall also calls the same hook with access_type = BPF_DEVCG_ACC_MKNOD before creating a device node in the filesystem.

14.5.2 Block Device Registration

Block devices follow an analogous pattern with register_blkdev() and a separate BLKDEV_TABLE: XArray<Arc<BlkdevRegion>>. The block layer (Section 15.2) adds additional registration state (request queue, disk geometry, partition table) that character devices do not need.

14.5.3 Major Number Allocation Table

Well-known major numbers are assigned to match Linux for userspace compatibility. Tools like ls -l, stat, udev rules, and container runtimes rely on these values being identical to Linux.

Major Device Class Minor Range Notes
1 mem (null, zero, random, urandom, full) 0-15 /dev/null=1,3; /dev/zero=1,5; /dev/full=1,7; /dev/random=1,8; /dev/urandom=1,9
4 ttyS (serial terminals) 64-255 /dev/ttyS0=4,64; legacy range for 16550-compatible UARTs
5 tty, console, ptmx 0-2 /dev/tty=5,0; /dev/console=5,1; /dev/ptmx=5,2
10 misc (miscellaneous character devices) varies /dev/fuse=10,229; /dev/rfkill=10,242; /dev/watchdog=10,130; /dev/loop-control=10,237
13 input (evdev, joydev, mousedev) 0-1023 /dev/input/event0=13,64; mousedev 13,32-63; joydev 13,0-31; evdev 13,64-95; extended evdev 13,256+ (Linux 2.6+)
29 fb (framebuffer) 0-31 /dev/fb0=29,0; legacy interface, DRM preferred
31 mtdblock (MTD block translation) 0-31 /dev/mtdblock0=31,0
90 mtd (raw MTD character access) 0-31 /dev/mtd0=90,0
116 ALSA (snd) 0-255 /dev/snd/pcmC0D0p=116,16; /dev/snd/controlC0=116,0
136 pts (PTY slave devices) 0-1048575 /dev/pts/0=136,0; devpts filesystem allocates minors dynamically
226 DRM (dri) 0-255 /dev/dri/card0=226,0; /dev/dri/renderD128=226,128
239 IPMI device interface 0-31 /dev/ipmi0=239,0
dynamic UIO, RTC, hwmon, etc. allocated at registration Major assigned by alloc_chrdev_region()

14.5.4 Devtmpfs: Automatic /dev Node Lifecycle

Devtmpfs is a kernel-managed tmpfs instance mounted on /dev that automatically creates and removes device nodes in response to device registration and unregistration events. It eliminates the boot-time race between device discovery and userspace udev — device nodes exist before any userspace process runs.

/// Devtmpfs entry describing a device node to create under /dev.
/// Passed to `devtmpfs_create_node()` by the device registry when a
/// device is registered, and to `devtmpfs_remove_node()` on unregistration
/// or crash recovery.
pub struct DevtmpfsEntry {
    /// Path relative to /dev (e.g., "ttyS0", "input/event3", "snd/pcmC0D0p").
    /// Intermediate directories (e.g., "input/", "snd/") are created
    /// automatically if they do not exist. Max 63 bytes.
    pub path: ArrayString<64>,

    /// Device type: character or block.
    pub dev_type: DevType,

    /// Major:minor device identifier.
    pub dev_id: DevId,

    /// File permissions (e.g., 0o666 for /dev/null, 0o620 for TTY devices,
    /// 0o660 for block devices). The owner is always root:root; udev rules
    /// can adjust ownership after boot.
    pub mode: u16,
}

/// Device type discriminant for device nodes.
#[repr(u8)]
pub enum DevType {
    /// Character device (S_IFCHR).
    Char  = 0,
    /// Block device (S_IFBLK).
    Block = 1,
}

/// Major:minor device identifier. Encoded as a single u32 for storage
/// efficiency (matches Linux's `MKDEV(major, minor)` encoding).
///
/// Linux `dev_t` is u32 with MAJOR = top 12 bits (0–4095) and MINOR =
/// bottom 20 bits (0–1048575). The `new()` constructor validates that
/// `major` fits in 12 bits; callers passing a u16 > 4095 get a panic
/// (debug) or truncation would silently produce wrong device numbers.
pub struct DevId {
    /// Encoded as `(major << 20) | (minor & 0xFFFFF)`.
    /// Major occupies bits 31:20 (12 bits, 0–4095).
    /// Minor occupies bits 19:0  (20 bits, 0–1048575).
    pub raw: u32,
}

impl DevId {
    /// Create a `DevId` from separate major and minor numbers.
    ///
    /// # Panics
    /// Panics if `major > 4095` — Linux ABI reserves only 12 bits for major.
    pub fn new(major: u16, minor: u32) -> Self {
        assert!(major <= 0x0FFF, "DevId: major {} exceeds 12-bit Linux ABI limit (max 4095)", major);
        assert!(minor <= 0xFFFFF, "DevId: minor {} exceeds 20-bit limit (max 1048575)", minor);
        DevId { raw: (major as u32) << 20 | (minor & 0x000F_FFFF) }
    }
    pub fn major(&self) -> u16 { (self.raw >> 20) as u16 }
    pub fn minor(&self) -> u32 { self.raw & 0x000F_FFFF }

    /// Encode for stat()/fstat()/newfstatat() `st_dev` and `st_rdev` fields.
    /// This encoding differs from the kernel-internal MKDEV layout.
    /// Matches Linux `new_encode_dev()` in include/linux/kdev_t.h.
    /// The SysAPI layer calls this when filling `struct stat` responses.
    /// For statx() responses, use `major()` and `minor()` directly (statx
    /// has separate `stx_rdev_major`/`stx_rdev_minor` u32 fields).
    pub fn new_encode_dev(&self) -> u32 {
        let major = self.major() as u32;
        let minor = self.minor();
        (minor & 0xff) | ((major & 0xfff) << 8) | ((minor & !0xffu32) << 12)
    }

    /// Decode a stat()/fstat() encoded device number back to DevId.
    /// Inverse of `new_encode_dev()`. Matches Linux `new_decode_dev()`.
    pub fn new_decode_dev(encoded: u32) -> DevId {
        let major = ((encoded & 0xfff00) >> 8) as u16;
        let minor = (encoded & 0xff) | ((encoded >> 12) & 0xfff00);
        DevId::new(major, minor)
    }
}
// Round-trip verification: encode and decode must be inverses.
// Test with boundary values: major 0..4095 (12 bits), minor 0..1048575 (20 bits).
const_assert!({
    let d = DevId::new(0, 0);
    let enc = d.new_encode_dev();
    let dec = DevId::new_decode_dev(enc);
    dec.major() == 0 && dec.minor() == 0
});
const_assert!({
    let d = DevId::new(4095, 1048575);
    let enc = d.new_encode_dev();
    let dec = DevId::new_decode_dev(enc);
    dec.major() == 4095 && dec.minor() == 1048575
});

Lifecycle hooks:

/// Create a device node under /dev. Called by `DeviceRegistry::register()`
/// after a device and its chrdev/blkdev region are successfully registered.
///
/// Creates the inode in the devtmpfs superblock with the specified
/// major:minor, type, and permissions. If intermediate path components
/// do not exist (e.g., "input/" for "input/event3"), they are created as
/// directories with mode 0o755.
///
/// This function is idempotent: if the node already exists with the same
/// major:minor, it is a no-op. If it exists with a different major:minor,
/// the old node is replaced (stale node from a previous driver instance).
pub fn devtmpfs_create_node(entry: &DevtmpfsEntry) -> Result<(), IoError>;

/// Remove a device node from /dev. Called by `DeviceRegistry::unregister()`
/// and by the crash recovery manager when a driver's device is being
/// cleaned up.
///
/// Removes the inode from devtmpfs. Empty parent directories are NOT
/// removed (they may be needed by other devices in the same class).
///
/// Idempotent: removing a non-existent node is a no-op (returns `Ok(())`).
pub fn devtmpfs_remove_node(path: &str) -> Result<(), IoError>;

Boot sequence:

Devtmpfs is mounted during boot Phase 5 (after the physical memory allocator, slab allocator, and VFS are initialized, but before the root filesystem is mounted):

Boot Phase 5: devtmpfs initialization
  1. Create an in-kernel tmpfs instance for devtmpfs.
  2. Mount it internally (not yet visible to userspace).
  3. Create standard device nodes:
     - /dev/null     (1, 3)   mode 0o666   — discard sink
     - /dev/zero     (1, 5)   mode 0o666   — zero source
     - /dev/full     (1, 7)   mode 0o666   — always-full sink
     - /dev/random   (1, 8)   mode 0o666   — blocking entropy source
     - /dev/urandom  (1, 9)   mode 0o666   — non-blocking entropy source
     - /dev/console  (5, 1)   mode 0o600   — kernel console
     - /dev/tty      (5, 0)   mode 0o666   — controlling terminal alias
     - /dev/ptmx     (5, 2)   mode 0o666   — PTY master multiplexer
  4. Device discovery (PCI enumeration, platform devices, DT/ACPI) probes
     drivers, which register devices → devtmpfs_create_node() populates
     /dev with hardware-specific nodes.
  5. After rootfs mount: bind-mount devtmpfs onto /dev in the real root.
     Userspace udev starts and may adjust permissions, create symlinks
     (e.g., /dev/disk/by-uuid/...), and apply udev rules.

Crash recovery interaction: When a Tier 1 driver crashes and its device is being recovered (Section 11.9), the crash recovery manager calls devtmpfs_remove_node() for all device nodes owned by the crashed driver. After the replacement driver loads and re-registers its devices, devtmpfs_create_node() recreates the nodes. Userspace processes that had open file descriptors to the old nodes receive EIO on subsequent I/O; they must reopen the device to get a file descriptor backed by the new driver instance.

14.5.4.1 Crash Recovery and Hotplug Event Interaction

When a driver crash overlaps with hotplug events (e.g., a USB hub driver crashes while devices are being enumerated), the following ordering guarantees apply:

  1. Event queue freeze: The crash recovery manager acquires the hotplug workqueue's drain lock before beginning recovery. New HotplugEvent::DeviceArrival events for the crashed driver's bus subtree are enqueued but not processed until recovery completes. Events for unrelated bus subtrees continue processing normally.

  2. Device node cleanup: devtmpfs_remove_node() is called for each device owned by the crashed driver. The removal is atomic per-node: either the inode is fully removed or the operation has no effect (idempotent).

  3. Pending event replay: After the replacement driver loads and its init() returns ProbeResult::Ok, the hotplug workqueue drain lock is released. Queued arrival events for the recovered subtree are replayed in FIFO order. The new driver instance receives DeviceArrival events for any devices that appeared during the recovery window.

  4. Stale removal events: DeviceRemoval events for devices that were already cleaned up during crash recovery are silently dropped (the device handle no longer exists in the registry). This is safe because devtmpfs_remove_node() is idempotent.

  5. Uevent replay to userspace: After recovery, the netlink translation layer (Section 19.5) emits a synthetic change uevent for each recovered device. This notifies udev/systemd-udevd to re-apply rules (permissions, symlinks) without requiring a full udevadm trigger.

/// Crash recovery hotplug coordination.
///
/// Acquires the drain lock on the hotplug workqueue for the specified bus
/// subtree, preventing event processing until `release_hotplug_drain()`.
/// Events continue to be enqueued — they are replayed on release.
pub fn acquire_hotplug_drain(subtree_root: DeviceHandle) -> HotplugDrainGuard;

/// RAII guard that releases the hotplug drain lock on drop.
/// Queued events for the frozen subtree are replayed in FIFO order.
pub struct HotplugDrainGuard {
    subtree_root: DeviceHandle,
}

impl Drop for HotplugDrainGuard {
    fn drop(&mut self) {
        // Release drain lock. The hotplug workqueue processes all queued
        // events for subtree_root's descendants in FIFO order.
    }
}

14.5.5 Initial Device Naming

The kernel assigns initial device names following Linux conventions for userspace compatibility. Userspace udev may later create persistent symlinks (/dev/disk/by-uuid/, /dev/disk/by-id/, etc.) but the kernel-assigned names must match what Linux tools expect.

Naming rules by device class:

Class Pattern Algorithm Examples
Block (SCSI/NVMe) sd[a-z]+ / nvme[N]n[M] SCSI: alphabetic sequence by probe order. NVMe: controller N, namespace M. sda, sdb, nvme0n1
Block partitions <disk>N Partition number from GPT/MBR table. sda1, nvme0n1p1
Network eth[N] / wlan[N] Sequential index per subsystem (Ethernet vs WiFi). udev's predictable naming (ens3, enp0s25) is applied by userspace rules, not the kernel. eth0, wlan0
TTY serial ttyS[N] Port index from UART enumeration (PCI BAR order, DT aliases, ACPI UID). ttyS0, ttyS1
TTY USB serial ttyUSB[N] Sequential index by USB probe order. ttyUSB0
Input (evdev) input/event[N] Sequential index by registration order. input/event0
ALSA snd/pcmC[N]D[M]p Card N (probe order), device M (codec order), p=playback / c=capture. snd/pcmC0D0p
DRM dri/card[N] / dri/renderD[128+N] Sequential by GPU probe order. Render nodes start at minor 128. dri/card0, dri/renderD128
Framebuffer fb[N] Legacy. Sequential by registration. fb0
Watchdog watchdog[N] Sequential by registration. watchdog0
Loop loop[N] Fixed pool of max_loop devices (default 256). Created at boot. loop0

Implementation: Each device class maintains its own index counter (typically a static AtomicU32). The counter is incremented atomically at register_chrdev() or add_disk() time. The generated name is passed to devtmpfs_create_node() and stored in the DeviceNode.dev_name field.

Stability caveat: Kernel-assigned names like sda/sdb depend on probe order, which can vary across boots. This is a known Linux behavior. Persistent naming (/dev/disk/by-uuid/, /dev/disk/by-path/, /dev/disk/by-id/) is handled entirely by userspace udev rules that read device attributes from the uevent/sysfs interface and create stable symlinks. The kernel provides all necessary attributes (serial number, WWN, partition UUID) via the uevent mechanism (Section 19.5).

14.5.6 File Operations Replacement (replace_fops)

Some device classes use a single /dev entry as a multiplexer that switches to a specialized FileOps vtable after open(). Examples:

  • ALSA: /dev/snd/controlC0 opens with a generic ALSA control FileOps. PCM device files (/dev/snd/pcmC0D0p) open with PCM-specific FileOps from the start and do NOT use replace_fops. The ALSA replace_fops use case is the control device switching to a specialized monitoring mode via SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS.
  • TTY: A TTY file descriptor switches its line discipline (e.g., from N_TTY to N_SLIP) via TIOCSETD, which replaces the FileOps to reflect the new discipline's read/write/ioctl behavior.
  • evdev: EVIOCGRAB transitions an input device to exclusive-grab mode with a specialized FileOps that filters events to the grabbing client.

The OpenFile.f_ops field is declared as &'static dyn FileOps and is normally immutable after creation. replace_fops provides a controlled mechanism to swap it:

/// Atomically replace the FileOps vtable on an open file descriptor.
///
/// This is the mechanism for device classes that multiplex multiple
/// operational modes through a single device node. The caller (device
/// subsystem code, NOT the driver directly) must hold the file's position
/// lock (`fdget_pos()` guard) to prevent concurrent read/write operations
/// from observing a partially-switched state.
///
/// # Safety
///
/// * `new_ops` must be `&'static` — it must outlive the `OpenFile`. In practice
///   this means the new FileOps must be a static vtable defined in the subsystem
///   module (e.g., `static PCM_PLAYBACK_OPS: FileOps = ...`), not a dynamically
///   constructed object.
/// * The caller must ensure no I/O operations are in-flight on the file at the
///   time of the swap. The `fdget_pos()` guard serializes with `read()`/`write()`;
///   `ioctl()` is serialized by the subsystem's own locking (e.g., ALSA's
///   `pcm_stream_lock`, TTY's `tty_lock`).
/// * The old `FileOps` is not freed (it is `&'static`). No cleanup callback is
///   needed.
///
/// # Implementation
///
/// Uses `AtomicPtr::store(new_ops, Release)` on the internal representation of
/// `f_ops`. Subsequent `read()`/`write()`/`ioctl()` calls load with `Acquire`
/// ordering and dispatch through the new vtable. The `Release`/`Acquire` pair
/// ensures all state mutations made by the caller before calling `replace_fops()`
/// (e.g., initializing PCM hardware parameters, setting up the line discipline
/// buffer) are visible to the next I/O operation through the new vtable.
pub fn replace_fops(
    file: &OpenFile,
    new_ops: &'static dyn FileOps,
    _guard: &FdPosGuard,
) {
    // `_guard` enforces at the type level that the caller holds the
    // fdget_pos() guard, serializing concurrent replace_fops() calls
    // on the same OpenFile. Without this parameter, the UnsafeCell
    // write to f_ops_vtable relies on caller discipline alone.
    // Decompose the fat pointer into data + vtable, store data atomically.
    let (data_ptr, vtable_ptr) = (new_ops as *const dyn FileOps).to_raw_parts();
    // Write vtable FIRST (plain store, ordered by the subsequent Release).
    // SAFETY: all FileOps impl types have 'static vtables. The plain store
    // is safe because the subsequent Release on f_ops_data orders this write
    // relative to any reader's Acquire load.
    unsafe { *file.f_ops_vtable.get() = vtable_ptr; }
    // THEN publish the data pointer with Release. A reader's Acquire load
    // on f_ops_data guarantees visibility of the vtable write above.
    file.f_ops_data.store(data_ptr as *mut (), Release);
}

OpenFile.f_ops representation: To support replace_fops, the internal representation uses two fields rather than a single &'static dyn FileOps: - f_ops_data: AtomicPtr<()> — the data pointer component of the fat pointer, swapped atomically with Release ordering on write, Acquire on read. - f_ops_vtable: UnsafeCell<*const ()> — the vtable pointer component, written BEFORE the f_ops_data Release store. The Release/Acquire pair on f_ops_data guarantees that a reader observing the new data pointer also observes the new vtable pointer. This ordering is correct on all architectures, including weakly ordered ones (AArch64, RISC-V), because Release orders ALL prior writes.

AtomicPtr<dyn FileOps> is NOT valid Rust (dyn FileOps is !Sized, and AtomicPtr<T> requires T: Sized). The two-field decomposition avoids this limitation. The public f_ops() accessor reconstructs the fat pointer from the two components. For the common case (no replacement), this adds zero overhead on x86-64 (Acquire is free under TSO) and a single ldar instruction on AArch64 (~1 cycle). The f_ops field shown in OpenFile (Section 14.1) is the accessor return type, not the storage type.

Subsystem usage constraints: replace_fops is callable only from kernel subsystem code (ALSA core, TTY layer, input core), not from KABI driver callbacks. Tier 1/Tier 2 drivers that need mode-switching behavior must request the switch through their subsystem's control interface (e.g., ALSA snd_pcm_hw_params(), TTY tty_set_ldisc()), which validates the request and calls replace_fops internally.

Cross-references: - Device registry and bus management: Section 11.4 - Crash recovery node cleanup: Section 11.9 - TTY/PTY device nodes: Section 21.1 - ALSA device nodes: Section 21.4 - DRM device nodes: Section 22.1 - Input (evdev) device nodes: Section 21.3

14.6 Mount Tree Data Structures and Operations

The mount tree is the central data structure of the VFS layer that tracks all mounted filesystems, their hierarchical relationships, and their propagation properties. Every path resolution operation traverses the mount tree (via the mount hash table) to cross mount boundaries. This section defines the complete data structures, algorithms, and namespace operations that were previously referenced but unspecified by Section 14.1, Section 14.1, and Section 17.1.

Design principles:

  1. RCU for the read path: Mount hash table lookups happen on every path resolution (every open(), stat(), readlink(), execve()). The read path must be completely lock-free. Writers (mount/unmount) serialize through the per-namespace mount_lock and publish changes via RCU.

  2. Per-namespace scoping: Unlike Linux, which uses a single global mount_hashtable, UmkaOS scopes the mount hash table per mount namespace. This eliminates contention between namespaces in container-heavy workloads (thousands of namespaces with independent mount trees) and allows mount operations in different namespaces to proceed in parallel with no shared lock. The trade-off is additional memory per namespace; this is acceptable because each namespace already has an independent mount tree and the hash table overhead is proportional to the number of mounts (typically 30-100 per container, well under 1 KiB of hash table memory).

  3. Arc-based lifetime management: Mount nodes are reference-counted via Arc<Mount>. Parent, master, and peer references use Arc (strong) or Weak (where appropriate to break cycles). RCU protects the hash chains and list traversals; Arc protects the Mount node lifetime beyond the RCU grace period.

  4. Capability gating: All mount tree modifications check CAP_MOUNT or CAP_SYS_ADMIN as specified in Section 14.1. The data structures below enforce this at the entry point of each operation, not deep inside the algorithm.

  5. 64-bit mount IDs: Per-namespace monotonic counter, never wrapping on any realistic system. Mount IDs are unique within a namespace and are the stable identifier used by statx() (STATX_MNT_ID), the new statmount()/listmount() syscalls, and /proc/PID/mountinfo.

14.6.1 Mount Flags

bitflags! {
    /// Per-mount flags controlling security and access behavior.
    ///
    /// These are distinct from per-superblock options (which control the
    /// filesystem driver's behavior). A single superblock can be mounted
    /// at multiple locations with different per-mount flags (e.g., one
    /// mount point read-write, another read-only via bind mount + remount).
    ///
    /// Bit assignments match Linux's `MNT_*` internal flags
    /// (`include/linux/mount.h`, stable since Linux 2.6.x). These are
    /// NOT the userspace `MS_*` flags (`include/uapi/linux/mount.h`) —
    /// the `mount(2)` and `mount_setattr(2)` compat shims translate
    /// `MS_*`/`MOUNT_ATTR_*` to `MountFlags` at syscall entry.
    #[repr(transparent)]
    pub struct MountFlags: u64 {
        // --- Userspace-visible flags (set via mount/remount/mount_setattr) ---
        //
        // Bit assignments match Linux `include/linux/mount.h` exactly.
        // Verified against torvalds/linux master (2026-03-25).

        /// Do not honor set-user-ID and set-group-ID bits on executables.
        const MNT_NOSUID       = 0x01;       // Linux: MNT_NOSUID = 0x01
        /// Do not allow access to device special files on this mount.
        const MNT_NODEV        = 0x02;       // Linux: MNT_NODEV = 0x02
        /// Do not allow execution of programs on this mount.
        const MNT_NOEXEC       = 0x04;       // Linux: MNT_NOEXEC = 0x04
        /// Do not update access times on this mount.
        const MNT_NOATIME      = 0x08;       // Linux: MNT_NOATIME = 0x08
        /// Do not update directory access times on this mount.
        const MNT_NODIRATIME   = 0x10;       // Linux: MNT_NODIRATIME = 0x10
        /// Update atime only if atime <= mtime or atime <= ctime, or if
        /// the previous atime is more than 24 hours old. Default for most
        /// mounts since Linux 2.6.30 and UmkaOS.
        const MNT_RELATIME     = 0x20;       // Linux: MNT_RELATIME = 0x20
        /// Mount is read-only. Writes return EROFS.
        const MNT_READONLY     = 0x40;       // Linux: MNT_READONLY = 0x40
        /// Do not follow symlinks on this mount. Used by container runtimes
        /// to prevent symlink-based escapes from bind-mounted directories.
        const MNT_NOSYMFOLLOW  = 0x80;       // Linux: MNT_NOSYMFOLLOW = 0x80

        // --- Internal flags (kernel-managed, not settable by userspace) ---

        /// Mount can be expired and automatically unmounted under memory
        /// pressure or after an idle timeout. Used by autofs. The VFS
        /// checks `mnt_count == 0` before expiring a shrinkable mount.
        const MNT_SHRINKABLE   = 0x100;      // Linux: MNT_SHRINKABLE = 0x100
        /// Internal mount (not exposed to userspace). Used for kernel-
        /// internal mounts (pipefs, sockfs, etc.).
        const MNT_INTERNAL     = 0x4000;     // Linux: MNT_INTERNAL = 0x4000

        // --- Container namespace lock flags (MNT_LOCK_*) ---
        //
        // These flags prevent unprivileged users in child mount namespaces
        // from changing mount attributes inherited from the parent namespace.
        // Set by the kernel when creating a user namespace or copying a
        // mount namespace. Critical for container security — without these,
        // a container could remount a read-only host path as read-write.

        /// Atime setting is locked (NOATIME/RELATIME/NODIRATIME cannot
        /// be changed by unprivileged mount_setattr in child namespace).
        const MNT_LOCK_ATIME     = 0x040000; // Linux: MNT_LOCK_ATIME = 0x040000
        /// NOEXEC flag is locked.
        const MNT_LOCK_NOEXEC    = 0x080000; // Linux: MNT_LOCK_NOEXEC = 0x080000
        /// NOSUID flag is locked.
        const MNT_LOCK_NOSUID    = 0x100000; // Linux: MNT_LOCK_NOSUID = 0x100000
        /// NODEV flag is locked.
        const MNT_LOCK_NODEV     = 0x200000; // Linux: MNT_LOCK_NODEV = 0x200000
        /// READONLY flag is locked (cannot be remounted read-write by
        /// unprivileged users in child namespace).
        const MNT_LOCK_READONLY  = 0x400000; // Linux: MNT_LOCK_READONLY = 0x400000

        /// Mount is locked and cannot be unmounted by unprivileged
        /// processes. Set on mounts visible in child mount namespaces
        /// created by unprivileged users — prevents a child namespace
        /// from unmounting a mount inherited from the parent. Cleared
        /// only by a process with `CAP_SYS_ADMIN` in the mount's owning
        /// user namespace.
        const MNT_LOCKED         = 0x800000; // Linux: MNT_LOCKED = 0x800000

        /// Mount is in the process of being unmounted. Set by `umount()`
        /// before removing the mount from the hash table. Prevents new
        /// path lookups from entering this mount. Once set, never cleared
        /// (the mount node is freed after the RCU grace period).
        const MNT_DOOMED         = 0x1000000;  // Linux: MNT_DOOMED = 0x1000000
        /// Synchronous unmount requested. Set when MNT_DETACH was NOT
        /// specified and the kernel must wait for all references to drain.
        const MNT_SYNC_UMOUNT    = 0x2000000;  // Linux: MNT_SYNC_UMOUNT = 0x2000000
        /// Mount is being torn down by the umount process.
        const MNT_UMOUNT         = 0x8000000;  // Linux: MNT_UMOUNT = 0x8000000

        // --- UmkaOS extension flags (bits 28+) ---
        //
        // These flags are UmkaOS-original extensions NOT present in Linux's
        // mnt_flags. They occupy high bit positions (28+) to avoid collision
        // with future Linux MNT_* additions. Both are intentional design
        // improvements over Linux.

        /// **UmkaOS extension — not present in Linux mnt_flags.**
        /// Per-mount lazytime: buffer atime updates in memory and flush
        /// lazily. Reduces write I/O for atime-heavy workloads (mail servers).
        /// This is a genuine improvement over Linux's per-superblock
        /// `SB_LAZYTIME`: different bind mounts of the same filesystem can
        /// have different lazytime policies (e.g., `/var/mail` with lazytime,
        /// `/var/log` without, on the same ext4 volume).
        const MNT_LAZYTIME       = 1 << 28;   // UmkaOS extension (bit 28)
        /// **UmkaOS extension — not present in Linux mnt_flags.**
        /// Explicit detached-mount state flag. Set by `fsmount()` before
        /// `move_mount()` attaches the mount to the namespace tree. Detached
        /// mounts are invisible to path resolution and /proc/PID/mountinfo.
        /// Linux tracks this implicitly through namespace tree membership;
        /// UmkaOS makes it an explicit flag used in 10+ places in
        /// fsmount/move_mount/open_tree flows for clarity and correctness.
        const MNT_DETACHED       = 1 << 29;   // UmkaOS extension (bit 29)
    }
}

14.6.2 Propagation Type

/// Mount propagation type. Controls whether mount/unmount events at this
/// mount point are propagated to other mount points, and in which direction.
///
/// Propagation is fundamental to container runtimes: Docker sets the rootfs
/// to MS_PRIVATE by default, Kubernetes uses MS_SHARED for volume mounts
/// that must be visible across pod containers.
///
/// See: Linux kernel Documentation/filesystems/sharedsubtree.rst
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum PropagationType {
    /// Mount events propagate bidirectionally within the peer group.
    /// All mounts in the same peer group see each other's mount/unmount
    /// events. This is the Linux default for the initial namespace root.
    Shared = 0,

    /// Mount events are not propagated to or from this mount. This is
    /// the default for new mount namespaces (container isolation).
    Private = 1,

    /// Mount events propagate unidirectionally from the master to this
    /// mount, but not in the reverse direction. Used when a container
    /// should see new mounts from the host but not expose its own mounts
    /// to the host.
    Slave = 2,

    /// Like Private, but additionally prevents this mount from being
    /// used as the source of a bind mount. Used for security-sensitive
    /// mount points that should never be replicated.
    Unbindable = 3,
}

14.6.3 Mount Node

/// A single mount instance in the mount tree.
///
/// Equivalent to Linux's `struct mount` (not `struct vfsmount` — the latter
/// is the subset exposed to filesystem drivers; `struct mount` is the full
/// internal structure). Each `Mount` represents one attachment of a
/// filesystem at a specific point in the directory tree.
///
/// **Lifetime**: `Mount` nodes are allocated via `Arc<Mount>`. References
/// are held by:
/// - The mount hash table (via RCU-protected hash chain)
/// - The parent mount's `children` list
/// - The peer group's `mnt_share` ring
/// - The master mount's `mnt_slave_list`
/// - Any open file descriptor whose path traversed this mount
///   (via `mnt_count` reference count)
/// - The `MountNamespace.mount_list`
///
/// A mount node is freed when all strong references are dropped, which
/// happens after: (a) removal from the hash table, (b) removal from the
/// parent's child list, (c) RCU grace period completion, and (d) all
/// path-resolution references (`mnt_count`) have been released.
pub struct Mount {
    // --- Identity ---

    /// Unique mount identifier within the owning namespace. Monotonically
    /// increasing, 64-bit, never reused. This is the value returned by
    /// `statx()` in `stx_mnt_id` (STATX_MNT_ID) and reported in
    /// `/proc/PID/mountinfo` field 1.
    pub mount_id: u64,

    /// Device name string (e.g., "/dev/sda1", "tmpfs", "overlay").
    /// Displayed in `/proc/PID/mountinfo` field 10 (mount source).
    /// Heap-allocated, immutable after mount creation.
    pub device_name: Box<[u8]>,

    // --- Tree structure ---

    /// Parent mount. `None` for the root of the mount namespace.
    /// Uses `Weak` to prevent reference cycles in the mount tree:
    /// parent -> children -> parent would create a cycle with `Arc`.
    /// The parent is always alive while any child exists (the child
    /// holds a position in the parent's hash chain), so the `Weak`
    /// can always be upgraded during normal operation. It fails only
    /// during the teardown of a doomed mount tree, which is expected.
    pub parent: Option<Weak<Mount>>,

    /// Cached parent mount ID. Set at mount time, updated on `move_mount`.
    /// Avoids `Weak::upgrade()` during RCU-walk lookups (the upgrade may
    /// fail during concurrent umount). Helper: `fn mount_id_of_parent(&self)
    /// -> u64 { self.parent_mount_id }`.
    pub parent_mount_id: u64,

    /// The dentry in the parent mount's filesystem where this mount is
    /// attached. For the root mount of a namespace, this is the root
    /// dentry of the parent mount (which is itself).
    ///
    /// Together with `parent`, this pair `(parent_mount, mountpoint_dentry)`
    /// is the key in the mount hash table. Path resolution uses this to
    /// detect mount crossings: when a dentry has `DCACHE_MOUNTED` set,
    /// the VFS calls `lookup_mnt(current_mount, dentry)` to find the
    /// child mount.
    pub mountpoint: DentryRef,

    /// Root dentry of the mounted filesystem. When path resolution
    /// crosses into this mount, it continues from this dentry.
    pub root: DentryRef,

    /// The superblock of the mounted filesystem. Shared across all
    /// mounts of the same filesystem instance (e.g., bind mounts share
    /// the superblock). The superblock holds the filesystem-specific
    /// state and the `FileSystemOps`/`InodeOps`/`FileOps` trait objects.
    pub superblock: Arc<SuperBlock>,

    /// Children of this mount — sub-mounts attached at dentries within
    /// this mount's filesystem. Intrusive doubly-linked list for O(1)
    /// insertion and removal. Protected by the namespace's `mount_lock`
    /// for writes; RCU-protected for reads during path resolution.
    pub children: IntrusiveList<Arc<Mount>>,

    /// Link entry for this mount in its parent's `children` list.
    /// Embedded in the `Mount` node to avoid per-child heap allocation.
    pub child_link: IntrusiveListNode,

    // --- Mount flags ---

    /// Per-mount flags (nosuid, nodev, noexec, readonly, noatime, etc.).
    /// Atomically readable for the path-resolution hot path (no lock
    /// needed to check MNT_READONLY or MNT_NOSUID). Modified only under
    /// `mount_lock` via atomic store with Release ordering.
    pub flags: AtomicU64,

    // --- Propagation ---

    /// Propagation type for this mount (Shared, Private, Slave, Unbindable).
    /// Determines how mount/unmount events are forwarded to related mounts.
    /// Modified only under `mount_lock`.
    pub propagation: PropagationType,

    /// Peer group ID for shared mounts. All mounts in the same peer group
    /// have the same `group_id`. Private and unbindable mounts have
    /// `group_id == 0`. Slave mounts retain the `group_id` of their
    /// former peer group (for /proc/PID/mountinfo optional fields).
    ///
    /// Allocated from the namespace's `group_id_allocator`. Unique within
    /// a namespace.
    pub group_id: u64,

    /// Circular linked list of peer mounts (shared propagation).
    /// All mounts in a peer group are linked through `mnt_share`.
    /// When a mount/unmount event occurs on any peer, it is propagated
    /// to all other peers in the ring. For Private/Unbindable mounts,
    /// this list contains only the mount itself (self-loop).
    pub mnt_share: IntrusiveListNode,

    /// Master mount for slave propagation. When this mount is a slave,
    /// `mnt_master` points to the shared mount from which this mount
    /// receives (but does not send) propagation events.
    /// `None` for shared, private, and unbindable mounts.
    pub mnt_master: Option<Weak<Mount>>,

    /// List head for slave mounts of this mount. When this mount is
    /// shared (or was shared), slave mounts derived from it are linked
    /// through `mnt_slave_list`. Each slave's `mnt_slave` node is an
    /// entry in this list.
    pub mnt_slave_list: IntrusiveList<Arc<Mount>>,

    /// Link entry for this mount in its master's `mnt_slave_list`.
    pub mnt_slave: IntrusiveListNode,

    // --- Namespace membership ---

    /// The mount namespace that owns this mount. `Weak` because the
    /// namespace may be destroyed (all processes exited) while detached
    /// mounts or lazy-unmount remnants still exist.
    pub ns: Weak<MountNamespace>,

    /// Link entry in the namespace's `mount_list`. Used for ordered
    /// iteration (e.g., /proc/PID/mountinfo output, umount ordering).
    pub ns_list_link: IntrusiveListNode,

    // --- Reference counting ---

    /// Active reference count. Incremented when path resolution enters
    /// this mount (ref-walk mode) or when an open file descriptor
    /// references a path within this mount. `umount()` checks this
    /// before removing the mount: if `mnt_count > 0`, the mount is
    /// busy and umount returns `EBUSY` (unless `MNT_DETACH` is used).
    ///
    /// Note: this is separate from the `Arc` reference count. `Arc`
    /// tracks the lifetime of the `Mount` struct itself. `mnt_count`
    /// tracks whether the mount is actively *in use* by path lookups
    /// and open files. A mount can have `mnt_count == 0` (not busy)
    /// while still having `Arc` strong count > 0 (struct not yet freed
    /// because it's still in the hash table or child list).
    pub mnt_count: AtomicU64,

    // --- Mount hash chain ---

    /// Link entry in the mount hash table bucket chain. RCU-protected:
    /// readers traverse the chain under `rcu_read_lock()` without any
    /// lock; writers modify the chain under `mount_lock` and publish
    /// via RCU. Uses intrusive linking for zero-allocation hash insertion.
    pub hash_link: IntrusiveListNode,
}

impl Mount {
    /// Cached parent mount ID, avoids Weak::upgrade() during RCU-walk.
    #[inline]
    pub fn mount_id_of_parent(&self) -> u64 {
        self.parent_mount_id
    }

    /// Inode ID of the mountpoint dentry (the dentry in the parent mount
    /// where this mount is attached). Used as the secondary key in the
    /// mount hash table: lookup is `(parent_mount_id, mountpoint_inode_id)`.
    #[inline]
    pub fn mountpoint_inode(&self) -> InodeId {
        self.mountpoint.inode
    }
}

/// Reference to a dentry. Wraps the dentry's inode ID and parent inode ID,
/// which together uniquely identify a dentry in the dentry cache (Section
/// 13.1.2). The VFS resolves this to a cached dentry entry on access.
///
/// This avoids holding a direct pointer into the dentry cache (which is
/// RCU-managed and may be evicted), while still providing O(1) lookup via
/// the dentry hash table.
pub struct DentryRef {
    /// Inode ID of the parent directory containing this dentry.
    pub parent_inode: InodeId,
    /// Name hash of this dentry. For filesystems with a custom
    /// `DentryOps::d_hash()` (case-insensitive filesystems), `name_hash`
    /// stores the result of `d_hash()`, not the default hash. For
    /// filesystems without custom hashing, the default SipHash-1-3 of the
    /// name component is used. Used for O(1) dentry cache lookup without
    /// storing the full name.
    pub name_hash: u64,
    /// Inode ID of the dentry itself (for positive dentries).
    pub inode: InodeId,
}

14.6.4 Mount Hash Table

/// Per-namespace mount hash table. Maps `(parent_mount_id, mountpoint_dentry)`
/// pairs to child `Mount` nodes. This is the data structure consulted on
/// every mount-point crossing during path resolution.
///
/// **Why per-namespace**: Linux uses a single global `mount_hashtable` with
/// ~2048 buckets, protected by a per-bucket spinlock for writes and RCU for
/// reads. In container-heavy environments (thousands of namespaces, each with
/// 30-100 mounts), this creates false sharing on hash buckets and limits
/// scalability of concurrent mount operations across namespaces. UmkaOS's
/// per-namespace hash table eliminates cross-namespace contention entirely.
///
/// **Sizing**: The hash table is sized to the number of mounts in the
/// namespace, with a minimum of 32 buckets and a maximum of 1024. The table
/// is resized (doubled) when the load factor exceeds 2.0, and shrunk
/// (halved) when the load factor drops below 0.25. Resizing allocates a
/// new bucket array, rehashes under `mount_lock`, and publishes via RCU.
///
/// **Hash function**: SipHash-1-3 of `(parent_mount_id, mountpoint_inode_id)`.
/// The SipHash key is per-namespace, generated from a CSPRNG at namespace
/// creation. This prevents hash-flooding attacks where an adversary crafts
/// mount points that collide in the hash table.
pub struct MountHashTable {
    /// RCU-protected bucket array. Wrapped in `Arc<BucketArray>` because
    /// `RcuCell` requires an atomically-swappable thin pointer — `Box<[T]>`
    /// is a fat pointer (data + length) that cannot be atomically swapped
    /// on any current architecture. `Arc<BucketArray>` is a single thin
    /// pointer that `RcuCell` can swap atomically.
    /// Readers traverse under `rcu_read_lock()`; writers modify under
    /// the namespace's `mount_lock`.
    buckets: RcuCell<Arc<BucketArray>>,

    /// Number of entries in the hash table. Used for load-factor
    /// computation during resize decisions. Modified only under `mount_lock`.
    /// **Bounded**: u32 supports ~4 billion mounts per namespace. Linux's
    /// default `sysctl fs.mount-max` is 100,000; even extreme container
    /// workloads rarely exceed 1 million. u32 is sufficient.
    /// At mount_max=100K, u32 provides ~42,949x headroom. This is a hash
    /// table entry count, not an identifier — the 50-year u64 policy does
    /// not apply.
    count: u32,

    /// SipHash key for this hash table. Per-namespace, generated at
    /// namespace creation from the kernel CSPRNG.
    hash_key: [u64; 2],
}

/// Thin-pointer wrapper for the dynamically-sized bucket array.
/// `Box<[MountHashBucket]>` is a fat pointer (data + length) that cannot be
/// atomically swapped by `RcuCell`. This wrapper provides a thin `Arc` pointer.
// Kernel-internal, not KABI.
struct BucketArray {
    buckets: Box<[MountHashBucket]>,
}

/// A single bucket in the mount hash table. Contains the head pointer
/// of an RCU-protected chain of Mount nodes.
struct MountHashBucket {
    /// Head of the intrusive linked list of Mount nodes hashing to this
    /// bucket. Null if the bucket is empty. Readers follow this chain
    /// under RCU; writers modify under `mount_lock`.
    ///
    /// **Lifecycle**: Hash chain insertion calls `Arc::into_raw()` to obtain
    /// the raw pointer (incrementing the strong count); hash chain removal
    /// under `mount_lock` uses RCU to defer `Arc::from_raw()` (which
    /// decrements the count) until after the grace period. This ensures
    /// RCU readers never access freed memory.
    head: AtomicPtr<Mount>,
}

impl MountHashTable {
    /// Look up a child mount at the given `(parent, dentry)` pair.
    ///
    /// Called during path resolution when a dentry has the `DCACHE_MOUNTED`
    /// flag set. Must be called under `rcu_read_lock()`.
    ///
    /// Returns `Some(&Mount)` if a mount is found at this point, or
    /// `None` if the dentry is not a mount point (stale `DCACHE_MOUNTED`
    /// flag — possible after lazy unmount).
    ///
    /// **Performance**: O(1) expected, O(n) worst-case where n is the
    /// chain length (bounded by load factor < 2.0). No locks, no atomics
    /// beyond the initial `Acquire` load of the bucket head pointer.
    pub fn lookup<'a>(
        &'a self,
        parent_mount_id: u64,
        mountpoint_inode: InodeId,
        _rcu: &'a RcuReadGuard,
    ) -> Option<&'a Mount> {
        let hash = siphash_1_3(
            self.hash_key,
            parent_mount_id,
            mountpoint_inode,
        );
        // Readers must obtain the bucket array pointer and compute
        // bucket_count from the same RCU-protected snapshot to avoid
        // OOB access during a concurrent resize.
        let buckets = self.buckets.read(_rcu);
        let bucket_idx = hash as usize % buckets.len();
        let bucket = &buckets[bucket_idx];

        let mut current = bucket.head.load(Ordering::Acquire);
        while !current.is_null() {
            // SAFETY: `current` is a valid Mount pointer within an RCU
            // read-side critical section. The Mount node is not freed
            // until after the RCU grace period.
            let mnt = unsafe { &*current };
            if mnt.mount_id_of_parent() == parent_mount_id
                && mnt.mountpoint_inode() == mountpoint_inode
                && !mnt.is_doomed()
            {
                return Some(mnt);
            }
            current = mnt.hash_link.next.load(Ordering::Acquire);
        }
        None
    }

    /// Transition from RCU-protected `&Mount` to a long-lived reference.
    ///
    /// **Ref-walk mode**: After `lookup()` returns `Some(&Mount)`, the
    /// caller must increment `mnt_count` before dropping the `RcuReadGuard`:
    /// ```
    /// let rcu = rcu_read_lock();
    /// if let Some(mnt) = mount_hash.lookup(parent_id, ino, &rcu) {
    ///     mnt.mnt_count.fetch_add(1, Acquire);
    ///     drop(rcu);
    ///     // `mnt` is now safe to use without RCU protection.
    ///     // Caller must call mnt.mnt_count.fetch_sub(1, Release)
    ///     // when the reference is no longer needed.
    /// }
    /// ```
    ///
    /// **RCU-walk mode**: The caller stays within the RCU critical section
    /// for the entire path resolution and never increments `mnt_count`.
    /// If RCU-walk fails (e.g., dentry seqlock mismatch), the path
    /// resolution restarts in ref-walk mode.
    ///
    /// The `Acquire` on `fetch_add` pairs with the `Release` on
    /// `fetch_sub` to ensure visibility of all mount state modifications
    /// made before the reference was taken.
    pub fn get_counted_ref(mnt: &Mount) {
        mnt.mnt_count.fetch_add(1, Ordering::Acquire);
    }
}

14.6.5 Mount Namespace

/// A mount namespace. Contains an independent mount tree with its own root
/// mount, hash table, and mount list. Created by `clone(CLONE_NEWNS)` or
/// `unshare(CLONE_NEWNS)`.
///
/// The `vfs_root: Capability<VfsNode>` field in `NamespaceSet` (Section 17.1.2)
/// is updated to point to this namespace's root mount:
///
/// ```rust
/// // Updated NamespaceSet field (replaces the previous Capability<VfsNode>):
/// pub mount_ns: Arc<MountNamespace>,
/// ```
///
/// **Relationship to NamespaceSet**: Each task's `NamespaceSet` holds
/// an `Arc<MountNamespace>`. Multiple tasks in the same mount namespace
/// share the same `Arc<MountNamespace>`. When `clone(CLONE_NEWNS)` is called,
/// a new `MountNamespace` is created by cloning the parent's mount tree
/// (via `copy_tree()`).
pub struct MountNamespace {
    /// Unique namespace identifier. Used for `/proc/PID/ns/mnt` inode
    /// number and `setns()` namespace comparison.
    pub ns_id: u64,

    /// Root mount of this namespace's mount tree. This is the mount
    /// that corresponds to "/" for all processes in this namespace.
    /// Updated atomically by `pivot_root()`.
    pub root: RcuCell<Arc<Mount>>,

    /// Ordered list of all mounts in this namespace. The ordering is
    /// topological: parent mounts appear before their children. This
    /// ordering is used by:
    /// - `/proc/PID/mountinfo`: output follows this order
    /// - `umount -a`: unmounts in reverse order (leaves before parents)
    /// - Namespace teardown: unmounts in reverse topological order
    pub mount_list: IntrusiveList<Arc<Mount>>,

    /// Number of mounts in this namespace. Used to enforce the
    /// per-namespace mount count limit (default: 100,000 — matching
    /// Linux's `sysctl fs.mount-max`). Prevents mount-storm DoS attacks
    /// where a compromised container creates millions of mounts.
    /// Current-state count bounded by mount_max (~100K). u64 used for
    /// consistency with other AtomicU64 counters in the namespace; u32
    /// would suffice. On ILP32 architectures (ARMv7, PPC32), AtomicU64
    /// requires a CAS loop (no native 64-bit atomics), adding ~5-10ns
    /// per increment. Acceptable: mount/unmount is a warm path.
    pub mount_count: AtomicU64,

    /// Event counter. Incremented on every mount/unmount/remount
    /// operation. Used by `poll()` on `/proc/PID/mountinfo` to detect
    /// mount tree changes. Container runtimes and systemd use this
    /// to react to mount events without periodic scanning.
    pub event_seq: AtomicU64,

    /// Per-namespace mount hash table. Maps `(parent_mount, dentry)` to
    /// child mount for path resolution mount-point crossings.
    pub hash_table: MountHashTable,

    /// Mutex serializing mount tree modifications (mount, unmount,
    /// remount, pivot_root, bind mount, move mount). Readers (path
    /// resolution) do not acquire this lock — they use RCU.
    /// Lock hierarchy level 20 (MOUNT_LOCK): above DENTRY_LOCK (19),
    /// below EVM_LOCK (22). See [Section 3.5](03-concurrency.md#locking-strategy--lock-hierarchy-summary).
    pub mount_lock: Mutex<()>,

    /// Mount ID allocator. Monotonically increasing 64-bit counter.
    /// IDs are never reused within a namespace. At 1 mount/second
    /// sustained, a 64-bit counter would not wrap for ~584 billion years.
    pub id_allocator: AtomicU64,

    /// Peer group ID allocator. Like mount IDs, monotonically increasing
    /// and never reused. Separate from mount IDs because group IDs are
    /// shared across mounts and have a different lifecycle.
    pub group_id_allocator: AtomicU64,

    /// User namespace that owns this mount namespace. Determines
    /// capability checks for mount operations. A process must have
    /// `CAP_MOUNT` in this user namespace (or an ancestor) to modify
    /// the mount tree.
    pub user_ns: Arc<UserNamespace>,
}

14.6.6 DCACHE_MOUNTED Integration

The dentry cache (Section 14.1) must track which dentries are mount points. When a filesystem is mounted at a dentry, the VFS sets the DCACHE_MOUNTED flag on that dentry. During path resolution (Section 14.1), when the VFS encounters a dentry with DCACHE_MOUNTED set, it calls mnt_ns.hash_table.lookup() (where mnt_ns is the current task's mount namespace) to find the child mount and continues resolution from the child mount's root dentry.

/// Dentry cache entry flags. Stored in the dentry's `flags: AtomicU32` field.
/// Extended to include DCACHE_MOUNTED for mount-point detection.
bitflags! {
    #[repr(transparent)]
    pub struct DcacheFlags: u32 {
        /// This dentry is a mount point — a filesystem is mounted on it.
        /// Set by `do_mount()` when attaching a mount. Cleared by
        /// `do_umount()` when the last mount at this dentry is removed.
        ///
        /// Path resolution checks this flag on every path component.
        /// When set, `mnt_ns.hash_table.lookup(current_mount.mount_id, dentry)`
        /// is called to find the child mount. This check is a single atomic
        /// load (~1 cycle) — the flag exists specifically to avoid a hash
        /// table lookup on every path component (only mount points need
        /// the lookup).
        const DCACHE_MOUNTED       = 1 << 0;

        /// Dentry has been disconnected from the tree (e.g., NFS stale
        /// handle, deleted directory that is still open).
        const DCACHE_DISCONNECTED  = 1 << 1;

        /// Dentry is a negative dentry (caches a failed lookup).
        const DCACHE_NEGATIVE      = 1 << 2;

        /// Dentry has filesystem-specific operations (d_revalidate, etc.).
        const DCACHE_OP_MASK       = 1 << 3;
    }
}

14.6.7 Filesystem Context (New Mount API)

The new mount API (Linux 5.2+, used increasingly by container runtimes and systemd) separates mount operations into discrete steps: context creation, configuration, superblock creation, and attachment. This provides better error reporting (errors at each step, not a single mount(2) errno) and supports atomic mount configuration changes.

/// Filesystem context for the new mount API.
///
/// Created by `fsopen()`, configured by `fsconfig()`, and consumed by
/// `fsmount()`. The context holds all the state needed to create a new
/// superblock and mount, accumulated through multiple `fsconfig()` calls.
///
/// This is equivalent to Linux's `struct fs_context`.
///
/// **Lifetime**: The context is reference-counted via a file descriptor
/// returned by `fsopen()`. It is destroyed when the file descriptor is
/// closed. If `fsmount()` has not been called, the context is simply
/// freed (no mount created). If `fsmount()` was called, the context's
/// state has been consumed and the mount exists independently.
/// Maximum mount options (key-value pairs) across `options` and
/// `binary_options` combined. `fsconfig()` returns `ENOSPC` when
/// `options.len() + binary_options.len() >= FS_CONTEXT_MAX_OPTIONS`.
pub const FS_CONTEXT_MAX_OPTIONS: usize = 256;

/// Maximum error log size in bytes (matches Linux `FC_LOG_SIZE`).
/// The `fc_log_write()` function checks
/// `log.len() + msg.len() <= FC_LOG_SIZE` before appending; excess
/// bytes are silently truncated.
pub const FC_LOG_SIZE: usize = 4096;

pub struct FsContext {
    /// Filesystem type (e.g., "ext4", "tmpfs", "overlay"). Set at
    /// `fsopen()` time and immutable thereafter.
    pub fs_type: Arc<dyn FileSystemOps>,

    /// Filesystem type name (for diagnostics and /proc/mounts).
    pub fs_type_name: Box<[u8]>,

    /// Source device or path (equivalent to mount(2) `source` parameter).
    /// Set via `fsconfig(FSCONFIG_SET_STRING, "source", ...)`.
    pub source: Option<Box<[u8]>>,

    /// Accumulated mount options as key-value pairs. Each `fsconfig()`
    /// call adds or modifies an entry. The filesystem driver validates
    /// options at `fsconfig(FSCONFIG_CMD_CREATE)` time.
    /// Bounded by FS_CONTEXT_MAX_OPTIONS (256 total across `options` and
    /// `binary_options`). `fsconfig()` returns `ENOSPC` when the combined
    /// count reaches the limit. Cold-path allocation (mount/remount only).
    pub options: Vec<(Box<[u8]>, Box<[u8]>)>,

    /// Binary data options (for filesystems that accept binary mount data).
    /// Set via `fsconfig(FSCONFIG_SET_BINARY, ...)`.
    /// Shares the `FS_CONTEXT_MAX_OPTIONS` limit with `options`.
    pub binary_options: Vec<(Box<[u8]>, Box<[u8]>)>,

    /// Mount flags to apply to the created mount.
    pub mount_flags: MountFlags,

    /// The created superblock. Set by `fsconfig(FSCONFIG_CMD_CREATE)`,
    /// consumed by `fsmount()`.
    pub superblock: Option<Arc<SuperBlock>>,

    /// Error log. Filesystem drivers write diagnostic messages here
    /// during context creation and configuration. Readable by userspace
    /// via `read()` on the fscontext file descriptor.
    /// Bounded to FC_LOG_SIZE (4096) bytes. Truncated silently when full.
    /// Cold-path allocation (mount error reporting only).
    pub log: Vec<u8>,

    /// Purpose of this context: new mount, reconfiguration, or submount.
    pub purpose: FsContextPurpose,

    /// Lifecycle state of this context. Transitions: New → Configured →
    /// Consumed (by `fsmount()`). Further `fsconfig()` calls on a Consumed
    /// context return `EBUSY`.
    pub state: FsContextState,

    /// User namespace for permission checks. Set at `fsopen()` time
    /// to the caller's user namespace.
    pub user_ns: Arc<UserNamespace>,
}

/// Purpose of a filesystem context, controlling which operations are valid.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum FsContextPurpose {
    /// Creating a new mount (from `fsopen()`).
    NewMount = 0,
    /// Reconfiguring an existing mount (from `fspick()`).
    Reconfig = 1,
    /// Internal: creating a submount (e.g., automount).
    Submount = 2,
}

/// Lifecycle state of an `FsContext`.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum FsContextState {
    /// Freshly created by `fsopen()` or `fspick()`. Accepting `fsconfig()` calls.
    New         = 0,
    /// Options have been set via `fsconfig()`, but `FSCONFIG_CMD_CREATE`/
    /// `FSCONFIG_CMD_RECONFIGURE` has not yet been called.
    Configuring = 1,
    /// `FSCONFIG_CMD_CREATE` succeeded; superblock is ready. Awaiting `fsmount()`.
    Created     = 2,
    /// `fsmount()` has consumed the superblock. The fsopen fd is still open
    /// for error log retrieval but cannot create another mount.
    Consumed    = 3,
    /// An error occurred during creation. The error log is readable.
    /// Further `fsconfig()` calls return `EBUSY`.
    Failed      = 4,
}

14.6.7.1 FsContext Lifecycle and Error Channel

The new mount API separates mount configuration into discrete, verifiable steps. Each step either advances the context state or returns a structured error. The full lifecycle:

Step 1: fd = fsopen("ext4", FSOPEN_CLOEXEC)
  → Validates "ext4" against the filesystem type registry.
  → Allocates FsContext { fs_type: ext4_ops, purpose: NewMount, state: New, ... }.
  → Returns an O_RDWR file descriptor backed by the FsContext.
  → FsContext state: New.

Step 2: fsconfig(fd, FSCONFIG_SET_STRING, "source", "/dev/sda1", 0)
        fsconfig(fd, FSCONFIG_SET_STRING, "errors",  "remount-ro",  0)
        fsconfig(fd, FSCONFIG_SET_FLAG,   "noatime", NULL,          0)
  → Each call appends to FsContext.options: [("source", "/dev/sda1"), ("errors", "remount-ro"), ...].
  → Returns 0 on success; EINVAL if the key is not recognized by the filesystem type.
  → First `fsconfig()` call transitions state: New → Configuring.
  → FsContext state: Configuring (still accumulating options).

Step 3: fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
  → Calls FileSystemOps::mount(source, flags, options) on the configured filesystem type.
  → On success: FsContext.superblock = Some(sb); state → Created.
  → On failure: diagnostic message is written to FsContext.log; state → Failed.
    Caller can read the error via read(fd, buf, len) — see Error Channel below.
  → Returns 0 on success; -errno on failure.

Step 4: mnt_fd = fsmount(fd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOATIME)
  → Consumes FsContext.superblock (state must be Created; returns EBUSY if
    Consumed, EINVAL if New or Failed).
  → Allocates a MountNode with MNT_DETACHED flag set.
  → Returns an O_PATH fd referencing the detached mount.
  → FsContext state: Consumed (further fsconfig/fsmount calls return EBUSY).

Step 5: move_mount(mnt_fd, "", AT_FDCWD, "/mnt/data", MOVE_MOUNT_F_EMPTY_PATH)
  → Attaches the detached mount to the namespace mount tree at /mnt/data.
  → Clears MNT_DETACHED from the MountNode.
  → Triggers mount propagation to peer/slave mounts (Section 14.2.10).

open_tree(2) — clone or open a mount:

fd = open_tree(dirfd, path, OPEN_TREE_CLONE | AT_RECURSIVE)
  → Resolves path to a mount.
  → OPEN_TREE_CLONE: creates a detached copy of the mount tree rooted at path,
    identical to a recursive bind mount but without modifying the namespace.
    AT_RECURSIVE: the clone includes all submounts below path.
  → The returned O_PATH fd can be passed to move_mount() to attach elsewhere.
  → Without OPEN_TREE_CLONE: returns an O_PATH fd referencing the existing mount
    without cloning (useful for passing a mount reference across namespaces).

mount_setattr(2) — bulk-modify mount tree flags:

mount_setattr(dirfd, path, AT_RECURSIVE, &mount_attr { attr_set, attr_clr }, sizeof)
  → Resolves path to a mount.
  → AT_RECURSIVE: applies to all mounts in the subtree rooted at path.
  → attr_clr: clears these flags from each mount (applied first).
  → attr_set: sets these flags on each mount (applied after attr_clr).
  → The operation is atomic within the subtree: if validation fails for any mount
    (e.g., clearing MNT_READONLY on a superblock-level read-only filesystem), no
    flags are changed on any mount.
  → Requires CAP_MOUNT.

FsContext Error Channel:

When fsconfig(FSCONFIG_CMD_CREATE) or fsmount() encounter a filesystem-level error (e.g., superblock checksum mismatch, missing required option, device I/O error), the error is not conveyed solely via errno. The filesystem driver writes a human-readable diagnostic string to FsContext.log. The caller retrieves it via read(fd, buf, len) on the FsContext file descriptor:

read(fs_context_fd, buf, len):
  if FsContext.log is empty: return 0 (EOF — no error message pending)
  n = min(len, FsContext.log.len())
  copy_to_user(buf, FsContext.log[..n])
  FsContext.log.drain(..n)
  return n

Example error message (readable by system administrators):

ext4: superblock checksum mismatch at block 0: expected 0xdeadbeef, got 0xcafebabe

This approach is superior to the traditional single-errno response: it gives system administrators and container runtimes actionable diagnostic information without requiring a separate diagnostics ioctl or /proc file.

14.6.8 Mount Attribute Structure (mount_setattr)

/// User-visible mount attribute structure for `mount_setattr(2)`.
/// Matches Linux's `struct mount_attr` exactly for ABI compatibility.
///
/// `mount_setattr()` atomically modifies mount properties on a single
/// mount or recursively on an entire mount tree (when `AT_RECURSIVE`
/// is passed). Container runtimes use this for recursive read-only
/// mounts (`MOUNT_ATTR_RDONLY` + `AT_RECURSIVE`).
#[repr(C)]
pub struct MountAttr {
    /// Flags to set on the mount(s). Bits correspond to `MOUNT_ATTR_*`
    /// constants. Applied after `attr_clr` (clear first, then set).
    pub attr_set: u64,

    /// Flags to clear from the mount(s). Applied before `attr_set`.
    pub attr_clr: u64,

    /// Propagation type to set. One of `MS_SHARED`, `MS_PRIVATE`,
    /// `MS_SLAVE`, `MS_UNBINDABLE`, or 0 (no change). Only one
    /// propagation flag may be set; combining them returns `EINVAL`.
    /// The mount_setattr handler validates `attr.propagation` is a valid
    /// PropagationType variant (0-3); returns EINVAL on invalid values.
    pub propagation: u64,

    /// File descriptor of the user namespace to associate with the
    /// mount (for ID-mapped mounts). Set to 0 or omit if not
    /// changing the mount's user namespace mapping.
    pub userns_fd: u64,
}
// Layout: 4 × u64 = 32 bytes.
const_assert!(size_of::<MountAttr>() == 32);

/// MOUNT_ATTR_* flag constants for mount_setattr(2).
/// These map to MountFlags but use a separate constant space matching
/// Linux's UAPI.
pub const MOUNT_ATTR_RDONLY: u64      = 0x00000001;
pub const MOUNT_ATTR_NOSUID: u64      = 0x00000002;
pub const MOUNT_ATTR_NODEV: u64       = 0x00000004;
pub const MOUNT_ATTR_NOEXEC: u64      = 0x00000008;
pub const MOUNT_ATTR_NOATIME: u64     = 0x00000010;
pub const MOUNT_ATTR_STRICTATIME: u64 = 0x00000020;
pub const MOUNT_ATTR_NODIRATIME: u64  = 0x00000080;
pub const MOUNT_ATTR_NOSYMFOLLOW: u64 = 0x00200000;

14.6.9 Mount Operations — Algorithms

All mount tree modification algorithms require holding the namespace's mount_lock (lock hierarchy level 20, Section 3.5). Path resolution (read path) uses only RCU and never acquires mount_lock. The algorithms below describe the kernel-internal implementation; the syscall entry points (mount(2), umount2(2), and the new mount API) perform argument validation and capability checks before calling these internal functions.

14.6.9.1 do_mount — Mount a Filesystem

do_mount(source, target_path, fs_type, flags, data) -> Result<()>

  0a. Capability check: verify caller holds CAP_MOUNT ([Section 9.1](09-security.md#capability-based-foundation))
      in the target mount namespace. Return EPERM if not held.
  0b. LSM hook: `lsm_call_superblock_security(Mount, cred, sb, &SbOpContext { ... })`.
      If the LSM denies the mount request, return EPERM. This hook fires
      before any path resolution to allow early rejection of unauthorized
      mount operations (e.g., SELinux `mount` permission check against the
      caller's security context and the target path label).
  0c. Cgroup device controller check. If the calling task's cgroup has a device
      controller with `BPF_CGROUP_DEVICE` program attached: call
      `cgroup_bpf_run(BPF_CGROUP_DEVICE, &DeviceAccessCtx { dev: source_dev,
      access: BLK_OPEN_READ | BLK_OPEN_WRITE })`. If denied, return EPERM.
      This check runs before filesystem lookup because the source device may
      not be accessible to the container.

  1. Resolve `target_path` to (mount, dentry) via path resolution (Section 14.1.3).
  2. If `flags` contains MS_REMOUNT, delegate to do_remount() (Section 14.2.9.4).
  3. If `flags` contains MS_BIND, delegate to do_bind_mount() (Section 14.2.9.5).
  4. If `flags` contains MS_MOVE, delegate to do_move_mount() (Section 14.2.9.6).
  5. If `flags` contains MS_SHARED|MS_PRIVATE|MS_SLAVE|MS_UNBINDABLE,
     delegate to do_change_propagation() (Section 14.2.9.7).
  6. Otherwise, this is a new filesystem mount:
     a. Look up the filesystem type by name in the filesystem registry.
        If not registered, return ENODEV.
     b. Call `FileSystemOps::mount(source, flags, data)` on the filesystem
        driver. This creates and returns a `SuperBlock`. On failure, return
        the error from the driver.
     b2. LSM hook: `lsm_call_superblock_security(superblock, source, flags, data)`.
         If the LSM denies the mount (returns non-zero), drop the superblock
         and return EPERM. This hook allows SELinux/AppArmor to enforce mount
         restrictions based on the filesystem type, source device, and
         mount options.
     c. Check namespace mount count against `mount_max` limit. If exceeded,
        drop the superblock and return ENOSPC.
     d. Allocate a new `Mount` node:
        - `mount_id` from `namespace.id_allocator.fetch_add(1)`
        - `parent` = resolved mount from step 1
        - `mountpoint` = resolved dentry from step 1
        - `root` = superblock's root dentry
        - `superblock` = the SuperBlock from step 6b
        - `flags` = translate MS_* to MountFlags
        - `propagation` = Private (default for new mounts)
        - `group_id` = 0 (private mount has no peer group)
        - `mnt_count` = 0
     e. Acquire `mount_lock`.
     f. Increment the mountpoint dentry's mount refcount and set the
        `DCACHE_MOUNTED` flag:
        ```
        mountpoint_dentry.d_mount_refcount.fetch_add(1, Relaxed);
        mountpoint_dentry.d_flags.fetch_or(DCACHE_MOUNTED, Release);
        ```
        The refcount increment uses Relaxed ordering because it is
        protected by `mount_lock` (held since step 6e). The flag set
        uses Release so RCU readers see it on path resolution.
        `d_mount_refcount` tracks how many mounts reference this dentry
        as their mountpoint (see [Section 14.1](#virtual-filesystem-layer)); it is
        decremented by `do_umount()` and only when it reaches zero is
        `DCACHE_MOUNTED` cleared. This avoids a lock ordering violation:
        `mount_lock` (level 20) is held, and acquiring `d_lock`
        (level 19) would violate the lower-first ordering rule.
        **Writer-writer safety**: both mount and umount hold the
        namespace's `mount_lock` before modifying `DCACHE_MOUNTED`.
        The atomic `fetch_or`/`fetch_and` are for reader visibility
        under RCU, not for writer synchronization.
     g. Insert the Mount into the mount hash table at
        bucket(parent_mount_id, mountpoint_inode_id).
     h. Add the Mount to the parent's `children` list.
     i. Add the Mount to the namespace's `mount_list` (after its parent
        in topological order).
     j. Increment `namespace.mount_count`.
     k. Propagate: if the parent mount is shared, call
        `propagate_mount()` (Section 14.2.10.1) to replicate this mount
        on all peers and slaves of the parent.
     l. Increment `namespace.event_seq`.
     m. Release `mount_lock`.

14.6.9.2 do_umount — Unmount a Filesystem

do_umount(target_mount, flags) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. If `target_mount` is the namespace root and flags does not contain
     MNT_DETACH, return EBUSY (cannot unmount root).
  2. If `target_mount.flags` has MNT_LOCKED and the caller lacks
     CAP_SYS_ADMIN in the mount's owning user namespace, return EPERM.
  3. If `flags` does not contain MNT_DETACH (not lazy):
     a. Check `target_mount.mnt_count`. If > 0, return EBUSY.
     b. Check that `target_mount.children` is empty. If not, return EBUSY
        (sub-mounts must be unmounted first, unless MNT_DETACH is used).
  4. If `flags` contains MNT_FORCE:
     a. Call `FileSystemOps::force_umount()` if the filesystem supports it.
        This causes in-flight I/O to fail with EIO. NFS uses this for stale
        server recovery.
  5. Acquire `mount_lock`.
  6. Set `MNT_DOOMED` on `target_mount.flags` (atomic OR).
     This prevents new path lookups from entering the mount.
  7. Remove `target_mount` from the mount hash table.
  8. Remove `target_mount` from the parent's `children` list.
  9. Decrement the mountpoint dentry's mount refcount and conditionally
     clear `DCACHE_MOUNTED`:
     ```
     let mountpoint_dentry = target_mount.mountpoint;
     if mountpoint_dentry.d_mount_refcount.fetch_sub(1, AcqRel) == 1 {
         mountpoint_dentry.d_flags.fetch_and(!DCACHE_MOUNTED, Release);
     }
     ```
     AcqRel on the decrement: Acquire ensures we see all prior increments
     from other mount operations; Release ensures the flag clear is visible
     to RCU readers only after the refcount reaches zero. Multiple mounts
     can be stacked on the same dentry (cross-namespace or bind mounts);
     `DCACHE_MOUNTED` is cleared only when the last one is removed.
  10. Propagate: if the parent mount is shared, call `propagate_umount()`
      (Section 14.2.10.2) to remove corresponding mounts from peers and slaves.
  11. Remove from `namespace.mount_list`.
  12. Decrement `namespace.mount_count`.
  13. Increment `namespace.event_seq`.
  14. Release `mount_lock`.
  15. If `flags` contains MNT_DETACH (lazy unmount):
      a. The mount is now disconnected from the tree but may still be
         referenced by open file descriptors (mnt_count > 0). It will be
         fully freed when the last reference is dropped.
      b. Open files continue to work on the disconnected mount. New path
         lookups cannot reach it.
  16. If not lazy: call `FileSystemOps::unmount()` synchronously.
      If lazy: schedule `FileSystemOps::unmount()` to run when `mnt_count`
      drops to 0 (via a callback registered on the final `Arc::drop`).

14.6.9.3 do_umount_tree — Recursive Unmount

do_umount_tree(root_mount, flags) -> Result<()>

  Used by MNT_DETACH on a mount with sub-mounts, and by namespace teardown.

  1. Acquire `mount_lock`.
  2. Collect all mounts in the subtree rooted at `root_mount` by traversing
     `root_mount.children` recursively. Collect in reverse topological order
     (leaves first, root last).
  3. For each mount in the collected list:
     a. Set MNT_DOOMED.
     b. Remove from hash table.
     c. Remove from parent's children list.
     d. Decrement the mountpoint dentry's `d_mount_refcount` and
        conditionally clear `DCACHE_MOUNTED` (same protocol as `do_umount`
        step 9: `d_mount_refcount.fetch_sub(1, AcqRel)`; clear flag only
        when refcount reaches 0).
     e. Remove from namespace.mount_list.
     f. Decrement namespace.mount_count.
  4. Propagate umount for each removed mount.
  5. Increment namespace.event_seq.
  6. Release `mount_lock`.
  7. For each collected mount: schedule filesystem unmount (immediate
     if mnt_count == 0, deferred if lazy).

14.6.9.4 do_remount — Change Mount Flags/Options

do_remount(target_mount, flags, data) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. Translate new `flags` to `MountFlags`.
  2. Extract per-superblock options from `data`.
  3. **RW→RO transition: flush dirty pages before flag change.**
     If the remount transitions from read-write to read-only
     (`!(old_flags & MS_RDONLY) && (new_flags & MS_RDONLY)`):
     a. Call `sync_filesystem(sb)` to flush all dirty pages and metadata.
        This triggers `writeback_inodes_sb(sb, WB_SYNC_ALL)` which
        writes all dirty pages for this superblock to stable storage.
     b. If any dirty inode cannot be flushed (device error), retry up to
        `REMOUNT_RO_FLUSH_RETRIES` (3) times with a 100ms delay between
        retries. If all retries fail:
        - If `flags & MS_FORCE`: proceed with remount-ro anyway. Dirty
          pages for failed inodes are discarded (data loss accepted —
          the admin explicitly requested force). Log FMA warning.
        - If no MS_FORCE: return `Err(EBUSY)` — cannot remount read-only
          while dirty pages exist that cannot be flushed. The caller
          must either fix the device error or use `mount -o remount,ro,force`.
     c. After successful flush, verify no new dirty pages appeared during
        the flush (a writer may have dirtied pages concurrently). If
        `sb.nr_dirty_inodes > 0`, retry from step 3a (bounded by the
        same 3-retry limit — total retries across all sub-attempts).
     d. Set `sb.s_writers.frozen` to `SbFreezeLevel::Write` to prevent
        new writers from dirtying pages between the final flush and the
        flag change in step 5. Wait for `sb.s_writers.writers[0].sum() == 0`
        (all active writers drain). Released after step 5.
  4. Acquire `mount_lock`.
  5. Update `target_mount.flags` atomically with `Release` ordering.
     Concurrent readers (path walk, statfs) load mount flags with `Acquire`
     ordering to pair with this `Release`, ensuring the flag change is
     visible before subsequent filesystem operations on this mount.
     Note: a remount can change per-mount flags (readonly, nosuid, etc.)
     independently of superblock options. For example, `mount -o remount,ro`
     on a bind mount makes that mount point read-only without affecting
     other mount points of the same filesystem.
  6. If per-superblock options changed, call
     `FileSystemOps::remount(sb, flags, data)`. On failure, restore the
     old flags, release freeze if held, and return the error.
  7. Release freeze: set `sb.s_writers.frozen` to `SbFreezeLevel::Unfrozen`
     and wake blocked writers (step 3d).
  8. Increment `namespace.event_seq`.
  9. Release `mount_lock`.

14.6.9.5 do_bind_mount — Bind Mount (MS_BIND)

do_bind_mount(source_path, target_path, flags) -> Result<()>

  Capability check: CAP_MOUNT + read access to source path.

  1. Resolve `source_path` to (source_mount, source_dentry).
  2. Resolve `target_path` to (target_mount, target_dentry).
  3. If `source_mount.propagation == Unbindable`, return EINVAL.
  4. Clone the source mount:
     a. Allocate a new `Mount` node.
     b. `superblock` = `source_mount.superblock` (shared — same filesystem
        instance, same data pages).
     c. `root` = `source_dentry` (bind mount's root is the source path,
        not necessarily the source mount's root — this is how bind mounts
        of subdirectories work).
     d. `flags` = copy from source, then apply any new flags from `flags`.
     e. `propagation` = Private (new bind mounts default to Private).
  5. If `flags` contains MS_REC (recursive bind):
     a. For each sub-mount under `source_mount` (descendants of
        `source_dentry`), clone the mount and attach it at the
        corresponding dentry under the new bind mount.
     b. Skip unbindable mounts.
  6. Acquire `mount_lock`.
  7. Attach the cloned mount(s) at target_path (same steps as
     do_mount steps 6f-6m).
  8. Release `mount_lock`.

14.6.9.6 do_move_mount — Move a Mount (MS_MOVE)

do_move_mount(source_mount, target_path) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. Resolve `target_path` to (target_parent_mount, target_dentry).
  2. Verify `target_dentry` is not a descendant of `source_mount`
     (moving a mount underneath itself would create a cycle). Return
     EINVAL if it is.
  3. Verify `source_mount` is not the namespace root. Return EINVAL.
  4. Acquire `mount_lock`.
  5. Remove `source_mount` from the old location:
     a. Remove from hash table at old (parent, dentry) key.
     b. Remove from old parent's children list.
     c. Decrement old mountpoint dentry's `d_mount_refcount` and
        conditionally clear `DCACHE_MOUNTED` (same protocol as
        `do_umount` step 9).
  6. Attach at new location:
     a. Update `source_mount.parent` to `target_parent_mount`.
     b. Update `source_mount.mountpoint` to `target_dentry`.
     c. Insert into hash table at new (parent, dentry) key.
     d. Add to new parent's children list.
     e. Increment `target_dentry.d_mount_refcount` and set
        `DCACHE_MOUNTED` (same protocol as `do_mount` step 6f).
  7. Propagation: moving a mount does not trigger propagation
     (matches Linux behavior).
  8. Increment `namespace.event_seq`.
  9. Release `mount_lock`.

14.6.9.7 do_change_propagation — Set Propagation Type

do_change_propagation(target_mount, type, flags) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. Determine the target mount(s):
     - If `flags` contains MS_REC: target mount and all descendants.
     - Otherwise: target mount only.
  2. Acquire `mount_lock`.
  3. For each target mount:
     a. If changing to Shared:
        - Allocate a new `group_id` from `namespace.group_id_allocator`.
        - Set `mount.group_id = new_id`.
        - If the mount was previously a slave, it becomes shared+slave
          (receives from master AND propagates to peers).
     b. If changing to Private:
        - Remove from peer group ring (`mnt_share`).
        - Remove from master's slave list (if slave).
        - Set `mount.group_id = 0`.
        - Set `mount.mnt_master = None`.
     c. If changing to Slave:
        - If the mount is currently shared, it becomes a slave of its
          former peer group. The first remaining peer becomes the master.
        - Remove from peer group ring.
        - Add to master's `mnt_slave_list`.
        - Set `mount.mnt_master` to the former peer group leader.
        - Mount retains its `group_id` (for mountinfo optional fields).
     d. If changing to Unbindable:
        - Same as Private, plus prevents bind mount of this mount.
     e. Update `mount.propagation`.
  4. Increment `namespace.event_seq`.
  5. Release `mount_lock`.

14.6.10 Mount Propagation Algorithms

Mount propagation ensures that mount/unmount events on shared mount points are replicated across all related mount points. This is essential for container volume mounts: when a volume is mounted on a shared host path, all containers that have a slave relationship to that path see the new mount.

14.6.10.1 propagate_mount

propagate_mount(source_mount, new_child_mount) -> Result<()>

  Called under mount_lock when a mount is added to a shared mount point.

  Lock ordering: when the propagation walk must acquire per-mount locks
  (e.g., for mnt_count, children list, or mountpoint hash updates),
  locks are acquired in ascending mnt_id order. This prevents ABBA
  deadlocks when two concurrent propagation walks traverse overlapping
  peer groups. If a lock cannot be acquired in order (e.g., a lower
  mnt_id mount is discovered after a higher one is already locked),
  the higher lock is released and re-acquired after the lower one.

  1. Walk the peer group ring of `source_mount` (via `mnt_share` links).
     For each peer mount (excluding `source_mount` itself):
     a. Clone `new_child_mount` with the peer as parent.
        The clone's mountpoint is the dentry in the peer's filesystem
        that corresponds to `new_child_mount.mountpoint` in the source.
     b. Attach the clone at the peer (insert into hash table, set
        DCACHE_MOUNTED, add to children list, add to mount_list).
     c. If the clone's parent is shared, iteratively propagate to
        that peer group using the tree walk algorithm (matching Linux's
        `propagate_mnt()` iterative walker). Visited groups are tracked
        via a marker flag on each mount to prevent infinite loops. The
        iterative walker processes the propagation tree in a single loop
        without stack recursion, preventing stack overflow regardless of
        propagation chain depth.
  2. Walk the slave list of `source_mount` (via `mnt_slave_list`).
     For each slave mount:
     a. Clone `new_child_mount` with the slave as parent.
     b. Attach the clone at the slave.
     c. If the slave is also shared (shared+slave), propagate to the
        slave's peer group (step 1 applied to the slave's peers).
  3. If the cloning in any propagation step fails (e.g., ENOMEM for
     the mount count limit), roll back: remove all clones created in
     this propagation pass and return the error. Propagation is
     all-or-nothing within a single mount operation.

14.6.10.2 propagate_umount

propagate_umount(source_mount) -> Result<()>

  Called under mount_lock when a mount is removed from a shared mount point.

  1. Walk the peer group ring of `source_mount.parent` (the parent must
     be shared for propagation to occur).
     For each peer of the parent:
     a. Look up a child mount at the corresponding mountpoint dentry
        in the peer's mount hash table.
     b. If found and the child's superblock matches `source_mount`'s
        superblock (same filesystem), unmount it (do_umount steps 6-12).
     c. If the child mount has its own children, recursively unmount
        the subtree (do_umount_tree).
  2. Walk the slave list of the parent.
     For each slave:
     a. Same as step 1a-1c, applied to the slave.

14.6.11 Namespace Operations

14.6.11.1 copy_tree — Clone Mount Tree for CLONE_NEWNS

copy_tree(source_root_mount, source_root_dentry) -> Result<Arc<MountNamespace>>

  Called by clone(CLONE_NEWNS) and unshare(CLONE_NEWNS).

  1. Allocate a new `MountNamespace` with fresh `ns_id`, empty hash table,
     and a new `mount_lock`.
  2. The new namespace inherits the parent's `user_ns`.
  3. Clone the source root mount:
     a. Allocate new `Mount` with the same superblock and root dentry.
     b. Flags are copied. **Propagation is INHERITED from the source root mount**
        (not forced to Private). If the source root is shared, the clone is added
        to the same peer group. If the source root is private, the clone is private.
        This matches Linux's `copy_tree()` behavior and is consistent with step 4e
        logic. The statement "child's mounts are private unless marked shared"
        means shared propagation is PRESERVED, not overridden.
     c. Record old-to-new mount mapping: `mount_map[source_root] = cloned_root`.
        `mount_map` type: `HashMap<*const Mount, Arc<Mount>>` — cold path
        (runs once per `clone(CLONE_NEWNS)` or `unshare(CLONE_NEWNS)`).
        Keys are raw pointers for identity comparison (Arc pointer values
        are not integers, so XArray cannot be used per collection policy).
        HashMap is acceptable on this cold path.
  4. For each mount in the source namespace's mount_list (topological order):
     a. Skip unbindable mounts.
     b. Clone the mount into the new namespace.
     c. Preserve the parent-child relationship (the cloned child's parent
        is the clone of the original child's parent).
     d. Insert into the new namespace's hash table and mount_list.
     e. Record old-to-new mapping: `mount_map[source_mount] = cloned_mount`.
     f. Set propagation:
        - If the source mount is shared: the clone is added to the same
          peer group (shared propagation preserved across CLONE_NEWNS).
          This is critical for container runtimes that rely on propagation.
        - If the source mount is private/slave/unbindable: the clone is
          Private.
     **Error handling**: If mount cloning fails at step 4b (e.g., ENOMEM from
     Mount allocation), drop the partially-constructed MountNamespace. All
     previously cloned mounts are freed via their Arc destructors. The task's
     `fs.root` and `fs.pwd` are unchanged because step 6 was not reached.
     Return the error to the caller (`ENOMEM`).
  5. Set the new namespace's root to the clone of `source_root_mount`.
  6. **Update calling task's fs.root and fs.pwd**: Using the `mount_map`, find the
     cloned counterpart of the task's current root mount and pwd mount:
     ```
     let mut fs = task.fs.write();
     if let Some(new_root) = mount_map.get(&fs.root.mount) {
         fs.root = PathRef { mount: Arc::clone(new_root), dentry: fs.root.dentry.clone() };
     }
     if let Some(new_pwd) = mount_map.get(&fs.pwd.mount) {
         fs.pwd = PathRef { mount: Arc::clone(new_pwd), dentry: fs.pwd.dentry.clone() };
     }
     ```
     Without this step, the task would still resolve paths against the old namespace's
     mounts, defeating the purpose of CLONE_NEWNS.
  7. Return the new namespace.

14.6.11.2 pivot_root Integration

The pivot_root(2) algorithm specified in Section 17.1 is updated to use the Mount data structure:

pivot_root(new_root_path, put_old_path) -> Result<()>

  Capability check: CAP_SYS_ADMIN in caller's user namespace.
  The caller must be in a mount namespace (not the initial namespace).

  1. Resolve `new_root_path` to (new_root_mount, new_root_dentry).
     Verify `new_root_dentry` is the root of `new_root_mount` (i.e.,
     new_root is a mount point, not just a directory).
  2. Resolve `put_old_path` to (put_old_mount, put_old_dentry).
     Verify `put_old` is at or under `new_root`.
  3. Verify `new_root_mount` is not the current namespace root
     (i.e., `new_root_mount != namespace.root`). If it is, pivot_root
     is a no-op — return `EINVAL`.
  4. Verify `put_old` is reachable from `new_root` by walking the
     mount tree upward. This ensures `put_old` is a valid location
     within the new root's subtree for mounting the old root.
  5. Acquire `mount_lock`.
  6. Let `old_root_mount` = namespace's current root mount.
  7. Detach `new_root_mount` from its current position:
     a. Remove from hash table.
     b. Remove from parent's children.
     c. Clear DCACHE_MOUNTED on its old mountpoint.
  8. Reattach `old_root_mount` at `put_old`:
     a. Set `old_root_mount.parent` = `new_root_mount`.
     b. Set `old_root_mount.mountpoint` = the dentry corresponding to
        `put_old` within `new_root_mount`'s filesystem.
     c. Insert `old_root_mount` into hash table at new position.
     d. Set DCACHE_MOUNTED on the put_old dentry.
  9. Set `new_root_mount` as the namespace root:
     a. `new_root_mount.parent` = None (it is now the root).
     b. `new_root_mount.mountpoint` = `new_root_mount.root` (self-referential
        for the root mount).
     c. `namespace.root.update(new_root_mount, &mount_lock_guard)` (RCU
        publish via RcuCell::update).
  10. Update the CURRENT TASK's fs.root and fs.pwd if they reference the old root:
      ```
      let mut fs = current_task().fs.write();
      if fs.root.mount == old_root_mount {
          fs.root = PathRef { mount: Arc::clone(&new_root_mount), dentry: new_root_dentry };
      }
      if fs.pwd.mount == old_root_mount {
          fs.pwd = PathRef { mount: Arc::clone(&new_root_mount), dentry: new_root_dentry };
      }
      ```
      NOTE: Linux does NOT iterate all tasks in the namespace. Other tasks sharing
      the same `fs_struct` see the update via the shared reference. Tasks with
      different `fs_struct` instances that reference the old root will see the old
      root moved to `put_old` on their next path resolution — this is correct
      behavior (they can then chdir to the new root if desired).
  11. Increment `namespace.event_seq`.
  12. Release `mount_lock`.

  Note: Steps 7-9 are the atomic state change. In-flight path lookups
  that started before step 9 see the old root via RCU (the old
  `RcuCell` value remains valid until the grace period). New lookups
  after step 9 see the new root. This matches the atomicity guarantee
  specified in Section 17.1.3.

14.6.11.3 Namespace Teardown

When a mount namespace is destroyed (all processes exited, all /proc/PID/ns/mnt file descriptors closed, all bind mounts of the namespace file unmounted):

destroy_mount_namespace(ns) -> ()

  1. Acquire `mount_lock`.
  2. Iterate `ns.mount_list` in reverse topological order (leaves first).
  3. For each mount:
     a. Set MNT_DOOMED.
     b. Remove from hash table.
     c. Remove from parent's children.
     d. Remove from peer group and slave lists.
  4. Release `mount_lock`.
  5. For each removed mount (in reverse order):
     a. If `mnt_count == 0`, call `FileSystemOps::unmount()`.
     b. If `mnt_count > 0` (lazy unmount remnants still referenced by
        open file descriptors), defer unmount to final reference drop.
  6. Drop the hash table and mount list.

14.6.12 New Mount API Syscalls

UmkaOS implements the Linux 5.2+ mount API syscalls for compatibility with modern container runtimes (containerd, CRI-O) and systemd. These are thin wrappers around the internal mount operations described above.

Syscall Purpose Capability
fsopen(fs_type, flags) Create a filesystem context CAP_MOUNT
fspick(dirfd, path, flags) Create a reconfiguration context for an existing mount CAP_MOUNT
fsconfig(fd, cmd, key, value, aux) Configure a filesystem context CAP_MOUNT
fsmount(fs_fd, flags, mount_attr) Create a detached mount from a configured context CAP_MOUNT
move_mount(from_dirfd, from_path, to_dirfd, to_path, flags) Attach a detached mount or move an existing mount CAP_MOUNT
open_tree(dirfd, path, flags) Open or clone a mount point as a file descriptor CAP_MOUNT (if OPEN_TREE_CLONE)
mount_setattr(dirfd, path, flags, attr, size) Modify mount attributes, optionally recursively CAP_MOUNT

fsopen flow: 1. Validate fs_type against the filesystem registry. 2. Allocate FsContext with purpose = NewMount. 3. Return a file descriptor referencing the context.

fsconfig flow (selected commands): - FSCONFIG_SET_STRING: set a key-value option string. - FSCONFIG_SET_BINARY: set a binary option blob. - FSCONFIG_SET_FD: set an option to a file descriptor (e.g., source device). - FSCONFIG_CMD_CREATE: validate all options and create the superblock by calling FileSystemOps::mount(). On success, the superblock is stored in FsContext.superblock. On failure, diagnostic messages are written to the context's error log. - FSCONFIG_CMD_RECONFIGURE: for fspick contexts, apply new options to the existing superblock via FileSystemOps::remount().

fsmount flow: 1. Consume the superblock from the FsContext. The FsContext is marked consumed (state = FsContextState::Consumed); further fsconfig() calls on this fd return EBUSY. The fsopen fd remains open for error log retrieval but cannot be used to create another mount. Closing the fsopen fd releases the FsContext; double-release is prevented by the consumed state flag. 2. Allocate a Mount node with MNT_DETACHED flag set. 3. The mount is not yet attached to any namespace or visible to path resolution. It exists only as a detached object referenced by the returned file descriptor. 4. Return an O_PATH file descriptor referencing the detached mount.

move_mount flow: 1. Resolve the source (detached mount fd or existing mount path). 2. Resolve the target path. 3. If the source is detached (MNT_DETACHED): a. Clear MNT_DETACHED. b. Attach to the namespace via do_mount steps 6e-6m. 4. If the source is an existing mount: a. Delegate to do_move_mount() (Section 14.6).

open_tree flow: 1. Resolve the path to a mount. 2. If OPEN_TREE_CLONE: a. Clone the mount (like do_bind_mount without attaching). b. The clone is detached (MNT_DETACHED). c. If OPEN_TREE_CLONE | AT_RECURSIVE: recursively clone the subtree. 3. Return an O_PATH file descriptor.

mount_setattr flow: 1. Resolve the path to a mount. 2. Validate attr_set and attr_clr do not conflict. 3. Acquire mount_lock. 4. If AT_RECURSIVE: a. Collect all mounts in the subtree. b. Validate the changes are valid for all mounts (e.g., clearing MNT_READONLY on a mount whose superblock is read-only is invalid). c. If validation fails for any mount, return error (no partial changes). d. Apply attr_clr then attr_set to all mounts atomically. 5. If not recursive: apply to the single mount. 6. If attr.propagation != 0: change propagation type (Section 14.6). 7. Increment namespace.event_seq. 8. Release mount_lock.

14.6.13 Mount Introspection Syscalls

Linux 6.8 introduced statmount(2) and listmount(2) as structured replacements for parsing /proc/PID/mountinfo. UmkaOS implements both for container introspection tools and future-compatible userspace.

Syscall Purpose Capability
statmount(req, buf, bufsize, flags) Query detailed mount information by mount ID None (own namespace)
listmount(req, buf, bufsize, flags) List child mount IDs of a given mount None (own namespace)

statmount: Returns a struct statmount containing the mount's ID, parent ID, mount flags, propagation type, peer group ID, master mount ID, filesystem type, mount source, mount point path, and superblock options. The request specifies which fields to populate via a bitmask, avoiding unnecessary work (e.g., path resolution for mount point is skipped if STATMOUNT_MNT_POINT is not requested).

listmount: Returns an array of 64-bit mount IDs for the child mounts of a given mount. Supports cursor-based iteration: the caller passes the last seen mount ID, and listmount returns mount IDs after that cursor. This handles concurrent mount/unmount gracefully (mounts added after the cursor are seen; mounts removed are skipped).

14.6.14 /proc/PID/mountinfo Format

Each process exposes its mount namespace's mount tree through /proc/PID/mountinfo and /proc/PID/mounts. These files are read by systemd, Docker, findmnt, df, mountpoint, and other tools.

mountinfo line format (one line per mount, matching Linux exactly):

<mount_id> <parent_id> <major>:<minor> <root> <mount_point> <mount_options> <optional_fields> - <fs_type> <mount_source> <super_options>
Field Source Example
mount_id Mount.mount_id 36
parent_id Mount.parent.mount_id (self for root) 35
major:minor SuperBlock.dev major:minor 98:0
root Path of mount root within the filesystem / or /subdir
mount_point Path of mount point relative to process root /mnt/data
mount_options Per-mount flags as comma-separated options rw,noatime,nosuid
optional fields Propagation: shared:N, master:N, propagate_from:N shared:1 master:2
separator Literal hyphen -
fs_type Filesystem type name ext4
mount_source Mount.device_name /dev/sda1
super_options From FileSystemOps::show_options() rw,errors=continue

Implementation: The VFS iterates the namespace's mount_list under rcu_read_lock() and formats each line. The mount_list's topological ordering ensures that parent mounts appear before children (matching Linux's output order).

/proc/PID/mounts: A simplified view matching the old /etc/mtab format: <device> <mount_point> <fs_type> <options> 0 0. Generated from the same mount_list, omitting mount IDs and propagation fields.

14.6.15 Path Resolution Integration

This section details how the mount tree integrates with the path resolution algorithm described in Section 14.1.

Mount crossing in RCU-walk (fast path):

resolve_component_rcu(current_mount, current_dentry, name):
  1. Look up `name` in the dentry cache: dentry = dcache_lookup(current_dentry, name).
  2. If dentry is not found: fall through to ref-walk (cache miss).
  3. If dentry.flags has DCACHE_MOUNTED:
     a. Call mnt_ns.hash_table.lookup(current_mount.mount_id, dentry.inode, &rcu_guard).
     b. If a child mount is found:
        - current_mount = child_mount
        - current_dentry = child_mount.root
        - If child_mount.root also has DCACHE_MOUNTED, repeat step 3
          (stacked mounts — rare but legal).
     c. If no child mount found: DCACHE_MOUNTED is stale (race with
        umount). Clear the flag lazily and continue with the dentry.
  4. Return (current_mount, dentry).

Mount crossing in ref-walk (slow path):

resolve_component_ref(current_mount, current_dentry, name):
  1. Same as RCU-walk step 1, but takes a dentry reference count.
  2. Same DCACHE_MOUNTED check.
  3. If mount crossing:
     a. Call mnt_ns.hash_table.lookup() under rcu_read_lock().
     b. If found: increment child_mount.mnt_count (atomic add).
     c. Decrement current_mount.mnt_count.
     d. current_mount = child_mount; current_dentry = child_mount.root.
  4. Return (current_mount, dentry).

".." traversal across mount boundaries:

resolve_dotdot(current_mount, current_dentry):
  1. Chroot boundary check: if current_dentry == task.fs.root.dentry
     AND current_mount == task.fs.root.mnt, return (current_mount,
     current_dentry). The process is at its chroot root — ".." must
     not escape the jail.
  2. If current_dentry == current_mount.root:
     - We are at the root of this mount. ".." should cross into the parent
       mount.
     - If current_mount.parent is None: we are at the namespace root.
       ".." resolves to the root itself (cannot go above /).
     - Otherwise: current_mount = current_mount.parent.
       current_dentry = current_mount.mountpoint.
       (Continue resolving ".." from the parent mount's mountpoint.)
  3. If current_dentry != current_mount.root:
     - Normal ".." within the mount's filesystem.
     - current_dentry = current_dentry.parent.
  4. Return (current_mount, current_dentry).

14.6.16 Performance Characteristics

Operation Cost Notes
Mount hash lookup (RCU read) ~5-15 ns SipHash + 1-2 pointer chases, no locks, no atomics. Occurs on every mount-point crossing during path resolution.
DCACHE_MOUNTED check ~1 ns Single atomic load of dentry flags. Occurs on every path component — the gate that avoids hash lookup on non-mount-point dentries.
Mount (new filesystem) ~1-10 us Dominated by filesystem driver's mount() (superblock creation). Mount tree insertion is ~200 ns under lock.
Unmount ~500 ns - 5 us Hash removal + propagation. Filesystem unmount() cost varies (ext4 journal flush vs. tmpfs instant).
Bind mount ~300 ns Mount node clone + hash insertion. No filesystem I/O.
Bind mount (recursive, N sub-mounts) ~300*N ns Linear in subtree size.
Propagation (mount, M peers) ~300*M ns One clone per peer. Propagation to slaves adds per-slave overhead.
/proc/PID/mountinfo generation ~50 ns/mount One line per mount. 100-mount namespace: ~5 us total.
copy_tree (CLONE_NEWNS, N mounts) ~500*N ns Clone all mounts. 100-mount namespace: ~50 us.
pivot_root ~1 us Two hash table mutations + RCU publish.

Memory overhead per mount: ~320 bytes for the Mount struct (including all intrusive list nodes and propagation fields) plus ~16 bytes for the hash table entry. A container with 100 mounts consumes ~33 KiB of mount tree metadata. A system with 10,000 containers (1 million mounts total) consumes ~330 MiB — proportional to the actual number of mounts, not pre-allocated.

14.6.17 Cross-References

  • Section 3.5 (Lock Hierarchy): MOUNT_LOCK at level 20, between DENTRY_LOCK (19) and EVM_LOCK (22).
  • Section 9.1 (Capabilities): CAP_MOUNT (bit 70) gates all mount operations. CAP_SYS_ADMIN (bit 21) required for pivot_root and MNT_LOCKED override.
  • Section 14.1 (VFS Architecture): FileSystemOps::mount() creates the superblock consumed by do_mount(). FileSystemOps::unmount() is called by do_umount() after tree removal.
  • Section 14.1 (Dentry Cache): DCACHE_MOUNTED flag triggers mount hash table lookup during path resolution.
  • Section 14.1 (Path Resolution): RCU-walk and ref-walk mount crossing detailed in Section 14.6.
  • Section 14.1 (Mount Namespace and Capability-Gated Mounting): The capability table and propagation type summary specified there are implemented by the data structures in this section.
  • Section 14.8 (overlayfs): OverlayFs::mount() creates an OverlaySuperBlock consumed via the standard do_mount() path.
  • Section 17.1 (Namespace Implementation): NamespaceSet.mount_ns: Arc<MountNamespace> provides access to the full mount tree rather than just a capability handle to the root VFS node. The NamespaceSet is per-task (Task.nsproxy), not per-process.
  • Section 17.1 (pivot_root): The step-by-step algorithm there is superseded by the precise Mount-struct-based algorithm in Section 14.6.
  • Section 17.1 (Namespace Inheritance): CLONE_NEWNS triggers copy_tree() (Section 14.6).

14.7 Distribution-Aware VFS Extensions

When filesystems are shared across cluster nodes (Section 15.14), the VFS must handle cache validity, locking granularity, and metadata coherence across node boundaries. Linux's VFS was designed for local filesystems with network filesystem support bolted on afterward, resulting in several systemic performance problems. UmkaOS's VFS addresses these by integrating with the Distributed Lock Manager (Section 15.15).

Linux Problem Impact UmkaOS Fix
Dentry cache assumes local validity Remote rename/unlink leaves stale dentries on other nodes Callback-based invalidation: DLM lock downgrade (Section 15.15) triggers targeted dentry invalidation for affected directory entries only
d_revalidate() on every lookup for network FS Extra round-trip per path component on NFS/CIFS/GFS2 Lease-attached dentries: dentry is valid while parent directory DLM lock is held (Section 15.15); zero revalidation cost during lease period
Inode-level locking forces false sharing Two nodes writing to different byte ranges of the same file serialize on the inode lock Range locks in VFS: DLM byte-range lock resources (Section 15.15) allow concurrent operations on different ranges of the same file
No concurrent directory operations mkdir and create in the same directory serialize globally Per-bucket directory locks: hash-based directory formats (ext4 htree, GFS2 leaf blocks) use separate DLM resources per hash bucket
readdir() + stat() = 2N round-trips for N files ls -l on a 1000-file remote directory requires 2001 operations getdents_plus() returning attributes with directory entries (analogous to NFS READDIRPLUS but in-kernel, avoiding the userspace/kernel boundary per entry). getdents_plus() is an UmkaOS VFS-internal operation (not a new syscall): the VFS's readdir implementation populates both the directory entry and its InodeAttr in a single filesystem callback, caching the attributes for immediate use by a subsequent getattr() / stat() call. Userspace accesses this via the standard getdents64(2) + statx(2) syscalls — the optimization is transparent, eliminating redundant disk or DLM round-trips inside the kernel.
Full inode cache invalidation on lock drop Dropping a DLM lock on an inode discards all cached metadata, even fields that haven't changed Per-field inode validity: mtime/size read from DLM Lock Value Block (Section 15.15); permissions and ownership from local capability cache; only stale fields refreshed on lock reacquire

Integration with Section 15.15 DLM:

  • Dentry lease binding: When the VFS caches a dentry for a clustered filesystem, it records the DLM lock resource that protects the parent directory. The dentry remains valid as long as that lock is held at CR (Concurrent Read) mode or stronger. When the DLM downgrades or releases the lock (due to contention from another node), the VFS receives a callback and invalidates only the affected dentries — not the entire dentry subtree.
/// Per-dentry lease tracking for distributed VFS.
/// Stored in `Dentry::d_fsdata` for clustered filesystems.
pub struct DentryLeaseInfo {
    /// DLM lock resource ID protecting the parent directory of this dentry.
    pub dlm_resource: DlmResourceId,
    /// Lease sequence counter. Incremented by the DLM callback when the
    /// lease is invalidated (lock downgrade or release). VFS path walk
    /// compares the dentry's cached `lease_seq` against the current
    /// directory DLM lock's `lease_seq`: if they differ, the dentry is
    /// treated as stale and re-validated.
    ///
    /// Type: u64 (50-year rule: at 10M invalidations/sec, wraps in ~58K years).
    pub lease_seq: u64,
    /// The DLM lock mode at which this dentry was validated.
    pub validated_at_mode: DlmLockMode,
}
  • Range-aware writeback: When a process holds a DLM byte-range lock and writes to pages within that range, the VFS tracks dirty pages per lock range (not per inode). On lock downgrade, only dirty pages within the lock's range are flushed (Section 15.15). This eliminates the Linux problem where dropping a lock on a 100 GB file requires flushing all dirty pages, even if only 4 KB was modified.

  • Attribute caching via LVB: The VFS reads frequently-accessed inode attributes (i_size, i_mtime, i_blocks) from the DLM Lock Value Block (Section 15.15) rather than performing a disk read on every lock acquire. The LVB is updated by the last writer on lock release, so readers always get current values at the cost of a single RDMA operation (~3-4 μs) instead of a disk I/O (~10-15 μs for NVMe).

14.7.1.1 Lease Invalidation and In-Flight I/O Synchronization

When the DLM downgrades a dentry or byte-range lock (due to contention from another node), in-flight I/O operations that depend on the lease must be coordinated to prevent data corruption. The synchronization protocol:

  1. DLM blocking callback received: The VFS receives a dlm_ast_blocking() callback indicating that another node requests the lock at a conflicting mode.

  2. In-flight I/O barrier: The VFS increments the per-inode invalidation_seq: AtomicU64 counter (Acquire ordering). All new VFS operations targeting this inode check invalidation_seq before proceeding; if it has changed since the operation began, the operation must re-validate its cached dentry/inode state after re-acquiring the lock.

  3. Drain in-flight operations: The VFS waits for all in-flight operations that hold a reference to the current lock grant to complete. This uses a per-lock-resource inflight_count: AtomicU32 reference counter:

  4. Each VFS operation that depends on a DLM lock increments inflight_count (Acquire) at operation start and decrements it (Release) at completion.
  5. The invalidation path waits on a per-lock-resource WaitQueue with a 30-second timeout (matching GFS2's gfs2_glock_wait_for_demote timeout). The WaitQueue is signaled by each VFS operation upon completion (after decrementing inflight_count). If the timeout expires, the lock is downgraded forcibly (the remote node's request takes priority to avoid cluster-wide deadlocks).

  6. Flush dirty data: For byte-range locks, dirty pages within the lock's range are flushed to disk (Section 15.15) before the lock is downgraded.

  7. Invalidate caches: Dentry cache entries protected by the lock are invalidated. Page cache pages within the byte range are invalidated (discarded if clean, flushed then discarded if dirty).

  8. Downgrade/release the lock: The DLM lock is downgraded to the requested mode (or released entirely). The dlm_ast_completion() callback notifies the requesting node that the lock is available.

Ordering guarantee: Steps 2-5 are atomic with respect to the lock: no new operation can acquire the lock between the barrier (step 2) and the downgrade (step 6) because the lock's grant state is set to LOCK_INVALIDATING during this window.


14.8 overlayfs: Union Filesystem for Containers

Use case: Container image layering. Docker, containerd, Podman, and Kubernetes all use overlayfs as their primary storage driver. A container image is a stack of read-only filesystem layers; overlayfs merges them with a writable upper layer to present a unified view. Without overlayfs, container runtimes fall back to copy-the-entire-layer approaches (VFS copy, naive snapshots), which are orders of magnitude slower for image pull and container startup.

Tier: Tier 1 (runs in the VFS isolation domain alongside umka-vfs).

Rationale for Tier 1 (not Tier 2): overlayfs is a stacking filesystem — it sits between the VFS and the underlying filesystem drivers (ext4, XFS, btrfs, tmpfs). Every path lookup, readdir, and file open in a container traverses overlayfs. Placing it in Tier 2 (Ring 3, process boundary) would add two domain crossings per VFS operation inside every container, roughly doubling the path resolution overhead. Since overlayfs delegates all storage I/O to the underlying filesystem (which is itself a Tier 1 driver), overlayfs never touches hardware directly — it is a pure VFS client. Its code complexity is moderate (~3,000 SLOC in Linux) and auditable. The crash containment boundary is the VFS domain: if overlayfs panics, the VFS recovery protocol (Section 14.1) handles it.

Container setup ordering: During container creation, the overlayfs mount must complete before pivot_root() changes the container's root filesystem. The sequence is: (1) mount overlayfs at the target path, (2) mount pseudo-filesystems (/proc, /sys, /dev) on top, (3) pivot_root() to switch the container root to the overlayfs mount. Reversing steps (1) and (3) would leave the container with no root filesystem. This ordering matches the OCI runtime specification and is enforced by the umka-sysapi container setup helpers.

pivot_root() namespace validation: pivot_root() validates that new_root is a mount point in the calling task's mount namespace. If new_root was mounted in a different namespace (e.g., parent), pivot_root() returns EINVAL. This prevents namespace-crossing pivots that would create incoherent mount state.

Design: overlayfs implements FileSystemOps, InodeOps, FileOps, and DentryOps from the VFS trait system (Section 14.1). It does not introduce new VFS abstractions — it composes existing ones.

14.8.1 Mount Options and Configuration

/// Mount options parsed from the `data` parameter of `FileSystemOps::mount()`.
/// Encoded as comma-separated key=value pairs in the `data: &[u8]` slice,
/// matching Linux's overlayfs mount option syntax exactly.
///
/// Example mount command:
/// ```
/// mount -t overlay overlay \
///   -o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work \
///   /merged
/// ```
///
/// For read-only overlays (no upperdir/workdir), only lowerdir is required.
/// This is used for container image inspection without a writable layer.
pub struct OverlayMountOptions {
    /// Colon-separated list of lower layer paths, ordered from topmost to
    /// bottommost. At least one lower layer is required. Maximum 500 layers
    /// (matching Linux's limit, which Docker/containerd never approach —
    /// typical images have 5-20 layers).
    ///
    /// Each path must be an existing directory on a mounted filesystem.
    /// The VFS resolves each path to an `InodeId` at mount time and holds
    /// a reference to the underlying superblock for the mount's lifetime.
    ///
    /// Heap-allocated rather than inline (`ArrayVec<_, 500>` would be up to
    /// 4000 bytes on the stack). The 500-layer maximum is enforced at mount
    /// validation time. Mount processing is a rare, non-hot-path operation
    /// where heap allocation is acceptable.
    pub lower_dirs: Box<[InodeId]>,

    /// Upper layer directory (read-write). `None` for read-only overlays.
    /// Must reside on a filesystem that supports: xattr (for whiteouts and
    /// metacopy markers), rename with RENAME_WHITEOUT, and mknod (for
    /// character-device whiteouts). The upper filesystem must be writable.
    pub upper_dir: Option<InodeId>,

    /// Work directory for atomic copy-up staging. Required if `upper_dir`
    /// is set. Must be on the **same filesystem** as `upper_dir` (same
    /// superblock) — copy-up uses rename(2) from workdir to upperdir,
    /// which requires same-device semantics. The VFS verifies this at
    /// mount time by comparing `SuperBlock` identity.
    ///
    /// The workdir must be empty at mount time. overlayfs creates a `work/`
    /// subdirectory inside it for staging, and an `index/` subdirectory
    /// for NFS export handles (if enabled).
    pub work_dir: Option<InodeId>,

    /// Enable metadata-only copy-up. When true, operations that modify
    /// only metadata (chmod, chown, utimes, setxattr) copy only the
    /// inode metadata to the upper layer, deferring data copy until the
    /// first write. Dramatically reduces container startup I/O: a
    /// `chmod` on a 200 MB binary copies ~4 KB of metadata instead of
    /// 200 MB of data.
    ///
    /// Default: true (matches Docker/containerd default since Linux 5.11+
    /// with kernel config `OVERLAY_FS_METACOPY=y`).
    ///
    /// Security restriction: this option is silently forced to `false`
    /// when the mount is user-namespace-influenced (i.e., when the caller
    /// does not hold `CAP_SYS_ADMIN` in the initial user namespace). In
    /// such mounts the upper layer uses `user.overlay.*` xattrs, which
    /// are writable by the file owner without privilege; a forged
    /// metacopy xattr could redirect reads to arbitrary lower-layer files.
    /// See [Section 14.8](#overlayfs-union-filesystem-for-containers--metacopy-trust-model-and-security-constraints)
    /// for the complete trust model and enforcement mechanism.
    pub metacopy: bool,

    /// Directory rename/redirect handling.
    ///
    /// - `On`: Enable redirect xattrs for directory renames. Required
    ///   for rename(2) on merged directories to succeed (without this,
    ///   rename of a directory that exists in a lower layer returns EXDEV).
    /// - `Follow`: Follow existing redirect xattrs but do not create new
    ///   ones. Safe for mounting layers created by a trusted system.
    /// - `NoFollow`: Ignore redirect xattrs entirely. Most restrictive.
    /// - `Off`: Disable redirect handling; directory renames return EXDEV.
    ///
    /// Default: `On` (required by Docker/containerd for correct semantics).
    pub redirect_dir: RedirectDirMode,

    /// Volatile mode. When enabled, overlayfs skips all fsync/sync_fs calls
    /// to the upper filesystem. A crash or power loss may leave the upper
    /// layer in an inconsistent state (workdir staging artifacts, partial
    /// copy-ups). The overlay refuses to remount if it detects a previous
    /// volatile session that was not cleanly unmounted.
    ///
    /// Docker uses volatile mode for ephemeral containers where persistence
    /// is not needed (CI runners, build containers, test environments).
    ///
    /// Default: false.
    pub volatile: bool,

    /// Use `user.overlay.*` xattr namespace instead of `trusted.overlay.*`.
    /// Required for unprivileged (rootless) overlayfs mounts where the
    /// calling process lacks CAP_SYS_ADMIN in the initial user namespace.
    /// The `user.*` xattr namespace is writable by the file owner without
    /// special capabilities.
    ///
    /// Default: false (use `trusted.overlay.*`).
    pub userxattr: bool,

    /// Extended inode number mode. Controls how overlayfs composes inode
    /// numbers to guarantee uniqueness across layers.
    ///
    /// - `On`: Compose inode numbers using upper bits for layer index.
    ///   Requires underlying filesystems to use <32-bit inode numbers
    ///   (ext4, XFS with `inode32` mount option).
    /// - `Off`: Use raw underlying inode numbers. Risk of collisions
    ///   across layers (two files on different layers may share an ino).
    /// - `Auto`: Enable if all underlying filesystems have small enough
    ///   inode numbers; disable otherwise.
    ///
    /// Default: `Auto`.
    pub xino: XinoMode,

    /// NFS export support. When enabled, overlayfs maintains an index
    /// directory (inside workdir) that maps NFS file handles to overlay
    /// dentries. Required if the overlay mount will be exported via NFS.
    ///
    /// Default: false (NFS export of container filesystems is uncommon).
    pub nfs_export: bool,

    /// fs-verity digest validation for lower layer files. When enabled,
    /// overlayfs verifies that lower-layer files have valid fs-verity
    /// digests matching the expected values stored in the upper layer's
    /// metacopy xattr. Provides content integrity for container image
    /// layers without requiring dm-verity on the entire block device.
    ///
    /// - `Off`: No verity checking.
    /// - `On`: Verify if digest is present; allow files without digest.
    /// - `Require`: Reject files that lack a valid fs-verity digest.
    ///
    /// Default: `Off`.
    pub verity: VerityMode,
}

/// Redirect directory mode.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum RedirectDirMode {
    /// Create and follow redirect xattrs.
    On,
    /// Follow existing redirect xattrs but do not create new ones.
    Follow,
    /// Do not follow redirect xattrs.
    NoFollow,
    /// Disable redirect handling; directory renames return EXDEV.
    Off,
}

/// Extended inode number composition mode.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum XinoMode {
    /// Always compose inode numbers.
    On,
    /// Never compose inode numbers.
    Off,
    /// Compose if underlying inode numbers fit.
    Auto,
}

/// fs-verity enforcement mode for lower layer files.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum VerityMode {
    /// No verity checking.
    Off,
    /// Verify if digest present; allow files without digest.
    On,
    /// Reject lower files without valid fs-verity digest.
    Require,
}

14.8.2 Core Data Structures

/// Overlay filesystem superblock state. One instance per overlay mount.
/// Created by `OverlayFs::mount()` and stored in the `SuperBlock`'s
/// filesystem-private field.
pub struct OverlaySuperBlock {
    /// Lower layer inodes (topmost first). Index 0 is the highest-priority
    /// lower layer (searched first after upper). These are directory inodes
    /// on the underlying filesystems, held for the mount's lifetime.
    ///
    /// Heap-allocated rather than inline (`ArrayVec<_, 500>` would exceed
    /// the safe stack frame budget — each `OverlayLayer` contains an
    /// `InodeId`, a `SuperBlock` reference, and a `u16` index). The
    /// 500-layer maximum is enforced at mount validation time. Mount
    /// processing is a rare, non-hot-path operation where heap allocation
    /// is acceptable.
    pub lower_layers: Box<[OverlayLayer]>,

    /// Upper layer state. `None` for read-only overlay mounts.
    pub upper_layer: Option<OverlayLayer>,

    /// Work directory inode on the upper filesystem. Used as a staging
    /// area for atomic copy-up operations.
    pub work_dir: Option<InodeId>,

    /// Index directory inode (inside workdir). Used for NFS export file
    /// handle resolution and hard link tracking across copy-up.
    pub index_dir: Option<InodeId>,

    /// Parsed mount options (immutable after mount).
    pub config: OverlayMountOptions,

    /// The xattr prefix used for overlay-private xattrs. Either
    /// `"trusted.overlay."` (privileged) or `"user.overlay."` (userxattr
    /// mode). Stored once to avoid branching on every xattr operation.
    pub xattr_prefix: &'static [u8],

    /// Volatile session marker. If volatile mode is enabled, this is set
    /// to true after creating the `$workdir/work/incompat/volatile`
    /// sentinel directory. On mount, if the sentinel exists from a
    /// previous unclean session, mount fails with EINVAL.
    pub volatile_active: bool,

    /// True if this overlay was mounted from within a user namespace or
    /// if the upper layer's filesystem mount is owned by a non-initial
    /// user namespace. When true, `metacopy` and `redirect_dir=on` are
    /// disabled regardless of mount options, `userxattr` mode is
    /// mandatory, and data-only lower layers are rejected.
    ///
    /// Set once at `OverlayFs::mount()` time by checking whether the
    /// calling process's user namespace is the initial user namespace
    /// (`current_user_ns() == &init_user_ns`). Immutable thereafter.
    ///
    /// See Section 14.4.6.1 for the full security model.
    pub userns_influenced: bool,
}

/// A single layer in the overlay stack.
pub struct OverlayLayer {
    /// Root directory inode of this layer on its underlying filesystem.
    pub root: InodeId,

    /// Superblock of the underlying filesystem. Arc reference held for
    /// the overlay mount's lifetime to prevent the underlying FS from
    /// being unmounted while the overlay is active.
    pub sb: Arc<SuperBlock>,

    /// Layer index (0 = upper or topmost lower; increases downward).
    /// Used for xino composition and for identifying which layer an
    /// overlay inode's data resides on.
    pub index: u16,
}

/// Atomic optional value using a sentinel for the `None` state.
/// `InodeId` of 0 represents `None` (inode 0 is never valid in any filesystem).
/// Provides lock-free read access via `Acquire` load and one-time write
/// via `compare_exchange` (for copy-up transitions from None -> Some).
pub struct AtomicOption<T: Into<u64> + From<u64>> {
    value: AtomicU64,  // 0 = None, non-zero = Some(T)
}

impl AtomicOption<InodeId> {
    pub fn none() -> Self { Self { value: AtomicU64::new(0) } }
    pub fn load(&self) -> Option<InodeId> {
        match self.value.load(Ordering::Acquire) {
            0 => None,
            v => Some(InodeId(v)),
        }
    }
    /// Atomically transition from None to Some. Returns Err if already set.
    pub fn set_once(&self, val: InodeId) -> Result<(), InodeId> {
        self.value.compare_exchange(0, val.0, Ordering::AcqRel, Ordering::Acquire)
            .map(|_| ())
            .map_err(|v| InodeId(v))
    }
}

/// Per-inode overlay state. Tracks which layers contribute to a merged
/// view of this inode.
///
/// An `OverlayInode` is created on first lookup and cached in the VFS
/// inode cache. It is the filesystem-private data attached to the VFS
/// inode via `InodeId`.
pub struct OverlayInode {
    /// Inode in the upper layer. `Some` if the entry exists in upper
    /// (either originally or after copy-up). `None` if the entry exists
    /// only in lower layers.
    ///
    /// Protected by `copy_up_lock`: transitions from `None` to `Some`
    /// exactly once during copy-up. Once set, never changes back.
    /// Reads after copy-up are lock-free (Acquire load on the Option
    /// discriminant).
    pub upper: AtomicOption<InodeId>,

    /// Inode in the topmost lower layer that contains this entry.
    /// `None` if the entry exists only in upper (newly created file).
    pub lower: Option<LowerInodeRef>,

    /// 1 if this inode is a metacopy-only upper entry (metadata
    /// copied, data still in lower layer). Cleared to 0 after full
    /// data copy-up completes. Uses AtomicU8 (not AtomicBool) to avoid
    /// the bool validity invariant — Tier 1 intra-domain memory
    /// corruption from a co-domain module could write a non-0/1 value,
    /// which would be undefined behavior for AtomicBool.
    /// 0 = no metacopy, 1 = metacopy.
    pub metacopy: AtomicU8,

    /// True if this is an opaque directory. An opaque directory hides
    /// all entries from lower layers — readdir and lookup do not
    /// descend into lower layers below this point.
    pub opaque: bool,

    /// Redirect path for directory renames. When a merged directory is
    /// renamed in the upper layer, this field stores the original lower
    /// path so that lookups can find the renamed directory's lower
    /// contents. `None` for non-redirected entries.
    pub redirect: Option<Box<OsStr>>,

    /// Lock serializing copy-up operations on this inode. Only one
    /// thread may copy-up a given inode at a time. Other threads
    /// attempting to modify the same lower-layer file block on this
    /// lock until copy-up completes, then proceed against the upper copy.
    ///
    /// This is a `Mutex`, not an `RwLock`, because copy-up is an
    /// exclusive state transition (None -> Some). Read paths check
    /// `upper` with an Acquire load and only take the lock if they
    /// need to trigger copy-up.
    pub copy_up_lock: Mutex<()>,

    /// Overlay inode type. Needed because the overlay may present a
    /// different view than the underlying filesystem (e.g., a whiteout
    /// character device appears as "entry does not exist").
    pub inode_type: OverlayInodeType,
}

/// Reference to a lower-layer inode.
pub struct LowerInodeRef {
    /// Inode ID on the lower layer's filesystem.
    pub inode: InodeId,
    /// Which lower layer this inode resides on (index into
    /// `OverlaySuperBlock::lower_layers`).
    pub layer_index: u16,
}

/// Overlay inode type classification.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum OverlayInodeType {
    /// Regular file (may be metacopy).
    Regular,
    /// Directory (may be merged or opaque).
    Directory,
    /// Symbolic link.
    Symlink,
    /// Character device, block device, FIFO, or socket.
    Special,
    /// Whiteout entry (exists in upper layer to mark deletion of a
    /// lower-layer entry). Not visible to userspace — lookups return
    /// ENOENT. Internally represented as either a character device
    /// with major:minor 0:0 or a zero-size file with the
    /// `trusted.overlay.whiteout` xattr.
    Whiteout,
}

14.8.3 Overlay Dentry Operations

overlayfs requires custom DentryOps to handle the dynamic nature of the merged filesystem view. Copy-up changes which layer serves a file, so cached dentries must be revalidated.

/// overlayfs dentry operations.
impl DentryOps for OverlayDentryOps {
    /// Revalidate a cached overlay dentry.
    ///
    /// Returns `false` (forcing re-lookup) in these cases:
    /// 1. The overlay inode has been copied up since the dentry was cached
    ///    (detected by checking if `OverlayInode::upper` transitioned from
    ///    None to Some since the last lookup).
    /// 2. The underlying filesystem's dentry has been invalidated (delegates
    ///    to the underlying filesystem's `d_revalidate` if it implements one,
    ///    e.g., for NFS lower layers).
    /// 3. A whiteout has been created or removed in the upper layer for this
    ///    name (detected by checking upper-layer lookup result against cached
    ///    overlay state).
    ///
    /// Returns `true` (dentry is still valid) in all other cases.
    fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool>;

    /// Overlay dentries use the default VFS hash (SipHash-1-3).
    fn d_hash(&self, _name: &OsStr) -> Option<u64> {
        None
    }

    /// Overlay dentries are always eligible for LRU caching.
    fn d_delete(&self, _inode: InodeId, _name: &OsStr) -> bool {
        true
    }

    /// On dentry release, drop the overlay inode's references to
    /// underlying filesystem inodes.
    fn d_release(&self, inode: InodeId, name: &OsStr);
}

Dentry cache interaction: When a copy-up occurs, the overlay must invalidate the affected dentry in the VFS dentry cache (Section 14.1) so that subsequent lookups see the upper-layer inode instead of the stale lower-layer reference. The invalidation sequence:

  1. Copy-up completes (new file exists in upper layer).
  2. OverlayInode::upper is set via an atomic Release store.
  3. The overlay calls d_invalidate() on the parent directory's dentry for the affected name. This removes the dentry from the hash table and marks it for re-lookup.
  4. The next lookup for this name calls OverlayInodeOps::lookup(), which now finds the upper-layer entry and returns the updated OverlayInode.

Negative dentry handling: Negative dentries (cached ENOENT results) in the overlay dentry cache are invalidated when: - A new file is created in the upper layer (the negative dentry for that name must be purged). - A whiteout is removed (the previously-hidden lower-layer entry becomes visible again).

14.8.4 Lookup Algorithm

overlayfs lookup implements the layer search order:

OverlayInodeOps::lookup(parent: InodeId, name: &OsStr) -> Result<InodeId>:

  let overlay_parent = get_overlay_inode(parent)

  // Step 1: Search upper layer (if writable overlay).
  if let Some(upper_dir) = overlay_parent.upper {
      match underlying_lookup(upper_dir, name) {
          Ok(upper_inode) => {
              // Check if this is a whiteout.
              if is_whiteout(upper_inode) {
                  // Entry was deleted. Do NOT search lower layers.
                  // Cache a negative dentry.
                  return Err(ENOENT)
              }
              // Check if this is an opaque directory.
              let opaque = is_opaque_dir(upper_inode)
              // Found in upper. If directory and not opaque, may need
              // to merge with lower layers.
              if is_directory(upper_inode) && !opaque {
                  // Merged directory: upper exists, also search lower
                  // for the merge view.
                  let lower = find_in_lower_layers(overlay_parent, name)
                  return create_overlay_inode(Some(upper_inode), lower, ...)
              }
              // Non-directory or opaque directory: upper is authoritative.
              return create_overlay_inode(Some(upper_inode), None, ...)
          }
          Err(ENOENT) => {
              // Not in upper, fall through to lower layers.
          }
          Err(e) => return Err(e),  // Propagate I/O errors.
      }
  }

  // Step 2: Search lower layers (topmost first).
  // If parent directory has a redirect, follow it.
  for (layer_idx, lower_layer) in lower_layers_for(overlay_parent) {
      match underlying_lookup(lower_dir_at(lower_layer, overlay_parent), name) {
          Ok(lower_inode) => {
              if is_whiteout(lower_inode) {
                  // Whiteout in this lower layer. Stop searching.
                  return Err(ENOENT)
              }
              return create_overlay_inode(None, Some(LowerInodeRef {
                  inode: lower_inode,
                  layer_index: layer_idx,
              }), ...)
          }
          Err(ENOENT) => continue,  // Try next lower layer.
          Err(e) => return Err(e),
      }
  }

  // Not found in any layer.
  Err(ENOENT)

Whiteout detection: An upper-layer entry is a whiteout if either: - It is a character device with major:minor 0:0 (traditional format), OR - It is a zero-size regular file with the trusted.overlay.whiteout (or user.overlay.whiteout in userxattr mode) xattr set.

Both formats are supported for compatibility with existing container images. UmkaOS creates whiteouts using the xattr format by default (avoids requiring mknod capability for character device creation in unprivileged containers).

Opaque directory detection: A directory is opaque if it has the xattr trusted.overlay.opaque (or user.overlay.opaque) set to "y". An opaque directory hides all entries from lower layers — lookups do not descend past it. This is used when an entire directory is deleted and recreated in the upper layer.

14.8.5 Copy-Up Protocol

Copy-up is the central operation of overlayfs. When a lower-layer file must be modified, its contents (and/or metadata) are first copied to the upper layer. The copy-up must be atomic from the perspective of concurrent readers: at no point should a reader see a partially-copied file.

Full copy-up algorithm (for regular files when metacopy is disabled, or on first write to a metacopy-only file):

copy_up(overlay_inode: &OverlayInode) -> Result<InodeId>:

  // Fast path: already copied up.
  if let Some(upper) = overlay_inode.upper.load(Acquire) {
      if !overlay_inode.metacopy.load(Acquire) {
          return Ok(upper)  // Fully copied up already.
      }
      // Metacopy exists but needs full data copy. Fall through.
  }

  // Slow path: take copy-up lock.
  let _guard = overlay_inode.copy_up_lock.lock()

  // Double-check after acquiring lock (another thread may have completed
  // copy-up while we waited).
  if let Some(upper) = overlay_inode.upper.load(Acquire) {
      if !overlay_inode.metacopy.load(Acquire) {
          return Ok(upper)
      }
  }

  let lower = overlay_inode.lower.as_ref().expect("copy-up requires lower");
  let sb = overlay_super_block()

  // Step 1: Ensure parent directory exists in upper layer.
  // Recursively copy-up parent directories if needed.
  let upper_parent = ensure_upper_parent(overlay_inode)

  // Step 2: Create temporary file in workdir (same filesystem as upper).
  // The workdir is on the same device as upperdir, enabling atomic rename.
  let tmp_name = generate_temp_name()  // e.g., "#overlay.XXXXXXXX"
  let tmp_inode = underlying_create(sb.work_dir, tmp_name, lower_mode)

  // Step 3: Copy metadata from lower to tmp.
  let lower_attr = underlying_getattr(lower.inode)

  // CVE-2023-0386 mitigation: Verify that the source file's UID/GID are valid
  // in the overlay mount's user namespace. A setuid file in the lower layer
  // whose UID has no mapping in the overlay's userns must NOT be copied up
  // with elevated privileges. Reject with EOVERFLOW if unmappable.
  if !from_kuid_munged(overlay_mnt_userns, lower_attr.uid).is_valid()
      || !from_kgid_munged(overlay_mnt_userns, lower_attr.gid).is_valid() {
      underlying_unlink(sb.work_dir, tmp_name)  // clean up temp file
      return Err(EOVERFLOW)
  }

  underlying_setattr(tmp_inode, &lower_attr)  // owner, mode, timestamps

  // Step 4: Copy xattrs from lower to tmp.
  // Filter out overlay-private xattrs (trusted.overlay.*).
  copy_xattrs_filtered(lower.inode, tmp_inode, sb.xattr_prefix)

  // Step 5: Copy file data (skip if metacopy mode and this is a
  // metadata-only copy-up triggered by chmod/chown/utimes).
  if !metacopy_only {
      copy_file_data_chunked(lower.inode, tmp_inode, &overlay_inode.copy_up_lock)
      // Uses chunked I/O with periodic lock release. See "Chunked Copy-Up
      // and Cgroup I/O Throttling" below for the algorithm.
  } else {
      // Set metacopy xattr on the tmp file. This marks it as containing
      // metadata only — data will be copied on first write.
      underlying_setxattr(tmp_inode,
          concat(sb.xattr_prefix, "metacopy"), b"", 0)

      // If the lower file is itself a metacopy (nested overlay), follow
      // the redirect chain to find the actual data source.
      if let Some(origin) = get_metacopy_origin(lower.inode) {
          underlying_setxattr(tmp_inode,
              concat(sb.xattr_prefix, "origin"), &encode_fh(origin), 0)
      }
  }

  // Step 6: Set security context on tmp file.
  // Copy security.* xattrs that the security framework requires.

  // Step 7: Atomic rename from workdir to upperdir.
  // This is the commit point. Before this rename, the copy-up is invisible
  // to other processes. After this rename, the upper-layer file is live.
  underlying_rename(sb.work_dir, tmp_name, upper_parent, target_name,
                    RenameFlags::RENAME_NOREPLACE)

  // Step 8: Update overlay inode state.
  let upper_inode = underlying_lookup(upper_parent, target_name)
  // set_once(): CAS from None to Some. The copy_up_lock guarantees single
  // writer, so set_once always succeeds (debug_assert to catch invariant violations).
  overlay_inode.upper.set_once(upper_inode)
      .expect("copy_up_lock guarantees single writer");
  if metacopy_only {
      overlay_inode.metacopy.store(true, Release)
  }

  // Step 9: Invalidate the dentry cache entry for this name.
  // Forces subsequent lookups to see the upper-layer version.
  d_invalidate(upper_parent, target_name)

  Ok(upper_inode)

Atomicity guarantee: The rename in Step 7 is the single atomic commit point. If the system crashes before Step 7, the temporary file in workdir is orphaned and cleaned up on next mount (overlayfs scans workdir for stale temporaries during mount() and removes them). If the system crashes after Step 7, the upper-layer file is complete and consistent.

Error recovery (runtime failures): Each step that can fail must clean up all prior steps before returning an error to the caller. The copy-up state machine tracks progress through four states:

/// Copy-up state machine. Tracks the current phase of a copy-up operation
/// for error recovery. Stored on the stack (not persistent — crash recovery
/// uses workdir scan, not state machine replay).
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub enum CopyUpState {
    /// Step 2: Creating the temporary file in workdir.
    /// Cleanup on failure: none (nothing created yet).
    Creating,

    /// Steps 3-6: Copying metadata, xattrs, data, and security context
    /// to the temporary file.
    /// Cleanup on failure: unlink temporary file from workdir.
    Copying,

    /// Step 7: Atomic rename from workdir to upperdir.
    /// Cleanup on failure: unlink temporary file from workdir.
    Renaming,

    /// Steps 8-9: Rename succeeded. Updating overlay inode state and
    /// invalidating dentry cache.
    /// No cleanup needed — the upper file is committed and will be
    /// found on retry if post-rename steps fail.
    Complete,
}

Error recovery protocol (driven by CopyUpState):

CopyUpState when error occurs Cleanup action Returned error Lower file status
Creating (Step 2 fails) None (nothing was created) EIO / ENOSPC Unchanged
Copying (Steps 3-6 fail) underlying_unlink(workdir, tmp_name) — remove partial temp file EIO / ENOSPC Unchanged
Renaming (Step 7 fails) underlying_unlink(workdir, tmp_name) — remove complete-but-uncommitted temp file EIO Unchanged
Complete (Steps 8-9 fail) None — rename already committed; upper file is live and will be discovered on retry EIO (rare) Now shadowed by upper

The error recovery path:

copy_up_with_recovery(overlay_inode) -> Result<InodeId>:
  let mut state = CopyUpState::Creating;

  let result = (|| -> Result<InodeId> {
      // Step 2: Create temp file.
      let tmp = underlying_create(sb.work_dir, tmp_name, mode)?;
      state = CopyUpState::Copying;

      // Steps 3-6: Copy metadata, xattrs, data, security context.
      copy_metadata(lower, tmp)?;
      copy_xattrs_filtered(lower, tmp, prefix)?;
      if !metacopy_only {
          match copy_file_data_chunked(lower, tmp, &copy_up_lock) {
              Ok(()) => {}
              Err(EALREADY) => {
                  // Another thread completed copy-up while we released
                  // the lock between chunks. Clean up our temp file and
                  // return the already-committed upper inode.
                  let _ = underlying_unlink(sb.work_dir, tmp_name);
                  return Ok(overlay_inode.upper.load(Acquire).unwrap())
              }
              Err(e) => return Err(e),
          }
      }
      copy_security_context(lower, tmp)?;
      state = CopyUpState::Renaming;

      // Step 7: Atomic rename.
      underlying_rename(sb.work_dir, tmp_name, upper_parent, name,
                        RENAME_NOREPLACE)?;
      state = CopyUpState::Complete;

      // Steps 8-9: Update overlay state.
      overlay_inode.upper.set_once(upper)
          .expect("copy_up_lock guarantees single writer");
      d_invalidate(upper_parent, name);
      Ok(upper)
  })();

  if let Err(e) = &result {
      match state {
          CopyUpState::Creating => {
              // Nothing to clean up.
          }
          CopyUpState::Copying | CopyUpState::Renaming => {
              // Best-effort cleanup: remove the temporary file.
              if let Err(unlink_err) = underlying_unlink(sb.work_dir, tmp_name) {
                  // Unlink failed — orphan left in workdir.
                  // mount() workdir scan will remove it.
                  log_warn!("copy-up cleanup failed: {:?}", unlink_err);
              }
          }
          CopyUpState::Complete => {
              // Rename committed. Upper file is live. No cleanup.
          }
      }
  }
  result

Key invariant: The original lower-layer file is never modified or removed during copy-up. If any step fails before CopyUpState::Complete, the lower file remains intact and serves subsequent reads. The temporary file in workdir is either successfully cleaned up or left as an orphan for the next mount scan.

If cleanup of the temporary file itself fails (i.e., underlying_unlink() returns an error during recovery), the orphaned temporary is left in workdir and will be removed by the next mount() scan. The original copy-up failure is still returned to the caller as an error. The orphaned file does not affect correctness because the rename (Step 7) did not complete.

Parent directory copy-up: Directories are copied up recursively. When copying up /a/b/c/file.txt, if /a/b/c/ does not exist in upper, overlayfs creates /a/, then /a/b/, then /a/b/c/ in upper (each with appropriate metadata and the trusted.overlay.origin xattr pointing to the lower original). Only then does the file copy-up proceed. Each directory copy-up is itself atomic (created in workdir, renamed to upper).

Hard link handling on copy-up: If a lower-layer file has multiple hard links (nlink > 1), all names referencing the same lower inode must resolve to the same upper inode after copy-up. The overlay maintains an index directory (inside workdir) that maps lower file handles to upper inodes. On copy-up, the overlay checks the index first: - If an index entry exists, the file was already copied up via another name. Create a hard link in upper rather than copying data again. - If no index entry exists, perform a full copy-up and record the mapping.

This index is also used for NFS export (mapping file handles across copy-up).

14.8.5.1 Chunked Copy-Up and Cgroup I/O Throttling

Copy-up data transfer (Step 5) can be arbitrarily large — container images routinely contain multi-gigabyte database files, ML model weights, or log archives. A naive single-pass copy_file_data() holds copy_up_lock for the entire transfer duration. Under cgroup io.max throttling (Section 17.2), this duration is further extended because each write is subject to the cgroup's byte-rate and IOPS limits. The result: all other threads attempting any metadata or write operation on the same file block on copy_up_lock for the full throttled transfer time — potentially minutes for a large file under tight io.max limits.

Solution: Chunked copy-up with periodic lock release. The data copy is split into fixed-size chunks. Between chunks, the copy_up_lock is released, allowing blocked threads to observe the in-progress copy-up and either wait briefly or proceed (if the copy-up completed during their wait). The temporary file in workdir is not visible to other overlay operations until the final atomic rename (Step 7), so releasing the lock during data copy does not expose partial state.

/// Chunk size for copy-up data transfer. 2 MiB balances:
/// - Throughput: large enough to amortize splice/sendfile setup overhead.
/// - Latency: small enough that lock-hold time per chunk is bounded (~2ms
///   at 1 GB/s disk throughput, longer under io.max throttling).
/// - Memory: the chunk is transferred in-place (splice) or via a bounded
///   kernel buffer, not allocated as a contiguous 2 MiB region.
const COPY_UP_CHUNK_SIZE: u64 = 2 * 1024 * 1024;  // 2 MiB

copy_file_data_chunked(
    lower: InodeId,
    tmp: InodeId,
    lock: &Mutex<()>,
) -> Result<()>:
  let file_size = underlying_getattr(lower).size
  let mut offset: u64 = 0

  while offset < file_size {
      let chunk_len = min(COPY_UP_CHUNK_SIZE, file_size - offset)

      // Copy one chunk. Uses splice/sendfile for zero-copy where the
      // underlying filesystem supports it; falls back to read+write.
      // The write side is subject to the calling task's cgroup io.max
      // throttling — cgroup_io_throttle() is called inside the block
      // layer's submit_bio() path, which may sleep if the cgroup's
      // byte-rate or IOPS budget is exhausted.
      copy_file_range(lower, tmp, offset, chunk_len)?

      offset += chunk_len

      if offset < file_size {
          // Release the copy-up lock between chunks. This allows other
          // threads blocked on copy_up_lock to wake and re-check state.
          // They will find upper still None (the rename has not happened)
          // and re-acquire the lock. The first thread to re-acquire
          // continues the copy from where it left off.
          //
          // SAFETY of releasing mid-copy: the temporary file is in workdir,
          // invisible to overlay lookups. No concurrent thread can observe
          // partial data. The only visible state change is the lock
          // becoming briefly available, which lets waiters check for
          // completion and yields the CPU to higher-priority tasks.
          drop(lock.unlock())

          // Yield point: allow the scheduler to run higher-priority tasks
          // (especially relevant when this copy-up is in a low-priority
          // cgroup). Also allows signal delivery (SIGKILL check).
          if signal_pending(current_task()) {
              // Copy-up interrupted by fatal signal. The temporary file
              // will be cleaned up by the error recovery path (CopyUpState::Copying).
              return Err(EINTR)
          }

          // Re-acquire the lock before the next chunk.
          lock.lock()

          // Double-check: another thread may have completed the copy-up
          // while we released the lock (race between two writers on the
          // same metacopy file). If so, abandon our temp file — the
          // error recovery path will unlink it.
          if overlay_inode.upper.load(Acquire).is_some()
              && !overlay_inode.metacopy.load(Acquire) {
              return Err(EALREADY)  // caller detects this and returns Ok
          }
      }
  }

  Ok(())

Interaction with io.max throttling: Each chunk's write passes through the block layer's submit_bio() path. If the calling task's cgroup has io.max limits, the throttle check (cgroup_io_throttle() in Section 17.2) sleeps until the cgroup's token bucket replenishes. This sleep occurs while holding the copy-up lock for the current chunk only — at most COPY_UP_CHUNK_SIZE / wbps seconds per chunk hold. Between chunks, the lock is released, so other threads are unblocked.

Worst-case lock-hold time per chunk: COPY_UP_CHUNK_SIZE / min(disk_throughput, io.max.wbps). At the minimum practical io.max setting of 1 MB/s, a 2 MiB chunk holds the lock for ~2 seconds. This is acceptable because: 1. Threads waiting on copy_up_lock are already blocked on a slow-path mutation. 2. The 2-second hold is bounded and predictable, unlike the unbounded hold of a single-pass copy of a multi-gigabyte file. 3. Under normal (unthrottled) I/O, the hold time per chunk is ~2ms at 1 GB/s.

EALREADY sentinel: When copy_file_data_chunked returns Err(EALREADY), the caller (copy_up_with_recovery) recognizes this as a benign race — another thread completed the copy-up. The caller cleans up its temporary file and returns the already-committed upper inode. This is handled in the CopyUpState::Copying error recovery branch (temp file unlink).

Signal handling: The signal_pending() check between chunks allows SIGKILL to abort a long-running copy-up promptly (within one chunk transfer time) instead of only after the entire file is copied. The error recovery path cleans up the partial temporary file.

14.8.6 Metacopy Mode

Metacopy is the performance-critical optimization for container startup. Without metacopy, any metadata operation (chmod, chown, utimes) on a lower-layer file triggers a full data copy. With metacopy enabled, only metadata is copied, and data copy is deferred until the file is opened for writing.

Metacopy lifecycle:

State transitions for a file in metacopy mode:

  [Lower-only]
      │ chmod/chown/utimes/setxattr
  [Metacopy in upper]   ← metadata copied, data in lower
      │                    upper has trusted.overlay.metacopy xattr
      │ open(O_WRONLY/O_RDWR) or truncate
  [Full copy-up]         ← data + metadata in upper
                           trusted.overlay.metacopy xattr removed

Realfile mechanism: When an overlayfs file is opened, the OpenFile's f_mapping (AddressSpace pointer) is redirected to the underlying real inode's AddressSpace (upper if exists, else lower). This ensures page cache operations go directly to the real filesystem. On copy-up, f_mapping is redirected from lower to upper. Overlayfs has no own page cache — it delegates entirely to underlying filesystems.

Read path for metacopy files: When a metacopy file is opened for reading (O_RDONLY), data is served from the lower layer. The OverlayFileOps::read() implementation checks overlay_inode.metacopy and dispatches to the lower-layer FileOps::read() with the lower inode. No data copy occurs.

Concurrent metacopy copy-up: If two tasks trigger copy-up for the same metacopy inode simultaneously, the first task to acquire the inode's copy_up_lock mutex (see OverlayInode::copy_up_lock) performs the copy-up. The second task waits on copy_up_lock and, after acquiring it, checks whether copy-up already completed (oi.upper is now Some). If so, it redirects f_mapping to the upper inode and proceeds without repeating the copy-up.

Write trigger: When a metacopy file is opened for writing (O_WRONLY, O_RDWR) or truncated, the overlay triggers a full data copy-up before allowing the write:

impl FileOps for OverlayFileOps {
    fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64> {
        let oi = get_overlay_inode(inode);

        // If opening for write and file is metacopy-only, trigger
        // full data copy-up before returning the fd.
        if flags.is_writable() && oi.metacopy.load(Acquire) {
            copy_up_data(oi)?;
            // copy_up_data() copies file data from lower to upper,
            // removes the metacopy xattr, and clears oi.metacopy.
        }

        // Delegate open to the appropriate underlying filesystem.
        if let Some(upper) = oi.upper.load(Acquire) {
            underlying_open(upper, flags)
        } else {
            // Read-only open on a lower-only file. No copy-up needed.
            underlying_open(oi.lower.unwrap().inode, flags)
        }
    }
}

14.8.6.1 Metacopy Trust Model and Security Constraints

The metacopy mechanism is only safe when the kernel can trust that trusted.overlay.metacopy (or user.overlay.metacopy in userxattr mode) was written by the overlay itself during a copy-up, not forged by a process with write access to the upper layer. If forged, an attacker could create a file whose upper stub has a redirect xattr pointing to an arbitrary path in a lower layer, then set the metacopy xattr to tell the kernel to serve lower-layer data through the stub — exposing files the attacker would not otherwise be able to read via the overlay's merged view.

Xattr namespace privilege boundary

The trusted. xattr namespace is the primary safeguard. The kernel checks CAP_SYS_ADMIN via capable() — which verifies the capability against the initial user namespace — not via ns_capable() (which would accept a user namespace root). This means:

A process that holds CAP_SYS_ADMIN only within a user namespace (i.e., container root mapped to an unprivileged host UID) cannot set or read trusted.* xattrs on the host filesystem. Only a process with CAP_SYS_ADMIN in the initial user namespace can write trusted.overlay.* xattrs.

This provides complete protection for overlayfs mounts created in the initial user namespace: container processes cannot forge trusted.overlay.metacopy or trusted.overlay.redirect xattrs because they lack the required capability on the host filesystem.

Privileged container caveat: A container that runs with CAP_SYS_ADMIN in the initial user namespace (not just the container's user namespace) CAN write trusted.* xattrs and could forge metacopy stubs. This is a known trust boundary: granting CAP_SYS_ADMIN in the initial namespace to a container is equivalent to granting root on the host. Operators who grant this should not rely on overlayfs metacopy for isolation.

User-namespace-influenced mounts: the attack surface

Since Linux 5.11, overlayfs can be mounted from within a user namespace (CAP_SYS_ADMIN in the user namespace that owns the mount namespace suffices to call mount("overlay", ...)). Such mounts are required to use userxattr mode (-o userxattr), which substitutes the user.overlay.* xattr namespace for trusted.overlay.*. Unlike trusted.*, the user.* namespace is writable by the file owner without any privilege — specifically, the unprivileged host UID that the container root maps to can set user.overlay.metacopy and user.overlay.redirect xattrs on files in the upper layer.

A user-namespace-influenced mount is defined as any overlayfs mount where either:

  1. The overlayfs mount() call was made from within a user namespace (the calling process's user namespace is not the initial user namespace), or
  2. The upper directory's owning user namespace differs from the initial user namespace (detected by comparing the user namespace of the mount namespace that created the upper directory's filesystem mount against init_user_ns).

Enforcement: metacopy disabled for user-namespace-influenced mounts

UmkaOS enforces the following rule at mount time and at metacopy lookup time:

Mount-time enforcement: When OverlayFs::mount() is called from a process not in the initial user namespace, the metacopy and redirect_dir options are forced to off regardless of what the caller requested. The mount proceeds with these features disabled. The kernel logs:

overlayfs: metacopy and redirect_dir disabled for user-namespace mount (CVE mitigation, Section 14.4.6.1)

This matches Linux's behaviour (since kernel 5.11, user-namespace overlayfs mounts are restricted to userxattr mode and metacopy is not permitted unless the caller has CAP_SYS_ADMIN in the initial user namespace).

The OverlaySuperBlock records whether the mount is user-namespace-influenced:

pub struct OverlaySuperBlock {
    // ... existing fields ...

    /// True if this overlay was mounted from within a user namespace (the
    /// mounting process's user namespace is not the initial user namespace)
    /// or if the upper layer's filesystem mount is owned by a non-initial
    /// user namespace. When true, metacopy and redirect_dir are disabled
    /// regardless of mount options, and userxattr mode is mandatory.
    ///
    /// Set once at mount time; immutable thereafter.
    pub userns_influenced: bool,
}

Lookup-time enforcement: Even if metacopy is enabled in the mount options, the metacopy lookup path checks userns_influenced before reading or acting on any metacopy xattr:

/// Attempt to read a metacopy stub from the given upper-layer dentry.
/// Returns `None` (treat as a regular upper file) if:
///   - The mount is user-namespace-influenced, or
///   - No metacopy xattr is present, or
///   - The xattr value fails validation.
fn ovl_lookup_metacopy(dentry: &Dentry, sb: &OverlaySuperBlock) -> Option<OverlayMetacopy> {
    // Never trust metacopy xattrs from user-namespace-influenced mounts.
    // The xattr namespace used by such mounts (user.overlay.*) is writable
    // by the file owner without privilege, so any metacopy xattr present
    // must be treated as potentially forged.
    if sb.userns_influenced {
        return None;
    }

    // Read the metacopy xattr from the upper-layer file.
    let xattr_name = concat_static(sb.xattr_prefix, "metacopy");
    let xattr = dentry.get_xattr(xattr_name)?;

    // Validate xattr value. The Linux-compatible format is either empty
    // (legacy, no digest) or a 4+N byte structure: 4-byte header followed
    // by an optional fs-verity SHA-256 digest (32 bytes). Reject anything
    // that does not match either form.
    validate_metacopy_xattr(xattr)
}

The lookup-time check is defence-in-depth: the mount-time enforcement already prevents metacopy=on from reaching OverlaySuperBlock::config on user-namespace mounts, so ovl_lookup_metacopy would not be called. The redundant check in ovl_lookup_metacopy protects against future code paths that might bypass the mount-time gate.

Userxattr mode and data-only layers

When userxattr=on is set (required for user-namespace mounts), user.overlay.* xattrs are used throughout. The user.overlay.redirect xattr controls directory rename semantics and, in data-only layer configurations, points metacopy stubs to their data sources. Because user.* xattrs are writable by the file owner, and because data-only layer configurations allow a metacopy file in one lower layer to redirect to a file in a data-only lower layer via user.overlay.redirect:

  • redirect_dir=on is disallowed for user-namespace-influenced mounts (forced to off at mount time).
  • Data-only lower layers are disallowed for user-namespace-influenced mounts: OverlayFs::mount() returns EPERM if any lower layer path is specified with the :: data-only separator syntax when userns_influenced is true.

These restrictions prevent the user.overlay.redirect xattr from being used to point a metacopy stub in one layer at a file in another layer that the container would not otherwise be able to access.

Summary of security invariants

Condition trusted.overlay.* metacopy user.overlay.* metacopy
Initial user namespace mount, metacopy=on Trusted (forging requires host CAP_SYS_ADMIN) N/A (userxattr not used in privileged mounts by default)
User-namespace mount N/A (trusted.* inaccessible from user NS) Disabled (forced off at mount time; ovl_lookup_metacopy returns None)
User-namespace mount, userxattr=on, data-only layers N/A Rejected at mount time (EPERM)

14.8.7 Directory Operations

Readdir merge: Reading a merged directory (one that exists in both upper and lower layers) requires combining entries from all layers, excluding whiteouts and applying opaque directory semantics.

OverlayFileOps::readdir(inode, private, offset, emit) -> Result<()>:

  let oi = get_overlay_inode(inode)

  // Phase 1: Collect entries from upper layer.
  let mut seen: HashSet<OsString> = HashSet::new()
  if let Some(upper) = oi.upper.load(Acquire) {
      underlying_readdir(upper, |entry_inode, entry_off, ftype, name| {
          // Skip whiteout entries — they indicate deleted lower entries.
          if is_whiteout_entry(entry_inode) {
              seen.insert(name.to_owned())  // Track for lower suppression.
              return true  // Continue iteration.
          }
          seen.insert(name.to_owned())
          emit(overlay_inode_for(entry_inode), entry_off, ftype, name)
      })
  }

  // Phase 2: If directory is opaque, stop here. Lower entries are hidden.
  if oi.opaque {
      return Ok(())
  }

  // Phase 3: Collect entries from lower layers, skipping duplicates.
  for lower_ref in lower_dirs_for(oi) {
      underlying_readdir(lower_ref.inode, |entry_inode, entry_off, ftype, name| {
          // Skip entries already seen in upper or higher lower layers.
          if seen.contains(name) {
              return true
          }
          // Skip whiteout entries from lower layers too.
          if is_whiteout_entry(entry_inode) {
              seen.insert(name.to_owned())
              return true
          }
          seen.insert(name.to_owned())
          emit(overlay_inode_for(entry_inode), entry_off, ftype, name)
      })
  }

  Ok(())

Deduplication uses byte-exact filename comparison. Case-insensitive upper/lower layer combinations may produce duplicate entries with different casing. This matches Linux overlayfs behavior.

Readdir caching: The merged directory listing is cached in the overlay file's private state (returned by open()) for the lifetime of the open directory file descriptor. This matches Linux's behavior: the merge is computed once per opendir() and subsequent readdir() calls return entries from the cache. The cache is invalidated on rewinddir() (seek to offset 0).

Performance note on seen HashSet: The HashSet<OsString> in the pseudocode above is allocated once per opendir() call (during the initial merge), not once per readdir() call. The cache stores the deduplicated entry list; subsequent readdir() calls walk the already-merged cache without re-allocating or re-hashing. For large directories (>10,000 entries), the initial opendir() merge is O(N) with one allocation per distinct entry name (stored in the HashSet during merge, then released when the merge completes and entries are stored in a flat Vec in the file private state). The hot path — repeated readdir() calls iterating through the cached Vec — is O(entries) with zero heap allocations. Bound: The HashSet is bounded by the sum of directory entries across all layers for this single directory (typically <10,000 in container images; capped by the filesystem's max directory entries). This is a warm-path allocation (once per opendir(), bounded by directory size) and is acceptable per the collection usage policy (Section 3.13).

Directory rename (redirect_dir=on): When a merged directory is renamed, overlayfs cannot rename the lower-layer directory (it is read-only). Instead:

  1. Create the new directory name in the upper layer.
  2. Set the trusted.overlay.redirect xattr on the new upper directory, containing the absolute path (from the overlay root) of the original lower directory. Maximum redirect path: 256 bytes. Encoding: raw bytes (the underlying filesystem's filename encoding — typically UTF-8). No escaping; path components are separated by /. Paths exceeding 256 bytes cause the rename to fall back to full copy-up of the directory tree (no redirect xattr is set; the renamed directory becomes an opaque copy). This 256-byte limit is an UmkaOS implementation choice (Linux has no specific limit). The fallback to full directory copy-up preserves correctness.
  3. Lookups for the renamed directory follow the redirect: when searching lower layers, use the redirect path instead of the current name.
  4. Create a whiteout at the old name to hide the lower-layer original.

Opaque directory creation (rmdir + mkdir of same name):

  1. Create whiteout or opaque directory in upper layer.
  2. Set trusted.overlay.opaque xattr to "y" on the new upper directory.
  3. All lower-layer entries under this path are hidden.

14.8.8 Whiteout and Deletion

When a file or directory is deleted from a merged view, overlayfs must hide the lower-layer entry without modifying the lower layer:

File deletion (unlink on a merged file): 1. If the file exists in upper: remove the upper entry via underlying_unlink(). 2. If the file exists in any lower layer: create a whiteout in the upper layer at the same path. 3. Invalidate the dentry cache entry.

Directory deletion (rmdir on a merged directory): 1. Verify the merged view of the directory is empty (no entries from any layer that are not whiteouts). Return ENOTEMPTY if non-empty. 2. If the directory exists in upper: remove it. 3. If the directory exists in lower: create an opaque whiteout in upper.

Whiteout creation:

/// Create a whiteout entry in the upper layer.
///
/// UmkaOS uses the xattr-based whiteout format by default: a zero-size
/// regular file with the overlay whiteout xattr set. This avoids
/// requiring mknod(2) capability (character device 0:0 creation
/// requires CAP_MKNOD in the filesystem's user namespace).
///
/// For compatibility, the character-device whiteout format is also
/// recognized on read (lookup).
fn create_whiteout(upper_parent: InodeId, name: &OsStr) -> Result<()> {
    let sb = overlay_super_block();

    // Create zero-size regular file.
    let whiteout = underlying_create(upper_parent, name,
        FileMode::regular(0o000))?;

    // Set the whiteout xattr.
    underlying_setxattr(whiteout,
        concat(sb.xattr_prefix, "whiteout"), b"y", XattrFlags::CREATE)?;

    Ok(())
}

RENAME_WHITEOUT integration: The VFS rename() with RENAME_WHITEOUT flag (already supported in InodeOps::rename(), Section 14.1) atomically renames a file and creates a whiteout at the old name. overlayfs uses this during copy-up of directory entries: when a file is copied from lower to upper, the old lower path is hidden by a whiteout created atomically with the rename.

14.8.9 Volatile Mode

Volatile mode disables all durability guarantees for the upper layer. This is a deliberate trade-off for ephemeral container workloads.

Behavior: - fsync(), fdatasync(), and sync_fs() on overlay files are no-ops (return success without calling the underlying filesystem's sync). - On mount with volatile=true, create the sentinel directory $workdir/work/incompat/volatile/. - On unmount, remove the sentinel directory (clean shutdown). - On next mount, if the sentinel exists, return EINVAL with a diagnostic message: the previous volatile session was not cleanly unmounted, and the upper/work directories may be inconsistent. The operator must delete upper and work directories and recreate them. - After any writeback error on the upper filesystem, subsequent fsync() calls on overlay files return EIO persistently (matching Linux's error stickiness behavior from Section 15.1).

Container runtime usage: Docker enables volatile mode for containers started with --storage-opt overlay2.volatile=true. This is common for CI/CD runners, build containers, and test environments where container state is discarded after each run.

14.8.10 Extended Attribute Handling

overlayfs must handle xattrs carefully because it uses private xattrs for internal bookkeeping (whiteouts, metacopy, redirects, opaque markers) and must pass through user-visible xattrs correctly.

Xattr namespace partitioning:

Namespace Behavior
trusted.overlay.* (or user.overlay.* in userxattr mode) Internal: overlay-private. Not visible to userspace via listxattr()/getxattr(). Used for whiteout, opaque, metacopy, redirect, origin markers.
security.* Pass-through with copy-up: Copied from lower to upper during copy-up. setxattr() triggers copy-up. Includes security.selinux, security.capability (file caps), security.ima.
system.posix_acl_access, system.posix_acl_default Pass-through with copy-up: POSIX ACLs are copied during copy-up. setfacl triggers copy-up.
user.* (excluding user.overlay.* in userxattr mode) Pass-through with copy-up: User-defined xattrs. Copied during copy-up.
trusted.* (excluding trusted.overlay.*) Pass-through with copy-up: Only accessible to CAP_SYS_ADMIN processes. Copied during copy-up.

getxattr/setxattr dispatch:

OverlayInodeOps::getxattr(inode, name, buf) -> Result<usize>:
  // Block access to overlay-private xattrs.
  if name.starts_with(overlay_xattr_prefix()) {
      return Err(ENODATA)
  }
  // Serve from upper if available, otherwise from lower.
  let target = upper_or_lower(inode)
  underlying_getxattr(target, name, buf)

OverlayInodeOps::setxattr(inode, name, value, flags) -> Result<()>:
  // Block writes to overlay-private xattrs.
  if name.starts_with(overlay_xattr_prefix()) {
      return Err(EPERM)
  }
  // setxattr triggers copy-up (xattr must be set on upper).
  let upper = copy_up(inode)?
  underlying_setxattr(upper, name, value, flags)

OverlayInodeOps::listxattr(inode, buf) -> Result<usize>:
  // List xattrs from upper (if exists) or lower.
  // Filter out overlay-private xattrs from the result.
  let target = upper_or_lower(inode)
  let raw = underlying_listxattr(target, buf)?
  filter_out_overlay_xattrs(buf, raw)

Nested overlayfs: When overlayfs is mounted on top of another overlayfs (nested container images, uncommon but valid), the inner overlay's xattrs must not collide with the outer overlay's. Linux handles this via "xattr escaping": the inner overlay stores its xattrs under trusted.overlay.overlay.* instead of trusted.overlay.*. UmkaOS implements the same escaping mechanism. This is transparent to the filesystem — the inner overlay simply uses a longer prefix.

14.8.11 statfs Behavior

OverlayFs::statfs() returns statistics from the upper layer's filesystem (if present). For read-only overlays (no upper), statistics from the topmost lower layer are returned. This matches Linux behavior and ensures that df on a container's root filesystem shows the available space on the writable layer.

14.8.12 Inode Number Composition (xino)

To guarantee unique inode numbers across the merged view, overlayfs composes inode numbers from the underlying filesystem's inode number and the layer index:

composed_ino = (layer_index << xino_bits) | underlying_ino

Where xino_bits is the number of bits available for the underlying inode (typically 32 for ext4 with default inode sizes). This ensures that stat() returns unique inode numbers for files from different layers that happen to share the same underlying inode number (common when layers are on the same filesystem).

When xino=off or when underlying inode numbers exceed the available bit width, overlayfs falls back to using the underlying inode numbers directly. In this mode, st_dev differs between upper and lower files (the VFS assigns a unique device number per overlay mount), but st_ino may collide across layers. Applications that rely on (st_dev, st_ino) pairs for file identity (e.g., tar, rsync, find -inum) may exhibit incorrect behavior. xino=auto avoids this by enabling composition only when it is safe.

14.8.13 Mount and Unmount Flow

Mount:

OverlayFs::mount(source, flags, data) -> Result<SuperBlock>:

  1. Parse mount options from `data` into `OverlayMountOptions`.

  2. Determine user-namespace influence (security policy, Section 14.4.6.1):
     userns_influenced = (current_user_ns() != &init_user_ns)

     If userns_influenced:
       a. Force options.metacopy = false.
          Force options.redirect_dir = RedirectDirMode::Off.
          Log: "overlayfs: metacopy and redirect_dir disabled for
                user-namespace mount (Section 14.4.6.1)"
       b. Require options.userxattr == true. If not set, return EPERM.
          (User-namespace mounts cannot use trusted.overlay.* xattrs.)
       c. If any lower_dir entry uses the data-only '::' separator syntax:
          return EPERM. (Data-only layers with userxattr are disallowed
          because user.overlay.redirect is owner-writable.)

  3. Resolve each lower_dir path to an InodeId via VFS path lookup.
     Verify each is a directory. Hold references for mount lifetime.

  4. If upper_dir is set:
     a. Resolve upper_dir to InodeId. Verify it is a writable directory.
     b. Resolve work_dir to InodeId. Verify same superblock as upper_dir.
     c. Check work_dir is empty.
     d. Create `$workdir/work/` subdirectory if it does not exist.
     e. If volatile mode:
        - Check for `$workdir/work/incompat/volatile/` sentinel.
          If exists: return EINVAL ("previous volatile session unclean").
        - Create the sentinel directory.
     f. If nfs_export: create `$workdir/index/` subdirectory.
     g. Clean stale temporary files from workdir (names starting with
        `#overlay.`). These are remnants of interrupted copy-ups.

  5. Verify upper filesystem supports required operations:
     - xattr support (getxattr/setxattr succeed with overlay prefix).
     - rename with RENAME_WHITEOUT (test with a dummy file in workdir).

  6. Construct `OverlaySuperBlock` with userns_influenced as determined
     in step 2, and `SuperBlock`.

  7. Register overlay dentry ops with the VFS.

  8. Emit mount options for /proc/mounts via show_options().

Unmount:

OverlayFs::unmount(sb) -> Result<()>:

  1. If volatile mode: remove sentinel directory
     `$workdir/work/incompat/volatile/`.

  2. Flush and release upper layer (must happen FIRST):
     a. Sync all dirty pages and metadata in the upper filesystem's
        writeback queue. This ensures that any copy-up data, whiteouts,
        and metadata changes written to the upper layer are on stable
        storage before the upper SuperBlock reference is released.
        Uses `sync_filesystem(upper_sb)` which issues a full barrier.
     b. Release the upper directory InodeId reference. This decrements
        the upper SuperBlock's mount reference count.
     c. Release the workdir InodeId reference (same SuperBlock as upper).

     The upper layer MUST be flushed and released before the lower layers
     because:
     - Dirty data in the upper layer may reference inodes from lower
       layers (metacopy files whose data still resides on a lower layer).
       If a lower SuperBlock were dropped first, its block device could
       be detached, making the lower data unreachable and causing I/O
       errors during upper flush.
     - The upper filesystem's journal commit may reference lower-layer
       block addresses (in filesystems like ext4 where the journal
       records physical block numbers). Releasing the lower device
       before journal commit would corrupt the journal.

  3. Release lower layer references (in reverse stacking order,
     topmost first):
     a. For each lower layer (from layer N down to layer 1):
        - Release the lower directory InodeId reference.
        - Decrement the lower SuperBlock's mount reference count.
     b. Reverse order ensures that if multiple lower layers share a
        SuperBlock (uncommon but valid), the last reference is released
        on the final iteration, not mid-traversal.
     c. Lower layers are read-only — no flush is needed. Their data
        is immutable for the lifetime of the overlay mount.

  4. Drop the OverlaySuperBlock (overlay's own VFS superblock metadata).
     At this point, all underlying filesystem references have been
     released. If this was the last mount referencing an underlying
     filesystem, that filesystem's kill_sb() is triggered, which
     flushes its own metadata and releases the block device.

Race with concurrent unmount of underlying filesystems: The VFS mount reference counting prevents an underlying filesystem from being unmounted while the overlay holds references to its inodes. An umount of the lower or upper filesystem while the overlay is mounted returns EBUSY (the overlay's InodeId references pin the underlying SuperBlock). This is identical to Linux behavior.

MountNamespace teardown cleanup order: During MountNamespace teardown, overlayfs cleanup follows the same ordering as explicit unmount: (1) flush pending copy-ups, (2) remove workdir temporary files, (3) unmount upper layer, (4) unmount lower layers. Workdir cleanup MUST precede upper unmount — the workdir is on the upper filesystem. If the workdir is cleaned after upper unmount, the workdir files become inaccessible and leak storage until the next fsck on the underlying filesystem.

14.8.14 Performance Characteristics

Operation Overhead vs. direct filesystem access Notes
Path lookup (cached) +1 hash lookup per component Overlay dentry points to underlying dentry
Read (lower-only file) ~0% Direct delegation to lower filesystem
Read (upper file) ~0% Direct delegation to upper filesystem
Read (metacopy file) ~0% Reads from lower, same as lower-only
Write (upper file) ~0% Direct delegation to upper filesystem
Write (first write, copy-up) O(file_size) one-time Sequential read+write of file data
Write (metacopy first write) O(file_size) one-time Deferred from container startup
chmod/chown (metacopy) O(1) ~10μs Metadata-only copy-up (no data copy)
chmod/chown (no metacopy) O(file_size) Full copy-up triggered
readdir (merged) O(entries × layers) Hash-based dedup over all layers
stat (cached) ~0% Overlay inode cached in VFS

Container startup optimization: With metacopy enabled, pulling and starting a container image avoids copying any file data during the initial setup phase (only metadata operations occur: chmod, chown, symlink creation for the container's init process). Data is copied lazily on first write. For typical container images (200-500 MB of layers), this reduces container start time from seconds to tens of milliseconds for the filesystem setup phase.

14.8.15 dm-verity Integration for Container Image Layers

Read-only lower layers in a container overlay can be protected by dm-verity (Section 9.3). The container runtime mounts each image layer's block device with dm-verity verification, then stacks them as overlayfs lower layers:

Container image mount sequence:
  1. Pull image layers: layer1.img, layer2.img, ..., layerN.img
  2. For each layer:
     a. Set up dm-verity on the layer's block device (Merkle tree
        verification, Section 9.2.6)
     b. Mount the verified block device read-only (ext4/XFS)
  3. Mount overlayfs:
     mount -t overlay overlay \
       -o lowerdir=/mnt/layerN:...:/mnt/layer1,upperdir=...,workdir=...
       /container/rootfs

This provides block-level integrity verification for all read-only container layers. The writable upper layer is covered by IMA (Section 9.5) for runtime integrity measurement of modified files. Together, dm-verity (lower layers) + IMA (upper layer) provide complete integrity coverage for container filesystems.

The optional verity=require mount option (Section 14.8) provides an additional layer of verification at the overlayfs level using fs-verity digests, independent of dm-verity block device verification.

14.8.16 Linux Compatibility

overlayfs is compatible with Linux's overlayfs at the mount interface and xattr format level:

  • Upper and lower directories created by Linux overlayfs are mountable by UmkaOS and vice versa. The xattr format (trusted.overlay.* names and values) is identical.
  • Mount option syntax matches Linux exactly (-o lowerdir=...,upperdir=..., workdir=...).
  • Whiteout formats (both character device 0:0 and xattr-based) are recognized.
  • Metacopy xattr format is compatible: layers created with metacopy=on on Linux work on UmkaOS.
  • redirect_dir xattr format and path encoding match Linux.
  • /proc/mounts output format matches Linux for container introspection tools.
  • /sys/module/overlay/parameters/* is not emulated (UmkaOS does not use kernel modules); per-mount options in the mount command are the sole configuration mechanism.

Docker/containerd/Podman compatibility: These runtimes interact with overlayfs exclusively through the mount(2) syscall and standard file operations. They do not use any overlayfs-specific ioctls or sysfs interfaces. UmkaOS's implementation of mount("overlay", ...) with the standard option string is sufficient for full compatibility. The overlay2 storage driver in Docker and the overlayfs snapshotter in containerd are fully supported.


14.9 binfmt_misc — Arbitrary Binary Format Registration

binfmt_misc is a VFS-level mechanism that allows userspace to register handlers for arbitrary binary formats, identified by magic bytes or file extension. When the kernel's exec path attempts to start a file and neither the native ELF handler nor the #! script handler matches, the kernel delegates to a registered binfmt_misc interpreter. The registered interpreter binary is invoked with the original file path as an additional argument.

Critical use cases:

  • Multi-architecture containers: qemu-aarch64-static is registered as the interpreter for AArch64 ELF binaries, identified by the AArch64 ELF magic header. This allows running unmodified ARM64 Docker images on an x86-64 host without hardware virtualisation.
  • Java: .jar files executed as if they were executables via a registration that maps the .jar extension to /usr/bin/java -jar.
  • .NET: PE32+ executables identified by the MZ magic bytes are mapped to dotnet exec.
  • Wine: 16-bit and 32-bit Windows PE files mapped to wine.

14.9.1 Data Structures

/// A single registered binfmt_misc entry.
/// Kernel-internal, not KABI or wire format. `Option<[u8; N]>` fields use
/// the discriminant to distinguish "magic match" from "extension match"
/// without requiring sentinel values in the array.
pub struct BinfmtMiscEntry {
    /// Registration name. Shown as the filename under the binfmt_misc mount.
    /// Alphanumeric, hyphen, and underscore only. NUL-terminated.
    pub name:         [u8; 64],
    /// Matching strategy: magic bytes or file extension.
    pub match_type:   BinfmtMatch,
    /// Magic bytes to compare against file content (BinfmtMatch::Magic only).
    /// Maximum 128 bytes. Length of `magic` and `mask` must be equal.
    /// Comparison is byte-by-byte (no endianness interpretation) — each
    /// byte in the file at `magic_offset + i` is ANDed with `mask[i]` and
    /// compared to `magic[i]`. Multi-byte values embedded in magic patterns
    /// must be specified in the byte order they appear in the file.
    pub magic:        Option<[u8; 128]>,
    /// Length of the valid portion of `magic` and `mask` arrays.
    pub magic_len:    u8,
    /// Bitmask applied to each file byte before comparison with `magic`.
    /// A mask byte of `0xff` means "match exactly"; `0x00` means "ignore".
    pub mask:         Option<[u8; 128]>,
    /// Byte offset within the file at which `magic` is compared.
    pub magic_offset: u16,
    /// File extension string (BinfmtMatch::Extension only).
    /// Case-sensitive. Does not include the leading `.`. NUL-terminated.
    /// 32 bytes: 31 chars + NUL. Covers all reasonable extensions.
    pub extension:    Option<[u8; 32]>,
    /// Absolute path to the interpreter binary.
    pub interpreter:  [u8; PATH_MAX],
    /// Behavioural flags.
    pub flags:        BinfmtFlags,
    /// Whether this entry participates in exec matching.
    pub enabled:      AtomicBool,
}

/// How the entry identifies matching binaries.
pub enum BinfmtMatch {
    /// Match by magic bytes at a fixed offset within the file.
    Magic,
    /// Match by the file extension of the executed path.
    Extension,
}

bitflags! {
    /// Behavioural flags for a binfmt_misc entry.
    pub struct BinfmtFlags: u32 {
        /// Pass the original filename as argv[0] to the interpreter instead
        /// of substituting the interpreter path.
        const PRESERVE_ARGV0 = 0x01;
        /// Open the binary file and pass it to the interpreter as an open fd
        /// (via `/proc/self/fd/N`). Required when the binary is not
        /// world-readable and the interpreter runs without elevated privilege.
        const OPEN_BINARY    = 0x02;
        /// Use the credentials (uid, gid, capabilities) of the interpreter
        /// binary rather than those of the executed file. Equivalent to
        /// setuid execution for the interpreter.
        const CREDENTIALS    = 0x04;
        /// Fix binary: the interpreter is not itself subject to further
        /// binfmt_misc or personality transformation. Prevents recursion.
        const FIX_BINARY     = 0x08;
        /// Secure: do not grant elevated credentials even when the interpreter
        /// binary is setuid. Overrides CREDENTIALS for privilege de-escalation.
        const SECURE         = 0x10;
    }
}
/// Maximum registered binfmt_misc entries. Real systems have fewer than 64;
/// this bound makes the table fixed-size and avoids heap allocation on the
/// exec hot path.
pub const MAX_BINFMT_MISC: usize = 64;

The global entry table is an RcuCell<ArrayVec<Arc<BinfmtMiscEntry>, MAX_BINFMT_MISC>>. The exec path reads the table under an RCU read guard (lock-free) and performs a bounded scan (at most MAX_BINFMT_MISC entries). Registration, enable/disable, and removal are cold-path operations: the writer clones the current ArrayVec, applies the modification, and publishes the new version via RcuCell::update() (RCU grace period). The list is short in practice (fewer than 64 entries on any real system), so O(N) scan cost is negligible relative to exec overhead.

14.9.2 Registration Interface

The binfmt_misc filesystem is mounted at /proc/sys/fs/binfmt_misc (also accessible at /sys/kernel/umka/binfmt_misc/ via the umkafs namespace — see Section 20.5). It exposes:

Path Type Description
register write-only file Register a new entry
status read/write file 1 = all entries active; 0 = all disabled globally
<name>/enabled read/write file 1 enable, 0 disable, -1 remove this entry
<name> read-only file Shows entry details (flags, interpreter, magic/extension)

Writing to register or any <name>/enabled file requires Capability::SysAdmin in the caller's capability set.

Registration format (written as a single line to register):

:name:type:offset:magic:mask:interpreter:flags

Fields are separated by the same delimiter character as the leading :. Any printable non-alphanumeric character may be used as the delimiter (allowing paths that contain colons).

Field Description
name Identifier: alphanumeric, -, _. Maximum 63 characters.
type M for magic-byte match; E for extension match.
offset Decimal byte offset for magic comparison (type M). 0 for most formats.
magic Hex-escaped bytes for type M (e.g., \x7fELF). Extension string for type E.
mask Hex-escaped bitmask for type M; same length as magic. Empty for type E.
interpreter Absolute path to the interpreter binary. Must exist at registration time.
flags Subset of POCFS: P = PRESERVE_ARGV0, O = OPEN_BINARY, C = CREDENTIALS, F = FIX_BINARY, S = SECURE.

Example — registering QEMU user-mode for AArch64 ELF binaries on an x86-64 host:

:qemu-aarch64:M:0:\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00::qemu-aarch64-static:OC
  • Type M, offset 0: compare 20 magic bytes starting at file byte 0.
  • No mask: all bytes compared exactly (\xff mask is implied).
  • O (OPEN_BINARY): interpreter receives file as fd, not path, for cross-uid access.
  • C (CREDENTIALS): interpreter's credentials govern setuid semantics.

Parsing algorithm:

parse_registration(line: &[u8]) -> Result<BinfmtMiscEntry>:
  1. delimiter = line[0]
  2. Split line on delimiter into fields: [name, type, offset, magic_or_ext,
     mask, interpreter, flags_str].
  3. Validate name: alphanumeric + '-' + '_', length 1–63.
  4. Parse type: 'M' → BinfmtMatch::Magic, 'E' → BinfmtMatch::Extension.
  5. For type M:
     a. Parse offset as decimal u16.
     b. Decode hex-escaped bytes into magic array (max 128 bytes).
     c. If mask non-empty: decode hex-escaped bytes; must equal magic.len().
     d. If mask empty: fill mask with 0xff bytes (exact match).
  6. For type E:
     a. Validate extension: printable ASCII, no '/', no '.', max 31 chars.
     b. Store extension without leading '.'.
  7. Validate interpreter: starts with '/', exists in VFS (path lookup),
     is a regular file with execute permission for at least one uid.
  8. Parse flags_str: accept 'P', 'O', 'C', 'F', 'S' in any order.
  9. Construct BinfmtMiscEntry with enabled = AtomicBool::new(true).
  10. Clone current ArrayVec from RcuCell; reject if name already exists.
  11. Push Arc<BinfmtMiscEntry> to cloned table; RCU-publish via RcuCell::update().

14.9.3 Exec Path Integration

During do_execve (Section 8.1), after the ELF handler and the #! script handler both decline the binary (return ENOEXEC), the kernel calls binfmt_misc_load_binary(file, argv, envp).

Matching algorithm:

binfmt_misc_load_binary(file, argv, envp) -> Result<()>:
  1. Acquire RCU read guard on global entry table (lock-free).
  2. If global status is disabled: return ENOEXEC.
  3. Read a probe buffer of min(128 + max_magic_offset, 256) bytes from
     offset 0 of `file`. This single read covers all registered magic ranges.
  4. For each entry in table order (bounded by MAX_BINFMT_MISC = 64):
     a. If !entry.enabled.load(Relaxed): skip.
     b. If entry.match_type == Magic:
        i.  end = entry.magic_offset as usize + entry.magic_len as usize.
        ii. If end > probe_buffer.len(): skip (file too short).
        iii.For each byte i in 0..magic_len:
              file_byte = probe[magic_offset + i] & mask[i]
              if file_byte != magic[i] & mask[i]: break → no match
        iv. If all bytes matched: entry is selected.
     c. If entry.match_type == Extension:
        i.  Extract filename from argv[0] (last path component).
        ii. If filename ends with '.' + extension (case-sensitive): entry is selected.
  5. If no entry matched: drop RCU guard; return ENOEXEC.
  6. Clone the matched entry (Arc clone, no copy of byte arrays).
  7. Drop RCU read guard.
  8. Build new argv:
     a. If PRESERVE_ARGV0 set: new_argv = [interpreter, argv[0], argv[1..]]
     b. Else:                  new_argv = [interpreter, original_file_path, argv[1..]]
     c. If OPEN_BINARY set:    pass file as open fd; prepend "/proc/self/fd/<N>"
        in place of original_file_path.
  9. If CREDENTIALS set: use interpreter binary's uid/gid/caps for the new exec.
  10. If SECURE set: clear any setuid bits that CREDENTIALS would have applied.
  11. Invoke do_execve recursively with interpreter path and new_argv.
      If FIX_BINARY set: skip binfmt_misc matching in the recursive exec
      (set a per-exec flag to prevent re-entry into this function).

Step 11's recursive do_execve processes the interpreter itself through the normal ELF handler. QEMU user-mode binaries are statically linked ELF executables, so the recursion terminates in one level.

14.9.4 The binfmt_misc Filesystem

binfmt_misc_fs is a minimal VFS filesystem type (FsType::BinfmtMisc) with the following FsOps implementation:

impl FsOps for BinfmtMiscFs {
    fn mount(&self, flags: MountFlags, _data: &[u8]) -> Result<Arc<SuperBlock>>;
    fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;
}

impl InodeOps for BinfmtMiscDir {
    fn lookup(&self, name: &OsStr) -> Result<Arc<Dentry>>;
    fn iterate_dir(&self, ctx: &mut DirContext) -> Result<()>;
}

impl FileOps for BinfmtMiscRegister {
    fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>; // parse_registration
}

impl FileOps for BinfmtMiscStatus {
    fn read(&self, buf: &mut [u8], _offset: u64) -> Result<usize>; // "enabled\n" or "disabled\n"
    fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>;    // "1" / "0"
}

impl FileOps for BinfmtMiscEntryFile {
    fn read(&self, buf: &mut [u8], _offset: u64) -> Result<usize>; // entry details
    fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>;    // "1" / "0" / "-1"
}

The filesystem has no on-disk backing store. All state lives in the in-kernel RcuCell<ArrayVec<Arc<BinfmtMiscEntry>, MAX_BINFMT_MISC>>. Directory inodes are synthesised dynamically: lookup reads the entry table under an RCU guard, scans for a matching name, and returns a synthetic inode. iterate_dir emits register, status, and all current entry names.

RCU synchronization on handler removal: When an entry is removed (write -1 to the entry file), the removal sequence is: 1. Acquire the global binfmt_misc_lock (spinlock, serializes writers). 2. Create a new ArrayVec with the entry removed. 3. Publish via RcuCell::update() (rcu_assign_pointer semantics). 4. Release binfmt_misc_lock. 5. Call synchronize_rcu() to wait for all readers to complete. 6. Drop the old ArrayVec (releases the Arc<BinfmtMiscEntry>). Step 5 is critical: without it, a concurrent execve() holding an RCU read lock could still be matching against the removed entry. The synchronize_rcu() ensures that by the time the Arc is dropped (and potentially the interpreter binary's file reference released), all in-flight search_binary_handler() calls have either completed or moved past the entry table read. For FIX_BINARY entries (where the interpreter file is pinned at registration time), the pinned file reference is released only after the RCU grace period completes.

Multiple mounts of the binfmt_misc filesystem share the same global entry table (identical to Linux semantics). Unmounting does not clear registrations; entries persist until explicitly removed via echo -1 > /proc/sys/fs/binfmt_misc/<name>/enabled or until the kernel reboots.

Mount point: The standard location is /proc/sys/fs/binfmt_misc, mounted by systemd-binfmt.service at early boot before loading entries from /etc/binfmt.d/*.conf and /usr/lib/binfmt.d/*.conf.

14.9.5 Persistence and systemd Integration

The kernel holds registrations only in memory. Registrations are lost on reboot. The systemd-binfmt.service unit re-registers all entries at each boot by reading configuration files with the format:

# /etc/binfmt.d/qemu-aarch64.conf
:qemu-aarch64:M:0:\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00::qemu-aarch64-static:OC

Each non-comment, non-empty line is written verbatim to /proc/sys/fs/binfmt_misc/register. Drop-in files in /usr/lib/binfmt.d/ are processed first, then /etc/binfmt.d/ (higher priority). Conflicting entries with the same name are rejected by the kernel (duplicate-name check in parse_registration).

14.9.6 Security Model

  • Privilege: Writing to register or any enabled file requires Capability::SysAdmin. Unprivileged processes cannot add or modify entries.
  • Interpreter credentials: By default (no CREDENTIALS flag), the interpreter runs with the calling process's credentials. The setuid bits of the interpreter binary are ignored. This prevents privilege escalation via a crafted binary whose magic bytes happen to match a setuid interpreter's registration.
  • CREDENTIALS flag: Explicitly opts in to interpreter-binary credential inheritance. Should only be set for fully trusted interpreters.
  • SECURE flag: When set alongside CREDENTIALS, strips any elevated privilege that would have been inherited. Useful for sandboxed interpreters.
  • OPEN_BINARY flag: The kernel opens the binary file before constructing the new argv, so the interpreter receives an already-open fd. This allows the interpreter to read the file even when the binary is not world-readable (e.g., chmod 700 user-owned binaries run through QEMU on a shared host). The fd is passed as a /proc/self/fd/N path to remain compatible with interpreters that accept a file path argument.
  • Recursion guard: The FIX_BINARY flag, combined with the per-exec recursion flag set in step 11 of Section 14.5.3, prevents pathological interpreter chains where an interpreter is itself a binfmt_misc-dispatched binary.

14.10 autofs — Kernel Automount Trigger

autofs is the kernel side of the automount subsystem. Its role is narrow: detect access to a path that has not yet been mounted, suspend the filesystem lookup, notify a userspace daemon, and resume the lookup after the daemon has performed the mount. The kernel does not decide what to mount or where it comes from — that is entirely the daemon's responsibility.

Used extensively by systemd through .automount units: lazy NFS home directories (/home/$user), removable media (/media/disk), and network shares that should only connect on demand.

14.10.1 Architecture

autofs registers a VFS filesystem type (FsType::Autofs). An autofs filesystem instance covers a single mount point. Inside that mount point, the kernel may see directory entries that are not yet backed by a real mount. When path resolution (Section 14.1) traverses one of these directories and finds DCACHE_NEED_AUTOMOUNT set on its dentry, it calls the dentry's d_automount operation.

The two fundamental mount modes are:

Mode Description
indirect autofs mount covers a directory; lookups of subdirectories trigger mounts. /nfs is autofs; accessing /nfs/fileserver triggers a mount of fileserver:/export onto /nfs/fileserver.
direct The autofs mount point IS the trigger. Accessing the exact path (e.g., /mnt/backup) triggers the mount.

14.10.2 Data Structures

/// State for one autofs filesystem instance (one mount point).
pub struct AutofsMount {
    /// Pipe to the automount daemon. Kernel writes AutofsPacket messages here.
    pub pipe:             Arc<Pipe>,
    /// Protocol version negotiated with the daemon (UmkaOS implements v5).
    /// The daemon declares its version via `AUTOFS_IOC_PROTOVER` ioctl on
    /// the autofs mount fd. If the daemon's version is < 5, the kernel
    /// responds with v4 compatibility packets (no UID/GID/PID fields).
    /// If the daemon's version is > 5, the kernel uses v5 (the kernel
    /// never speaks a protocol newer than it implements). Version mismatch
    /// logging: "autofs: daemon v{N}, kernel v5 — using v{min(N,5)}".
    pub proto_version:    u32,
    /// Whether the daemon has declared itself gone (catatonic state).
    pub catatonic:        AtomicBool,
    /// Idle timeout in seconds after which expire packets are sent.
    pub timeout_secs:     AtomicU32,
    /// All outstanding lookup requests waiting for daemon response.
    /// Keyed by token (u64). XArray provides O(1) lookup with internal
    /// xa_lock for write serialization, replacing the external Mutex.
    pub pending:          XArray<Arc<AutofsPendingRequest>>,
    /// Monotonically increasing token counter. Internal counter is u64
    /// (exhaustion-proof: at 100 tokens/sec, wraps in 5.8 billion years).
    /// The Linux ABI wire protocol (`AutofsPacketMissing::wait_queue_token`)
    /// carries the low 32 bits only. The XArray is keyed by the wire token
    /// (u32, zero-extended to u64 for XArray indexing). Only the low 32 bits
    /// of the counter are used as XArray keys and wire tokens. Lookup on daemon
    /// response is O(1) via `pending.get(wire_token as u64)`. Collision is
    /// impossible: at 100 tokens/sec,
    /// the u32 space covers 49 days of tokens, but pending requests time out
    /// within `timeout_secs` (typically 30-300 seconds).
    pub next_token:       AtomicU64,
    /// Mount type: indirect or direct.
    pub mount_type:       AutofsMountType,
}

pub enum AutofsMountType {
    Indirect,
    Direct,
    Offset, // Internal: used for sub-mounts within a multi-mount map.
}

/// One outstanding automount request.
pub struct AutofsPendingRequest {
    /// Token echoed back in the daemon's IOC_READY / IOC_FAIL ioctl.
    pub token:   u32,
    /// Path component that triggered the lookup (indirect) or full path (direct).
    pub name:    CString,
    /// Sleeping callers blocked on this mount.
    pub waitq:   WaitQueue,
    /// Result set by the daemon: Ok(()) on success, Err(errno) on failure.
    pub result:  OnceLock<Result<()>>,
}

/// Packet written to the daemon pipe for a missing mount (protocol v5).
/// Layout matches Linux `struct autofs_v5_packet`. 304 bytes on all
/// UmkaOS-supported architectures: the `ino: u64` field has 8-byte alignment
/// on all targets (ARMv7 AAPCS, PPC32 System V ABI, and all 64-bit ABIs),
/// so trailing padding is always 4 bytes (300 named → 304 aligned).
/// The daemon reads `mem::size_of::<AutofsPacketMissing>()` bytes from the pipe.
#[repr(C)]
pub struct AutofsPacketMissing {
    pub hdr:              AutofsPacketHdr,       // offset  0, size  8
    /// Token for AUTOFS_IOC_READY / AUTOFS_IOC_FAIL.
    pub wait_queue_token: u32,                   // offset  8, size  4
    /// Device number of the autofs mount.
    pub dev:              u32,                    // offset 12, size  4
    /// Inode number of the trigger dentry.
    pub ino:              u64,                    // offset 16, size  8
    /// UID of the process that triggered the lookup.
    pub uid:              u32,                    // offset 24, size  4
    /// GID of the process that triggered the lookup.
    pub gid:              u32,                    // offset 28, size  4
    /// PID (thread group leader) of the process that triggered the lookup.
    /// Autofs wire protocol uses __u32 (not pid_t).
    pub pid:              u32,                    // offset 32, size  4
    /// TGID of the triggering process. Autofs wire protocol uses __u32.
    pub tgid:             u32,                    // offset 36, size  4
    /// Length of `name` (not including NUL). Autofs wire: __u32.
    pub len:              u32,                    // offset 40, size  4
    /// Name of the missing directory component (NUL-terminated).
    pub name:             [u8; NAME_MAX + 1],     // offset 44, size 256
    // Named fields: 8+4+4+8+4+4+4+4+4+256 = 300 bytes.
    // u64 alignment on all UmkaOS targets → 4 bytes trailing padding → 304.
}
// 304 on all UmkaOS-supported architectures: u64 has 8-byte alignment on
// ARMv7 (AAPCS), PPC32 (System V ABI), and all 64-bit ABIs. The 32-bit
// value of 300 would only apply on i386 (4-byte u64 alignment), which
// UmkaOS does not support.
const_assert!(size_of::<AutofsPacketMissing>() == 304);

/// Packet written to the daemon pipe requesting expiry of an idle mount.
/// Layout matches Linux `struct autofs_v5_packet_expire` (304 bytes on all
/// UmkaOS-supported architectures — identical to AutofsPacketMissing). In
/// Linux, `autofs_packet_expire_direct_t` is a typedef alias for `autofs_v5_packet`.
#[repr(C)]
pub struct AutofsPacketExpire {
    pub hdr:              AutofsPacketHdr,
    pub wait_queue_token: u32,
    pub dev:              u32,
    pub ino:              u64,
    pub uid:              u32,
    pub gid:              u32,
    pub pid:              u32,
    pub tgid:             u32,
    pub len:              u32, // Autofs wire: __u32.
    pub name:             [u8; NAME_MAX + 1],
}
// Same reasoning as AutofsPacketMissing: 304 on all UmkaOS targets.
const_assert!(size_of::<AutofsPacketExpire>() == 304);

/// Common packet header.
/// Field types match Linux's `struct autofs_packet_hdr` exactly:
/// both `proto_version` and `type` are `int` (i32) in the Linux C struct.
#[repr(C)]
pub struct AutofsPacketHdr {
    pub proto_version: i32,
    pub packet_type:   i32,
}
const_assert!(size_of::<AutofsPacketHdr>() == 8);

/// Autofs packet type constants. Values match Linux `auto_fs.h`.
/// The header's `packet_type` field is `i32` (not enum repr) for C ABI
/// compatibility. These constants are used for matching:
///
/// | Value | Name | Protocol | Description |
/// |-------|------|----------|-------------|
/// | 0 | Missing | v1 | Legacy missing (v1/v2 only) |
/// | 1 | Expire | v1 | Legacy expire (v1/v2 only) |
/// | 2 | ExpireMulti | v4 | Multi-mount expire |
/// | 3 | MissingIndirect | v5 | Indirect mount trigger |
/// | 4 | ExpireIndirect | v5 | Indirect mount expiry |
/// | 5 | MissingDirect | v5 | Direct mount trigger |
/// | 6 | ExpireDirect | v5 | Direct mount expiry |
///
/// Types 3-6 are required for v5 protocol. systemd dispatches on these values.
/// UmkaOS uses types 3-6 for v5 operation (types 0-1 only for v4 compat).
#[repr(i32)]
pub enum AutofsPacketType {
    Missing         = 0,
    Expire          = 1,
    ExpireMulti     = 2,
    MissingIndirect = 3,
    ExpireIndirect  = 4,
    MissingDirect   = 5,
    ExpireDirect    = 6,
}

14.10.3 Packetized Pipe Protocol

The autofs kernel-to-daemon communication channel is a packetized pipe: each write() from the kernel writes exactly one complete packet (mem::size_of::<AutofsPacketMissing>() bytes — 304 on all UmkaOS-supported architectures), and each read() from the daemon must read exactly that many bytes to consume one packet. The pipe is opened with O_DIRECT semantics (Linux pipe O_DIRECT flag, since Linux 3.4) to ensure atomic packet-sized writes — a partial write never occurs as long as the packet size (304 bytes) is less than PIPE_BUF (4096 bytes, POSIX-guaranteed atomicity threshold). See Section 14.17 for the UmkaOS pipe implementation.

If the pipe buffer is full (all slots occupied), the kernel's write() returns -EAGAIN (the pipe fd is set to non-blocking mode by the daemon at setup). The autofs trigger path converts this to -ENOMEM and returns to the caller — the daemon is overloaded and cannot accept new mount requests.

14.10.4 Automount Protocol

Trigger sequence (the fast path through VFS path resolution):

autofs_d_automount(dentry, path) -> Result<Option<Arc<Mount>>>:
  Precondition: called from REF-walk (never RCU-walk; see Section 14.6.6).

  1. Obtain the AutofsMount for this dentry's superblock.
  2. If catatonic: return Err(ENOENT) immediately.
  3. Check if `dentry` is already a mount point (DCACHE_MOUNTED set):
     return Ok(None) — another thread raced and completed the mount.
  4. Allocate token = next_token.fetch_add(1, Relaxed).
  5. Construct AutofsPacketMissing with v5 packet type:
     - indirect mode: `packet_type = AutofsPacketType::MissingIndirect`
     - direct mode: `packet_type = AutofsPacketType::MissingDirect`
     Set `{ hdr: { proto_version: 5, packet_type }, token, name = dentry.name or full path }`.
  6. Insert Arc<AutofsPendingRequest> into pending table under token.
  7. Write packet to pipe (non-blocking; if pipe is full, return ENOMEM —
     the daemon is overloaded).
  8. Sleep on pending.waitq with timeout = timeout_secs seconds.
  9. On wake:
     a. Remove request from pending table.
     b. If result is Ok(()):
        - Verify dentry is now a mount point (DCACHE_MOUNTED).
        - Return Ok(None) (VFS follow_mount() will handle the new mount).
     c. If result is Err(e): return Err(e).
  10. On timeout:
     a. Remove request from pending table.
     b. Return Err(ETIMEDOUT).

Daemon response (via ioctl on the autofs pipe fd or mount point fd):

AUTOFS_IOC_READY(token: u32):
  1. Acquire pending lock; look up token.
  2. If not found: return ENXIO (stale token; request already timed out).
  3. Set request.result = Ok(()).
  4. Wake all waiters on request.waitq.
  5. Remove from pending table.

AUTOFS_IOC_FAIL(token: u32):
  1. Acquire pending lock; look up token.
  2. If not found: return ENXIO.
  3. Set request.result = Err(ENOENT).
  4. Wake all waiters.
  5. Remove from pending table.

Multiple callers may race to access the same missing path simultaneously. All of them find the same AutofsPendingRequest in the pending table (inserted by the first caller) and sleep on the same waitq. When the daemon responds, all waiters wake together.

14.10.5 Control Interface

All autofs control operations are performed via ioctl(2) on the file descriptor of the autofs pipe (passed to the kernel at mount time via the fd=N mount option) or on a file descriptor opened on the autofs mount point itself.

ioctl Direction Description
AUTOFS_IOC_READY daemon→kernel Mount succeeded for token.
AUTOFS_IOC_FAIL daemon→kernel Mount failed for token.
AUTOFS_IOC_CATATONIC daemon→kernel Daemon is exiting; all future lookups fail with ENOENT.
AUTOFS_IOC_PROTOVER kernel→daemon Returns protocol version (5 for UmkaOS).
AUTOFS_IOC_SETTIMEOUT daemon→kernel Sets idle expiry timeout in seconds.
AUTOFS_IOC_EXPIRE kernel→daemon Requests daemon to expire (unmount) one idle subtree.
AUTOFS_IOC_EXPIRE_MULTI kernel→daemon Requests daemon to expire up to N idle subtrees.
AUTOFS_IOC_EXPIRE_INDIRECT kernel→daemon Like EXPIRE but limited to indirect-mode subtrees.
AUTOFS_IOC_EXPIRE_DIRECT kernel→daemon Like EXPIRE but limited to direct-mode mount points.
AUTOFS_IOC_PROTOSUBVER kernel→daemon Returns protocol sub-version (UmkaOS: 6, matching Linux 5.4+).
AUTOFS_IOC_ASKUMOUNT daemon→kernel Query whether the autofs mount point can be unmounted.

14.10.6 Expiry

After an autofs-triggered mount has been idle for timeout_secs seconds, the kernel initiates expiry. Expiry is cooperative: the kernel asks the daemon to consider unmounting; the daemon decides whether conditions are met (no processes have open files under the mount, no active chdir into it) and issues umount(2) if appropriate.

autofs_expire_run(mount: &AutofsMount):
  Executed from a kernel timer callback at intervals of timeout_secs / 4.

  1. Walk all mounts that are children of this autofs mount point.
  2. For each child mount M:
     a. Compute idle_time = now - M.last_access_time.
     b. If idle_time < timeout_secs: skip.
     c. If any process has an open fd into M's subtree (check mount's
        open-file reference count): skip.
     d. Allocate token = next_token.fetch_add(1, Relaxed).
     e. Write AutofsPacketExpire { hdr: { proto_version: 5,
        packet_type: ExpireIndirect (indirect) or ExpireDirect (direct) },
        token, name = M.mountpoint_name } to pipe.
     f. Insert AutofsPendingRequest into pending table.
     g. Daemon calls AUTOFS_IOC_READY(token) after umount(2) succeeds, or
        AUTOFS_IOC_FAIL(token) if the mount is still busy.
  3. The timer reschedules itself unless the mount is in catatonic state.

The expiry path does not sleep in the kernel; it is fire-and-forget from the kernel's perspective. The daemon drives the actual unmount.

14.10.7 VFS Integration

autofs inserts itself into the VFS path walk at the d_automount dentry operation hook, which is called by follow_automount() inside the path resolution loop (Section 14.1):

follow_automount(path, nd) -> Result<()>:
  1. Verify nd.flags does not include LOOKUP_NO_AUTOMOUNT.
  2. Call dentry.ops.d_automount(dentry, path) → new_mnt (may be None).
  3. If new_mnt is Some(mnt): call do_add_mount(mnt, path).
  4. Continue path walk over the now-mounted subtree.

RCU-walk downgrade: d_automount cannot sleep, and sleeping is required to wait for the daemon response. Therefore, if the path walk is in RCU mode (the optimistic lockless fast path), it is downgraded to REF-walk before d_automount is called. The downgrade is performed by unlazy_walk(), which acquires reference counts on the path components traversed so far. Once in REF-walk, the kernel can sleep safely in autofs_d_automount.

LOOKUP_NO_AUTOMOUNT: Certain operations (stat, openat with O_NOFOLLOW | O_PATH, utimensat with AT_SYMLINK_NOFOLLOW) set this flag to avoid triggering automounts on stat-only access. This matches Linux semantics.

14.10.8 Mount Options

autofs is mounted by the daemon at startup with options passed via the data argument to mount(2):

Option Description
fd=N File descriptor of the daemon-side pipe end. Required.
uid=N UID of the daemon process. Used for permission checks on expire.
gid=N GID of the daemon process.
minproto=N Minimum acceptable protocol version (daemon's minimum).
maxproto=N Maximum acceptable protocol version (daemon's maximum).
indirect Mount in indirect mode (default).
direct Mount in direct mode.
offset Mount in offset mode (internal; used by the daemon for sub-mounts).

UmkaOS implements autofs protocol version 5, sub-version 6 (AUTOFS_PROTO_SUBVERSION = 6), matching the version supported by Linux kernel 5.4+ and systemd's automount daemon v252+. The protocol version is negotiated at mount time: the kernel picks min(maxproto, UMKA_PROTO_VERSION) and returns it via AUTOFS_IOC_PROTOVER.

14.10.9 systemd Integration

A systemd .automount unit creates an autofs mount point at the path specified by Where=, paired with a .mount unit of the same name. systemd acts as the automount daemon:

  1. At unit activation, systemd calls mount("autofs", Where, "autofs", 0, "fd=N,...").
  2. When AutofsPacketMissing arrives on the pipe, systemd activates the corresponding .mount unit (which runs mount(2) for the real filesystem).
  3. On success, systemd calls AUTOFS_IOC_READY(token); on failure, AUTOFS_IOC_FAIL(token).
  4. TimeoutIdleSec= in the .automount unit maps directly to AUTOFS_IOC_SETTIMEOUT.
  5. After the idle timeout, systemd receives AutofsPacketExpire and issues umount(2) if the mount is not busy, then calls AUTOFS_IOC_READY(token).

Example unit (/etc/systemd/system/home.automount):

[Unit]
Description=Automount /home via NFS

[Automount]
Where=/home
TimeoutIdleSec=300

[Install]
WantedBy=multi-user.target

Paired with /etc/systemd/system/home.mount which specifies the NFS source and options. systemd creates the autofs mount point when the .automount unit starts and tears it down when the unit stops.

14.10.10 Linux Compatibility

UmkaOS's autofs implementation is wire-compatible with Linux autofs4:

  • Protocol version 5, sub-version 6 — matches Linux kernel 5.4+.
  • All ioctl numbers are identical to Linux (AUTOFS_IOC_* from <linux/auto_fs.h>).
  • AutofsPacketMissing and AutofsPacketExpire structs are #[repr(C)] and match the Linux kernel ABI exactly.
  • Mount option string format (fd=N,uid=N,...) matches Linux.
  • systemd's automount daemon, autofs(5) userspace tools, and mount.autofs all operate without modification against UmkaOS's autofs implementation.

14.11 FUSE — Filesystem in Userspace

FUSE allows user-space processes to implement complete filesystems. A FUSE filesystem daemon opens /dev/fuse (character device, major 10, minor 229), mounts via FUSE_SUPER_MAGIC, and serves kernel VFS calls by reading and writing structured FUSE messages over the device fd. Any FUSE protocol-compliant daemon runs without modification on UmkaOS.

14.11.1 Architecture

User Process (e.g., sshfs, rclone, glusterfs-fuse)
       │  write(fuse_fd, fuse_out_header + reply)
       │  read(fuse_fd, fuse_in_header + args)
  /dev/fuse  (character device, major 10 minor 229)
  ┌────┴────────────────────────────────────────┐
  │  FuseConn: pending request queue            │
  │  FuseInode: nodeid → dentry mapping         │
  └────┬────────────────────────────────────────┘
       │  VFS callbacks → fuse_request dispatch
  UmkaOS VFS layer (lookup, read, write, open, ...)
  POSIX application

The FUSE connection object (FuseConn) is the central coordination point. It maintains two queues: pending (requests waiting for the daemon to pick up) and processing (requests sent to the daemon, awaiting reply). Each VFS thread that triggers a FUSE operation enqueues a request and blocks until the daemon writes the corresponding reply.

14.11.2 Core Data Structures

/// Maximum pending FUSE requests per connection. Prevents unbounded kernel
/// memory growth from a slow or misbehaving FUSE daemon.
const FUSE_MAX_PENDING: usize = 4096;

/// One FUSE connection — shared between all fds opened on this mount.
pub struct FuseConn {
    /// Pending requests waiting for the daemon to read.
    /// Lock-free bounded MPMC ring (defined in Section 3.1.11). VFS
    /// operations push requests (producer side); the FUSE daemon reads
    /// from the ring via `/dev/fuse` (consumer side). `try_push()`
    /// returns `Err(Full)` for backpressure — foreground callers block
    /// on `waitq` until the daemon drains entries; background callers
    /// receive `EAGAIN`.
    /// Per-request overhead: ~60-80 cycles for Arc refcount operations
    /// (4 atomic ops across pending ring and processing XArray). Acceptable:
    /// each FUSE request involves a user-kernel round-trip (~2-10 us),
    /// making the ~30-40 ns refcount overhead <2%.
    pub pending:      BoundedMpmcRing<Arc<FuseRequest>, FUSE_MAX_PENDING>,
    /// Number of currently outstanding background (async) requests.
    /// Incremented when a background request is submitted; decremented on reply.
    pub num_background: AtomicU32,
    /// Maximum background requests before blocking new submissions.
    /// Default: 12 (matching Linux `FUSE_DEFAULT_MAX_BACKGROUND`).
    /// Negotiated via FUSE_INIT: the daemon may set `max_background` in
    /// `FuseInitOut` to override the default.
    pub max_background: u32,
    /// When `num_background >= congestion_threshold`, the VFS marks the
    /// backing device as congested, causing writeback to throttle.
    /// Default: 9 (matching Linux `FUSE_DEFAULT_CONGESTION_THRESHOLD`,
    /// which is `max_background * 3 / 4`).
    ///
    /// **Units note**: Both `max_background` and `congestion_threshold` are
    /// measured in **request count** (not pages or bytes). Each background
    /// FUSE request may transfer a variable number of pages (e.g., a single
    /// WRITE request carries up to `max_write` bytes, default 128 KiB = 32
    /// pages). The request-count limit provides coarse backpressure; memory
    /// consumption is bounded by `max_background * max_write`.
    pub congestion_threshold: u32,
    /// Wait queue for tasks blocked due to backpressure (background request
    /// count exceeding `max_background`).
    pub bg_waitq:     WaitQueue,
    /// Requests sent to the daemon, awaiting reply. Keyed by monotonic
    /// request ID (u64). XArray provides O(1) lookup with native RCU reads
    /// and internal xa_lock for write serialization, eliminating the need
    /// for an external Mutex on the lookup structure.
    pub processing:   XArray<Arc<FuseRequest>>,
    /// Wait queue: daemon blocked in read() waiting for new requests.
    pub waitq:        WaitQueue,
    /// Connection options negotiated via FUSE_INIT.
    pub opts:         FuseConnOpts,
    /// Next unique request ID (monotonically increasing).
    pub next_unique:  AtomicU64,
    /// True after the daemon has exchanged FUSE_INIT.
    /// Intra-domain (FuseConn lives entirely within umka-vfs Tier 1).
    /// AtomicBool validity maintained by Rust type safety.
    pub initialized:  AtomicBool,
    /// True when the connection is shutting down. Intra-domain.
    pub destroyed:    AtomicBool,
    /// Maximum write size negotiated (from FUSE_INIT reply).
    pub max_write:    u32,
    /// Maximum read size.
    pub max_read:     u32,
}

/// A single FUSE request/reply pair.
pub struct FuseRequest {
    /// Monotonic ID — matches `FuseInHeader.unique` and `FuseOutHeader.unique`.
    pub unique:  u64,
    pub opcode:  FuseOpcode,
    /// Serialized FUSE input args (everything after the `FuseInHeader`).
    /// **Collection policy exception**: Vec<u8> on a warm/hot path. FUSE input
    /// args are variable-length (path names up to PATH_MAX, write data up to
    /// max_write). A fixed-size buffer would waste memory for small ops or
    /// truncate large ones. Allocation is bounded by max_write (negotiated
    /// at FUSE_INIT, typically 128 KiB) and occurs once per FUSE operation.
    pub in_args: Vec<u8>,
    pub reply:   Mutex<FuseReply>,
    /// Woken when `reply` transitions to `Done`.
    pub waker:   WaitEntry,
}

/// State of a request's reply.
pub enum FuseReply {
    /// Not yet answered by the daemon.
    Pending,
    /// Reply bytes, or a negative errno on error.
    /// Collection policy exception: Vec<u8> on warm/hot path. FUSE replies
    /// are variable-length (stat: ~100 bytes, read data: up to max_read,
    /// readdir: variable). Allocation bounded by max_read (negotiated at
    /// FUSE_INIT, typically 128 KiB). The FUSE userspace round-trip (~2-10 us)
    /// dominates; Vec allocation (~50-100 ns) is <5% overhead.
    Done(Result<Vec<u8>, i32>),
}

/// FUSE connection options negotiated during FUSE_INIT.
pub struct FuseConnOpts {
    pub max_write:           u32,
    pub max_read:            u32,
    pub max_pages:           u16,
    /// Capabilities declared by the daemon (server side).
    pub capable:             FuseInitFlags,
    /// Capabilities the kernel requests (client side).
    pub want:                FuseInitFlags,
    /// Timestamp granularity in nanoseconds (0 = 1 ns, i.e., full precision).
    pub time_gran:           u32,
    pub writeback_cache:     bool,
    pub parallel_dirops:     bool,
    pub async_dio:           bool,
    pub posix_acl:           bool,
    pub default_permissions: bool,
    pub allow_other:         bool,
}

FuseConn is reference-counted via Arc and held by: - The superblock of the mounted filesystem. - Every open file descriptor on /dev/fuse belonging to that mount.

When the last daemon fd is closed, FuseConn.destroyed is set and all further VFS operations return EIO. The mount point must then be explicitly unmounted with fusermount -u or umount.

14.11.2.1 Request Backpressure

FUSE distinguishes foreground requests (synchronous VFS operations: lookup, open, read, write) from background requests (async writeback, readahead, background FUSE_NOTIFY replies). Backpressure is applied to background requests to prevent a slow daemon from causing unbounded kernel memory growth:

fuse_submit_background(conn, request):
  loop:
    n = conn.num_background.load(Acquire)
    if n < conn.max_background:
      if conn.num_background.compare_exchange(n, n + 1, AcqRel, Acquire).is_ok():
        break
    else:
      // Block until the daemon processes a reply and decrements num_background.
      // Non-blocking callers (e.g., writeback from kthread) get EAGAIN instead.
      if request.is_nonblocking():
        return Err(EAGAIN)
      conn.bg_waitq.wait_until(|| conn.num_background.load(Acquire) < conn.max_background)

  // Congestion marking: when background requests exceed the threshold,
  // inform the VFS writeback layer so it throttles dirty page generation.
  if conn.num_background.load(Acquire) >= conn.congestion_threshold:
    set_bdi_congested(conn.backing_dev_info)

  conn.pending.try_push(request)  // lock-free; returns Err(Full) if ring is full
  conn.waitq.wake_one()  // wake daemon blocked in read(/dev/fuse)

fuse_complete_background(conn):
  prev = conn.num_background.fetch_sub(1, AcqRel)
  if prev <= conn.congestion_threshold:
    clear_bdi_congested(conn.backing_dev_info)
  if prev <= conn.max_background:
    conn.bg_waitq.wake_one()

Foreground requests are not subject to max_background — they always enter the pending ring (bounded by FUSE_MAX_PENDING = 4096). If try_push() returns Err(Full), the foreground caller blocks on conn.waitq until the daemon drains entries. This matches Linux semantics where synchronous FUSE operations never return EAGAIN (except with O_NONBLOCK on the file, which is handled at the VFS layer above FUSE).

14.11.3 Wire Protocol

All FUSE communication is framed with fixed headers. The kernel writes a request header followed by opcode-specific arguments; the daemon writes a reply header followed by opcode-specific data.

/// Fixed header preceding every FUSE request (kernel → daemon).
#[repr(C)]
pub struct FuseInHeader {
    /// Total request length (this header + opcode args).
    pub len:     u32,
    /// Opcode (FuseOpcode value).
    pub opcode:  u32,
    /// Unique request ID; matched by the reply.
    pub unique:  u64,
    /// Target inode number (0 for FUSE_INIT / FUSE_STATFS).
    pub nodeid:  u64,
    /// Effective UID of the calling process.
    pub uid:     u32,
    /// Effective GID of the calling process.
    pub gid:     u32,
    /// PID of the calling process.
    pub pid:     u32,
    /// Length of extended request data appended after the standard opcode
    /// arguments (protocol 7.36+). Zero when no extensions are present.
    /// Used by FUSE_SECURITY_CTX, FUSE_CREATE_SUPP_GROUP.
    pub total_extlen: u16,
    pub padding: u16,
}
const_assert!(size_of::<FuseInHeader>() == 40);

/// Fixed header preceding every FUSE reply (daemon → kernel).
#[repr(C)]
pub struct FuseOutHeader {
    /// Total reply length (this header + reply data).
    pub len:    u32,
    /// 0 on success; negative errno on error (e.g., -ENOENT = -2).
    pub error:  i32,
    /// Matches the `unique` field from the corresponding `FuseInHeader`.
    pub unique: u64,
}
const_assert!(size_of::<FuseOutHeader>() == 16);

Requests and replies are variable-length. The daemon must read exactly FuseInHeader.len bytes per request and must write exactly FuseOutHeader.len bytes per reply. A short read or write is a protocol error and terminates the connection.

FUSE_FORGET and FUSE_BATCH_FORGET are the only opcodes that carry no reply; the daemon must not write a reply for them.

14.11.4 FUSE Opcodes

The direction column records who initiates the message: K→D = kernel to daemon (a VFS call from a user process), D→K = daemon to kernel (a notify or retrieve reply with no corresponding VFS initiator).

Opcode Value Direction Description
FUSE_LOOKUP 1 K→D Lookup a name within a directory
FUSE_FORGET 2 K→D Decrement inode reference count (no reply)
FUSE_GETATTR 3 K→D Fetch inode attributes
FUSE_SETATTR 4 K→D Modify inode attributes
FUSE_READLINK 5 K→D Read the target of a symbolic link
FUSE_SYMLINK 6 K→D Create a symbolic link
(reserved) 7 Reserved (unused in FUSE protocol; sequence intentionally skips from 6 to 8)
FUSE_MKNOD 8 K→D Create a special or regular file
FUSE_MKDIR 9 K→D Create a directory
FUSE_UNLINK 10 K→D Remove a file
FUSE_RMDIR 11 K→D Remove a directory
FUSE_RENAME 12 K→D Rename a file (v1; same mount)
FUSE_LINK 13 K→D Create a hard link
FUSE_OPEN 14 K→D Open a file
FUSE_READ 15 K→D Read file data
FUSE_WRITE 16 K→D Write file data
FUSE_STATFS 17 K→D Query filesystem statistics
FUSE_RELEASE 18 K→D Close file (last close releases the handle)
(reserved) 19 Unassigned in FUSE protocol (intentionally skipped)
FUSE_FSYNC 20 K→D Sync file data to stable storage
FUSE_SETXATTR 21 K→D Set an extended attribute
FUSE_GETXATTR 22 K→D Get an extended attribute value
FUSE_LISTXATTR 23 K→D List all extended attribute names
FUSE_REMOVEXATTR 24 K→D Remove an extended attribute
FUSE_FLUSH 25 K→D Flush on close (sent before FUSE_RELEASE)
FUSE_INIT 26 K→D Initialize connection (first message exchanged)
FUSE_OPENDIR 27 K→D Open a directory
FUSE_READDIR 28 K→D Read directory entries
FUSE_RELEASEDIR 29 K→D Close a directory
FUSE_FSYNCDIR 30 K→D Sync directory metadata to stable storage
FUSE_GETLK 31 K→D Test a POSIX byte-range lock
FUSE_SETLK 32 K→D Acquire or release a POSIX lock (non-blocking)
FUSE_SETLKW 33 K→D Acquire a POSIX lock (blocking)
FUSE_ACCESS 34 K→D Check access (used only when default_permissions is false)
FUSE_CREATE 35 K→D Atomically create and open a file
FUSE_INTERRUPT 36 K→D Cancel a pending request
FUSE_BMAP 37 K→D Map logical file block to device block
FUSE_DESTROY 38 K→D Tear down the connection
FUSE_IOCTL 39 K→D Forward an ioctl to the userspace filesystem
FUSE_POLL 40 K→D Poll a file for readiness events
FUSE_NOTIFY_REPLY 41 D→K Deliver data in response to FUSE_NOTIFY_RETRIEVE
FUSE_BATCH_FORGET 42 K→D Drop references for multiple inodes at once
FUSE_FALLOCATE 43 K→D Pre-allocate or de-allocate file space
FUSE_READDIRPLUS 44 K→D Read directory entries together with their attributes
FUSE_RENAME2 45 K→D Rename with RENAME_EXCHANGE or RENAME_NOREPLACE
FUSE_LSEEK 46 K→D Seek with SEEK_DATA or SEEK_HOLE
FUSE_COPY_FILE_RANGE 47 K→D Server-side copy (copy_file_range)
FUSE_SETUPMAPPING 48 K→D Set up a DAX direct memory mapping
FUSE_REMOVEMAPPING 49 K→D Remove a DAX mapping
FUSE_SYNCFS 50 K→D Sync the entire filesystem
FUSE_TMPFILE 51 K→D Create an unnamed temporary file (O_TMPFILE)
FUSE_STATX 52 K→D Extended stat (statx(2))
FUSE_COPY_FILE_RANGE_64 53 K→D Server-side copy (64-bit variant, returns bytes_copied via fuse_copy_file_range_out). Added in FUSE protocol 7.45.

Notify messages (daemon → kernel, unsolicited; no reply is sent by the kernel except for FUSE_NOTIFY_RETRIEVE which expects FUSE_NOTIFY_REPLY):

Notify code Value Description
FUSE_NOTIFY_POLL 1 Wake all pollers on the specified file handle
FUSE_NOTIFY_INVAL_INODE 2 Invalidate cached attributes and, optionally, a byte range of page cache
FUSE_NOTIFY_INVAL_ENTRY 3 Invalidate a specific dentry in a parent directory
FUSE_NOTIFY_STORE 4 Pre-populate a byte range of the page cache
FUSE_NOTIFY_RETRIEVE 5 Request the kernel to send page-cache contents back to the daemon
FUSE_NOTIFY_DELETE 6 Remove a dentry without a round-trip FUSE_LOOKUP failure
FUSE_NOTIFY_RESEND 7 Daemon notification that a previously interrupted request should be resent. Paired with HAS_RESEND capability flag (bit 39). Protocol 7.41+, Linux 6.12+
FUSE_NOTIFY_INC_EPOCH 8 Increment the kernel-side epoch counter for cache invalidation coordination
FUSE_NOTIFY_PRUNE 9 Request the kernel to prune (evict) dentries from a directory

14.11.5 FUSE_INIT Handshake

FUSE_INIT is always the first message exchanged. The kernel sends FuseInitIn and the daemon replies with FuseInitOut. The two sides negotiate protocol version and capability flags; the connection uses the minimum agreed minor version.

/// FUSE_INIT request body (kernel → daemon).
#[repr(C)]
pub struct FuseInitIn {
    /// FUSE major protocol version (kernel sends 7).
    pub major:         u32,
    /// FUSE minor protocol version (kernel sends 45 for Linux 6.14+ equivalent).
    pub minor:         u32,
    pub max_readahead: u32,
    /// Capability bitmask the kernel supports (low 32 bits of FuseInitFlags).
    /// Wire format: flags = FuseInitFlags bits 0-31 (low 32 bits).
    pub flags:         u32,
    /// Extended capability flags (protocol minor ≥ 36, FUSE_INIT_EXT must be set in flags).
    /// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
    /// This matches the FUSE protocol extension for large flag sets (kernel 5.13+).
    pub flags2:        u32,
    pub unused:        [u32; 11],
}
// Layout: 5 × u32 + 11 × u32 = 16 × 4 = 64 bytes.
const_assert!(size_of::<FuseInitIn>() == 64);

/// FUSE_INIT reply body (daemon → kernel).
#[repr(C)]
pub struct FuseInitOut {
    pub major:               u32,
    pub minor:               u32,
    pub max_readahead:       u32,
    /// Capabilities the daemon acknowledges and enables (low 32 bits of FuseInitFlags).
    /// Wire format: flags = FuseInitFlags bits 0-31 (low 32 bits).
    pub flags:               u32,
    /// Maximum number of outstanding background requests.
    pub max_background:      u16,
    /// Congestion threshold: kernel slows down at this many background requests.
    pub congestion_threshold: u16,
    /// Maximum bytes per WRITE request.
    pub max_write:           u32,
    /// Timestamp granularity in nanoseconds (0 = 1 ns, i.e., full precision).
    pub time_gran:           u32,
    /// Maximum scatter-gather page count per request.
    pub max_pages:           u16,
    /// Alignment required for DAX mappings.
    pub map_alignment:       u16,
    /// Extended flags (protocol minor ≥ 36, requires FUSE_INIT_EXT set in flags).
    /// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
    /// This matches the FUSE protocol extension for large flag sets (kernel 5.13+).
    pub flags2:              u32,
    pub max_stack_depth:     u32,
    /// Negotiated request timeout in seconds. Valid when `FUSE_REQUEST_TIMEOUT`
    /// (bit 42) is set in the negotiated flags. 0 = no timeout. Matches Linux
    /// `include/uapi/linux/fuse.h` field `request_timeout`.
    pub request_timeout:     u16,
    pub unused:              [u16; 11],
}
// Layout: 4+4+4+4+2+2+4+4+2+2+4+4+2+22 = 64 bytes.
// (8×u32 = 32) + (4×u16 = 8) + (11×u16 = 22) + (request_timeout u16 = 2) = 64.
const_assert!(size_of::<FuseInitOut>() == 64);

bitflags! {
    /// Capability flags exchanged during FUSE_INIT.
    pub struct FuseInitFlags: u64 {
        /// Daemon supports asynchronous read requests.
        const ASYNC_READ          = 1 << 0;
        /// Daemon handles POSIX advisory byte-range locks.
        const POSIX_LOCKS         = 1 << 1;
        /// Daemon uses file handles returned in open replies.
        const FILE_OPS            = 1 << 2;
        /// Daemon handles O_TRUNC atomically in open.
        const ATOMIC_O_TRUNC      = 1 << 3;
        /// Filesystem supports NFS export (node IDs are stable across reboots).
        const EXPORT_SUPPORT      = 1 << 4;
        /// Daemon supports writes larger than 4 KiB.
        const BIG_WRITES          = 1 << 5;
        /// Kernel should not apply the process umask to create operations.
        const DONT_MASK           = 1 << 6;
        /// Daemon supports splice(2)-based writes.
        const SPLICE_WRITE        = 1 << 7;
        /// Daemon supports splice(2)-based moves.
        const SPLICE_MOVE         = 1 << 8;
        /// Daemon supports splice(2)-based reads.
        const SPLICE_READ         = 1 << 9;
        /// Daemon handles BSD flock() locking.
        const FLOCK_LOCKS         = 1 << 10;
        /// Daemon supports ioctl on directories.
        const HAS_IOCTL_DIR       = 1 << 11;
        /// Kernel auto-invalidates cached data on attribute changes.
        const AUTO_INVAL_DATA     = 1 << 12;
        /// Kernel uses FUSE_READDIRPLUS instead of FUSE_READDIR.
        const DO_READDIRPLUS      = 1 << 13;
        /// Kernel switches adaptively between READDIRPLUS and READDIR.
        const READDIRPLUS_AUTO    = 1 << 14;
        /// Daemon supports asynchronous direct I/O.
        const ASYNC_DIO           = 1 << 15;
        /// Daemon supports writeback caching (batched dirty page writeback).
        const WRITEBACK_CACHE     = 1 << 16;
        /// Daemon does not need FUSE_OPEN (open is a no-op).
        const NO_OPEN_SUPPORT     = 1 << 17;
        /// Parallel directory operations are safe (no serialization needed).
        const PARALLEL_DIROPS     = 1 << 18;
        /// Kernel clears setuid/setgid bits on write (v1).
        const HANDLE_KILLPRIV     = 1 << 19;
        /// Daemon supports POSIX ACLs.
        const POSIX_ACL           = 1 << 20;
        /// Daemon sets error on abort rather than returning EIO.
        const ABORT_ERROR         = 1 << 21;
        /// `max_pages` field in FuseInitOut is valid.
        const MAX_PAGES           = 1 << 22;
        /// Daemon caches symlink targets.
        const CACHE_SYMLINKS      = 1 << 23;
        /// Daemon does not need FUSE_OPENDIR.
        const NO_OPENDIR_SUPPORT  = 1 << 24;
        /// Daemon explicitly invalidates data (FUSE_NOTIFY_INVAL_INODE).
        const EXPLICIT_INVAL_DATA = 1 << 25;
        /// `map_alignment` field in FuseInitOut is valid.
        const MAP_ALIGNMENT       = 1 << 26;
        /// Daemon is aware of submount semantics.
        const SUBMOUNTS           = 1 << 27;
        /// Kernel clears setuid/setgid bits on write (v2, extended semantics).
        const HANDLE_KILLPRIV_V2  = 1 << 28;
        /// Extended setxattr arguments (flags field present).
        const SETXATTR_EXT        = 1 << 29;
        /// `flags2` fields in FuseInitIn/Out are valid.
        const INIT_EXT            = 1 << 30;
        const INIT_RESERVED       = 1 << 31;

        // --- flags2 bits (require INIT_EXT set in flags) ---
        // Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
        // All bit positions match Linux `include/uapi/linux/fuse.h` (torvalds/linux master).

        /// Security context support (protocol 7.36+). Linux 6.0+.
        /// Daemon can receive security context (e.g., SELinux label) with
        /// create/mkdir/mknod requests via extended headers (total_extlen).
        const SECURITY_CTX           = 1 << 32;
        /// Per-inode DAX hint (protocol 7.36+). Linux 6.0+.
        /// Daemon can set per-inode DAX mode via FUSE_ATTR_DAX.
        const HAS_INODE_DAX          = 1 << 33;
        /// Supplementary group support in create (protocol 7.38+). Linux 6.6+.
        const CREATE_SUPP_GROUP      = 1 << 34;
        /// Expire-only entry invalidation (protocol 7.38+). Linux 6.6+.
        const HAS_EXPIRE_ONLY        = 1 << 35;
        /// Allow mmap on direct-I/O files (protocol 7.39+). Linux 6.6+.
        const DIRECT_IO_ALLOW_MMAP   = 1 << 36;
        /// I/O passthrough to backing file (protocol 7.40+). Linux 6.9+.
        const PASSTHROUGH            = 1 << 37;
        /// Opt out of NFS export support (protocol 7.40+). Linux 6.9+.
        const NO_EXPORT_SUPPORT      = 1 << 38;
        /// Daemon supports request resend on interrupted operations.
        /// When set, the kernel may resend a FUSE request that was interrupted
        /// (e.g., by a signal) if the daemon has not yet replied. The daemon must
        /// handle duplicate `unique` IDs idempotently (protocol 7.41+). Linux 6.12+.
        const HAS_RESEND             = 1 << 39;
        /// ID-mapped FUSE mounts (protocol 7.42+). Linux 6.12+.
        const ALLOW_IDMAP            = 1 << 40;
        /// io_uring-based FUSE request transport (protocol 7.43+). Linux 6.14+.
        /// When negotiated, requests are submitted and completed via io_uring
        /// SQEs/CQEs instead of read()/write() on `/dev/fuse`, eliminating
        /// two syscalls per FUSE operation.
        const OVER_IO_URING          = 1 << 41;
        /// Per-request timeout (protocol 7.45+). Linux 6.14+.
        const REQUEST_TIMEOUT        = 1 << 42;
    }
}

If the daemon returns a minor version lower than what the kernel sent, the kernel downconverts: fields that did not exist in the older protocol minor are ignored. If the daemon sends a major version other than 7, the kernel closes the connection.

14.11.6 VFS Integration

FUSE registers filesystem type "fuse" with superblock magic FUSE_SUPER_MAGIC = 0x65735546. Mounting proceeds as follows:

mount(2) path

  1. User invokes mount -t fuse -o fd=N,... or uses the fusermount3 helper.
  2. The kernel parses the fd=N mount option and resolves the fd to an open /dev/fuse file.
  3. A FuseConn is allocated and attached to the fd and the new superblock.
  4. The kernel sends FUSE_INIT and waits for the daemon's reply; on success, FuseConn.initialized is set and the mount completes.

VFS → FUSE dispatch

For every VFS operation on a FUSE mount (lookup, read, write, getattr, etc.) the kernel:

  1. Allocates a FuseRequest with a fresh unique ID.
  2. Serializes the opcode-specific arguments into in_args.
  3. Appends the request to FuseConn.pending and wakes the daemon's wait queue.
  4. Blocks on FuseRequest.waker until the daemon writes a reply.
  5. Deserializes the reply from FuseRequest.reply and returns to the VFS caller.

The daemon loop is simply:

loop {
    bytes = read(fuse_fd, buf)          // blocks until a request is pending
    handle_opcode(parse(buf))
    write(fuse_fd, reply_bytes)         // unblocks the kernel thread
}

Interrupt handling

If the calling thread receives a fatal signal while waiting for a FUSE reply, the kernel enqueues a FUSE_INTERRUPT message targeting the original request's unique ID. It then waits a short grace period (default 20 milliseconds). If the daemon does not abort the request and send a reply within that window, the kernel forcibly removes the request from FuseConn.processing and returns EINTR to the caller. The daemon is expected to ignore any subsequent reply it sends for the interrupted unique.

Writeback cache (WRITEBACK_CACHE flag)

When this capability is negotiated, dirty pages accumulate in the kernel page cache and are written to the daemon in larger batches via FUSE_WRITE. Without it, every write(2) to a FUSE file generates an immediate, synchronous FUSE_WRITE to the daemon, serializing all write traffic. Most performance- sensitive FUSE filesystems negotiate WRITEBACK_CACHE.

Connection death

When the last daemon fd is closed (daemon exits, crashes, or explicitly calls FUSE_DESTROY):

  1. FuseConn.destroyed is set atomically.
  2. All requests in FuseConn.processing are completed with error ENOTCONN.
  3. All requests in FuseConn.pending are discarded.
  4. Subsequent VFS operations on the mount return EIO.
  5. The mount point persists in the namespace; an explicit umount or fusermount -u is required to remove it.

14.11.7 Security Model

Mount-owner restriction (default)

Unless the allow_other mount option is passed, only the UID that opened /dev/fuse and performed the mount may access the filesystem. All other UIDs receive EACCES from the UmkaOS VFS layer before the request reaches the daemon, regardless of the file mode bits the daemon returns.

allow_other option

Permits any UID to access the filesystem subject to normal Unix permission checks. Because allow_other exposes the daemon process to arbitrary user requests, it requires either: - The SysAdmin capability in the mount namespace, or - The /proc/sys/fs/fuse/user_allow_other sysctl set to 1 (off by default).

default_permissions option

When set, the kernel enforces standard Unix permission checks (owner, group, other; st_mode, st_uid, st_gid) against the attributes the daemon returns in FUSE_GETATTR. The kernel never sends FUSE_ACCESS in this mode. Without default_permissions, the daemon is responsible for its own access control and receives FUSE_ACCESS for every access check.

Privilege requirement for mounting

Unprivileged FUSE mounts (without SysAdmin) are permitted only through fusermount3, which is installed setuid-root and validates that the user owns the target mountpoint. Direct mount(2) requires SysAdmin in the current user namespace.

14.11.8 io_uring FUSE

UmkaOS supports the io_uring-based FUSE I/O path (OVER_IO_URING feature, equivalent to Linux 6.14+). The daemon opts in by negotiating the OVER_IO_URING capability during FUSE_INIT and then submitting SQEs of type IORING_OP_URING_CMD to the /dev/fuse fd rather than using blocking read/write.

Benefits over the classic blocking I/O path:

  • Asynchronous request handling — the daemon can have many requests in flight simultaneously without blocking threads.
  • Reduced syscall overhead — requests are batched via io_uring_submit; one syscall drains or fills multiple queue slots.
  • CPU affinity — the daemon can pin io_uring workers to specific CPUs, reducing cross-socket latency for NUMA-aware FUSE filesystems.

The FUSE daemon registers a fixed buffer pool at startup. The kernel delivers requests into pre-registered buffers, and the daemon submits replies via the same ring. The wire format (FuseInHeader, FuseOutHeader, opcode bodies) is unchanged; only the transport mechanism differs.

Capability requirement: OVER_IO_URING negotiation requires the daemon process to hold CAP_IPC_LOCK (needed for the io_uring fixed buffer registration, which pins user pages via IORING_REGISTER_BUFFERS). If the daemon lacks CAP_IPC_LOCK, the OVER_IO_URING capability is silently cleared from the FUSE_INIT response and the connection falls back to the classic blocking I/O path.

14.11.9 Linux Compatibility

  • /dev/fuse device node (major 10, minor 229): identical to Linux.
  • FUSE protocol version 7.45 (Linux 6.14+ equivalent) is the maximum negotiated kernel version. Daemons advertising higher minors receive 7.45 in the reply.
  • libfuse3 (3.x series): works without modification.
  • fusermount3 and the fuse.ko-equivalent path: built into the UmkaOS VFS layer; no kernel module is required.
  • All widely deployed FUSE filesystems run without modification: sshfs, rclone mount, glusterfs-fuse, ceph-fuse, bindfs, s3fs-fuse, encfs, gvfs, ntfs-3g.
  • DAX (FUSE_SETUPMAPPING / FUSE_REMOVEMAPPING) is supported on persistent memory-backed FUSE mounts, providing zero-copy access to file data.

14.11.10 VFS Service Provider

Provider model: VFS service is always a host-native provider (the host's kernel manages the filesystem). Device-native Tier M providers do not apply here — storage devices provide BLOCK_STORAGE, not FILESYSTEM. The VFS service provider runs on the host that mounts the filesystem and serves it to remote peers. Sharing model: multiple remote peers mount the same export simultaneously (close-to-open consistency, DLM-coordinated locking).

A host can provide mounted filesystems as cluster services via the peer protocol. Remote peers mount the export as a local filesystem and perform file operations transparently — the VFS dispatches operations to the remote host, which executes them against its local filesystem.

This is the VFS instantiation of the capability service provider model described in Section 5.7. In a uniform UmkaOS cluster, VFS service provider provides file sharing without requiring NFS, nfsd, portmapper, idmapd, or any external daemon. The cluster infrastructure (peer protocol, DLM, PeerRegistry, heartbeat) provides everything needed.

// umka-vfs/src/service_provider.rs

/// Provides a local mount point as a cluster service to remote peers.
pub struct VfsServiceProvider {
    /// The local mount point being served (e.g., "/data").
    mount: MountHandle,
    /// Unique service identifier. Used as the DLM lock namespace for all
    /// file locks on this service (Section 14.7.10.3).
    service_id: ExportId,
    /// Transport endpoint for receiving remote VFS operations.
    endpoint: PeerEndpoint,
    /// Lease duration for metadata caching (default: 30 seconds).
    /// Remote peers cache metadata (stat, readdir) for this duration.
    /// On expiry, they must re-validate with the server.
    lease_duration_ms: u32,
    /// Maximum concurrent remote operations.
    max_inflight: u32,
    /// Connected clients, tracked for lease invalidation and recovery.
    /// Keyed by PeerId (u64). XArray provides O(1) lookup with native
    /// RCU-protected reads (no read-side locking) and ordered iteration.
    clients: XArray<ServiceClientState>,
}

/// Per-client state on the server side. Tracks leases and open files
/// for recovery after client disconnect/reconnect.
pub struct ServiceClientState {
    /// Peer ID of the connected client.
    peer_id: PeerId,
    /// PeerRegistry generation at last sync. Used to detect stale clients.
    last_registry_gen: u64,
    /// Active inode leases held by this client. Keyed by InodeId (u64);
    /// value is `()` (presence-only tracking). XArray per collection policy
    /// (integer-keyed mapping, warm path — lease grant/revoke).
    leases: XArray<()>,
    /// Open file handles (for recovery after server reboot). Keyed by
    /// FileHandle (u64). XArray per collection policy (integer-keyed mapping).
    /// Maximum 4096 open files per client (enforced at Open time; server
    /// returns -EMFILE if exceeded).
    open_files: XArray<OpenFileRecord>,
}

/// Recovery metadata for one open file on a remote client.
/// Used during server reboot recovery (grace period) to validate
/// client reclaim requests. Stored in `ServiceClientState::open_files`,
/// keyed by `FileHandle` (u64). The server populates this on every
/// successful `Open` and removes it on `Release`.
pub struct OpenFileRecord {
    /// Server-assigned file handle.
    handle: FileHandle,
    /// Inode this file handle refers to.
    inode_id: InodeId,
    /// Open flags (O_RDONLY, O_WRONLY, O_RDWR, etc.).
    flags: u32,
    /// Client's UID at open time (for permission re-verification on reclaim).
    uid: u32,
    /// Client's GID at open time.
    gid: u32,
}

/// VFS operation forwarded from a remote peer. Modeled after FUSE opcodes
/// but uses native UmkaOS VFS types, not FUSE wire format.
///
/// Every mutating operation carries the caller's `uid` and `gid` for
/// permission checking on the server (Section 14.7.10.2).
/// Fixed-size filename for wire protocol. NUL-padded, max 255 bytes.
#[repr(C)]
pub struct FileName {
    /// Actual name length in bytes (excluding NUL).
    pub len: u8,
    /// NUL-padded name bytes (only first `len` bytes are significant).
    pub bytes: [u8; 255],
}
// Layout: 1 + 255 = 256 bytes.
const_assert!(size_of::<FileName>() == 256);

/// Fixed-size xattr name for wire protocol (same layout as FileName).
pub type XattrName = FileName;

/// Xattr value descriptor. Values <= 224 bytes are inlined; larger values
/// use bulk transfer via a shared memory region.
#[repr(C)]
pub struct XattrValue {
    /// Actual value length in bytes.
    pub len: u32,
    /// Inline data (valid for first `min(len, 224)` bytes).
    pub inline_data: [u8; 224],
    /// Explicit padding after inline_data (offset 228) to align bulk_offset (u64, align 8).
    /// 228 % 8 = 4, need 4 bytes. CLAUDE.md rule 11.
    pub _pad: [u8; 4],
    /// Non-zero if value was transferred via bulk region (offset into
    /// the shared data region). Zero if fully inlined.
    pub bulk_offset: u64,
}
// Layout: len(4) + inline_data(224) + _pad(4) + bulk_offset(8) = 240 bytes.
// All padding explicit.
const_assert!(size_of::<XattrValue>() == 240);

/// Wire protocol discriminant. Append-only: new operations are added at
/// the end with incrementing values. Do not reorder or remove variants.
#[repr(C, u16)]
pub enum VfsServiceOp {
    Lookup { parent: InodeId, name: FileName, uid: u32, gid: u32 },
    Getattr { inode: InodeId },
    /// `attrs` is a `SetAttrMask` bitmask specifying which attributes to set.
    Setattr { inode: InodeId, attrs: SetAttrMask, uid: u32, gid: u32 },
    Open { inode: InodeId, flags: u32, uid: u32, gid: u32 },
    Read { handle: FileHandle, offset: u64, len: u32 },
    Write { handle: FileHandle, offset: u64, data_region_offset: u64, data_len: u32 },
    Release { handle: FileHandle },
    /// `offset` is an opaque server-assigned cookie (NOT a byte offset or
    /// entry index). Value 0 starts from the beginning of the directory.
    /// Each Readdir response includes the cookie for the next batch.
    /// This matches the NFS cookie model and avoids issues with
    /// concurrent directory mutations invalidating positional offsets.
    Readdir { inode: InodeId, offset: u64, uid: u32, gid: u32 },
    Create { parent: InodeId, name: FileName, mode: u32, flags: u32, uid: u32, gid: u32 },
    Unlink { parent: InodeId, name: FileName, uid: u32, gid: u32 },
    Mkdir { parent: InodeId, name: FileName, mode: u32, uid: u32, gid: u32 },
    Rmdir { parent: InodeId, name: FileName, uid: u32, gid: u32 },
    Rename { src_parent: InodeId, src_name: FileName,
             dst_parent: InodeId, dst_name: FileName, uid: u32, gid: u32 },
    Fsync { handle: FileHandle, datasync: u8 }, // 0 = fsync, 1 = fdatasync. u8 for wire safety.
    Statfs,
    /// File locking operations. Lock state is managed by the DLM
    /// (Section 14.7.10.3); these ops coordinate with the server's
    /// local filesystem lock state.
    Lock { handle: FileHandle, lock_type: LockType, start: u64, len: u64, uid: u32 },
    Unlock { handle: FileHandle, start: u64, len: u64 },
    /// Create a symbolic link. `target` is the symlink destination path.
    Symlink { parent: InodeId, name: FileName, target: FileName, uid: u32, gid: u32 },
    /// Read the target of a symbolic link.
    Readlink { inode: InodeId },
    /// Create a hard link. `inode` is the existing file; `new_parent`/`new_name`
    /// specify the new directory entry pointing to it.
    Link { inode: InodeId, new_parent: InodeId, new_name: FileName, uid: u32, gid: u32 },
    /// Get an extended attribute value.
    Getxattr { inode: InodeId, name: XattrName, uid: u32, gid: u32 },
    /// Set an extended attribute. `flags` follows Linux semantics:
    /// `XATTR_CREATE` (1) = fail if exists, `XATTR_REPLACE` (2) = fail if absent.
    Setxattr { inode: InodeId, name: XattrName, value: XattrValue, flags: u32, uid: u32, gid: u32 },
    /// List all extended attribute names on an inode.
    Listxattr { inode: InodeId, uid: u32, gid: u32 },
    /// Remove an extended attribute.
    Removexattr { inode: InodeId, name: XattrName, uid: u32, gid: u32 },
}
// VfsServiceOp is #[repr(C, u16)]: overall alignment = 8 (from u64 fields).
// Discriminant u16 at offset 0 (2 bytes), 6 bytes padding to offset 8 for
// first field alignment.
//
// Largest variant: Rename { src_parent: InodeId(8), src_name: FileName(256),
//   dst_parent: InodeId(8), dst_name: FileName(256), uid: u32(4), gid: u32(4) }
// Layout: offset 8..16(InodeId) + 16..272(FileName) + 272..280(InodeId)
//   + 280..536(FileName) + 536..540(u32) + 540..544(u32) = 544 bytes total.
//
// Runner-up: Symlink { parent(8), name(256), target(256), uid(4), gid(4) }
//   = 8+256+256+4+4 + 6(discrim pad) = 536 bytes.
// Setxattr { inode(8), name(256), value(240), flags(4), uid(4), gid(4) }
//   = 8+256+240+4+4+4 + 6(discrim pad) + 4(trailing align) = 528 bytes.
//
// EVERY variant on the wire takes 544 bytes, even a Getattr (16 bytes of
// actual data). Acceptable for a KABI ring (not a network wire protocol);
// consider a header+opcode+payload redesign if ring bandwidth is a concern.
const_assert!(size_of::<VfsServiceOp>() == 544);

/// Xattr encoding rules:
///
/// - `XattrName`: up to 255 bytes (`XATTR_NAME_MAX`, matching Linux). Sent
///   inline in the `ServiceMessage` payload.
/// - `XattrValue`: up to 65536 bytes (`XATTR_SIZE_MAX`, matching Linux).
///   Values <= 224 bytes are sent inline in the `ServiceMessage`. Values
///   > 224 bytes use bulk transfer via the peer transport: the client
///   writes the value into a bounce buffer and sends `Setxattr` with
///   `data_region_offset` pointing to the buffer; the server fetches the
///   value via remote fetch through the peer transport.
/// - `Listxattr` returns a null-separated list of attribute names. If the
///   total list exceeds 224 bytes, it is transferred via bulk push from
///   the server to the client's bounce buffer.
///
/// The 224-byte inline threshold is chosen to fit within a single
/// `ServiceMessage` payload (256 bytes minus header overhead), avoiding
/// a separate bulk transfer for small xattr values (the common case for
/// security labels, POSIX ACLs, and user attributes).

/// SetAttrMask bitmask specifying which attributes to set. Matches Linux
/// `ATTR_*` values for compatibility with `fuse_setattr_in`.
bitflags! {
    pub struct SetAttrMask: u32 {
        /// Set file mode (permissions).
        const MODE      = 1 << 0;
        /// Set owner UID.
        const UID       = 1 << 1;
        /// Set owner GID.
        const GID       = 1 << 2;
        /// Set file size (truncate).
        const SIZE      = 1 << 3;
        /// Set access time to a specific value.
        const ATIME     = 1 << 4;
        /// Set modification time to a specific value.
        const MTIME     = 1 << 5;
        /// Set change time (server updates ctime automatically on any change;
        /// this flag is for explicit ctime override when restoring backups).
        const CTIME     = 1 << 6;
        /// Set access time to current server time.
        const ATIME_NOW = 1 << 7;
        /// Set modification time to current server time.
        const MTIME_NOW = 1 << 8;
    }
}

/// Server-assigned file handle. Opaque to the client. Maps to an open
/// file descriptor on the server's VFS. Valid for the lifetime of the
/// client connection (or until Release).
pub type FileHandle = u64;

InodeId scope: InodeId values are scoped to a single VFS service export (one server mount point). They are NOT globally unique across the cluster. The combination (server_peer_id, service_id, inode_id) is globally unique. Clients must not compare InodeId values across different peerfs mounts.

Wire protocol: operation forwarding over the peer protocol. Data transfers (Read, Write) use remote write/read via the peer transport for zero-copy. Metadata and control operations use ring pair send/recv.

Server side: the VFS service provider receives operations, dispatches them to the local VFS layer (which invokes the local filesystem — ext4, XFS, etc.), and returns results. The server is entirely in-kernel — no userspace daemon involved (unlike NFS's nfsd or FUSE daemons). Permission checks use the caller's UID/GID against the local filesystem's ownership and mode bits.

Client side: remote peers mount the export using a dedicated filesystem type (mount -t peerfs host_peer_id:/data /mnt/remote). The client filesystem translates local VFS operations into VfsServiceOp messages and sends them to the server peer.

14.11.10.1 Scope and Relationship to NFS

VFS service provider is designed for uniform UmkaOS clusters — all nodes run UmkaOS, share a consistent UID/GID namespace, and trust each other at the kernel level (mutual peer authentication via the capability system). In this environment, it replaces NFS entirely: no RPC/XDR, no portmapper, no separate daemon, no exports file. File sharing is a native cluster capability.

VFS service provider does not cover the full NFS feature set:

Feature VFS Export NFS v4.2
Wire protocol Native peer protocol (RDMA) RPC/XDR (+ optional RDMA)
Authentication Peer capability + UID pass-through Kerberos/GSSAPI, AUTH_SYS
UID mapping None (consistent namespace assumed) idmapd, Kerberos principal
File locking DLM (Section 15.15) Built-in NLM/v4 locks
Consistency Close-to-open + leases Close-to-open + delegations
Server recovery DLM lock recovery + heartbeat Grace period + reclaim
Parallel data Not supported pNFS
Referrals Not supported v4.1 referrals
ACL model POSIX ACLs (from underlying FS) NFSv4 ACLs
Configuration Zero (auto-discovered via PeerRegistry) exports file, mount options
Daemons required None nfsd, mountd, idmapd, gssproxy

For mixed environments (Linux clients, Windows clients, NAS appliances, or clusters where per-user Kerberos authentication is required), NFS (Section 15.14) remains the right choice.

14.11.10.2 Identity and Permission Model

VFS service provider uses UID/GID pass-through: the client sends the calling process's UID and GID with every operation that requires permission checking. The server applies standard POSIX permission checks (owner, group, other, POSIX ACLs) against the local filesystem using the received UID/GID.

Assumptions (non-negotiable for VFS service provider):
1. All peers in the cluster share a consistent UID/GID namespace.
   (Same /etc/passwd, LDAP, or equivalent directory service on all nodes.)
2. The client peer is authenticated via the capability system (Section 9.1).
   UID/GID are trusted because the client kernel is trusted.
3. Root squash: optional, configurable per-export. When enabled, uid 0 from
   remote peers is mapped to nobody (65534). Default: enabled.

This is intentionally simple. UmkaOS clusters are managed as a single system with a single identity domain. Cross-domain authentication (Kerberos, GSSAPI) is the job of NFS, not the VFS service provider.

14.11.10.3 File Locking via DLM

File locks on exported filesystems are managed by the cluster's DLM (Section 15.15). The DLM already provides distributed lock coordination, deadlock detection (Section 15.15), and lock recovery on node failure.

/// DLM lock resource name for a file lock on an exported filesystem.
/// Composed from export ID and inode number — globally unique within
/// the cluster.
pub struct VfsLockResource {
    pub service_id: ExportId,
    pub inode_id: InodeId,
}

/// Lock type for VFS service provider locks. Maps directly to POSIX lock types.
#[repr(u8)]
pub enum LockType {
    /// flock() shared lock or fcntl() F_RDLCK.
    Shared = 0,
    /// flock() exclusive lock or fcntl() F_WRLCK.
    Exclusive = 1,
}

Lock flow (process on Host B locks file on Host A's export):

  1. Process calls flock(fd, LOCK_EX) on a file mounted via peerfs.
  2. Client VFS sends Lock { handle, Exclusive, 0, WHOLE_FILE, uid } to the server peer.
  3. Server acquires a DLM lock on VfsLockResource { service_id, inode_id } in exclusive mode. If the DLM lock is already held by another peer, the request blocks (or returns EWOULDBLOCK for LOCK_NB).
  4. After DLM grant, server also acquires the local filesystem lock (so local processes and remote clients see consistent lock state).
  5. Server responds with success. Client flock() returns.

fcntl() byte-range locks: supported. The DLM lock resource includes the byte range: VfsLockResource { service_id, inode_id, start, len }. DLM handles range overlap and splitting.

Deadlock detection: the DLM's WaitForGraph (Section 15.15) detects cross-node deadlocks. If a deadlock cycle is found, one holder receives EDEADLK.

Lock recovery on client failure: when a client peer is declared Dead (heartbeat timeout, Section 5.8), the DLM releases all locks held by that peer. Server-side open file state for the dead client is cleaned up.

14.11.10.4 Consistency Model: Close-to-Open

VFS service provider uses close-to-open consistency, the same model as NFS. This is simple, well-understood, and sufficient for the vast majority of workloads.

Rules:
1. CLOSE flushes: when a client closes a file (Release op), all dirty data
   and metadata are flushed to the server before Release returns. The server
   commits to the underlying filesystem (fsync if O_SYNC, writeback otherwise).

2. OPEN validates: when a client opens a file (Open op), the client
   invalidates all cached attributes for that inode and fetches fresh
   metadata from the server. If the file was modified by another client
   since the last open, the new data is visible.

3. Between open and close: the client may cache read data and buffer
   writes. Concurrent access to the same file from multiple clients
   without locking has undefined ordering (same as NFS). Use flock()
   or fcntl() for coordination.

4. Lease-assisted invalidation: the server sends lease invalidation to
   clients holding metadata leases when an inode is modified. This
   provides best-effort visibility between open/close cycles —
   not guaranteed, but usually works within 1-2 lease durations.

fsync() semantics: Fsync operation is forwarded to the server, which calls fsync() on the underlying filesystem. Returns only after data is persistent on the server's storage.

14.11.10.5 Metadata Caching and Leases

The client caches stat and readdir results for the lease duration (default: 30 seconds, configurable per-export). Leases are per-inode, granted implicitly on Lookup and Getattr responses.

Lease invalidation: when the server modifies an inode (local write, unlink, rename, chmod, etc.), it sends invalidation messages to all clients holding leases for that inode. Clients receiving invalidation drop their cached attributes; the next access re-fetches from the server.

Unreachable client: if a client is unreachable during invalidation, the lease expires naturally after the lease duration. The server does NOT block on client acknowledgment — invalidation is best-effort, and close-to-open consistency provides the correctness backstop.

Read caching: file data may be cached on the client for the lease duration. On lease invalidation, cached data for the invalidated inode is also dropped. This provides reasonable read performance for read-heavy workloads without complex delegation machinery.

Write buffering: dirty writes are buffered on the client and flushed on close(), fsync(), or when the write buffer exceeds a threshold (default: 1 MB per file). The server may send an early flush request if another client opens the same file (to make close-to-open consistency work without waiting for the first client's close).

14.11.10.5.1 Lease Invalidation Wire Protocol

The server sends a ServiceMessage with opcode LEASE_INVALIDATE (0x0100) containing a batch of InodeId values to invalidate (up to 32 per message, batched to reduce round trips).

/// Lease invalidation message payload. Sent from server to client when
/// inodes held in the client's lease set are modified on the server.
#[repr(C)]
pub struct LeaseInvalidation {
    /// Number of inodes in this batch (1-32).
    count: u16,
    _pad: [u8; 6],
    /// Array of inode IDs to invalidate. Only the first `count` entries
    /// are valid; remaining entries are undefined.
    inodes: [InodeId; 32],
}
// Layout: 2 + 6(pad) + 32×8 = 264 bytes.
const_assert!(size_of::<LeaseInvalidation>() == 264);

Delivery: fire-and-forget (no ACK required). The server does NOT block on delivery — invalidation is best-effort. If the ring pair send fails (client unreachable), the message is discarded. Close-to-open consistency (Section 14.11) provides the correctness backstop: any client that opens a file will always see fresh data regardless of whether invalidation was delivered.

Server-side invalidation trigger: any local operation that modifies an inode (write, truncate, chmod, chown, rename, unlink, link, setxattr) scans the clients XArray for leases containing that inode_id and batches invalidation messages. The scan uses RCU-protected reads on the per-client leases XArray — no locking required on the read side. Batching collects up to 32 inodes per message before sending, with a 1 ms flush timer to bound latency when fewer than 32 inodes are pending.

14.11.10.6 Server Reboot Recovery

When the exporting host reboots, connected clients detect the reboot via the heartbeat protocol (generation change in PeerRegistry, Section 5.2).

Recovery protocol:

  1. Server reboots, re-joins the cluster with an incremented generation number. Re-exports its filesystems.
  2. Server enters a grace period (default: 45 seconds). During the grace period, only lock reclaim operations are accepted — no new opens or mutations. This prevents new clients from acquiring locks that conflict with locks held by recovering clients.
  3. Clients detect the generation change, reconnect to the server, and reclaim their open file state:
  4. Re-send Open for each file that was open before the reboot.
  5. Re-acquire DLM locks (DLM recovery protocol handles this automatically — Section 15.15).
  6. After the grace period, normal operations resume. Clients that did not reclaim within the grace period have their open files and locks invalidated.

Grace period operation filtering: during the grace period, the server classifies each incoming VfsServiceOp:

RECLAIM operations (allowed during grace):
  - Open with RECLAIM flag set (client re-opening a file it had open
    before reboot). Server validates against OpenFileRecord from the
    client's pre-reboot state (persisted in the recovery log or re-sent
    by the client in the reclaim Open message).
  - Lock with RECLAIM flag (DLM lock reclaim — handled by DLM subsystem,
    see [Section 15.15](15-storage.md#distributed-lock-manager)).

NON-RECLAIM operations (rejected during grace with -EAGAIN):
  - Any Open without RECLAIM flag.
  - Create, Mkdir, Unlink, Rmdir, Rename, Setattr, Setxattr.
  - Write (new writes, not reclaim of buffered data).
  - Readdir, Getattr, Lookup (metadata reads also deferred — server
    state may be inconsistent during recovery).

RECLAIM flag encoding:
  const OPEN_RECLAIM: u32 = 1 << 31;
  Set in the VfsServiceOp::Open.flags field. The server masks this bit
  before passing flags to the local filesystem open.

After the grace period expires (default 45 seconds, configurable via
per-export grace_period_s mount option), the RECLAIM flag is ignored
and all operations proceed normally. Clients that did not reclaim within
the window have their handles invalidated — subsequent operations on
stale handles return -ESTALE.

Client-side handling: processes with open files on the rebooting server experience a brief stall (grace period duration) followed by normal operation. No EIO unless the server remains unreachable beyond the heartbeat dead threshold.

14.11.10.7 Capability Gating and Discovery

Remote filesystem access requires CAP_FS_REMOTE (Section 9.1). The capability is checked per-connection (at mount time), not per-operation.

Discovery: hosts exporting filesystems advertise FILESYSTEM in their PeerRegistry capabilities (Section 5.2). Remote peers discover available exports by querying PeerRegistry::peers_with_cap(FILESYSTEM), then requesting an export list from the serving peer. No /etc/exports file, no showmount — discovery is automatic via the cluster membership protocol.

14.11.10.8 PeerFS Client Filesystem Implementation

PeerFS is the client-side kernel filesystem that mounts a remote VFS service export as a local filesystem. It translates VFS operations into VfsServiceOp messages, manages a local inode cache, integrates with the page cache for data caching, and handles server reconnection transparently.

Tier assignment: Tier 1 (hardware memory domain isolated). PeerFS runs in the VFS isolation domain alongside other filesystem drivers. Network I/O is delegated to the peer protocol transport in Core.

Phase: 2 (core cluster filesystem, required for multi-node operation).

14.11.10.8.1 PeerFs Struct
/// Client-side filesystem for mounting remote VFS service exports.
///
/// Registered with VFS as filesystem type `"peerfs"`. Each mount creates
/// one `PeerFs` instance attached to the superblock via `s_fs_info`.
///
/// **Superblock magic**: `PEERFS_SUPER_MAGIC = 0x50454552` (`"PEER"` in ASCII).
///
/// **Lifecycle**: Created during `fill_super`. Destroyed on unmount after
/// all open files are released and dirty data flushed to the server.
pub struct PeerFs {
    /// Peer ID of the serving host.
    pub server_peer_id: PeerId,

    /// Export path on the server (e.g., "/data").
    pub export_path: ArrayString<256>,

    /// Service connection established via ServiceBind
    /// ([Section 5.7](05-distributed.md#network-portable-capabilities--capability-service-providers)).
    /// Provides the peer queue pair and control channel to the server.
    pub conn: PeerFsConn,

    /// Local cache of remote inodes. Keyed by server-assigned `InodeId` (u64).
    /// XArray provides O(1) lookup with RCU-protected reads on the hot path.
    pub inode_cache: XArray<Arc<PeerFsInode>>,

    /// Number of cached inodes. Used for LRU eviction decisions.
    pub nr_cached_inodes: AtomicU64,

    /// Maximum cached inodes before LRU eviction begins (default: 65536).
    pub max_cached_inodes: u64,

    /// LRU list for inode cache eviction. Head = most recently used.
    /// Protected by a dedicated spinlock (not the inode cache XArray lock)
    /// to avoid contention between lookups (read-side RCU on XArray) and
    /// LRU reordering. Eviction walks from tail (least recently used).
    pub inode_lru: SpinLock<IntrusiveList<PeerFsInode>>,

    /// Lease duration in milliseconds. Cached attributes and data are valid
    /// for this duration after the server grants the lease. Default: 30000 (30s).
    pub lease_duration_ms: u32,

    /// Write buffer flush threshold per file in bytes. When buffered dirty
    /// data for a single file exceeds this, writeback is triggered without
    /// waiting for close. Default: 1 MiB (1_048_576).
    pub writeback_threshold: u32,

    /// Maximum concurrent in-flight operations to the server. Backpressure:
    /// new operations block when this limit is reached. Default: 256.
    pub max_inflight: u32,

    /// Current in-flight operation count.
    pub inflight: AtomicU32,

    /// Wait queue for tasks blocked on inflight limit.
    pub inflight_waitq: WaitQueue,

    /// Retry timeout for server-unreachable in milliseconds. Operations
    /// block for this duration before returning -EIO. Default: 60000 (60s).
    pub retry_timeout_ms: u32,

    /// Server generation (from PeerRegistry). Used to detect server reboots.
    pub server_generation: AtomicU64,

    /// True when the server is in grace period (lock reclaim only).
    pub in_grace_period: AtomicBool,

    /// Root squash: when true, uid 0 from this client is mapped to
    /// nobody (65534) by the server. Default: true.
    pub root_squash: bool,

    /// Read-only mount. When true, all mutating operations return -EROFS.
    pub read_only: bool,
}

/// Connection state to the remote VFS service provider.
pub struct PeerFsConn {
    /// Peer protocol endpoint for control messages (ServiceBind channel).
    pub endpoint: PeerEndpoint,

    /// Data region registered with the peer transport at ServiceBind time.
    /// Covers the client's bounce buffer pool for zero-copy bulk transfers.
    pub data_region: ServiceDataRegion,

    /// Bounce buffer pool for bulk data transfers. Pre-allocated at mount
    /// time. Size: `max_inflight * 128 KiB` (covers max concurrent I/O).
    /// Slab-backed, no hot-path allocation.
    pub bounce_pool: SlabPool<BounceBuffer>,

    /// Connection state.
    pub state: AtomicU8, // PeerFsConnState discriminant

    /// Sequence number for request/response matching.
    pub next_seq: AtomicU64,
}

/// Pre-allocated bounce buffer for bulk data transfers. Fixed-size,
/// slab-allocated from `PeerFsConn::bounce_pool`. One buffer per in-flight
/// I/O operation. Never heap-allocated on the hot path — the pool is
/// sized at mount time to `max_inflight` entries. Bounce buffer pool uses
/// vmalloc-backed allocation (not buddy allocator) to avoid high-order
/// page allocation failures. Each buffer is page-aligned within the
/// vmalloc region.
#[repr(C, align(4096))]
pub struct BounceBuffer {
    /// Data area. Size: 128 KiB (covers the maximum single read/write
    /// transfer size). Page-aligned for transport registration requirements.
    pub data: [u8; 131072],
    /// Transport-local key for this buffer (from the registered data region).
    pub local_key: u32,
    /// Explicit padding after local_key (u32, offset 131076) to align
    /// region_offset (u64, align 8). 131076 % 8 = 4, need 4 bytes.
    /// CLAUDE.md rule 11.
    pub _pad: [u8; 4],
    /// Offset of this buffer within the registered data region. Used to
    /// compute the remote-accessible address for bulk transfers.
    pub region_offset: u64,
}
// Layout: data(131072) + local_key(4) + _pad(4) + region_offset(8) = 131088 bytes.
// align(4096) pads to 135168. All padding explicit.
const_assert!(size_of::<BounceBuffer>() == 135168);

/// Connection state machine.
#[repr(u8)]
pub enum PeerFsConnState {
    /// ServiceBind in progress.
    Connecting = 0,
    /// Normal operation.
    Connected = 1,
    /// Server unreachable, retrying.
    Reconnecting = 2,
    /// Server rebooted, in grace period (reclaim only).
    GracePeriod = 3,
    /// Unmounting, draining in-flight operations.
    Draining = 4,
    /// Terminal: connection dead, all ops return -EIO.
    Dead = 5,
}
14.11.10.8.2 FileSystemOps Implementation

PeerFS implements the FileSystemOps trait (Section 14.1) to register as filesystem type "peerfs". It does not require a block device (FS_REQUIRES_DEV is not set).

mount / fill_super:

peerfs_mount(source: &str, flags: MountFlags, data: &[u8]) -> Result<SuperBlock>:
  1. Parse `source` as "<peer_id>:<export_path>".
     - peer_id: decimal u64 or hostname resolved via PeerRegistry.
     - export_path: absolute path on the server.
     If parse fails: return -EINVAL.

  2. Parse mount options from `data`:
     - lease_duration=N (seconds, default 30, range 1-3600)
     - writeback_threshold=N (bytes, default 1048576, range 4096-67108864)
     - max_inflight=N (default 256, range 16-4096)
     - retry_timeout=N (seconds, default 60, range 5-600)
     - ro (read-only)
     - norootsquash (disable root squash)

  3. Resolve server_peer_id via PeerRegistry
     ([Section 5.2](05-distributed.md#cluster-topology-model--peer-registry)).
     Check FILESYSTEM flag in server's PeerCapFlags. If absent: return -ENOENT.

  4. Check CAP_FS_REMOTE capability on calling task
     ([Section 9.1](09-security.md#capability-based-foundation)). If absent: return -EPERM.

  5. ServiceBind to the server's VFS service
     ([Section 5.7](05-distributed.md#network-portable-capabilities--capability-service-providers)).
     Payload includes export_path and client's lease_duration preference.
     Server may adjust lease_duration downward. On failure: return -ECONNREFUSED.

  6. Register data region with peer transport for bounce buffer pool.

  7. Allocate PeerFs struct, populate all fields.

  8. Send VfsServiceOp::Lookup { parent: ROOT_INODE, name: "." } to
     fetch root inode attributes from server.

  9. Allocate SuperBlock:
     - s_type = "peerfs"
     - s_blocksize = server-reported block size (from Statfs)
     - s_maxbytes = i64::MAX
     - s_flags = flags | MS_NOSUID | MS_NODEV
     - s_fs_info = PeerFs pointer
     - s_root = dentry for root inode
     - s_magic = PEERFS_SUPER_MAGIC (0x50454552)
     - s_bdev = None (no local block device)
     - s_bdi = None (no local backing device; writeback managed by peerfs)

  10. Register heartbeat callback for server_peer_id to detect reboots.
      Return SuperBlock.

statfs: Forwards VfsServiceOp::Statfs to the server. Returns the server's StatFs values directly (total/free/available blocks and inodes). Cached for lease_duration_ms to avoid round-trips on repeated df calls.

sync_fs: Flushes all dirty pages for all open files on this mount to the server, then sends VfsServiceOp::Fsync for each dirty inode. Blocks until all server acknowledgments arrive.

unmount: Flushes dirty data (sync_fs), releases all DLM locks, sends VfsServiceOp::Release for all open files, tears down the data region, and disconnects from the server.

show_options: Emits peer=<peer_id>,export=<path>,lease=<N> for /proc/mounts.

14.11.10.8.3 InodeOps and FileOps

All VFS inode and file operations are translated to VfsServiceOp messages and forwarded to the server. The mapping is direct:

VFS Operation VfsServiceOp Notes
lookup Lookup Populates local PeerFsInode cache on hit
getattr Getattr Served from cache if lease valid
setattr Setattr Invalidates cached attrs on success
create Create Returns new inode + open handle
mkdir Mkdir
unlink Unlink Invalidates parent dir cache
rmdir Rmdir Invalidates parent dir cache
rename Rename Invalidates both parent dir caches
readdir Readdir Populates inode cache for returned entries
open Open Invalidates cached attrs (close-to-open)
read Read (remote fetch) Zero-copy from server's page cache
write Buffered, Write (remote push) Flushed on close/fsync/threshold
release Release Flush dirty pages first
fsync Fsync Flush dirty pages, then forward
symlink Symlink Creates symlink; returns new inode
readlink Readlink Returns symlink target path
link Link Creates hard link; invalidates parent dir cache
getxattr Getxattr Phase 2 — required for POSIX ACLs
setxattr Setxattr Phase 2 — required for POSIX ACLs
listxattr Listxattr Phase 2
removexattr Removexattr Phase 2

Read path (hot):

peerfs_read(file, buf, offset, len):
  inode = file.inode
  pi = inode.i_private as &PeerFsInode

  // Check page cache first (lease must be valid).
  if pi.lease_valid():
    pages = page_cache_lookup(inode, offset, len)
    if pages.is_complete():
      copy_to_user(buf, pages)
      return len

  // Cache miss or lease expired: fetch from server via peer transport.
  bounce = conn.bounce_pool.alloc()  // pre-allocated, no heap alloc
  send VfsServiceOp::Read { handle, offset, len } to server
  // Server responds with region offset for the data.
  // Client fetches data from remote region into bounce buffer.
  transport_fetch(bounce, server_region_offset, len)
  insert_into_page_cache(inode, offset, bounce.data, len)
  copy_to_user(buf, bounce.data, len)
  conn.bounce_pool.free(bounce)
  return len

Write path (warm — buffered, flushed asynchronously):

peerfs_write(file, buf, offset, len):
  if self.read_only: return -EROFS
  pi = file.inode.i_private as &PeerFsInode

  // Buffer in page cache. Mark pages dirty.
  copy_from_user_to_page_cache(file.inode, offset, buf, len)
  // Relaxed ordering: dirty_bytes is a best-effort counter for writeback
  // triggering, not a synchronization mechanism. Races between concurrent
  // writers may cause slight over- or under-counting, but writeback
  // correctness does not depend on exact counts — the page cache dirty
  // flags are the source of truth. Relaxed avoids unnecessary fence cost
  // on the write hot path (~2-5ns saved per write on weakly-ordered archs).
  pi.dirty_bytes.fetch_add(len, Relaxed)

  // Trigger writeback if threshold exceeded.
  if pi.dirty_bytes.load(Relaxed) >= self.writeback_threshold:
    peerfs_writeback(file.inode)

  return len

peerfs_writeback(inode):
  // Collect dirty pages, push to server via peer transport.
  for each dirty page range (offset, data, len):
    bounce = conn.bounce_pool.alloc()
    copy_pages_to_bounce(bounce, data, len)
    send VfsServiceOp::Write { handle, offset, data_region_offset: bounce.region_offset, data_len: len } to server
    // Server fetches data from client's bounce buffer via remote read.
    wait_for_server_ack()
    conn.bounce_pool.free(bounce)
    clear_page_dirty(inode, offset, len)
  pi.dirty_bytes.store(0, Relaxed)

Release (close) path: enforces close-to-open consistency.

peerfs_release(file):
  // Flush all dirty pages to server before releasing handle.
  peerfs_writeback(file.inode)
  send VfsServiceOp::Release { handle }
  wait_for_server_ack()
  // Do NOT invalidate cached attrs here — other opens may hold leases.

Open path: enforces close-to-open consistency (revalidation side).

peerfs_open(inode, flags):
  pi = inode.i_private as &PeerFsInode

  // Close-to-open: invalidate cached attrs and page cache on open.
  pi.invalidate_attrs()
  invalidate_inode_pages(inode)  // drop cached read data

  send VfsServiceOp::Open { inode: pi.remote_inode_id, flags, uid, gid }
  handle = wait_for_reply()
  // Store server-assigned handle in file->private_data.
  file.private_data = handle
  // Refresh cached attrs from Open reply.
  pi.update_attrs(reply.attrs, current_time_ms() + self.lease_duration_ms)
14.11.10.8.4 PeerFsInode: Remote Inode Cache
/// Local representation of a remote inode. Attached to the VFS `Inode`
/// via `i_private`. Caches attributes received from the server to avoid
/// round-trips on `stat()` / `getattr()` within the lease window.
pub struct PeerFsInode {
    /// Server-assigned inode ID. Matches `InodeId` on the server's filesystem.
    pub remote_inode_id: InodeId,

    /// Cached attributes (mode, uid, gid, size, mtime, ctime, nlink).
    pub attrs: SpinLock<PeerFsInodeAttrs>,

    /// Absolute time (monotonic ms) when the cached attrs expire.
    /// Operations after this time must re-fetch from the server.
    pub lease_expiry_ms: AtomicU64,

    /// Server generation at the time this inode was cached. If the server
    /// reboots (generation changes), all cached inodes are stale.
    pub server_generation: u64,

    /// Dirty bytes buffered in the page cache for this inode. Used to
    /// trigger writeback when exceeding `PeerFs::writeback_threshold`.
    pub dirty_bytes: AtomicU64,

    /// Server-assigned file handle for the current open. Zero if not open.
    /// Multiple opens share the same PeerFsInode but get distinct handles
    /// (handles are stored in `OpenFile::private_data`, not here).
    /// This field tracks the *last known* handle for recovery purposes.
    pub last_handle: AtomicU64,

    /// LRU linkage for inode cache eviction.
    pub lru_node: IntrusiveListNode,
}

/// Cached attributes for a remote inode.
pub struct PeerFsInodeAttrs {
    pub mode: u32,
    pub uid: u32,
    pub gid: u32,
    pub size: u64,
    pub nlink: u32,
    pub mtime_sec: u64,
    pub mtime_nsec: u32,
    pub ctime_sec: u64,
    pub ctime_nsec: u32,
    pub blocks: u64,
    pub blksize: u32,
}

impl PeerFsInode {
    /// Returns true if cached attributes are still within the lease window.
    pub fn lease_valid(&self) -> bool {
        current_monotonic_ms() < self.lease_expiry_ms.load(Acquire)
    }

    /// Invalidate cached attributes. Next getattr will fetch from server.
    pub fn invalidate_attrs(&self) {
        self.lease_expiry_ms.store(0, Release);
    }

    /// Update cached attributes from a server response.
    pub fn update_attrs(&self, attrs: PeerFsInodeAttrs, new_expiry_ms: u64) {
        *self.attrs.lock() = attrs;
        self.lease_expiry_ms.store(new_expiry_ms, Release);
    }
}

Cache eviction: triggered when nr_cached_inodes exceeds max_cached_inodes (default 65536). Eviction runs in a background kthread (peerfs_evictor), not on the hot lookup path.

peerfs_evict_inodes(pfs: &PeerFs):
  1. Acquire inode_lru spinlock.
  2. Walk the LRU list from tail (least recently used).
  3. For each candidate inode:
     a. Skip if VFS reference count > 0 (inode has active opens).
     b. Skip if dirty_bytes > 0 (must writeback first — schedule
        async writeback via peerfs_writeback() and re-check on next
        eviction pass).
     c. Remove from LRU list and inode_cache XArray.
     d. Drop the Arc<PeerFsInode> (deallocates if refcount reaches 0).
     e. Decrement nr_cached_inodes.
  4. Stop when nr_cached_inodes <= max_cached_inodes * 7/8 (hysteresis
     to avoid thrashing — evict 12.5% below threshold before stopping).
  5. Release inode_lru spinlock.

Memory budget: 65536 inodes x ~128 bytes per PeerFsInode = 8 MiB. Configurable via mount option max_inodes=N.

Lease-driven invalidation: The server sends lease invalidation messages when a remote inode is modified by another client. On receipt:

on_lease_invalidation(inode_id: InodeId):
  pi = inode_cache.load(inode_id)
  if pi.is_some():
    pi.invalidate_attrs()
    invalidate_inode_pages(pi.vfs_inode)  // drop cached read data

This provides best-effort visibility between open/close cycles. The close-to-open protocol is the correctness backstop — invalidation is an optimization that improves freshness for long-lived opens.

Revalidation: When a VFS operation accesses an inode with an expired lease, the client sends VfsServiceOp::Getattr and updates the cache. This is transparent to the caller.

14.11.10.8.5 Page Cache Integration

PeerFS uses the standard VFS page cache (Section 14.1) for read and write caching. The caching policy is deliberately simple — no delegations, no complex cache consistency state machines.

Design principles (lessons from NFS client bugs): - No "delegation" concept. Leases are the only cache validity mechanism. - Page cache validity is tied to the inode's lease. When the lease expires or is invalidated, all cached pages for that inode are dropped. - No speculative cache retention after lease loss. This wastes some bandwidth on re-reads but eliminates stale data bugs. - Dirty pages are ALWAYS flushed before releasing a file handle. No deferred or lazy flush that could lose data.

Read caching: - On read(), check page cache first. If the page is present AND the inode's lease is valid, serve from cache (zero network I/O). - If the page is absent or the lease is expired, fetch from the server via remote fetch (peer transport) and insert into the page cache. - Readahead: the VFS readahead infrastructure (Section 14.1) drives prefetch. PeerFS implements AddressSpaceOps::readahead() to batch multiple pages into a single remote fetch.

Write caching: - Writes go to the page cache. Pages are marked dirty. - Dirty pages are flushed to the server on: 1. close() (mandatory — close-to-open consistency). 2. fsync() (explicit sync request). 3. Per-file dirty bytes exceeding writeback_threshold. 4. Server sends early-flush request (another client opened the file). 5. Memory pressure (VFS shrinker callback).

mmap support: - mmap() is supported. Page faults trigger the read path (fetch from server if not cached). Dirty mmap pages are tracked via the page cache dirty mechanism and flushed on msync(), munmap(), or close().

No AddressSpaceOps::direct_IO: PeerFS does not support O_DIRECT. All I/O goes through the page cache. This simplifies the implementation and avoids the NFS O_DIRECT coherency bugs. Applications requiring direct server access should use the DLM-based locking path.

Readdir caching: - Directory entries are cached as a serialized readdir buffer in the page cache, using the same page cache machinery as regular file data. Pages are keyed by (directory inode, byte offset into the readdir stream). - Cached readdir data is valid for lease_duration_ms. After expiry, the next readdir() re-fetches from the server. - Any directory mutation (Create, Unlink, Mkdir, Rmdir, Rename affecting the directory as source or destination) invalidates the directory's cached readdir pages immediately. - No negative dentry caching. Caching "file not found" results is error-prone with multiple clients modifying the same directory concurrently. A lookup miss always goes to the server.

14.11.10.8.6 Mount Syntax
mount -t peerfs <peer_id>:<export_path> <mountpoint> [-o <options>]
Option Type Default Range Description
lease_duration seconds 30 1-3600 Metadata/data cache validity period
writeback_threshold bytes 1048576 4096-64M Per-file dirty data flush threshold
max_inflight count 256 16-4096 Max concurrent server operations
retry_timeout seconds 60 5-600 Timeout before -EIO on server unreachable
ro flag off Read-only mount
norootsquash flag off Disable root squash (uid 0 not mapped)
max_inodes count 65536 1024-1048576 Maximum cached inodes before LRU eviction

Examples:

mount -t peerfs 42:/data /mnt/remote
mount -t peerfs 42:/home /mnt/home -o lease_duration=60,ro
mount -t peerfs 7:/scratch /mnt/scratch -o writeback_threshold=4194304,max_inflight=512

Peer ID can also be a hostname if DNS/mDNS resolution is available in the cluster. The PeerRegistry resolves hostnames to PeerId values.

14.11.10.8.7 Error Handling
Condition Behavior
Server unreachable Operations block for retry_timeout (default 60s), then return -EIO. Connection transitions to Reconnecting.
Server rebooted Detected via PeerRegistry generation change. Connection transitions to GracePeriod. Client reclaims open files and DLM locks. Normal operations resume after grace period. See Section 14.11.
Stale file handle Server returns -ESTALE. Client re-lookups the inode from parent directory: walk up to nearest cached valid parent, re-lookup each path component. If re-lookup succeeds, retry the operation with the new handle. If the file was deleted, return -ESTALE to the application.
Server returns error Passed through to the application as-is (e.g., -EACCES, -ENOSPC, -ENOENT).
Transport failure Connection transitions to Reconnecting. Re-register data region after reconnect. In-flight operations receive -EIO.
Client-side memory pressure VFS shrinker evicts clean cached inodes (LRU). Dirty inodes are written back first.
Inflight limit reached New operations block on inflight_waitq until an in-flight operation completes.

Stale inode recovery detail:

peerfs_handle_estale(inode, parent_path_components):
  // Walk up the cached path to find the nearest valid ancestor.
  for component in parent_path_components.reverse():
    ancestor = lookup_cached(component)
    if ancestor.is_valid():
      // Re-lookup from this ancestor downward.
      for child in path_from(ancestor, inode):
        result = send VfsServiceOp::Lookup { parent: ancestor.id, name: child }
        if result.is_err():
          return result  // file was deleted or renamed
        ancestor = result.inode
      // Update local inode cache with fresh server state.
      update_inode_cache(inode, ancestor)
      return Ok(())
  return Err(ESTALE)  // entire path is stale
14.11.10.8.8 Advantages Over NFS
Aspect PeerFS NFS v4.2
Wire overhead Native structs over peer transport — no RPC/XDR marshaling Sun RPC + XDR encoding/decoding on every operation
Daemons Zero (pure in-kernel) nfsd, mountd, idmapd, gssproxy, rpc.statd, rpcbind
Configuration mount -t peerfs peer:/path /mnt /etc/exports, /etc/fstab, Kerberos keytabs, idmapd.conf
Discovery Automatic via PeerRegistry (Section 5.2) Manual: must know server IP and export path
Locking DLM (Section 15.15) — already running for DFS NLM (v3) or built-in (v4) — separate protocol, separate recovery
Data transfer Zero-copy via peer transport (bounce buffer pool, no kernel copy on large I/O; ~3-5μs on RDMA, ~50-200μs on TCP) TCP or RPC-over-RDMA (still has XDR framing overhead)
Caching model Leases only — simple, predictable, few bugs Delegations — complex state machine, known source of client bugs
Recovery DLM lock recovery + PeerRegistry heartbeat Grace period + reclaim + edge cases around delegation return
14.11.10.8.9 Intentional Non-Goals

The following NFS features are not implemented in PeerFS. Each omission is deliberate, not a gap:

  • pNFS (parallel NFS): UmkaOS clusters use the block service provider (Section 15.14) for parallel storage access. PeerFS serves the simpler "one server, one export" use case.
  • Referrals: PeerRegistry discovery handles service location. No need for filesystem-level redirect.
  • NFSv4 ACLs: PeerFS uses POSIX ACLs from the underlying filesystem. The server enforces them using the passed UID/GID.
  • Client-side deduplication: The server's local filesystem handles this.
  • Kerberos/GSSAPI: PeerFS assumes a uniform trust domain (all peers mutually authenticated via Section 9.1). Cross-domain authentication requires NFS.
  • O_DIRECT: All I/O goes through the page cache for simplicity. Applications needing direct access use DLM locks and RDMA directly.

14.12 configfs — Kernel Object Configuration Filesystem

configfs is a RAM-resident pseudo-filesystem (similar to sysfs) that allows user-space to create, configure, and destroy kernel objects by manipulating directories and files under a single mount point. The key distinction from sysfs is direction of control: sysfs exports kernel-managed objects to user-space, while configfs gives user-space the power to instantiate new kernel objects via mkdir.

configfs is used by: - LIO iSCSI / NVMe-oF target (/sys/kernel/config/target/, /sys/kernel/config/nvmet/) — see Section 11 for the block-layer and NVMe-oF protocol details. - USB gadget framework (/sys/kernel/config/usb_gadget/) - 9pnet and netconsole subsystems

14.12.1 Architecture

                 User Space
         mkdir / rmdir / cat / echo
              /sys/kernel/config/
                     │  (VFS operations)
        ┌────────────┴────────────────────────┐
        │          configfs VFS layer          │
        │  ConfigfsSubsystem → ConfigGroup     │
        │  ConfigItem → ConfigAttribute        │
        └────────────┬────────────────────────┘
                     │  callbacks
              Kernel subsystem
         (LIO, nvmet, USB gadget, ...)

User-space operates exclusively with POSIX filesystem primitives. No ioctl or dedicated syscall is needed. The kernel subsystem registers callback functions that the configfs VFS layer invokes in response to standard filesystem operations.

14.12.2 Data Structures

/// A configfs subsystem, registered by a kernel module at init time.
pub struct ConfigfsSubsystem {
    /// Directory name created under /sys/kernel/config/.
    pub name: &'static str,
    /// Root group of this subsystem.
    pub root: Arc<ConfigGroup>,
}

/// A configfs group — a directory that may contain items, subgroups, and
/// attributes. Groups may also carry a set of default child groups that are
/// created automatically when the group itself is created.
pub struct ConfigGroup {
    pub item:           ConfigItem,
    /// Active children (items and subgroups) keyed by name.
    /// Bounded by the subsystem's make_item/make_group callbacks (return
    /// ENOSPC when subsystem-specific limits are exceeded). All configfs
    /// mutation operations require CAP_SYS_ADMIN.
    ///
    /// **Lock ordering**: Parent group's `children` lock MUST be acquired
    /// before any child group's `children` lock (top-down ordering). This
    /// is deadlock-free by the tree structure: no cycle can form because
    /// locks are always acquired in root-to-leaf order. Concurrent mkdir
    /// operations on sibling groups do not conflict (different RwLock
    /// instances). Lock level: LOCK_LEVEL_CONFIGFS_CHILDREN (within the
    /// VFS lock ordering hierarchy, after VFS inode lock, before
    /// attribute file I/O locks).
    pub children:       RwLock<BTreeMap<String, ConfigChild>>,
    /// Type descriptor controlling allowed operations on this group.
    pub item_type:      Arc<ConfigItemType>,
    /// Subgroups automatically created alongside this group (not user-removable).
    /// Bounded: typically 1-4 default groups per subsystem (e.g., target_core_mod
    /// creates "alua" and "statistics"). Cold path (group creation).
    /// Enforced: registration fails with ENOSPC if len >= CONFIGFS_MAX_DEFAULT_GROUPS.
    pub default_groups: Vec<Arc<ConfigGroup>>,
}

const CONFIGFS_MAX_DEFAULT_GROUPS: usize = 16;

/// Discriminated union of group children.
pub enum ConfigChild {
    Item(Arc<ConfigItem>),
    Group(Arc<ConfigGroup>),
}

/// A configfs item — the leaf directory representing one kernel object.
pub struct ConfigItem {
    /// Item name within its parent group. Set at creation time (mkdir);
    /// immutable thereafter — no lock required. Inline storage avoids
    /// heap allocation for typical names (container IDs, device names).
    pub name:      ArrayString<256>,
    /// Reference count; item is dropped when it reaches zero.
    /// AtomicU64 for consistent width across 32-bit and 64-bit platforms
    /// (per project policy — avoids usize width variation).
    pub kref:      AtomicU64,
    pub parent:    Weak<ConfigGroup>,
    pub item_type: Arc<ConfigItemType>,
}

/// Type descriptor: defines the callbacks and attributes for an item or group.
pub struct ConfigItemType {
    pub name: &'static str,
    /// Called when the item's reference count drops to zero.
    pub release:    fn(&ConfigItem),
    /// Attribute files exposed in every instance of this item type.
    pub attrs:      &'static [&'static dyn ConfigAttribute],
    /// Returns additional child groups (used for complex multi-level objects).
    pub groups:     Option<fn(&ConfigItem) -> Vec<Arc<ConfigGroup>>>,
    /// Create a new leaf item inside this group (triggered by mkdir).
    pub make_item:  Option<fn(group: &ConfigGroup, name: &str)
                              -> Result<Arc<ConfigItem>, KernelError>>,
    /// Create a new subgroup inside this group (triggered by mkdir).
    pub make_group: Option<fn(group: &ConfigGroup, name: &str)
                               -> Result<Arc<ConfigGroup>, KernelError>>,
    /// Notify the subsystem before an item is removed (triggered by rmdir).
    pub drop_item:  Option<fn(group: &ConfigGroup, item: &ConfigItem)>,
}

/// A single configfs attribute — a regular file in the item directory.
pub trait ConfigAttribute: Send + Sync {
    /// File name within the item directory.
    fn name(&self) -> &str;
    /// Unix permission bits (typically 0644 for read-write, 0444 for read-only).
    fn mode(&self) -> u32;
    /// Populate `buf` with a text representation of the attribute value.
    /// Returns the number of bytes written.
    fn show(&self, item: &ConfigItem, buf: &mut [u8]) -> Result<usize, KernelError>;
    /// Parse `buf` and apply the new attribute value.
    /// Returns the number of bytes consumed.
    fn store(&self, item: &ConfigItem, buf: &[u8]) -> Result<usize, KernelError>;
}

Lifetimes and reference counting mirror those of the objects the subsystem manages. A ConfigItem is kept alive as long as the directory exists in the configfs namespace. Removal (rmdir) calls drop_item, decrements the kref, and invokes release when the count reaches zero.

14.12.3 Mount Point and Directory Layout

configfs is mounted at boot by configfs_init() and exposed at /sys/kernel/config. User-space may also mount it manually:

mount -t configfs configfs /sys/kernel/config

Illustrative layout showing the NVMe-oF and iSCSI target subsystems (see Section 11 for full protocol details):

/sys/kernel/config/
├── target/                              ← LIO iSCSI / generic target subsystem
│   ├── core/
│   │   └── iblock_0/                   ← mkdir: create iblock backstore group
│   │       └── lio_disk0/              ← mkdir: create a new block device object
│   │           ├── dev                 ← echo /dev/sda > dev
│   │           ├── udev_path           ← echo /dev/sda > udev_path
│   │           └── enable              ← echo 1 > enable
│   └── iscsi/
│       └── iqn.2024-01.com.example:storage/   ← mkdir: create iSCSI target IQN
│           └── tpgt_1/                         ← mkdir: create target portal group
│               ├── enable
│               ├── lun/
│               │   └── lun_0 → ../../core/iblock_0/lio_disk0   ← symlink
│               ├── acls/
│               │   └── iqn.2024-01.com.client:host1/
│               │       ├── auth/
│               │       └── mapped_lun0/
│               └── fabric_statistics/
├── nvmet/                               ← NVMe-oF target subsystem
│   ├── subsystems/
│   │   └── nqn.2024-01.com.example:nvme-ssd/  ← mkdir: create NVMe subsystem NQN
│   │       ├── attr_allow_any_host
│   │       └── namespaces/
│   │           └── 1/                          ← mkdir: create namespace ID 1
│   │               ├── device_path             ← echo /dev/nvme0n1 > device_path
│   │               └── enable                  ← echo 1 > enable
│   └── ports/
│       └── 1/                                  ← mkdir: create NVMe-oF port
│           ├── addr_trtype                     ← echo tcp > addr_trtype
│           ├── addr_traddr                     ← echo 192.0.2.1 > addr_traddr
│           ├── addr_trsvcid                    ← echo 4420 > addr_trsvcid
│           └── subsystems/
│               └── nqn.2024-01.com.example:nvme-ssd  ← symlink
└── usb_gadget/                          ← USB gadget framework
    └── g1/                             ← mkdir: create a gadget instance
        ├── idVendor
        ├── idProduct
        └── functions/
            └── mass_storage.0/
                └── lun.0/
                    └── file            ← echo /dev/sdb > file

The directory hierarchy encodes object relationships. Symlinks express associations between independently-created objects (e.g., linking a LUN to its backing store, or attaching a subsystem to a port).

14.12.4 VFS Operations

configfs maps the five fundamental filesystem operations onto subsystem callbacks:

mkdir(path) The parent directory's ConfigItemType is consulted. If make_group is defined, a new ConfigGroup is allocated and returned as a subdirectory dentry. If make_item is defined, a new ConfigItem is allocated and returned. Only one of the two may be non-null for a given group type; attempting mkdir on a group that defines neither returns EPERM. Default child groups are created automatically alongside any new group.

rmdir(path) The directory must be empty (no user-created children; default children are exempt from this check and are removed automatically). drop_item is invoked on the parent's ConfigItemType, then the item's kref is decremented. If the kref reaches zero, release is called. Attempting to remove a non-empty directory returns ENOTEMPTY.

open(attr_path) / read(attr_fd) The fd is associated with the specific ConfigAttribute. read(2) invokes ConfigAttribute::show(), which populates the kernel buffer with a text representation. The output is always \n-terminated for shell compatibility.

open(attr_path) / write(attr_fd) write(2) invokes ConfigAttribute::store() with the user-supplied buffer. The subsystem parses and validates the value; on error it returns a negative errno. Writes larger than PAGE_SIZE (4 KiB) are rejected with EINVAL to prevent unbounded allocations.

symlink(src, dst) Used to express dependencies between items: for example, associating a LUN directory with a backstore object, or adding a subsystem to a port's subscriber list. configfs validates that both the source and destination are within the same configfs mount before creating the link. The subsystem's ConfigItemType may reject symlinks by returning EPERM from an optional allow_link callback.

readdir Returns all children of a group: items, subgroups, attribute files, and symlinks. Attribute names are synthesized from ConfigItemType.attrs; no inode backing store is needed.

14.12.5 Linux Compatibility

  • /sys/kernel/config/ mount point and directory layout: byte-for-byte identical to Linux configfs (kernel 5.0+).
  • The ConfigAttribute read/write text format (newline-terminated strings, echo value > file idiom) matches Linux.
  • LIO iSCSI target tools (targetcli, targetcli-fb, rtslib-fb) work without modification.
  • NVMe-oF target tools (nvmetcli) work without modification; see Section 11 for NVMe-oF transport configuration details.
  • USB gadget framework (configfs-gadget, libusbgx) works without modification.
  • Symlink semantics (cross-item dependencies) are identical to Linux: both source and destination must reside within the same configfs mount.

14.13 File Notification System

UmkaOS implements inotify and fanotify with full Linux syscall and wire-format compatibility. Internal delivery uses typed structured channels rather than raw fd-write protocols; the external syscall interfaces are byte-for-byte identical to Linux.

Two interfaces are provided:

  • inotify: informational events (IN_CREATE, IN_MODIFY, etc.), delivered asynchronously via a file descriptor readable with read(2).
  • fanotify: superset of inotify, plus permission events (FAN_OPEN_PERM, FAN_ACCESS_PERM, FAN_OPEN_EXEC_PERM) that block the originating syscall until userspace responds with allow or deny. Used by malware scanners, file integrity monitors, and backup software.

Both are implemented in umka-vfs. Event delivery hooks are called from within the VFS operation dispatch paths — after permission checks pass, before returning to userspace.

14.13.1 inotify

14.13.1.1 In-Kernel Objects

/// Per-inotify-instance state. Created by inotify_init() / inotify_init1().
/// Exposed to userspace as a file descriptor (the fd is backed by a synthetic
/// inode in the anonymous inode filesystem; read(2) on it drains the event queue).
pub struct InotifyInstance {
    /// Watch descriptors: maps wd → InotifyWatch.
    /// WatchDescriptor is the Linux `wd` (i32 cast to u32 index). XArray
    /// provides O(1) lookup with native RCU-protected reads (no read-side
    /// locking) and internal xa_lock for write serialization, replacing the
    /// external RwLock. Watch addition/removal is infrequent (warm path).
    pub watches: XArray<Arc<InotifyWatch>>,

    /// Monotonically increasing allocator for watch descriptors.
    /// WDs are 1-based positive integers per inotify_add_watch(2) contract.
    ///
    /// **Longevity**: i32 allows ~2.1 billion watch additions per inotify fd.
    /// At 1000 add/remove cycles per second, wraps in ~24.8 days. In practice,
    /// applications rarely exceed thousands of watches. If a long-lived daemon
    /// exhausts the space, the application must close and re-open the inotify fd
    /// (matches Linux behavior — Linux also does not recycle WDs).
    /// WD recycling (reusing released WDs via a free list) would extend
    /// the practical lifetime but is deferred: Linux does not recycle WDs,
    /// and changing this would be a behavioral difference.
    pub next_wd: AtomicI32,

    /// Per-instance event queue. Pre-allocated at `inotify_init()` time with
    /// capacity `max_queued_events` (sysctl, default 16384). Fixed capacity avoids
    /// heap allocation under spinlock. Overflow policy: **newest events are dropped**
    /// when the queue is full (the newest event that would be enqueued is discarded,
    /// not an existing queued event). This matches Linux inotify behavior: the
    /// `overflow` flag is set and a synthetic `IN_Q_OVERFLOW` event is prepended
    /// to the next `read(2)` response.
    pub event_queue: SpinLock<BoundedRing<InotifyEventBuf>>,

    /// Set when the event queue overflowed since the last read(2). A synthetic
    /// `IN_Q_OVERFLOW` event is prepended to the next read response and this flag
    /// is cleared. Separate from the queue to avoid occupying a queue slot.
    pub overflow: AtomicBool,

    /// Wait queue for poll()/select()/epoll() on this instance.
    pub wait_queue: WaitQueueHead,

    /// Flags from inotify_init1() (IN_CLOEXEC, IN_NONBLOCK).
    pub flags: u32,
}

/// One inotify watch: a single inode being monitored for specific events.
pub struct InotifyWatch {
    /// Watch descriptor (the value returned to userspace by inotify_add_watch).
    pub wd: WatchDescriptor,

    /// The inode being watched. Holds an Arc reference to prevent premature eviction
    /// while the watch is active.
    pub inode: Arc<Inode>,

    /// Bitmask of watched events (IN_CREATE | IN_MODIFY | IN_CLOSE_WRITE | etc.).
    pub mask: u32,

    /// Back-reference to the owning InotifyInstance (weak to avoid cycles).
    pub instance: Weak<InotifyInstance>,
}

/// Event delivered to userspace via read(2) on the inotify fd.
/// Matches the Linux inotify_event ABI exactly.
#[repr(C)]
pub struct InotifyEvent {
    /// Watch descriptor that fired.
    pub wd: i32,
    /// Event type (IN_CREATE, IN_MODIFY, IN_DELETE, etc.).
    pub mask: u32,
    /// Links related IN_MOVED_FROM and IN_MOVED_TO events (same cookie = same rename).
    pub cookie: u32,
    /// Length of the name[] field in bytes, including the null terminator and any
    /// trailing padding bytes. 0 if no filename is associated with this event
    /// (e.g., IN_ATTRIB on a non-directory inode).
    pub len: u32,
    // Followed immediately by name[len]: null-terminated filename, valid only for
    // events on directory inodes. Padded to a 4-byte boundary.
}
const_assert!(size_of::<InotifyEvent>() == 16);

/// Internal buffer holding a complete inotify event + filename bytes.
/// Uses a fixed-size array instead of `Vec<u8>` to avoid heap allocation
/// inside the spinlock-protected event queue. NAME_MAX is 255; with a null
/// terminator the maximum name is 256 bytes. Padded to 4-byte alignment,
/// the worst case is 256 bytes (255 + 1 null, already 4-byte aligned).
pub struct InotifyEventBuf {
    pub header: InotifyEvent,
    /// The filename, null-terminated and padded to a 4-byte boundary.
    /// Only the first `header.len` bytes are valid; the rest is unused.
    /// Fixed capacity avoids heap allocation under the event_queue SpinLock.
    ///
    /// **Memory budget**: 272 * max_queued_events bytes per instance
    /// (default ~4.25 MiB). With max_user_instances=128, worst-case per-user
    /// kernel memory: ~544 MiB. This is bounded by per-user rlimits and
    /// is comparable to Linux's inotify memory consumption for the same
    /// event depth. Phase 3 optimization: variable-length event entries
    /// reduce typical memory by 10-20x.
    pub name: [u8; 256],
}

14.13.1.2 VFS Integration Hooks

inotify events are generated from dentry/inode operation call sites within the VFS dispatch layer. The fast path check costs a single pointer load:

VFS operation Event(s) generated
create, mkdir, mknod, symlink IN_CREATE on parent dir inode
unlink, rmdir IN_DELETE on parent dir; IN_DELETE_SELF on the target inode
rename (source side) IN_MOVED_FROM on old parent + cookie
rename (destination side) IN_MOVED_TO on new parent + same cookie
open IN_OPEN on the inode
read, readdir IN_ACCESS on the inode
write, truncate, fallocate IN_MODIFY on the inode
setattr (chmod/chown/utimes) IN_ATTRIB on the inode
close (file was written) IN_CLOSE_WRITE on the inode
close (read-only open) IN_CLOSE_NOWRITE on the inode
inotify watch removed (inode evicted or inotify_rm_watch) IN_IGNORED on the watch descriptor

Each Inode carries an inotify_watches field:

/// Maximum number of inotify watches that can be attached to a single inode.
/// Bounded to avoid unbounded heap allocation on the per-inode watch list,
/// which is scanned under a SpinLock on every VFS event delivery. 128 is
/// generous — even heavily-monitored inodes rarely exceed a handful of
/// watchers (one per inotify instance). The system-wide per-user limit
/// (`max_user_watches` sysctl, default 8192) bounds the total; this
/// per-inode cap prevents pathological concentration on a single inode.
const MAX_WATCHES_PER_INODE: usize = 128;

/// Per-inode inotify watch list. Null when no watches are active (the common case).
/// This field is checked on every relevant VFS operation; a null pointer load
/// has zero overhead (no branch misprediction for the vast majority of inodes).
///
/// Uses `OnceLock` for the `None` → `Some` transition: the first `inotify_add_watch`
/// calls `inotify_watches.get_or_init(|| SpinLock::new(ArrayVec::new()))`.
/// `OnceLock` provides internal synchronization for the initialization race —
/// if two threads add the first watch concurrently, only one performs the init.
/// Subsequent accesses are a simple pointer load (no locking overhead).
/// Reverting to "no watches" does NOT clear the OnceLock (the empty SpinLock
/// persists, consuming only the lock + ArrayVec header — ~24 bytes); this avoids
/// an ABA race on the pointer.
pub inotify_watches: OnceLock<SpinLock<ArrayVec<Arc<InotifyWatch>, MAX_WATCHES_PER_INODE>>>,

When the field is None (no watches active), the check is a single null pointer comparison — zero overhead on the fast path for the vast majority of inodes.

14.13.1.3 Event Delivery Algorithm

fsnotify_inode_event(inode, event_mask, name, cookie):
  watches_opt = inode.inotify_watches.get()  // single load, no locking
  if watches_opt is None: return             // fast path: no watches on this inode

  watches = watches_opt.lock()
  for watch in watches.iter():
    fired_mask = watch.mask & event_mask
    if fired_mask == 0: continue
    if let Some(instance) = watch.instance.upgrade():
      buf = InotifyEventBuf {
        header: InotifyEvent { wd: watch.wd, mask: fired_mask, cookie, len: name.len() + padding },
        name: name_bytes_padded_to_4_bytes,
      }
      queue = instance.event_queue.lock()
      if !queue.is_full():
        queue.push(buf)
      else:
        // Queue overflow: set the overflow flag so that the next read(2) prepends
        // a synthetic IN_Q_OVERFLOW event. The AtomicBool lives outside the spinlock;
        // store is done while still holding the lock to ensure the writer side sees
        // the flag before any reader drains the queue.
        instance.overflow.store(true, Ordering::Release)
      drop(queue)
      instance.wait_queue.wake_up_one()  // unblock read()/poll()

14.13.1.4 Syscall Implementations

inotify_add_watch(fd, path, mask) → wd: 1. Resolve path → inode using normal path resolution. 2. Look up fdInotifyInstance. 3. Scan instance.watches for an existing watch on this inode: - If found: update watch.mask = mask (OR behavior if IN_MASK_ADD flag is set; replace otherwise). Return the existing wd. 4. Enforce max_user_watches: count total watches across all inotify instances for the calling user's real UID. If total >= sysctl.max_user_watches, return ENOSPC (matches Linux errno for this limit). The per-user watch count is tracked via an AtomicU32 in the per-user credential structure for O(1) checking. 5. Allocate a new WatchDescriptor from instance.next_wd.fetch_add(1). If the result is negative (wrapped past i32::MAX), return ENOSPC — the WD space is exhausted. The application must close and re-open the inotify fd. (Linux also wraps without checking; UmkaOS adds the guard for 50-year uptime correctness at negligible cost.) 6. Construct InotifyWatch { wd, inode: inode.clone(), mask, instance: Arc::downgrade(&instance) }. 7. Initialize inode.inotify_watches if it was None. 8. Insert the watch into both inode.inotify_watches and instance.watches. 9. Increment the per-user watch count. 10. Return wd.

inotify_rm_watch(fd, wd) → 0: 1. Look up fdInotifyInstance. 2. Remove the watch from instance.watches by wd. Return EINVAL if not found. 3. Remove the corresponding entry from inode.inotify_watches. 4. If inode.inotify_watches is now empty, the OnceLock persists with an empty ArrayVec (not cleared — see OnceLock design note above). The ~24-byte overhead avoids ABA races on the pointer. 5. Deliver an IN_IGNORED event to the instance. 6. Drop the Arc<InotifyWatch>.

14.13.1.5 Mandatory Event Coalescing

Coalescing rule (mandatory): Before enqueuing a new event, the delivery path checks whether the tail of the instance's EventQueue is an identical event. If so, the new event is discarded (coalesced) rather than enqueued. Two events are identical if and only if:

fn events_are_identical(a: &InotifyEvent, b: &InotifyEvent) -> bool {
    a.wd     == b.wd     &&
    a.mask   == b.mask   &&
    a.cookie == b.cookie &&
    a.name   == b.name    // byte-for-byte name comparison
}

The check is against the tail only (O(1)), not the entire queue. Events are coalesced only when consecutive and identical — non-consecutive duplicates are not coalesced (ordering is preserved for different events between duplicates).

IN_MOVED_FROM / IN_MOVED_TO cookie pairing: Cookie values are assigned by a per-VFS-instance AtomicU32 cookie_counter. Consecutive rename operations get consecutive cookie values. Coalescing does NOT apply to cookie-bearing events (mask has IN_MOVED_FROM or IN_MOVED_TO set) — rename pairs must always be delivered in full.

IN_Q_OVERFLOW: When the fixed-capacity BoundedRing is full and a new event cannot be enqueued (even after attempting coalescing), the InotifyInstance.overflow AtomicBool is set to true. On the next read(2), the read path checks this flag first: if set, it clears the flag and prepends a synthetic IN_Q_OVERFLOW event (wd=-1, mask=IN_Q_OVERFLOW, cookie=0, name="") before draining normal events. This keeps the overflow sentinel out of the ring buffer itself, preserving all max_queued_events slots for real events. The queue is never silently dropped without this sentinel.

Performance: Under cargo build workloads (10k+ file writes), inotify watchers on the build directory receive IN_MODIFY storms. Coalescing reduces queue pressure by 10-100x for write-heavy workloads where the application re-reads the file on any change (editor reload, build system).

Linux compatibility: Linux inotify performs the same tail-coalescing. UmkaOS mandates it (Linux specifies it informally). The IN_Q_OVERFLOW sentinel behaviour is identical to Linux.

14.13.1.6 inotify Sysctls (/proc/sys/fs/inotify/)

Sysctl Default Enforced at Description
max_user_instances 128 inotify_init() / inotify_init1() Maximum inotify file descriptors per real UID. Returns EMFILE when exceeded.
max_user_watches 8192 inotify_add_watch() Maximum watches across all inotify instances per real UID. Returns ENOSPC when exceeded.
max_queued_events 16384 Event enqueue (§14.9.1.3) Maximum pending events per inotify instance before IN_Q_OVERFLOW. Set at inotify_init() time.

Note on max_user_watches default: Linux kernels 5.11+ dynamically increase this limit based on available memory (up to 1048576). UmkaOS uses the static default of 8192 (matching the historical Linux default) but allows runtime tuning via the sysctl. The event queue capacity per instance is max_queued_events (not the compile-time generic parameter — the BoundedRing is allocated with capacity max_queued_events at inotify_init() time).

Enforcement: - inotify_init() / inotify_init1(): check per-user instance count against max_user_instances. If exceeded, return EMFILE. - inotify_add_watch(): check per-user watch count against max_user_watches (step 4 in §14.9.1.4). If exceeded, return ENOSPC. - Event enqueue: when the per-instance event queue reaches max_queued_events, new events are dropped and the overflow flag is set (§14.9.1.5).

14.13.1.7 read(2) Serialization Protocol

Events are packed contiguously in the user buffer provided to read(2):

  1. Each event consists of struct inotify_event (16 bytes: wd i32 + mask u32
  2. cookie u32 + len u32) followed by len bytes of filename data.
  3. The filename is null-terminated and padded with additional null bytes to align the total event size (sizeof(inotify_event) + len) to the next 4-byte boundary. The len field includes all null bytes (terminator + padding).
  4. For events without a filename (e.g., IN_ATTRIB on a non-directory), len is 0 and no name bytes follow the header.
  5. If the user buffer is smaller than sizeof(inotify_event) (16 bytes), read(2) returns EINVAL (matching Linux ≥ 2.6.21 behavior).
  6. Partial events are never returned: if the next event in the queue does not fit in the remaining buffer space, read(2) stops and returns the number of bytes written so far. If no events have been written yet (first event does not fit), return EINVAL.
  7. If the overflow flag is set, a synthetic IN_Q_OVERFLOW event (wd=-1, mask= IN_Q_OVERFLOW, cookie=0, len=0, total 16 bytes) is prepended before draining normal events. The flag is cleared after prepending.

14.13.2 fanotify

fanotify extends inotify with:

  1. Filesystem-wide and mount-wide marks (not just per-inode): a single mark can cover an entire mount point or filesystem, eliminating the need to add per-inode watches for directories being monitored for new file creation.
  2. Permission events (FAN_OPEN_PERM, FAN_ACCESS_PERM, FAN_OPEN_EXEC_PERM): the originating syscall blocks until the fanotify daemon responds with allow or deny, subject to a mandatory per-group timeout (default 5000ms) to prevent system-wide I/O stalls.

14.13.2.1 Data Structures

/// Per-fanotify-instance state. Created by fanotify_init().
pub struct FanotifyInstance {
    /// Mark tables: one XArray per mark type, keyed by the object's u64 ID.
    /// Three separate XArrays (matching the QuotaCache precedent in
    /// disk-quota-subsystem.md) instead of a single `BTreeMap<FanotifyMarkKey>`:
    /// (1) XArray provides O(1) lookup with RCU-compatible reads,
    /// (2) avoids BTreeMap with enum-wrapped integer keys (collection policy),
    /// (3) allows independent locking per mark type.
    /// Mark management (fanotify_mark() syscall) is warm-path. Event delivery
    /// traverses per-inode/mount/sb mark lists (attached when marks are added),
    /// not these central tables.
    pub inode_marks: XArray<Arc<FanotifyMark>>,
    pub mount_marks: XArray<Arc<FanotifyMark>>,
    pub sb_marks: XArray<Arc<FanotifyMark>>,

    /// Informational event queue (non-permission events).
    /// Uses a pre-allocated fixed-capacity ring buffer (`BoundedRing`) to avoid
    /// heap allocation under spinlock. Capacity is set at `fanotify_init()` time
    /// (default 16384, matching Linux `FANOTIFY_DEFAULT_MAX_EVENTS`).
    /// Events beyond capacity are dropped and a `FAN_Q_OVERFLOW` synthetic event
    /// is generated (matching Linux behavior).
    pub event_queue: SpinLock<BoundedRing<FanotifyEvent>>,

    /// Set when the event queue overflowed since the last read(2). A synthetic
    /// `FAN_Q_OVERFLOW` event is prepended to the next read response and this
    /// flag is cleared. Kept outside the ring to avoid occupying an event slot.
    pub overflow: AtomicBool,

    /// Pending permission requests: keyed by a unique request ID (u64)
    /// assigned at creation. Entries are removed when the daemon writes a
    /// response. XArray provides O(1) lookup with internal xa_lock for write
    /// serialization, replacing the external SpinLock.
    pub perm_queue: XArray<Arc<FanotifyPermRequest>>,

    /// Next permission request ID (monotonically increasing).
    pub next_perm_id: AtomicU64,

    /// Wait queue for poll()/select()/epoll() on this instance.
    pub wait_queue: WaitQueueHead,

    /// Notification class: determines permission event delivery order when multiple
    /// fanotify instances watch the same inode.
    /// FAN_CLASS_NOTIF=0x00000000: informational only.
    /// FAN_CLASS_CONTENT=0x00000004: content scanners (see file after open).
    /// FAN_CLASS_PRE_CONTENT=0x00000008: DLP / integrity monitors (see file before open).
    /// Higher class is notified first. Within the same class, order is unspecified.
    pub class: FanotifyClass,

    /// Flags from fanotify_init() (FAN_CLOEXEC, FAN_NONBLOCK, FAN_REPORT_FID, etc.).
    pub flags: u32,

    /// Maximum time to wait for a permission event response.
    /// Default: 5000ms. Configurable per group at fanotify_init() time via
    /// FANOTIFY_INIT_PERM_TIMEOUT_MS (UmkaOS extension, not in Linux).
    /// A value of 0 means: use the system default from
    /// /proc/sys/fs/fanotify/perm_timeout_ms.
    pub perm_timeout: Duration,

    /// Action taken when a permission request times out:
    /// - PermTimeoutAction::Deny: return EPERM to the originating syscall (safe default)
    /// - PermTimeoutAction::Allow: allow the operation (permissive mode for monitoring-only daemons)
    pub perm_timeout_action: PermTimeoutAction,
}

pub enum PermTimeoutAction {
    Deny,   // Return EPERM to originating syscall on timeout (default)
    Allow,  // Allow the operation on timeout (for monitoring daemons that tolerate loss)
}

/// Mark type discriminant for fanotify_mark() dispatch. Determines which
/// XArray (`inode_marks`, `mount_marks`, or `sb_marks`) to use for the
/// mark operation. The u64 ID is extracted from the discriminant and used
/// as the XArray key directly.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum FanotifyMarkKey {
    /// Inode mark: watches a specific file or directory.
    /// `inode_id` is the filesystem-wide inode number.
    Inode { inode_id: u64 },
    /// Mount mark: watches all files under a mount point.
    /// `mount_id` is the unique mount ID from the mount tree.
    Mount { mount_id: u64 },
    /// Filesystem mark: watches all files on a filesystem (superblock scope).
    /// `sb_id` is a unique identifier for the superblock (device number).
    Filesystem { sb_id: u64 },
}

/// A single fanotify mark: attaches event interest to an inode, mount, or superblock.
pub struct FanotifyMark {
    pub mark_type: FanotifyMarkType,  // FAN_MARK_INODE, FAN_MARK_MOUNT, FAN_MARK_FILESYSTEM
    /// Object identifier: inode_id (for inode marks), mount_id (for mount marks),
    /// or superblock pointer (for filesystem marks).
    pub object_id: u64,
    /// Event mask this mark is listening for.
    pub mask: u64,
    /// Ignore mask: events matching this mask are suppressed even if mask is set.
    pub ignored_mask: u64,
    pub instance: Weak<FanotifyInstance>,
}

/// A pending permission request: holds the event plus the response channel.
pub struct FanotifyPermRequest {
    /// The event as delivered to userspace via read(2) on the fanotify fd.
    pub event: FanotifyEvent,
    /// Unique request ID (matches the fd-based identification in the response).
    pub request_id: u64,
    /// Set to FAN_ALLOW or FAN_DENY by the daemon's write(2) response.
    /// Uses OnceLock<u32> (first-writer-wins via `.set()`): the first
    /// writer (daemon's write() or timeout handler) wins; the loser's
    /// `.set()` returns Err. This enforces the semantic at the type level.
    pub response: OnceLock<u32>,
    /// Wakes the blocked originating syscall when response becomes Some.
    pub waker: WaitQueueHead,
}

/// Event delivered to userspace via read(2) on the fanotify fd.
/// Matches Linux's fanotify_event_metadata ABI.
#[repr(C)]
pub struct FanotifyEvent {
    pub event_len: u32,    // Total length of this event record (including variable info records)
    pub vers: u8,          // FANOTIFY_METADATA_VERSION (always 3)
    pub reserved: u8,
    pub metadata_len: u16, // sizeof(FanotifyEvent)
    pub mask: u64,         // Event type bitmask
    pub fd: i32,           // Opened fd for the file (or -1 with FAN_REPORT_FID)
    pub pid: i32,          // PID of the process that triggered the event
}
// mask field matches Linux `__aligned_u64` — naturally 8-byte aligned at offset 8.
const_assert!(size_of::<FanotifyEvent>() == 24);

/// Variable-length event information record header.
/// Appended after `FanotifyEvent` when `FAN_REPORT_FID` is set in the
/// fanotify group flags. Multiple info records may follow a single event
/// (e.g., `FAN_REPORT_FID | FAN_REPORT_DFID_NAME` produces two records).
/// Linux ABI: `struct fanotify_event_info_header` (linux/fanotify.h).
#[repr(C)]
pub struct FanotifyEventInfoHeader {
    /// Info record type. Determines the layout of the data following
    /// this header. Values:
    /// - `FAN_EVENT_INFO_TYPE_FID` (1): file handle info.
    /// - `FAN_EVENT_INFO_TYPE_DFID_NAME` (2): directory + name.
    /// - `FAN_EVENT_INFO_TYPE_DFID` (3): directory file handle only.
    /// - `FAN_EVENT_INFO_TYPE_PIDFD` (4): pidfd info (Linux 5.15+).
    /// - `FAN_EVENT_INFO_TYPE_ERROR` (5): filesystem error info (Linux 5.16+).
    pub info_type: u8,
    /// Padding for alignment.
    pub pad: u8,
    /// Total length of this info record (header + payload), in bytes.
    /// Must be a multiple of 4 (aligned to u32 boundary).
    pub len: u16,
}

/// File identifier info record. Follows `FanotifyEventInfoHeader` when
/// `info_type == FAN_EVENT_INFO_TYPE_FID` (or DFID/DFID_NAME).
/// Linux ABI: `struct fanotify_event_info_fid` (linux/fanotify.h).
///
/// The file handle bytes follow immediately after this struct. The total
/// record size is `sizeof(FanotifyEventInfoHeader) + sizeof(FanotifyEventInfoFid)
/// + handle_bytes`, padded to a 4-byte boundary.
#[repr(C)]
pub struct FanotifyEventInfoFid {
    /// Header identifying this record type and total length.
    pub hdr: FanotifyEventInfoHeader,
    /// Filesystem identifier (same as `statfs.f_fsid`). Allows userspace
    /// to correlate the file handle with a specific mounted filesystem.
    pub fsid: FsId,
    /// Variable-length file handle. The first 4 bytes are the handle
    /// length (matching `struct file_handle.handle_bytes`), followed by
    /// the handle type (4 bytes) and the handle data. The total length
    /// including padding is recorded in `hdr.len`.
    // Followed by: struct file_handle { u32 handle_bytes; i32 handle_type; u8 f_handle[]; }
}

/// Filesystem identifier (matches `__kernel_fsid_t` / `statfs.f_fsid`).
#[repr(C)]
pub struct FsId {
    pub val: [i32; 2],
}
const_assert!(size_of::<FanotifyEventInfoHeader>() == 4);
const_assert!(size_of::<FsId>() == 8);
const_assert!(size_of::<FanotifyEventInfoFid>() == 12);

/// fanotify event info type constants. Match Linux `FAN_EVENT_INFO_TYPE_*`.
pub const FAN_EVENT_INFO_TYPE_FID: u8 = 1;
pub const FAN_EVENT_INFO_TYPE_DFID_NAME: u8 = 2;
pub const FAN_EVENT_INFO_TYPE_DFID: u8 = 3;
pub const FAN_EVENT_INFO_TYPE_PIDFD: u8 = 4;
pub const FAN_EVENT_INFO_TYPE_ERROR: u8 = 5;
/// Byte-range info for pre-content events (fanotify pre-content scanning).
/// Linux 6.12+.
pub const FAN_EVENT_INFO_TYPE_RANGE: u8 = 6;
/// Mount ID info for mount-aware fanotify. Linux 6.12+.
pub const FAN_EVENT_INFO_TYPE_MNT: u8 = 7;
// Types 8, 9 reserved by Linux.
/// Source directory+name for rename events. Linux 6.6+.
pub const FAN_EVENT_INFO_TYPE_OLD_DFID_NAME: u8 = 10;
// Type 11 reserved by Linux.
/// Destination directory+name for rename events. Linux 6.6+.
pub const FAN_EVENT_INFO_TYPE_NEW_DFID_NAME: u8 = 12;

pub enum FanotifyMarkType { Inode, Mount, Filesystem }

pub enum FanotifyClass {
    Notif = 0x0000_0000,      // FAN_CLASS_NOTIF
    Content = 0x0000_0004,    // FAN_CLASS_CONTENT
    PreContent = 0x0000_0008, // FAN_CLASS_PRE_CONTENT
}

14.13.2.2 Permission Event Flow

When a VFS operation triggers a permission-event mask bit (e.g., FAN_OPEN_PERM on open(2)):

fanotify_perm_event(inode, event_type, opener_pid):
  // Collect all matching fanotify instances in class order (PreContent first).
  matching = collect_matching_marks(inode, event_type)
  if matching is empty: return Ok(())  // fast path

  for instance in matching sorted by class descending:
    id = instance.next_perm_id.fetch_add(1)
    event_fd = open_file_for_fanotify(inode)  // opens fd for daemon to inspect
    event = FanotifyEvent { mask: event_type, fd: event_fd, pid: opener_pid, ... }
    req = Arc::new(FanotifyPermRequest { event, request_id: id, response: None, waker })

    instance.perm_queue.lock().insert(id, req.clone())
    queue = instance.event_queue.lock()
    if !queue.is_full():
      queue.push(event)
    else:
      // Queue overflow: drop event, set FAN_Q_OVERFLOW flag (matching Linux)
      instance.overflow.store(true, Ordering::Release)
    instance.wait_queue.wake_up_one()

    // Block with mandatory timeout — never block indefinitely
    match req.channel.wait_timeout(instance.perm_timeout):
      Ok(response):
        if response.allow: close(event_fd); continue  // allow: close fd, check next instance
        else: close(event_fd); return Err(EPERM)
      Err(Timeout):
        // Log timeout: fanotify daemon too slow
        log_warn!("fanotify: perm request timed out after {:?}, action={:?}",
                  instance.perm_timeout, instance.perm_timeout_action)
        // Increment per-group timeout counter (visible in /proc/PID/fdinfo/<fafd>)
        instance.timeout_count.fetch_add(1, Ordering::Relaxed)
        match instance.perm_timeout_action:
          PermTimeoutAction::Deny  → close(event_fd); return Err(EPERM)
          PermTimeoutAction::Allow → close(event_fd); continue  // allow on timeout

  return Ok(())  // all instances allowed

Timeout vs late response race: If the daemon responds after the timeout fires but before the requesting thread fully unblocks, the response is discarded. The req.response uses an OnceLock<u32> first-writer-wins pattern: the first writer (either the daemon's write() or the timeout handler) wins. The loser's write is a no-op. This prevents both double-free of the event fd and contradictory allow-then-deny sequences. The daemon's late response is logged at DEBUG level for diagnostic purposes.

Mandatory permission event timeout: Permission events (FAN_OPEN_PERM, FAN_ACCESS_PERM, FAN_OPEN_EXEC_PERM) have a mandatory response timeout to prevent system-wide I/O stalls.

System-wide timeout knob: /proc/sys/fs/fanotify/perm_timeout_ms (default: 5000). Can be set to 0 to disable timeout (not recommended; requires CAP_SYS_ADMIN).

Monitoring: /proc/sys/fs/fanotify/perm_timeout_count — system-wide count of permission request timeouts (monotonic counter, reset on boot). Per-group count in /proc/PID/fdinfo/<fafd> as perm_timeout_count: N.

Linux compatibility note: Linux fanotify has no timeout on permission events (daemon death causes permanent block — requires daemon restart or fanotify fd close). UmkaOS's timeout is an improvement over Linux; existing fanotify daemons work unchanged (they don't set FANOTIFY_INIT_PERM_TIMEOUT_MS, so they get the 5s default with Deny on timeout). Tools like systemd-oomd, CrowdStrike Falcon, and audit daemons that use fanotify will benefit automatically from the safety timeout.

Userspace daemon writes FAN_ALLOW / FAN_DENY:

write(fanotify_fd, &fanotify_response { fd: event_fd, response: FAN_ALLOW_or_DENY }):
  // Match the response to a pending request by event_fd.
  // Linux ABI compatibility: the userspace `fanotify_response` struct uses `fd` as
  // the matching key. Internally, the kernel maps fd → request using the per-group
  // fd-to-request XArray (O(1) lookup). An internal request_id is used only for
  // kernel-side tracking and logging; it is never exposed to userspace.
  req = find_perm_request_by_fd(instance.perm_queue, event_fd)
  if req is None: return Err(EINVAL)  // stale or already answered
  req.response.call_once(|| FAN_ALLOW_or_DENY)
  req.waker.wake_up_one()  // unblock the blocked syscall

UmkaOS improvement over Linux fanotify: Linux matches responses to pending permission requests by the fd number inside the fanotify_response struct, which becomes ambiguous if the daemon closes and reopens fds in the event window. UmkaOS uses a typed FanotifyPermRequest with a structured response channel keyed by a monotonically increasing request_id. The Arc<FanotifyPermRequest> lifetime guarantees the blocked syscall's stack is valid until the response arrives, eliminating the lifetime ambiguity in the fd-matching approach.

14.13.3 UmkaOS-Native File Watch Capabilities

UmkaOS provides a capability-based file watching API as a modern alternative to inotify. Unlike inotify (global watch descriptor namespace, process-scoped), FileWatchCap watches are:

  • Capability-scoped: unforgeable, revocable, auditable
  • Memory-bounded: each watch is a capability slot (no global state)
  • Automatically revoked: when the capability is dropped or the process exits
  • Ring-delivered: events go to a typed UmkaOS ring buffer, not a read() queue
  • Composable: multiple watches can share one ring

inotify remains fully supported for Linux compatibility. FileWatchCap is the recommended API for new UmkaOS code.

/// A capability granting the holder the right to watch a specific inode for
/// specific events. Cannot be forged; issued by the kernel only.
/// Revocable via the standard capability revocation path (Section 9.1).
pub struct FileWatchCap {
    /// The inode to watch. Kernel-internal reference — not a path (immune to rename).
    inode: Arc<Inode>,
    /// Events to deliver (subset of InotifyMask).
    mask: InotifyMask,
    /// Watch children of this directory (if inode is a directory).
    watch_children: bool,
    /// Watch children recursively (deep watch — UmkaOS extension, not in inotify).
    watch_recursive: bool,
}

/// Subscribe to inode events via a capability.
/// Events are delivered to `ring` as typed `FileWatchEvent` structs.
///
/// Returns a `WatchHandle` — dropping the handle unregisters the watch.
pub fn inode_watch(
    cap: FileWatchCap,
    ring: Arc<EventRing<FileWatchEvent>>,
) -> Result<WatchHandle, WatchError>;

/// A single file watch event, delivered to the ring.
/// C-compatible layout: uses explicit length + fixed array instead of
/// `Option<ArrayString<255>>` (which has Rust-internal layout).
// kernel-internal, not KABI
#[repr(C)]
pub struct FileWatchEvent {
    pub event_type: FileWatchEventType, // enum (see below)
    pub cookie: u32,                    // for rename pairs (FROM/TO share cookie)
    pub inode_id: u64,                  // stable inode number
    pub name_len: u8,                   // 0 = no name; >0 = first `name_len` bytes valid
    pub name: [u8; 255],                // filename (for directory events), NUL-padded
    pub timestamp: MonotonicInstant,    // UmkaOS extension: not in inotify
}
// FileWatchEvent layout: event_type(u32=4) + cookie(u32=4) + inode_id(u64=8) +
// name_len(u8=1) + name([u8;255]=255) + timestamp(MonotonicInstant(u64)=8).
// After name_len+name: offset = 4+4+8+1+255 = 272. 272 % 8 = 0, no padding.
// Total: 272 + 8 = 280 bytes.
const_assert!(core::mem::size_of::<FileWatchEvent>() == 280);

/// Monotonic timestamp (nanoseconds since boot, from CLOCK_MONOTONIC).
/// Used for UmkaOS extensions where wall-clock time is not needed.
#[repr(transparent)]
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct MonotonicInstant(pub u64);

#[repr(u32)]
pub enum FileWatchEventType {
    Access,       // File was read
    Modify,       // File was written
    Attrib,       // Metadata changed (chmod, chown, timestamps)
    CloseWrite,   // File opened for writing was closed
    CloseNoWrite, // File opened read-only was closed
    Open,         // File was opened
    MovedFrom,    // File moved out (cookie matches MovedTo)
    MovedTo,      // File moved in (cookie matches MovedFrom)
    Create,       // File created in watched directory
    Delete,       // File deleted from watched directory
    DeleteSelf,   // Watched file itself was deleted
    MoveSelf,     // Watched file itself was moved
    Unmount,      // Filesystem containing watched file was unmounted
}

Deep watch (watch_recursive: true): watches a directory tree recursively. UmkaOS maintains a kernel-side tree of watch registrations, automatically adding watches for new subdirectories as they are created (IN_CREATE on a directory). inotify has no recursive watch; tools like inotifywait -r simulate it with userspace polling, which has TOCTOU races. UmkaOS's deep watch is race-free.

Deep watch resource limits: Recursive watches can consume significant kernel memory on deep directory trees (e.g., a deep watch on / with millions of directories). Limits are enforced per-user to prevent denial-of-service: - max_deep_watches_per_user: default 128 (sysctl fs.inotify.max_deep_watches). Exceeding this returns ENOSPC. - max_watch_entries_per_deep_watch: default 65536. If the directory tree contains more subdirectories than this, the kernel stops adding new watches beyond the limit and delivers IN_Q_OVERFLOW to signal the user that coverage is incomplete. - Memory accounting: each internal watch node costs ~128 bytes. A deep watch on a tree with 65536 directories consumes ~8 MB of kernel memory, charged to the user's RLIMIT_MEMLOCK limit. - CAP_SYS_ADMIN can override max_deep_watches_per_user up to the system-wide hard limit of 1024.

Obtaining a FileWatchCap: capability is issued via:

/// Open a FileWatchCap for a path (requires read permission on the path).
pub fn open_watch_cap(
    dirfd: DirFd,
    path: &Path,
    mask: InotifyMask,
    watch_children: bool,
    watch_recursive: bool,
) -> Result<FileWatchCap, WatchError>;

Revocation: WatchHandle::drop() unregisters the watch. When the process exits, all WatchHandles are dropped automatically — no cleanup required. Capability revocation (Section 9.1) also revokes all file watches derived from the revoked capability.

Linux compatibility: FileWatchCap is an UmkaOS-only API. inotify_init(), inotify_add_watch(), inotify_rm_watch() work identically to Linux. FileWatchCap is intended for new UmkaOS applications; existing Linux software uses inotify unchanged.

14.13.4 Cross-References

  • Section 14.1 (VFS Traits): inotify/fanotify hooks are inserted at the VFS operation dispatch layer, after InodeOps/FileOps call sites complete successfully.
  • Section 17.1 (Namespace Implementation): fanotify marks survive CLONE_NEWNS and remain attached to the underlying inode/mount, not to a specific mount namespace. Marks set in a parent namespace remain visible in child namespaces for the same underlying mount.
  • Section 9.1 (Security): fanotify_init(FAN_CLASS_CONTENT) and fanotify_init(FAN_CLASS_PRE_CONTENT) require CAP_SYS_ADMIN. Informational fanotify (FAN_CLASS_NOTIF) requires no capability (Linux 5.13+, unprivileged fanotify); UmkaOS follows the same requirement for compatibility.

14.14 Local File Locking (flock / fcntl POSIX Locks / OFD Locks)

UmkaOS provides three advisory file locking interfaces, each with distinct semantics:

Interface Granularity Lock scope Inherited on fork Released on
flock(2) Whole file Per open-file-description Yes (child shares the open file description, so the same lock is shared; either process can release it) Last close of the description
fcntl F_SETLK Byte-range (POSIX) Per process (PID) No Process exit OR any close of the file
fcntl F_OFD_SETLK Byte-range (OFD) Per open-file-description Yes Last close of the description

All three are advisory: a process can read and write a file regardless of locks held by other processes. Locks only prevent other processes from acquiring conflicting locks. Mandatory locking (Linux MS_MANDLOCK) is deliberately not implemented — it was deprecated in Linux 5.15 and is incompatible with modern VFS semantics.

14.14.1 Data Structures

/// A single file lock entry. Stored in the per-inode `FileLockTree`.
pub struct FileLock {
    /// Lock type: read (shared) or write (exclusive).
    pub lock_type: FileLockType,

    /// Byte range: [start, end] inclusive. 0..=u64::MAX represents the whole file.
    /// For flock locks, start=0 and end=u64::MAX always.
    pub start: u64,
    pub end: u64,

    /// For POSIX locks: the PID of the owning process.
    /// All POSIX locks held by a process are released when it exits OR when
    /// any file descriptor for the file is closed (POSIX semantics).
    /// For OFD locks: None. The lock is owned by the open-file-description.
    /// For flock locks: None. The lock is owned by the open-file-description.
    pub owner_pid: Option<Pid>,

    /// The open-file-description that created this lock.
    /// Weak reference: if the description is dropped (last fd closed), the lock
    /// is released. For POSIX locks, `owner_pid` is the primary ownership token
    /// and `owner_fd` is advisory for conflict matching.
    pub owner_fd: Weak<OpenFile>,

    /// Wait queue: tasks blocked waiting for this lock to be released sleep here.
    pub wait_queue: WaitQueueHead,
}

pub enum FileLockType {
    /// Shared (read) lock. Multiple readers can hold simultaneously.
    Read,
    /// Exclusive (write) lock. No other lock may be held concurrently.
    Write,
}

/// Per-inode lock state. Present only on inodes that have had locks acquired;
/// None on inodes that have never been locked (zero overhead on the fast path).
pub struct InodeLocks {
    /// Augmented interval tree of active locks (POSIX, flock, and OFD locks).
    /// Sorted by `l_start`; each node carries `subtree_max: u64` = maximum
    /// `l_end` in its subtree. This enables O(log n) range overlap queries.
    /// See Section 14.10.3 for the full algorithm specification.
    pub locks: FileLockTree,
    /// Protects the lock tree. Operations must be atomic with respect to each other.
    pub lock: SpinLock<()>,
}

/// Augmented interval tree for file lock conflict detection.
/// Red-black tree sorted by `l_start`, augmented with `subtree_max` for
/// O(log n) range overlap queries.
pub struct FileLockTree {
    /// Root of the red-black tree. None when no locks are held.
    root: Option<Box<FileLockNode>>,
    /// Number of locks currently in the tree.
    count: usize,
}

/// FileLockNode allocation uses a dedicated slab cache (`file_lock_slab`)
/// with per-CPU magazines, matching Linux's `file_lock_cache`. This provides
/// bounded warm-path allocation without general-heap contention.
pub struct FileLockNode {
    pub lock: FileLock,
    /// Maximum `l_end` value in this node's subtree (including this node).
    /// Updated on every insert/delete along the path to the root.
    pub subtree_max: u64,
    pub left: Option<Box<FileLockNode>>,
    pub right: Option<Box<FileLockNode>>,
    pub color: RbColor,
}

pub enum RbColor { Red, Black }

14.14.2 Conflict Detection

Two locks conflict if: 1. At least one is a write lock (FileLockType::Write). 2. Their byte ranges overlap: !(lock_a.end < lock_b.start || lock_b.end < lock_a.start). 3. They have different owners: - For POSIX locks: different PIDs. - For OFD/flock locks: different Weak<OpenFile> pointers. - A POSIX lock can upgrade/replace an existing POSIX lock from the same PID without conflict.

14.14.3 Locking Algorithm

UmkaOS uses an augmented interval tree (red-black tree with subtree_max augmentation) for O(log n) file lock conflict detection. This is the correct data structure; there is no O(n) fallback. Linux used an O(n) linked-list scan for decades before adding interval trees in Linux 3.13; UmkaOS starts with the correct design.

FileLockTree structure: - Sorted by l_start (range start) - Each node carries subtree_max: u64 = maximum l_end in its subtree - This augmentation enables O(log n) range overlap queries

Conflict query for range [req_start, req_end): Walk the tree: at each node, if node.subtree_max < req_start, the entire subtree has no overlapping locks — prune. Otherwise check the node itself and recurse into both children. O(log n + k) where k = number of conflicts found.

Insert/delete: O(log n) standard red-black tree operations, plus O(log n) subtree_max recomputation on the path to root. During red-black tree rotations (left-rotate, right-rotate), subtree_max is recomputed for the two rotated nodes: node.subtree_max = max(node.lock.l_end, left_child_max(node), right_child_max(node)). This is the standard augmented red-black tree technique (CLRS §14.2).

fcntl_setlk(fd, lock_type, start, end, wait: bool) → Result:
  inode = fd.inode()
  ensure inode.locks is initialized

  inode.locks.lock.lock()

  loop:
    // O(log n + k) interval tree query for conflicting locks in [start, end).
    for existing in inode.locks.locks.query_conflicts(start, end, lock_type, &fd):
      if !wait:
        inode.locks.lock.unlock()
        return Err(EAGAIN)           // F_SETLK: fail immediately

      // F_SETLKW: deadlock detection before sleeping
      if would_deadlock(current_pid, existing.owner_pid):
        inode.locks.lock.unlock()
        return Err(EDEADLK)

      inode.locks.lock.unlock()
      existing.wait_queue.wait_event(|| !lock_conflicts_anymore(...))
      inode.locks.lock.lock()
      continue loop                  // re-check after wakeup (spurious wakeup safe)

    // No conflict: coalesce adjacent/overlapping locks of the same type and owner,
    // then insert the new lock. O((k+1) log n).
    coalesce_and_insert(inode, fd, lock_type, start, end)
    inode.locks.lock.unlock()
    return Ok(())

Lock Coalescing Algorithm (Greedy Interval Merge)

The following batch coalescing algorithm is used during lock migration and crash recovery (when multiple lock requests are replayed). The per-call coalescing path is coalesce_and_insert() below, which operates on the interval tree directly with no Vec allocation.

Input: a set of pending lock requests sorted by (offset, len). Output: a minimal set of merged lock requests covering the same byte ranges.

Data structure:

struct PendingLockRequest {
    offset: u64,
    len: u64,
    op: LockOp,  // Shared or Exclusive
}

Algorithm (O(n log n) for n requests): 1. Collect all pending requests into Vec<PendingLockRequest>. 2. Sort by offset (ascending), then by len (descending) as tiebreaker. 3. Sweep left to right: - Start with current = requests[0]. - For each subsequent request r: - If r.offset <= current.offset + current.len (overlapping or adjacent) AND r.op == current.op (same lock type): - current.len = max(current.offset + current.len, r.offset + r.len) - current.offset - Otherwise: emit current, set current = r. 4. Emit final current.

Rationale: coalescing reduces the number of kernel lock table entries for byte-range locking (POSIX fcntl(F_SETLK)), avoiding fragmentation in the per-file lock list.

coalesce_and_insert(new_lock) — called after conflict check passes:

  1. Query the interval tree for all locks owned by new_lock.pid that are adjacent to or overlapping new_lock's range [l_start, l_end) (adjacent = existing.l_end == new_lock.l_start or vice versa)
  2. Compute the union range: min(all.l_start) to max(all.l_end)
  3. Remove all found locks from the interval tree (O(k log n))
  4. Insert a single merged lock covering the union range (O(log n))

Complexity: O((k+1) log n) where k = number of locks merged. Coalescing reduces tree size over time for processes that acquire many adjacent byte-range locks (common in database file locking patterns).

14.14.4 Deadlock Detection

Wait-For Graph data structure:

/// Directed graph of lock-wait relationships between threads.
/// Edge (A → B) means thread A is currently blocked waiting for a byte-range
/// lock held by thread B. The graph is maintained in the lock manager and
/// updated on each lock acquisition attempt that would block.
///
/// Bounded at compile time; exceeding `LOCK_GRAPH_MAX_THREADS` causes
/// `detect_deadlock` to abort with `LockError::DeadlockDetectionOverflow`.
pub struct WaitForGraph {
    /// Sparse adjacency list: (waiting_thread, [holder_threads]).
    /// Each entry represents one blocked thread and the set of threads
    /// it is waiting on (typically 1, but can be multiple for range locks).
    edges: ArrayVec<(ThreadId, ArrayVec<ThreadId, LOCK_GRAPH_MAX_HOLDERS>), LOCK_GRAPH_MAX_THREADS>,
}

// WaitForGraph is ~20 KiB inline (256 * 80 bytes), which exceeds safe kernel
// stack depth (8-16 KiB). It MUST NOT be allocated on the kernel stack.
// Allocation strategy: per-CPU pre-allocated buffer. Only one deadlock
// detection runs per CPU at a time because InodeLocks.lock is a SpinLock,
// which disables preemption. No other thread on this CPU can enter a locking
// path while the SpinLock is held.
static LOCK_DEADLOCK_GRAPH: PerCpu<WaitForGraph> = PerCpu::new(WaitForGraph::new);

impl WaitForGraph {
    pub const fn new() -> Self { Self { edges: ArrayVec::new() } }

    /// Record that `waiter` is blocked on a lock held by `holder`.
    pub fn add_edge(&mut self, waiter: ThreadId, holder: ThreadId) { /* ... */ }

    /// Remove all outgoing edges from `waiter` (called on lock release or
    /// wakeup so stale edges don't pollute subsequent detections).
    pub fn remove_waiter(&mut self, waiter: ThreadId) { /* ... */ }

    /// Return an iterator over all threads that currently hold locks that
    /// `waiter` is blocked on. O(n) scan over the edge list.
    pub fn holders_of(&self, waiter: ThreadId) -> impl Iterator<Item = ThreadId> + '_ {
        self.edges.iter()
            .find(|(tid, _)| *tid == waiter)
            .map(|(_, holders)| holders.iter().copied())
            .into_iter()
            .flatten()
    }
}

/// Maximum number of threads tracked simultaneously in the wait-for graph.
/// This is the per-inode concurrent contention limit, not a system-wide
/// thread limit. 256 concurrent threads blocked on the same inode's lock
/// tree is well beyond any realistic workload. Overflow is treated
/// conservatively (EDEADLK) — correctness is preserved at the cost of
/// a false deadlock report.
pub const LOCK_GRAPH_MAX_THREADS: usize = 256;
/// Maximum number of holders per waiting thread (range locks can be split).
pub const LOCK_GRAPH_MAX_HOLDERS: usize = 8;

Deadlock Detection: Wait-For Graph DFS (3-Color)

Each lock holder is a node; each blocked waiter is a directed edge (waiter → holder).

Node state per thread: - WHITE: not yet visited in current DFS - GRAY: currently in the DFS recursion stack (potential cycle node) - BLACK: fully explored, no cycle reachable from here

Constants:

const VFS_LOCK_MAX_DEPTH: usize = 64;  // Max wait-chain depth before abort

Algorithm (invoked before blocking on a contested lock):

fn detect_deadlock(start: ThreadId, graph: &WaitForGraph) -> bool:
  // Stack-allocated: WaitForGraph limits to VFS_LOCK_MAX_DEPTH threads,
  // so linear scan on ≤64 entries is faster than HashMap heap allocation.
  color = ArrayVec<(ThreadId, Color), VFS_LOCK_MAX_DEPTH>::new()
  return dfs(start, &mut color, graph, depth=0)

fn dfs(node: ThreadId, color: &mut ArrayVec, graph: &WaitForGraph, depth: usize) -> bool:
  if depth > VFS_LOCK_MAX_DEPTH:
    return true   // treat as deadlock (conservative)
  color[node] = GRAY
  for each holder in graph.holders_of(node):
    match color.get(holder):
      GRAY  => return true   // back-edge: cycle detected
      BLACK => continue      // already explored, safe
      WHITE | None:
        color[holder] = WHITE
        if dfs(holder, color, graph, depth+1): return true
  color[node] = BLACK
  return false

On true return: the blocking call returns Err(LockError::Deadlock) / EDEADLK. The caller must release all currently held locks and retry with a backoff.

The graph is constructed on-demand per lock request and is not persisted. Returning true on depth overflow is safe: it causes the lock request to fail with EDEADLK, which is better than silently allowing a potential deadlock. The depth limit prevents deadlock detection from becoming a denial-of-service vector in pathological chains.

14.14.5 Lock Release on File Description Close

When an OpenFile's reference count drops to zero (the last file descriptor pointing to it is closed):

  • OFD locks: all locks where owner_fd matches this description are removed.
  • flock locks: the flock lock associated with this description (if any) is removed.
  • POSIX locks: all locks where owner_pid == current_process.pid are removed. This is the POSIX-mandated behavior: closing any file descriptor for a file releases all POSIX locks the process holds on that file, regardless of which fd was used to acquire them.

After removing locks, wake all tasks in the wait_queue of each removed lock so they can retry acquisition.

14.14.6 memfd Sealing (F_ADD_SEALS / F_GET_SEALS)

memfd_create(2) returns an anonymous file (backed by tmpfs, with no pathname). Seals are write-once restrictions placed on the file's mutation capabilities:

/// Seal flags for memfd files. Once set, seals cannot be removed.
/// SEAL_SEAL prevents any further seals from being added.
bitflags! {
    pub struct SealFlags: u32 {
        /// Prevent any further seals from being added.
        const SEAL_SEAL         = 0x0001;
        /// Prevent the file from shrinking (ftruncate to a smaller size returns EPERM).
        const SEAL_SHRINK       = 0x0002;
        /// Prevent the file from growing (writes past EOF, ftruncate to larger size return EPERM).
        const SEAL_GROW         = 0x0004;
        /// Prevent all writes: write(2) returns EPERM, mmap(PROT_WRITE) returns EPERM.
        const SEAL_WRITE        = 0x0008;
        /// Prevent future mmap(PROT_WRITE) but allow existing writable mappings to remain.
        const SEAL_FUTURE_WRITE = 0x0010;
    }
}

fcntl(fd, F_ADD_SEALS, seals): add the specified seals atomically via a compare_exchange on the inode's AtomicU32 seal field. Fails with EPERM if SEAL_SEAL is already set. Fails with EBUSY if SEAL_WRITE is being added while a writable mmap exists on the file.

fcntl(fd, F_GET_SEALS): return the current seal set (atomic load, lock-free).

Seal enforcement in VFS paths: - write(2) and pwrite64(2): check SEAL_WRITE. - ftruncate(2) to smaller size: check SEAL_SHRINK. - ftruncate(2) to larger size: check SEAL_GROW. - mmap(PROT_WRITE): check SEAL_WRITE | SEAL_FUTURE_WRITE.

UmkaOS improvement: seals are stored as an AtomicU32 in the memfd's inode — seal reads are lock-free (a single atomic load), which is important because the seal check appears on every write(2) and mmap(2) call for sealed fds.

14.14.7 Cross-References

  • Section 15.15 (Distributed Lock Manager): the DLM provides cluster-wide advisory locks that extend the local flock/POSIX lock semantics across nodes. Local file locks (this section) are node-local only.
  • Section 14.1 (VFS Architecture): FileOps::release() is the call site where OFD and flock locks are released when the last fd to a file description is closed.
  • Section 17.1 (Containers): POSIX lock ownership is per-PID-namespace-PID. Within a container's PID namespace, lock ownership semantics are unchanged.

14.14.8 Lock Semantics Mode (POSIX Default / OFD Opt-in)

UmkaOS keeps POSIX semantics as the default for F_SETLK to preserve full Linux binary compatibility. Applications and deployments that want the correct OFD semantics as default can opt in at three levels, with the highest-priority source winning:

Priority order (highest first): 1. Per-call explicit constant 2. Per-process prctl 3. Per-user-namespace sysctl 4. System global default: POSIX


14.14.8.1.1 Per-call explicit (always available, no mode setting needed)
F_OFD_SETLK    // Always OFD semantics (Linux 3.15+, UmkaOS supported)
F_OFD_SETLKW   // Always OFD semantics, blocking
F_SETLK_POSIX  // UmkaOS extension: always POSIX semantics, explicit
F_SETLKW_POSIX // UmkaOS extension: always POSIX semantics, blocking

F_SETLK_POSIX exists so code inside an OFD-default process can still request POSIX semantics for specific locks (e.g., a bundled library that requires process-death lock release for crash detection).


14.14.8.1.2 Per-process opt-in
prctl(PR_SET_LOCK_SEMANTICS, LOCK_SEM_OFD)    // F_SETLK means OFD for this process
prctl(PR_SET_LOCK_SEMANTICS, LOCK_SEM_POSIX)  // Explicit POSIX (escape hatch)
prctl(PR_GET_LOCK_SEMANTICS, 0, 0, 0, 0)      // Query current mode
pub const LOCK_SEM_POSIX: u64 = 0;  // default
pub const LOCK_SEM_OFD:   u64 = 1;

Stored in Task.lock_semantics: LockSemanticsMode (per-thread but inherited from the process — all threads in a process share the same mode via Process.lock_semantics).

Inheritance rules: - fork(): child inherits parent's lock_semantics - exec(): inherited (sticky) — a container runtime sets it once; all descendant processes inherit - exec() of setuid/setgid binary: reset to the user-namespace sysctl default (security: a privilege-elevating binary must not blindly inherit)


14.14.8.1.3 Per-user-namespace sysctl
/proc/sys/fs/file_lock_default

Values: posix (default) | ofd

This sysctl is per-user-namespace, not global. Each container has its own user namespace and therefore its own file_lock_default. The container runtime sets it at container creation:

# Inside an UmkaOS-native container's user namespace:
echo ofd > /proc/sys/fs/file_lock_default
/// Per-user-namespace lock semantics default.
/// Stored in UserNamespace.file_lock_default.
pub enum LockSemanticsMode {
    Posix = 0,  // F_SETLK uses POSIX semantics (default)
    Ofd   = 1,  // F_SETLK uses OFD semantics
    Unset = 2,  // Not explicitly configured; falls through to namespace/global default
}

Requires CAP_SYS_ADMIN in the target user namespace to change. Affects new processes only — running processes keep their current mode.


14.14.8.1.4 Deployment model
Scenario Recommended config
Host with legacy software sysctl = posix (default), no change needed
UmkaOS-native container runtime sets sysctl = ofd in container's user namespace
Mixed container (some legacy binaries) sysctl = posix, UmkaOS-native apps use prctl
Wine / NFS lockd / old SQLite prctl(LOCK_SEM_POSIX) in launch wrapper

14.14.8.1.5 Internal resolution
fn effective_lock_semantics(
    task: &Task,
    cmd: FcntlCmd,
) -> LockSemanticsMode {
    match cmd {
        FcntlCmd::OfdSetLk | FcntlCmd::OfdSetLkW     => LockSemanticsMode::Ofd,
        FcntlCmd::SetLkPosix | FcntlCmd::SetLkWPosix  => LockSemanticsMode::Posix,
        FcntlCmd::SetLk | FcntlCmd::SetLkW => {
            // Resolve: per-process > per-namespace sysctl > global POSIX
            if task.process.lock_semantics != LockSemanticsMode::Unset {
                task.process.lock_semantics
            } else {
                task.user_namespace.file_lock_default
            }
        }
        _ => LockSemanticsMode::Posix,
    }
}

Linux compatibility: existing binaries calling F_SETLK on a system where no mode is set get identical POSIX behaviour to Linux. F_OFD_SETLK was added in Linux 3.15 and is already supported. F_SETLK_POSIX and PR_SET_LOCK_SEMANTICS are UmkaOS extensions with no Linux equivalent.


14.15 Disk Quota Subsystem (quotactl)

Disk quotas enforce per-user, per-group, and per-project limits on filesystem space and inode usage. Required for multi-tenant storage environments and Linux compatibility.

14.15.1 Data Structures

/// Internal kernel quota accounting structure. NOT the UAPI struct — see
/// `IfDqblk` below for the Linux-compatible quotactl(2) wire format.
/// This struct extends the UAPI layout with `bgrace` and `igrace` fields
/// for in-kernel grace period tracking (not exposed to userspace directly).
pub struct DiskQuota {
    /// Hard block limit (bytes). 0 = no limit. Writes that would exceed this
    /// are rejected with EDQUOT immediately, regardless of grace period.
    pub bhardlimit: u64,

    /// Soft block limit (bytes). Exceeding this triggers a grace period timer.
    /// Once the grace period expires, further writes are rejected with EDQUOT.
    pub bsoftlimit: u64,

    /// Current block usage (bytes). Updated on every successful write and truncate.
    pub bcurrent: u64,

    /// Hard inode limit. 0 = no limit. File creation that would exceed this
    /// is rejected with EDQUOT.
    pub ihardlimit: u64,

    /// Soft inode limit. Exceeding this triggers an inode grace period.
    pub isoftlimit: u64,

    /// Current inode count (files + directories + symlinks owned by this subject).
    pub icurrent: u64,

    /// Quota grace period expiry deadline for blocks: the absolute timestamp
    /// (seconds since epoch) at which the block soft limit grace period expires.
    /// Set to `now + bgrace` when the soft block limit is first exceeded.
    /// 0 if the soft limit has not been exceeded.
    /// After this deadline, writes that would keep usage above `bsoftlimit`
    /// are rejected with EDQUOT (same enforcement as the hard limit).
    /// Matches Linux `dqb_btime` semantics ("time limit for excessive disk use").
    /// Type is u64, matching the Linux UAPI `struct if_dqblk.dqb_btime` (`__u64`).
    pub btime: u64,

    /// Quota grace period expiry deadline for inodes: the absolute timestamp
    /// (seconds since epoch) at which the inode soft limit grace period expires.
    /// Semantics mirror `btime` but for inode counts instead of block usage.
    /// 0 if the soft inode limit has not been exceeded.
    /// Type is u64, matching the Linux UAPI `struct if_dqblk.dqb_itime` (`__u64`).
    pub itime: u64,

    /// Grace period for the block soft limit, in seconds. Default: 7 days (604800).
    pub bgrace: u32,

    /// Grace period for the inode soft limit, in seconds. Default: 7 days (604800).
    pub igrace: u32,
}

/// Linux UAPI quota structure for quotactl(2). Matches `struct if_dqblk`
/// from `<linux/quota.h>` exactly — this is the struct that userspace tools
/// (quota, repquota, edquota) read and write via Q_GETQUOTA / Q_SETQUOTA.
///
/// Field order and sizes must match the Linux definition exactly:
///   __u64 dqb_bhardlimit, dqb_bsoftlimit, dqb_curspace,
///   __u64 dqb_ihardlimit, dqb_isoftlimit, dqb_curinodes,
///   __u64 dqb_btime, dqb_itime,
///   __u32 dqb_valid
#[repr(C)]
pub struct IfDqblk {
    pub dqb_bhardlimit: u64,
    pub dqb_bsoftlimit: u64,
    pub dqb_curspace:    u64,
    pub dqb_ihardlimit:  u64,
    pub dqb_isoftlimit:  u64,
    pub dqb_curinodes:   u64,
    /// Grace period expiry deadline for blocks (seconds since epoch).
    /// 0 if the soft block limit has not been exceeded.
    pub dqb_btime:       u64,
    /// Grace period expiry deadline for inodes (seconds since epoch).
    pub dqb_itime:       u64,
    /// Bitmask of QIF_* flags indicating which fields are valid.
    /// QIF_BLIMITS=1, QIF_SPACE=2, QIF_ILIMITS=4, QIF_INODES=8,
    /// QIF_BTIME=16, QIF_ITIME=32, QIF_ALL=0x3F.
    pub dqb_valid:       u32,
    // repr(C) adds 4 bytes implicit trailing padding for u64 alignment,
    // matching Linux's `struct if_dqblk` exactly (9 fields, 72 bytes).
    // No explicit `_pad` field — Linux has 9 fields, not 10. The implicit
    // padding is zero-initialized by the kernel before copy_to_user().
}
// Layout: 8×u64 + u32 + 4(implicit pad) = 64 + 4 + 4 = 72 bytes.
const_assert!(size_of::<IfDqblk>() == 72);

/// Conversion between internal `DiskQuota` and UAPI `IfDqblk`:
/// - Q_GETQUOTA: kernel reads `DiskQuota` from cache, converts to `IfDqblk`,
///   copies to userspace. `dqb_valid` is set to `QIF_ALL` (all fields valid).
/// - Q_SETQUOTA: kernel copies `IfDqblk` from userspace, updates only the
///   `DiskQuota` fields indicated by `dqb_valid` in the cache.

/// Quota subject type.
pub enum QuotaType {
    User    = 0,  // USRQUOTA
    Group   = 1,  // GRPQUOTA
    Project = 2,  // PRJQUOTA
}

/// Quota operations implemented by filesystems that support quotas.
/// Optional — filesystems without quota support omit this and quotactl(2) returns ENOSYS.
pub trait QuotaOps: Send + Sync {
    /// Enable quota enforcement for the given type, reading limits from `quota_file`.
    fn quota_on(&self, quota_type: QuotaType, quota_file: &str) -> Result<(), VfsError>;

    /// Disable quota enforcement for the given type.
    fn quota_off(&self, quota_type: QuotaType) -> Result<(), VfsError>;

    /// Read the quota entry for subject `id` (UID, GID, or project ID).
    fn get_quota(&self, quota_type: QuotaType, id: u32) -> Result<DiskQuota, VfsError>;

    /// Set limits and accounting for subject `id`. Requires CAP_SYS_ADMIN.
    fn set_quota(&self, quota_type: QuotaType, id: u32, quota: &DiskQuota) -> Result<(), VfsError>;

    /// Read global quota state (grace periods, flags) for the given type.
    fn get_info(&self, quota_type: QuotaType) -> Result<QuotaInfo, VfsError>;

    /// Set global quota state (grace periods). Requires CAP_SYS_ADMIN.
    fn set_info(&self, quota_type: QuotaType, info: &QuotaInfo) -> Result<(), VfsError>;

    /// Flush in-memory quota accounting to the quota database file.
    fn sync_quota(&self, quota_type: QuotaType) -> Result<(), VfsError>;
}

/// Global quota state (grace periods and enabled flags) for a single quota type.
pub struct QuotaInfo {
    /// Block grace period in seconds.
    pub bgrace: u32,
    /// Inode grace period in seconds.
    pub igrace: u32,
    /// Quota flags (QIF_FLAGS: quota enabled, quota accounting-only, etc.).
    pub flags: u32,
}

14.15.2 quotactl(2) Dispatch

The quotactl(2) syscall encodes both the quota command and the quota type in a single 32-bit cmd argument: bits [31:8] are the command (Q_QUOTAON=0x800002, Q_QUOTAOFF=0x800003, Q_GETQUOTA=0x800007, Q_SETQUOTA=0x800008, Q_GETINFO=0x800005, Q_SETINFO=0x800006, Q_SYNC=0x800001) and bits [7:0] are the quota type (USRQUOTA=0, GRPQUOTA=1, PRJQUOTA=2). This matches the Linux QCMD(cmd, type) = (cmd << 8) | type macro. Subcmd values range up to 0x800009 (Q_GETNEXTQUOTA).

quotactl(cmd, dev, id, addr):
  qt_cmd  = cmd >> 8
  qt_type = QuotaType::from(cmd & 0xff)  // USRQUOTA/GRPQUOTA/PRJQUOTA

  sb = resolve_superblock_from_device_path(dev)
  if sb.quota_ops is None: return Err(ENOSYS)

  // Capability check for mutating operations
  if qt_cmd in [Q_QUOTAON, Q_QUOTAOFF, Q_SETQUOTA, Q_SETINFO]:
    check_capability(CAP_SYS_ADMIN)?

  match qt_cmd:
    Q_QUOTAON   → sb.quota_ops.quota_on(qt_type, addr_as_path)
    Q_QUOTAOFF  → sb.quota_ops.quota_off(qt_type)
    Q_GETQUOTA  → quota = sb.quota_ops.get_quota(qt_type, id)?; uapi = quota.to_if_dqblk(); copy_to_user(addr, uapi)
    Q_SETQUOTA  → uapi = copy_from_user::<IfDqblk>(addr)?; quota = DiskQuota::from_if_dqblk(&uapi); sb.quota_ops.set_quota(qt_type, id, &quota)
    Q_GETINFO   → info = sb.quota_ops.get_info(qt_type)?; copy_to_user(addr, info)
    Q_SETINFO   → info = copy_from_user(addr)?; sb.quota_ops.set_info(qt_type, &info)
    Q_SYNC      → sb.quota_ops.sync_quota(qt_type)
    _           → return Err(EINVAL)

14.15.3 VFS Enforcement Hooks

On every write(2), fallocate, create, mkdir, mknod, and symlink call, the VFS checks quotas for all three subject types:

vfs_quota_check_blocks(inode, bytes_requested) → Result:
  creds = current_task().creds
  for qt in [QuotaType::User, QuotaType::Group, QuotaType::Project]:
    id = match qt:
      User    → creds.fsuid
      Group   → creds.fsgid
      Project → inode.project_id  // stored in the inode's native i_projid field (set via FS_IOC_FSSETXATTR)
    quota = inode.sb.quota_ops.get_quota(qt, id)?  // from in-memory quota cache
    new_usage = quota.bcurrent + bytes_requested
    if new_usage > quota.bhardlimit && quota.bhardlimit != 0:
      return Err(EDQUOT)  // hard limit exceeded: reject immediately
    if new_usage > quota.bsoftlimit && quota.bsoftlimit != 0:
      now = current_time_secs()
      // The get_quota() → check btime → update_quota_cache() sequence
      // must be serialized per (qt, id) to prevent a TOCTOU race: two
      // concurrent writers could both read btime == 0 and both set btime,
      // with the second overwriting the first. Serialization is provided
      // by the per-quota-entry SpinLock in the quota cache (acquired by
      // get_quota() and held until update_quota_cache() completes).
      if quota.btime == 0:
        quota.btime = now + quota.bgrace as u64  // start grace period timer
        update_quota_cache(qt, id, &quota)
      elif now > quota.btime:
        return Err(EDQUOT)  // grace period expired: reject
      // else: within grace period, allow the write
  return Ok(())

vfs_quota_check_inodes(inode, count) → Result:
  // Identical structure to vfs_quota_check_blocks but uses icurrent/isoftlimit/ihardlimit.

14.15.4 In-Memory Quota Cache

Quota accounting state is kept in a per-filesystem in-memory cache to avoid hitting the quota database file on every write. The cache structure mirrors DiskQuota with an additional dirty: bool field. Cache entries are written back to the quota file asynchronously via sync_quota(), which is called:

  • Periodically by the writeback daemon (default interval: 30 seconds).
  • On quotactl(Q_SYNC).
  • On filesystem unmount.
  • On sync(2) / syncfs(2) when the filesystem's quota is dirty.

The cache uses three per-filesystem XArrays — one per QuotaType — keyed by subject ID (u32 UID, GID, or project ID). Quota checks on the write(2) hot path use RCU read guards (lock-free, no contention between concurrent writers). Updates (usage accounting, limit changes via quotactl(Q_SETQUOTA)) acquire the XArray's internal lock on the affected entry only.

/// Per-filesystem in-memory quota cache.
///
/// Three XArrays partition by quota type so that user, group, and project
/// lookups are fully independent (no false contention). XArray provides
/// O(1) lookup by integer key with native RCU read support.
pub struct QuotaCache {
    /// User quota cache, keyed by UID.
    pub user: XArray<QuotaCacheEntry>,
    /// Group quota cache, keyed by GID.
    pub group: XArray<QuotaCacheEntry>,
    /// Project quota cache, keyed by project ID.
    pub project: XArray<QuotaCacheEntry>,
}

pub struct QuotaCacheEntry {
    pub quota: DiskQuota,
    /// True if this entry has been modified since the last writeback.
    pub dirty: bool,
}

impl QuotaCache {
    /// Look up a quota entry. RCU read — no lock, no allocation.
    /// Called on every write(2), fallocate, create, mkdir — hot path.
    pub fn get(&self, qt: QuotaType, id: u32) -> Option<RcuRef<QuotaCacheEntry>> {
        self.array_for(qt).get_rcu(id as u64)
    }

    /// Insert or update a quota entry. Acquires the XArray's internal lock
    /// for the affected slot only — does not block concurrent reads.
    pub fn set(&self, qt: QuotaType, id: u32, entry: QuotaCacheEntry) {
        self.array_for(qt).store(id as u64, entry);
    }

    fn array_for(&self, qt: QuotaType) -> &XArray<QuotaCacheEntry> {
        match qt {
            QuotaType::User    => &self.user,
            QuotaType::Group   => &self.group,
            QuotaType::Project => &self.project,
        }
    }
}

This replaces the previous RwLock<HashMap<(QuotaType, u32), DiskQuota>> design, which had three problems: (1) HashMap with integer keys violates collection policy (§3.1.13); (2) the global RwLock serialises all quota checks across all subjects; (3) the composite (QuotaType, u32) key prevents independent access by quota type. The XArray design gives O(1) lookup, lock-free RCU reads, and natural partitioning.

14.15.5 Linux Compatibility

  • quotactl(2) with all seven commands (Q_QUOTAON, Q_QUOTAOFF, Q_GETQUOTA, Q_SETQUOTA, Q_GETINFO, Q_SETINFO, Q_SYNC) is fully implemented.
  • The UAPI IfDqblk structure matches the Linux struct if_dqblk layout exactly (9 fields: dqb_bhardlimit through dqb_valid). The internal DiskQuota struct extends this with bgrace/igrace fields for in-kernel grace period tracking.
  • quota tools (quota, quotacheck, repquota, edquota) work without modification.
  • ext4, XFS, and tmpfs quota implementations are in scope for the initial release.
  • Project quotas (PRJQUOTA) are supported; project IDs are stored in the inode's i_projid field (set via FS_IOC_FSSETXATTR).

14.15.6 Cross-References

  • Section 14.1 (VFS Architecture): quota checks are inserted into the VFS dispatch layer at write, create, mkdir, mknod, and fallocate call sites.
  • Section 17.1 (Containers): cgroup v2 io.max and memory.max provide resource controls complementary to quota; quota enforces per-UID/GID storage limits while cgroups enforce per-container I/O and memory limits.
  • Section 15.1 (Storage): ext4, XFS, and btrfs filesystem drivers implement QuotaOps as part of their SuperBlock initialization.

14.16 Extended Attributes (xattr)

Extended attributes are name-value pairs associated with inodes, providing metadata beyond the standard POSIX file attributes (owner, group, mode, timestamps). They are the storage mechanism for POSIX ACLs (Section 9.2), SELinux labels, IMA hashes (Section 9.5), overlayfs whiteouts (Section 14.8), file capabilities (Section 9.9), and user-defined metadata.

UmkaOS implements the complete Linux xattr ABI: identical syscall numbers, identical namespace rules, identical size limits, and identical wire format for POSIX ACLs stored in system.posix_acl_access / system.posix_acl_default.

14.16.1 Syscall Interface

Twelve syscalls implement four operations across three path resolution variants:

Operation Path-based (follows symlinks) Link-based (no follow) FD-based
Get getxattr(path, name, value, size) lgetxattr(path, name, value, size) fgetxattr(fd, name, value, size)
Set setxattr(path, name, value, size, flags) lsetxattr(path, name, value, size, flags) fsetxattr(fd, name, value, size, flags)
List listxattr(path, list, size) llistxattr(path, list, size) flistxattr(fd, list, size)
Remove removexattr(path, name) lremovexattr(path, name) fremovexattr(fd, name)

Return values: getxattr returns the number of bytes written to value (or the required buffer size if size == 0). listxattr returns the total length of the null-separated name list (or required size if size == 0). setxattr and removexattr return 0 on success.

Error codes: ENODATA (attribute not found), EEXIST (CREATE flag, attribute already exists), ERANGE (buffer too small), EPERM (namespace permission denied), ENOTSUP (filesystem does not support xattrs or namespace not valid for this inode type).

The l-variants operate on the symlink inode itself rather than following the symlink target. The f-variants use an open file descriptor, bypassing path resolution entirely.

14.16.2 XattrFlags

bitflags! {
    /// Flags for setxattr / lsetxattr / fsetxattr. Matches Linux XATTR_CREATE
    /// and XATTR_REPLACE from <linux/xattr.h>.
// kernel-internal, not KABI
    #[repr(C)]
    pub struct XattrFlags: u32 {
        /// Fail with EEXIST if the attribute already exists.
        const CREATE  = 0x1;
        /// Fail with ENODATA if the attribute does not exist.
        const REPLACE = 0x2;
        // 0 (no flags) = create or replace unconditionally.
    }
}

Setting both CREATE | REPLACE simultaneously is invalid and returns EINVAL.

14.16.3 Namespace Prefixes

Extended attribute names are partitioned into four namespaces by their prefix string. Each namespace has independent permission semantics:

14.16.3.1 user.*

User-defined attributes. No capability required.

Operation Requirement
Get Read permission on the file
Set Write permission on the file

Inode type restriction: user.* xattrs are permitted only on regular files and directories. Attempts to set user.* on symlinks, device nodes, pipes, or sockets return EPERM. Rationale: symlinks must be transparent (a symlink's xattrs should not be confused with those of its target); device node xattrs would create ambiguity between the device file and the underlying device.

14.16.3.2 trusted.*

Trusted attributes for kernel subsystems and privileged daemons.

Operation Requirement
Get CAP_SYS_ADMIN
Set CAP_SYS_ADMIN

Stored on disk and persistent across reboots. Examples:

  • trusted.overlay.opaque — overlayfs opaque directory marker (Section 14.8)
  • trusted.overlay.redirect — overlayfs rename redirect

14.16.3.3 security.*

Security labels written by LSMs and integrity subsystems.

Operation Requirement
Set Delegated to LSM hooks. Default (commoncap): CAP_SYS_ADMIN for generic security.* attributes; security.capability requires CAP_SETFCAP (checked in cap_convert_nscap()). SELinux/AppArmor may impose additional type enforcement rules via lsm_call_inode_security(Setxattr, ...).
Get Varies by LSM; SELinux allows read by any process with appropriate type enforcement

Examples:

  • security.selinux — SELinux security context (Section 9.8)
  • security.ima — IMA measurement hash (Section 9.5)
  • security.capability — file capabilities (VFS_CAP_REVISION_3) (Section 9.9)
  • security.evm — EVM HMAC over protected xattrs (Section 9.5)

14.16.3.4 system.*

System attributes for kernel-managed metadata. Two attributes are defined:

  • system.posix_acl_access — POSIX access ACL (Section 9.2)
  • system.posix_acl_default — POSIX default ACL (directories only)

Permission model: read follows normal file permission checks; set requires write permission plus ownership (uid == i_uid) or CAP_FOWNER.

14.16.4 Size Limits

/// Maximum length of an extended attribute name, including the namespace prefix
/// (e.g., "user." is 5 bytes of the 255-byte budget). Matches Linux XATTR_NAME_MAX.
pub const XATTR_NAME_MAX: usize = 255;

/// Maximum size of an extended attribute value in bytes (64 KiB).
/// Matches Linux XATTR_SIZE_MAX.
pub const XATTR_SIZE_MAX: usize = 65536;

/// Maximum total size of a listxattr() output buffer in bytes (64 KiB).
/// Matches Linux XATTR_LIST_MAX.
pub const XATTR_LIST_MAX: usize = 65536;

These are hard limits enforced by the VFS layer before dispatching to filesystem code. Individual filesystems may impose smaller limits (e.g., ext4 inline xattr space is limited by the inode size minus i_extra_isize).

14.16.5 VFS Dispatch Pipeline

All xattr syscalls route through InodeOps methods defined in Section 14.1. The VFS layer performs namespace permission checks and LSM hooks before dispatching to the filesystem:

  1. Parse namespace prefix — extract "user.", "trusted.", "security.", or "system." from the attribute name. Unknown prefixes return EOPNOTSUPP.
  2. Validate name length — reject if name.len() > XATTR_NAME_MAX.
  3. Validate value size — reject if value.len() > XATTR_SIZE_MAX (set operations).
  4. Check namespace permissions — verify the caller holds the required capability for the namespace (see tables above). Check inode type restriction for user.*.
  5. Call LSM hookslsm_call_inode_security(Setxattr | Getxattr | Removexattr | Listxattr, ...) (Section 9.8). LSMs may deny the operation (e.g., SELinux type enforcement) or intercept security.* writes.
  6. Dispatch to filesystem — call the InodeOps::getxattr / setxattr / listxattr / removexattr method on the filesystem driver.
  7. EVM re-computation (set/remove of security.* xattrs only) — after the filesystem write succeeds, trigger EVM HMAC re-computation (Section 9.5).

14.16.6 Per-Filesystem Storage

Each filesystem implements xattr storage according to its on-disk format. The VFS layer is agnostic to the storage mechanism; it delegates entirely to InodeOps.

Filesystem Storage mechanism Inline capacity Overflow strategy
ext4 Inode body (after i_extra_isize) or external xattr block ~100 bytes (256-byte inode default) Separate 4 KiB block, shared across inodes via block refcount
XFS Inode attribute fork (shortform, leaf, or B-tree) ~256 bytes (shortform) B-tree of 4 KiB attr leaf blocks
Btrfs Xattr items in the filesystem B-tree (same tree as data extent refs) ~3900 bytes (single leaf item) Additional B-tree items (no single-xattr limit, tree grows)
tmpfs XArray per-inode, keyed by FNV-1a hash of xattr name Memory-only, no disk limit Bounded by tmpfs size limit and system memory
ZFS System Attributes (SA) in dnode bonus buffer or ZAP objects ~48 KiB (bonus buffer) Fat ZAP (on-disk hash table)

tmpfs xattr storage: tmpfs has no backing disk, so xattrs are stored in memory. Each inode with xattrs carries an XArray<XattrEntry> keyed by fnv1a(name) as u64 with open-addressing collision resolution (same triangular probing scheme as Section 14.18). On collision, the probe sequence h, h+1, h+3, h+6, … is followed, comparing XattrEntry.name at each occupied slot. This gives O(1) lookup for the common case (no collisions) with bounded worst-case O(k) where k is the number of collisions for a given hash.

/// tmpfs xattr entry. Stored in the per-inode XArray.
pub struct TmpfsXattrEntry {
    /// Full attribute name including namespace prefix (e.g., "user.mime_type").
    /// Heap-allocated because xattr names are variable-length.
    pub name: Box<[u8]>,
    /// Attribute value. Heap-allocated, up to XATTR_SIZE_MAX bytes.
    pub value: Box<[u8]>,
}

14.16.7 POSIX ACL Wire Format

The POSIX draft ACL (Section 9.2) is stored on disk as the value of system.posix_acl_access (access ACL) and system.posix_acl_default (default ACL, directories only). The wire format is identical to Linux <linux/posix_acl_xattr.h>:

/// POSIX ACL xattr header. Appears once at the start of the xattr value.
/// All fields are little-endian on disk (Le32/Le16 wrappers enforce
/// explicit conversion at read/write boundaries).
#[repr(C, packed)]
pub struct PosixAclXattrHeader {
    /// ACL format version. Must be POSIX_ACL_XATTR_VERSION (0x0002).
    pub version: Le32,
}
// Packed layout: 4 bytes.
const_assert!(size_of::<PosixAclXattrHeader>() == 4);

/// POSIX_ACL_XATTR_VERSION — the only version defined by the POSIX draft standard.
pub const POSIX_ACL_XATTR_VERSION: u32 = 0x0002;

/// A single ACL entry. Follows the header; repeated N times.
/// All fields are little-endian on disk.
#[repr(C, packed)]
pub struct PosixAclXattrEntry {
    /// ACL entry tag identifying the entry type.
    pub tag: Le16,
    /// Permission bits: ACL_READ (0x04) | ACL_WRITE (0x02) | ACL_EXECUTE (0x01).
    pub perm: Le16,
    /// Qualifier: uid for ACL_USER, gid for ACL_GROUP.
    /// ACL_UNDEFINED_ID (0xFFFFFFFF) for USER_OBJ, GROUP_OBJ, MASK, OTHER.
    pub id: Le32,
}
// Packed layout: 2 + 2 + 4 = 8 bytes.
const_assert!(size_of::<PosixAclXattrEntry>() == 8);

/// ACL entry tag values.
pub const ACL_USER_OBJ:  u16 = 0x01;
pub const ACL_USER:      u16 = 0x02;
pub const ACL_GROUP_OBJ: u16 = 0x04;
pub const ACL_GROUP:     u16 = 0x08;
pub const ACL_MASK:      u16 = 0x10;
pub const ACL_OTHER:     u16 = 0x20;

/// Sentinel value for entries that do not reference a specific uid/gid.
pub const ACL_UNDEFINED_ID: u32 = 0xFFFF_FFFF;

Wire layout: 4-byte header followed by N 8-byte entries. Total size = 4 + 8 * N bytes.

Minimum ACL: 3 entries (USER_OBJ, GROUP_OBJ, OTHER) = 28 bytes. This is the "minimal ACL" equivalent to standard POSIX mode bits.

Extended ACL: When named users or groups are present, a MASK entry is mandatory. The mask defines the maximum permissions for ACL_USER, ACL_GROUP, and ACL_GROUP_OBJ entries (the "effective permissions" are entry.perm & mask.perm).

14.16.8 chmod / ACL Mask Interaction

When chmod() is called on a file that has a POSIX access ACL, the ACL must be updated to reflect the new mode bits. The POSIX draft standard defines this mapping:

Mode bits ACL entry updated
Owner bits (mode >> 6) & 0o7 ACL_USER_OBJ.perm
Group bits (mode >> 3) & 0o7 ACL_MASK.perm (NOT ACL_GROUP_OBJ)
Other bits (mode) & 0o7 ACL_OTHER.perm

The group bits of the file mode always correspond to ACL_MASK, not ACL_GROUP_OBJ. This is a common source of confusion but is required by POSIX.1e: the mask entry is the upper bound on group-class permissions, and ls -l displays the mask as the group permission bits.

Conversely, when an ACL is set via setxattr("system.posix_acl_access", ...), the file's mode bits are updated to match: owner bits from ACL_USER_OBJ.perm, group bits from ACL_MASK.perm, other bits from ACL_OTHER.perm.

14.16.9 Default ACL Inheritance

When creating a new inode in a directory that has system.posix_acl_default set:

New file creation: 1. The directory's default ACL becomes the new file's access ACL. 2. The ACL_MASK entry (if present) is ANDed with the umask-adjusted creation mode to produce the file's effective permissions. 3. The file's mode bits are set from the resulting ACL (owner from USER_OBJ, group from MASK, other from OTHER). 4. The new file does NOT receive a default ACL (only directories inherit defaults).

New directory creation: 1. Same as file creation for the access ACL. 2. Additionally, the parent's default ACL is copied as the new directory's own default ACL, ensuring recursive inheritance for all future children.

No default ACL: If the parent directory has no system.posix_acl_default xattr, standard umask-based permission inheritance applies and no ACL is created on the new inode.

14.16.10 EVM Integration

EVM (Extended Verification Module) protects security-critical xattrs against offline tampering (Section 9.5).

Protected xattr set: security.selinux, security.ima, security.capability, and any other security.* xattr registered with EVM at boot.

Flow on security.* xattr modification: 1. VFS calls InodeOps::setxattr() to persist the new value. 2. On success, VFS acquires the per-inode evm_lock (spinlock). 3. VFS concatenates the inode number and all protected xattr values in a deterministic order. 4. VFS computes HMAC-SHA3-256 over the concatenation using the boot-derived EVM key. 5. VFS writes the resulting HMAC as the value of security.evm. 6. VFS releases evm_lock.

Lock ordering: The evm_lock is acquired AFTER the inode's i_rwsem (which protects xattr storage). The ordering is: i_rwsem (exclusive, acquired by setxattr() VFS path) → evm_lock (spinlock, acquired in step 2). Reversing this order would deadlock: evm_lock protects only the HMAC recomputation (steps 3-5), not the underlying xattr storage write (step 1). Concurrent getxattr("security.evm") reads do NOT acquire evm_lock — they read the stored HMAC value directly. This is safe because setxattr holds i_rwsem exclusive, which prevents concurrent setxattr (but not concurrent getxattr, which takes i_rwsem shared). A getxattr concurrent with step 5 may see either the old or new HMAC — both are valid (the old HMAC matches the old xattr set; the new HMAC matches the new xattr set).

Appraisal on file open: When a file is opened, EVM re-computes the HMAC and compares it against the stored security.evm value. Mismatch returns EINTEGRITY (or EPERM when evm_mode is set to enforce).

14.16.11 Performance Budget

Operation Path class Typical cost Notes
getxattr (inline, ext4) Warm ~500 ns Inode already in page cache; inline scan of i_extra_isize region
getxattr (external block, ext4) Cold ~5 us Requires reading the shared xattr block from disk
setxattr (inline, ext4) Warm ~1 us Journal transaction for inode update
listxattr Cold ~2 us Iterates all xattr entries in inode + overflow
LSM hook overhead per xattr op Hot ~20 ns Static dispatch through LSM hook array (Section 9.8)
EVM HMAC re-computation Warm ~3 us HMAC-SHA3-256 over protected xattr set; dominated by hash computation
tmpfs getxattr (XArray lookup) Warm ~100 ns In-memory XArray traversal, no disk I/O

Hot-path note: xattr operations are not on the per-packet or per-syscall hot path. The most frequent xattr consumer is LSM label checks during file_open, which cache the resolved label in the inode's LSM blob and do not re-read the xattr on every access. The performance budget above reflects the actual xattr syscall cost, not the cached LSM check cost (which is ~5 ns via the blob pointer).

14.17 Pipes and FIFOs

Pipes (pipe(2), pipe2(2)) and named FIFOs (mkfifo(2)) are anonymous unidirectional byte streams. They are the oldest and most widely used IPC primitive in UNIX.

14.17.1 Pipe Data Buffer

The pipe data buffer uses the page-array model defined in Section 17.3. Each pipe holds an array of page references (PipePage), supporting partial-page writes for small messages and zero-copy page gifting for vmsplice(SPLICE_F_GIFT). The PipeBuffer struct provides:

  • Default 65536 bytes (16 pages), matching Linux default
  • Lock-free single-writer fast path with active_writer CAS
  • Mutex-protected multi-writer slow path for POSIX atomicity
  • Inline storage for the common case (16 pages), heap fallback for fcntl(F_SETPIPE_SZ) beyond 64 KB
  • Seqlock-based resize safety for concurrent fcntl(F_SETPIPE_SZ)

Key PipeBuffer fields (summary; full definition in Section 17.3): - pages: ArrayVec<PipePage, 16> — page ring (inline for default 64 KiB) - r_idx: u32, w_idx: u32 — read/write indices into the page ring - capacity: u32 — current capacity in bytes - active_writer: AtomicU64 — CAS-based single-writer detection

See Section 17.3 for the complete struct definition, write/read algorithms, memory ordering rationale, and resize protocol.

14.17.2 Capacity and fcntl(F_SETPIPE_SZ)

Default pipe capacity: 65536 bytes (matches Linux default).

fcntl(F_SETPIPE_SZ, size) resizes the pipe: - Rounds up to the next power of 2 (minimum 4096 bytes) - Maximum: /proc/sys/fs/pipe-max-size (default 1MB, same as Linux) - Requires CAP_SYS_RESOURCE to exceed /proc/sys/fs/pipe-max-size - Data currently in the pipe is preserved (pages migrated to new array) - If the new size is smaller than current content: EBUSY

fcntl(F_GETPIPE_SZ) returns the current capacity.

14.17.3 MPSC Pipes (Multiple Writers)

When more than one process/thread writes to the same pipe (e.g., shell { cmd1; cmd2; } | cmd3), the single-writer fast path cannot be used. UmkaOS detects multiple writers via PipeBuffer.writer_count: AtomicU32:

  • writer_count == 1: lock-free single-writer fast path (CAS on active_writer)
  • writer_count > 1: writer acquires PipeBuffer.ring_lock: Mutex<()> before writing

Writes <= PIPE_BUF (4096 bytes) are always atomic (no interleaving with other writers) -- same guarantee as POSIX.

14.17.4 O_DIRECT Pipe Mode

pipe2(O_DIRECT): each write() is a discrete message; read() returns exactly one message. Implemented by prepending a 4-byte length header before each message in the page array:

/// O_DIRECT pipe message header (4 bytes, native endian).
/// Followed immediately by `len` bytes of payload.
/// Alignment: none required (page data is byte-addressable).
/// packed is defensive: ensures no trailing padding if fields are added later.
#[repr(C, packed)]
pub struct PipeMessageHdr {
    pub len: u32,
}
const_assert!(size_of::<PipeMessageHdr>() == 4);

Maximum message size: PIPE_BUF (4096 bytes) for atomic writes.

14.17.5 Named FIFOs (mkfifo)

Named FIFOs use the same PipeBuffer struct, but with a VFS inode for pathname lookup:

  • mkfifo(path, mode): creates a VFS inode of type InodeKind::Fifo
  • open(path, O_RDONLY): blocks until a writer opens (unless O_NONBLOCK)
  • open(path, O_WRONLY): blocks until a reader opens (unless O_NONBLOCK)
  • Once both ends are open: identical semantics to anonymous pipe

A FIFO is a VFS node (VfsNode) that, when opened, creates a reference to an existing PipeBuffer or creates a new one. Multiple readers and writers can open a FIFO; the reader_count and writer_count fields track opens/closes. Writers use the multi-writer slow path when concurrent writes are detected. When the last reader and last writer close, the buffer is freed.

14.17.6 Splice and Zero-Copy

UmkaOS's page-array pipe model (PipeBuffer) uses the same fundamental design as Linux's struct pipe_buffer array: each pipe page is a reference to a physical page with offset and length. This enables true zero-copy splice operations via page-reference transfer.

Pipe-to-pipe splice (splice(pipe_fd_in, pipe_fd_out)): Transfers page references from the source pipe to the destination pipe. The source pipe's PipePage entries are moved (not copied) to the destination, incrementing the underlying page refcount. No data copy occurs -- only metadata (page pointer, offset, length) is transferred. This matches Linux's pipe-to-pipe splice semantics exactly.

File-to-pipe splice (splice(file_fd, pipe_fd)): The filesystem's FileOps::splice_read populates pipe pages directly from page cache pages, transferring page references (incrementing refcount) without copying data. The pipe's PipePage entries point directly into the page cache.

Pipe-to-socket splice (splice(pipe_fd, socket_fd)): The network stack receives page references from the pipe and uses scatter-gather DMA to transmit directly from the pipe's pages (zero kernel-side copy). Each PipePage maps to a scatter-gather entry for the NIC.

Pipe-to-file splice (splice(pipe_fd, file_fd)): The filesystem's FileOps::splice_write transfers page references from the pipe into the page cache (for filesystems that support it) or copies data from pipe pages to page cache pages.

vmsplice zero-copy (vmsplice(pipe_fd, iov, SPLICE_F_GIFT)): When SPLICE_F_GIFT is set, the user pages described by the iovec are unmapped from the sender's address space and gifted to the pipe as PipePage entries with is_gifted == true. The reader can then access the data without any copy. Without SPLICE_F_GIFT, data is copied from user pages into pipe pages.

14.17.7 Linux Compatibility

  • Default capacity 65536 bytes: identical to Linux
  • F_SETPIPE_SZ / F_GETPIPE_SZ: identical semantics
  • PIPE_BUF = 4096 bytes: POSIX required, identical
  • O_DIRECT pipe mode: identical to Linux 3.4+
  • pipe2(O_CLOEXEC | O_NONBLOCK | O_DIRECT): all flags supported
  • Splice semantics: identical to Linux (page-reference transfer)
  • /proc/sys/fs/pipe-max-size: identical default (1MB), same permission model
  • Signal on broken pipe: SIGPIPE + EPIPE on write to pipe with no readers
  • select()/poll()/epoll(): EPOLLIN when data available, EPOLLOUT when space available, EPOLLHUP on last writer close

14.18 Pseudo-Filesystems

Pseudo-filesystems are RAM-resident virtual filesystems that expose kernel state to userspace. Unlike disk-backed filesystems, they have no persistent storage — all content is generated dynamically by the kernel on read and consumed on write. UmkaOS provides a common registration framework and six standard pseudo-filesystems required for Linux compatibility.

procfs and sysfs are specified in Section 14.19. tmpfs and devtmpfs are built into the VFS core (Section 14.1). cgroupfs is specified in Section 17.2, and configfs in Section 14.12. This section covers the remaining pseudo-filesystems needed for complete Linux workload support: debugfs, tracefs, hugetlbfs, bpffs, securityfs, and efivarfs.

14.18.1 Common Registration Framework

All pseudo-filesystems share a uniform registration path through the VFS layer (Section 14.1). Each pseudo-fs defines a static PseudoFsType descriptor and calls register_filesystem() during kernel init (Section 2.3).

/// Registration descriptor for a pseudo-filesystem type.
///
/// Each pseudo-fs defines exactly one static instance. The VFS layer stores
/// registered types in an XArray keyed by a monotonic u64 registration
/// sequence number. Name-based lookup (mount -t <name>) walks the XArray
/// linearly — acceptable because the total number of filesystem types is
/// small (<50) and registration/mount are cold-path operations.
pub struct PseudoFsType {
    /// Filesystem type name as it appears in mount(2) and /proc/filesystems.
    /// e.g., "debugfs", "tracefs", "hugetlbfs", "bpf", "securityfs", "efivarfs".
    pub name: &'static str,

    /// Filesystem flags. Pseudo-filesystems typically set NODEV | NOEXEC | NOSUID
    /// to prevent device node creation, executable mapping, and setuid escalation.
    pub fs_flags: FsFlags,

    /// Filesystem magic number returned by statfs(2). Each pseudo-fs has a
    /// unique magic defined by the Linux UAPI (include/uapi/linux/magic.h).
    pub magic: u32,

    /// Populate the superblock and create the root inode. Called once per mount.
    /// The implementation creates the root directory inode with appropriate mode
    /// and ownership, then populates any initial directory structure.
    pub fill_super: fn(&mut SuperBlock) -> Result<(), VfsError>,

    /// Accepted mount options. The VFS parses `-o key=value` pairs from mount(2)
    /// and validates them against this table before calling `fill_super`.
    pub mount_opts: &'static [MountOptDesc],
}

bitflags! {
    /// Filesystem-level flags applied at mount time.
    pub struct FsFlags: u32 {
        /// No device special files may be created or accessed on this filesystem.
        const NODEV  = 1 << 0;
        /// No files on this filesystem may be executed (mmap PROT_EXEC denied).
        const NOEXEC = 1 << 1;
        /// Setuid and setgid bits are ignored for all files on this filesystem.
        const NOSUID = 1 << 2;
        /// This filesystem may be mounted inside a non-initial user namespace.
        /// Only set for filesystems that are safe for unprivileged mounting
        /// (e.g., tmpfs, procfs subset). None of the six pseudo-filesystems
        /// in this section set this flag — they all require initial namespace.
        const USERNS_MOUNT = 1 << 3;
    }
}

/// Descriptor for a single mount option accepted by a pseudo-filesystem.
pub struct MountOptDesc {
    /// Option name (e.g., "pagesize", "mode", "uid").
    pub name: &'static str,
    /// Type of the option value.
    pub kind: MountOptKind,
}

/// Value type for a mount option.
pub enum MountOptKind {
    /// Boolean flag (present = true, absent = false). Example: "noexec".
    Flag,
    /// Unsigned 64-bit integer. Example: "size=1073741824".
    U64,
    /// Unsigned 32-bit integer. Example: "uid=1000", "mode=0700".
    U32,
    /// String value. Example: (currently unused by pseudo-fs, reserved).
    Str,
}

Registration is idempotent — calling register_filesystem() with a name that is already registered returns EBUSY. Unregistration is not supported for built-in pseudo-filesystems (they live for the kernel's lifetime).

14.18.1.1 PseudoInode and File Operations

All pseudo-filesystems share a simplified inode representation for RAM-backed directory trees. Unlike disk-backed inodes (Section 14.1), pseudo-inodes carry no block mappings, no page cache association, and no filesystem-specific opaque data.

/// Simplified inode for RAM-backed pseudo-filesystems.
///
/// Pseudo-inodes are allocated on first access and freed when the
/// dentry is removed. They are never written to disk.
pub struct PseudoInode {
    /// Inode number. Allocated from a per-superblock `AtomicU64`
    /// counter starting at 2 (inode 1 is reserved for the root).
    /// u64 counters never wrap within the kernel's operational
    /// lifetime (50+ years at billions of allocations per second).
    pub ino: u64,
    /// POSIX file mode (type + permission bits).
    pub mode: u32,
    /// Owner UID.
    pub uid: Uid,
    /// Owner GID.
    pub gid: Gid,
    /// Access time.
    pub atime: Timespec64,
    /// Modification time.
    pub mtime: Timespec64,
    /// Status change time.
    pub ctime: Timespec64,
    /// Inode content, determined by the file type.
    pub data: PseudoInodeData,
}

/// Content discriminant for pseudo-inodes.
pub enum PseudoInodeData {
    /// Directory: children stored in `HashedXArray<DirEntry>`, a wrapper
    /// that encapsulates FNV-1a hashing + triangular probing + tombstone
    /// protocol on top of XArray. This is necessary because XArray's built-in
    /// `xa_store(key, value)` overwrites existing entries at the same key —
    /// using raw `xa_store(fnv1a(name), entry)` would silently lose entries
    /// on hash collision.
    ///
    /// `HashedXArray<V>` provides:
    ///   - `insert(name: &[u8], value: V) -> Result<(), Exists>`
    ///   - `lookup(name: &[u8]) -> Option<&V>`
    ///   - `remove(name: &[u8]) -> Option<V>`
    ///
    /// Implementation: on insert, hash `h = fnv1a(name) as u64`. If the slot
    /// at `h` is occupied by a different name, probe `h+1, h+3, h+6, ...`
    /// (triangular probing: offset k = k*(k+1)/2). Lookup uses the same
    /// probe sequence, comparing `DirEntry.name` at each occupied slot.
    /// Remove stores a tombstone sentinel (a DirEntry with empty name) so
    /// the probe chain is not broken. Periodic compaction removes tombstones
    /// when the tombstone ratio exceeds 25%.
    ///
    /// Common-case (no collision): single XArray lookup, O(1). With 64-bit
    /// FNV-1a, collisions are negligible for directories under ~2^32 entries.
    Directory(HashedXArray<DirEntry>),
    /// Regular file: read/write behavior defined by the `PseudoFileOps`
    /// implementation provided at file creation time.
    RegularFile(&'static dyn PseudoFileOps),
    /// Symbolic link: target path stored inline.
    Symlink(Box<[u8]>),
}

/// Directory entry within a pseudo-filesystem directory.
pub struct DirEntry {
    /// Entry name (variable-length, heap-allocated).
    pub name: Box<[u8]>,
    /// Inode number of the target.
    pub ino: u64,
    /// File type for getdents64 optimization. Values match Linux UAPI
    /// `include/uapi/linux/dirent.h`:
    ///   DT_UNKNOWN = 0, DT_FIFO = 1, DT_CHR = 2, DT_DIR = 4,
    ///   DT_BLK = 6, DT_REG = 8, DT_LNK = 10, DT_SOCK = 12.
    pub d_type: u8,
}

/// Callback trait for pseudo-filesystem regular files.
///
/// Each file in a pseudo-filesystem implements this trait to define its
/// read/write behavior. Implementations are typically stateless — they
/// read from or write to kernel data structures referenced through the
/// inode's subsystem-specific context.
pub trait PseudoFileOps: Send + Sync {
    /// Read data from this pseudo-file into `buf` starting at `offset`.
    /// Returns the number of bytes written to `buf`.
    fn read(
        &self,
        inode: &PseudoInode,
        buf: &mut [u8],
        offset: u64,
    ) -> Result<usize, Errno>;

    /// Write data from `buf` to this pseudo-file at `offset`.
    /// Returns the number of bytes consumed from `buf`.
    fn write(
        &self,
        inode: &PseudoInode,
        buf: &[u8],
        offset: u64,
    ) -> Result<usize, Errno>;

    /// Called when the file is opened. Optional initialization.
    /// Default: no-op (returns Ok).
    fn open(&self, _inode: &PseudoInode) -> Result<(), Errno> {
        Ok(())
    }

    /// Called when the last fd referencing this file is closed.
    /// Default: no-op.
    fn release(&self, _inode: &PseudoInode) {}
}

14.18.1.2 Helper Functions

Convenience functions for creating and removing entries in pseudo-filesystem directories. Used internally by debugfs, tracefs, securityfs, and bpffs.

/// Create a regular file as a child of `parent`.
///
/// Allocates a new `PseudoInode` with `PseudoInodeData::RegularFile(ops)`,
/// inserts a `DirEntry` into the parent's `XArray`, and creates a VFS
/// dentry linking the two.
///
/// # Errors
/// - `EEXIST`: a child with the same name already exists.
/// - `ENOMEM`: inode or dentry allocation failed.
pub fn pseudo_create_file(
    parent: &PseudoInode,
    name: &[u8],
    mode: u16,
    ops: &'static dyn PseudoFileOps,
) -> Result<Arc<PseudoInode>, VfsError>;

/// Create a subdirectory as a child of `parent`.
///
/// Allocates a new `PseudoInode` with `PseudoInodeData::Directory(HashedXArray::new())`.
/// The new directory starts empty.
pub fn pseudo_create_dir(
    parent: &PseudoInode,
    name: &[u8],
    mode: u16,
) -> Result<Arc<PseudoInode>, VfsError>;

/// Remove a child entry from `parent` by name.
///
/// Looks up the child in the parent's directory XArray. If the target is a
/// directory, it must be empty (returns `ENOTEMPTY` otherwise).
///
/// **Tombstone protocol**: Because the directory uses open-addressing
/// collision resolution, naive deletion (clearing a slot to empty) would
/// break probe chains for entries inserted after the deleted entry via
/// collision. On removal, the slot is set to a tombstone sentinel
/// (`DirEntry::TOMBSTONE`) that:
/// - Is treated as "occupied" during probing (probe chains remain intact)
/// - Is skipped during name matching (not returned by lookup)
/// - Is overwritten by new inserts (reclaimed lazily)
///
/// When the directory's tombstone count exceeds 25% of occupied slots,
/// the directory is compacted: a new `HashedXArray` is allocated, all
/// live entries are inserted into it, the directory's `HashedXArray`
/// pointer is atomically swapped via `rcu_assign_pointer`, and the old
/// `HashedXArray` is freed after an RCU grace period. Concurrent RCU
/// readers always see a consistent directory (either old or new, never
/// partially rebuilt). Cold path, under the parent inode's directory lock.
pub fn pseudo_remove(
    parent: &PseudoInode,
    name: &[u8],
) -> Result<(), VfsError>;

14.18.2 debugfs — Kernel Debug Filesystem

Mount point: /sys/kernel/debug (mount -t debugfs debugfs /sys/kernel/debug) Magic: 0x64626720 (DEBUGFS_MAGIC) Flags: NODEV | NOEXEC | NOSUID

debugfs exposes kernel-internal debugging data. It carries no stable ABI guarantee: files may appear, disappear, or change format between kernel versions. Userspace tools must handle missing files gracefully.

Access control: mounted with mode=0700 by default, restricting access to CAP_SYS_ADMIN (Section 9.9). Distributions may remount with mode=0755 for read-only debug access, but this is a policy decision outside kernel scope.

14.18.2.1 debugfs Registration

static DEBUGFS_TYPE: PseudoFsType = PseudoFsType {
    name: "debugfs",
    fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
    magic: 0x64626720,
    fill_super: debugfs_fill_super,
    mount_opts: &[
        MountOptDesc { name: "uid",  kind: MountOptKind::U32 },
        MountOptDesc { name: "gid",  kind: MountOptKind::U32 },
        MountOptDesc { name: "mode", kind: MountOptKind::U32 },
    ],
};

14.18.2.2 debugfs Kernel API

Kernel subsystems create debugfs entries during their initialization. All functions are no-ops (returning a dummy handle) if debugfs is not mounted, ensuring subsystem init never fails due to debugfs unavailability.

/// Handle to a debugfs directory. Opaque to callers.
/// Internally holds the dentry reference for the directory inode.
pub struct DebugfsDir {
    dentry: Arc<Dentry>,
}

/// Handle to a debugfs file or value entry. Opaque to callers.
pub struct DebugfsEntry {
    dentry: Arc<Dentry>,
}

/// Create a directory under the debugfs root (or under `parent`).
/// Returns `DebugfsDir` used as the parent for subsequent entries.
pub fn debugfs_create_dir(
    name: &str,
    parent: Option<&DebugfsDir>,
) -> Result<DebugfsDir, VfsError>;

/// Create a file with custom read/write operations.
/// `mode` is the POSIX permission bits (e.g., 0o444 for read-only).
/// `fops` provides the read/write/open/release callbacks.
pub fn debugfs_create_file(
    name: &str,
    parent: &DebugfsDir,
    mode: u16,
    fops: &'static FileOps,
) -> Result<DebugfsEntry, VfsError>;

/// Create a file that reads/writes a single `AtomicU32` value.
/// Read returns the decimal ASCII representation; write parses decimal ASCII.
pub fn debugfs_create_u32(
    name: &str,
    parent: &DebugfsDir,
    mode: u16,
    value: &'static AtomicU32,
) -> Result<DebugfsEntry, VfsError>;

/// Create a file that reads/writes a single `AtomicU64` value.
pub fn debugfs_create_u64(
    name: &str,
    parent: &DebugfsDir,
    mode: u16,
    value: &'static AtomicU64,
) -> Result<DebugfsEntry, VfsError>;

/// Create a file that reads/writes a single `AtomicBool` value.
/// Read returns "Y\n" or "N\n"; write accepts "1"/"Y"/"y" or "0"/"N"/"n".
pub fn debugfs_create_bool(
    name: &str,
    parent: &DebugfsDir,
    mode: u16,
    value: &'static AtomicBool,
) -> Result<DebugfsEntry, VfsError>;

/// Remove a single debugfs entry (file or empty directory).
pub fn debugfs_remove(entry: DebugfsEntry);

/// Remove a directory and all entries beneath it recursively.
/// Safe to call from module teardown — removes all entries created
/// by the subsystem in one call.
pub fn debugfs_remove_recursive(dir: DebugfsDir);

14.18.2.3 Lockdown Integration

When kernel lockdown (Section 9.3) is active, debugfs access is restricted based on the lockdown level:

Lockdown Level debugfs Behavior
none Full read/write access (subject to mount permissions)
integrity Read-only: writes to all debugfs files return EPERM
confidentiality Fully disabled: mount returns EPERM; all reads/writes return EPERM

The debugfs=off boot parameter disables debugfs entirely (equivalent to confidentiality lockdown for debugfs). When disabled, debugfs_create_* functions return dummy handles and all file operations are no-ops, ensuring subsystem initialization never fails due to debugfs unavailability.

14.18.2.4 Standard debugfs Directories

Created at boot by their respective subsystems:

Directory Subsystem Content
/sys/kernel/debug/block/ Block layer (Section 15.2) Per-device I/O stats, request queue state
/sys/kernel/debug/dma_buf/ DMA subsystem (Section 4.14) DMA-buf allocation tracking
/sys/kernel/debug/clk/ Clock framework (Section 2.24) Clock tree rates, enable counts
/sys/kernel/debug/regulator/ Regulator framework (Section 13.27) Voltage/current state per regulator
/sys/kernel/debug/ieee80211/ WiFi subsystem (Section 13.15) Per-PHY/per-STA debug counters
/sys/kernel/debug/bluetooth/ Bluetooth (Section 13.14) HCI trace data

14.18.3 tracefs — Tracing Filesystem

Mount point: /sys/kernel/tracing (historically /sys/kernel/debug/tracing; UmkaOS creates a compatibility symlink at the legacy path) Magic: 0x74726163 (TRACEFS_MAGIC) Flags: NODEV | NOEXEC | NOSUID

tracefs exposes the tracepoint event catalog and ftrace ring buffers to userspace tracing tools (perf, bpftrace, trace-cmd). It is the primary interface for Section 20.2.

Access control: CAP_SYS_ADMIN or CAP_PERFMON for most operations. Reading available_events and event format files requires only read permission on the tracefs mount (allows unprivileged discovery of available tracepoints without enabling them).

14.18.3.1 tracefs Registration

static TRACEFS_TYPE: PseudoFsType = PseudoFsType {
    name: "tracefs",
    fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
    magic: 0x74726163,
    fill_super: tracefs_fill_super,
    mount_opts: &[
        MountOptDesc { name: "uid",  kind: MountOptKind::U32 },
        MountOptDesc { name: "gid",  kind: MountOptKind::U32 },
        MountOptDesc { name: "mode", kind: MountOptKind::U32 },
    ],
};

14.18.3.2 tracefs Directory Structure

/sys/kernel/tracing/
    available_events           # one "subsystem:event" per line
    available_tracers          # "nop function function_graph"
    current_tracer             # write tracer name to activate
    trace                      # human-readable trace output (snapshot)
    trace_pipe                 # streaming trace output (blocks on read)
    tracing_on                 # "1"=enabled, "0"=disabled (write to toggle)
    buffer_size_kb             # per-CPU ring buffer size (write to resize)
    events/                    # per-subsystem event directories
        sched/                 # scheduler events
            sched_switch/
                enable         # "1"=trace, "0"=disable
                filter         # BPF-style filter expression
                format         # printf-style field description
                id             # tracepoint numeric ID (u32)
        syscalls/              # syscall enter/exit events
        net/                   # networking events
        block/                 # block I/O events
        irq/                   # interrupt events
    per_cpu/                   # per-CPU trace data
        cpu0/
            trace              # CPU-specific trace snapshot
            trace_pipe         # CPU-specific streaming trace
            stats              # entries, overrun, commit overrun, bytes
    instances/                 # named trace instances (independent buffers)

14.18.3.3 tracefs Ring Buffer

Each CPU has a dedicated ring buffer (per-CPU, no lock contention). The buffer size defaults to 1408 KB per CPU (matching Linux default) and is configurable via buffer_size_kb. Ring buffer allocation uses the page allocator (Section 4.2); each buffer is a set of linked pages, not a single contiguous allocation (avoids high-order allocation failures on fragmented systems).

/// Per-CPU trace ring buffer. One instance per CPU per trace instance.
pub struct TraceRingBuffer {
    /// Per-CPU buffer pages. Each entry is a page-sized ring segment.
    /// Pages are linked in a circular list for wraparound.
    pub pages: PerCpu<TraceBufferPages>,
    /// Buffer size in KB (per CPU). Default: 1408.
    pub size_kb: AtomicU32,
    /// Overrun counter: events dropped due to full buffer (per CPU, u64).
    pub overrun: PerCpu<AtomicU64>,
    /// Total events written (per CPU, u64).
    pub entries: PerCpu<AtomicU64>,
}

14.18.3.4 Tracepoint Integration

Each tracepoint registered via the DECLARE_TRACEPOINT! macro (Section 20.2) automatically gets an events/<subsystem>/<name>/ directory in tracefs. The id file provides a stable u32 identifier that perf_event_open() (Section 20.8) uses to attach to tracepoints programmatically.

14.18.3.5 Trace Instances

Named trace instances (mkdir /sys/kernel/tracing/instances/mytracer) create independent ring buffers with their own events/, trace, trace_pipe, and per_cpu/ directories. This allows multiple concurrent tracing sessions (e.g., one for system-wide scheduler tracing, another for application-specific I/O tracing) without interference.

14.18.4 hugetlbfs — Huge Page Filesystem

Mount point: /dev/hugepages (or any user-chosen mount point) Magic: 0x958458f6 (HUGETLBFS_MAGIC) Flags: NODEV | NOSUID

hugetlbfs provides huge page-backed file mappings for applications requiring large contiguous physical pages: databases (Oracle, PostgreSQL shared buffers), DPDK, HPC/AI workloads, and GPU pinned memory.

14.18.4.1 hugetlbfs Registration

static HUGETLBFS_TYPE: PseudoFsType = PseudoFsType {
    name: "hugetlbfs",
    fs_flags: FsFlags::NODEV | FsFlags::NOSUID,
    magic: 0x958458f6,
    fill_super: hugetlbfs_fill_super,
    mount_opts: &[
        MountOptDesc { name: "pagesize", kind: MountOptKind::U64 },
        MountOptDesc { name: "size",     kind: MountOptKind::U64 },
        MountOptDesc { name: "min_size", kind: MountOptKind::U64 },
        MountOptDesc { name: "nr_inodes", kind: MountOptKind::U64 },
        MountOptDesc { name: "uid",  kind: MountOptKind::U32 },
        MountOptDesc { name: "gid",  kind: MountOptKind::U32 },
        MountOptDesc { name: "mode", kind: MountOptKind::U32 },
    ],
};

14.18.4.2 Mount Options

Option Type Default Description
pagesize bytes architecture default Huge page size. Platform-dependent (see table below).
size bytes all available Maximum total size of files on this mount.
min_size bytes 0 Guaranteed reservation: this many bytes of huge pages are reserved at mount time and cannot be stolen by other mounts.
nr_inodes count unlimited Maximum number of inodes (files + directories).
uid uid_t 0 UID of the root directory.
gid gid_t 0 GID of the root directory.
mode octal 01777 Permissions of the root directory.

14.18.4.3 Supported Huge Page Sizes

Architecture Default Available Sizes
x86-64 2 MiB 2 MiB (PMD), 1 GiB (PUD)
AArch64 2 MiB 64 KiB (cont PTE), 2 MiB (PMD), 32 MiB (cont PMD), 1 GiB (PUD)
ARMv7 2 MiB 2 MiB (section)
RISC-V 64 2 MiB 2 MiB (PMD), 1 GiB (PUD)
PPC32 4 MiB 4 MiB (depends on MMU variant)
PPC64LE 2 MiB 2 MiB, 1 GiB (radix); 16 MiB (HPT mode also supports 16 MiB)
s390x 1 MiB 1 MiB (segment table large page)
LoongArch64 2 MiB 2 MiB (PMD), 1 GiB (PUD)

Sizes are discovered at boot from the hardware page table capabilities and reported in /proc/meminfo (Hugepagesize) and /sys/kernel/mm/hugepages/.

14.18.4.4 File Operations

Syscall Behavior
open() / creat() Creates a file backed by huge pages. No physical pages allocated yet.
mmap() Maps huge pages into the process address space. Each VMA page fault allocates a single huge page from the pool. MAP_POPULATE pre-faults all pages.
read() / write() Returns EINVAL. hugetlbfs files are mmap-only.
unlink() Removes the directory entry. Huge pages are returned to the pool when the last mapping is removed (reference-counted).
fallocate() mode=0: pre-allocate huge pages without mapping. FALLOC_FL_PUNCH_HOLE: release allocated pages for the given range.
ftruncate() Resize the file. Shrinking releases pages beyond the new size.

14.18.4.5 Huge Page Pool Management

The system-wide huge page pool is managed via:

  • /proc/sys/vm/nr_hugepages — persistent huge pages (survive memory pressure)
  • /proc/sys/vm/nr_overcommit_hugepages — surplus pages (reclaimed under pressure)
  • Per-NUMA node: /sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/nr_hugepages

The hugetlbfs pool is independent of THP (Section 4.7): hugetlbfs uses an explicit reservation pool while THP uses buddy allocator promotion. They do not compete for the same pages.

14.18.4.6 memfd_create Integration

memfd_create() with MFD_HUGETLB (Section 4.15) creates an anonymous file descriptor backed by hugetlbfs. The optional MFD_HUGE_2MB / MFD_HUGE_1GB flags select the page size. This is the preferred mechanism for applications that need huge pages without a visible filesystem mount.

14.18.5 bpffs — BPF Filesystem

Mount point: /sys/fs/bpf (mount -t bpf bpffs /sys/fs/bpf) Magic: 0xcafe4a11 (BPF_FS_MAGIC) Flags: NODEV | NOEXEC | NOSUID

bpffs persists BPF objects (programs, maps, links) beyond the lifetime of the loading process. Required by Cilium (Kubernetes CNI), systemd, bpftool, and any infrastructure that loads BPF programs at boot and expects them to survive across process restarts.

14.18.5.1 bpffs Registration

static BPFFS_TYPE: PseudoFsType = PseudoFsType {
    name: "bpf",
    fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
    magic: 0xcafe4a11,
    fill_super: bpffs_fill_super,
    mount_opts: &[
        MountOptDesc { name: "mode",            kind: MountOptKind::U32 },
        MountOptDesc { name: "delegate_cmds",   kind: MountOptKind::U64 },
        MountOptDesc { name: "delegate_maps",   kind: MountOptKind::U64 },
        MountOptDesc { name: "delegate_progs",  kind: MountOptKind::U64 },
        MountOptDesc { name: "delegate_attachs", kind: MountOptKind::U64 },
    ],
};

14.18.5.2 BPF Object Pinning

/// Pin a BPF object (program, map, or link) to a path in bpffs.
/// The object's kernel reference count is incremented. The object
/// remains alive as long as at least one pin or fd references it.
///
/// Called via bpf(BPF_OBJ_PIN, { fd, pathname }).
///
/// # Errors
/// - `EEXIST`: path already exists.
/// - `EINVAL`: fd does not refer to a BPF object.
/// - `EACCES`: caller lacks CAP_BPF or write permission on parent directory.
/// - `ENOSPC`: bpffs inode limit reached (if configured).
pub fn bpf_obj_pin(fd: BpfFd, pathname: &Path) -> Result<(), SyscallError>;

/// Retrieve a previously pinned BPF object by path, returning a new fd.
/// The caller receives a new file descriptor referencing the pinned object.
///
/// Called via bpf(BPF_OBJ_GET, { pathname }).
///
/// # Errors
/// - `ENOENT`: path does not exist.
/// - `EACCES`: caller lacks CAP_BPF or read permission on the path.
pub fn bpf_obj_get(pathname: &Path) -> Result<BpfFd, SyscallError>;

14.18.5.3 Object Lifecycle

A BPF object is freed when all references are removed:

  1. All userspace file descriptors closed.
  2. All bpffs pins removed (via unlink()).
  3. All kernel-internal references released (e.g., a BPF program attached to a network hook holds an internal reference; detaching releases it).

Only when the reference count reaches zero does the kernel free the BPF program bytecode and map memory.

14.18.5.4 Directory Structure

bpffs supports arbitrary directory hierarchies via mkdir(2). Conventions used by standard tools:

Path Creator Content
/sys/fs/bpf/tc/globals/ iproute2 Shared maps for TC BPF programs
/sys/fs/bpf/cilium/ Cilium Datapath programs and maps
/sys/fs/bpf/xdp/ xdp-tools XDP programs
/sys/fs/bpf/ip/ iproute2 BPF programs for ip rule

Standard VFS operations: mkdir(), rmdir(), unlink(), readdir() for namespace management. Only unlink() on a pinned object file removes the pin; rmdir() requires the directory to be empty.

14.18.5.5 BPF Token Delegation

BPF tokens (Linux 6.9+) allow unprivileged processes to perform specific BPF operations within the scope of a bpffs mount. A privileged process creates a token by calling bpf(BPF_TOKEN_CREATE) on a bpffs file descriptor; the token inherits delegation rights from the mount options.

/// A BPF token granting scoped BPF permissions to an unprivileged process.
/// Created via bpf(BPF_TOKEN_CREATE, { bpffs_fd }).
///
/// The token inherits its allowed operations from the mount-time
/// delegation options of the bpffs instance.
pub struct BpfToken {
    /// BPF commands this token permits (e.g., BPF_PROG_LOAD, BPF_MAP_CREATE).
    pub allowed_cmds: BpfCmdSet,
    /// Map types this token permits creating.
    pub allowed_map_types: BpfMapTypeSet,
    /// Program types this token permits loading.
    pub allowed_prog_types: BpfProgTypeMask,
    /// Attach types this token permits.
    pub allowed_attach_types: BpfAttachTypeSet,
}

Mount-time delegation options control what a token created on this mount may grant:

Mount Option Effect
delegate_cmds=0x1f Bitmask of BPF commands the token may delegate
delegate_maps=0xff Bitmask of map types the token may delegate
delegate_progs=0x3f Bitmask of program types the token may delegate
delegate_attachs=0x7f Bitmask of attach types the token may delegate

Without delegation mount options, BPF_TOKEN_CREATE returns ENOENT — the mount does not support token creation.

14.18.5.6 Access Control

Standard POSIX permissions on directories and files control visibility. Creating or retrieving pinned objects additionally requires CAP_BPF (Section 9.9). Programs operating within a BPF token scope (Section 19.2) may pin objects without CAP_BPF if the token grants the appropriate permissions.

14.18.6 securityfs — Security Module Filesystem

Mount point: /sys/kernel/security Magic: 0x73636673 (SECURITYFS_MAGIC) Flags: NODEV | NOEXEC | NOSUID

securityfs provides per-LSM configuration and status interfaces. The overall LSM framework and the content exposed under securityfs are specified in Section 9.8. This section formalizes the filesystem registration and the kernel API for creating securityfs entries.

14.18.6.1 securityfs Registration

static SECURITYFS_TYPE: PseudoFsType = PseudoFsType {
    name: "securityfs",
    fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
    magic: 0x73636673,
    fill_super: securityfs_fill_super,
    mount_opts: &[],
};

14.18.6.2 securityfs Kernel API

/// Handle to a securityfs directory. Opaque to callers.
pub struct SecurityfsDir {
    dentry: Arc<Dentry>,
}

/// Handle to a securityfs file. Opaque to callers.
pub struct SecurityfsEntry {
    dentry: Arc<Dentry>,
}

/// Create a directory under the securityfs root (or under `parent`).
/// Each LSM creates its top-level directory during LSM init.
pub fn securityfs_create_dir(
    name: &str,
    parent: Option<&SecurityfsDir>,
) -> Result<SecurityfsDir, VfsError>;

/// Create a file with custom read/write callbacks.
/// `mode` is POSIX permission bits. LSMs typically use 0o444 for
/// status files and 0o600 or 0o200 for policy write interfaces.
pub fn securityfs_create_file(
    name: &str,
    parent: &SecurityfsDir,
    mode: u16,
    fops: &'static FileOps,
) -> Result<SecurityfsEntry, VfsError>;

/// Remove a securityfs entry. Called during LSM teardown (if supported)
/// or during live kernel evolution ([Section 13.18](13-device-classes.md#live-kernel-evolution)).
pub fn securityfs_remove(entry: SecurityfsEntry);

14.18.6.3 Standard securityfs Layout

Path LSM Content
/sys/kernel/security/lsm Core Comma-separated list of active LSMs (read-only)
/sys/kernel/security/apparmor/ AppArmor Profile management, policy load
/sys/kernel/security/selinux/ SELinux Enforce mode, policy, booleans, AVC stats
/sys/kernel/security/ima/ IMA Measurement log, policy (Section 9.5)
/sys/kernel/security/evm/ EVM EVM mode, status
/sys/kernel/security/landlock/ Landlock ABI version (Section 9.8)

Access control varies by LSM: reading status files is typically unrestricted, while policy writes require CAP_MAC_ADMIN (Section 9.9).

14.18.7 efivarfs — EFI Variable Filesystem

Mount point: /sys/firmware/efi/efivars Magic: 0xde5e81e4 (EFIVARFS_MAGIC) Flags: NODEV | NOEXEC | NOSUID

efivarfs exposes UEFI firmware variables to userspace for reading and writing. Required for boot manager configuration (efibootmgr), Secure Boot key management, and firmware diagnostics.

Availability: UEFI systems only. On non-UEFI platforms (most ARMv7, PPC32, PPC64LE), the filesystem is not registered and mount -t efivarfs returns ENODEV.

Architecture EFI Support efivarfs Available
x86-64 Yes (UEFI standard) Yes
AArch64 Yes (UEFI standard) Yes
ARMv7 Rare (U-Boot) Only if EFI runtime services present
RISC-V 64 Emerging (UEFI spec) When EFI runtime services present
PPC32 No (Open Firmware / DTB) No
PPC64LE No (OPAL / SLOF) No
s390x No (z/VM / LPAR IPL) No
LoongArch64 Emerging (UEFI standard on Loongson 3A5000+) When EFI runtime services present

14.18.7.1 efivarfs Registration

static EFIVARFS_TYPE: PseudoFsType = PseudoFsType {
    name: "efivarfs",
    fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
    magic: 0xde5e81e4,
    fill_super: efivarfs_fill_super,
    mount_opts: &[],
};

14.18.7.2 File Naming Convention

Each file in efivarfs represents one UEFI variable, named as {VariableName}-{VendorGUID} where the GUID is in standard xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx format:

  • Boot0001-8be4df61-93ca-11d2-aa0d-00e098032b8c — Boot entry 1 (EFI Global Variable GUID)
  • BootOrder-8be4df61-93ca-11d2-aa0d-00e098032b8c — Boot order sequence
  • SecureBoot-8be4df61-93ca-11d2-aa0d-00e098032b8c — Secure Boot state
  • dbx-d719b2cb-3d3a-4596-a3bc-dad00e67656f — Secure Boot forbidden signature DB

14.18.7.3 File Format

/// Wire format for efivarfs file content. The first 4 bytes are the EFI
/// variable attributes; the remainder is the variable value.
/// This matches the Linux efivarfs file format exactly.
///
/// Read: returns attributes (4 bytes LE) + value
/// Write: caller provides attributes (4 bytes LE) + new_value
// Userspace ABI (efivarfs read/write wire format). DST: no const_assert on
// EFI variable file format: first 4 bytes are the attributes (little-endian u32),
// remainder is the variable value. Total size is attributes(4) + value(N).
// kernel-internal, not KABI

/// Parse an efivarfs file buffer into (attributes, value_slice).
/// Returns `Err(EINVAL)` if the buffer is too short (< 4 bytes).
fn parse_efivar(buf: &[u8]) -> Result<(EfiVarAttributes, &[u8]), Errno> {
    if buf.len() < 4 {
        return Err(EINVAL);
    }
    let attrs = u32::from_le_bytes([buf[0], buf[1], buf[2], buf[3]]);
    Ok((EfiVarAttributes::from_bits_truncate(attrs), &buf[4..]))
}

/// Build an efivarfs file buffer from (attributes, value).
/// Writes attributes as little-endian u32 prefix followed by value bytes.
fn build_efivar(attrs: EfiVarAttributes, value: &[u8], out: &mut [u8]) -> usize {
    let total = 4 + value.len();
    out[..4].copy_from_slice(&attrs.bits().to_le_bytes());
    out[4..total].copy_from_slice(value);
    total
}

bitflags! {
    /// EFI variable attributes. Matches the UEFI specification (Table 14).
    pub struct EfiVariableAttributes: u32 {
        /// Variable persists across resets.
        const NON_VOLATILE                       = 0x0000_0001;
        /// Variable accessible during boot services.
        const BOOTSERVICE_ACCESS                 = 0x0000_0002;
        /// Variable accessible at OS runtime.
        const RUNTIME_ACCESS                     = 0x0000_0004;
        /// Hardware error record (separate NVRAM region on some firmware).
        const HARDWARE_ERROR_RECORD              = 0x0000_0008;
        /// Only authenticated writes accepted (deprecated by UEFI 2.8+).
        const AUTHENTICATED_WRITE_ACCESS         = 0x0000_0010;
        /// Time-based authenticated variable (used by Secure Boot db/dbx).
        const TIME_BASED_AUTHENTICATED_WRITE_ACCESS = 0x0000_0020;
        /// Append-only writes: new data is appended, existing data unchanged.
        const APPEND_WRITE                       = 0x0000_0040;
    }
}

14.18.7.4 Operations

Syscall Behavior
read() Returns attributes (4 bytes LE) followed by the variable value. Calls EFI GetVariable() runtime service (Section 2.20).
write() Caller provides attributes (4 bytes LE) + new value. Calls EFI SetVariable(). Returns EIO on firmware error, ENOSPC if NVRAM is full.
creat() Creates a new EFI variable. File name must follow the {Name}-{GUID} convention. Calls EFI SetVariable() with the new name.
unlink() Deletes the EFI variable by calling SetVariable() with DataSize=0. Returns EPERM if the variable is immutable.
readdir() Enumerates all EFI variables via GetNextVariableName(). Results are cached in memory after first enumeration; invalidated on any write.

14.18.7.5 Immutable Variable Protection

Certain variables are critical to system boot and must not be accidentally deleted:

/// Variables marked immutable (FS_IMMUTABLE_FL) by the kernel.
/// Users cannot unlink or write to these without first clearing
/// the immutable flag (requires CAP_LINUX_IMMUTABLE).
const IMMUTABLE_VARS: &[&str] = &[
    "SecureBoot",
    "SetupMode",
    "PK",        // Platform Key
    "KEK",       // Key Exchange Key
    "AuditMode",
    "DeployedMode",
];

The kernel sets FS_IMMUTABLE_FL on these files at mount time. Modifying them requires chattr -i first (which requires CAP_LINUX_IMMUTABLE), providing a two-step safeguard against accidental firmware corruption.

14.18.7.6 NVRAM Wear Protection

EFI NVRAM has limited write endurance (typically 100K-1M cycles per flash block). The kernel rate-limits writes to prevent userspace from wearing out NVRAM:

/// NVRAM write rate limiter. Shared across all efivarfs writes.
/// Uses a token bucket algorithm: one token per write, refilled at
/// `REFILL_RATE` tokens per second, maximum burst of `BUCKET_SIZE`.
pub struct EfiNvramRateLimiter {
    /// Current token count. Bounded gauge: range [0, NVRAM_BUCKET_SIZE].
    /// AtomicU32 is sufficient: max value is 64 (NVRAM_BUCKET_SIZE),
    /// never incremented beyond the bucket ceiling by the refill logic.
    pub tokens: AtomicU32,
    /// Last refill timestamp (nanoseconds, monotonic clock).
    pub last_refill_ns: AtomicU64,
}

/// Maximum burst writes before throttling.
const NVRAM_BUCKET_SIZE: u32 = 64;

/// Sustained write rate: 1 write per 100ms (10 writes/sec).
/// At this rate, 100K-cycle NVRAM endures ~2,800 hours of
/// continuous maximum-rate writes. Practical workloads (boot
/// manager changes, key rotations) are many orders of magnitude
/// below this rate.
const NVRAM_REFILL_INTERVAL_NS: u64 = 100_000_000;

When the bucket is empty, write() returns EBUSY. The caller (typically efibootmgr or mokutil) retries after a short delay.

14.18.8 Boot Initialization Order

All pseudo-filesystems register during Phase 5 (VFS initialization) of the boot sequence (Section 2.3). The ordering reflects dependency constraints:

Order Filesystem Dependency Registration Guard
1 debugfs VFS core initialized None
2 tracefs debugfs mounted (for legacy symlink) debugfs mount point exists
3 hugetlbfs Physical memory allocator + huge page pool (Section 4.2) Huge page pool initialized
4 bpffs eBPF verifier initialized (Section 19.2) BPF subsystem ready
5 securityfs LSM framework initialized (Section 9.8) At least one LSM registered
6 efivarfs EFI runtime services available (Section 2.20) efi.runtime_services != null (skipped on non-UEFI)

After registration, each filesystem is mounted at its standard mount point by the init process (PID 1). The kernel does not auto-mount pseudo-filesystems — mount commands come from userspace init (systemd .mount units or /etc/fstab). The exception is the root filesystem and devtmpfs, which are mounted by the kernel before init executes.

14.19 procfs and sysfs

procfs and sysfs are the two primary pseudo-filesystems through which the kernel exposes runtime state to userspace. procfs (/proc) is process-oriented: per-PID directories, global memory/CPU statistics, and writable sysctl tunables. sysfs (/sys) is device-oriented: it mirrors the kernel's device model as a directory hierarchy with one-value-per-file attributes. Both are required for glibc, systemd, udev, ps, top, htop, lscpu, lsblk, and virtually every Linux system management tool.

Both filesystems build on the common pseudo-filesystem registration framework defined in Section 14.18.


14.19.1 procfs — Process Information Filesystem

Mount point: /proc (type proc) Magic: 0x9fa0 (PROC_SUPER_MAGIC) Flags: NODEV | NOEXEC | NOSUID | USERNS_MOUNT

procfs is the kernel's primary process-to-userspace information channel. It is mounted automatically during early init (before PID 1 executes) and is required for correct glibc operation (/proc/self/), systemd (cgroup discovery, mount enumeration, process introspection), and standard POSIX process tools.

14.19.1.1 procfs Registration

static PROCFS_TYPE: PseudoFsType = PseudoFsType {
    name: "proc",
    fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID
        | FsFlags::USERNS_MOUNT,
    magic: 0x9fa0,
    fill_super: proc_fill_super,
    mount_opts: &[
        MountOptDesc { name: "hidepid", kind: MountOptKind::U32 },
        MountOptDesc { name: "gid",     kind: MountOptKind::U32 },
        MountOptDesc { name: "subset",  kind: MountOptKind::Str },
    ],
};

Mount options:

Option Values Default Description
hidepid 0, 1, 2, 4 0 Process visibility. 0 = world-readable. 1 = hide cmdline/status for other users' PIDs. 2 = invisible /proc/PID/ for non-owned processes. 4 = same as 2 plus hide thread IDs.
gid GID 0 GID that bypasses hidepid restrictions. Allows monitoring daemons (e.g., monit, Prometheus node_exporter) to see all processes without running as root.
subset pid none Mount only per-PID entries (no global files). Used for container /proc mounts that only need process information.

14.19.1.2 ProcEntry Trait

Every file or directory in procfs is backed by an implementation of ProcEntry. Subsystems register entries during init; per-PID entries are instantiated lazily on first lookup.

/// Trait for procfs file content generation.
///
/// Implementations are typically zero-sized types that read kernel state
/// on demand. No per-file heap allocation occurs — the `ProcEntry` is a
/// `&'static dyn` reference.
pub trait ProcEntry: Send + Sync {
    /// Read content into `buf` starting at byte `offset`.
    /// Returns the number of bytes written to `buf`.
    ///
    /// For fixed-format files (e.g., `/proc/PID/stat`), the implementation
    /// generates the entire content into an internal buffer on the first
    /// read (offset=0) and serves subsequent reads from that snapshot.
    /// This ensures a consistent view even if the process state changes
    /// between read() calls.
    fn read(&self, ctx: &ProcReadCtx, buf: &mut [u8], offset: u64) -> Result<usize, Errno>;

    /// Write data from `buf`. Returns bytes consumed.
    /// Most procfs files are read-only and return `EACCES`.
    fn write(&self, ctx: &ProcWriteCtx, buf: &[u8]) -> Result<usize, Errno> {
        Err(Errno::EACCES)
    }

    /// Poll for readability/writability/events. Returns the events that are
    /// currently ready. Default: not pollable (empty events).
    ///
    /// This is required for `/proc/sys/` files: programs that `poll()` on
    /// sysctl files (e.g., `poll(/proc/sys/vm/overcommit_memory, POLLPRI)`)
    /// need notification when the value changes. The sysctl implementation
    /// calls `proc_sys_notify()` on write, which wakes pollers.
    fn poll(&self, _ctx: &ProcReadCtx, _events: PollEvents) -> PollEvents {
        PollEvents::empty()
    }
}

/// Notify pollers of a procfs sysctl entry that the value has changed.
/// Called by the sysctl write path after updating the kernel parameter.
/// Wakes all processes polling the file with `POLLPRI`.
fn proc_sys_notify(entry: &dyn ProcEntry) {
    // Implementation: the sysctl ProcEntry stores a WaitQueue. poll()
    // registers the caller on this WaitQueue. proc_sys_notify() does
    // wake_up_all(&entry.waitqueue, POLLPRI). Specific sysctl entries
    // that support notification override poll() to register on the waitqueue.
}

/// Context passed to ProcEntry::read().
pub struct ProcReadCtx<'a> {
    /// The PID this entry belongs to (None for global entries).
    pub pid: Option<Pid>,
    /// Credentials of the reading process (for permission checks).
    /// RCU-protected reference: credentials can change during a task's
    /// lifetime (setuid, capset), so a static reference would be unsound.
    pub cred: RcuRef<'a, Credentials>,
}

/// Context passed to ProcEntry::write().
pub struct ProcWriteCtx<'a> {
    /// The PID this entry belongs to (None for global entries).
    pub pid: Option<Pid>,
    /// Credentials of the writing process.
    /// RCU-protected reference — see `ProcReadCtx::cred` for rationale.
    pub cred: RcuRef<'a, Credentials>,
}

14.19.1.3 SeqFile Protocol

Large procfs files that enumerate variable-length records (e.g., /proc/net/tcp, /proc/mounts) use the SeqFile protocol to handle partial reads across multiple read() syscalls without missing or duplicating entries.

/// Sequential file generator. Each call to `show()` produces one logical
/// record. The SeqFile infrastructure handles buffering, partial reads,
/// and seek.
pub trait SeqFileOps: Send + Sync {
    /// Position the iterator at the beginning.
    /// `pos` is the logical position (0 on first read, or a saved position
    /// from a previous lseek). Returns an opaque iterator state, or None
    /// if `pos` is beyond the last record.
    fn start(&self, ctx: &ProcReadCtx, pos: u64) -> Option<SeqIterState>;

    /// Emit one record into `buf`. Returns bytes written.
    /// The SeqFile core calls `show()` repeatedly, appending output to an
    /// internal page-sized buffer. When the buffer is full, it is returned
    /// to the `read()` caller and the remaining records are served on the
    /// next `read()`.
    fn show(&self, state: &SeqIterState, buf: &mut [u8]) -> Result<usize, Errno>;

    /// Advance to the next record. Returns the updated iterator state,
    /// or None if iteration is complete.
    fn next(&self, state: SeqIterState) -> Option<SeqIterState>;

    /// Release any resources held by the iterator.
    /// Called when the file is closed or the read sequence is abandoned.
    fn stop(&self, state: SeqIterState);
}

/// Opaque iterator state for SeqFile traversal.
/// Subsystems typically store an index, a pointer to the current object,
/// and an RCU read-lock guard or reference count.
pub struct SeqIterState {
    /// Logical position counter (incremented by `next()`).
    /// Stored across read() calls for correct resume-after-partial-read.
    pub index: u64,
    /// Bytes emitted so far in the current read() call.
    pub count: usize,
    /// Subsystem-private state (cast from a concrete type).
    pub private: u64,
}

14.19.1.4 procfs Registration API

/// Register a global procfs entry (e.g., `/proc/meminfo`).
///
/// `name`: entry name (e.g., "meminfo").
/// `mode`: POSIX permission bits (e.g., 0o444 for read-only).
/// `parent`: parent directory (None = procfs root).
/// `ops`: implementation providing read/write behavior.
///
/// Returns a handle used for removal (live kernel evolution only;
/// built-in entries are never removed).
///
/// # Errors
/// - `EEXIST`: name already registered under parent.
/// - `ENOMEM`: allocation failure.
pub fn proc_create(
    name: &'static str,
    mode: u16,
    parent: Option<&ProcDir>,
    ops: &'static dyn ProcEntry,
) -> Result<ProcHandle, VfsError>;

/// Register a global procfs entry backed by SeqFile.
/// Convenience wrapper: creates a ProcEntry that delegates to SeqFileOps.
pub fn proc_create_seq(
    name: &'static str,
    mode: u16,
    parent: Option<&ProcDir>,
    ops: &'static dyn SeqFileOps,
) -> Result<ProcHandle, VfsError>;

/// Create a subdirectory under the procfs root (e.g., `/proc/net/`).
pub fn proc_mkdir(
    name: &'static str,
    parent: Option<&ProcDir>,
) -> Result<ProcDir, VfsError>;

/// Opaque handle to a registered procfs directory.
pub struct ProcDir {
    dentry: Arc<Dentry>,
}

/// Opaque handle to a registered procfs entry (file or directory).
/// Dropping this handle does NOT remove the entry — entries persist
/// for the kernel's lifetime. Explicit removal via `proc_remove()`
/// is for live kernel evolution only.
pub struct ProcHandle {
    dentry: Arc<Dentry>,
}

/// Remove a previously registered procfs entry.
pub fn proc_remove(handle: ProcHandle);

14.19.1.5 Per-PID Directory Lifecycle

A /proc/PID/ directory is created when a task is allocated (Section 8.1) and removed when the task is reaped (after wait() collects the zombie). The directory is lazily populated — subdirectory dentries are instantiated on first lookup, not at task creation time. This avoids per-fork allocation overhead for the ~20 entries per PID. The lookup callback (pid_dir_lookup()) resolves the PID to a Task, creates a dentry backed by a ProcInode, and populates on-demand. No dentries exist for PIDs that have not been accessed.

For thread-group leaders, /proc/PID/task/ contains subdirectories for each thread in the group. Each thread subdirectory mirrors the per-PID layout.

14.19.1.6 Mandatory /proc/PID/ Entries

These entries are required for glibc, systemd, ps, top, htop, and container runtimes. Format descriptions are normative — field order, separator characters, and field types must match Linux exactly for binary compatibility.

Entry Mode Format Description
status 0o444 Key:\tValue\n lines Human-readable status. Fields: Name, Umask, State, Tgid, Ngid, Pid, PPid, TracerPid, Uid (4 fields), Gid (4 fields), FDSize, Groups, NStgid, NSpid, NSpgid, NSsid, Kthread, VmPeak, VmSize, VmLck, VmPin, VmHWM, VmRSS, RssAnon, RssFile, RssShmem, VmData, VmStk, VmExe, VmLib, VmPTE, VmSwap, HugetlbPages, CoreDumping, THP_enabled, Threads, SigQ, SigPnd, ShdPnd, SigBlk, SigIgn, SigCgt, CapInh, CapPrm, CapEff, CapBnd, CapAmb, NoNewPrivs, Seccomp, Speculation_Store_Bypass, SpeculationIndirectBranch, Cpus_allowed, Cpus_allowed_list, Mems_allowed, Mems_allowed_list, voluntary_ctxt_switches, nonvoluntary_ctxt_switches.
stat 0o444 Single line, space-separated 52 fields: pid (comm) state ppid pgrp session tty_nr tpgid flags minflt cminflt majflt cmajflt utime stime cutime cstime priority nice num_threads itrealvalue starttime vsize rss rsslim startcode endcode startstack kstkesp kstkeip signal blocked sigignore sigcatch wchan nswap cnswap exit_signal processor rt_priority policy delayacct_blkio_ticks guest_time cguest_time start_data end_data start_brk arg_start arg_end env_start env_end exit_code. Matches Linux fs/proc/array.c do_task_stat() — field 52 is exit_code. Note: core_dumping and thp_enabled are in /proc/PID/status (Key:Value format), not stat.
statm 0o444 Single line, 7 space-separated page counts size resident shared text lib data dt
cmdline 0o444 NUL-separated argv bytes Empty for zombie/kernel threads.
environ 0o400 NUL-separated envp bytes Requires ptrace_may_access() or same UID. Returns empty for kernel threads.
maps 0o444 One line per VMA start-end perms offset dev inode pathname. Hex addresses, rwxp perms.
smaps 0o444 Multi-line per VMA Same header as maps, followed by Size, KernelPageSize, MMUPageSize, Rss, Pss, Pss_Dirty, Shared_Clean, Shared_Dirty, Private_Clean, Private_Dirty, Referenced, Anonymous, LazyFree, AnonHugePages, ShmemPmdMapped, FilePmdMapped, Shared_Hugetlb, Private_Hugetlb, Swap, SwapPss, Locked, THPeligible, VmFlags lines.
fd/ 0o500 Directory of symlinks Each entry is a decimal fd number; readlink returns the path of the open file.
fdinfo/ 0o500 One file per fd Lines: pos:, flags:, mnt_id:. Additional fields for epoll, eventfd, inotify, fanotify, timerfd, signalfd fds.
cgroup 0o444 hierarchy-ID:controller-list:cgroup-path per line For cgroups v2: single line 0::/path.
mountinfo 0o444 SeqFile, one line per mount Fields: mount_id, parent_id, major:minor, root, mount_point, mount_options, optional_fields, separator(-), fs_type, mount_source, super_options. See Section 14.6.
ns/ 0o500 Directory of symlinks Entries: cgroup, ipc, mnt, net, pid, pid_for_children, time, time_for_children, user, uts, ima. Readlink returns <type>:[<inode>]. The ima entry links to the task's IMA namespace; IMA measurement log output (e.g., /sys/kernel/security/ima/ascii_runtime_measurements) is scoped to the reading task's IMA namespace — a process in container namespace A sees only namespace A's measurement log entries. See Section 17.1, Section 9.5.
oom_score 0o444 Single integer (0-2000) OOM badness score. See Section 4.5.
oom_score_adj 0o644 Single integer (-1000 to 1000) OOM adjustment. -1000 = never kill. Writing requires CAP_SYS_RESOURCE for values < 0.
limits 0o444 Table format Columns: Limit, Soft Limit, Hard Limit, Units. Rows: Max cpu time, Max file size, Max data size, Max stack size, Max core file size, Max resident set, Max processes, Max open files, Max locked memory, Max address space, Max file locks, Max pending signals, Max msgqueue size, Max nice priority, Max realtime priority, Max realtime timeout. See Section 8.7.
io 0o400 key: value lines Fields: rchar, wchar, syscr, syscw, read_bytes, write_bytes, cancelled_write_bytes. Requires ptrace_may_access() or same UID.
task/ 0o555 Directory of per-thread subdirectories Each subdirectory is named by TID and contains the same entries as the parent /proc/PID/ (stat, status, maps, etc.) scoped to that thread.
exe Symlink Points to the executable file. Returns ENOENT readlink if the binary has been deleted (but the symlink text reads <path> (deleted)).
root Symlink Points to the process's root directory (as set by chroot()).
cwd Symlink Points to the process's current working directory.
comm 0o644 Single line, max 16 bytes Executable name (truncated to TASK_COMM_LEN). Writable: echo newname > /proc/PID/comm sets the task comm (requires same UID or CAP_SYS_PTRACE).
wchan 0o444 Single symbol name Wait channel: the kernel function the task is sleeping in, or "0" if running.

14.19.1.7 Mandatory Global /proc Entries

These entries are required for system monitoring tools, container runtimes, and standard POSIX utilities.

Entry Mode Format Primary Consumers
meminfo 0o444 Key: value kB lines free, top, htop, systemd, Prometheus node_exporter. Fields: MemTotal, MemFree, MemAvailable, Buffers, Cached, SwapCached, Active, Inactive, Active(anon), Inactive(anon), Active(file), Inactive(file), Unevictable, Mlocked, SwapTotal, SwapFree, Zswap, Zswapped, Dirty, Writeback, AnonPages, Mapped, Shmem, KReclaimable, Slab, SReclaimable, SUnreclaim, KernelStack, PageTables, SecPageTables, NFS_Unstable, Bounce, WritebackTmp, CommitLimit, Committed_AS, VmallocTotal, VmallocUsed, VmallocChunk, Percpu, HardwareCorrupted, AnonHugePages, ShmemHugePages, ShmemPmdMapped, FileHugePages, FilePmdMapped, CmaTotal, CmaFree, HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb, DirectMap4k, DirectMap2M, DirectMap1G.
cpuinfo 0o444 Per-arch key-value blocks lscpu, nproc, /proc/cpuinfo parsers. Per-CPU block separated by blank line. Format is architecture-specific (x86: processor/vendor_id/model name/flags; ARM: processor/BogoMIPS/Features; RISC-V: hart/isa/mmu).
stat 0o444 Multi-line top, mpstat, vmstat. Lines: cpu (aggregate), cpu0..cpuN (per-CPU: user nice system idle iowait irq softirq steal guest guest_nice), intr (per-IRQ counts), ctxt (context switches), btime (boot time epoch), processes (forks since boot), procs_running, procs_blocked, softirq (per-softirq counts).
loadavg 0o444 Single line uptime, w, shell prompts. Format: 1min 5min 15min running/total last_pid.
uptime 0o444 Single line uptime. Format: seconds_since_boot idle_seconds (both with centisecond precision).
version 0o444 Single line uname, build identification. Format: UmkaOS version <version> (<build>) (<compiler>) #<build_number> <config> <date>.
filesystems 0o444 One per line mount auto-detection. Format: [nodev]\tfstype. nodev prefix for pseudo-filesystems that have no backing device.
self Symlink Points to /proc/[getpid()]. Resolved per-access to the calling task's TGID. Required by glibc (/proc/self/exe, /proc/self/fd/).
thread-self Symlink Points to /proc/[getpid()]/task/[gettid()]. Resolved per-access to the calling thread's TID. Required for per-thread procfs access (e.g., /proc/thread-self/attr/current for SELinux).
mounts Symlink Points to self/mounts.
partitions 0o444 Table lsblk, fdisk. Columns: major, minor, #blocks, name.
diskstats 0o444 One line per device iostat, sar. 18 fields per line: major minor name reads_completed reads_merged sectors_read ms_reading writes_completed writes_merged sectors_written ms_writing ios_in_progress ms_io weighted_ms_io discards_completed discards_merged sectors_discarded ms_discarding flush_count ms_flushing.
net/ 0o555 Directory Network pseudo-files (per-net-namespace). Entries: dev (interface stats), tcp (TCP sockets), tcp6, udp, udp6, unix (UNIX sockets), route (IPv4 routing), ipv6_route, if_inet6, arp, snmp, snmp6, netstat, sockstat, sockstat6, raw, raw6, packet, protocols, wireless. Each file uses SeqFile.
sys/ 0o555 Directory tree Sysctl interface. Writable tunables organized as kernel/, vm/, fs/, net/, dev/. Each leaf file reads/writes one value. See Section 20.9 for the sysctl registration framework.
interrupts 0o444 Table Per-CPU IRQ counts. Columns: IRQ number, per-CPU counts, IRQ chip name, hardware IRQ, action name.
softirqs 0o444 Table Per-CPU softirq counts. One row per softirq type (HI, TIMER, NET_TX, NET_RX, BLOCK, IRQ_POLL, TASKLET, SCHED, HRTIMER, RCU).

14.19.1.8 procfs Namespace Awareness

Each PID namespace (Section 17.1) has its own view of /proc: processes in a child PID namespace see only PIDs visible within that namespace. When procfs is mounted inside a container, fill_super binds the superblock to the caller's PID namespace. /proc/1/ inside the container refers to the container's init process, not the host PID 1.

14.19.1.9 procfs Permission Model

hidepid Effect
0 All /proc/PID/ directories visible to all users (Linux default)
1 /proc/PID/{cmdline,sched,status} restricted to owner; directory is visible
2 /proc/PID/ directory invisible to non-owners (opendir returns ENOENT)
4 Same as 2, plus thread IDs hidden from /proc/PID/task/

The gid mount option exempts a group from hidepid restrictions. Processes whose supplementary groups include the specified GID see all PIDs regardless of hidepid. This allows monitoring tools to run without CAP_SYS_PTRACE.


14.19.2 sysfs — Device Model Filesystem

Mount point: /sys (type sysfs) Magic: 0x62656572 (SYSFS_MAGIC) Flags: NODEV | NOEXEC | NOSUID

sysfs mirrors the kernel's device model hierarchy as a directory tree. Every registered bus, device, driver, and class gets a directory. Attributes are files that expose exactly one value each (the "sysfs one-value-per-file rule"). This filesystem is required for udev device discovery, systemd device management, and all /sys-reading tools (lspci, lsusb, lsblk, ip link, etc.).

14.19.2.1 sysfs Registration

static SYSFS_TYPE: PseudoFsType = PseudoFsType {
    name: "sysfs",
    fs_flags: FsFlags::NODEV | FsFlags::NOEXEC | FsFlags::NOSUID,
    magic: 0x62656572,
    fill_super: sysfs_fill_super,
    mount_opts: &[],
};

14.19.2.2 Kobject Model

Every kernel object that participates in sysfs inherits from Kobject. A kobject represents one directory in the sysfs tree. The parent-child relationship between kobjects defines the directory hierarchy.

/// Kernel object — the unit of representation in sysfs.
///
/// Every kobject corresponds to exactly one directory under `/sys`.
/// Kobjects form a tree: each kobject has at most one parent.
/// The root kobjects (no parent) appear directly under `/sys`.
/// **Cycle detection**: `sysfs_create_kobject()` walks the parent chain
/// (max depth 32) to verify the new parent is not a descendant of the
/// new kobject. If a cycle is detected, creation fails with `ELOOP` and
/// an FMA warning identifies the offending driver.
pub struct Kobject {
    /// Name of this object (= directory name in sysfs).
    pub name: Box<[u8]>,
    /// Parent kobject. None for top-level directories.
    pub parent: Option<Arc<Kobject>>,
    /// Attribute groups attached to this kobject.
    /// Each group's attributes appear as files in this directory.
    pub attr_groups: ArrayVec<&'static SysfsGroup, 8>,
    /// Kset membership (if any). A kset is a collection of related
    /// kobjects that share a uevent domain.
    pub kset: Option<Arc<Kset>>,
    /// Reference count. Kobject is freed when refcount reaches zero.
    pub refcount: AtomicU64,
    /// Uevent state: whether this kobject has been announced to userspace.
    pub uevent_sent: AtomicBool,
}

/// A kset groups related kobjects and provides the uevent emission
/// context. Bus, class, and device collections are ksets.
pub struct Kset {
    /// The kset's own kobject (its sysfs directory).
    pub kobj: Kobject,
    /// Uevent filter: if set, called before emitting uevents for
    /// member kobjects. Returns false to suppress the event.
    pub uevent_filter: Option<fn(&Kobject) -> bool>,
}

14.19.2.3 SysfsAttribute Trait

/// Trait for sysfs file attributes.
///
/// Each attribute is a single file in a kobject's directory.
/// Implementations MUST follow the one-value-per-file rule:
/// `show()` returns exactly one scalar, string, or enumeration value.
/// `store()` parses exactly one value.
pub trait SysfsAttribute: Send + Sync {
    /// Attribute file name.
    fn name(&self) -> &'static str;

    /// POSIX permission bits (e.g., 0o444 read-only, 0o644 read-write).
    fn mode(&self) -> u16;

    /// Read the attribute value into `buf`. Returns bytes written.
    /// The output MUST end with a newline (`\n`) for shell compatibility.
    /// Maximum output: PAGE_SIZE - 1 bytes (4095 on all architectures).
    fn show(&self, kobj: &Kobject, buf: &mut [u8]) -> Result<usize, Errno>;

    /// Write a new value from `buf`. Returns bytes consumed.
    /// Returns `EACCES` for read-only attributes (mode without write bits).
    /// Returns `EINVAL` if the value cannot be parsed.
    fn store(&self, kobj: &Kobject, buf: &[u8]) -> Result<usize, Errno> {
        Err(Errno::EACCES)
    }
}

/// A named collection of attributes applied to a kobject together.
///
/// Groups provide atomic attachment: all attributes in a group are
/// created or destroyed as a unit. A kobject may have multiple groups;
/// each group can optionally create a subdirectory.
pub struct SysfsGroup {
    /// Subdirectory name. If Some, attributes appear in a named
    /// subdirectory of the kobject's directory. If None, attributes
    /// appear directly in the kobject's directory.
    pub name: Option<&'static str>,
    /// Attributes in this group.
    pub attrs: &'static [&'static dyn SysfsAttribute],
    /// Visibility filter. Called once per attribute at group creation
    /// time. Returns the effective mode (0 = skip this attribute).
    /// Allows conditional attributes based on hardware capabilities.
    pub is_visible: Option<fn(&Kobject, &dyn SysfsAttribute) -> u16>,
}

14.19.2.4 sysfs Kernel API

/// Create a kobject directory in sysfs and attach its attribute groups.
///
/// The directory is created under the parent's directory (or the sysfs
/// root if parent is None). All attribute groups are instantiated
/// atomically: if any attribute creation fails, the entire directory
/// is rolled back.
///
/// Typically called indirectly through `device_register()`, `bus_register()`,
/// or `class_register()` rather than directly by subsystem code.
pub fn sysfs_create_kobject(kobj: &Arc<Kobject>) -> Result<(), VfsError>;

/// Attach an additional attribute group to an existing kobject.
/// Used when subsystems add attributes after initial registration
/// (e.g., driver-specific attributes added at probe time).
pub fn sysfs_create_group(
    kobj: &Arc<Kobject>,
    group: &'static SysfsGroup,
) -> Result<(), VfsError>;

/// Remove an attribute group from a kobject.
pub fn sysfs_remove_group(
    kobj: &Arc<Kobject>,
    group: &'static SysfsGroup,
);

/// Create a symbolic link in sysfs. Used for cross-references:
/// e.g., `/sys/class/net/eth0/device` → `/sys/devices/pci0000:00/...`.
///
/// `kobj`: the directory where the symlink is created.
/// `target`: the kobject the symlink points to.
/// `name`: the symlink file name.
pub fn sysfs_create_link(
    kobj: &Arc<Kobject>,
    target: &Arc<Kobject>,
    name: &'static str,
) -> Result<(), VfsError>;

/// Remove a symbolic link.
pub fn sysfs_remove_link(kobj: &Arc<Kobject>, name: &'static str);

/// Notify userspace that an attribute value has changed.
/// Wakes any process blocked in `poll()` / `select()` on the
/// attribute file. Used by the thermal subsystem, battery driver,
/// and power management to notify udev/systemd of state changes.
pub fn sysfs_notify(kobj: &Arc<Kobject>, attr_name: &'static str);

/// Remove a kobject and all its attribute groups from sysfs.
/// Called during device removal or driver unbind.
pub fn sysfs_remove_kobject(kobj: &Arc<Kobject>);

14.19.2.5 Mandatory /sys Hierarchies

These top-level directories are required for udev, systemd, and standard Linux device management tools.

14.19.2.5.1 /sys/devices/ — Physical Device Tree

The canonical device hierarchy. Every device registered via Section 11.4 appears here, organized by physical topology: system/cpu/, pci0000:00/, platform/, virtual/. Each device directory contains:

  • Standard attributes: uevent, power/ (runtime PM state), driver (symlink), subsystem (symlink to bus/class).
  • Device-specific attributes added by the device driver at probe time.
  • Child device directories for hierarchical devices (e.g., PCI bridge children).
14.19.2.5.2 /sys/bus/ — Bus Type Directories

One directory per registered bus type (pci, usb, platform, i2c, spi, etc.). Each bus directory contains:

Subdirectory Content
devices/ Symlinks to /sys/devices/... for every device on this bus.
drivers/ One subdirectory per registered driver. Each driver directory contains bind (write device name to force-bind), unbind (write device name to force-unbind), and new_id (write vendor:device to add dynamic ID).
drivers_probe Write a device name to trigger re-probe.
drivers_autoprobe 1 = auto-probe new devices (default), 0 = manual binding only.
14.19.2.5.3 /sys/class/ — Device Class Directories

One directory per device class. Each class directory contains symlinks to the device directories in /sys/devices/. Classes group devices by function rather than bus topology:

Class Content Primary Tool
net/ Network interfaces (eth0, wlan0, lo) ip link, NetworkManager
block/ Block devices (sda, nvme0n1, dm-0) lsblk
tty/ Terminal devices (ttyS0, ttyUSB0, pts/0) stty, getty
input/ Input devices (event0, mice) libinput, evtest
hwmon/ Hardware monitoring (temperatures, fans, voltages) sensors, lm-sensors
thermal/ Thermal zones and cooling devices thermald
power_supply/ Battery and AC adapter status upower
backlight/ Display backlight xrandr, brightnessctl
leds/ LED devices ledctl
sound/ ALSA sound devices aplay, alsamixer
drm/ DRM/KMS display devices modetest, Xorg
misc/ Miscellaneous devices Various
14.19.2.5.4 /sys/block/ — Block Devices

Symlinks into /sys/devices/ for all block devices. Each block device directory contains queue/ (scheduler, nr_requests, read_ahead_kb), stat (I/O counters), size (sectors), and partition subdirectories.

14.19.2.5.5 /sys/fs/ — Filesystem-Specific Controls
Directory Content
cgroup/ cgroup filesystem mount controls (Section 17.2)
ext4/ Per-mount ext4 tuning (when ext4 driver is active)
fuse/ FUSE connection controls (Section 14.11)
selinux/ SELinux policy interface (duplicate of securityfs for compat)
14.19.2.5.6 /sys/kernel/ — Kernel Parameters
Directory Content
mm/ Memory management controls (transparent_hugepage/, hugepages/, ksm/)
debug/ debugfs mount point (when debugfs is mounted here)
security/ securityfs mount point. security/ima/ascii_runtime_measurements output is scoped to the reading task's IMA namespace (Section 9.5).
tracing/ tracefs mount point
irq/ Default IRQ affinity
uevent_seqnum Monotonic uevent sequence number (u64, for udev ordering)
kexec_loaded Always 0 in UmkaOS. Live kernel evolution (Section 13.18) replaces kexec; the traditional kexec path is not implemented. Retained for compatibility with tools that read this file.
kexec_crash_loaded Always 0 in UmkaOS. Crash recovery uses the live evolution infrastructure, not kexec-based kdump. Retained for compatibility.
vmcoreinfo Crash dump layout information
14.19.2.5.7 /sys/module/ — Loaded Modules and Parameters

One directory per loaded module (including built-in modules that have parameters). Each module directory contains:

Entry Content
parameters/ One file per module parameter. Read returns current value; write changes it (if the parameter is writable).
refcnt Module reference count.
coresize Size of the module's core section in bytes.
initsize Size of the module's init section (0 after init completes).
holders/ Symlinks to modules that depend on this one.
14.19.2.5.8 /sys/power/ — Power Management
Entry Mode Description
state 0o644 Write mem, disk, freeze to trigger system suspend. Read returns available states. See Section 7.5.
wakeup_count 0o644 Wakeup event counter for suspend synchronization.
mem_sleep 0o644 Preferred suspend-to-RAM variant (s2idle, shallow, deep).
disk 0o644 Hibernate method (platform, shutdown, reboot, suspend).
pm_async 0o644 1 = async device suspend/resume (default), 0 = sequential.
image_size 0o644 Maximum hibernate image size in bytes.
resume 0o200 Write MAJOR:MINOR to set the resume device for hibernation.

14.19.2.6 uevent Mechanism

Kobject state changes generate uevents that are delivered to userspace via two channels: a netlink multicast socket (NETLINK_KOBJECT_UEVENT, group 1) and the /sys/*/uevent file.

/// Uevent actions. Each action generates a NETLINK_KOBJECT_UEVENT
/// message and updates the kobject's `uevent` sysfs file.
#[derive(Clone, Copy, Debug)]
pub enum KobjAction {
    /// Device/kobject added. udev creates /dev nodes and runs rules.
    Add,
    /// Device/kobject removed. udev removes /dev nodes.
    Remove,
    /// Device state changed (e.g., firmware loaded, link state changed).
    Change,
    /// Kobject moved to a different parent (renamed).
    Move,
    /// Device brought online (e.g., CPU, memory block).
    Online,
    /// Device taken offline.
    Offline,
    /// Device binding to a driver.
    Bind,
    /// Device unbound from a driver.
    Unbind,
}

/// Environment variables included in every uevent message.
pub struct UeventEnv {
    /// Key-value pairs. Standard keys:
    /// - `ACTION`: "add", "remove", "change", "move", "online",
    ///   "offline", "bind", "unbind"
    /// - `DEVPATH`: sysfs path relative to /sys (e.g., "/devices/pci0000:00/...")
    /// - `SUBSYSTEM`: bus or class name (e.g., "pci", "net", "block")
    /// - `SEQNUM`: monotonic u64 sequence number (never wraps in 50+ years
    ///   at 10 billion events/sec)
    /// - `DEVTYPE`: device type within subsystem (e.g., "disk", "partition")
    /// - `DRIVER`: driver name (present for bind/unbind)
    /// - `MAJOR`, `MINOR`: device numbers (present for char/block devices)
    /// - `DEVNAME`: device name for /dev (e.g., "sda", "ttyS0")
    ///
    /// Additional subsystem-specific keys are appended by the device's
    /// `uevent()` callback.
    pub vars: ArrayVec<UeventVar, 32>,
}

/// Single uevent environment variable.
pub struct UeventVar {
    pub key: &'static str,
    pub value: ArrayVec<u8, 256>,
}

/// Emit a uevent for a kobject.
///
/// 1. Increment the global uevent sequence number (AtomicU64).
/// 2. Build the UeventEnv: standard keys + subsystem-specific keys
///    from the kobject's `uevent()` callback.
/// 3. Format the netlink message: `ACTION@DEVPATH\0KEY=VALUE\0...`
/// 4. Multicast via NETLINK_KOBJECT_UEVENT to all listeners (udevd).
/// 5. Write the formatted uevent to `/sys/<devpath>/uevent` for
///    manual re-trigger (`echo add > /sys/devices/.../uevent`).
pub fn kobject_uevent(kobj: &Arc<Kobject>, action: KobjAction) -> Result<(), Errno>;

/// Emit a uevent with additional environment variables.
/// Used when the subsystem needs to include extra key-value pairs
/// beyond what the standard `uevent()` callback provides.
pub fn kobject_uevent_env(
    kobj: &Arc<Kobject>,
    action: KobjAction,
    extra_env: &[UeventVar],
) -> Result<(), Errno>;

14.19.2.7 uevent Delivery Path

  1. Driver or subsystem calls kobject_uevent(kobj, action).
  2. Kernel increments global UEVENT_SEQNUM (AtomicU64, visible at /sys/kernel/uevent_seqnum).
  3. UeventEnv is built: ACTION, DEVPATH, SUBSYSTEM, SEQNUM, plus subsystem-specific variables from the kobject's uevent() method.
  4. Netlink multicast: the formatted message is sent to NETLINK_KOBJECT_UEVENT group 1. udevd receives the message, matches it against udev rules, and creates/removes /dev nodes, loads firmware, sets permissions, runs RUN commands, etc.
  5. The uevent file in the kobject's sysfs directory is updated. Writing an action name to this file re-triggers the uevent (e.g., echo change > /sys/devices/.../uevent forces udev to re-process the device).

14.19.2.8 sysfs Namespace Awareness

sysfs is network-namespace-aware for /sys/class/net/: each network namespace (Section 17.1) sees only its own interfaces. Other sysfs hierarchies are shared across all namespaces (devices, buses, and classes other than net/ are global). This matches Linux behavior. Device entries under /sys/class/net/ are tagged with their owning network namespace; readdir() filters entries to show only devices in the caller's net namespace.

14.19.2.9 sysfs Binary Attributes

For attributes that are not human-readable text (firmware blobs, ACPI tables, PCIe config space), sysfs provides binary attributes:

/// Binary attribute: arbitrary-length read/write with offset support.
/// Used for firmware upload, PCI config space, ACPI tables, etc.
pub trait SysfsBinAttribute: Send + Sync {
    /// Attribute file name.
    fn name(&self) -> &'static str;
    /// POSIX permission bits.
    fn mode(&self) -> u16;
    /// Maximum file size (for `stat()` reporting and write bounds checking).
    fn size(&self) -> usize;
    /// Read into `buf` at `offset`. Returns bytes read.
    fn read(&self, kobj: &Kobject, buf: &mut [u8], offset: u64) -> Result<usize, Errno>;
    /// Write from `buf` at `offset`. Returns bytes written.
    fn write(&self, kobj: &Kobject, buf: &[u8], offset: u64) -> Result<usize, Errno> {
        Err(Errno::EACCES)
    }
}

14.19.3 Boot Initialization Order

procfs and sysfs are among the earliest pseudo-filesystems mounted, as many subsequent boot steps depend on them:

Order Filesystem Dependency Notes
1 sysfs VFS core initialized Mounted before device probing begins; bus/class directories must exist for device registration.
2 procfs VFS core + PID allocator Mounted before PID 1 executes; glibc requires /proc/self/.
3 devtmpfs sysfs Device nodes reference sysfs kobjects.

Both filesystems are kernel-mounted (not user-mounted): the kernel mounts them during early init before transferring control to PID 1. This is unlike debugfs, tracefs, and other pseudo-filesystems which are mounted by userspace init.