Skip to content

Chapter 13: Virtual Filesystem Layer

VFS architecture, dentry cache, mount tree, path resolution, overlayfs, mount namespace operations


13.1 Virtual Filesystem Layer

The VFS (umka-vfs) provides a unified interface over all filesystem types. It is a Tier 1 component running in its own hardware isolation domain (see Section 10.2 for platform-specific isolation mechanisms), isolated from both umka-core and the individual filesystem drivers it manages.

Why VFS is Tier 1 (not Tier 0):

The VFS handles complex, security-sensitive operations: path resolution (symlink loops, mount point crossing), permission checks, and filesystem driver coordination. Isolating VFS from Core provides:

  1. Attack surface reduction: Path resolution bugs (symlink attacks, directory traversal) are confined to the VFS domain and cannot corrupt Core memory.

  2. Driver isolation chain: Core → VFS (Tier 1) → Filesystem driver (Tier 1/2). A compromised filesystem driver cannot corrupt VFS metadata, and a compromised VFS cannot corrupt Core memory.

  3. Crash containment: A VFS panic (e.g., corrupted dentry cache) is recoverable without rebooting the entire kernel. The recovery protocol:

a. Detection: umka-core detects VFS domain death (MPK exception, panic handler, or watchdog timeout on the VFS heartbeat ring). b. Freeze: All syscalls that enter VFS (open, stat, read, write, close, etc.) are blocked at the umka-core domain boundary. Callers receive -ERESTARTSYS and the VFS ring is drained. c. Dirty page cache flush: Dirty pages in umka-core's page cache are flushed to their backing block devices. The page cache is in umka-core memory (not VFS memory), so it survives the VFS crash. Flush uses the block layer ring directly. d. Dentry/inode cache rebuild: The new VFS instance starts with an empty dentry cache. Dentries are lazily re-populated on the next path lookup (cache miss triggers disk read). Inode cache is similarly rebuilt on demand. e. Open file descriptor recovery: umka-core maintains a table of open file descriptors with their inode numbers and seek positions. After VFS restart, umka-core re-opens each fd by inode number. File descriptors that pointed to deleted files (unlinked but still open) receive -EIO on next access. f. Resume: The VFS ring is reopened and blocked syscalls are retried.

Recovery time: ~100-500ms depending on the number of open file descriptors. Limitation: In-flight writes that had not yet reached the page cache are lost (the application receives -EIO and must retry).

Dirty Page Handling on VFS Crash

Dirty Page Handling on VFS Crash:

When a Tier 1 VFS driver crashes, UmkaOS Core cannot safely flush dirty pages using the crashed driver's block mapping (the file-offset → block-address translation lives in the now-destroyed VFS domain).

UmkaOS's design: pre-registration of dirty extents.

Before modifying any file pages, the VFS driver must register the affected block extents with UmkaOS Core:

/// Register a dirty file extent before modification.
/// Called by VFS drivers before dirtying page cache pages.
/// UmkaOS Core stores a compact log of registered extents for use
/// during crash recovery.
///
/// # Parameters
/// - `inode_id`: Stable inode identifier (survives VFS crash).
/// - `file_offset`: Byte offset of the dirty range start.
/// - `len`: Length of the dirty range in bytes.
/// - `block_addr`: Physical block address for this range.
/// - `block_len`: Length of the block range in bytes.
///
/// # Errors
/// Returns `VfsDirtyError::RingFull` (equivalent to `EBUSY`) when the
/// dirty extent ring is full. The caller must free ring slots by calling
/// `vfs_flush_extent_complete()` before retrying.
pub fn vfs_register_dirty_extent(
    inode_id: InodeId,
    file_offset: u64,
    len: u64,
    block_addr: PhysBlockAddr,
    block_len: u64,
) -> Result<(), VfsDirtyError>;

/// Error type for dirty extent registration.
pub enum VfsDirtyError {
    /// The dirty extent ring is full (all 4096 slots occupied with
    /// unacknowledged extents). Free slots with `vfs_flush_extent_complete()`
    /// before retrying. Equivalent to EBUSY.
    RingFull,
    /// Other VFS error (invalid inode ID, etc.).
    Other(VfsError),
}

UmkaOS Core maintains a dirty extent log in core memory (not in VFS domain memory): - Ring buffer of up to 4096 DirtyExtentRecord entries per filesystem instance. - Each record: { inode_id, file_offset, len, block_addr, block_len, seq: u64 }. - Entries are cleared when the VFS driver calls vfs_flush_extent_complete().

Ring overflow policy: vfs_register_dirty_extent() returns EBUSY when the ring is full (all 4096 slots occupied with unacknowledged extents). The VFS driver must not proceed with the write operation when EBUSY is returned; it must first call vfs_flush_extent_complete() for one or more completed extents to free ring slots, then retry vfs_register_dirty_extent().

This is a deliberate design choice that differs from Linux's approach: UmkaOS never silently discards safety information. The EBUSY backpressure ensures that on any VFS crash, umka-core has a complete record of all outstanding dirty extents and can accurately flag inconsistent data — no dirty extent is ever "forgotten."

Filesystem drivers should use flow control: pre-register extents in batches of ≤64, and flush completed extents at every fsync/barrier point. Under normal operation, the 4096-slot ring provides ample buffering for burst writes; EBUSY is only encountered if the VFS driver fails to acknowledge completions promptly.

If the VFS driver is unresponsive (not calling vfs_flush_extent_complete() for >5 seconds), umka-core treats all unacknowledged extents as dirty and initiates VFS driver restart — the EBUSY backpressure prevents ring overflow from masking a stuck VFS driver.

Crash recovery sequence:

When a Tier 1 VFS driver crashes while UmkaOS Core detects pending dirty extents:

  1. Iterate the dirty extent log for the crashed filesystem instance.
  2. For each registered extent (newest first, to preserve ordering for journals):
  3. Issue a direct block write via the block layer (bypassing VFS).
  4. Use block_addr and block_len from the pre-registered extent record.
  5. Wait for write completion.
  6. After all registered extents are flushed: mark as "crash-flushed" and continue with driver reload.
  7. Any dirty pages NOT covered by registered extents are flagged as "potentially inconsistent". The filesystem's own journal/log handles recovery on next mount (same as a hard power-off scenario).

Filesystem requirements: - Filesystems with stable pre-allocated extents (ext4, XFS): can register extents at file creation/truncation time. The block address is stable. - Filesystems with copy-on-write (Btrfs): must register the NEW block address before the CoW write begins (not the old block address). - FAT/exFAT: no journaling — if not using registered extents, crash = data loss. The VFS driver must register all dirty cluster chains.

Design rationale: This is better than Linux's approach (which silently loses dirty pages when a kernel module crashes) while being simpler than running a full WAL in UmkaOS Core. The pre-registration overhead is one lightweight ring-buffer push per dirtied file region — negligible for writeback-dominated workloads.

Performance implications and mitigation:

The Core → VFS domain switch costs ~23 cycles for the bare WRPKRU instruction (x86-64 MPK). The full domain crossing — including argument marshaling via the inter-domain ring buffer and cache effects — is ~30-35 cycles per crossing. This overhead is amortized by:

  1. Page Cache in Core: The Page Cache (Section 4.1.3) lives in Core, not VFS. Cached file reads/writes hit the Page Cache directly with zero domain switches. Only cache misses (actual I/O) cross into VFS.

  2. Batching: Multiple file operations within a single syscall (e.g., readv, io_uring batches) amortize the domain switch over many operations.

  3. Dentry cache hit rate: The dentry cache (in VFS) has >99% hit rate for typical workloads. Path resolution is fast, and the domain switch cost is dominated by the actual I/O latency (microseconds vs nanoseconds).

Measured overhead: For a 4KB NVMe read (~10μs device latency), the additional domain switches (Core → VFS → FS driver) add ~70 cycles (~30ns total), which is 0.3% overhead. This is well within the "<5% overhead" target.

13.1.1 VFS Architecture

Responsibilities: path resolution, dentry caching, inode management, mount tree traversal, and permission checks (delegated to umka-core's capability system via the inter-domain ring buffer).

Filesystem drivers register as VFS backends. The VFS never interprets on-disk format directly — it delegates all storage operations through three trait interfaces:

Foundational VFS types (used throughout this chapter):

/// Opaque filesystem inode identifier. Unique within a single SuperBlock.
///
/// Inode 0 is never valid (used as the null sentinel in `AtomicOption`).
/// Inode 1 is conventionally the root directory inode.
/// The u64 width accommodates all known filesystem inode spaces (ext4 uses
/// u32 internally but promotes to u64 for future-proofing; Btrfs and ZFS
/// use u64 natively).
///
/// `InodeId` is filesystem-private: the same u64 value in two different
/// `SuperBlock` instances refers to different inodes.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct InodeId(pub u64);

impl From<u64> for InodeId { fn from(v: u64) -> Self { InodeId(v) } }
impl From<InodeId> for u64  { fn from(id: InodeId) -> u64 { id.0 } }

/// Opaque VFS pipe identifier. Each `pipe(2)` / `pipe2(2)` call produces a
/// unique `PipeId` for internal tracking (waitqueue association, splice
/// routing, and PipeBuffer lifetime management). Not visible to userspace.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct PipeId(pub u64);

/// Response envelope for cross-domain VFS ring buffer calls.
/// Matches the ring protocol described in [Section 10.7](10-drivers.md#107-ipc-architecture-and-message-passing).
#[derive(Debug)]
pub enum VfsResponse {
    /// Success, possibly with a return value (e.g., byte count for read/write).
    Ok(i64),
    /// Error code (negated Linux errno, e.g., `-ENOENT`).
    Err(i32),
    /// Asynchronous completion pending; caller must wait on the completion ring.
    Pending,
}
/// Filesystem-level operations (mount, unmount, statfs).
/// Implemented once per filesystem type (ext4, XFS, btrfs, ZFS, tmpfs, etc.).
pub trait FileSystemOps: Send + Sync {
    /// Mount a filesystem from the given source device with flags and options.
    fn mount(&self, source: &str, flags: MountFlags, data: &[u8]) -> Result<SuperBlock>;

    /// Unmount a previously mounted filesystem.
    fn unmount(&self, sb: &SuperBlock) -> Result<()>;

    /// Force-unmount: abort in-flight I/O with EIO. Called when umount2()
    /// is invoked with MNT_FORCE. Not all filesystems support this — return
    /// ENOSYS if unsupported. NFS uses this for stale server recovery.
    fn force_umount(&self, sb: &SuperBlock) -> Result<()>;

    /// Return filesystem statistics (total/free/available blocks and inodes).
    fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;

    /// Flush all dirty data and metadata for this filesystem to stable storage.
    /// Backend for syncfs(2) and the filesystem-level portion of sync(2).
    fn sync_fs(&self, sb: &SuperBlock, wait: bool) -> Result<()>;

    /// Remount with changed flags/options (e.g., `mount -o remount,ro`).
    fn remount(&self, sb: &SuperBlock, flags: MountFlags, data: &[u8]) -> Result<()>;

    /// Freeze the filesystem for a consistent snapshot. All pending writes are
    /// flushed and new writes block until thaw. Used by LVM snapshots, device-mapper,
    /// and backup tools via FIFREEZE ioctl.
    fn freeze(&self, sb: &SuperBlock) -> Result<()>;

    /// Thaw a previously frozen filesystem, allowing writes to resume.
    fn thaw(&self, sb: &SuperBlock) -> Result<()>;

    /// Format filesystem-specific mount options for /proc/mounts output.
    fn show_options(&self, sb: &SuperBlock, buf: &mut [u8]) -> Result<usize>;
}

/// Inode (directory structure) operations.
/// Handles namespace operations: lookup, create, link, unlink, rename.
///
/// Note: `OsStr` is a kernel-defined type (NOT `std::ffi::OsStr`, which is
/// unavailable in `no_std`). It is a dynamically-sized type (DST) wrapping
/// `[u8]`, representing filenames that may contain arbitrary non-UTF-8 bytes
/// (Linux filenames are byte strings, not Unicode). Defined in
/// `umka-vfs/src/types.rs`:
///   `pub struct OsStr([u8]);`
/// As a DST, `OsStr` cannot be used by value — it is always behind a
/// reference (`&OsStr`) or `Box<OsStr>`. `&OsStr` is a fat pointer
/// (pointer + length), analogous to `&[u8]` but carrying the semantic
/// intent of "filesystem name component." Conversion from `&str` is
/// infallible (UTF-8 is a valid byte sequence); conversion TO `&str`
/// returns `Result` (may fail on non-UTF-8 filenames).
pub trait InodeOps: Send + Sync {
    /// Look up a child entry by name within a parent directory.
    fn lookup(&self, parent: InodeId, name: &OsStr) -> Result<InodeId>;

    /// Create a regular file in the given directory.
    fn create(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;

    /// Create a subdirectory.
    fn mkdir(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;

    /// Create a hard link: new entry `new_name` in `new_parent` pointing to `inode`.
    fn link(&self, inode: InodeId, new_parent: InodeId, new_name: &OsStr) -> Result<()>;

    /// Create a symbolic link containing `target` at `parent/name`.
    fn symlink(&self, parent: InodeId, name: &OsStr, target: &OsStr) -> Result<InodeId>;

    /// Read the target of a symbolic link.
    fn readlink(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;

    /// Create a device special file (block/char device, FIFO, or socket).
    fn mknod(&self, parent: InodeId, name: &OsStr, mode: FileMode, dev: DevId) -> Result<InodeId>;

    /// Remove a directory entry (unlink for files, rmdir for empty directories).
    fn unlink(&self, parent: InodeId, name: &OsStr) -> Result<()>;

    /// Remove an empty directory. Separate from unlink for POSIX semantics:
    /// `unlink()` on a directory returns EISDIR; `rmdir()` on a file returns ENOTDIR.
    fn rmdir(&self, parent: InodeId, name: &OsStr) -> Result<()>;

    /// Rename/move a directory entry, possibly across directories.
    /// `flags` supports RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT
    /// (Linux renameat2 semantics, required for overlayfs).
    fn rename(
        &self,
        old_parent: InodeId, old_name: &OsStr,
        new_parent: InodeId, new_name: &OsStr,
        flags: RenameFlags,
    ) -> Result<()>;

    /// Get inode attributes (size, mode, timestamps, link count).
    fn getattr(&self, inode: InodeId) -> Result<InodeAttr>;

    /// Set inode attributes (chmod, chown, utimes).
    fn setattr(&self, inode: InodeId, attr: &SetAttr) -> Result<()>;

    /// List extended attributes on an inode.
    fn listxattr(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;

    /// Get an extended attribute value.
    fn getxattr(&self, inode: InodeId, name: &OsStr, buf: &mut [u8]) -> Result<usize>;

    /// Set an extended attribute value.
    fn setxattr(&self, inode: InodeId, name: &OsStr, value: &[u8], flags: XattrFlags)
        -> Result<()>;

    /// Remove an extended attribute.
    fn removexattr(&self, inode: InodeId, name: &OsStr) -> Result<()>;
}

/// File data operations (open, read, write, sync, allocate, close).
pub trait FileOps: Send + Sync {
    /// Called when a file is opened. Allows the filesystem to initialize per-open
    /// state (NFS delegation, device state, lock state). Returns a filesystem-private
    /// context value stored in the file descriptor.
    fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64>;

    /// Called when the last file descriptor referencing this open file is closed.
    /// Filesystem releases per-open state (flock release-on-close, NFS delegation
    /// return, device cleanup). `private` is the value returned by `open()`.
    fn release(&self, inode: InodeId, private: u64) -> Result<()>;

    /// Read data from a file at the given offset. `private` is the
    /// filesystem-private context value returned by `open()`.
    fn read(&self, inode: InodeId, private: u64, offset: u64, buf: &mut [u8]) -> Result<usize>;

    /// Write data to a file at the given offset. `private` is the
    /// filesystem-private context value returned by `open()`.
    fn write(&self, inode: InodeId, private: u64, offset: u64, buf: &[u8]) -> Result<usize>;

    /// Truncate a file to the specified size. This is separate from setattr
    /// because truncation is a complex operation on many filesystems: it must
    /// free blocks/extents, update extent trees, handle COW (ZFS/btrfs),
    /// interact with snapshots, and flush in-progress writes beyond the new
    /// size. The VFS calls truncate after updating the in-memory inode size.
    /// `private` is the filesystem-private context value returned by `open()`.
    fn truncate(&self, inode: InodeId, private: u64, new_size: u64) -> Result<()>;

    /// Flush file data (and optionally metadata) to stable storage.
    /// `private` is the filesystem-private context value returned by `open()`.
    fn fsync(&self, inode: InodeId, private: u64, datasync: bool) -> Result<()>;

    /// Pre-allocate or punch holes in file storage. `private` is the
    /// filesystem-private context value returned by `open()`.
    fn fallocate(&self, inode: InodeId, private: u64, offset: u64, len: u64, mode: FallocateMode) -> Result<()>;

    /// Read directory entries. Returns entries starting from `offset` (an opaque
    /// cookie, not a byte position). The callback is invoked for each entry; it
    /// returns `false` to stop iteration (buffer full). This is the backend for
    /// `getdents64(2)`. `private` is the filesystem-private context value
    /// returned by `open()`.
    fn readdir(
        &self,
        inode: InodeId,
        private: u64,
        offset: u64,
        emit: &mut dyn FnMut(InodeId, u64, FileType, &OsStr) -> bool,
    ) -> Result<()>;

    /// Seek to a data or hole region (SEEK_DATA / SEEK_HOLE, lseek(2)).
    /// Filesystems that do not support sparse files return the file size for
    /// SEEK_DATA at any offset, and ENXIO for SEEK_HOLE at any offset.
    /// `private` is the filesystem-private context value returned by `open()`.
    fn llseek(&self, inode: InodeId, private: u64, offset: i64, whence: SeekWhence) -> Result<u64>;

    /// Map a file region into a process address space. The VFS calls this to
    /// obtain the page frame list; the actual page table manipulation is done
    /// by umka-core (Section 4.1). Filesystems that do not support mmap (e.g.,
    /// procfs, sysfs) return ENODEV. `private` is the filesystem-private
    /// context value returned by `open()`.
    fn mmap(&self, inode: InodeId, private: u64, offset: u64, len: usize, prot: MmapProt) -> Result<MmapResult>;

    /// Handle a filesystem-specific ioctl. The VFS dispatches generic ioctls
    /// (FIOCLEX, FIONREAD, etc.) itself; only unrecognized ioctls reach the
    /// filesystem driver. Returns ENOTTY for unsupported ioctls. `private` is
    /// the filesystem-private context value returned by `open()`.
    fn ioctl(&self, inode: InodeId, private: u64, cmd: u32, arg: u64) -> Result<i64>;

    /// Splice data between a file and a pipe without copying through userspace.
    /// Backend for splice(2), sendfile(2), and copy_file_range(2). Filesystems
    /// that do not implement this get a generic page-cache-based fallback
    /// provided by the VFS. `private` is the filesystem-private context value
    /// returned by `open()`.
    fn splice_read(
        &self,
        inode: InodeId,
        private: u64,
        offset: u64,
        pipe: PipeId,
        len: usize,
    ) -> Result<usize>;

    /// Splice data from a pipe into a file without copying through userspace.
    /// Reverse direction of splice_read: pipe is the data source, file is the
    /// destination. Backend for splice(2) write direction and vmsplice(2).
    /// Filesystems that do not implement this get a generic page-cache-based
    /// fallback provided by the VFS. `private` is the filesystem-private
    /// context value returned by `open()`.
    fn splice_write(
        &self,
        pipe: PipeId,
        inode: InodeId,
        private: u64,
        offset: u64,
        len: usize,
    ) -> Result<usize>;

    /// Poll for readiness events (POLLIN, POLLOUT, POLLERR). Regular files
    /// always return ready; special files (pipes, device nodes, eventfd)
    /// implement blocking semantics. `private` is the filesystem-private
    /// context value returned by `open()`.
    fn poll(&self, inode: InodeId, private: u64, events: PollEvents) -> Result<PollEvents>;
}

/// Dentry (directory entry) lifecycle operations.
/// Most filesystems use the default VFS implementations. Only network and
/// clustered filesystems need custom implementations (primarily d_revalidate).
pub trait DentryOps: Send + Sync {
    /// Revalidate a cached dentry. Called before using a cached dentry to verify
    /// it is still valid. Returns true if the dentry is still valid, false if
    /// the VFS should discard it and perform a fresh lookup.
    /// Default: always returns true (local filesystems).
    /// Network FS: checks with the server. Clustered FS: checks DLM lease (Section 14.6.6).
    fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool> {
        Ok(true)
    }

    /// Custom name comparison. Called during lookup to compare a dentry name
    /// with a search name. Used by case-insensitive filesystems (e.g., VFAT,
    /// CIFS with case folding, ext4 with casefold feature).
    /// Default: byte-exact comparison.
    fn d_compare(&self, name: &OsStr, search: &OsStr) -> bool {
        name == search
    }

    /// Returns a custom hash for this dentry name, or `None` to use the
    /// VFS default (SipHash-1-3 with per-superblock key from `SuperBlock.hash_key`).
    /// Must be consistent with d_compare: if two names are equal per d_compare,
    /// they must produce the same hash.
    ///
    /// The VFS lookup layer calls `d_hash()` and checks the return value.
    /// If `None`, the VFS uses its own SipHash-1-3 with the per-superblock
    /// random key directly, without requiring filesystem involvement. This
    /// matches Linux's pattern where `d_hash` is only invoked when
    /// `dentry->d_op->d_hash` is non-NULL.
    ///
    /// Filesystems with custom hash requirements (e.g., case-insensitive)
    /// override this to return `Some(hash_value)` using their own algorithm —
    /// they never see the SipHash key. The per-superblock key is managed by
    /// the VFS, not exposed to filesystem implementations.
    fn d_hash(&self, name: &OsStr) -> Option<u64> {
        None
    }

    /// Called when a dentry's reference count drops to zero (dentry enters
    /// the unused LRU list). Filesystem can veto caching by returning false.
    fn d_delete(&self, inode: InodeId, name: &OsStr) -> bool {
        true // default: allow LRU caching
    }

    /// Called when a dentry is finally freed from the cache.
    fn d_release(&self, inode: InodeId, name: &OsStr) {}
}

/// Inode attribute structure — returned by getattr(), compatible with
/// Linux statx(2) for full metadata exposure.
pub struct InodeAttr {
    /// Bitmask of valid fields (STATX_* flags). Filesystems set only
    /// the bits for fields they actually populate.
    pub mask: u32,

    pub mode: u32,        // File type and permissions; u32 to accommodate extended
                          // permission bits — lower 16 bits match Linux umode_t format.
    pub nlink: u32,       // Hard link count
    pub uid: u32,         // Owner UID
    pub gid: u32,         // Group GID
    pub ino: u64,         // Inode number
    pub size: u64,        // File size in bytes
    pub blocks: u64,      // 512-byte blocks allocated
    pub blksize: u32,     // Preferred I/O block size

    // Timestamps with nanosecond precision
    pub atime_sec: i64,   // Last access
    pub atime_nsec: u32,
    pub mtime_sec: i64,   // Last modification
    pub mtime_nsec: u32,
    pub ctime_sec: i64,   // Last status change
    pub ctime_nsec: u32,
    pub btime_sec: i64,   // Creation time (birth time)
    pub btime_nsec: u32,

    pub rdev: u64,        // Device ID (for device special files). Encodes major:minor as (major << 32) | minor. The Linux compat layer (Section 18.1) splits these into separate u32 major/minor fields for statx() responses.
    pub dev: u64,         // Device ID of containing filesystem. Encodes major:minor as (major << 32) | minor. The Linux compat layer (Section 18.1) splits these into separate u32 major/minor fields for statx() responses.
    pub mount_id: u64,    // Mount identifier (STATX_MNT_ID, since Linux 5.8)
    pub attributes: u64,  // File attributes (STATX_ATTR_* flags)
    pub attributes_mask: u64, // Supported attributes mask

    // Direct I/O alignment (STATX_DIOALIGN, since Linux 6.1)
    pub dio_mem_align: u32,    // Required alignment for DIO memory buffers
    pub dio_offset_align: u32, // Required alignment for DIO file offsets

    // Subvolume identifier (STATX_SUBVOL, since Linux 6.10; btrfs, bcachefs)
    pub subvol: u64,

    // Atomic write limits (STATX_WRITE_ATOMIC, since Linux 6.11)
    pub atomic_write_unit_min: u32,  // Min atomic write size (power-of-2)
    pub atomic_write_unit_max: u32,  // Max atomic write size (power-of-2)
    pub atomic_write_segments_max: u32, // Max segments in atomic write
    pub atomic_write_unit_max_opt: u32, // Optimal max atomic write size (STATX_WRITE_ATOMIC, since Linux 6.13)

    // Direct I/O read alignment (STATX_DIO_READ_ALIGN, since Linux 6.14)
    pub dio_read_offset_align: u32,  // DIO read offset alignment (0 = use dio_offset_align)
}

Linux comparison: Linux's VFS uses struct super_operations, struct inode_operations, struct file_operations, and struct dentry_operations — C structs of function pointers (Linux's file_operations alone has 30+ methods). UmkaOS's trait-based design serves the same purpose but with Rust's safety guarantees: a filesystem that forgets to implement fsync is a compile-time error, not a null pointer dereference at runtime. The trait methods above cover the operations needed for POSIX compatibility; rarely-used operations (e.g., fiemap, copy_file_range with cross-filesystem support) are handled by generic VFS fallback code that calls the core read/write/fallocate methods.

13.1.1.1 File Handle Export (ExportOps)

The ExportOps trait is implemented by filesystems that support persistent file handles — opaque tokens that identify an inode across server reboots and path renames. Required for:

  • NFS server (clients hold file handles that survive server restart)
  • CRIU checkpoint/restore (open_by_handle_at reopens files by handle)
  • Backup software (rsync --no-implied-dirs, backup agents)
/// File system export operations. Optional — implement only if the filesystem
/// supports persistent, path-independent file handles.
///
/// A file handle is a short opaque byte string (max 128 bytes) that uniquely
/// identifies an inode within a filesystem instance. The handle must survive:
/// - Server reboots (handle encodes stable inode ID + generation counter)
/// - Directory renames (handle does not encode path)
/// - Mount point changes (handle is filesystem-relative, not global)
pub trait ExportOps: Send + Sync {
    /// Encode an inode into a file handle.
    ///
    /// Returns the handle bytes written and a filesystem-defined `fh_type` code
    /// (passed back to `decode_fh`; used to distinguish handle formats).
    ///
    /// # Typical encoding
    /// ext4:  [ inode_number: u32, generation: u32 ] → 8 bytes, fh_type=1
    /// XFS:   [ ino: u64, gen: u32, parent_ino: u64, parent_gen: u32 ] → 24 bytes, fh_type=1
    /// Btrfs: [ objectid: u64, root_objectid: u64, gen: u64 ] → 24 bytes, fh_type=1
    ///
    /// Returns `Err(EOVERFLOW)` if `max_bytes` is too small for this filesystem's handle.
    fn encode_fh(
        &self,
        inode: &Inode,
        handle: &mut [u8; 128],
        max_bytes: usize,
        /// If true, include parent inode info to enable NFS reconnect after server reboot.
        connectable: bool,
    ) -> Result<(usize, u8), VfsError>; // (bytes_written, fh_type)

    /// Decode a file handle back to an inode reference.
    ///
    /// Called by `open_by_handle_at`. Must look up the inode using the filesystem's
    /// internal handle format without path traversal.
    ///
    /// Returns `Err(ESTALE)` if the inode no longer exists or the generation counter
    /// does not match (inode number reused after deletion).
    fn decode_fh(
        &self,
        handle: &[u8],
        fh_type: u8,
    ) -> Result<Arc<Inode>, VfsError>;

    /// Get the parent directory inode of an inode (for NFS reconnect after reboot).
    ///
    /// Returns `Err(EACCES)` if the filesystem cannot determine the parent without a
    /// full tree walk (e.g., hardlinks with multiple parents).
    fn get_parent(&self, inode: &Inode) -> Result<Arc<Inode>, VfsError>;

    /// Get the directory entry name for `child` within `parent`.
    ///
    /// Used by the NFS server to reconstruct paths for client caches.
    /// Returns the byte length of the name written into `name_buf`.
    /// Returns `Err(ENOENT)` if no entry for `child` is found in `parent`.
    fn get_name(
        &self,
        parent: &Inode,
        child: &Inode,
        name_buf: &mut [u8; 256],
    ) -> Result<usize, VfsError>;
}

/// Kernel-side file handle: wraps the opaque handle bytes with metadata.
/// Matches the layout of Linux's `struct file_handle` for syscall ABI compatibility.
#[repr(C)]
pub struct FileHandle {
    /// Byte length of the handle data (the populated prefix of `f_handle`).
    pub handle_bytes: u32,
    /// Filesystem-defined type code (passed back verbatim to `ExportOps::decode_fh`).
    pub handle_type: i32,
    /// Opaque handle data (filesystem-defined encoding, up to 128 bytes).
    pub f_handle: [u8; 128],
}

name_to_handle_at(2) implementation:

name_to_handle_at(dirfd, pathname, handle, mount_id, flags):

1. Resolve pathname to an inode (using normal path resolution with dirfd as the base;
   AT_EMPTY_PATH allows operating on dirfd itself without a pathname component).
2. Retrieve the inode's superblock.
3. Check that the superblock implements ExportOps. Return ENOTSUP if not.
4. Call superblock.export_ops.encode_fh(inode, handle.f_handle, handle.handle_bytes,
   connectable=true).
5. Write back handle_bytes and handle_type into the userspace handle struct.
6. Write the mount's numeric ID to *mount_id. Mount IDs are assigned at mount time
   via a monotonic counter (Section 13.2.3 MountNode.mnt_id).
7. Return 0 on success; EOVERFLOW if the handle buffer is too small.

open_by_handle_at(2) implementation:

open_by_handle_at(mount_fd, handle, flags):

1. Requires CAP_DAC_READ_SEARCH. This syscall bypasses normal path-based access checks
   by design — it is intended for root-equivalent processes such as NFS servers and
   backup agents. Return EPERM if the capability is absent.
2. Resolve mount_fd to a MountNamespace and the corresponding mount point.
3. Retrieve the mount's superblock.
4. Check that the superblock implements ExportOps. Return ENOTSUP if not.
5. Call superblock.export_ops.decode_fh(handle.f_handle, handle.handle_type) → Arc<Inode>.
6. If Err(ESTALE): the inode was deleted or the generation counter does not match
   (inode number reused). Return ESTALE.
7. Perform a DAC check and LSM check on the inode using the caller's credentials.
8. Allocate a new FileDescription wrapping the inode. The file description does not
   carry a path — the inode is accessed directly without directory traversal.
9. Return the new file descriptor number.

Security note: open_by_handle_at intentionally skips directory execute-permission
checks along the path to the inode (the path is not known at this point). This is
the documented and expected behavior for NFS server use. CAP_DAC_READ_SEARCH is the
required guard.

13.1.1.2 Core VFS Data Structures

The VFS layer operates on four fundamental data structures: dentries (directory entries), inodes (index nodes), superblocks (mounted filesystem state), and files (open file handles). This section defines the first three; file handles are defined in Section 7.1 (process model) as part of the file descriptor table.

Dentry (Directory Cache Entry)
/// Directory cache entry — represents a single component in a pathname.
///
/// Dentries form a tree that mirrors the filesystem namespace. Each dentry
/// caches the result of a directory lookup: the mapping from a name to an
/// inode. The dentry cache (dcache) is the primary mechanism for avoiding
/// repeated directory lookups on hot paths.
///
/// **Lifecycle**: Created by `InodeOps::lookup()` on first access. Cached
/// in the dcache hash table (keyed by parent + name). Freed when the
/// reference count drops to zero AND the dentry is evicted from the LRU.
/// Negative dentries (name exists but no inode) are also cached to avoid
/// repeated failed lookups.
///
/// **Concurrency**: Dentries are RCU-protected for lockless path resolution
/// (RCU-walk mode, Section 13.1.3). Mutations (create, unlink, rename)
/// acquire the parent dentry's `d_lock` spinlock.
#[repr(C)]
pub struct Dentry {
    /// The name of this directory entry (the final component, not the full path).
    /// Inline for short names (<=32 bytes); heap-allocated for longer names.
    /// Immutable after creation (renames create a new dentry).
    pub d_name: DentryName,

    /// Inode that this dentry points to. `None` for negative dentries
    /// (cached "does not exist" results). Set once by `d_instantiate()`
    /// after a successful lookup or create. Protected by RCU for readers;
    /// `d_lock` for writers.
    pub d_inode: RcuCell<Option<Arc<Inode>>>,

    /// Parent dentry. The root dentry's parent is itself.
    /// Protected by RCU (for RCU-walk path resolution).
    pub d_parent: RcuCell<Arc<Dentry>>,

    /// Hash table linkage for dcache lookup (keyed by parent + name hash).
    pub d_hash: HashListNode,

    /// Children list (subdirectories and files in this directory).
    /// Only meaningful for directory dentries. Protected by `d_lock`.
    pub d_children: IntrusiveList<Dentry>,

    /// Sibling linkage (entry in parent's `d_children` list).
    pub d_sibling: IntrusiveListNode,

    /// Per-dentry spinlock. Protects `d_children`, `d_inode` mutations,
    /// and `d_flags` updates. Lock level: DENTRY_LOCK (level 8).
    pub d_lock: SpinLock<(), DENTRY_LOCK>,

    /// Dentry flags (DCACHE_MOUNTED, DCACHE_NEGATIVE, etc.).
    pub d_flags: AtomicU32,

    /// Reference count. Dentries with refcount > 0 are pinned (in use).
    /// Dentries with refcount == 0 are on the LRU and may be evicted
    /// under memory pressure.
    pub d_refcount: AtomicU32,

    /// Cached permission bits for fast path resolution (Section 13.1.3).
    pub cached_perm: AtomicU32,

    /// Superblock this dentry belongs to.
    pub d_sb: Arc<SuperBlock>,

    /// Filesystem-specific dentry operations (d_revalidate, d_release, etc.).
    /// Set by the filesystem during lookup. NULL for simple filesystems.
    pub d_ops: Option<&'static dyn DentryOps>,

    /// RCU head for deferred freeing.
    pub d_rcu: RcuHead,

    /// LRU list linkage for dcache reclaim.
    pub d_lru: IntrusiveListNode,

    /// Mount point sequence counter. Incremented when a filesystem is
    /// mounted or unmounted on this dentry. Used by RCU-walk to detect
    /// mount table changes during lockless traversal.
    pub d_mount_seq: AtomicU32,
}

/// Short name inline buffer size. Names <=32 bytes are stored inline
/// in the dentry (no heap allocation). Covers >99% of real filenames.
pub const DENTRY_INLINE_NAME_LEN: usize = 32;

/// Dentry name: inline for short names, heap-allocated for long names.
pub enum DentryName {
    Inline { buf: [u8; DENTRY_INLINE_NAME_LEN], len: u8 },
    Heap { ptr: Box<[u8]> },
}
AddressSpace (Page Cache Mapping)
/// Per-inode page cache. Maps file byte offsets (at page granularity)
/// to physical page frames held in memory.
///
/// Each inode for a regular file or block device has exactly one
/// `AddressSpace`. Directories and symlinks typically do not use
/// `AddressSpace` unless the filesystem maps their data through the page
/// cache (e.g., directories in ext4 are page-cache-backed).
///
/// **Storage**: `AddressSpace` is embedded directly inside `Inode`
/// (field `i_mapping`). No separate allocation is needed on the fast
/// path.
///
/// **Concurrency**:
/// - `pages` (XArray): RCU-safe for readers; writers hold `xa_lock`
///   (a fine-grained spinlock embedded in the XArray).
/// - `nrpages`, `nrdirty`, `nrwriteback`: independent atomic counters;
///   no lock needed for individual increments/decrements.
/// - `writeback_lock`: `Mutex` serializing concurrent writeback of
///   this inode's pages. At most one writeback agent runs per inode
///   at any time.
///
/// **XArray**: Generic ordered sparse array implemented as a tree of
/// `XNode` slots. Provides RCU-safe concurrent reads with fine-grained
/// locking for writes. Equivalent to Linux's `struct xarray`. See
/// Section 13.2 (Dentry Cache) for the XArray lock ordering rules.
pub struct AddressSpace {
    /// Back-pointer to the owning inode. `Weak` to avoid a reference
    /// cycle (Inode → AddressSpace → Inode).
    pub host: Weak<Inode>,

    /// Page cache: file page index (u64) → `Arc<PageFrame>`.
    ///
    /// Key: `file_offset >> PAGE_SHIFT` (page index, not byte offset).
    /// Value: an `Arc`-wrapped physical page frame holding one page of
    /// file data. Absent entries indicate the page is not cached; the
    /// filesystem must populate it via `AddressSpaceOps::read_page`.
    ///
    /// The XArray provides O(1) amortised lookup and RCU-safe reads:
    /// readers never take a lock; writers take `xa_lock` only for the
    /// affected slot.
    pub pages: XArray<Arc<PageFrame>>,

    /// Total number of page frames currently present in the cache.
    /// Incremented when a page is inserted; decremented when evicted.
    pub nrpages: AtomicU64,

    /// Number of pages that are dirty (modified in memory, not yet
    /// flushed to the backing store). The writeback path decrements this
    /// as pages are written out.
    pub nrdirty: AtomicU64,

    /// Number of pages currently under active writeback I/O. A page is
    /// counted here from the moment writeback I/O is submitted until the
    /// I/O completion handler clears the `PG_WRITEBACK` flag.
    pub nrwriteback: AtomicU64,

    /// Writeback serialisation state. At most one concurrent writeback
    /// agent is permitted per `AddressSpace` to avoid seek amplification
    /// on rotational storage and to simplify error propagation.
    pub writeback_lock: Mutex<WritebackState>,

    /// Filesystem-provided callbacks for page cache operations.
    /// Statically known at inode creation time; never changes.
    pub ops: &'static dyn AddressSpaceOps,

    /// Flags controlling eviction and special page semantics.
    ///
    /// - `AS_UNEVICTABLE` (bit 0): pages must not be reclaimed under
    ///   memory pressure (e.g., ramfs, tmpfs locked pages).
    /// - `AS_BALLOON_PAGE` (bit 1): pages are balloon-inflated and may
    ///   be reclaimed by the balloon driver at any time.
    /// - `AS_EIO` (bit 2): a writeback error occurred; subsequent
    ///   `fsync` calls must return `-EIO` until the flag is cleared.
    /// - `AS_ENOSPC` (bit 3): a writeback error occurred due to no
    ///   space remaining on device.
    pub flags: AtomicU32,
}

/// Serialised writeback state embedded inside `AddressSpace::writeback_lock`.
///
/// Protected by `AddressSpace::writeback_lock`. The `Mutex` ensures only
/// one writeback agent runs at a time; the fields inside track progress
/// so that a new agent can resume where the previous one left off.
pub struct WritebackState {
    /// Next page index to examine during writeback. The writeback agent
    /// advances this forward as pages are submitted for I/O. Wraps to 0
    /// after reaching the last page, implementing a cyclic scan
    /// consistent with the kernel's "kupdate" writeback policy.
    pub writeback_index: u64,

    /// Accumulated bytes of dirty data at the time writeback started.
    /// Used to limit how much data a single writeback pass writes, so
    /// that a continuous dirty stream does not starve readers.
    pub dirty_bytes: u64,
}

/// Filesystem callbacks invoked by the VFS page cache layer.
///
/// Each filesystem that participates in the page cache provides a
/// static `AddressSpaceOps` implementation. The VFS calls these methods
/// when it needs to populate the cache (read miss), flush dirty pages
/// (writeback), or decide whether a page can be dropped (reclaim).
///
/// **Object safety**: all methods take `&self` on the ops vtable plus
/// explicit `AddressSpace`/`PageFrame` references. The vtable itself is
/// `'static`, `Send`, and `Sync`.
pub trait AddressSpaceOps: Send + Sync {
    /// Read one page (identified by `index`, a page-aligned file offset
    /// divided by `PAGE_SIZE`) from the backing store into the page
    /// cache. The implementation must allocate a `PageFrame`, fill it,
    /// insert it into `mapping.pages`, and return an `Arc` to it.
    ///
    /// Called with no locks held. The implementation may block.
    fn read_page(
        &self,
        mapping: &AddressSpace,
        index: u64,
    ) -> Result<Arc<PageFrame>, IoError>;

    /// Write a single dirty page to the backing store. `wbc` carries
    /// writeback control parameters (sync mode, range limits, number
    /// of pages already written in this pass). The implementation must
    /// clear `PG_DIRTY` on the page before starting I/O and set
    /// `PG_WRITEBACK` for the duration of the I/O.
    ///
    /// Called with no locks held. The implementation may block.
    fn writepage(
        &self,
        mapping: &AddressSpace,
        page: &PageFrame,
        wbc: &WritebackControl,
    ) -> Result<(), IoError>;

    /// Called by the page reclaimer immediately before a clean page is
    /// removed from the cache. The filesystem may decline eviction by
    /// returning `false` (e.g., because it has pinned the page for
    /// journalling). Returning `true` grants permission to evict.
    ///
    /// Must not block; must not acquire locks that might sleep.
    fn releasepage(&self, page: &PageFrame) -> bool;

    /// Returns the direct-I/O implementation for this address space,
    /// if the filesystem supports bypassing the page cache (e.g., for
    /// `O_DIRECT` opens). Returns `None` if direct I/O is not supported;
    /// the VFS will then fall back to the page-cache path.
    fn direct_io(&self) -> Option<&dyn DirectIoOps> {
        None
    }
}
Inode (Index Node)
/// In-memory representation of a filesystem object (file, directory,
/// symlink, device, pipe, socket).
///
/// Each inode has a unique (superblock, inode_number) pair. The VFS
/// maintains an inode cache (icache) keyed by this pair to avoid
/// repeated disk reads.
///
/// **Lifecycle**: Created by `FileSystemOps::mount()` (root inode) or
/// `InodeOps::lookup()`/`InodeOps::create()` for other entries. Cached
/// in the icache. Freed when the last dentry referencing it is evicted
/// AND the on-disk link count drops to zero (unlinked).
///
/// **Concurrency**: Inode metadata is protected by `i_lock` (spinlock).
/// File data is protected by `i_rwsem` (read-write semaphore) — readers
/// (read, readdir) take shared; writers (write, truncate) take exclusive.
#[repr(C)]
pub struct Inode {
    /// Inode number. Unique within a superblock. Assigned by the filesystem.
    pub i_ino: u64,

    /// File type and permission mode (S_IFREG, S_IFDIR, etc. | rwxrwxrwx).
    pub i_mode: u32,

    /// Owner UID.
    pub i_uid: u32,

    /// Owner GID.
    pub i_gid: u32,

    /// Hard link count. When this reaches 0 and no open file descriptors
    /// remain, the inode is freed (both in-memory and on-disk).
    pub i_nlink: AtomicU32,

    /// File size in bytes. For regular files: data size. For directories:
    /// implementation-defined (often the size of the directory data).
    /// For symlinks: length of the target path. Updated under `i_rwsem`.
    pub i_size: AtomicI64,

    /// Timestamps (seconds + nanoseconds since epoch).
    pub i_atime: Timespec,
    pub i_mtime: Timespec,
    pub i_ctime: Timespec,

    /// Block size for this inode's filesystem (typically 4096).
    pub i_blksize: u32,

    /// Number of 512-byte blocks allocated on disk.
    pub i_blocks: u64,

    /// Device number (major:minor) for device special files (S_IFBLK/S_IFCHR).
    /// Encoding: `(major << 32) | minor`. Zero for regular files.
    pub i_rdev: u64,

    /// Generation number. Incremented when an inode is recycled (same i_ino
    /// reused for a new file). Used by NFS file handles to detect stale handles.
    pub i_generation: u32,

    /// Per-inode spinlock. Protects metadata updates (mode, uid, gid, timestamps,
    /// nlink). Lock level: INODE_LOCK (level 7).
    pub i_lock: SpinLock<(), INODE_LOCK>,

    /// Read-write semaphore for file data. read()/readdir() take shared;
    /// write()/truncate() take exclusive.
    pub i_rwsem: RwSemaphore,

    /// Superblock this inode belongs to.
    pub i_sb: Arc<SuperBlock>,

    /// Inode operations (lookup, create, link, unlink, etc.).
    /// Set by the filesystem when the inode is created.
    pub i_op: &'static dyn InodeOps,

    /// File operations (read, write, mmap, ioctl, etc.).
    /// Set by the filesystem; used when opening this inode as a file.
    pub i_fop: &'static dyn FileOps,

    /// Filesystem-private data. Opaque pointer used by the filesystem
    /// driver to attach its own per-inode state (e.g., ext4_inode_info).
    pub i_private: *mut (),

    /// Page cache address space for this inode's data.
    /// Contains the XArray of cached pages, dirty page counters, and
    /// writeback state. See the `AddressSpace` struct defined above.
    pub i_mapping: AddressSpace,

    /// Reference count. Managed by dentry references and open file handles.
    pub i_refcount: AtomicU32,

    /// Dirty flag. Set when inode metadata has been modified in memory
    /// but not yet written to disk.
    pub i_state: AtomicU32,

    /// Hash table linkage for icache lookup (keyed by sb + i_ino).
    pub i_hash: HashListNode,

    /// Superblock dirty inode list linkage.
    pub i_sb_list: IntrusiveListNode,
}
SuperBlock
/// In-memory representation of a mounted filesystem.
///
/// Each mount creates one SuperBlock instance. The superblock holds
/// filesystem-level metadata (block size, feature flags, root inode)
/// and provides the interface between the VFS and the filesystem driver.
///
/// **Lifecycle**: Created by `FileSystemOps::mount()`. Destroyed by
/// `FileSystemOps::unmount()` after all references are released.
pub struct SuperBlock {
    /// Filesystem type identifier (e.g., "ext4", "xfs", "tmpfs").
    pub s_type: &'static str,

    /// Block size in bytes (typically 1024, 2048, or 4096).
    pub s_blocksize: u32,

    /// Log2 of block size (for bit-shift division).
    pub s_blocksize_bits: u8,

    /// Maximum file size supported by this filesystem.
    pub s_maxbytes: i64,

    /// Root dentry of the mounted filesystem.
    pub s_root: Arc<Dentry>,

    /// Filesystem operations (mount, unmount, statfs, sync).
    pub s_op: &'static dyn FileSystemOps,

    /// Mount flags (MS_RDONLY, MS_NOSUID, MS_NODEV, etc.).
    pub s_flags: AtomicU32,

    /// Filesystem-specific data. Opaque pointer used by the filesystem
    /// driver to attach its own per-superblock state (e.g., ext4_sb_info,
    /// xfs_mount).
    pub s_fs_info: *mut (),

    /// UUID of the filesystem (if supported). Used for persistent mount
    /// identification and `/proc/mounts` output.
    pub s_uuid: [u8; 16],

    /// List of all inodes belonging to this superblock.
    /// Protected by `s_inode_list_lock`.
    pub s_inodes: IntrusiveList<Inode>,

    /// List of dirty inodes that need writeback.
    pub s_dirty: IntrusiveList<Inode>,

    /// Per-superblock lock for inode list management.
    pub s_inode_list_lock: SpinLock<()>,

    /// Block device backing this filesystem (None for pseudo-filesystems
    /// like tmpfs, procfs, sysfs).
    pub s_bdev: Option<Arc<BlockDevice>>,

    /// Reference count. Held by Mount nodes and open file handles.
    pub s_refcount: AtomicU32,

    /// Freeze count. >0 means filesystem is frozen (FIFREEZE).
    pub s_freeze_count: AtomicU32,
}

13.1.1.3 VFS Ring Buffer Protocol (Cross-Domain Dispatch)

The tier model (Section 10.4) requires ALL cross-domain communication to use ring buffer IPC. However, the FileSystemOps, InodeOps, and FileOps traits defined above use direct Rust function call signatures. This section specifies how trait method calls are marshaled across the isolation domain boundary between umka-core (VFS layer) and Tier 1 filesystem drivers.

Architecture: Each mounted filesystem has a dedicated request/response ring pair:

/// Per-mount ring buffer pair for VFS <-> filesystem driver communication.
///
/// The VFS (in umka-core) enqueues requests on `request_ring`; the filesystem
/// driver dequeues, processes, and enqueues responses on `response_ring`.
/// Both rings are in shared memory (PKEY 1 on x86-64 — read-only for both
/// domains; actual data in PKEY 14 shared DMA pool).
pub struct VfsRingPair {
    /// Request ring: VFS -> filesystem driver. SPSC (VFS is the sole producer;
    /// the driver is the sole consumer). Ring size: 256 entries (configurable
    /// per-mount via mount options).
    pub request_ring: RingBuffer<VfsRequest>,

    /// Response ring: filesystem driver -> VFS. SPSC (driver produces, VFS
    /// consumes). Same size as request ring.
    pub response_ring: RingBuffer<VfsResponse>,

    /// Doorbell: filesystem driver writes to signal request availability.
    /// Uses the doorbell coalescing mechanism (Section 10.6.1.1) to batch
    /// notifications when multiple requests are enqueued.
    pub doorbell: DoorbellRegister,

    /// Completion event: VFS waits on this when a synchronous operation
    /// needs a response. Uses `WaitQueue` for blocking callers.
    pub completion: WaitQueue,
}

/// VFS request message. Serialized representation of a trait method call.
/// Fixed-size header + variable-length payload.
#[repr(C)]
pub struct VfsRequest {
    /// Unique request ID for matching responses. Monotonically increasing
    /// per-ring. Wraps at u64::MAX.
    pub request_id: u64,

    /// Operation code identifying the trait method.
    pub opcode: VfsOpcode,

    /// Inode number (for InodeOps/FileOps calls). 0 for FileSystemOps calls.
    pub ino: u64,

    /// File handle (for FileOps calls). u64::MAX for non-file operations.
    pub fh: u64,

    /// Operation-specific arguments. The variant must match `opcode`.
    /// Variable-length data (filenames, xattr values, write data) is
    /// passed via shared DMA buffer references embedded in the variant,
    /// not stored inline in the ring entry.
    ///
    /// The VFS dispatcher validates that the `args` variant matches
    /// `opcode` before dispatching; a mismatch is a kernel bug and
    /// triggers a panic in debug builds, a silent no-op error response
    /// in release builds.
    pub args: VfsRequestArgs,
}

/// Per-opcode argument payload for a `VfsRequest`.
///
/// This is a Rust enum (tagged union) rather than a C-style `union` for
/// memory safety. Every `VfsOpcode` variant has a corresponding
/// `VfsRequestArgs` variant with the exact parameters that the trait
/// method requires. Variants that carry no extra data beyond what is
/// already in the `VfsRequest` header (opcode, ino, fh) use an empty
/// body `{}`.
///
/// **Inline string limits**: `KernelString` holds up to 255 bytes. Names
/// longer than 255 bytes (possible on some exotic filesystems) must be
/// passed via a `DmaBufferHandle` placed in the `buf` field of the
/// relevant variant; the VFS sets the string `len` to 0 as a sentinel in
/// that case.
///
/// **Caller contract**: The caller fills `VfsRequest { opcode, args, .. }`
/// and enqueues it on `request_ring`. The VFS dispatcher validates that
/// the `args` variant matches `opcode` before dispatching to the
/// filesystem driver.
pub enum VfsRequestArgs {
    // ---------------------------------------------------------------
    // FileSystemOps
    // ---------------------------------------------------------------

    /// `FileSystemOps::mount`. No extra args; mount options are passed
    /// via a separate `DmaBufferHandle` in the ring header.
    Mount {},
    /// `FileSystemOps::unmount`. Graceful unmount; all dirty data must
    /// be flushed before the response is sent.
    Unmount {},
    /// `FileSystemOps::force_unmount`. Best-effort: abandon in-flight
    /// I/O and free resources.
    ForceUnmount {},
    /// `FileSystemOps::statfs`. No per-call arguments.
    Statfs {},
    /// `FileSystemOps::sync_fs`. `wait` controls whether the driver
    /// must block until all I/O is complete (`true`) or may return once
    /// I/O is queued (`false`).
    SyncFs { wait: bool },
    /// `FileSystemOps::remount`. New flags; updated option string is in
    /// a `DmaBufferHandle` in the ring header.
    Remount { flags: u32 },
    /// `FileSystemOps::freeze`. Quiesce all writes for snapshotting.
    Freeze {},
    /// `FileSystemOps::thaw`. Resume writes after a freeze.
    Thaw {},

    // ---------------------------------------------------------------
    // InodeOps
    // ---------------------------------------------------------------

    /// `InodeOps::lookup`. Look up `name` in the directory identified
    /// by `VfsRequest::ino`.
    Lookup { name: KernelString },
    /// `InodeOps::create`. Create a regular file named by the dentry
    /// already allocated by the VFS. `mode` is the combined file-type
    /// and permission bits.
    Create { mode: FileMode },
    /// `InodeOps::link`. Create a hard link whose new name is
    /// `new_name` inside the directory inode of the request.
    Link { new_name: KernelString },
    /// `InodeOps::unlink`. Remove a directory entry. The inode is freed
    /// when its link count reaches zero and all file descriptors are
    /// closed.
    Unlink {},
    /// `InodeOps::mkdir`. Create a directory with the given permission
    /// bits.
    Mkdir { mode: FileMode },
    /// `InodeOps::rmdir`. Remove an empty directory.
    Rmdir {},
    /// `InodeOps::rename`. Move or rename a directory entry.
    /// `new_dir_ino` is the inode number of the destination directory.
    /// `new_name` is the destination name. `flags` carries `RENAME_*`
    /// constants (e.g., `RENAME_NOREPLACE`, `RENAME_EXCHANGE`).
    Rename { new_dir_ino: u64, new_name: KernelString, flags: u32 },
    /// `InodeOps::symlink`. Create a symbolic link whose target path is
    /// `target`. The created inode is named by the dentry pre-allocated
    /// by the VFS.
    Symlink { target: KernelString },
    /// `InodeOps::readlink`. Resolve the symlink target into
    /// `buf`. The driver writes the target string into the DMA buffer
    /// identified by `buf`.
    Readlink { buf: DmaBufferHandle },
    /// `InodeOps::mknod`. Create a special file (block device, character
    /// device, FIFO, or socket). `dev` carries the (major, minor) pair
    /// encoded as `(major << 32) | minor`.
    Mknod { mode: FileMode, dev: DeviceNumber },
    /// `InodeOps::getattr`. Retrieve inode attributes into an
    /// `InodeAttr`. `request_mask` is a bitmask of `STATX_*` fields the
    /// caller wants. `flags` is `AT_*` flags from `statx(2)`.
    GetAttr { request_mask: u32, flags: u32 },
    /// `InodeOps::setattr`. Modify inode attributes. `valid` is a
    /// bitmask of `ATTR_*` flags indicating which fields in `attr` the
    /// driver must update.
    SetAttr { attr: InodeAttr, valid: u32 },
    /// `InodeOps::truncate`. Set the file size to `size` bytes,
    /// releasing or zero-extending as needed.
    Truncate { size: u64 },
    /// `InodeOps::getxattr`. Retrieve the extended attribute `name` into
    /// `buf`. On return, the response `bytes_read` field carries the
    /// attribute value length.
    GetXattr { name: KernelString, buf: DmaBufferHandle },
    /// `InodeOps::setxattr`. Set extended attribute `name` to `value`.
    /// `flags` is `XATTR_CREATE`, `XATTR_REPLACE`, or 0.
    SetXattr { name: KernelString, value: DmaBufferHandle, value_len: u32, flags: u32 },
    /// `InodeOps::listxattr`. Enumerate all extended attribute names into
    /// `buf` as a sequence of NUL-terminated strings. On return, the
    /// response `bytes_read` field carries the total length written.
    ListXattr { buf: DmaBufferHandle },
    /// `InodeOps::removexattr`. Delete the extended attribute `name`.
    RemoveXattr { name: KernelString },
    /// `FileSystemOps::show_options`. Write the filesystem-specific
    /// mount options (as they would appear in `/proc/mounts`) into
    /// `buf`.
    ShowOptions { buf: DmaBufferHandle },

    // ---------------------------------------------------------------
    // FileOps
    // ---------------------------------------------------------------

    /// `FileOps::open`. Open the file. `flags` are the `O_*` open
    /// flags from `open(2)`/`openat(2)`. `mode` is relevant only when
    /// `O_CREAT` is set.
    Open { flags: u32, mode: FileMode },
    /// `FileOps::release`. The last reference to this open file
    /// descriptor has been closed. The driver must flush any cached
    /// state for `fh`.
    Release {},
    /// `FileOps::read`. Read up to `count` bytes starting at `offset`
    /// from the file into `buf`. The driver writes data into the DMA
    /// buffer identified by `buf`. On return, `VfsResponse::bytes_read`
    /// carries the number of bytes actually written.
    Read { buf: DmaBufferHandle, offset: u64, count: u32 },
    /// `FileOps::write`. Write `count` bytes from `buf` into the file
    /// starting at `offset`. `buf` points to a DMA buffer the VFS has
    /// already filled with the data to be written.
    Write { buf: DmaBufferHandle, offset: u64, count: u32 },
    /// `FileOps::fsync`. Flush dirty data and metadata to stable
    /// storage. If `datasync` is `true`, only data blocks need to be
    /// flushed (equivalent to `fdatasync(2)`). `start`..`end` is the
    /// byte range to sync; `end == u64::MAX` means "to end of file".
    Fsync { datasync: bool, start: u64, end: u64 },
    /// `FileOps::readdir`. Enumerate directory entries into `buf`
    /// starting after the position identified by `cookie`. A `cookie` of
    /// 0 means start from the beginning. The driver fills `buf` with
    /// `linux_dirent64` records and sets `VfsResponse::bytes_read` to
    /// the number of bytes written.
    ReadDir { buf: DmaBufferHandle, cookie: u64 },
    /// `FileOps::ioctl`. Pass a device-specific command to the
    /// filesystem driver. `cmd` is the ioctl number; `arg` is the raw
    /// usize argument (may be a user pointer, a small integer, or a
    /// `DmaBufferHandle` depending on the command).
    Ioctl { cmd: u32, arg: usize },
    /// `FileOps::mmap`. Establish a memory mapping. `vma_token` is an
    /// opaque handle the VFS passes to the driver to identify the
    /// virtual memory area; the driver uses it to call back into the
    /// VFS to install PTEs via the KABI page-fault callback.
    Mmap { vma_token: u64, prot: u32, flags: u32 },
    /// `FileOps::fallocate`. Pre-allocate or manipulate storage for the
    /// given byte range. `mode` carries `FALLOC_FL_*` flags.
    Fallocate { mode: u32, offset: u64, len: u64 },
    /// `FileOps::seek_data`. Find the next byte range containing data
    /// at or after `offset` (implements `SEEK_DATA` from `lseek(2)`).
    SeekData { offset: u64 },
    /// `FileOps::seek_hole`. Find the next hole (unallocated range) at
    /// or after `offset` (implements `SEEK_HOLE` from `lseek(2)`).
    SeekHole { offset: u64 },
    /// `FileOps::poll`. Query which I/O events are ready. `events` is
    /// a bitmask of `POLLIN`, `POLLOUT`, `POLLERR`, etc. The driver
    /// responds immediately with the currently ready events; the VFS
    /// handles `epoll`/`select` wait registration separately.
    Poll { events: u32 },
    /// `FileOps::splice_read`. Transfer up to `len` bytes from the file
    /// at `offset` into an in-kernel pipe identified by `pipe_ino`,
    /// without copying through userspace. `flags` carries `SPLICE_F_*`
    /// flags.
    SpliceRead { pipe_ino: u64, offset: u64, len: u32, flags: u32 },
    /// `FileOps::splice_write`. Transfer up to `len` bytes from the
    /// in-kernel pipe identified by `pipe_ino` into the file at
    /// `offset`. `flags` carries `SPLICE_F_*` flags.
    SpliceWrite { pipe_ino: u64, offset: u64, len: u32, flags: u32 },
}

/// Bounded kernel-internal string. Avoids heap allocation for the common
/// case of short names (directory entries, xattr names, symlink targets
/// ≤ 255 bytes).
///
/// For strings longer than 255 bytes the caller must use a
/// `DmaBufferHandle` instead and set `len = 0` as a sentinel.
pub struct KernelString {
    /// Byte length of the string, not including any NUL terminator.
    /// Range: 0 (sentinel for "use DMA buffer") to 255.
    pub len: u8,
    /// Inline storage. Valid bytes are `data[..len]`. The remainder
    /// is zero-padded. Not NUL-terminated; callers must use `len`.
    pub data: [u8; 255],
}

/// VFS operation codes. One-to-one mapping to trait methods.
#[repr(u32)]
pub enum VfsOpcode {
    // FileSystemOps
    Mount = 1,
    Unmount = 2,
    ForceUnmount = 3,
    Statfs = 4,
    SyncFs = 5,
    Remount = 6,
    Freeze = 7,
    Thaw = 8,

    // InodeOps
    Lookup = 20,
    Create = 21,
    Link = 22,
    Unlink = 23,
    Mkdir = 24,
    Rmdir = 25,
    Rename = 26,
    Symlink = 27,
    Readlink = 28,
    Getattr = 29,
    Setattr = 30,
    Truncate = 35,
    Getxattr = 31,
    Setxattr = 32,
    Listxattr = 33,
    Removexattr = 34,

    // InodeOps (continued)
    Mknod = 36,          // → InodeOps::mknod; called by mknod(2) for device nodes
    ShowOptions = 37,    // → FileSystemOps::show_options; called by /proc/mounts, mount(8)

    // FileOps
    Open = 40,
    Release = 41,
    Read = 42,
    Write = 43,
    Fsync = 44,
    Readdir = 45,
    Ioctl = 46,
    Mmap = 47,
    Fallocate = 48,
    SeekData = 49,
    SeekHole = 50,
    Poll = 51,
    SpliceRead = 52,     // → FileOps::splice_read; called by splice(2), sendfile(2)
    SpliceWrite = 53,    // → FileOps::splice_write; called by splice(2) write side
}

Dispatch flow (read syscall example):

  1. Userspace calls read(fd, buf, len).
  2. Syscall entry point resolves fd to a ValidatedCap (Section 8.1.1).
  3. VFS checks the page cache (Section 4.1.3). On cache HIT: data is served from core memory with zero domain crossings. On cache MISS: continue.
  4. VFS constructs a VfsRequest { opcode: Read, ino, fh, args: { offset, len, buf_handle } }. The buf_handle is a DmaBufferHandle pointing to a shared-memory region where the driver will write the read data (zero-copy).
  5. VFS enqueues the request on request_ring and rings the doorbell.
  6. The filesystem driver (in its Tier 1 domain) dequeues the request, performs the I/O (issuing block reads via BlockDevice, Section 14.3), and writes data to the shared buffer.
  7. Driver enqueues a VfsResponse { request_id, status, bytes_read } on response_ring.
  8. VFS dequeues the response, populates the page cache, and copies data to userspace.

Key design properties:

  • Page cache absorbs most I/O: Only cache misses cross the domain boundary. On a warm cache (common for frequently accessed files), read() has zero domain crossings — data is served directly from core memory. This is why the page cache lives in umka-core, not in the filesystem driver.
  • Zero-copy data path: Read/write data is transferred via shared DMA buffer handles, not copied into the ring buffer. The ring carries only the metadata (opcode, offsets, lengths, buffer handles). Data pages are in the shared DMA pool (PKEY 14 / domain 2).
  • Batching: The doorbell coalescing mechanism (Section 10.6.1.1) batches multiple requests into a single domain switch. readahead() enqueues multiple read requests before ringing the doorbell once.
  • Trait interface as specification: The FileSystemOps, InodeOps, and FileOps traits defined above serve as the SPECIFICATION of the ring protocol. Each trait method maps to exactly one VfsOpcode. The trait signatures define the arguments; the ring protocol serializes them into VfsRequestArgs. Filesystem driver developers implement the traits; the KABI code generator (Section 11.1) produces the serialization/deserialization stubs.

VFS Ring Error Handling and Cancellation:

Every cross-domain VFS request is subject to timeout, cancellation, and driver crash handling. This section specifies the complete lifecycle of a request that does not complete normally.

1. Timeout: Every VFS request has a per-operation timeout based on the expected latency class of the operation:

Timeout class Operations Default timeout
Regular Read, Write, Stat, Lookup, Create, Open, Release, Getattr, Setattr, Readdir, Readlink, Link, Unlink, Mkdir, Rmdir, Rename, Symlink, Getxattr, Setxattr, Listxattr, Removexattr, Mmap, SeekData, SeekHole, Poll, Ioctl 30 seconds
Slow Fsync, Truncate, Fallocate 120 seconds
Mount Mount, Unmount, ForceUnmount, Remount, Statfs, SyncFs, Freeze, Thaw 300 seconds

The kernel VFS layer starts a per-request timer when the request is enqueued on the request_ring. If the timer fires before a VfsResponse::Ok or VfsResponse::Err arrives on the response_ring, the kernel performs the following steps:

a. Sets request.state to Cancelled in the shared ring metadata. b. Returns ETIMEDOUT to the waiting syscall (waking the blocked thread via the VfsRingPair::completion wait queue). c. Enqueues a CancelToken { request_id, reason: CancelReason::Timeout } on a dedicated cancellation side-channel in the ring so the filesystem driver can detect the cancellation and avoid processing a stale request. The driver is expected to check the cancellation channel before beginning I/O for each dequeued request.

Timeouts are per-mount configurable via mount options (vfs_timeout_regular=<secs>, vfs_timeout_slow=<secs>, vfs_timeout_mount=<secs>). The values above are defaults.

2. Crash handling (filesystem driver crashes): When a Tier 1 filesystem driver crashes (detected by the isolation recovery mechanism described in Section 10.4.3), the kernel VFS layer performs the following recovery sequence:

a. All pending requests for the crashed filesystem driver are immediately failed with EIO. Every thread blocked on VfsRingPair::completion for that mount is woken with VfsResponse::Err(-EIO). b. The VFS ring is closed: the kernel unmaps the shared ring pages and marks the VfsRingPair as defunct. No new requests are accepted. c. Any subsequent access to files on that filesystem (open files, cached dentries, inode operations) returns ENOTCONN until the driver is restarted and the filesystem is remounted. d. For Tier 1 filesystem drivers: the crash recovery mechanism reloads the driver module and replays the mount sequence (using the stored mount arguments from SuperBlock). Pending request state is lost — applications whose requests were failed with EIO must retry. Open file descriptors pointing to the crashed filesystem become invalid and return ENOTCONN on any operation; applications must close and reopen them after remount completes.

Crash Recovery Algorithm — Complete Specification:

VFS crash recovery runs when a Tier 1 VFS driver (e.g., ext4, XFS) crashes and is reloaded (Section 10.8).

Lock ordering during recovery (must acquire in this order to prevent deadlock): 1. vfs_global_lock (prevents new VFS operations from starting) 2. Per-superblock sb.recovery_lock (one at a time, in mounting-order sequence) 3. Per-inode inode.lock (only if individual inodes need repair) Never hold an inode lock while acquiring sb.recovery_lock.

Step 1: Quiesce in-flight operations - Set sb.state = SuperblockState::Recovering (atomic store, Release ordering). - All new VFS operations on this superblock return ENXIO immediately (no-op check at syscall entry). - Wait for the per-sb operation counter sb.inflight_ops to reach zero (spin with a 5s timeout; if not drained after 5s, send SIGKILL to processes with operations in flight). - The in-flight counter is incremented at VFS entry (vfs_op_enter()) and decremented at exit (vfs_op_exit()), both under the per-task RCU read lock to prevent grace-period racing.

Step 2: Drain the ring buffer - The driver-to-kernel ring buffer (Section 11.1) may have pending completion events from operations submitted before the crash. - Call ring_drain_completions(sb.driver_ring): process all pending completions (call the registered callback for each entry). Completions after a crash return EIO. - Discard all pending submission-side entries (operations that were submitted but not yet seen by the driver) by marking them complete(EIO) without forwarding.

Step 3: Dirty page detection and writeback - Walk the superblock's page cache (sb.page_cache) for all dirty pages: pages with PageFlags::Dirty set. - For each dirty page, check page.last_written_by_lsn against sb.last_committed_lsn. - If page.lsn <= sb.last_committed_lsn: page was committed to the journal; mark clean (journal will replay it on fsck). - If page.lsn > sb.last_committed_lsn: page is beyond the last journal commit; writeback must be deferred until the filesystem is repaired. - Dirty pages beyond the last commit are kept in memory (pinned) until the filesystem is fsck'd and remounted, at which point a forced writeback is issued.

Step 4: Reload driver and remount - Load the new driver image (Section 10.2 reload protocol). - Call driver.mount(sb.device, sb.flags) with MS_RDONLY first (safe mode). - Run the filesystem's built-in consistency check (ext4 replay journal; XFS log recovery; Btrfs tree walk) via driver.fsck_fast(). - If fsck_fast() returns Ok(()): remount read-write; resume normal operations. - If fsck_fast() returns Err: emit FMA fault event, keep read-only, require manual intervention.

Step 5: Flush deferred dirty pages - After successful RW remount, call writeback_deferred_dirty(sb) to flush the dirty pages held since Step 3.

Recovery latency target: ≤500ms for ≤1 million in-flight operations and ≤10 million dirty pages.

3. Cancellation protocol: A caller (or the kernel on behalf of a caller) can cancel a pending request through the following protocol:

a. The caller invokes vfs_cancel(request_id) (internal kernel API, not exposed as a syscall — cancellation is triggered by signal delivery, thread exit, or timeout). b. The kernel sets request.state = Cancelled in the shared ring metadata for the target request. c. The kernel enqueues a CancelToken on the cancellation side-channel of the VfsRingPair. d. The filesystem driver checks the cancellation channel before processing each dequeued request. If the request is marked Cancelled, the driver skips the operation and sends no response (the kernel has already returned an error to the caller). e. If the driver has already started processing the request (e.g., issued a block I/O read), it may complete the operation — the result is silently discarded by the kernel since the request is already resolved.

/// Token placed on the cancellation side-channel of a VfsRingPair to notify
/// the filesystem driver that a previously enqueued request should be skipped.
#[repr(C)]
pub struct CancelToken {
    /// The `request_id` of the cancelled request. Matches `VfsRequest::request_id`.
    pub request_id: u64,
    /// Why the request was cancelled.
    pub reason: CancelReason,
}

/// Reason for request cancellation.
#[repr(u32)]
pub enum CancelReason {
    /// The per-operation timeout expired before the driver responded.
    Timeout = 1,
    /// The calling thread was interrupted (signal delivery or thread exit).
    CallerCancelled = 2,
    /// The filesystem driver crashed; all pending requests are being flushed.
    DriverCrash = 3,
}

4. VfsResponse::Pending semantics: A VfsResponse::Pending response from the filesystem driver means the request has been accepted and acknowledged but not yet completed (for example, the driver has issued a block I/O request and is waiting for device completion). The contract is:

  • The caller must poll the response_ring or sleep on VfsRingPair::completion for the final VfsResponse::Ok or VfsResponse::Err.
  • Pending does NOT reset the per-request timeout timer. The maximum time in Pending state is bounded by the operation timeout defined above. If the final response does not arrive within the timeout, the request is cancelled using the standard cancellation protocol (step 3).
  • A driver may send at most one Pending response per request. Sending multiple Pending responses for the same request_id is a protocol violation; the kernel logs a warning and ignores duplicate Pending responses.
  • Pending is optional: a driver may respond directly with Ok or Err without ever sending Pending. It exists to allow the VFS layer to distinguish "driver has seen the request" from "request is still sitting in the ring unprocessed" for diagnostic and health-monitoring purposes (Section 19.1).

13.1.2 Dentry Cache

The dentry (directory entry) cache is the performance-critical data structure of the VFS. It maps (parent_inode, name) pairs to child inodes, eliminating repeated disk lookups for path resolution.

Data structure: RCU-protected hash table. Read-side lookups are lock-free — no atomic operations on the read path, only a memory barrier on RCU read lock entry/exit. This matches Linux's dentry cache design, which is similarly RCU-protected for the same performance reasons.

Negative dentries: When a lookup() returns ENOENT, the VFS caches a negative dentry for that (parent, name) pair. Subsequent lookups for the same nonexistent path component return ENOENT immediately without calling into the filesystem driver. This is critical for workloads like $PATH searches where the shell looks for an executable in 5-10 directories, finding it only in one. Without negative dentries, every command invocation would perform 4-9 unnecessary disk lookups.

Eviction: LRU eviction under memory pressure. The dentry cache integrates with umka-core's memory reclaim (Section 4.2 — Memory Compression Tier, in 04-memory.md) — when the page allocator signals memory pressure, the dentry cache shrinker evicts least-recently-used entries. Negative dentries are evicted preferentially (they are cheaper to re-create than positive dentries).

13.1.3 Path Resolution

Path resolution walks the dentry cache component by component. For example, /usr/lib/libfoo.so resolves as: root dentry -> lookup("usr") -> lookup("lib") -> lookup("libfoo.so").

RCU path walk (fast path): The entire resolution is attempted under an RCU read-side critical section. No dentry reference counts are taken, no locks are acquired. If every component is in the dentry cache and no concurrent renames or unmounts are in progress, the entire path resolves with zero atomic operations.

Ref-walk fallback (slow path): If any component is not cached, or if a concurrent mount/rename is detected (via sequence counters), the RCU walk aborts and restarts in ref-walk mode. Ref-walk takes dentry reference counts and inode locks as needed. This two-phase approach is identical to Linux's LOOKUP_RCU -> LOOKUP_LOCKED fallback.

Mount point traversal: When a dentry is flagged as a mount point, resolution crosses into the mounted filesystem's root dentry. The mount table is consulted via RCU lookup (no lock) in the fast path.

Symlink resolution: The VFS follows up to 40 nested symlinks before returning ELOOP. This matches the Linux limit and prevents infinite symlink loops.

Capability checks: Traverse permission is checked at each path component, but not via an inter-domain ring call on every component. Instead, the dentry cache stores a cached_perm: AtomicU32 field containing the permission bits resolved on the last successful access by the current UID. During RCU-walk, the VFS reads cached_perm from the dentry (same domain, no ring call) and compares against the requesting process's UID and requested permission. If the cached permission matches (common case — same user accessing the same path), no domain crossing occurs and the check costs only a single atomic load (~1-3 cycles). The permission cache is invalidated on chmod(), chown(), ACL changes, and capability revocation (all infrequent operations).

Permission cache encoding: The 32-bit cached_perm field is divided into: - Bits [31:16]: Truncated UID hash (upper 16 bits of a fast hash of the accessor's UID). This is NOT a full UID — it is a probabilistic match filter. - Bits [15:12]: Reserved (zero). - Bits [11:9]: Permission result for owner (rwx). - Bits [8:6]: Permission result for group (rwx). - Bits [5:3]: Permission result for other (rwx). - Bits [2:0]: Access mode that was checked (rwx).

On a cache hit (UID hash matches AND requested permission bits are a subset of the cached grant), the VFS skips the domain crossing. On a cache miss (UID hash mismatch or permission bits not cached), the VFS performs a full capability check via the inter-domain ring and updates the cache. The 16-bit UID hash has a ~1/65536 false positive rate — a different user may incorrectly hit the cache and receive a stale permission result.

On hash collision (false positive rate ~1/65536 per lookup): access is denied — the VFS falls back to the slow-path inter-domain capability check, which always produces a correct result. The permission cache is purely advisory; a collision always causes a cache miss, never a permission elevation. Fail-safe direction: deny unknown, never grant unknown.

This design is correct because: 1. A cache hit is only accepted when the UID hash AND the requested permission bits match the stored grant exactly. The probability of a different user with a different permission set matching both fields is ~1/65536 per lookup — and that case results in a cache miss and full slow-path check anyway. 2. The cache is invalidated on ALL permission-changing operations (chmod, chown, ACL changes, capability revocation), ensuring stale grants are never served after the underlying permission state changes. 3. Only the slow-path inter-domain ring call is authoritative. umka-vfs cannot grant access that umka-core's capability tables do not authorize.

Only on a cache miss (first access, different UID, or invalidated entry) does the VFS call umka-core via the inter-domain ring to perform a full capability check and update the dentry's cached permissions. This amortized design preserves the security guarantee (umka-vfs cannot bypass capability checks — it has no access to capability tables, per Section 10.2 and Section 10.4) while keeping the hot-path overhead to a single atomic load per component, comparable to Linux's inode->i_mode check.

13.1.4 Mount Namespace and Capability-Gated Mounting

Each process belongs to a mount namespace containing its own mount tree.

Mount operations are capability-gated:

Operation Required Capability Scope
mount CAP_MOUNT Mount namespace
bind mount CAP_MOUNT + read access to source Mount namespace + source
remount CAP_MOUNT Mount namespace
umount CAP_MOUNT Mount namespace
pivot_root CAP_SYS_ADMIN Mount namespace

CAP_MOUNT is scoped to the calling process's mount namespace — it does not grant mount authority in other namespaces. A container with its own mount namespace can mount filesystems within that namespace without affecting the host.

Mount propagation: Shared, private, slave, and unbindable propagation types, with the same semantics as Linux (MS_SHARED, MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE). This is essential for container runtimes that rely on mount propagation for volume mounts.

Filesystem type registration: Only umka-core can register new filesystem types with the VFS. Filesystem drivers request registration via the inter-domain ring, and umka-core verifies the driver's identity and KABI certification before granting registration.

13.2 Mount Tree Data Structures and Operations

The mount tree is the central data structure of the VFS layer that tracks all mounted filesystems, their hierarchical relationships, and their propagation properties. Every path resolution operation traverses the mount tree (via the mount hash table) to cross mount boundaries. This section defines the complete data structures, algorithms, and namespace operations that were previously referenced but unspecified by Section 13.1.3, 10.1.4, and Section 16.1.3.

Design principles:

  1. RCU for the read path: Mount hash table lookups happen on every path resolution (every open(), stat(), readlink(), execve()). The read path must be completely lock-free. Writers (mount/unmount) serialize through the per-namespace mount_lock and publish changes via RCU.

  2. Per-namespace scoping: Unlike Linux, which uses a single global mount_hashtable, UmkaOS scopes the mount hash table per mount namespace. This eliminates contention between namespaces in container-heavy workloads (thousands of namespaces with independent mount trees) and allows mount operations in different namespaces to proceed in parallel with no shared lock. The trade-off is additional memory per namespace; this is acceptable because each namespace already has an independent mount tree and the hash table overhead is proportional to the number of mounts (typically 30-100 per container, well under 1 KiB of hash table memory).

  3. Arc-based lifetime management: Mount nodes are reference-counted via Arc<Mount>. Parent, master, and peer references use Arc (strong) or Weak (where appropriate to break cycles). RCU protects the hash chains and list traversals; Arc protects the Mount node lifetime beyond the RCU grace period.

  4. Capability gating: All mount tree modifications check CAP_MOUNT or CAP_SYS_ADMIN as specified in Section 13.1.4. The data structures below enforce this at the entry point of each operation, not deep inside the algorithm.

  5. 64-bit mount IDs: Per-namespace monotonic counter, never wrapping on any realistic system. Mount IDs are unique within a namespace and are the stable identifier used by statx() (STATX_MNT_ID), the new statmount()/listmount() syscalls, and /proc/PID/mountinfo.

13.2.1 Mount Flags

bitflags! {
    /// Per-mount flags controlling security and access behavior.
    ///
    /// These are distinct from per-superblock options (which control the
    /// filesystem driver's behavior). A single superblock can be mounted
    /// at multiple locations with different per-mount flags (e.g., one
    /// mount point read-write, another read-only via bind mount + remount).
    ///
    /// Bit assignments match Linux's `MNT_*` internal flags for
    /// straightforward compat-layer translation. The `mount(2)` compat
    /// shim translates `MS_*` userspace flags to `MountFlags` at syscall
    /// entry; the new mount API (`mount_setattr(2)`) translates
    /// `MOUNT_ATTR_*` flags similarly.
    #[repr(transparent)]
    pub struct MountFlags: u64 {
        // --- Userspace-visible flags (set via mount/remount/mount_setattr) ---

        /// Do not honor set-user-ID and set-group-ID bits on executables.
        const MNT_NOSUID       = 1 << 0;
        /// Do not allow access to device special files on this mount.
        const MNT_NODEV        = 1 << 1;
        /// Do not allow execution of programs on this mount.
        const MNT_NOEXEC       = 1 << 2;
        /// Mount is read-only. Writes return EROFS.
        const MNT_READONLY     = 1 << 3;
        /// Do not update access times on this mount.
        const MNT_NOATIME      = 1 << 4;
        /// Do not update directory access times on this mount.
        const MNT_NODIRATIME   = 1 << 5;
        /// Update atime only if atime <= mtime or atime <= ctime, or if
        /// the previous atime is more than 24 hours old. Default for most
        /// mounts since Linux 2.6.30 and UmkaOS.
        const MNT_RELATIME     = 1 << 6;
        /// Buffer atime updates in memory and flush lazily. Reduces write
        /// I/O for atime-heavy workloads (e.g., mail servers).
        const MNT_LAZYTIME     = 1 << 7;
        /// Do not follow symlinks on this mount. Used by container runtimes
        /// to prevent symlink-based escapes from bind-mounted directories.
        const MNT_NOSYMFOLLOW  = 1 << 8;

        // --- Internal flags (kernel-managed, not settable by userspace) ---

        /// Mount is in the process of being unmounted. Set by `umount()`
        /// before removing the mount from the hash table. Prevents new
        /// path lookups from entering this mount. Once set, never cleared
        /// (the mount node is freed after the RCU grace period).
        const MNT_DOOMED       = 1 << 16;
        /// Mount is locked and cannot be unmounted by unprivileged
        /// processes. Set on mounts visible in child mount namespaces
        /// created by unprivileged users — prevents a child namespace
        /// from unmounting a mount inherited from the parent. Cleared
        /// only by a process with `CAP_SYS_ADMIN` in the mount's owning
        /// user namespace.
        const MNT_LOCKED       = 1 << 17;
        /// Mount can be expired and automatically unmounted under memory
        /// pressure or after an idle timeout. Used by autofs. The VFS
        /// checks `mnt_count == 0` before expiring a shrinkable mount.
        const MNT_SHRINKABLE   = 1 << 18;
        /// Mount was created by the new mount API (fsopen/fsmount) and
        /// has not yet been attached to the mount tree via move_mount().
        /// Detached mounts are invisible to path resolution and
        /// /proc/PID/mountinfo. They become visible only after
        /// move_mount() attaches them.
        const MNT_DETACHED     = 1 << 19;
    }
}

13.2.2 Propagation Type

/// Mount propagation type. Controls whether mount/unmount events at this
/// mount point are propagated to other mount points, and in which direction.
///
/// Propagation is fundamental to container runtimes: Docker sets the rootfs
/// to MS_PRIVATE by default, Kubernetes uses MS_SHARED for volume mounts
/// that must be visible across pod containers.
///
/// See: Linux kernel Documentation/filesystems/sharedsubtree.rst
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum PropagationType {
    /// Mount events propagate bidirectionally within the peer group.
    /// All mounts in the same peer group see each other's mount/unmount
    /// events. This is the Linux default for the initial namespace root.
    Shared = 0,

    /// Mount events are not propagated to or from this mount. This is
    /// the default for new mount namespaces (container isolation).
    Private = 1,

    /// Mount events propagate unidirectionally from the master to this
    /// mount, but not in the reverse direction. Used when a container
    /// should see new mounts from the host but not expose its own mounts
    /// to the host.
    Slave = 2,

    /// Like Private, but additionally prevents this mount from being
    /// used as the source of a bind mount. Used for security-sensitive
    /// mount points that should never be replicated.
    Unbindable = 3,
}

13.2.3 Mount Node

/// A single mount instance in the mount tree.
///
/// Equivalent to Linux's `struct mount` (not `struct vfsmount` — the latter
/// is the subset exposed to filesystem drivers; `struct mount` is the full
/// internal structure). Each `Mount` represents one attachment of a
/// filesystem at a specific point in the directory tree.
///
/// **Lifetime**: `Mount` nodes are allocated via `Arc<Mount>`. References
/// are held by:
/// - The mount hash table (via RCU-protected hash chain)
/// - The parent mount's `children` list
/// - The peer group's `mnt_share` ring
/// - The master mount's `mnt_slave_list`
/// - Any open file descriptor whose path traversed this mount
///   (via `mnt_count` reference count)
/// - The `MountNamespace.mount_list`
///
/// A mount node is freed when all strong references are dropped, which
/// happens after: (a) removal from the hash table, (b) removal from the
/// parent's child list, (c) RCU grace period completion, and (d) all
/// path-resolution references (`mnt_count`) have been released.
pub struct Mount {
    // --- Identity ---

    /// Unique mount identifier within the owning namespace. Monotonically
    /// increasing, 64-bit, never reused. This is the value returned by
    /// `statx()` in `stx_mnt_id` (STATX_MNT_ID) and reported in
    /// `/proc/PID/mountinfo` field 1.
    pub mount_id: u64,

    /// Device name string (e.g., "/dev/sda1", "tmpfs", "overlay").
    /// Displayed in `/proc/PID/mountinfo` field 10 (mount source).
    /// Heap-allocated, immutable after mount creation.
    pub device_name: Box<[u8]>,

    // --- Tree structure ---

    /// Parent mount. `None` for the root of the mount namespace.
    /// Uses `Weak` to prevent reference cycles in the mount tree:
    /// parent -> children -> parent would create a cycle with `Arc`.
    /// The parent is always alive while any child exists (the child
    /// holds a position in the parent's hash chain), so the `Weak`
    /// can always be upgraded during normal operation. It fails only
    /// during the teardown of a doomed mount tree, which is expected.
    pub parent: Option<Weak<Mount>>,

    /// The dentry in the parent mount's filesystem where this mount is
    /// attached. For the root mount of a namespace, this is the root
    /// dentry of the parent mount (which is itself).
    ///
    /// Together with `parent`, this pair `(parent_mount, mountpoint_dentry)`
    /// is the key in the mount hash table. Path resolution uses this to
    /// detect mount crossings: when a dentry has `DCACHE_MOUNTED` set,
    /// the VFS calls `lookup_mnt(current_mount, dentry)` to find the
    /// child mount.
    pub mountpoint: DentryRef,

    /// Root dentry of the mounted filesystem. When path resolution
    /// crosses into this mount, it continues from this dentry.
    pub root: DentryRef,

    /// The superblock of the mounted filesystem. Shared across all
    /// mounts of the same filesystem instance (e.g., bind mounts share
    /// the superblock). The superblock holds the filesystem-specific
    /// state and the `FileSystemOps`/`InodeOps`/`FileOps` trait objects.
    pub superblock: Arc<SuperBlock>,

    /// Children of this mount — sub-mounts attached at dentries within
    /// this mount's filesystem. Intrusive doubly-linked list for O(1)
    /// insertion and removal. Protected by the namespace's `mount_lock`
    /// for writes; RCU-protected for reads during path resolution.
    pub children: IntrusiveList<Arc<Mount>>,

    /// Link entry for this mount in its parent's `children` list.
    /// Embedded in the `Mount` node to avoid per-child heap allocation.
    pub child_link: IntrusiveListNode,

    // --- Mount flags ---

    /// Per-mount flags (nosuid, nodev, noexec, readonly, noatime, etc.).
    /// Atomically readable for the path-resolution hot path (no lock
    /// needed to check MNT_READONLY or MNT_NOSUID). Modified only under
    /// `mount_lock` via atomic store with Release ordering.
    pub flags: AtomicU64,

    // --- Propagation ---

    /// Propagation type for this mount (Shared, Private, Slave, Unbindable).
    /// Determines how mount/unmount events are forwarded to related mounts.
    /// Modified only under `mount_lock`.
    pub propagation: PropagationType,

    /// Peer group ID for shared mounts. All mounts in the same peer group
    /// have the same `group_id`. Private and unbindable mounts have
    /// `group_id == 0`. Slave mounts retain the `group_id` of their
    /// former peer group (for /proc/PID/mountinfo optional fields).
    ///
    /// Allocated from the namespace's `group_id_allocator`. Unique within
    /// a namespace.
    pub group_id: u64,

    /// Circular linked list of peer mounts (shared propagation).
    /// All mounts in a peer group are linked through `mnt_share`.
    /// When a mount/unmount event occurs on any peer, it is propagated
    /// to all other peers in the ring. For Private/Unbindable mounts,
    /// this list contains only the mount itself (self-loop).
    pub mnt_share: IntrusiveListNode,

    /// Master mount for slave propagation. When this mount is a slave,
    /// `mnt_master` points to the shared mount from which this mount
    /// receives (but does not send) propagation events.
    /// `None` for shared, private, and unbindable mounts.
    pub mnt_master: Option<Weak<Mount>>,

    /// List head for slave mounts of this mount. When this mount is
    /// shared (or was shared), slave mounts derived from it are linked
    /// through `mnt_slave_list`. Each slave's `mnt_slave` node is an
    /// entry in this list.
    pub mnt_slave_list: IntrusiveList<Arc<Mount>>,

    /// Link entry for this mount in its master's `mnt_slave_list`.
    pub mnt_slave: IntrusiveListNode,

    // --- Namespace membership ---

    /// The mount namespace that owns this mount. `Weak` because the
    /// namespace may be destroyed (all processes exited) while detached
    /// mounts or lazy-unmount remnants still exist.
    pub ns: Weak<MountNamespace>,

    /// Link entry in the namespace's `mount_list`. Used for ordered
    /// iteration (e.g., /proc/PID/mountinfo output, umount ordering).
    pub ns_list_link: IntrusiveListNode,

    // --- Reference counting ---

    /// Active reference count. Incremented when path resolution enters
    /// this mount (ref-walk mode) or when an open file descriptor
    /// references a path within this mount. `umount()` checks this
    /// before removing the mount: if `mnt_count > 0`, the mount is
    /// busy and umount returns `EBUSY` (unless `MNT_DETACH` is used).
    ///
    /// Note: this is separate from the `Arc` reference count. `Arc`
    /// tracks the lifetime of the `Mount` struct itself. `mnt_count`
    /// tracks whether the mount is actively *in use* by path lookups
    /// and open files. A mount can have `mnt_count == 0` (not busy)
    /// while still having `Arc` strong count > 0 (struct not yet freed
    /// because it's still in the hash table or child list).
    pub mnt_count: AtomicU64,

    // --- Mount hash chain ---

    /// Link entry in the mount hash table bucket chain. RCU-protected:
    /// readers traverse the chain under `rcu_read_lock()` without any
    /// lock; writers modify the chain under `mount_lock` and publish
    /// via RCU. Uses intrusive linking for zero-allocation hash insertion.
    pub hash_link: IntrusiveListNode,
}

/// Reference to a dentry. Wraps the dentry's inode ID and parent inode ID,
/// which together uniquely identify a dentry in the dentry cache (Section
/// 27a.2). The VFS resolves this to a cached dentry entry on access.
///
/// This avoids holding a direct pointer into the dentry cache (which is
/// RCU-managed and may be evicted), while still providing O(1) lookup via
/// the dentry hash table.
pub struct DentryRef {
    /// Inode ID of the parent directory containing this dentry.
    pub parent_inode: InodeId,
    /// Name hash of this dentry (SipHash-1-3 of the name component).
    /// Used for O(1) dentry cache lookup without storing the full name.
    pub name_hash: u64,
    /// Inode ID of the dentry itself (for positive dentries).
    pub inode: InodeId,
}

13.2.4 Mount Hash Table

/// Per-namespace mount hash table. Maps `(parent_mount_id, mountpoint_dentry)`
/// pairs to child `Mount` nodes. This is the data structure consulted on
/// every mount-point crossing during path resolution.
///
/// **Why per-namespace**: Linux uses a single global `mount_hashtable` with
/// ~2048 buckets, protected by a per-bucket spinlock for writes and RCU for
/// reads. In container-heavy environments (thousands of namespaces, each with
/// 30-100 mounts), this creates false sharing on hash buckets and limits
/// scalability of concurrent mount operations across namespaces. UmkaOS's
/// per-namespace hash table eliminates cross-namespace contention entirely.
///
/// **Sizing**: The hash table is sized to the number of mounts in the
/// namespace, with a minimum of 32 buckets and a maximum of 1024. The table
/// is resized (doubled) when the load factor exceeds 2.0, and shrunk
/// (halved) when the load factor drops below 0.25. Resizing allocates a
/// new bucket array, rehashes under `mount_lock`, and publishes via RCU.
///
/// **Hash function**: SipHash-1-3 of `(parent_mount_id, mountpoint_inode_id)`.
/// The SipHash key is per-namespace, generated from a CSPRNG at namespace
/// creation. This prevents hash-flooding attacks where an adversary crafts
/// mount points that collide in the hash table.
pub struct MountHashTable {
    /// RCU-protected bucket array. Each bucket is the head of an intrusive
    /// singly-linked list of `Mount` nodes (via `Mount.hash_link`).
    /// Readers traverse under `rcu_read_lock()`; writers modify under
    /// the namespace's `mount_lock`.
    buckets: RcuCell<Box<[MountHashBucket]>>,

    /// Number of entries in the hash table. Used for load-factor
    /// computation during resize decisions. Modified only under `mount_lock`.
    count: u32,

    /// SipHash key for this hash table. Per-namespace, generated at
    /// namespace creation from the kernel CSPRNG.
    hash_key: [u64; 2],
}

/// A single bucket in the mount hash table. Contains the head pointer
/// of an RCU-protected chain of Mount nodes.
struct MountHashBucket {
    /// Head of the intrusive linked list of Mount nodes hashing to this
    /// bucket. Null if the bucket is empty. Readers follow this chain
    /// under RCU; writers modify under `mount_lock`.
    head: AtomicPtr<Mount>,
}

impl MountHashTable {
    /// Look up a child mount at the given `(parent, dentry)` pair.
    ///
    /// Called during path resolution when a dentry has the `DCACHE_MOUNTED`
    /// flag set. Must be called under `rcu_read_lock()`.
    ///
    /// Returns `Some(&Mount)` if a mount is found at this point, or
    /// `None` if the dentry is not a mount point (stale `DCACHE_MOUNTED`
    /// flag — possible after lazy unmount).
    ///
    /// **Performance**: O(1) expected, O(n) worst-case where n is the
    /// chain length (bounded by load factor < 2.0). No locks, no atomics
    /// beyond the initial `Acquire` load of the bucket head pointer.
    pub fn lookup<'a>(
        &'a self,
        parent_mount_id: u64,
        mountpoint_inode: InodeId,
        _rcu: &'a RcuReadGuard,
    ) -> Option<&'a Mount> {
        let hash = siphash_1_3(
            self.hash_key,
            parent_mount_id,
            mountpoint_inode,
        );
        let bucket_idx = hash as usize % self.bucket_count();
        let bucket = &self.buckets.read(_rcu)[bucket_idx];

        let mut current = bucket.head.load(Ordering::Acquire);
        while !current.is_null() {
            // SAFETY: `current` is a valid Mount pointer within an RCU
            // read-side critical section. The Mount node is not freed
            // until after the RCU grace period.
            let mnt = unsafe { &*current };
            if mnt.mount_id_of_parent() == parent_mount_id
                && mnt.mountpoint_inode() == mountpoint_inode
                && !mnt.is_doomed()
            {
                return Some(mnt);
            }
            current = mnt.hash_link.next.load(Ordering::Acquire);
        }
        None
    }
}

13.2.5 Mount Namespace

/// A mount namespace. Contains an independent mount tree with its own root
/// mount, hash table, and mount list. Created by `clone(CLONE_NEWNS)` or
/// `unshare(CLONE_NEWNS)`.
///
/// The `vfs_root: Capability<VfsNode>` field in `NamespaceSet` (Section 16.1.2)
/// is updated to point to this namespace's root mount:
///
/// ```rust
/// // Updated NamespaceSet field (replaces the previous Capability<VfsNode>):
/// pub mount_ns: Arc<MountNamespace>,
/// ```
///
/// **Relationship to NamespaceSet**: Each process's `NamespaceSet` holds
/// an `Arc<MountNamespace>`. Multiple processes in the same mount namespace
/// share the same `Arc<MountNamespace>`. When `clone(CLONE_NEWNS)` is called,
/// a new `MountNamespace` is created by cloning the parent's mount tree
/// (via `copy_tree()`).
pub struct MountNamespace {
    /// Unique namespace identifier. Used for `/proc/PID/ns/mnt` inode
    /// number and `setns()` namespace comparison.
    pub ns_id: u64,

    /// Root mount of this namespace's mount tree. This is the mount
    /// that corresponds to "/" for all processes in this namespace.
    /// Updated atomically by `pivot_root()`.
    pub root: RcuCell<Arc<Mount>>,

    /// Ordered list of all mounts in this namespace. The ordering is
    /// topological: parent mounts appear before their children. This
    /// ordering is used by:
    /// - `/proc/PID/mountinfo`: output follows this order
    /// - `umount -a`: unmounts in reverse order (leaves before parents)
    /// - Namespace teardown: unmounts in reverse topological order
    pub mount_list: IntrusiveList<Arc<Mount>>,

    /// Number of mounts in this namespace. Used to enforce the
    /// per-namespace mount count limit (default: 100,000 — matching
    /// Linux's `sysctl fs.mount-max`). Prevents mount-storm DoS attacks
    /// where a compromised container creates millions of mounts.
    pub mount_count: AtomicU64,

    /// Event counter. Incremented on every mount/unmount/remount
    /// operation. Used by `poll()` on `/proc/PID/mountinfo` to detect
    /// mount tree changes. Container runtimes and systemd use this
    /// to react to mount events without periodic scanning.
    pub event_seq: AtomicU64,

    /// Per-namespace mount hash table. Maps `(parent_mount, dentry)` to
    /// child mount for path resolution mount-point crossings.
    pub hash_table: MountHashTable,

    /// Mutex serializing mount tree modifications (mount, unmount,
    /// remount, pivot_root, bind mount, move mount). Readers (path
    /// resolution) do not acquire this lock — they use RCU.
    /// Lock hierarchy level 9 (MOUNT_LOCK): above DENTRY_LOCK (8),
    /// below NET (10). See Section 3.1.5 lock hierarchy table.
    pub mount_lock: Mutex<()>,

    /// Mount ID allocator. Monotonically increasing 64-bit counter.
    /// IDs are never reused within a namespace. At 1 mount/second
    /// sustained, a 64-bit counter would not wrap for ~584 billion years.
    pub id_allocator: AtomicU64,

    /// Peer group ID allocator. Like mount IDs, monotonically increasing
    /// and never reused. Separate from mount IDs because group IDs are
    /// shared across mounts and have a different lifecycle.
    pub group_id_allocator: AtomicU64,

    /// User namespace that owns this mount namespace. Determines
    /// capability checks for mount operations. A process must have
    /// `CAP_MOUNT` in this user namespace (or an ancestor) to modify
    /// the mount tree.
    pub user_ns: Arc<UserNamespace>,
}

13.2.6 DCACHE_MOUNTED Integration

The dentry cache (Section 13.1.2) must track which dentries are mount points. When a filesystem is mounted at a dentry, the VFS sets the DCACHE_MOUNTED flag on that dentry. During path resolution (Section 13.1.3), when the VFS encounters a dentry with DCACHE_MOUNTED set, it calls MountHashTable::lookup() to find the child mount and continues resolution from the child mount's root dentry.

/// Dentry cache entry flags. Stored in the dentry's `flags: AtomicU32` field.
/// Extended to include DCACHE_MOUNTED for mount-point detection.
bitflags! {
    #[repr(transparent)]
    pub struct DcacheFlags: u32 {
        /// This dentry is a mount point — a filesystem is mounted on it.
        /// Set by `do_mount()` when attaching a mount. Cleared by
        /// `do_umount()` when the last mount at this dentry is removed.
        ///
        /// Path resolution checks this flag on every path component.
        /// When set, `MountHashTable::lookup(current_mount, dentry)` is
        /// called to find the child mount. This check is a single atomic
        /// load (~1 cycle) — the flag exists specifically to avoid a hash
        /// table lookup on every path component (only mount points need
        /// the lookup).
        const DCACHE_MOUNTED       = 1 << 0;

        /// Dentry has been disconnected from the tree (e.g., NFS stale
        /// handle, deleted directory that is still open).
        const DCACHE_DISCONNECTED  = 1 << 1;

        /// Dentry is a negative dentry (caches a failed lookup).
        const DCACHE_NEGATIVE      = 1 << 2;

        /// Dentry has filesystem-specific operations (d_revalidate, etc.).
        const DCACHE_OP_MASK       = 1 << 3;
    }
}

13.2.7 Filesystem Context (New Mount API)

The new mount API (Linux 5.2+, used increasingly by container runtimes and systemd) separates mount operations into discrete steps: context creation, configuration, superblock creation, and attachment. This provides better error reporting (errors at each step, not a single mount(2) errno) and supports atomic mount configuration changes.

/// Filesystem context for the new mount API.
///
/// Created by `fsopen()`, configured by `fsconfig()`, and consumed by
/// `fsmount()`. The context holds all the state needed to create a new
/// superblock and mount, accumulated through multiple `fsconfig()` calls.
///
/// This is equivalent to Linux's `struct fs_context`.
///
/// **Lifetime**: The context is reference-counted via a file descriptor
/// returned by `fsopen()`. It is destroyed when the file descriptor is
/// closed. If `fsmount()` has not been called, the context is simply
/// freed (no mount created). If `fsmount()` was called, the context's
/// state has been consumed and the mount exists independently.
pub struct FsContext {
    /// Filesystem type (e.g., "ext4", "tmpfs", "overlay"). Set at
    /// `fsopen()` time and immutable thereafter.
    pub fs_type: Arc<dyn FileSystemOps>,

    /// Filesystem type name (for diagnostics and /proc/mounts).
    pub fs_type_name: Box<[u8]>,

    /// Source device or path (equivalent to mount(2) `source` parameter).
    /// Set via `fsconfig(FSCONFIG_SET_STRING, "source", ...)`.
    pub source: Option<Box<[u8]>>,

    /// Accumulated mount options as key-value pairs. Each `fsconfig()`
    /// call adds or modifies an entry. The filesystem driver validates
    /// options at `fsconfig(FSCONFIG_CMD_CREATE)` time.
    pub options: Vec<(Box<[u8]>, Box<[u8]>)>,

    /// Binary data options (for filesystems that accept binary mount data).
    /// Set via `fsconfig(FSCONFIG_SET_BINARY, ...)`.
    pub binary_options: Vec<(Box<[u8]>, Box<[u8]>)>,

    /// Mount flags to apply to the created mount.
    pub mount_flags: MountFlags,

    /// The created superblock. Set by `fsconfig(FSCONFIG_CMD_CREATE)`,
    /// consumed by `fsmount()`.
    pub superblock: Option<Arc<SuperBlock>>,

    /// Error log. Filesystem drivers write diagnostic messages here
    /// during context creation and configuration. Readable by userspace
    /// via `read()` on the fscontext file descriptor.
    pub log: Vec<u8>,

    /// Purpose of this context: new mount, reconfiguration, or submount.
    pub purpose: FsContextPurpose,

    /// User namespace for permission checks. Set at `fsopen()` time
    /// to the caller's user namespace.
    pub user_ns: Arc<UserNamespace>,
}

/// Purpose of a filesystem context, controlling which operations are valid.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum FsContextPurpose {
    /// Creating a new mount (from `fsopen()`).
    NewMount = 0,
    /// Reconfiguring an existing mount (from `fspick()`).
    Reconfig = 1,
    /// Internal: creating a submount (e.g., automount).
    Submount = 2,
}

13.2.7.1 FsContext Lifecycle and Error Channel

The new mount API separates mount configuration into discrete, verifiable steps. Each step either advances the context state or returns a structured error. The full lifecycle:

Step 1: fd = fsopen("ext4", FSOPEN_CLOEXEC)
  → Validates "ext4" against the filesystem type registry.
  → Allocates FsContext { fs_type: ext4_ops, purpose: NewMount, state: Blank, ... }.
  → Returns an O_RDWR file descriptor backed by the FsContext.
  → FsContext state: Blank.

Step 2: fsconfig(fd, FSCONFIG_SET_STRING, "source", "/dev/sda1", 0)
        fsconfig(fd, FSCONFIG_SET_STRING, "errors",  "remount-ro",  0)
        fsconfig(fd, FSCONFIG_SET_FLAG,   "noatime", NULL,          0)
  → Each call appends to FsContext.options: [("source", "/dev/sda1"), ("errors", "remount-ro"), ...].
  → Returns 0 on success; EINVAL if the key is not recognized by the filesystem type.
  → FsContext state: Blank (still accumulating options).

Step 3: fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
  → Calls FileSystemOps::mount(source, flags, options) on the configured filesystem type.
  → On success: FsContext.superblock = Some(sb); state → Ready.
  → On failure: diagnostic message is written to FsContext.log; state → Failed.
    Caller can read the error via read(fd, buf, len) — see Error Channel below.
  → Returns 0 on success; -errno on failure.

Step 4: mnt_fd = fsmount(fd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOATIME)
  → Consumes FsContext.superblock (state must be Ready; returns EBUSY if Mounted,
    EINVAL if Blank or Failed).
  → Allocates a MountNode with MNT_DETACHED flag set.
  → Returns an O_PATH fd referencing the detached mount.
  → FsContext state: Mounted (further fsconfig/fsmount calls return EBUSY).

Step 5: move_mount(mnt_fd, "", AT_FDCWD, "/mnt/data", MOVE_MOUNT_F_EMPTY_PATH)
  → Attaches the detached mount to the namespace mount tree at /mnt/data.
  → Clears MNT_DETACHED from the MountNode.
  → Triggers mount propagation to peer/slave mounts (Section 13.2.10).

open_tree(2) — clone or open a mount:

fd = open_tree(dirfd, path, OPEN_TREE_CLONE | AT_RECURSIVE)
  → Resolves path to a mount.
  → OPEN_TREE_CLONE: creates a detached copy of the mount tree rooted at path,
    identical to a recursive bind mount but without modifying the namespace.
    AT_RECURSIVE: the clone includes all submounts below path.
  → The returned O_PATH fd can be passed to move_mount() to attach elsewhere.
  → Without OPEN_TREE_CLONE: returns an O_PATH fd referencing the existing mount
    without cloning (useful for passing a mount reference across namespaces).

mount_setattr(2) — bulk-modify mount tree flags:

mount_setattr(dirfd, path, AT_RECURSIVE, &mount_attr { attr_set, attr_clr }, sizeof)
  → Resolves path to a mount.
  → AT_RECURSIVE: applies to all mounts in the subtree rooted at path.
  → attr_clr: clears these flags from each mount (applied first).
  → attr_set: sets these flags on each mount (applied after attr_clr).
  → The operation is atomic within the subtree: if validation fails for any mount
    (e.g., clearing MNT_READONLY on a superblock-level read-only filesystem), no
    flags are changed on any mount.
  → Requires CAP_MOUNT.

FsContext Error Channel:

When fsconfig(FSCONFIG_CMD_CREATE) or fsmount() encounter a filesystem-level error (e.g., superblock checksum mismatch, missing required option, device I/O error), the error is not conveyed solely via errno. The filesystem driver writes a human-readable diagnostic string to FsContext.log. The caller retrieves it via read(fd, buf, len) on the FsContext file descriptor:

read(fs_context_fd, buf, len):
  if FsContext.log is empty: return 0 (EOF — no error message pending)
  n = min(len, FsContext.log.len())
  copy_to_user(buf, FsContext.log[..n])
  FsContext.log.drain(..n)
  return n

Example error message (readable by system administrators):

ext4: superblock checksum mismatch at block 0: expected 0xdeadbeef, got 0xcafebabe

This approach is superior to the traditional single-errno response: it gives system administrators and container runtimes actionable diagnostic information without requiring a separate diagnostics ioctl or /proc file.

13.2.8 Mount Attribute Structure (mount_setattr)

/// User-visible mount attribute structure for `mount_setattr(2)`.
/// Matches Linux's `struct mount_attr` exactly for ABI compatibility.
///
/// `mount_setattr()` atomically modifies mount properties on a single
/// mount or recursively on an entire mount tree (when `AT_RECURSIVE`
/// is passed). Container runtimes use this for recursive read-only
/// mounts (`MOUNT_ATTR_RDONLY` + `AT_RECURSIVE`).
#[repr(C)]
pub struct MountAttr {
    /// Flags to set on the mount(s). Bits correspond to `MOUNT_ATTR_*`
    /// constants. Applied after `attr_clr` (clear first, then set).
    pub attr_set: u64,

    /// Flags to clear from the mount(s). Applied before `attr_set`.
    pub attr_clr: u64,

    /// Propagation type to set. One of `MS_SHARED`, `MS_PRIVATE`,
    /// `MS_SLAVE`, `MS_UNBINDABLE`, or 0 (no change). Only one
    /// propagation flag may be set; combining them returns `EINVAL`.
    pub propagation: u64,

    /// File descriptor of the user namespace to associate with the
    /// mount (for ID-mapped mounts). Set to 0 or omit if not
    /// changing the mount's user namespace mapping.
    pub userns_fd: u64,
}

/// MOUNT_ATTR_* flag constants for mount_setattr(2).
/// These map to MountFlags but use a separate constant space matching
/// Linux's UAPI.
pub const MOUNT_ATTR_RDONLY: u64      = 0x00000001;
pub const MOUNT_ATTR_NOSUID: u64      = 0x00000002;
pub const MOUNT_ATTR_NODEV: u64       = 0x00000004;
pub const MOUNT_ATTR_NOEXEC: u64      = 0x00000008;
pub const MOUNT_ATTR_NOATIME: u64     = 0x00000010;
pub const MOUNT_ATTR_STRICTATIME: u64 = 0x00000020;
pub const MOUNT_ATTR_NODIRATIME: u64  = 0x00000080;
pub const MOUNT_ATTR_NOSYMFOLLOW: u64 = 0x00200000;

13.2.9 Mount Operations — Algorithms

All mount tree modification algorithms require holding the namespace's mount_lock (lock hierarchy level 9, Section 3.1.5). Path resolution (read path) uses only RCU and never acquires mount_lock. The algorithms below describe the kernel-internal implementation; the syscall entry points (mount(2), umount2(2), and the new mount API) perform argument validation and capability checks before calling these internal functions.

13.2.9.1 do_mount — Mount a Filesystem

do_mount(source, target_path, fs_type, flags, data) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. Resolve `target_path` to (mount, dentry) via path resolution (Section 13.1.3).
  2. If `flags` contains MS_REMOUNT, delegate to do_remount() (Section 13.2.9.4).
  3. If `flags` contains MS_BIND, delegate to do_bind_mount() (Section 13.2.9.5).
  4. If `flags` contains MS_MOVE, delegate to do_move_mount() (Section 13.2.9.6).
  5. If `flags` contains MS_SHARED|MS_PRIVATE|MS_SLAVE|MS_UNBINDABLE,
     delegate to do_change_propagation() (Section 13.2.9.7).
  6. Otherwise, this is a new filesystem mount:
     a. Look up the filesystem type by name in the filesystem registry.
        If not registered, return ENODEV.
     b. Call `FileSystemOps::mount(source, flags, data)` on the filesystem
        driver. This creates and returns a `SuperBlock`. On failure, return
        the error from the driver.
     c. Check namespace mount count against `mount_max` limit. If exceeded,
        drop the superblock and return ENOSPC.
     d. Allocate a new `Mount` node:
        - `mount_id` from `namespace.id_allocator.fetch_add(1)`
        - `parent` = resolved mount from step 1
        - `mountpoint` = resolved dentry from step 1
        - `root` = superblock's root dentry
        - `superblock` = the SuperBlock from step 6b
        - `flags` = translate MS_* to MountFlags
        - `propagation` = Private (default for new mounts)
        - `group_id` = 0 (private mount has no peer group)
        - `mnt_count` = 0
     e. Acquire `mount_lock`.
     f. Set `DCACHE_MOUNTED` on the target dentry.
     g. Insert the Mount into the mount hash table at
        bucket(parent_mount_id, mountpoint_inode_id).
     h. Add the Mount to the parent's `children` list.
     i. Add the Mount to the namespace's `mount_list` (after its parent
        in topological order).
     j. Increment `namespace.mount_count`.
     k. Propagate: if the parent mount is shared, call
        `propagate_mount()` (Section 13.2.10.1) to replicate this mount
        on all peers and slaves of the parent.
     l. Increment `namespace.event_seq`.
     m. Release `mount_lock`.

13.2.9.2 do_umount — Unmount a Filesystem

do_umount(target_mount, flags) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. If `target_mount` is the namespace root and flags does not contain
     MNT_DETACH, return EBUSY (cannot unmount root).
  2. If `target_mount.flags` has MNT_LOCKED and the caller lacks
     CAP_SYS_ADMIN in the mount's owning user namespace, return EPERM.
  3. If `flags` does not contain MNT_DETACH (not lazy):
     a. Check `target_mount.mnt_count`. If > 0, return EBUSY.
     b. Check that `target_mount.children` is empty. If not, return EBUSY
        (sub-mounts must be unmounted first, unless MNT_DETACH is used).
  4. If `flags` contains MNT_FORCE:
     a. Call `FileSystemOps::force_umount()` if the filesystem supports it.
        This causes in-flight I/O to fail with EIO. NFS uses this for stale
        server recovery.
  5. Acquire `mount_lock`.
  6. Set `MNT_DOOMED` on `target_mount.flags` (atomic OR).
     This prevents new path lookups from entering the mount.
  7. Remove `target_mount` from the mount hash table.
  8. Remove `target_mount` from the parent's `children` list.
  9. If the target dentry no longer has any mounts, clear `DCACHE_MOUNTED`
     on the mountpoint dentry. (Multiple mounts can be stacked on the same
     dentry; only clear when the last one is removed.)
  10. Propagate: if the parent mount is shared, call `propagate_umount()`
      (Section 13.2.10.2) to remove corresponding mounts from peers and slaves.
  11. Remove from `namespace.mount_list`.
  12. Decrement `namespace.mount_count`.
  13. Increment `namespace.event_seq`.
  14. Release `mount_lock`.
  15. If `flags` contains MNT_DETACH (lazy unmount):
      a. The mount is now disconnected from the tree but may still be
         referenced by open file descriptors (mnt_count > 0). It will be
         fully freed when the last reference is dropped.
      b. Open files continue to work on the disconnected mount. New path
         lookups cannot reach it.
  16. If not lazy: call `FileSystemOps::unmount()` synchronously.
      If lazy: schedule `FileSystemOps::unmount()` to run when `mnt_count`
      drops to 0 (via a callback registered on the final `Arc::drop`).

13.2.9.3 do_umount_tree — Recursive Unmount

do_umount_tree(root_mount, flags) -> Result<()>

  Used by MNT_DETACH on a mount with sub-mounts, and by namespace teardown.

  1. Acquire `mount_lock`.
  2. Collect all mounts in the subtree rooted at `root_mount` by traversing
     `root_mount.children` recursively. Collect in reverse topological order
     (leaves first, root last).
  3. For each mount in the collected list:
     a. Set MNT_DOOMED.
     b. Remove from hash table.
     c. Remove from parent's children list.
     d. Clear DCACHE_MOUNTED if no other mount remains at that dentry.
     e. Remove from namespace.mount_list.
     f. Decrement namespace.mount_count.
  4. Propagate umount for each removed mount.
  5. Increment namespace.event_seq.
  6. Release `mount_lock`.
  7. For each collected mount: schedule filesystem unmount (immediate
     if mnt_count == 0, deferred if lazy).

13.2.9.4 do_remount — Change Mount Flags/Options

do_remount(target_mount, flags, data) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. Translate new `flags` to `MountFlags`.
  2. Extract per-superblock options from `data`.
  3. Acquire `mount_lock`.
  4. Update `target_mount.flags` atomically.
     Note: a remount can change per-mount flags (readonly, nosuid, etc.)
     independently of superblock options. For example, `mount -o remount,ro`
     on a bind mount makes that mount point read-only without affecting
     other mount points of the same filesystem.
  5. If per-superblock options changed, call
     `FileSystemOps::remount(sb, flags, data)`. On failure, restore the
     old flags and return the error.
  6. Increment `namespace.event_seq`.
  7. Release `mount_lock`.

13.2.9.5 do_bind_mount — Bind Mount (MS_BIND)

do_bind_mount(source_path, target_path, flags) -> Result<()>

  Capability check: CAP_MOUNT + read access to source path.

  1. Resolve `source_path` to (source_mount, source_dentry).
  2. Resolve `target_path` to (target_mount, target_dentry).
  3. If `source_mount.propagation == Unbindable`, return EINVAL.
  4. Clone the source mount:
     a. Allocate a new `Mount` node.
     b. `superblock` = `source_mount.superblock` (shared — same filesystem
        instance, same data pages).
     c. `root` = `source_dentry` (bind mount's root is the source path,
        not necessarily the source mount's root — this is how bind mounts
        of subdirectories work).
     d. `flags` = copy from source, then apply any new flags from `flags`.
     e. `propagation` = Private (new bind mounts default to Private).
  5. If `flags` contains MS_REC (recursive bind):
     a. For each sub-mount under `source_mount` (descendants of
        `source_dentry`), clone the mount and attach it at the
        corresponding dentry under the new bind mount.
     b. Skip unbindable mounts.
  6. Acquire `mount_lock`.
  7. Attach the cloned mount(s) at target_path (same steps as
     do_mount steps 6f-6m).
  8. Release `mount_lock`.

13.2.9.6 do_move_mount — Move a Mount (MS_MOVE)

do_move_mount(source_mount, target_path) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. Resolve `target_path` to (target_parent_mount, target_dentry).
  2. Verify `target_dentry` is not a descendant of `source_mount`
     (moving a mount underneath itself would create a cycle). Return
     EINVAL if it is.
  3. Verify `source_mount` is not the namespace root. Return EINVAL.
  4. Acquire `mount_lock`.
  5. Remove `source_mount` from the old location:
     a. Remove from hash table at old (parent, dentry) key.
     b. Remove from old parent's children list.
     c. Clear DCACHE_MOUNTED on old mountpoint dentry (if no other
        mount remains).
  6. Attach at new location:
     a. Update `source_mount.parent` to `target_parent_mount`.
     b. Update `source_mount.mountpoint` to `target_dentry`.
     c. Insert into hash table at new (parent, dentry) key.
     d. Add to new parent's children list.
     e. Set DCACHE_MOUNTED on `target_dentry`.
  7. Propagation: moving a mount does not trigger propagation
     (matches Linux behavior).
  8. Increment `namespace.event_seq`.
  9. Release `mount_lock`.

13.2.9.7 do_change_propagation — Set Propagation Type

do_change_propagation(target_mount, type, flags) -> Result<()>

  Capability check: CAP_MOUNT in caller's mount namespace.

  1. Determine the target mount(s):
     - If `flags` contains MS_REC: target mount and all descendants.
     - Otherwise: target mount only.
  2. Acquire `mount_lock`.
  3. For each target mount:
     a. If changing to Shared:
        - Allocate a new `group_id` from `namespace.group_id_allocator`.
        - Set `mount.group_id = new_id`.
        - If the mount was previously a slave, it becomes shared+slave
          (receives from master AND propagates to peers).
     b. If changing to Private:
        - Remove from peer group ring (`mnt_share`).
        - Remove from master's slave list (if slave).
        - Set `mount.group_id = 0`.
        - Set `mount.mnt_master = None`.
     c. If changing to Slave:
        - If the mount is currently shared, it becomes a slave of its
          former peer group. The first remaining peer becomes the master.
        - Remove from peer group ring.
        - Add to master's `mnt_slave_list`.
        - Set `mount.mnt_master` to the former peer group leader.
        - Mount retains its `group_id` (for mountinfo optional fields).
     d. If changing to Unbindable:
        - Same as Private, plus prevents bind mount of this mount.
     e. Update `mount.propagation`.
  4. Increment `namespace.event_seq`.
  5. Release `mount_lock`.

13.2.10 Mount Propagation Algorithms

Mount propagation ensures that mount/unmount events on shared mount points are replicated across all related mount points. This is essential for container volume mounts: when a volume is mounted on a shared host path, all containers that have a slave relationship to that path see the new mount.

13.2.10.1 propagate_mount

propagate_mount(source_mount, new_child_mount) -> Result<()>

  Called under mount_lock when a mount is added to a shared mount point.

  1. Walk the peer group ring of `source_mount` (via `mnt_share` links).
     For each peer mount (excluding `source_mount` itself):
     a. Clone `new_child_mount` with the peer as parent.
        The clone's mountpoint is the dentry in the peer's filesystem
        that corresponds to `new_child_mount.mountpoint` in the source.
     b. Attach the clone at the peer (insert into hash table, set
        DCACHE_MOUNTED, add to children list, add to mount_list).
     c. If the clone's parent is shared, recursively propagate to
        that peer group (but track visited groups to avoid infinite loops).
  2. Walk the slave list of `source_mount` (via `mnt_slave_list`).
     For each slave mount:
     a. Clone `new_child_mount` with the slave as parent.
     b. Attach the clone at the slave.
     c. If the slave is also shared (shared+slave), propagate to the
        slave's peer group (step 1 applied to the slave's peers).
  3. If the cloning in any propagation step fails (e.g., ENOMEM for
     the mount count limit), roll back: remove all clones created in
     this propagation pass and return the error. Propagation is
     all-or-nothing within a single mount operation.

13.2.10.2 propagate_umount

propagate_umount(source_mount) -> Result<()>

  Called under mount_lock when a mount is removed from a shared mount point.

  1. Walk the peer group ring of `source_mount.parent` (the parent must
     be shared for propagation to occur).
     For each peer of the parent:
     a. Look up a child mount at the corresponding mountpoint dentry
        in the peer's mount hash table.
     b. If found and the child's superblock matches `source_mount`'s
        superblock (same filesystem), unmount it (do_umount steps 6-12).
     c. If the child mount has its own children, recursively unmount
        the subtree (do_umount_tree).
  2. Walk the slave list of the parent.
     For each slave:
     a. Same as step 1a-1c, applied to the slave.

13.2.11 Namespace Operations

13.2.11.1 copy_tree — Clone Mount Tree for CLONE_NEWNS

copy_tree(source_root_mount, source_root_dentry) -> Result<Arc<MountNamespace>>

  Called by clone(CLONE_NEWNS) and unshare(CLONE_NEWNS).

  1. Allocate a new `MountNamespace` with fresh `ns_id`, empty hash table,
     and a new `mount_lock`.
  2. The new namespace inherits the parent's `user_ns`.
  3. Clone the source root mount:
     a. Allocate new `Mount` with the same superblock and root dentry.
     b. Flags are copied. Propagation is set to Private (default for
        new namespace — Section 16.1.2 states "CLONE_NEWNS: child's mounts
        are private unless marked shared").
  4. For each mount in the source namespace's mount_list (topological order):
     a. Skip unbindable mounts.
     b. Clone the mount into the new namespace.
     c. Preserve the parent-child relationship (the cloned child's parent
        is the clone of the original child's parent).
     d. Insert into the new namespace's hash table and mount_list.
     e. Set propagation:
        - If the source mount is shared: the clone is added to the same
          peer group (shared propagation preserved across CLONE_NEWNS).
          This is critical for container runtimes that rely on propagation.
        - If the source mount is private/slave/unbindable: the clone is
          Private.
  5. Set the new namespace's root to the clone of `source_root_mount`.
  6. Return the new namespace.

13.2.11.2 pivot_root Integration

The pivot_root(2) algorithm specified in Section 16.1.3 is updated to use the Mount data structure:

pivot_root(new_root_path, put_old_path) -> Result<()>

  Capability check: CAP_SYS_ADMIN in caller's user namespace.
  The caller must be in a mount namespace (not the initial namespace).

  1. Resolve `new_root_path` to (new_root_mount, new_root_dentry).
     Verify `new_root_dentry` is the root of `new_root_mount` (i.e.,
     new_root is a mount point, not just a directory).
  2. Resolve `put_old_path` to (put_old_mount, put_old_dentry).
     Verify `put_old` is at or under `new_root`.
  3. Verify `new_root_mount` is not the current namespace root.
  4. Verify `new_root` is not already the root of the namespace.
  5. Acquire `mount_lock`.
  6. Let `old_root_mount` = namespace's current root mount.
  7. Detach `new_root_mount` from its current position:
     a. Remove from hash table.
     b. Remove from parent's children.
     c. Clear DCACHE_MOUNTED on its old mountpoint.
  8. Reattach `old_root_mount` at `put_old`:
     a. Set `old_root_mount.parent` = `new_root_mount`.
     b. Set `old_root_mount.mountpoint` = the dentry corresponding to
        `put_old` within `new_root_mount`'s filesystem.
     c. Insert `old_root_mount` into hash table at new position.
     d. Set DCACHE_MOUNTED on the put_old dentry.
  9. Set `new_root_mount` as the namespace root:
     a. `new_root_mount.parent` = None (it is now the root).
     b. `new_root_mount.mountpoint` = `new_root_mount.root` (self-referential
        for the root mount).
     c. `namespace.root.update(new_root_mount, &mount_lock_guard)` (RCU
        publish via RcuCell::update).
  10. Update all processes in this namespace whose root or cwd was
      under the old root to point to the new root.
  11. Increment `namespace.event_seq`.
  12. Release `mount_lock`.

  Note: Steps 7-9 are the atomic state change. In-flight path lookups
  that started before step 9 see the old root via RCU (the old
  `RcuCell` value remains valid until the grace period). New lookups
  after step 9 see the new root. This matches the atomicity guarantee
  specified in Section 16.1.3.

13.2.11.3 Namespace Teardown

When a mount namespace is destroyed (all processes exited, all /proc/PID/ns/mnt file descriptors closed, all bind mounts of the namespace file unmounted):

destroy_mount_namespace(ns) -> ()

  1. Acquire `mount_lock`.
  2. Iterate `ns.mount_list` in reverse topological order (leaves first).
  3. For each mount:
     a. Set MNT_DOOMED.
     b. Remove from hash table.
     c. Remove from parent's children.
     d. Remove from peer group and slave lists.
  4. Release `mount_lock`.
  5. For each removed mount (in reverse order):
     a. If `mnt_count == 0`, call `FileSystemOps::unmount()`.
     b. If `mnt_count > 0` (lazy unmount remnants still referenced by
        open file descriptors), defer unmount to final reference drop.
  6. Drop the hash table and mount list.

13.2.12 New Mount API Syscalls

UmkaOS implements the Linux 5.2+ mount API syscalls for compatibility with modern container runtimes (containerd, CRI-O) and systemd. These are thin wrappers around the internal mount operations described above.

Syscall Purpose Capability
fsopen(fs_type, flags) Create a filesystem context CAP_MOUNT
fspick(dirfd, path, flags) Create a reconfiguration context for an existing mount CAP_MOUNT
fsconfig(fd, cmd, key, value, aux) Configure a filesystem context CAP_MOUNT
fsmount(fs_fd, flags, mount_attr) Create a detached mount from a configured context CAP_MOUNT
move_mount(from_dirfd, from_path, to_dirfd, to_path, flags) Attach a detached mount or move an existing mount CAP_MOUNT
open_tree(dirfd, path, flags) Open or clone a mount point as a file descriptor CAP_MOUNT (if OPEN_TREE_CLONE)
mount_setattr(dirfd, path, flags, attr, size) Modify mount attributes, optionally recursively CAP_MOUNT

fsopen flow: 1. Validate fs_type against the filesystem registry. 2. Allocate FsContext with purpose = NewMount. 3. Return a file descriptor referencing the context.

fsconfig flow (selected commands): - FSCONFIG_SET_STRING: set a key-value option string. - FSCONFIG_SET_BINARY: set a binary option blob. - FSCONFIG_SET_FD: set an option to a file descriptor (e.g., source device). - FSCONFIG_CMD_CREATE: validate all options and create the superblock by calling FileSystemOps::mount(). On success, the superblock is stored in FsContext.superblock. On failure, diagnostic messages are written to the context's error log. - FSCONFIG_CMD_RECONFIGURE: for fspick contexts, apply new options to the existing superblock via FileSystemOps::remount().

fsmount flow: 1. Consume the superblock from the FsContext. 2. Allocate a Mount node with MNT_DETACHED flag set. 3. The mount is not yet attached to any namespace or visible to path resolution. It exists only as a detached object referenced by the returned file descriptor. 4. Return an O_PATH file descriptor referencing the detached mount.

move_mount flow: 1. Resolve the source (detached mount fd or existing mount path). 2. Resolve the target path. 3. If the source is detached (MNT_DETACHED): a. Clear MNT_DETACHED. b. Attach to the namespace via do_mount steps 6e-6m. 4. If the source is an existing mount: a. Delegate to do_move_mount() (Section 13.2.9.6).

open_tree flow: 1. Resolve the path to a mount. 2. If OPEN_TREE_CLONE: a. Clone the mount (like do_bind_mount without attaching). b. The clone is detached (MNT_DETACHED). c. If OPEN_TREE_CLONE | AT_RECURSIVE: recursively clone the subtree. 3. Return an O_PATH file descriptor.

mount_setattr flow: 1. Resolve the path to a mount. 2. Validate attr_set and attr_clr do not conflict. 3. Acquire mount_lock. 4. If AT_RECURSIVE: a. Collect all mounts in the subtree. b. Validate the changes are valid for all mounts (e.g., clearing MNT_READONLY on a mount whose superblock is read-only is invalid). c. If validation fails for any mount, return error (no partial changes). d. Apply attr_clr then attr_set to all mounts atomically. 5. If not recursive: apply to the single mount. 6. If attr.propagation != 0: change propagation type (Section 13.2.9.7). 7. Increment namespace.event_seq. 8. Release mount_lock.

13.2.13 Mount Introspection Syscalls

Linux 6.8 introduced statmount(2) and listmount(2) as structured replacements for parsing /proc/PID/mountinfo. UmkaOS implements both for container introspection tools and future-compatible userspace.

Syscall Purpose Capability
statmount(req, buf, bufsize, flags) Query detailed mount information by mount ID None (own namespace)
listmount(req, buf, bufsize, flags) List child mount IDs of a given mount None (own namespace)

statmount: Returns a struct statmount containing the mount's ID, parent ID, mount flags, propagation type, peer group ID, master mount ID, filesystem type, mount source, mount point path, and superblock options. The request specifies which fields to populate via a bitmask, avoiding unnecessary work (e.g., path resolution for mount point is skipped if STATMOUNT_MNT_POINT is not requested).

listmount: Returns an array of 64-bit mount IDs for the child mounts of a given mount. Supports cursor-based iteration: the caller passes the last seen mount ID, and listmount returns mount IDs after that cursor. This handles concurrent mount/unmount gracefully (mounts added after the cursor are seen; mounts removed are skipped).

13.2.14 /proc/PID/mountinfo Format

Each process exposes its mount namespace's mount tree through /proc/PID/mountinfo and /proc/PID/mounts. These files are read by systemd, Docker, findmnt, df, mountpoint, and other tools.

mountinfo line format (one line per mount, matching Linux exactly):

<mount_id> <parent_id> <major>:<minor> <root> <mount_point> <mount_options> <optional_fields> - <fs_type> <mount_source> <super_options>
Field Source Example
mount_id Mount.mount_id 36
parent_id Mount.parent.mount_id (self for root) 35
major:minor SuperBlock.dev major:minor 98:0
root Path of mount root within the filesystem / or /subdir
mount_point Path of mount point relative to process root /mnt/data
mount_options Per-mount flags as comma-separated options rw,noatime,nosuid
optional fields Propagation: shared:N, master:N, propagate_from:N shared:1 master:2
separator Literal hyphen -
fs_type Filesystem type name ext4
mount_source Mount.device_name /dev/sda1
super_options From FileSystemOps::show_options() rw,errors=continue

Implementation: The VFS iterates the namespace's mount_list under rcu_read_lock() and formats each line. The mount_list's topological ordering ensures that parent mounts appear before children (matching Linux's output order).

/proc/PID/mounts: A simplified view matching the old /etc/mtab format: <device> <mount_point> <fs_type> <options> 0 0. Generated from the same mount_list, omitting mount IDs and propagation fields.

13.2.15 Path Resolution Integration

This section details how the mount tree integrates with the path resolution algorithm described in Section 13.1.3.

Mount crossing in RCU-walk (fast path):

resolve_component_rcu(current_mount, current_dentry, name):
  1. Look up `name` in the dentry cache: dentry = dcache_lookup(current_dentry, name).
  2. If dentry is not found: fall through to ref-walk (cache miss).
  3. If dentry.flags has DCACHE_MOUNTED:
     a. Call MountHashTable::lookup(current_mount.mount_id, dentry.inode, &rcu_guard).
     b. If a child mount is found:
        - current_mount = child_mount
        - current_dentry = child_mount.root
        - If child_mount.root also has DCACHE_MOUNTED, repeat step 3
          (stacked mounts — rare but legal).
     c. If no child mount found: DCACHE_MOUNTED is stale (race with
        umount). Clear the flag lazily and continue with the dentry.
  4. Return (current_mount, dentry).

Mount crossing in ref-walk (slow path):

resolve_component_ref(current_mount, current_dentry, name):
  1. Same as RCU-walk step 1, but takes a dentry reference count.
  2. Same DCACHE_MOUNTED check.
  3. If mount crossing:
     a. Call MountHashTable::lookup() under rcu_read_lock().
     b. If found: increment child_mount.mnt_count (atomic add).
     c. Decrement current_mount.mnt_count.
     d. current_mount = child_mount; current_dentry = child_mount.root.
  4. Return (current_mount, dentry).

".." traversal across mount boundaries:

resolve_dotdot(current_mount, current_dentry):
  1. If current_dentry == current_mount.root:
     - We are at the root of this mount. ".." should cross into the parent
       mount.
     - If current_mount.parent is None: we are at the namespace root.
       ".." resolves to the root itself (cannot go above /).
     - Otherwise: current_mount = current_mount.parent.
       current_dentry = current_mount.mountpoint.
       (Continue resolving ".." from the parent mount's mountpoint.)
  2. If current_dentry != current_mount.root:
     - Normal ".." within the mount's filesystem.
     - current_dentry = current_dentry.parent.
  3. Return (current_mount, current_dentry).

13.2.16 Performance Characteristics

Operation Cost Notes
Mount hash lookup (RCU read) ~5-15 ns SipHash + 1-2 pointer chases, no locks, no atomics. Occurs on every mount-point crossing during path resolution.
DCACHE_MOUNTED check ~1 ns Single atomic load of dentry flags. Occurs on every path component — the gate that avoids hash lookup on non-mount-point dentries.
Mount (new filesystem) ~1-10 us Dominated by filesystem driver's mount() (superblock creation). Mount tree insertion is ~200 ns under lock.
Unmount ~500 ns - 5 us Hash removal + propagation. Filesystem unmount() cost varies (ext4 journal flush vs. tmpfs instant).
Bind mount ~300 ns Mount node clone + hash insertion. No filesystem I/O.
Bind mount (recursive, N sub-mounts) ~300*N ns Linear in subtree size.
Propagation (mount, M peers) ~300*M ns One clone per peer. Propagation to slaves adds per-slave overhead.
/proc/PID/mountinfo generation ~50 ns/mount One line per mount. 100-mount namespace: ~5 us total.
copy_tree (CLONE_NEWNS, N mounts) ~500*N ns Clone all mounts. 100-mount namespace: ~50 us.
pivot_root ~1 us Two hash table mutations + RCU publish.

Memory overhead per mount: ~320 bytes for the Mount struct (including all intrusive list nodes and propagation fields) plus ~16 bytes for the hash table entry. A container with 100 mounts consumes ~33 KiB of mount tree metadata. A system with 10,000 containers (1 million mounts total) consumes ~330 MiB — proportional to the actual number of mounts, not pre-allocated.

13.2.17 Cross-References

  • Section 3.1.5 (Lock Hierarchy): MOUNT_LOCK at level 9, between DENTRY_LOCK (8) and SOCK_LOCK (10).
  • Section 8.1.1 (Capabilities): CAP_MOUNT (bit 70) gates all mount operations. CAP_SYS_ADMIN (bit 21) required for pivot_root and MNT_LOCKED override.
  • Section 13.1.1 (VFS Architecture): FileSystemOps::mount() creates the superblock consumed by do_mount(). FileSystemOps::unmount() is called by do_umount() after tree removal.
  • Section 13.1.2 (Dentry Cache): DCACHE_MOUNTED flag triggers mount hash table lookup during path resolution.
  • Section 13.1.3 (Path Resolution): RCU-walk and ref-walk mount crossing detailed in Section 13.2.15.
  • Section 13.1.4 (Mount Namespace and Capability-Gated Mounting): The capability table and propagation type summary specified there are implemented by the data structures in this section.
  • Section 13.4 (overlayfs): OverlayFs::mount() creates an OverlaySuperBlock consumed via the standard do_mount() path.
  • Section 16.1.2 (Namespace Implementation): NamespaceSet.vfs_root is updated to NamespaceSet.mount_ns: Arc<MountNamespace>, providing access to the full mount tree rather than just a capability handle to the root VFS node.
  • Section 16.1.3 (pivot_root): The step-by-step algorithm there is superseded by the precise Mount-struct-based algorithm in Section 13.2.11.2.
  • Section 16.1.5 (Namespace Inheritance): CLONE_NEWNS triggers copy_tree() (Section 13.2.11.1).

13.3 Distribution-Aware VFS Extensions

When filesystems are shared across cluster nodes (Section 14.5), the VFS must handle cache validity, locking granularity, and metadata coherence across node boundaries. Linux's VFS was designed for local filesystems with network filesystem support bolted on afterward, resulting in several systemic performance problems. UmkaOS's VFS addresses these by integrating with the Distributed Lock Manager (Section 14.6).

Linux Problem Impact UmkaOS Fix
Dentry cache assumes local validity Remote rename/unlink leaves stale dentries on other nodes Callback-based invalidation: DLM lock downgrade (Section 14.6.8) triggers targeted dentry invalidation for affected directory entries only
d_revalidate() on every lookup for network FS Extra round-trip per path component on NFS/CIFS/GFS2 Lease-attached dentries: dentry is valid while parent directory DLM lock is held (Section 14.6.6); zero revalidation cost during lease period
Inode-level locking forces false sharing Two nodes writing to different byte ranges of the same file serialize on the inode lock Range locks in VFS: DLM byte-range lock resources (Section 14.6.4) allow concurrent operations on different ranges of the same file
No concurrent directory operations mkdir and create in the same directory serialize globally Per-bucket directory locks: hash-based directory formats (ext4 htree, GFS2 leaf blocks) use separate DLM resources per hash bucket
readdir() + stat() = 2N round-trips for N files ls -l on a 1000-file remote directory requires 2001 operations getdents_plus() returning attributes with directory entries (analogous to NFS READDIRPLUS but in-kernel, avoiding the userspace/kernel boundary per entry). getdents_plus() is an UmkaOS VFS-internal operation (not a new syscall): the VFS's readdir implementation populates both the directory entry and its InodeAttr in a single filesystem callback, caching the attributes for immediate use by a subsequent getattr() / stat() call. Userspace accesses this via the standard getdents64(2) + statx(2) syscalls — the optimization is transparent, eliminating redundant disk or DLM round-trips inside the kernel.
Full inode cache invalidation on lock drop Dropping a DLM lock on an inode discards all cached metadata, even fields that haven't changed Per-field inode validity: mtime/size read from DLM Lock Value Block (Section 14.6.3); permissions and ownership from local capability cache; only stale fields refreshed on lock reacquire

Integration with Section 14.6 DLM:

  • Dentry lease binding: When the VFS caches a dentry for a clustered filesystem, it records the DLM lock resource that protects the parent directory. The dentry remains valid as long as that lock is held at CR (Concurrent Read) mode or stronger. When the DLM downgrades or releases the lock (due to contention from another node), the VFS receives a callback and invalidates only the affected dentries — not the entire dentry subtree.

  • Range-aware writeback: When a process holds a DLM byte-range lock and writes to pages within that range, the VFS tracks dirty pages per lock range (not per inode). On lock downgrade, only dirty pages within the lock's range are flushed (Section 14.6.8). This eliminates the Linux problem where dropping a lock on a 100 GB file requires flushing all dirty pages, even if only 4 KB was modified.

  • Attribute caching via LVB: The VFS reads frequently-accessed inode attributes (i_size, i_mtime, i_blocks) from the DLM Lock Value Block (Section 14.6.3) rather than performing a disk read on every lock acquire. The LVB is updated by the last writer on lock release, so readers always get current values at the cost of a single RDMA operation (~3-4 μs) instead of a disk I/O (~10-15 μs for NVMe).


13.4 overlayfs: Union Filesystem for Containers

Use case: Container image layering. Docker, containerd, Podman, and Kubernetes all use overlayfs as their primary storage driver. A container image is a stack of read-only filesystem layers; overlayfs merges them with a writable upper layer to present a unified view. Without overlayfs, container runtimes fall back to copy-the-entire-layer approaches (VFS copy, naive snapshots), which are orders of magnitude slower for image pull and container startup.

Tier: Tier 1 (runs in the VFS isolation domain alongside umka-vfs).

Rationale for Tier 1 (not Tier 2): overlayfs is a stacking filesystem — it sits between the VFS and the underlying filesystem drivers (ext4, XFS, btrfs, tmpfs). Every path lookup, readdir, and file open in a container traverses overlayfs. Placing it in Tier 2 (Ring 3, process boundary) would add two domain crossings per VFS operation inside every container, roughly doubling the path resolution overhead. Since overlayfs delegates all storage I/O to the underlying filesystem (which is itself a Tier 1 driver), overlayfs never touches hardware directly — it is a pure VFS client. Its code complexity is moderate (~3,000 SLOC in Linux) and auditable. The crash containment boundary is the VFS domain: if overlayfs panics, the VFS recovery protocol (Section 13.1) handles it.

Design: overlayfs implements FileSystemOps, InodeOps, FileOps, and DentryOps from the VFS trait system (Section 13.1.1). It does not introduce new VFS abstractions — it composes existing ones.

13.4.1 Mount Options and Configuration

/// Mount options parsed from the `data` parameter of `FileSystemOps::mount()`.
/// Encoded as comma-separated key=value pairs in the `data: &[u8]` slice,
/// matching Linux's overlayfs mount option syntax exactly.
///
/// Example mount command:
/// ```
/// mount -t overlay overlay \
///   -o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work \
///   /merged
/// ```
///
/// For read-only overlays (no upperdir/workdir), only lowerdir is required.
/// This is used for container image inspection without a writable layer.
pub struct OverlayMountOptions {
    /// Colon-separated list of lower layer paths, ordered from topmost to
    /// bottommost. At least one lower layer is required. Maximum 500 layers
    /// (matching Linux's limit, which Docker/containerd never approach —
    /// typical images have 5-20 layers).
    ///
    /// Each path must be an existing directory on a mounted filesystem.
    /// The VFS resolves each path to an `InodeId` at mount time and holds
    /// a reference to the underlying superblock for the mount's lifetime.
    ///
    /// Heap-allocated rather than inline (`ArrayVec<_, 500>` would be up to
    /// 4000 bytes on the stack). The 500-layer maximum is enforced at mount
    /// validation time. Mount processing is a rare, non-hot-path operation
    /// where heap allocation is acceptable.
    pub lower_dirs: Box<[InodeId]>,

    /// Upper layer directory (read-write). `None` for read-only overlays.
    /// Must reside on a filesystem that supports: xattr (for whiteouts and
    /// metacopy markers), rename with RENAME_WHITEOUT, and mknod (for
    /// character-device whiteouts). The upper filesystem must be writable.
    pub upper_dir: Option<InodeId>,

    /// Work directory for atomic copy-up staging. Required if `upper_dir`
    /// is set. Must be on the **same filesystem** as `upper_dir` (same
    /// superblock) — copy-up uses rename(2) from workdir to upperdir,
    /// which requires same-device semantics. The VFS verifies this at
    /// mount time by comparing `SuperBlock` identity.
    ///
    /// The workdir must be empty at mount time. overlayfs creates a `work/`
    /// subdirectory inside it for staging, and an `index/` subdirectory
    /// for NFS export handles (if enabled).
    pub work_dir: Option<InodeId>,

    /// Enable metadata-only copy-up. When true, operations that modify
    /// only metadata (chmod, chown, utimes, setxattr) copy only the
    /// inode metadata to the upper layer, deferring data copy until the
    /// first write. Dramatically reduces container startup I/O: a
    /// `chmod` on a 200 MB binary copies ~4 KB of metadata instead of
    /// 200 MB of data.
    ///
    /// Default: true (matches Docker/containerd default since Linux 5.11+
    /// with kernel config `OVERLAY_FS_METACOPY=y`).
    ///
    /// Security restriction: this option is silently forced to `false`
    /// when the mount is user-namespace-influenced (i.e., when the caller
    /// does not hold `CAP_SYS_ADMIN` in the initial user namespace). In
    /// such mounts the upper layer uses `user.overlay.*` xattrs, which
    /// are writable by the file owner without privilege; a forged
    /// metacopy xattr could redirect reads to arbitrary lower-layer files.
    /// See [Section 13.4.6.1](#13461-metacopy-trust-model-and-security-constraints)
    /// for the complete trust model and enforcement mechanism.
    pub metacopy: bool,

    /// Directory rename/redirect handling.
    ///
    /// - `On`: Enable redirect xattrs for directory renames. Required
    ///   for rename(2) on merged directories to succeed (without this,
    ///   rename of a directory that exists in a lower layer returns EXDEV).
    /// - `Follow`: Follow existing redirect xattrs but do not create new
    ///   ones. Safe for mounting layers created by a trusted system.
    /// - `NoFollow`: Ignore redirect xattrs entirely. Most restrictive.
    /// - `Off`: Disable redirect handling; directory renames return EXDEV.
    ///
    /// Default: `On` (required by Docker/containerd for correct semantics).
    pub redirect_dir: RedirectDirMode,

    /// Volatile mode. When enabled, overlayfs skips all fsync/sync_fs calls
    /// to the upper filesystem. A crash or power loss may leave the upper
    /// layer in an inconsistent state (workdir staging artifacts, partial
    /// copy-ups). The overlay refuses to remount if it detects a previous
    /// volatile session that was not cleanly unmounted.
    ///
    /// Docker uses volatile mode for ephemeral containers where persistence
    /// is not needed (CI runners, build containers, test environments).
    ///
    /// Default: false.
    pub volatile: bool,

    /// Use `user.overlay.*` xattr namespace instead of `trusted.overlay.*`.
    /// Required for unprivileged (rootless) overlayfs mounts where the
    /// calling process lacks CAP_SYS_ADMIN in the initial user namespace.
    /// The `user.*` xattr namespace is writable by the file owner without
    /// special capabilities.
    ///
    /// Default: false (use `trusted.overlay.*`).
    pub userxattr: bool,

    /// Extended inode number mode. Controls how overlayfs composes inode
    /// numbers to guarantee uniqueness across layers.
    ///
    /// - `On`: Compose inode numbers using upper bits for layer index.
    ///   Requires underlying filesystems to use <32-bit inode numbers
    ///   (ext4, XFS with `inode32` mount option).
    /// - `Off`: Use raw underlying inode numbers. Risk of collisions
    ///   across layers (two files on different layers may share an ino).
    /// - `Auto`: Enable if all underlying filesystems have small enough
    ///   inode numbers; disable otherwise.
    ///
    /// Default: `Auto`.
    pub xino: XinoMode,

    /// NFS export support. When enabled, overlayfs maintains an index
    /// directory (inside workdir) that maps NFS file handles to overlay
    /// dentries. Required if the overlay mount will be exported via NFS.
    ///
    /// Default: false (NFS export of container filesystems is uncommon).
    pub nfs_export: bool,

    /// fs-verity digest validation for lower layer files. When enabled,
    /// overlayfs verifies that lower-layer files have valid fs-verity
    /// digests matching the expected values stored in the upper layer's
    /// metacopy xattr. Provides content integrity for container image
    /// layers without requiring dm-verity on the entire block device.
    ///
    /// - `Off`: No verity checking.
    /// - `On`: Verify if digest is present; allow files without digest.
    /// - `Require`: Reject files that lack a valid fs-verity digest.
    ///
    /// Default: `Off`.
    pub verity: VerityMode,
}

/// Redirect directory mode.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum RedirectDirMode {
    /// Create and follow redirect xattrs.
    On,
    /// Follow existing redirect xattrs but do not create new ones.
    Follow,
    /// Do not follow redirect xattrs.
    NoFollow,
    /// Disable redirect handling; directory renames return EXDEV.
    Off,
}

/// Extended inode number composition mode.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum XinoMode {
    /// Always compose inode numbers.
    On,
    /// Never compose inode numbers.
    Off,
    /// Compose if underlying inode numbers fit.
    Auto,
}

/// fs-verity enforcement mode for lower layer files.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum VerityMode {
    /// No verity checking.
    Off,
    /// Verify if digest present; allow files without digest.
    On,
    /// Reject lower files without valid fs-verity digest.
    Require,
}

13.4.2 Core Data Structures

/// Overlay filesystem superblock state. One instance per overlay mount.
/// Created by `OverlayFs::mount()` and stored in the `SuperBlock`'s
/// filesystem-private field.
pub struct OverlaySuperBlock {
    /// Lower layer inodes (topmost first). Index 0 is the highest-priority
    /// lower layer (searched first after upper). These are directory inodes
    /// on the underlying filesystems, held for the mount's lifetime.
    ///
    /// Heap-allocated rather than inline (`ArrayVec<_, 500>` would exceed
    /// the safe stack frame budget — each `OverlayLayer` contains an
    /// `InodeId`, a `SuperBlock` reference, and a `u16` index). The
    /// 500-layer maximum is enforced at mount validation time. Mount
    /// processing is a rare, non-hot-path operation where heap allocation
    /// is acceptable.
    pub lower_layers: Box<[OverlayLayer]>,

    /// Upper layer state. `None` for read-only overlay mounts.
    pub upper_layer: Option<OverlayLayer>,

    /// Work directory inode on the upper filesystem. Used as a staging
    /// area for atomic copy-up operations.
    pub work_dir: Option<InodeId>,

    /// Index directory inode (inside workdir). Used for NFS export file
    /// handle resolution and hard link tracking across copy-up.
    pub index_dir: Option<InodeId>,

    /// Parsed mount options (immutable after mount).
    pub config: OverlayMountOptions,

    /// The xattr prefix used for overlay-private xattrs. Either
    /// `"trusted.overlay."` (privileged) or `"user.overlay."` (userxattr
    /// mode). Stored once to avoid branching on every xattr operation.
    pub xattr_prefix: &'static [u8],

    /// Volatile session marker. If volatile mode is enabled, this is set
    /// to true after creating the `$workdir/work/incompat/volatile`
    /// sentinel directory. On mount, if the sentinel exists from a
    /// previous unclean session, mount fails with EINVAL.
    pub volatile_active: bool,

    /// True if this overlay was mounted from within a user namespace or
    /// if the upper layer's filesystem mount is owned by a non-initial
    /// user namespace. When true, `metacopy` and `redirect_dir=on` are
    /// disabled regardless of mount options, `userxattr` mode is
    /// mandatory, and data-only lower layers are rejected.
    ///
    /// Set once at `OverlayFs::mount()` time by checking whether the
    /// calling process's user namespace is the initial user namespace
    /// (`current_user_ns() == &init_user_ns`). Immutable thereafter.
    ///
    /// See Section 13.4.6.1 for the full security model.
    pub userns_influenced: bool,
}

/// A single layer in the overlay stack.
pub struct OverlayLayer {
    /// Root directory inode of this layer on its underlying filesystem.
    pub root: InodeId,

    /// Superblock of the underlying filesystem. Held as a reference
    /// for the overlay mount's lifetime.
    pub sb: SuperBlock,

    /// Layer index (0 = upper or topmost lower; increases downward).
    /// Used for xino composition and for identifying which layer an
    /// overlay inode's data resides on.
    pub index: u16,
}

/// Atomic optional value using a sentinel for the `None` state.
/// `InodeId` of 0 represents `None` (inode 0 is never valid in any filesystem).
/// Provides lock-free read access via `Acquire` load and one-time write
/// via `compare_exchange` (for copy-up transitions from None -> Some).
pub struct AtomicOption<T: Into<u64> + From<u64>> {
    value: AtomicU64,  // 0 = None, non-zero = Some(T)
}

impl AtomicOption<InodeId> {
    pub fn none() -> Self { Self { value: AtomicU64::new(0) } }
    pub fn load(&self) -> Option<InodeId> {
        match self.value.load(Ordering::Acquire) {
            0 => None,
            v => Some(InodeId(v)),
        }
    }
    /// Atomically transition from None to Some. Returns Err if already set.
    pub fn set_once(&self, val: InodeId) -> Result<(), InodeId> {
        self.value.compare_exchange(0, val.0, Ordering::AcqRel, Ordering::Acquire)
            .map(|_| ())
            .map_err(|v| InodeId(v))
    }
}

/// Per-inode overlay state. Tracks which layers contribute to a merged
/// view of this inode.
///
/// An `OverlayInode` is created on first lookup and cached in the VFS
/// inode cache. It is the filesystem-private data attached to the VFS
/// inode via `InodeId`.
pub struct OverlayInode {
    /// Inode in the upper layer. `Some` if the entry exists in upper
    /// (either originally or after copy-up). `None` if the entry exists
    /// only in lower layers.
    ///
    /// Protected by `copy_up_lock`: transitions from `None` to `Some`
    /// exactly once during copy-up. Once set, never changes back.
    /// Reads after copy-up are lock-free (Acquire load on the Option
    /// discriminant).
    pub upper: AtomicOption<InodeId>,

    /// Inode in the topmost lower layer that contains this entry.
    /// `None` if the entry exists only in upper (newly created file).
    pub lower: Option<LowerInodeRef>,

    /// True if this inode is a metacopy-only upper entry (metadata
    /// copied, data still in lower layer). Cleared to false after full
    /// data copy-up completes.
    pub metacopy: AtomicBool,

    /// True if this is an opaque directory. An opaque directory hides
    /// all entries from lower layers — readdir and lookup do not
    /// descend into lower layers below this point.
    pub opaque: bool,

    /// Redirect path for directory renames. When a merged directory is
    /// renamed in the upper layer, this field stores the original lower
    /// path so that lookups can find the renamed directory's lower
    /// contents. `None` for non-redirected entries.
    pub redirect: Option<Box<OsStr>>,

    /// Lock serializing copy-up operations on this inode. Only one
    /// thread may copy-up a given inode at a time. Other threads
    /// attempting to modify the same lower-layer file block on this
    /// lock until copy-up completes, then proceed against the upper copy.
    ///
    /// This is a `Mutex`, not an `RwLock`, because copy-up is an
    /// exclusive state transition (None -> Some). Read paths check
    /// `upper` with an Acquire load and only take the lock if they
    /// need to trigger copy-up.
    pub copy_up_lock: Mutex<()>,

    /// Overlay inode type. Needed because the overlay may present a
    /// different view than the underlying filesystem (e.g., a whiteout
    /// character device appears as "entry does not exist").
    pub inode_type: OverlayInodeType,
}

/// Reference to a lower-layer inode.
pub struct LowerInodeRef {
    /// Inode ID on the lower layer's filesystem.
    pub inode: InodeId,
    /// Which lower layer this inode resides on (index into
    /// `OverlaySuperBlock::lower_layers`).
    pub layer_index: u16,
}

/// Overlay inode type classification.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum OverlayInodeType {
    /// Regular file (may be metacopy).
    Regular,
    /// Directory (may be merged or opaque).
    Directory,
    /// Symbolic link.
    Symlink,
    /// Character device, block device, FIFO, or socket.
    Special,
    /// Whiteout entry (exists in upper layer to mark deletion of a
    /// lower-layer entry). Not visible to userspace — lookups return
    /// ENOENT. Internally represented as either a character device
    /// with major:minor 0:0 or a zero-size file with the
    /// `trusted.overlay.whiteout` xattr.
    Whiteout,
}

13.4.3 Overlay Dentry Operations

overlayfs requires custom DentryOps to handle the dynamic nature of the merged filesystem view. Copy-up changes which layer serves a file, so cached dentries must be revalidated.

/// overlayfs dentry operations.
impl DentryOps for OverlayDentryOps {
    /// Revalidate a cached overlay dentry.
    ///
    /// Returns `false` (forcing re-lookup) in these cases:
    /// 1. The overlay inode has been copied up since the dentry was cached
    ///    (detected by checking if `OverlayInode::upper` transitioned from
    ///    None to Some since the last lookup).
    /// 2. The underlying filesystem's dentry has been invalidated (delegates
    ///    to the underlying filesystem's `d_revalidate` if it implements one,
    ///    e.g., for NFS lower layers).
    /// 3. A whiteout has been created or removed in the upper layer for this
    ///    name (detected by checking upper-layer lookup result against cached
    ///    overlay state).
    ///
    /// Returns `true` (dentry is still valid) in all other cases.
    fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool>;

    /// Overlay dentries use the default VFS hash (SipHash-1-3).
    fn d_hash(&self, _name: &OsStr) -> Option<u64> {
        None
    }

    /// Overlay dentries are always eligible for LRU caching.
    fn d_delete(&self, _inode: InodeId, _name: &OsStr) -> bool {
        true
    }

    /// On dentry release, drop the overlay inode's references to
    /// underlying filesystem inodes.
    fn d_release(&self, inode: InodeId, name: &OsStr);
}

Dentry cache interaction: When a copy-up occurs, the overlay must invalidate the affected dentry in the VFS dentry cache (Section 13.1.2) so that subsequent lookups see the upper-layer inode instead of the stale lower-layer reference. The invalidation sequence:

  1. Copy-up completes (new file exists in upper layer).
  2. OverlayInode::upper is set via an atomic Release store.
  3. The overlay calls d_invalidate() on the parent directory's dentry for the affected name. This removes the dentry from the hash table and marks it for re-lookup.
  4. The next lookup for this name calls OverlayInodeOps::lookup(), which now finds the upper-layer entry and returns the updated OverlayInode.

Negative dentry handling: Negative dentries (cached ENOENT results) in the overlay dentry cache are invalidated when: - A new file is created in the upper layer (the negative dentry for that name must be purged). - A whiteout is removed (the previously-hidden lower-layer entry becomes visible again).

13.4.4 Lookup Algorithm

overlayfs lookup implements the layer search order:

OverlayInodeOps::lookup(parent: InodeId, name: &OsStr) -> Result<InodeId>:

  let overlay_parent = get_overlay_inode(parent)

  // Step 1: Search upper layer (if writable overlay).
  if let Some(upper_dir) = overlay_parent.upper {
      match underlying_lookup(upper_dir, name) {
          Ok(upper_inode) => {
              // Check if this is a whiteout.
              if is_whiteout(upper_inode) {
                  // Entry was deleted. Do NOT search lower layers.
                  // Cache a negative dentry.
                  return Err(ENOENT)
              }
              // Check if this is an opaque directory.
              let opaque = is_opaque_dir(upper_inode)
              // Found in upper. If directory and not opaque, may need
              // to merge with lower layers.
              if is_directory(upper_inode) && !opaque {
                  // Merged directory: upper exists, also search lower
                  // for the merge view.
                  let lower = find_in_lower_layers(overlay_parent, name)
                  return create_overlay_inode(Some(upper_inode), lower, ...)
              }
              // Non-directory or opaque directory: upper is authoritative.
              return create_overlay_inode(Some(upper_inode), None, ...)
          }
          Err(ENOENT) => {
              // Not in upper, fall through to lower layers.
          }
          Err(e) => return Err(e),  // Propagate I/O errors.
      }
  }

  // Step 2: Search lower layers (topmost first).
  // If parent directory has a redirect, follow it.
  for (layer_idx, lower_layer) in lower_layers_for(overlay_parent) {
      match underlying_lookup(lower_dir_at(lower_layer, overlay_parent), name) {
          Ok(lower_inode) => {
              if is_whiteout(lower_inode) {
                  // Whiteout in this lower layer. Stop searching.
                  return Err(ENOENT)
              }
              return create_overlay_inode(None, Some(LowerInodeRef {
                  inode: lower_inode,
                  layer_index: layer_idx,
              }), ...)
          }
          Err(ENOENT) => continue,  // Try next lower layer.
          Err(e) => return Err(e),
      }
  }

  // Not found in any layer.
  Err(ENOENT)

Whiteout detection: An upper-layer entry is a whiteout if either: - It is a character device with major:minor 0:0 (traditional format), OR - It is a zero-size regular file with the trusted.overlay.whiteout (or user.overlay.whiteout in userxattr mode) xattr set.

Both formats are supported for compatibility with existing container images. UmkaOS creates whiteouts using the xattr format by default (avoids requiring mknod capability for character device creation in unprivileged containers).

Opaque directory detection: A directory is opaque if it has the xattr trusted.overlay.opaque (or user.overlay.opaque) set to "y". An opaque directory hides all entries from lower layers — lookups do not descend past it. This is used when an entire directory is deleted and recreated in the upper layer.

13.4.5 Copy-Up Protocol

Copy-up is the central operation of overlayfs. When a lower-layer file must be modified, its contents (and/or metadata) are first copied to the upper layer. The copy-up must be atomic from the perspective of concurrent readers: at no point should a reader see a partially-copied file.

Full copy-up algorithm (for regular files when metacopy is disabled, or on first write to a metacopy-only file):

copy_up(overlay_inode: &OverlayInode) -> Result<InodeId>:

  // Fast path: already copied up.
  if let Some(upper) = overlay_inode.upper.load(Acquire) {
      if !overlay_inode.metacopy.load(Acquire) {
          return Ok(upper)  // Fully copied up already.
      }
      // Metacopy exists but needs full data copy. Fall through.
  }

  // Slow path: take copy-up lock.
  let _guard = overlay_inode.copy_up_lock.lock()

  // Double-check after acquiring lock (another thread may have completed
  // copy-up while we waited).
  if let Some(upper) = overlay_inode.upper.load(Acquire) {
      if !overlay_inode.metacopy.load(Acquire) {
          return Ok(upper)
      }
  }

  let lower = overlay_inode.lower.as_ref().expect("copy-up requires lower");
  let sb = overlay_super_block()

  // Step 1: Ensure parent directory exists in upper layer.
  // Recursively copy-up parent directories if needed.
  let upper_parent = ensure_upper_parent(overlay_inode)

  // Step 2: Create temporary file in workdir (same filesystem as upper).
  // The workdir is on the same device as upperdir, enabling atomic rename.
  let tmp_name = generate_temp_name()  // e.g., "#overlay.XXXXXXXX"
  let tmp_inode = underlying_create(sb.work_dir, tmp_name, lower_mode)

  // Step 3: Copy metadata from lower to tmp.
  let lower_attr = underlying_getattr(lower.inode)
  underlying_setattr(tmp_inode, &lower_attr)  // owner, mode, timestamps

  // Step 4: Copy xattrs from lower to tmp.
  // Filter out overlay-private xattrs (trusted.overlay.*).
  copy_xattrs_filtered(lower.inode, tmp_inode, sb.xattr_prefix)

  // Step 5: Copy file data (skip if metacopy mode and this is a
  // metadata-only copy-up triggered by chmod/chown/utimes).
  if !metacopy_only {
      copy_file_data(lower.inode, tmp_inode)
      // Uses splice/sendfile internally for zero-copy where possible.
      // Falls back to read+write for filesystems that don't support splice.
  } else {
      // Set metacopy xattr on the tmp file. This marks it as containing
      // metadata only — data will be copied on first write.
      underlying_setxattr(tmp_inode,
          concat(sb.xattr_prefix, "metacopy"), b"", 0)

      // If the lower file is itself a metacopy (nested overlay), follow
      // the redirect chain to find the actual data source.
      if let Some(origin) = get_metacopy_origin(lower.inode) {
          underlying_setxattr(tmp_inode,
              concat(sb.xattr_prefix, "origin"), &encode_fh(origin), 0)
      }
  }

  // Step 6: Set security context on tmp file.
  // Copy security.* xattrs that the security framework requires.

  // Step 7: Atomic rename from workdir to upperdir.
  // This is the commit point. Before this rename, the copy-up is invisible
  // to other processes. After this rename, the upper-layer file is live.
  underlying_rename(sb.work_dir, tmp_name, upper_parent, target_name,
                    RenameFlags::RENAME_NOREPLACE)

  // Step 8: Update overlay inode state.
  let upper_inode = underlying_lookup(upper_parent, target_name)
  overlay_inode.upper.store(Some(upper_inode), Release)
  if metacopy_only {
      overlay_inode.metacopy.store(true, Release)
  }

  // Step 9: Invalidate the dentry cache entry for this name.
  // Forces subsequent lookups to see the upper-layer version.
  d_invalidate(upper_parent, target_name)

  Ok(upper_inode)

Atomicity guarantee: The rename in Step 7 is the single atomic commit point. If the system crashes before Step 7, the temporary file in workdir is orphaned and cleaned up on next mount (overlayfs scans workdir for stale temporaries during mount() and removes them). If the system crashes after Step 7, the upper-layer file is complete and consistent.

Error recovery (runtime failures): Each step that can fail must clean up all prior steps before returning an error to the caller. The protocol:

Step that fails Cleanup required Returned error
underlying_create() (Step 2) None (nothing created yet) EIO / ENOSPC
underlying_setattr() (Step 3) Unlink tmp_name from workdir EIO
copy_file_data() (Step 5) Unlink tmp_name from workdir EIO / ENOSPC
underlying_rename() (Step 7) Unlink tmp_name from workdir EIO
overlay_inode.upper.store() (Step 8) Rename committed; upper file is live. Do NOT clean up — return EIO only if the store itself fails (hardware error). The upper file is kept and will be found on retry. EIO (rare)

If cleanup of the temporary file itself fails (i.e., underlying_unlink() returns an error during recovery), the orphaned temporary is left in workdir and will be removed by the next mount() scan. The original copy-up failure is still returned to the caller as an error. The orphaned file does not affect correctness because the rename (Step 7) did not complete.

Parent directory copy-up: Directories are copied up recursively. When copying up /a/b/c/file.txt, if /a/b/c/ does not exist in upper, overlayfs creates /a/, then /a/b/, then /a/b/c/ in upper (each with appropriate metadata and the trusted.overlay.origin xattr pointing to the lower original). Only then does the file copy-up proceed. Each directory copy-up is itself atomic (created in workdir, renamed to upper).

Hard link handling on copy-up: If a lower-layer file has multiple hard links (nlink > 1), all names referencing the same lower inode must resolve to the same upper inode after copy-up. The overlay maintains an index directory (inside workdir) that maps lower file handles to upper inodes. On copy-up, the overlay checks the index first: - If an index entry exists, the file was already copied up via another name. Create a hard link in upper rather than copying data again. - If no index entry exists, perform a full copy-up and record the mapping.

This index is also used for NFS export (mapping file handles across copy-up).

13.4.6 Metacopy Mode

Metacopy is the performance-critical optimization for container startup. Without metacopy, any metadata operation (chmod, chown, utimes) on a lower-layer file triggers a full data copy. With metacopy enabled, only metadata is copied, and data copy is deferred until the file is opened for writing.

Metacopy lifecycle:

State transitions for a file in metacopy mode:

  [Lower-only]
      │
      │ chmod/chown/utimes/setxattr
      ▼
  [Metacopy in upper]   ← metadata copied, data in lower
      │                    upper has trusted.overlay.metacopy xattr
      │ open(O_WRONLY/O_RDWR) or truncate
      ▼
  [Full copy-up]         ← data + metadata in upper
                           trusted.overlay.metacopy xattr removed

Read path for metacopy files: When a metacopy file is opened for reading (O_RDONLY), data is served from the lower layer. The OverlayFileOps::read() implementation checks overlay_inode.metacopy and dispatches to the lower-layer FileOps::read() with the lower inode. No data copy occurs.

Write trigger: When a metacopy file is opened for writing (O_WRONLY, O_RDWR) or truncated, the overlay triggers a full data copy-up before allowing the write:

impl FileOps for OverlayFileOps {
    fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64> {
        let oi = get_overlay_inode(inode);

        // If opening for write and file is metacopy-only, trigger
        // full data copy-up before returning the fd.
        if flags.is_writable() && oi.metacopy.load(Acquire) {
            copy_up_data(oi)?;
            // copy_up_data() copies file data from lower to upper,
            // removes the metacopy xattr, and clears oi.metacopy.
        }

        // Delegate open to the appropriate underlying filesystem.
        if let Some(upper) = oi.upper.load(Acquire) {
            underlying_open(upper, flags)
        } else {
            // Read-only open on a lower-only file. No copy-up needed.
            underlying_open(oi.lower.unwrap().inode, flags)
        }
    }
}

13.4.6.1 Metacopy Trust Model and Security Constraints

The metacopy mechanism is only safe when the kernel can trust that trusted.overlay.metacopy (or user.overlay.metacopy in userxattr mode) was written by the overlay itself during a copy-up, not forged by a process with write access to the upper layer. If forged, an attacker could create a file whose upper stub has a redirect xattr pointing to an arbitrary path in a lower layer, then set the metacopy xattr to tell the kernel to serve lower-layer data through the stub — exposing files the attacker would not otherwise be able to read via the overlay's merged view.

Xattr namespace privilege boundary

The trusted. xattr namespace is the primary safeguard. The kernel checks CAP_SYS_ADMIN via capable() — which verifies the capability against the initial user namespace — not via ns_capable() (which would accept a user namespace root). This means:

A process that holds CAP_SYS_ADMIN only within a user namespace (i.e., container root mapped to an unprivileged host UID) cannot set or read trusted.* xattrs on the host filesystem. Only a process with CAP_SYS_ADMIN in the initial user namespace can write trusted.overlay.* xattrs.

This provides complete protection for overlayfs mounts created in the initial user namespace: container processes cannot forge trusted.overlay.metacopy or trusted.overlay.redirect xattrs because they lack the required capability on the host filesystem.

User-namespace-influenced mounts: the attack surface

Since Linux 5.11, overlayfs can be mounted from within a user namespace (CAP_SYS_ADMIN in the user namespace that owns the mount namespace suffices to call mount("overlay", ...)). Such mounts are required to use userxattr mode (-o userxattr), which substitutes the user.overlay.* xattr namespace for trusted.overlay.*. Unlike trusted.*, the user.* namespace is writable by the file owner without any privilege — specifically, the unprivileged host UID that the container root maps to can set user.overlay.metacopy and user.overlay.redirect xattrs on files in the upper layer.

A user-namespace-influenced mount is defined as any overlayfs mount where either:

  1. The overlayfs mount() call was made from within a user namespace (the calling process's user namespace is not the initial user namespace), or
  2. The upper directory's owning user namespace differs from the initial user namespace (detected by comparing the user namespace of the mount namespace that created the upper directory's filesystem mount against init_user_ns).

Enforcement: metacopy disabled for user-namespace-influenced mounts

UmkaOS enforces the following rule at mount time and at metacopy lookup time:

Mount-time enforcement: When OverlayFs::mount() is called from a process not in the initial user namespace, the metacopy and redirect_dir options are forced to off regardless of what the caller requested. The mount proceeds with these features disabled. The kernel logs:

overlayfs: metacopy and redirect_dir disabled for user-namespace mount (CVE mitigation, Section 13.4.6.1)

This matches Linux's behaviour (since kernel 5.11, user-namespace overlayfs mounts are restricted to userxattr mode and metacopy is not permitted unless the caller has CAP_SYS_ADMIN in the initial user namespace).

The OverlaySuperBlock records whether the mount is user-namespace-influenced:

pub struct OverlaySuperBlock {
    // ... existing fields ...

    /// True if this overlay was mounted from within a user namespace (the
    /// mounting process's user namespace is not the initial user namespace)
    /// or if the upper layer's filesystem mount is owned by a non-initial
    /// user namespace. When true, metacopy and redirect_dir are disabled
    /// regardless of mount options, and userxattr mode is mandatory.
    ///
    /// Set once at mount time; immutable thereafter.
    pub userns_influenced: bool,
}

Lookup-time enforcement: Even if metacopy is enabled in the mount options, the metacopy lookup path checks userns_influenced before reading or acting on any metacopy xattr:

/// Attempt to read a metacopy stub from the given upper-layer dentry.
/// Returns `None` (treat as a regular upper file) if:
///   - The mount is user-namespace-influenced, or
///   - No metacopy xattr is present, or
///   - The xattr value fails validation.
fn ovl_lookup_metacopy(dentry: &Dentry, sb: &OverlaySuperBlock) -> Option<OverlayMetacopy> {
    // Never trust metacopy xattrs from user-namespace-influenced mounts.
    // The xattr namespace used by such mounts (user.overlay.*) is writable
    // by the file owner without privilege, so any metacopy xattr present
    // must be treated as potentially forged.
    if sb.userns_influenced {
        return None;
    }

    // Read the metacopy xattr from the upper-layer file.
    let xattr_name = concat_static(sb.xattr_prefix, "metacopy");
    let xattr = dentry.get_xattr(xattr_name)?;

    // Validate xattr value. The Linux-compatible format is either empty
    // (legacy, no digest) or a 4+N byte structure: 4-byte header followed
    // by an optional fs-verity SHA-256 digest (32 bytes). Reject anything
    // that does not match either form.
    validate_metacopy_xattr(xattr)
}

The lookup-time check is defence-in-depth: the mount-time enforcement already prevents metacopy=on from reaching OverlaySuperBlock::config on user-namespace mounts, so ovl_lookup_metacopy would not be called. The redundant check in ovl_lookup_metacopy protects against future code paths that might bypass the mount-time gate.

Userxattr mode and data-only layers

When userxattr=on is set (required for user-namespace mounts), user.overlay.* xattrs are used throughout. The user.overlay.redirect xattr controls directory rename semantics and, in data-only layer configurations, points metacopy stubs to their data sources. Because user.* xattrs are writable by the file owner, and because data-only layer configurations allow a metacopy file in one lower layer to redirect to a file in a data-only lower layer via user.overlay.redirect:

  • redirect_dir=on is disallowed for user-namespace-influenced mounts (forced to off at mount time).
  • Data-only lower layers are disallowed for user-namespace-influenced mounts: OverlayFs::mount() returns EPERM if any lower layer path is specified with the :: data-only separator syntax when userns_influenced is true.

These restrictions prevent the user.overlay.redirect xattr from being used to point a metacopy stub in one layer at a file in another layer that the container would not otherwise be able to access.

Summary of security invariants

Condition trusted.overlay.* metacopy user.overlay.* metacopy
Initial user namespace mount, metacopy=on Trusted (forging requires host CAP_SYS_ADMIN) N/A (userxattr not used in privileged mounts by default)
User-namespace mount N/A (trusted.* inaccessible from user NS) Disabled (forced off at mount time; ovl_lookup_metacopy returns None)
User-namespace mount, userxattr=on, data-only layers N/A Rejected at mount time (EPERM)

13.4.7 Directory Operations

Readdir merge: Reading a merged directory (one that exists in both upper and lower layers) requires combining entries from all layers, excluding whiteouts and applying opaque directory semantics.

OverlayFileOps::readdir(inode, private, offset, emit) -> Result<()>:

  let oi = get_overlay_inode(inode)

  // Phase 1: Collect entries from upper layer.
  let mut seen: HashSet<OsStr> = HashSet::new()
  if let Some(upper) = oi.upper.load(Acquire) {
      underlying_readdir(upper, |entry_inode, entry_off, ftype, name| {
          // Skip whiteout entries — they indicate deleted lower entries.
          if is_whiteout_entry(entry_inode) {
              seen.insert(name.to_owned())  // Track for lower suppression.
              return true  // Continue iteration.
          }
          seen.insert(name.to_owned())
          emit(overlay_inode_for(entry_inode), entry_off, ftype, name)
      })
  }

  // Phase 2: If directory is opaque, stop here. Lower entries are hidden.
  if oi.opaque {
      return Ok(())
  }

  // Phase 3: Collect entries from lower layers, skipping duplicates.
  for lower_ref in lower_dirs_for(oi) {
      underlying_readdir(lower_ref.inode, |entry_inode, entry_off, ftype, name| {
          // Skip entries already seen in upper or higher lower layers.
          if seen.contains(name) {
              return true
          }
          // Skip whiteout entries from lower layers too.
          if is_whiteout_entry(entry_inode) {
              seen.insert(name.to_owned())
              return true
          }
          seen.insert(name.to_owned())
          emit(overlay_inode_for(entry_inode), entry_off, ftype, name)
      })
  }

  Ok(())

Readdir caching: The merged directory listing is cached in the overlay file's private state (returned by open()) for the lifetime of the open directory file descriptor. This matches Linux's behavior: the merge is computed once per opendir() and subsequent readdir() calls return entries from the cache. The cache is invalidated on rewinddir() (seek to offset 0).

Performance note on seen HashSet: The HashSet<OsStr> in the pseudocode above is allocated once per opendir() call (during the initial merge), not once per readdir() call. The cache stores the deduplicated entry list; subsequent readdir() calls walk the already-merged cache without re-allocating or re-hashing. For large directories (>10,000 entries), the initial opendir() merge is O(N) with one allocation per distinct entry name (stored in the HashSet during merge, then released when the merge completes and entries are stored in a flat Vec in the file private state). The hot path — repeated readdir() calls iterating through the cached Vec — is O(entries) with zero heap allocations.

Directory rename (redirect_dir=on): When a merged directory is renamed, overlayfs cannot rename the lower-layer directory (it is read-only). Instead:

  1. Create the new directory name in the upper layer.
  2. Set the trusted.overlay.redirect xattr on the new upper directory, containing the absolute path (from the overlay root) of the original lower directory. Maximum redirect path: 256 bytes.
  3. Lookups for the renamed directory follow the redirect: when searching lower layers, use the redirect path instead of the current name.
  4. Create a whiteout at the old name to hide the lower-layer original.

Opaque directory creation (rmdir + mkdir of same name):

  1. Create whiteout or opaque directory in upper layer.
  2. Set trusted.overlay.opaque xattr to "y" on the new upper directory.
  3. All lower-layer entries under this path are hidden.

13.4.8 Whiteout and Deletion

When a file or directory is deleted from a merged view, overlayfs must hide the lower-layer entry without modifying the lower layer:

File deletion (unlink on a merged file): 1. If the file exists in upper: remove the upper entry via underlying_unlink(). 2. If the file exists in any lower layer: create a whiteout in the upper layer at the same path. 3. Invalidate the dentry cache entry.

Directory deletion (rmdir on a merged directory): 1. Verify the merged view of the directory is empty (no entries from any layer that are not whiteouts). Return ENOTEMPTY if non-empty. 2. If the directory exists in upper: remove it. 3. If the directory exists in lower: create an opaque whiteout in upper.

Whiteout creation:

/// Create a whiteout entry in the upper layer.
///
/// UmkaOS uses the xattr-based whiteout format by default: a zero-size
/// regular file with the overlay whiteout xattr set. This avoids
/// requiring mknod(2) capability (character device 0:0 creation
/// requires CAP_MKNOD in the filesystem's user namespace).
///
/// For compatibility, the character-device whiteout format is also
/// recognized on read (lookup).
fn create_whiteout(upper_parent: InodeId, name: &OsStr) -> Result<()> {
    let sb = overlay_super_block();

    // Create zero-size regular file.
    let whiteout = underlying_create(upper_parent, name,
        FileMode::regular(0o000))?;

    // Set the whiteout xattr.
    underlying_setxattr(whiteout,
        concat(sb.xattr_prefix, "whiteout"), b"y", XattrFlags::CREATE)?;

    Ok(())
}

RENAME_WHITEOUT integration: The VFS rename() with RENAME_WHITEOUT flag (already supported in InodeOps::rename(), Section 13.1.1) atomically renames a file and creates a whiteout at the old name. overlayfs uses this during copy-up of directory entries: when a file is copied from lower to upper, the old lower path is hidden by a whiteout created atomically with the rename.

13.4.9 Volatile Mode

Volatile mode disables all durability guarantees for the upper layer. This is a deliberate trade-off for ephemeral container workloads.

Behavior: - fsync(), fdatasync(), and sync_fs() on overlay files are no-ops (return success without calling the underlying filesystem's sync). - On mount with volatile=true, create the sentinel directory $workdir/work/incompat/volatile/. - On unmount, remove the sentinel directory (clean shutdown). - On next mount, if the sentinel exists, return EINVAL with a diagnostic message: the previous volatile session was not cleanly unmounted, and the upper/work directories may be inconsistent. The operator must delete upper and work directories and recreate them. - After any writeback error on the upper filesystem, subsequent fsync() calls on overlay files return EIO persistently (matching Linux's error stickiness behavior from Section 14.1).

Container runtime usage: Docker enables volatile mode for containers started with --storage-opt overlay2.volatile=true. This is common for CI/CD runners, build containers, and test environments where container state is discarded after each run.

13.4.10 Extended Attribute Handling

overlayfs must handle xattrs carefully because it uses private xattrs for internal bookkeeping (whiteouts, metacopy, redirects, opaque markers) and must pass through user-visible xattrs correctly.

Xattr namespace partitioning:

Namespace Behavior
trusted.overlay.* (or user.overlay.* in userxattr mode) Internal: overlay-private. Not visible to userspace via listxattr()/getxattr(). Used for whiteout, opaque, metacopy, redirect, origin markers.
security.* Pass-through with copy-up: Copied from lower to upper during copy-up. setxattr() triggers copy-up. Includes security.selinux, security.capability (file caps), security.ima.
system.posix_acl_access, system.posix_acl_default Pass-through with copy-up: POSIX ACLs are copied during copy-up. setfacl triggers copy-up.
user.* (excluding user.overlay.* in userxattr mode) Pass-through with copy-up: User-defined xattrs. Copied during copy-up.
trusted.* (excluding trusted.overlay.*) Pass-through with copy-up: Only accessible to CAP_SYS_ADMIN processes. Copied during copy-up.

getxattr/setxattr dispatch:

OverlayInodeOps::getxattr(inode, name, buf) -> Result<usize>:
  // Block access to overlay-private xattrs.
  if name.starts_with(overlay_xattr_prefix()) {
      return Err(ENODATA)
  }
  // Serve from upper if available, otherwise from lower.
  let target = upper_or_lower(inode)
  underlying_getxattr(target, name, buf)

OverlayInodeOps::setxattr(inode, name, value, flags) -> Result<()>:
  // Block writes to overlay-private xattrs.
  if name.starts_with(overlay_xattr_prefix()) {
      return Err(EPERM)
  }
  // setxattr triggers copy-up (xattr must be set on upper).
  let upper = copy_up(inode)?
  underlying_setxattr(upper, name, value, flags)

OverlayInodeOps::listxattr(inode, buf) -> Result<usize>:
  // List xattrs from upper (if exists) or lower.
  // Filter out overlay-private xattrs from the result.
  let target = upper_or_lower(inode)
  let raw = underlying_listxattr(target, buf)?
  filter_out_overlay_xattrs(buf, raw)

Nested overlayfs: When overlayfs is mounted on top of another overlayfs (nested container images, uncommon but valid), the inner overlay's xattrs must not collide with the outer overlay's. Linux handles this via "xattr escaping": the inner overlay stores its xattrs under trusted.overlay.overlay.* instead of trusted.overlay.*. UmkaOS implements the same escaping mechanism. This is transparent to the filesystem — the inner overlay simply uses a longer prefix.

13.4.11 statfs Behavior

OverlayFs::statfs() returns statistics from the upper layer's filesystem (if present). For read-only overlays (no upper), statistics from the topmost lower layer are returned. This matches Linux behavior and ensures that df on a container's root filesystem shows the available space on the writable layer.

13.4.12 Inode Number Composition (xino)

To guarantee unique inode numbers across the merged view, overlayfs composes inode numbers from the underlying filesystem's inode number and the layer index:

composed_ino = (layer_index << xino_bits) | underlying_ino

Where xino_bits is the number of bits available for the underlying inode (typically 32 for ext4 with default inode sizes). This ensures that stat() returns unique inode numbers for files from different layers that happen to share the same underlying inode number (common when layers are on the same filesystem).

When xino=off or when underlying inode numbers exceed the available bit width, overlayfs falls back to using the underlying inode numbers directly. In this mode, st_dev differs between upper and lower files (the VFS assigns a unique device number per overlay mount), but st_ino may collide across layers. Applications that rely on (st_dev, st_ino) pairs for file identity (e.g., tar, rsync, find -inum) may exhibit incorrect behavior. xino=auto avoids this by enabling composition only when it is safe.

13.4.13 Mount and Unmount Flow

Mount:

OverlayFs::mount(source, flags, data) -> Result<SuperBlock>:

  1. Parse mount options from `data` into `OverlayMountOptions`.

  2. Determine user-namespace influence (security policy, Section 13.4.6.1):
     userns_influenced = (current_user_ns() != &init_user_ns)

     If userns_influenced:
       a. Force options.metacopy = false.
          Force options.redirect_dir = RedirectDirMode::Off.
          Log: "overlayfs: metacopy and redirect_dir disabled for
                user-namespace mount (Section 13.4.6.1)"
       b. Require options.userxattr == true. If not set, return EPERM.
          (User-namespace mounts cannot use trusted.overlay.* xattrs.)
       c. If any lower_dir entry uses the data-only '::' separator syntax:
          return EPERM. (Data-only layers with userxattr are disallowed
          because user.overlay.redirect is owner-writable.)

  3. Resolve each lower_dir path to an InodeId via VFS path lookup.
     Verify each is a directory. Hold references for mount lifetime.

  4. If upper_dir is set:
     a. Resolve upper_dir to InodeId. Verify it is a writable directory.
     b. Resolve work_dir to InodeId. Verify same superblock as upper_dir.
     c. Check work_dir is empty.
     d. Create `$workdir/work/` subdirectory if it does not exist.
     e. If volatile mode:
        - Check for `$workdir/work/incompat/volatile/` sentinel.
          If exists: return EINVAL ("previous volatile session unclean").
        - Create the sentinel directory.
     f. If nfs_export: create `$workdir/index/` subdirectory.
     g. Clean stale temporary files from workdir (names starting with
        `#overlay.`). These are remnants of interrupted copy-ups.

  5. Verify upper filesystem supports required operations:
     - xattr support (getxattr/setxattr succeed with overlay prefix).
     - rename with RENAME_WHITEOUT (test with a dummy file in workdir).

  6. Construct `OverlaySuperBlock` with userns_influenced as determined
     in step 2, and `SuperBlock`.

  7. Register overlay dentry ops with the VFS.

  8. Emit mount options for /proc/mounts via show_options().

Unmount:

OverlayFs::unmount(sb) -> Result<()>:

  1. If volatile mode: remove sentinel directory
     `$workdir/work/incompat/volatile/`.
  2. Release all layer references (InodeId references to underlying
     filesystem directories).
  3. Drop the OverlaySuperBlock.

13.4.14 Performance Characteristics

Operation Overhead vs. direct filesystem access Notes
Path lookup (cached) +1 hash lookup per component Overlay dentry points to underlying dentry
Read (lower-only file) ~0% Direct delegation to lower filesystem
Read (upper file) ~0% Direct delegation to upper filesystem
Read (metacopy file) ~0% Reads from lower, same as lower-only
Write (upper file) ~0% Direct delegation to upper filesystem
Write (first write, copy-up) O(file_size) one-time Sequential read+write of file data
Write (metacopy first write) O(file_size) one-time Deferred from container startup
chmod/chown (metacopy) O(1) ~10μs Metadata-only copy-up (no data copy)
chmod/chown (no metacopy) O(file_size) Full copy-up triggered
readdir (merged) O(entries × layers) Hash-based dedup over all layers
stat (cached) ~0% Overlay inode cached in VFS

Container startup optimization: With metacopy enabled, pulling and starting a container image avoids copying any file data during the initial setup phase (only metadata operations occur: chmod, chown, symlink creation for the container's init process). Data is copied lazily on first write. For typical container images (200-500 MB of layers), this reduces container start time from seconds to tens of milliseconds for the filesystem setup phase.

13.4.15 dm-verity Integration for Container Image Layers

Read-only lower layers in a container overlay can be protected by dm-verity (Section 8.2.6). The container runtime mounts each image layer's block device with dm-verity verification, then stacks them as overlayfs lower layers:

Container image mount sequence:
  1. Pull image layers: layer1.img, layer2.img, ..., layerN.img
  2. For each layer:
     a. Set up dm-verity on the layer's block device (Merkle tree
        verification, Section 8.2.6)
     b. Mount the verified block device read-only (ext4/XFS)
  3. Mount overlayfs:
     mount -t overlay overlay \
       -o lowerdir=/mnt/layerN:...:/mnt/layer1,upperdir=...,workdir=...
       /container/rootfs

This provides block-level integrity verification for all read-only container layers. The writable upper layer is covered by IMA (Section 8.4) for runtime integrity measurement of modified files. Together, dm-verity (lower layers) + IMA (upper layer) provide complete integrity coverage for container filesystems.

The optional verity=require mount option (Section 13.4.15) provides an additional layer of verification at the overlayfs level using fs-verity digests, independent of dm-verity block device verification.

13.4.16 Linux Compatibility

overlayfs is compatible with Linux's overlayfs at the mount interface and xattr format level:

  • Upper and lower directories created by Linux overlayfs are mountable by UmkaOS and vice versa. The xattr format (trusted.overlay.* names and values) is identical.
  • Mount option syntax matches Linux exactly (-o lowerdir=...,upperdir=..., workdir=...).
  • Whiteout formats (both character device 0:0 and xattr-based) are recognized.
  • Metacopy xattr format is compatible: layers created with metacopy=on on Linux work on UmkaOS.
  • redirect_dir xattr format and path encoding match Linux.
  • /proc/mounts output format matches Linux for container introspection tools.
  • /sys/module/overlay/parameters/* is not emulated (UmkaOS does not use kernel modules); per-mount options in the mount command are the sole configuration mechanism.

Docker/containerd/Podman compatibility: These runtimes interact with overlayfs exclusively through the mount(2) syscall and standard file operations. They do not use any overlayfs-specific ioctls or sysfs interfaces. UmkaOS's implementation of mount("overlay", ...) with the standard option string is sufficient for full compatibility. The overlay2 storage driver in Docker and the overlayfs snapshotter in containerd are fully supported.


---

## 13.5 binfmt_misc — Arbitrary Binary Format Registration

`binfmt_misc` is a VFS-level mechanism that allows userspace to register handlers
for arbitrary binary formats, identified by magic bytes or file extension. When the
kernel's exec path attempts to start a file and neither the native ELF handler nor
the `#!` script handler matches, the kernel delegates to a registered `binfmt_misc`
interpreter. The registered interpreter binary is invoked with the original file
path as an additional argument.

Critical use cases:

- **Multi-architecture containers**: `qemu-aarch64-static` is registered as the
  interpreter for AArch64 ELF binaries, identified by the AArch64 ELF magic header.
  This allows running unmodified ARM64 Docker images on an x86-64 host without
  hardware virtualisation.
- **Java**: `.jar` files executed as if they were executables via a registration
  that maps the `.jar` extension to `/usr/bin/java -jar`.
- **.NET**: PE32+ executables identified by the `MZ` magic bytes are mapped to
  `dotnet exec`.
- **Wine**: 16-bit and 32-bit Windows PE files mapped to `wine`.

### 13.5.1 Data Structures

```rust
/// A single registered binfmt_misc entry.
pub struct BinfmtMiscEntry {
    /// Registration name. Shown as the filename under the binfmt_misc mount.
    /// Alphanumeric, hyphen, and underscore only. NUL-terminated.
    pub name:         [u8; 64],
    /// Matching strategy: magic bytes or file extension.
    pub match_type:   BinfmtMatch,
    /// Magic bytes to compare against file content (BinfmtMatch::Magic only).
    /// Maximum 128 bytes. Length of `magic` and `mask` must be equal.
    pub magic:        Option<[u8; 128]>,
    /// Length of the valid portion of `magic` and `mask` arrays.
    pub magic_len:    u8,
    /// Bitmask applied to each file byte before comparison with `magic`.
    /// A mask byte of `0xff` means "match exactly"; `0x00` means "ignore".
    pub mask:         Option<[u8; 128]>,
    /// Byte offset within the file at which `magic` is compared.
    pub magic_offset: u16,
    /// File extension string (BinfmtMatch::Extension only).
    /// Case-sensitive. Does not include the leading `.`. NUL-terminated.
    pub extension:    Option<[u8; 8]>,
    /// Absolute path to the interpreter binary.
    pub interpreter:  [u8; PATH_MAX],
    /// Behavioural flags.
    pub flags:        BinfmtFlags,
    /// Whether this entry participates in exec matching.
    pub enabled:      AtomicBool,
}

/// How the entry identifies matching binaries.
pub enum BinfmtMatch {
    /// Match by magic bytes at a fixed offset within the file.
    Magic,
    /// Match by the file extension of the executed path.
    Extension,
}

bitflags! {
    /// Behavioural flags for a binfmt_misc entry.
    pub struct BinfmtFlags: u32 {
        /// Pass the original filename as argv[0] to the interpreter instead
        /// of substituting the interpreter path.
        const PRESERVE_ARGV0 = 0x01;
        /// Open the binary file and pass it to the interpreter as an open fd
        /// (via `/proc/self/fd/N`). Required when the binary is not
        /// world-readable and the interpreter runs without elevated privilege.
        const OPEN_BINARY    = 0x02;
        /// Use the credentials (uid, gid, capabilities) of the interpreter
        /// binary rather than those of the executed file. Equivalent to
        /// setuid execution for the interpreter.
        const CREDENTIALS    = 0x04;
        /// Fix binary: the interpreter is not itself subject to further
        /// binfmt_misc or personality transformation. Prevents recursion.
        const FIX_BINARY     = 0x08;
        /// Secure: do not grant elevated credentials even when the interpreter
        /// binary is setuid. Overrides CREDENTIALS for privilege de-escalation.
        const SECURE         = 0x10;
    }
}

The global entry table is a RwLock<Vec<Arc<BinfmtMiscEntry>>>. Reads (exec path) take a read lock for a bounded scan; writes (registration, enable/disable, removal) take the write lock. The list is short in practice (fewer than 64 entries on any real system), so O(N) scan cost is negligible relative to exec overhead.

13.5.2 Registration Interface

The binfmt_misc filesystem is mounted at /proc/sys/fs/binfmt_misc (also accessible at /sys/kernel/umka/binfmt_misc/ via the umkafs namespace — see Section 19.4). It exposes:

Path Type Description
register write-only file Register a new entry
status read/write file 1 = all entries active; 0 = all disabled globally
<name>/enabled read/write file 1 enable, 0 disable, -1 remove this entry
<name> read-only file Shows entry details (flags, interpreter, magic/extension)

Writing to register or any <name>/enabled file requires Capability::SysAdmin in the caller's capability set.

Registration format (written as a single line to register):

:name:type:offset:magic:mask:interpreter:flags

Fields are separated by the same delimiter character as the leading :. Any printable non-alphanumeric character may be used as the delimiter (allowing paths that contain colons).

Field Description
name Identifier: alphanumeric, -, _. Maximum 63 characters.
type M for magic-byte match; E for extension match.
offset Decimal byte offset for magic comparison (type M). 0 for most formats.
magic Hex-escaped bytes for type M (e.g., \x7fELF). Extension string for type E.
mask Hex-escaped bitmask for type M; same length as magic. Empty for type E.
interpreter Absolute path to the interpreter binary. Must exist at registration time.
flags Subset of POCFS: P = PRESERVE_ARGV0, O = OPEN_BINARY, C = CREDENTIALS, F = FIX_BINARY, S = SECURE.

Example — registering QEMU user-mode for AArch64 ELF binaries on an x86-64 host:

:qemu-aarch64:M:0:\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00::qemu-aarch64-static:OC
  • Type M, offset 0: compare 20 magic bytes starting at file byte 0.
  • No mask: all bytes compared exactly (\xff mask is implied).
  • O (OPEN_BINARY): interpreter receives file as fd, not path, for cross-uid access.
  • C (CREDENTIALS): interpreter's credentials govern setuid semantics.

Parsing algorithm:

parse_registration(line: &[u8]) -> Result<BinfmtMiscEntry>:
  1. delimiter = line[0]
  2. Split line on delimiter into fields: [name, type, offset, magic_or_ext,
     mask, interpreter, flags_str].
  3. Validate name: alphanumeric + '-' + '_', length 1–63.
  4. Parse type: 'M' → BinfmtMatch::Magic, 'E' → BinfmtMatch::Extension.
  5. For type M:
     a. Parse offset as decimal u16.
     b. Decode hex-escaped bytes into magic array (max 128 bytes).
     c. If mask non-empty: decode hex-escaped bytes; must equal magic.len().
     d. If mask empty: fill mask with 0xff bytes (exact match).
  6. For type E:
     a. Validate extension: printable ASCII, no '/', no '.', max 7 chars.
     b. Store extension without leading '.'.
  7. Validate interpreter: starts with '/', exists in VFS (path lookup),
     is a regular file with execute permission for at least one uid.
  8. Parse flags_str: accept 'P', 'O', 'C', 'F', 'S' in any order.
  9. Construct BinfmtMiscEntry with enabled = AtomicBool::new(true).
  10. Acquire write lock on global table; reject if name already exists.
  11. Push Arc<BinfmtMiscEntry> to table.

13.5.3 Exec Path Integration

During do_execve (Section 7.3), after the ELF handler and the #! script handler both decline the binary (return ENOEXEC), the kernel calls binfmt_misc_load_binary(file, argv, envp).

Matching algorithm:

binfmt_misc_load_binary(file, argv, envp) -> Result<()>:
  1. Acquire read lock on global entry table.
  2. If global status is disabled: return ENOEXEC.
  3. Read a probe buffer of min(128 + max_magic_offset, 256) bytes from
     offset 0 of `file`. This single read covers all registered magic ranges.
  4. For each entry in table order:
     a. If !entry.enabled.load(Relaxed): skip.
     b. If entry.match_type == Magic:
        i.  end = entry.magic_offset as usize + entry.magic_len as usize.
        ii. If end > probe_buffer.len(): skip (file too short).
        iii.For each byte i in 0..magic_len:
              file_byte = probe[magic_offset + i] & mask[i]
              if file_byte != magic[i] & mask[i]: break → no match
        iv. If all bytes matched: entry is selected.
     c. If entry.match_type == Extension:
        i.  Extract filename from argv[0] (last path component).
        ii. If filename ends with '.' + extension (case-sensitive): entry is selected.
  5. If no entry matched: release lock; return ENOEXEC.
  6. Clone the matched entry (Arc clone, no copy of byte arrays).
  7. Release read lock.
  8. Build new argv:
     a. If PRESERVE_ARGV0 set: new_argv = [interpreter, argv[0], argv[1..]]
     b. Else:                  new_argv = [interpreter, original_file_path, argv[1..]]
     c. If OPEN_BINARY set:    pass file as open fd; prepend "/proc/self/fd/<N>"
        in place of original_file_path.
  9. If CREDENTIALS set: use interpreter binary's uid/gid/caps for the new exec.
  10. If SECURE set: clear any setuid bits that CREDENTIALS would have applied.
  11. Invoke do_execve recursively with interpreter path and new_argv.
      If FIX_BINARY set: skip binfmt_misc matching in the recursive exec
      (set a per-exec flag to prevent re-entry into this function).

Step 11's recursive do_execve processes the interpreter itself through the normal ELF handler. QEMU user-mode binaries are statically linked ELF executables, so the recursion terminates in one level.

13.5.4 The binfmt_misc Filesystem

binfmt_misc_fs is a minimal VFS filesystem type (FsType::BinfmtMisc) with the following FsOps implementation:

impl FsOps for BinfmtMiscFs {
    fn mount(&self, flags: MountFlags, _data: &[u8]) -> Result<Arc<SuperBlock>>;
    fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;
}

impl InodeOps for BinfmtMiscDir {
    fn lookup(&self, name: &OsStr) -> Result<Arc<Dentry>>;
    fn iterate_dir(&self, ctx: &mut DirContext) -> Result<()>;
}

impl FileOps for BinfmtMiscRegister {
    fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>; // parse_registration
}

impl FileOps for BinfmtMiscStatus {
    fn read(&self, buf: &mut [u8], _offset: u64) -> Result<usize>; // "enabled\n" or "disabled\n"
    fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>;    // "1" / "0"
}

impl FileOps for BinfmtMiscEntryFile {
    fn read(&self, buf: &mut [u8], _offset: u64) -> Result<usize>; // entry details
    fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>;    // "1" / "0" / "-1"
}

The filesystem has no on-disk backing store. All state lives in the in-kernel Vec<Arc<BinfmtMiscEntry>>. Directory inodes are synthesised dynamically: lookup scans the entry table for a matching name and returns a synthetic inode. iterate_dir emits register, status, and all current entry names.

Multiple mounts of the binfmt_misc filesystem share the same global entry table (identical to Linux semantics). Unmounting does not clear registrations; entries persist until explicitly removed via echo -1 > /proc/sys/fs/binfmt_misc/<name>/enabled or until the kernel reboots.

Mount point: The standard location is /proc/sys/fs/binfmt_misc, mounted by systemd-binfmt.service at early boot before loading entries from /etc/binfmt.d/*.conf and /usr/lib/binfmt.d/*.conf.

13.5.5 Persistence and systemd Integration

The kernel holds registrations only in memory. Registrations are lost on reboot. The systemd-binfmt.service unit re-registers all entries at each boot by reading configuration files with the format:

# /etc/binfmt.d/qemu-aarch64.conf
:qemu-aarch64:M:0:\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00::qemu-aarch64-static:OC

Each non-comment, non-empty line is written verbatim to /proc/sys/fs/binfmt_misc/register. Drop-in files in /usr/lib/binfmt.d/ are processed first, then /etc/binfmt.d/ (higher priority). Conflicting entries with the same name are rejected by the kernel (duplicate-name check in parse_registration).

13.5.6 Security Model

  • Privilege: Writing to register or any enabled file requires Capability::SysAdmin. Unprivileged processes cannot add or modify entries.
  • Interpreter credentials: By default (no CREDENTIALS flag), the interpreter runs with the calling process's credentials. The setuid bits of the interpreter binary are ignored. This prevents privilege escalation via a crafted binary whose magic bytes happen to match a setuid interpreter's registration.
  • CREDENTIALS flag: Explicitly opts in to interpreter-binary credential inheritance. Should only be set for fully trusted interpreters.
  • SECURE flag: When set alongside CREDENTIALS, strips any elevated privilege that would have been inherited. Useful for sandboxed interpreters.
  • OPEN_BINARY flag: The kernel opens the binary file before constructing the new argv, so the interpreter receives an already-open fd. This allows the interpreter to read the file even when the binary is not world-readable (e.g., chmod 700 user-owned binaries run through QEMU on a shared host). The fd is passed as a /proc/self/fd/N path to remain compatible with interpreters that accept a file path argument.
  • Recursion guard: The FIX_BINARY flag, combined with the per-exec recursion flag set in step 11 of Section 13.5.3, prevents pathological interpreter chains where an interpreter is itself a binfmt_misc-dispatched binary.

13.6 autofs — Kernel Automount Trigger

autofs is the kernel side of the automount subsystem. Its role is narrow: detect access to a path that has not yet been mounted, suspend the filesystem lookup, notify a userspace daemon, and resume the lookup after the daemon has performed the mount. The kernel does not decide what to mount or where it comes from — that is entirely the daemon's responsibility.

Used extensively by systemd through .automount units: lazy NFS home directories (/home/$user), removable media (/media/disk), and network shares that should only connect on demand.

13.6.1 Architecture

autofs registers a VFS filesystem type (FsType::Autofs). An autofs filesystem instance covers a single mount point. Inside that mount point, the kernel may see directory entries that are not yet backed by a real mount. When path resolution (Section 13.1.3) traverses one of these directories and finds DCACHE_NEED_AUTOMOUNT set on its dentry, it calls the dentry's d_automount operation.

The two fundamental mount modes are:

Mode Description
indirect autofs mount covers a directory; lookups of subdirectories trigger mounts. /nfs is autofs; accessing /nfs/fileserver triggers a mount of fileserver:/export onto /nfs/fileserver.
direct The autofs mount point IS the trigger. Accessing the exact path (e.g., /mnt/backup) triggers the mount.

13.6.2 Data Structures

/// State for one autofs filesystem instance (one mount point).
pub struct AutofsMount {
    /// Pipe to the automount daemon. Kernel writes AutofsPacket messages here.
    pub pipe:             Arc<Pipe>,
    /// Protocol version negotiated with the daemon (UmkaOS implements v5).
    pub proto_version:    u32,
    /// Whether the daemon has declared itself gone (catatonic state).
    pub catatonic:        AtomicBool,
    /// Idle timeout in seconds after which expire packets are sent.
    pub timeout_secs:     AtomicU32,
    /// All outstanding lookup requests waiting for daemon response.
    pub pending:          Mutex<HashMap<u32, Arc<AutofsPendingRequest>>>,
    /// Monotonically increasing token counter (wraps at u32::MAX).
    pub next_token:       AtomicU32,
    /// Mount type: indirect or direct.
    pub mount_type:       AutofsMountType,
}

pub enum AutofsMountType {
    Indirect,
    Direct,
    Offset, // Internal: used for sub-mounts within a multi-mount map.
}

/// One outstanding automount request.
pub struct AutofsPendingRequest {
    /// Token echoed back in the daemon's IOC_READY / IOC_FAIL ioctl.
    pub token:   u32,
    /// Path component that triggered the lookup (indirect) or full path (direct).
    pub name:    CString,
    /// Sleeping callers blocked on this mount.
    pub waitq:   WaitQueue,
    /// Result set by the daemon: Ok(()) on success, Err(errno) on failure.
    pub result:  Once<Result<()>>,
}

/// Packet written to the daemon pipe for a missing mount (protocol v5).
#[repr(C)]
pub struct AutofsPacketMissing {
    pub hdr:              AutofsPacketHdr,
    /// Token for AUTOFS_IOC_READY / AUTOFS_IOC_FAIL.
    pub wait_queue_token: u32,
    /// Length of `name` (not including NUL).
    pub len:              i32,
    /// Name of the missing directory component (NUL-terminated).
    pub name:             [u8; NAME_MAX + 1],
}

/// Packet written to the daemon pipe requesting expiry of an idle mount.
#[repr(C)]
pub struct AutofsPacketExpire {
    pub hdr:              AutofsPacketHdr,
    pub wait_queue_token: u32,
    pub len:              i32,
    pub name:             [u8; NAME_MAX + 1],
}

/// Common packet header.
#[repr(C)]
pub struct AutofsPacketHdr {
    pub proto_version: u32,
    pub packet_type:   AutofsPacketType,
}

#[repr(u32)]
pub enum AutofsPacketType {
    Missing = 0,
    Expire  = 1,
}

13.6.3 Automount Protocol

Trigger sequence (the fast path through VFS path resolution):

autofs_d_automount(dentry, path) -> Result<Option<Arc<VfsMount>>>:
  Precondition: called from REF-walk (never RCU-walk; see Section 13.6.6).

  1. Obtain the AutofsMount for this dentry's superblock.
  2. If catatonic: return Err(ENOENT) immediately.
  3. Check if `dentry` is already a mount point (DCACHE_MOUNTED set):
     return Ok(None) — another thread raced and completed the mount.
  4. Allocate token = next_token.fetch_add(1, Relaxed).
  5. Construct AutofsPacketMissing { token, name = dentry.name or full path }.
  6. Insert Arc<AutofsPendingRequest> into pending table under token.
  7. Write packet to pipe (non-blocking; if pipe is full, return ENOMEM —
     the daemon is overloaded).
  8. Sleep on pending.waitq with timeout = timeout_secs seconds.
  9. On wake:
     a. Remove request from pending table.
     b. If result is Ok(()):
        - Verify dentry is now a mount point (DCACHE_MOUNTED).
        - Return Ok(None) (VFS follow_mount() will handle the new mount).
     c. If result is Err(e): return Err(e).
  10. On timeout:
     a. Remove request from pending table.
     b. Return Err(ETIMEDOUT).

Daemon response (via ioctl on the autofs pipe fd or mount point fd):

AUTOFS_IOC_READY(token: u32):
  1. Acquire pending lock; look up token.
  2. If not found: return ENXIO (stale token; request already timed out).
  3. Set request.result = Ok(()).
  4. Wake all waiters on request.waitq.
  5. Remove from pending table.

AUTOFS_IOC_FAIL(token: u32):
  1. Acquire pending lock; look up token.
  2. If not found: return ENXIO.
  3. Set request.result = Err(ENOENT).
  4. Wake all waiters.
  5. Remove from pending table.

Multiple callers may race to access the same missing path simultaneously. All of them find the same AutofsPendingRequest in the pending table (inserted by the first caller) and sleep on the same waitq. When the daemon responds, all waiters wake together.

13.6.4 Control Interface

All autofs control operations are performed via ioctl(2) on the file descriptor of the autofs pipe (passed to the kernel at mount time via the fd=N mount option) or on a file descriptor opened on the autofs mount point itself.

ioctl Direction Description
AUTOFS_IOC_READY daemon→kernel Mount succeeded for token.
AUTOFS_IOC_FAIL daemon→kernel Mount failed for token.
AUTOFS_IOC_CATATONIC daemon→kernel Daemon is exiting; all future lookups fail with ENOENT.
AUTOFS_IOC_PROTOVER kernel→daemon Returns protocol version (5 for UmkaOS).
AUTOFS_IOC_SETTIMEOUT daemon→kernel Sets idle expiry timeout in seconds.
AUTOFS_IOC_EXPIRE kernel→daemon Requests daemon to expire (unmount) one idle subtree.
AUTOFS_IOC_EXPIRE_MULTI kernel→daemon Requests daemon to expire up to N idle subtrees.
AUTOFS_IOC_EXPIRE_INDIRECT kernel→daemon Like EXPIRE but limited to indirect-mode subtrees.
AUTOFS_IOC_EXPIRE_DIRECT kernel→daemon Like EXPIRE but limited to direct-mode mount points.
AUTOFS_IOC_PROTOSUBVER kernel→daemon Returns protocol sub-version (UmkaOS: 2).
AUTOFS_IOC_ASKUMOUNT daemon→kernel Query whether the autofs mount point can be unmounted.

13.6.5 Expiry

After an autofs-triggered mount has been idle for timeout_secs seconds, the kernel initiates expiry. Expiry is cooperative: the kernel asks the daemon to consider unmounting; the daemon decides whether conditions are met (no processes have open files under the mount, no active chdir into it) and issues umount(2) if appropriate.

autofs_expire_run(mount: &AutofsMount):
  Executed from a kernel timer callback at intervals of timeout_secs / 4.

  1. Walk all mounts that are children of this autofs mount point.
  2. For each child mount M:
     a. Compute idle_time = now - M.last_access_time.
     b. If idle_time < timeout_secs: skip.
     c. If any process has an open fd into M's subtree (check mount's
        open-file reference count): skip.
     d. Allocate token = next_token.fetch_add(1, Relaxed).
     e. Write AutofsPacketExpire { token, name = M.mountpoint_name } to pipe.
     f. Insert AutofsPendingRequest into pending table.
     g. Daemon calls AUTOFS_IOC_READY(token) after umount(2) succeeds, or
        AUTOFS_IOC_FAIL(token) if the mount is still busy.
  3. The timer reschedules itself unless the mount is in catatonic state.

The expiry path does not sleep in the kernel; it is fire-and-forget from the kernel's perspective. The daemon drives the actual unmount.

13.6.6 VFS Integration

autofs inserts itself into the VFS path walk at the d_automount dentry operation hook, which is called by follow_automount() inside the path resolution loop (Section 13.1.3):

follow_automount(path, nd) -> Result<()>:
  1. Verify nd.flags does not include LOOKUP_NO_AUTOMOUNT.
  2. Call dentry.ops.d_automount(dentry, path) → new_mnt (may be None).
  3. If new_mnt is Some(mnt): call do_add_mount(mnt, path).
  4. Continue path walk over the now-mounted subtree.

RCU-walk downgrade: d_automount cannot sleep, and sleeping is required to wait for the daemon response. Therefore, if the path walk is in RCU mode (the optimistic lockless fast path), it is downgraded to REF-walk before d_automount is called. The downgrade is performed by unlazy_walk(), which acquires reference counts on the path components traversed so far. Once in REF-walk, the kernel can sleep safely in autofs_d_automount.

LOOKUP_NO_AUTOMOUNT: Certain operations (stat, openat with O_NOFOLLOW | O_PATH, utimensat with AT_SYMLINK_NOFOLLOW) set this flag to avoid triggering automounts on stat-only access. This matches Linux semantics.

13.6.7 Mount Options

autofs is mounted by the daemon at startup with options passed via the data argument to mount(2):

Option Description
fd=N File descriptor of the daemon-side pipe end. Required.
uid=N UID of the daemon process. Used for permission checks on expire.
gid=N GID of the daemon process.
minproto=N Minimum acceptable protocol version (daemon's minimum).
maxproto=N Maximum acceptable protocol version (daemon's maximum).
indirect Mount in indirect mode (default).
direct Mount in direct mode.
offset Mount in offset mode (internal; used by the daemon for sub-mounts).

UmkaOS implements autofs protocol version 5, sub-version 2, matching the version supported by systemd's automount daemon as of systemd v252+. The protocol version is negotiated at mount time: the kernel picks min(maxproto, UMKA_PROTO_VERSION) and returns it via AUTOFS_IOC_PROTOVER.

13.6.8 systemd Integration

A systemd .automount unit creates an autofs mount point at the path specified by Where=, paired with a .mount unit of the same name. systemd acts as the automount daemon:

  1. At unit activation, systemd calls mount("autofs", Where, "autofs", 0, "fd=N,...").
  2. When AutofsPacketMissing arrives on the pipe, systemd activates the corresponding .mount unit (which runs mount(2) for the real filesystem).
  3. On success, systemd calls AUTOFS_IOC_READY(token); on failure, AUTOFS_IOC_FAIL(token).
  4. TimeoutIdleSec= in the .automount unit maps directly to AUTOFS_IOC_SETTIMEOUT.
  5. After the idle timeout, systemd receives AutofsPacketExpire and issues umount(2) if the mount is not busy, then calls AUTOFS_IOC_READY(token).

Example unit (/etc/systemd/system/home.automount):

[Unit]
Description=Automount /home via NFS

[Automount]
Where=/home
TimeoutIdleSec=300

[Install]
WantedBy=multi-user.target

Paired with /etc/systemd/system/home.mount which specifies the NFS source and options. systemd creates the autofs mount point when the .automount unit starts and tears it down when the unit stops.

13.6.9 Linux Compatibility

UmkaOS's autofs implementation is wire-compatible with Linux autofs4:

  • Protocol version 5, sub-version 2 — matches Linux kernel 5.0+.
  • All ioctl numbers are identical to Linux (AUTOFS_IOC_* from <linux/auto_fs.h>).
  • AutofsPacketMissing and AutofsPacketExpire structs are #[repr(C)] and match the Linux kernel ABI exactly.
  • Mount option string format (fd=N,uid=N,...) matches Linux.
  • systemd's automount daemon, autofs(5) userspace tools, and mount.autofs all operate without modification against UmkaOS's autofs implementation.

13.7 FUSE — Filesystem in Userspace

FUSE allows user-space processes to implement complete filesystems. A FUSE filesystem daemon opens /dev/fuse (character device, major 10, minor 229), mounts via FUSE_SUPER_MAGIC, and serves kernel VFS calls by reading and writing structured FUSE messages over the device fd. Any FUSE protocol-compliant daemon runs without modification on UmkaOS.

13.7.1 Architecture

User Process (e.g., sshfs, rclone, glusterfs-fuse)
       │  write(fuse_fd, fuse_out_header + reply)
       │  read(fuse_fd, fuse_in_header + args)
       ▼
  /dev/fuse  (character device, major 10 minor 229)
       │
  ┌────┴────────────────────────────────────────┐
  │  FuseConn: pending request queue            │
  │  FuseInode: nodeid → dentry mapping         │
  └────┬────────────────────────────────────────┘
       │  VFS callbacks → fuse_request dispatch
       ▼
  UmkaOS VFS layer (lookup, read, write, open, ...)
       │
  POSIX application

The FUSE connection object (FuseConn) is the central coordination point. It maintains two queues: pending (requests waiting for the daemon to pick up) and processing (requests sent to the daemon, awaiting reply). Each VFS thread that triggers a FUSE operation enqueues a request and blocks until the daemon writes the corresponding reply.

13.7.2 Core Data Structures

/// One FUSE connection — shared between all fds opened on this mount.
pub struct FuseConn {
    /// Pending requests waiting for the daemon to read.
    pub pending:      Mutex<VecDeque<Arc<FuseRequest>>>,
    /// Requests sent to the daemon, awaiting reply.
    pub processing:   Mutex<BTreeMap<u64, Arc<FuseRequest>>>,
    /// Wait queue: daemon blocked in read() waiting for new requests.
    pub waitq:        WaitQueue,
    /// Connection options negotiated via FUSE_INIT.
    pub opts:         FuseConnOpts,
    /// Next unique request ID (monotonically increasing).
    pub next_unique:  AtomicU64,
    /// True after the daemon has exchanged FUSE_INIT.
    pub initialized:  AtomicBool,
    /// True when the connection is shutting down.
    pub destroyed:    AtomicBool,
    /// Maximum write size negotiated (from FUSE_INIT reply).
    pub max_write:    u32,
    /// Maximum read size.
    pub max_read:     u32,
}

/// A single FUSE request/reply pair.
pub struct FuseRequest {
    /// Monotonic ID — matches `FuseInHeader.unique` and `FuseOutHeader.unique`.
    pub unique:  u64,
    pub opcode:  FuseOpcode,
    /// Serialized FUSE input args (everything after the `FuseInHeader`).
    pub in_args: Vec<u8>,
    pub reply:   Mutex<FuseReply>,
    /// Woken when `reply` transitions to `Done`.
    pub waker:   WaitEntry,
}

/// State of a request's reply.
pub enum FuseReply {
    /// Not yet answered by the daemon.
    Pending,
    /// Reply bytes, or a negative errno on error.
    Done(Result<Vec<u8>, i32>),
}

/// FUSE connection options negotiated during FUSE_INIT.
pub struct FuseConnOpts {
    pub max_write:           u32,
    pub max_read:            u32,
    pub max_pages:           u16,
    /// Capabilities declared by the daemon (server side).
    pub capable:             FuseInitFlags,
    /// Capabilities the kernel requests (client side).
    pub want:                FuseInitFlags,
    /// Timestamp granularity in nanoseconds (0 = 1 ns, i.e., full precision).
    pub time_gran:           u32,
    pub writeback_cache:     bool,
    pub parallel_dirops:     bool,
    pub async_dio:           bool,
    pub posix_acl:           bool,
    pub default_permissions: bool,
    pub allow_other:         bool,
}

FuseConn is reference-counted via Arc and held by: - The superblock of the mounted filesystem. - Every open file descriptor on /dev/fuse belonging to that mount.

When the last daemon fd is closed, FuseConn.destroyed is set and all further VFS operations return EIO. The mount point must then be explicitly unmounted with fusermount -u or umount.

13.7.3 Wire Protocol

All FUSE communication is framed with fixed headers. The kernel writes a request header followed by opcode-specific arguments; the daemon writes a reply header followed by opcode-specific data.

/// Fixed header preceding every FUSE request (kernel → daemon).
#[repr(C)]
pub struct FuseInHeader {
    /// Total request length (this header + opcode args).
    pub len:     u32,
    /// Opcode (FuseOpcode value).
    pub opcode:  u32,
    /// Unique request ID; matched by the reply.
    pub unique:  u64,
    /// Target inode number (0 for FUSE_INIT / FUSE_STATFS).
    pub nodeid:  u64,
    /// Effective UID of the calling process.
    pub uid:     u32,
    /// Effective GID of the calling process.
    pub gid:     u32,
    /// PID of the calling process.
    pub pid:     u32,
    pub padding: u32,
}

/// Fixed header preceding every FUSE reply (daemon → kernel).
#[repr(C)]
pub struct FuseOutHeader {
    /// Total reply length (this header + reply data).
    pub len:    u32,
    /// 0 on success; negative errno on error (e.g., -ENOENT = -2).
    pub error:  i32,
    /// Matches the `unique` field from the corresponding `FuseInHeader`.
    pub unique: u64,
}

Requests and replies are variable-length. The daemon must read exactly FuseInHeader.len bytes per request and must write exactly FuseOutHeader.len bytes per reply. A short read or write is a protocol error and terminates the connection.

FUSE_FORGET and FUSE_BATCH_FORGET are the only opcodes that carry no reply; the daemon must not write a reply for them.

13.7.4 FUSE Opcodes

The direction column records who initiates the message: K→D = kernel to daemon (a VFS call from a user process), D→K = daemon to kernel (a notify or retrieve reply with no corresponding VFS initiator).

Opcode Value Direction Description
FUSE_LOOKUP 1 K→D Lookup a name within a directory
FUSE_FORGET 2 K→D Decrement inode reference count (no reply)
FUSE_GETATTR 3 K→D Fetch inode attributes
FUSE_SETATTR 4 K→D Modify inode attributes
FUSE_READLINK 5 K→D Read the target of a symbolic link
FUSE_SYMLINK 6 K→D Create a symbolic link
FUSE_MKNOD 8 K→D Create a special or regular file
FUSE_MKDIR 9 K→D Create a directory
FUSE_UNLINK 10 K→D Remove a file
FUSE_RMDIR 11 K→D Remove a directory
FUSE_RENAME 12 K→D Rename a file (v1; same mount)
FUSE_LINK 13 K→D Create a hard link
FUSE_OPEN 14 K→D Open a file
FUSE_READ 15 K→D Read file data
FUSE_WRITE 16 K→D Write file data
FUSE_STATFS 17 K→D Query filesystem statistics
FUSE_RELEASE 18 K→D Close file (last close releases the handle)
FUSE_FSYNC 20 K→D Sync file data to stable storage
FUSE_SETXATTR 21 K→D Set an extended attribute
FUSE_GETXATTR 22 K→D Get an extended attribute value
FUSE_LISTXATTR 23 K→D List all extended attribute names
FUSE_REMOVEXATTR 24 K→D Remove an extended attribute
FUSE_FLUSH 25 K→D Flush on close (sent before FUSE_RELEASE)
FUSE_INIT 26 K→D Initialize connection (first message exchanged)
FUSE_OPENDIR 27 K→D Open a directory
FUSE_READDIR 28 K→D Read directory entries
FUSE_RELEASEDIR 29 K→D Close a directory
FUSE_FSYNCDIR 30 K→D Sync directory metadata to stable storage
FUSE_GETLK 31 K→D Test a POSIX byte-range lock
FUSE_SETLK 32 K→D Acquire or release a POSIX lock (non-blocking)
FUSE_SETLKW 33 K→D Acquire a POSIX lock (blocking)
FUSE_ACCESS 34 K→D Check access (used only when default_permissions is false)
FUSE_CREATE 35 K→D Atomically create and open a file
FUSE_INTERRUPT 36 K→D Cancel a pending request
FUSE_BMAP 37 K→D Map logical file block to device block
FUSE_DESTROY 38 K→D Tear down the connection
FUSE_IOCTL 39 K→D Forward an ioctl to the userspace filesystem
FUSE_POLL 40 K→D Poll a file for readiness events
FUSE_NOTIFY_REPLY 41 D→K Deliver data in response to FUSE_NOTIFY_RETRIEVE
FUSE_BATCH_FORGET 42 K→D Drop references for multiple inodes at once
FUSE_FALLOCATE 43 K→D Pre-allocate or de-allocate file space
FUSE_READDIRPLUS 44 K→D Read directory entries together with their attributes
FUSE_RENAME2 45 K→D Rename with RENAME_EXCHANGE or RENAME_NOREPLACE
FUSE_LSEEK 46 K→D Seek with SEEK_DATA or SEEK_HOLE
FUSE_COPY_FILE_RANGE 47 K→D Server-side copy (copy_file_range)
FUSE_SETUPMAPPING 48 K→D Set up a DAX direct memory mapping
FUSE_REMOVEMAPPING 49 K→D Remove a DAX mapping
FUSE_SYNCFS 50 K→D Sync the entire filesystem
FUSE_TMPFILE 51 K→D Create an unnamed temporary file (O_TMPFILE)
FUSE_STATX 52 K→D Extended stat (statx(2))

Notify messages (daemon → kernel, unsolicited; no reply is sent by the kernel except for FUSE_NOTIFY_RETRIEVE which expects FUSE_NOTIFY_REPLY):

Notify code Description
FUSE_NOTIFY_POLL Wake all pollers on the specified file handle
FUSE_NOTIFY_INVAL_INODE Invalidate cached attributes and, optionally, a byte range of page cache
FUSE_NOTIFY_INVAL_ENTRY Invalidate a specific dentry in a parent directory
FUSE_NOTIFY_STORE Pre-populate a byte range of the page cache
FUSE_NOTIFY_RETRIEVE Request the kernel to send page-cache contents back to the daemon
FUSE_NOTIFY_DELETE Remove a dentry without a round-trip FUSE_LOOKUP failure

13.7.5 FUSE_INIT Handshake

FUSE_INIT is always the first message exchanged. The kernel sends FuseInitIn and the daemon replies with FuseInitOut. The two sides negotiate protocol version and capability flags; the connection uses the minimum agreed minor version.

/// FUSE_INIT request body (kernel → daemon).
#[repr(C)]
pub struct FuseInitIn {
    /// FUSE major protocol version (kernel sends 7).
    pub major:         u32,
    /// FUSE minor protocol version (kernel sends 40 for Linux 6.10 equivalent).
    pub minor:         u32,
    pub max_readahead: u32,
    /// Capability bitmask the kernel supports (low 32 bits of FuseInitFlags).
    /// Wire format: flags = FuseInitFlags bits 0-31 (low 32 bits).
    pub flags:         u32,
    /// Extended capability flags (protocol minor ≥ 36, FUSE_INIT_EXT must be set in flags).
    /// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
    /// This matches the FUSE protocol extension for large flag sets (kernel 5.13+).
    pub flags2:        u32,
    pub unused:        [u32; 11],
}

/// FUSE_INIT reply body (daemon → kernel).
#[repr(C)]
pub struct FuseInitOut {
    pub major:               u32,
    pub minor:               u32,
    pub max_readahead:       u32,
    /// Capabilities the daemon acknowledges and enables (low 32 bits of FuseInitFlags).
    /// Wire format: flags = FuseInitFlags bits 0-31 (low 32 bits).
    pub flags:               u32,
    /// Maximum number of outstanding background requests.
    pub max_background:      u16,
    /// Congestion threshold: kernel slows down at this many background requests.
    pub congestion_threshold: u16,
    /// Maximum bytes per WRITE request.
    pub max_write:           u32,
    /// Timestamp granularity in nanoseconds (0 = 1 ns, i.e., full precision).
    pub time_gran:           u32,
    /// Maximum scatter-gather page count per request.
    pub max_pages:           u16,
    /// Alignment required for DAX mappings.
    pub map_alignment:       u16,
    /// Extended flags (protocol minor ≥ 36, requires FUSE_INIT_EXT set in flags).
    /// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
    /// This matches the FUSE protocol extension for large flag sets (kernel 5.13+).
    pub flags2:              u32,
    pub max_stack_depth:     u32,
    pub unused:              [u32; 6],
}

bitflags! {
    /// Capability flags exchanged during FUSE_INIT.
    pub struct FuseInitFlags: u64 {
        /// Daemon supports asynchronous read requests.
        const ASYNC_READ          = 1 << 0;
        /// Daemon handles POSIX advisory byte-range locks.
        const POSIX_LOCKS         = 1 << 1;
        /// Daemon uses file handles returned in open replies.
        const FILE_OPS            = 1 << 2;
        /// Daemon handles O_TRUNC atomically in open.
        const ATOMIC_O_TRUNC      = 1 << 3;
        /// Filesystem supports NFS export (node IDs are stable across reboots).
        const EXPORT_SUPPORT      = 1 << 4;
        /// Daemon supports writes larger than 4 KiB.
        const BIG_WRITES          = 1 << 5;
        /// Kernel should not apply the process umask to create operations.
        const DONT_MASK           = 1 << 6;
        /// Daemon supports splice(2)-based writes.
        const SPLICE_WRITE        = 1 << 7;
        /// Daemon supports splice(2)-based moves.
        const SPLICE_MOVE         = 1 << 8;
        /// Daemon supports splice(2)-based reads.
        const SPLICE_READ         = 1 << 9;
        /// Daemon handles BSD flock() locking.
        const FLOCK_LOCKS         = 1 << 10;
        /// Daemon supports ioctl on directories.
        const HAS_IOCTL_DIR       = 1 << 11;
        /// Kernel auto-invalidates cached data on attribute changes.
        const AUTO_INVAL_DATA     = 1 << 12;
        /// Kernel uses FUSE_READDIRPLUS instead of FUSE_READDIR.
        const DO_READDIRPLUS      = 1 << 13;
        /// Kernel switches adaptively between READDIRPLUS and READDIR.
        const READDIRPLUS_AUTO    = 1 << 14;
        /// Daemon supports asynchronous direct I/O.
        const ASYNC_DIO           = 1 << 15;
        /// Daemon supports writeback caching (batched dirty page writeback).
        const WRITEBACK_CACHE     = 1 << 16;
        /// Daemon does not need FUSE_OPEN (open is a no-op).
        const NO_OPEN_SUPPORT     = 1 << 17;
        /// Parallel directory operations are safe (no serialization needed).
        const PARALLEL_DIROPS     = 1 << 18;
        /// Kernel clears setuid/setgid bits on write (v1).
        const HANDLE_KILLPRIV     = 1 << 19;
        /// Daemon supports POSIX ACLs.
        const POSIX_ACL           = 1 << 20;
        /// Daemon sets error on abort rather than returning EIO.
        const ABORT_ERROR         = 1 << 21;
        /// `max_pages` field in FuseInitOut is valid.
        const MAX_PAGES           = 1 << 22;
        /// Daemon caches symlink targets.
        const CACHE_SYMLINKS      = 1 << 23;
        /// Daemon does not need FUSE_OPENDIR.
        const NO_OPENDIR_SUPPORT  = 1 << 24;
        /// Daemon explicitly invalidates data (FUSE_NOTIFY_INVAL_INODE).
        const EXPLICIT_INVAL_DATA = 1 << 25;
        /// `map_alignment` field in FuseInitOut is valid.
        const MAP_ALIGNMENT       = 1 << 26;
        /// Daemon is aware of submount semantics.
        const SUBMOUNTS           = 1 << 27;
        /// Kernel clears setuid/setgid bits on write (v2, extended semantics).
        const HANDLE_KILLPRIV_V2  = 1 << 28;
        /// Extended setxattr arguments (flags field present).
        const SETXATTR_EXT        = 1 << 29;
        /// `flags2` fields in FuseInitIn/Out are valid.
        const INIT_EXT            = 1 << 30;
        const INIT_RESERVED       = 1 << 31;
    }
}

If the daemon returns a minor version lower than what the kernel sent, the kernel downconverts: fields that did not exist in the older protocol minor are ignored. If the daemon sends a major version other than 7, the kernel closes the connection.

13.7.6 VFS Integration

FUSE registers filesystem type "fuse" with superblock magic FUSE_SUPER_MAGIC = 0x65735546. Mounting proceeds as follows:

mount(2) path

  1. User invokes mount -t fuse -o fd=N,... or uses the fusermount3 helper.
  2. The kernel parses the fd=N mount option and resolves the fd to an open /dev/fuse file.
  3. A FuseConn is allocated and attached to the fd and the new superblock.
  4. The kernel sends FUSE_INIT and waits for the daemon's reply; on success, FuseConn.initialized is set and the mount completes.

VFS → FUSE dispatch

For every VFS operation on a FUSE mount (lookup, read, write, getattr, etc.) the kernel:

  1. Allocates a FuseRequest with a fresh unique ID.
  2. Serializes the opcode-specific arguments into in_args.
  3. Appends the request to FuseConn.pending and wakes the daemon's wait queue.
  4. Blocks on FuseRequest.waker until the daemon writes a reply.
  5. Deserializes the reply from FuseRequest.reply and returns to the VFS caller.

The daemon loop is simply:

loop {
    bytes = read(fuse_fd, buf)          // blocks until a request is pending
    handle_opcode(parse(buf))
    write(fuse_fd, reply_bytes)         // unblocks the kernel thread
}

Interrupt handling

If the calling thread receives a fatal signal while waiting for a FUSE reply, the kernel enqueues a FUSE_INTERRUPT message targeting the original request's unique ID. It then waits a short grace period (default 20 milliseconds). If the daemon does not abort the request and send a reply within that window, the kernel forcibly removes the request from FuseConn.processing and returns EINTR to the caller. The daemon is expected to ignore any subsequent reply it sends for the interrupted unique.

Writeback cache (WRITEBACK_CACHE flag)

When this capability is negotiated, dirty pages accumulate in the kernel page cache and are written to the daemon in larger batches via FUSE_WRITE. Without it, every write(2) to a FUSE file generates an immediate, synchronous FUSE_WRITE to the daemon, serializing all write traffic. Most performance- sensitive FUSE filesystems negotiate WRITEBACK_CACHE.

Connection death

When the last daemon fd is closed (daemon exits, crashes, or explicitly calls FUSE_DESTROY):

  1. FuseConn.destroyed is set atomically.
  2. All requests in FuseConn.processing are completed with error ENOTCONN.
  3. All requests in FuseConn.pending are discarded.
  4. Subsequent VFS operations on the mount return EIO.
  5. The mount point persists in the namespace; an explicit umount or fusermount -u is required to remove it.

13.7.7 Security Model

Mount-owner restriction (default)

Unless the allow_other mount option is passed, only the UID that opened /dev/fuse and performed the mount may access the filesystem. All other UIDs receive EACCES from the UmkaOS VFS layer before the request reaches the daemon, regardless of the file mode bits the daemon returns.

allow_other option

Permits any UID to access the filesystem subject to normal Unix permission checks. Because allow_other exposes the daemon process to arbitrary user requests, it requires either: - The SysAdmin capability in the mount namespace, or - The /proc/sys/fs/fuse/user_allow_other sysctl set to 1 (off by default).

default_permissions option

When set, the kernel enforces standard Unix permission checks (owner, group, other; st_mode, st_uid, st_gid) against the attributes the daemon returns in FUSE_GETATTR. The kernel never sends FUSE_ACCESS in this mode. Without default_permissions, the daemon is responsible for its own access control and receives FUSE_ACCESS for every access check.

Privilege requirement for mounting

Unprivileged FUSE mounts (without SysAdmin) are permitted only through fusermount3, which is installed setuid-root and validates that the user owns the target mountpoint. Direct mount(2) requires SysAdmin in the current user namespace.

13.7.8 io_uring FUSE

UmkaOS supports the io_uring-based FUSE I/O path (FUSE_URING feature, equivalent to Linux 6.14+). The daemon opts in by negotiating the FUSE_URING capability during FUSE_INIT and then submitting SQEs of type IORING_OP_URING_CMD to the /dev/fuse fd rather than using blocking read/write.

Benefits over the classic blocking I/O path:

  • Asynchronous request handling — the daemon can have many requests in flight simultaneously without blocking threads.
  • Reduced syscall overhead — requests are batched via io_uring_submit; one syscall drains or fills multiple queue slots.
  • CPU affinity — the daemon can pin io_uring workers to specific CPUs, reducing cross-socket latency for NUMA-aware FUSE filesystems.

The FUSE daemon registers a fixed buffer pool at startup. The kernel delivers requests into pre-registered buffers, and the daemon submits replies via the same ring. The wire format (FuseInHeader, FuseOutHeader, opcode bodies) is unchanged; only the transport mechanism differs.

13.7.9 Linux Compatibility

  • /dev/fuse device node (major 10, minor 229): identical to Linux.
  • FUSE protocol version 7.40 (Linux 6.10 equivalent) is the maximum negotiated kernel version. Daemons advertising higher minors receive 7.40 in the reply.
  • libfuse3 (3.x series): works without modification.
  • fusermount3 and the fuse.ko-equivalent path: built into the UmkaOS VFS layer; no kernel module is required.
  • All widely deployed FUSE filesystems run without modification: sshfs, rclone mount, glusterfs-fuse, ceph-fuse, bindfs, s3fs-fuse, encfs, gvfs, ntfs-3g.
  • DAX (FUSE_SETUPMAPPING / FUSE_REMOVEMAPPING) is supported on persistent memory-backed FUSE mounts, providing zero-copy access to file data.

13.8 configfs — Kernel Object Configuration Filesystem

configfs is a RAM-resident pseudo-filesystem (similar to sysfs) that allows user-space to create, configure, and destroy kernel objects by manipulating directories and files under a single mount point. The key distinction from sysfs is direction of control: sysfs exports kernel-managed objects to user-space, while configfs gives user-space the power to instantiate new kernel objects via mkdir.

configfs is used by: - LIO iSCSI / NVMe-oF target (/sys/kernel/config/target/, /sys/kernel/config/nvmet/) — see Section 11 for the block-layer and NVMe-oF protocol details. - USB gadget framework (/sys/kernel/config/usb_gadget/) - 9pnet and netconsole subsystems

13.8.1 Architecture

                 User Space
         mkdir / rmdir / cat / echo
                     │
              /sys/kernel/config/
                     │  (VFS operations)
        ┌────────────┴────────────────────────┐
        │          configfs VFS layer          │
        │  ConfigfsSubsystem → ConfigGroup     │
        │  ConfigItem → ConfigAttribute        │
        └────────────┬────────────────────────┘
                     │  callbacks
              Kernel subsystem
         (LIO, nvmet, USB gadget, ...)

User-space operates exclusively with POSIX filesystem primitives. No ioctl or dedicated syscall is needed. The kernel subsystem registers callback functions that the configfs VFS layer invokes in response to standard filesystem operations.

13.8.2 Data Structures

/// A configfs subsystem, registered by a kernel module at init time.
pub struct ConfigfsSubsystem {
    /// Directory name created under /sys/kernel/config/.
    pub name: &'static str,
    /// Root group of this subsystem.
    pub root: Arc<ConfigGroup>,
}

/// A configfs group — a directory that may contain items, subgroups, and
/// attributes. Groups may also carry a set of default child groups that are
/// created automatically when the group itself is created.
pub struct ConfigGroup {
    pub item:           ConfigItem,
    /// Active children (items and subgroups) keyed by name.
    pub children:       RwLock<BTreeMap<String, ConfigChild>>,
    /// Type descriptor controlling allowed operations on this group.
    pub item_type:      Arc<ConfigItemType>,
    /// Subgroups automatically created alongside this group (not user-removable).
    pub default_groups: Vec<Arc<ConfigGroup>>,
}

/// Discriminated union of group children.
pub enum ConfigChild {
    Item(Arc<ConfigItem>),
    Group(Arc<ConfigGroup>),
}

/// A configfs item — the leaf directory representing one kernel object.
pub struct ConfigItem {
    pub name:      Mutex<String>,
    /// Reference count; item is dropped when it reaches zero.
    pub kref:      AtomicUsize,
    pub parent:    Weak<ConfigGroup>,
    pub item_type: Arc<ConfigItemType>,
}

/// Type descriptor: defines the callbacks and attributes for an item or group.
pub struct ConfigItemType {
    pub name: &'static str,
    /// Called when the item's reference count drops to zero.
    pub release:    fn(&ConfigItem),
    /// Attribute files exposed in every instance of this item type.
    pub attrs:      &'static [&'static dyn ConfigAttribute],
    /// Returns additional child groups (used for complex multi-level objects).
    pub groups:     Option<fn(&ConfigItem) -> Vec<Arc<ConfigGroup>>>,
    /// Create a new leaf item inside this group (triggered by mkdir).
    pub make_item:  Option<fn(group: &ConfigGroup, name: &str)
                              -> Result<Arc<ConfigItem>, KernelError>>,
    /// Create a new subgroup inside this group (triggered by mkdir).
    pub make_group: Option<fn(group: &ConfigGroup, name: &str)
                               -> Result<Arc<ConfigGroup>, KernelError>>,
    /// Notify the subsystem before an item is removed (triggered by rmdir).
    pub drop_item:  Option<fn(group: &ConfigGroup, item: &ConfigItem)>,
}

/// A single configfs attribute — a regular file in the item directory.
pub trait ConfigAttribute: Send + Sync {
    /// File name within the item directory.
    fn name(&self) -> &str;
    /// Unix permission bits (typically 0644 for read-write, 0444 for read-only).
    fn mode(&self) -> u32;
    /// Populate `buf` with a text representation of the attribute value.
    /// Returns the number of bytes written.
    fn show(&self, item: &ConfigItem, buf: &mut [u8]) -> Result<usize, KernelError>;
    /// Parse `buf` and apply the new attribute value.
    /// Returns the number of bytes consumed.
    fn store(&self, item: &ConfigItem, buf: &[u8]) -> Result<usize, KernelError>;
}

Lifetimes and reference counting mirror those of the objects the subsystem manages. A ConfigItem is kept alive as long as the directory exists in the configfs namespace. Removal (rmdir) calls drop_item, decrements the kref, and invokes release when the count reaches zero.

13.8.3 Mount Point and Directory Layout

configfs is mounted at boot by configfs_init() and exposed at /sys/kernel/config. User-space may also mount it manually:

mount -t configfs configfs /sys/kernel/config

Illustrative layout showing the NVMe-oF and iSCSI target subsystems (see Section 11 for full protocol details):

/sys/kernel/config/
├── target/                              ← LIO iSCSI / generic target subsystem
│   ├── core/
│   │   └── iblock_0/                   ← mkdir: create iblock backstore group
│   │       └── lio_disk0/              ← mkdir: create a new block device object
│   │           ├── dev                 ← echo /dev/sda > dev
│   │           ├── udev_path           ← echo /dev/sda > udev_path
│   │           └── enable              ← echo 1 > enable
│   └── iscsi/
│       └── iqn.2024-01.com.example:storage/   ← mkdir: create iSCSI target IQN
│           └── tpgt_1/                         ← mkdir: create target portal group
│               ├── enable
│               ├── lun/
│               │   └── lun_0 → ../../core/iblock_0/lio_disk0   ← symlink
│               ├── acls/
│               │   └── iqn.2024-01.com.client:host1/
│               │       ├── auth/
│               │       └── mapped_lun0/
│               └── fabric_statistics/
├── nvmet/                               ← NVMe-oF target subsystem
│   ├── subsystems/
│   │   └── nqn.2024-01.com.example:nvme-ssd/  ← mkdir: create NVMe subsystem NQN
│   │       ├── attr_allow_any_host
│   │       └── namespaces/
│   │           └── 1/                          ← mkdir: create namespace ID 1
│   │               ├── device_path             ← echo /dev/nvme0n1 > device_path
│   │               └── enable                  ← echo 1 > enable
│   └── ports/
│       └── 1/                                  ← mkdir: create NVMe-oF port
│           ├── addr_trtype                     ← echo tcp > addr_trtype
│           ├── addr_traddr                     ← echo 192.0.2.1 > addr_traddr
│           ├── addr_trsvcid                    ← echo 4420 > addr_trsvcid
│           └── subsystems/
│               └── nqn.2024-01.com.example:nvme-ssd  ← symlink
└── usb_gadget/                          ← USB gadget framework
    └── g1/                             ← mkdir: create a gadget instance
        ├── idVendor
        ├── idProduct
        └── functions/
            └── mass_storage.0/
                └── lun.0/
                    └── file            ← echo /dev/sdb > file

The directory hierarchy encodes object relationships. Symlinks express associations between independently-created objects (e.g., linking a LUN to its backing store, or attaching a subsystem to a port).

13.8.4 VFS Operations

configfs maps the five fundamental filesystem operations onto subsystem callbacks:

mkdir(path) The parent directory's ConfigItemType is consulted. If make_group is defined, a new ConfigGroup is allocated and returned as a subdirectory dentry. If make_item is defined, a new ConfigItem is allocated and returned. Only one of the two may be non-null for a given group type; attempting mkdir on a group that defines neither returns EPERM. Default child groups are created automatically alongside any new group.

rmdir(path) The directory must be empty (no user-created children; default children are exempt from this check and are removed automatically). drop_item is invoked on the parent's ConfigItemType, then the item's kref is decremented. If the kref reaches zero, release is called. Attempting to remove a non-empty directory returns ENOTEMPTY.

open(attr_path) / read(attr_fd) The fd is associated with the specific ConfigAttribute. read(2) invokes ConfigAttribute::show(), which populates the kernel buffer with a text representation. The output is always \n-terminated for shell compatibility.

open(attr_path) / write(attr_fd) write(2) invokes ConfigAttribute::store() with the user-supplied buffer. The subsystem parses and validates the value; on error it returns a negative errno. Writes larger than PAGE_SIZE (4 KiB) are rejected with EINVAL to prevent unbounded allocations.

symlink(src, dst) Used to express dependencies between items: for example, associating a LUN directory with a backstore object, or adding a subsystem to a port's subscriber list. configfs validates that both the source and destination are within the same configfs mount before creating the link. The subsystem's ConfigItemType may reject symlinks by returning EPERM from an optional allow_link callback.

readdir Returns all children of a group: items, subgroups, attribute files, and symlinks. Attribute names are synthesized from ConfigItemType.attrs; no inode backing store is needed.

13.8.5 Linux Compatibility

  • /sys/kernel/config/ mount point and directory layout: byte-for-byte identical to Linux configfs (kernel 5.0+).
  • The ConfigAttribute read/write text format (newline-terminated strings, echo value > file idiom) matches Linux.
  • LIO iSCSI target tools (targetcli, targetcli-fb, rtslib-fb) work without modification.
  • NVMe-oF target tools (nvmetcli) work without modification; see Section 11 for NVMe-oF transport configuration details.
  • USB gadget framework (configfs-gadget, libusbgx) works without modification.
  • Symlink semantics (cross-item dependencies) are identical to Linux: both source and destination must reside within the same configfs mount.

13.9 File Notification System

UmkaOS implements inotify and fanotify with full Linux syscall and wire-format compatibility. Internal delivery uses typed structured channels rather than raw fd-write protocols; the external syscall interfaces are byte-for-byte identical to Linux.

Two interfaces are provided:

  • inotify: informational events (IN_CREATE, IN_MODIFY, etc.), delivered asynchronously via a file descriptor readable with read(2).
  • fanotify: superset of inotify, plus permission events (FAN_OPEN_PERM, FAN_ACCESS_PERM, FAN_OPEN_EXEC_PERM) that block the originating syscall until userspace responds with allow or deny. Used by malware scanners, file integrity monitors, and backup software.

Both are implemented in umka-vfs. Event delivery hooks are called from within the VFS operation dispatch paths — after permission checks pass, before returning to userspace.

13.9.1 inotify

13.9.1.1 In-Kernel Objects

/// Per-inotify-instance state. Created by inotify_init() / inotify_init1().
/// Exposed to userspace as a file descriptor (the fd is backed by a synthetic
/// inode in the anonymous inode filesystem; read(2) on it drains the event queue).
pub struct InotifyInstance {
    /// Watch descriptors: maps wd → InotifyWatch.
    /// Protected by an RwLock: concurrent watchers on disjoint inodes do not
    /// contend, and watch addition/removal is infrequent.
    pub watches: RwLock<BTreeMap<WatchDescriptor, Arc<InotifyWatch>>>,

    /// Monotonically increasing allocator for watch descriptors.
    /// WDs are 1-based positive integers per inotify_add_watch(2) contract.
    pub next_wd: AtomicI32,

    /// Per-watch event queue. Fixed capacity avoids heap allocation under spinlock.
    /// Overflow policy: oldest events are dropped when full; EVENTIN_Q_OVERFLOW
    /// synthetic event is prepended to the next successful read.
    /// Capacity 256 events per watch is sufficient for typical usage; overflowed
    /// queues set the `overflow` flag and deliver a synthetic `IN_Q_OVERFLOW` event
    /// on next read (matches Linux inotify behavior).
    pub event_queue: SpinLock<RingBuffer<InotifyEventBuf, 256>>,

    /// Set when the event queue overflowed since the last read(2). A synthetic
    /// `IN_Q_OVERFLOW` event is prepended to the next read response and this flag
    /// is cleared. Separate from the queue to avoid occupying a queue slot.
    pub overflow: AtomicBool,

    /// Wait queue for poll()/select()/epoll() on this instance.
    pub wait_queue: WaitQueueHead,

    /// Flags from inotify_init1() (IN_CLOEXEC, IN_NONBLOCK).
    pub flags: u32,
}

/// One inotify watch: a single inode being monitored for specific events.
pub struct InotifyWatch {
    /// Watch descriptor (the value returned to userspace by inotify_add_watch).
    pub wd: WatchDescriptor,

    /// The inode being watched. Holds an Arc reference to prevent premature eviction
    /// while the watch is active.
    pub inode: Arc<Inode>,

    /// Bitmask of watched events (IN_CREATE | IN_MODIFY | IN_CLOSE_WRITE | etc.).
    pub mask: u32,

    /// Back-reference to the owning InotifyInstance (weak to avoid cycles).
    pub instance: Weak<InotifyInstance>,
}

/// Event delivered to userspace via read(2) on the inotify fd.
/// Matches the Linux inotify_event ABI exactly.
#[repr(C)]
pub struct InotifyEvent {
    /// Watch descriptor that fired.
    pub wd: i32,
    /// Event type (IN_CREATE, IN_MODIFY, IN_DELETE, etc.).
    pub mask: u32,
    /// Links related IN_MOVED_FROM and IN_MOVED_TO events (same cookie = same rename).
    pub cookie: u32,
    /// Length of the name[] field in bytes, including the null terminator and any
    /// trailing padding bytes. 0 if no filename is associated with this event
    /// (e.g., IN_ATTRIB on a non-directory inode).
    pub len: u32,
    // Followed immediately by name[len]: null-terminated filename, valid only for
    // events on directory inodes. Padded to a 4-byte boundary.
}

/// Internal buffer holding a complete inotify event + filename bytes.
pub struct InotifyEventBuf {
    pub header: InotifyEvent,
    /// The filename, null-padded to a multiple of 4 bytes.
    pub name: Vec<u8>,
}

13.9.1.2 VFS Integration Hooks

inotify events are generated from dentry/inode operation call sites within the VFS dispatch layer. The fast path check costs a single pointer load:

VFS operation Event(s) generated
create, mkdir, mknod, symlink IN_CREATE on parent dir inode
unlink, rmdir IN_DELETE on parent dir; IN_DELETE_SELF on the target inode
rename (source side) IN_MOVED_FROM on old parent + cookie
rename (destination side) IN_MOVED_TO on new parent + same cookie
open IN_OPEN on the inode
read, readdir IN_ACCESS on the inode
write, truncate, fallocate IN_MODIFY on the inode
setattr (chmod/chown/utimes) IN_ATTRIB on the inode
close (file was written) IN_CLOSE_WRITE on the inode
close (read-only open) IN_CLOSE_NOWRITE on the inode
inotify watch removed (inode evicted or inotify_rm_watch) IN_IGNORED on the watch descriptor

Each Inode carries an inotify_watches field:

/// Per-inode inotify watch list. None when no watches are active (the common case).
/// This field is checked on every relevant VFS operation; None costs a single
/// null pointer load with no branch misprediction.
pub inotify_watches: Option<SpinLock<Vec<Arc<InotifyWatch>>>>,

When the field is None (no watches active), the check is a single null pointer comparison — zero overhead on the fast path for the vast majority of inodes.

13.9.1.3 Event Delivery Algorithm

fsnotify_inode_event(inode, event_mask, name, cookie):
  watches_opt = inode.inotify_watches  // single load
  if watches_opt is None: return       // fast path: no watches on this inode

  watches = watches_opt.as_ref().lock()
  for watch in watches.iter():
    fired_mask = watch.mask & event_mask
    if fired_mask == 0: continue
    if let Some(instance) = watch.instance.upgrade():
      buf = InotifyEventBuf {
        header: InotifyEvent { wd: watch.wd, mask: fired_mask, cookie, len: name.len() + padding },
        name: name_bytes_padded_to_4_bytes,
      }
      queue = instance.event_queue.lock()
      if !queue.is_full():
        queue.push(buf)
      else:
        // Queue overflow: set the overflow flag so that the next read(2) prepends
        // a synthetic IN_Q_OVERFLOW event. The AtomicBool lives outside the spinlock;
        // store is done while still holding the lock to ensure the writer side sees
        // the flag before any reader drains the queue.
        instance.overflow.store(true, Ordering::Release)
      drop(queue)
      instance.wait_queue.wake_up_one()  // unblock read()/poll()

13.9.1.4 Syscall Implementations

inotify_add_watch(fd, path, mask) → wd: 1. Resolve path → inode using normal path resolution. 2. Look up fdInotifyInstance. 3. Scan instance.watches for an existing watch on this inode: - If found: update watch.mask = mask (OR behavior if IN_MASK_ADD flag is set; replace otherwise). Return the existing wd. 4. Allocate a new WatchDescriptor from instance.next_wd.fetch_add(1). 5. Construct InotifyWatch { wd, inode: inode.clone(), mask, instance: Arc::downgrade(&instance) }. 6. Initialize inode.inotify_watches if it was None. 7. Insert the watch into both inode.inotify_watches and instance.watches. 8. Return wd.

inotify_rm_watch(fd, wd) → 0: 1. Look up fdInotifyInstance. 2. Remove the watch from instance.watches by wd. Return EINVAL if not found. 3. Remove the corresponding entry from inode.inotify_watches. 4. If inode.inotify_watches is now empty, set it to None. 5. Deliver an IN_IGNORED event to the instance. 6. Drop the Arc<InotifyWatch>.

13.9.1.5 Mandatory Event Coalescing

Coalescing rule (mandatory): Before enqueuing a new event, the delivery path checks whether the tail of the instance's EventQueue is an identical event. If so, the new event is discarded (coalesced) rather than enqueued. Two events are identical if and only if:

fn events_are_identical(a: &InotifyEvent, b: &InotifyEvent) -> bool {
    a.wd     == b.wd     &&
    a.mask   == b.mask   &&
    a.cookie == b.cookie &&
    a.name   == b.name    // byte-for-byte name comparison
}

The check is against the tail only (O(1)), not the entire queue. Events are coalesced only when consecutive and identical — non-consecutive duplicates are not coalesced (ordering is preserved for different events between duplicates).

IN_MOVED_FROM / IN_MOVED_TO cookie pairing: Cookie values are assigned by a per-VFS-instance AtomicU32 cookie_counter. Consecutive rename operations get consecutive cookie values. Coalescing does NOT apply to cookie-bearing events (mask has IN_MOVED_FROM or IN_MOVED_TO set) — rename pairs must always be delivered in full.

IN_Q_OVERFLOW: When the fixed-capacity RingBuffer is full and a new event cannot be enqueued (even after attempting coalescing), the InotifyInstance.overflow AtomicBool is set to true. On the next read(2), the read path checks this flag first: if set, it clears the flag and prepends a synthetic IN_Q_OVERFLOW event (wd=-1, mask=IN_Q_OVERFLOW, cookie=0, name="") before draining normal events. This keeps the overflow sentinel out of the ring buffer itself, preserving all 256 slots for real events. The queue is never silently dropped without this sentinel.

Performance: Under cargo build workloads (10k+ file writes), inotify watchers on the build directory receive IN_MODIFY storms. Coalescing reduces queue pressure by 10-100x for write-heavy workloads where the application re-reads the file on any change (editor reload, build system).

Linux compatibility: Linux inotify performs the same tail-coalescing. UmkaOS mandates it (Linux specifies it informally). The IN_Q_OVERFLOW sentinel behaviour is identical to Linux.

13.9.2 fanotify

fanotify extends inotify with:

  1. Filesystem-wide and mount-wide marks (not just per-inode): a single mark can cover an entire mount point or filesystem, eliminating the need to add per-inode watches for directories being monitored for new file creation.
  2. Permission events (FAN_OPEN_PERM, FAN_ACCESS_PERM, FAN_OPEN_EXEC_PERM): the originating syscall blocks until the fanotify daemon responds with allow or deny, subject to a mandatory per-group timeout (default 5000ms) to prevent system-wide I/O stalls.

13.9.2.1 Data Structures

/// Per-fanotify-instance state. Created by fanotify_init().
pub struct FanotifyInstance {
    /// Mark table: key is (mark_type, object_id) where mark_type is inode/mount/sb.
    pub marks: RwLock<BTreeMap<FanotifyMarkKey, Arc<FanotifyMark>>>,

    /// Informational event queue (non-permission events).
    pub event_queue: SpinLock<VecDeque<FanotifyEvent>>,

    /// Pending permission requests: keyed by a unique request ID assigned at creation.
    /// Entries are removed when the daemon writes a response.
    pub perm_queue: SpinLock<BTreeMap<u64, Arc<FanotifyPermRequest>>>,

    /// Next permission request ID (monotonically increasing).
    pub next_perm_id: AtomicU64,

    /// Wait queue for poll()/select()/epoll() on this instance.
    pub wait_queue: WaitQueueHead,

    /// Notification class: determines permission event delivery order when multiple
    /// fanotify instances watch the same inode.
    /// FAN_CLASS_NOTIF=0: informational only.
    /// FAN_CLASS_CONTENT=1: content scanners (see file after open).
    /// FAN_CLASS_PRE_CONTENT=2: DLP / integrity monitors (see file before open).
    /// Higher class is notified first. Within the same class, order is unspecified.
    pub class: FanotifyClass,

    /// Flags from fanotify_init() (FAN_CLOEXEC, FAN_NONBLOCK, FAN_REPORT_FID, etc.).
    pub flags: u32,

    /// Maximum time to wait for a permission event response.
    /// Default: 5000ms. Configurable per group at fanotify_init() time via
    /// FANOTIFY_INIT_PERM_TIMEOUT_MS (UmkaOS extension, not in Linux).
    /// A value of 0 means: use the system default from
    /// /proc/sys/fs/fanotify/perm_timeout_ms.
    pub perm_timeout: Duration,

    /// Action taken when a permission request times out:
    /// - PermTimeoutAction::Deny: return EPERM to the originating syscall (safe default)
    /// - PermTimeoutAction::Allow: allow the operation (permissive mode for monitoring-only daemons)
    pub perm_timeout_action: PermTimeoutAction,
}

pub enum PermTimeoutAction {
    Deny,   // Return EPERM to originating syscall on timeout (default)
    Allow,  // Allow the operation on timeout (for monitoring daemons that tolerate loss)
}

/// A single fanotify mark: attaches event interest to an inode, mount, or superblock.
pub struct FanotifyMark {
    pub mark_type: FanotifyMarkType,  // FAN_MARK_INODE, FAN_MARK_MOUNT, FAN_MARK_FILESYSTEM
    /// Object identifier: inode_id (for inode marks), mount_id (for mount marks),
    /// or superblock pointer (for filesystem marks).
    pub object_id: u64,
    /// Event mask this mark is listening for.
    pub mask: u64,
    /// Ignore mask: events matching this mask are suppressed even if mask is set.
    pub ignored_mask: u64,
    pub instance: Weak<FanotifyInstance>,
}

/// A pending permission request: holds the event plus the response channel.
pub struct FanotifyPermRequest {
    /// The event as delivered to userspace via read(2) on the fanotify fd.
    pub event: FanotifyEvent,
    /// Unique request ID (matches the fd-based identification in the response).
    pub request_id: u64,
    /// Set to FAN_ALLOW or FAN_DENY by the daemon's write(2) response.
    /// Protected by the Mutex; None while pending.
    pub response: Mutex<Option<u32>>,
    /// Wakes the blocked originating syscall when response becomes Some.
    pub waker: WaitQueueHead,
}

/// Event delivered to userspace via read(2) on the fanotify fd.
/// Matches Linux's fanotify_event_metadata ABI.
#[repr(C)]
pub struct FanotifyEvent {
    pub event_len: u32,    // Total length of this event record (including variable info records)
    pub vers: u8,          // FANOTIFY_METADATA_VERSION (always 3)
    pub reserved: u8,
    pub metadata_len: u16, // sizeof(FanotifyEvent)
    pub mask: u64,         // Event type bitmask
    pub fd: i32,           // Opened fd for the file (or -1 with FAN_REPORT_FID)
    pub pid: i32,          // PID of the process that triggered the event
}

pub enum FanotifyMarkType { Inode, Mount, Filesystem }

pub enum FanotifyClass {
    Notif = 0,      // FAN_CLASS_NOTIF
    Content = 1,    // FAN_CLASS_CONTENT
    PreContent = 2, // FAN_CLASS_PRE_CONTENT
}

13.9.2.2 Permission Event Flow

When a VFS operation triggers a permission-event mask bit (e.g., FAN_OPEN_PERM on open(2)):

fanotify_perm_event(inode, event_type, opener_pid):
  // Collect all matching fanotify instances in class order (PreContent first).
  matching = collect_matching_marks(inode, event_type)
  if matching is empty: return Ok(())  // fast path

  for instance in matching sorted by class descending:
    id = instance.next_perm_id.fetch_add(1)
    event_fd = open_file_for_fanotify(inode)  // opens fd for daemon to inspect
    event = FanotifyEvent { mask: event_type, fd: event_fd, pid: opener_pid, ... }
    req = Arc::new(FanotifyPermRequest { event, request_id: id, response: None, waker })

    instance.perm_queue.lock().insert(id, req.clone())
    instance.event_queue.lock().push_back(event)
    instance.wait_queue.wake_up_one()

    // Block with mandatory timeout — never block indefinitely
    match req.channel.wait_timeout(instance.perm_timeout):
      Ok(response):
        if response.allow: continue  // allow and check next instance
        else: close(event_fd); return Err(EPERM)
      Err(Timeout):
        // Log timeout: fanotify daemon too slow
        log_warn!("fanotify: perm request timed out after {:?}, action={:?}",
                  instance.perm_timeout, instance.perm_timeout_action)
        // Increment per-group timeout counter (visible in /proc/PID/fdinfo/<fafd>)
        instance.timeout_count.fetch_add(1, Ordering::Relaxed)
        match instance.perm_timeout_action:
          PermTimeoutAction::Deny  → close(event_fd); return Err(EPERM)
          PermTimeoutAction::Allow → continue  // allow on timeout

    close(event_fd)

  return Ok(())  // all instances allowed

Mandatory permission event timeout: Permission events (FAN_OPEN_PERM, FAN_ACCESS_PERM, FAN_OPEN_EXEC_PERM) have a mandatory response timeout to prevent system-wide I/O stalls.

System-wide timeout knob: /proc/sys/fs/fanotify/perm_timeout_ms (default: 5000). Can be set to 0 to disable timeout (not recommended; requires CAP_SYS_ADMIN).

Monitoring: /proc/sys/fs/fanotify/perm_timeout_count — system-wide count of permission request timeouts (monotonic counter, reset on boot). Per-group count in /proc/PID/fdinfo/<fafd> as perm_timeout_count: N.

Linux compatibility note: Linux fanotify has no timeout on permission events (daemon death causes permanent block — requires daemon restart or fanotify fd close). UmkaOS's timeout is an improvement over Linux; existing fanotify daemons work unchanged (they don't set FANOTIFY_INIT_PERM_TIMEOUT_MS, so they get the 5s default with Deny on timeout). Tools like systemd-oomd, CrowdStrike Falcon, and audit daemons that use fanotify will benefit automatically from the safety timeout.

Userspace daemon writes FAN_ALLOW / FAN_DENY:

write(fanotify_fd, &fanotify_response { fd: event_fd, response: FAN_ALLOW_or_DENY }):
  // Match the response to a pending request by event_fd.
  req = find_perm_request_by_fd(instance.perm_queue, event_fd)
  if req is None: return Err(EINVAL)  // stale or already answered
  req.response.lock() = Some(FAN_ALLOW or FAN_DENY)
  req.waker.wake_up_one()  // unblock the blocked syscall

UmkaOS improvement over Linux fanotify: Linux matches responses to pending permission requests by the fd number inside the fanotify_response struct, which becomes ambiguous if the daemon closes and reopens fds in the event window. UmkaOS uses a typed FanotifyPermRequest with a structured response channel keyed by a monotonically increasing request_id. The Arc<FanotifyPermRequest> lifetime guarantees the blocked syscall's stack is valid until the response arrives, eliminating the lifetime ambiguity in the fd-matching approach.

13.9.3 UmkaOS-Native File Watch Capabilities

UmkaOS provides a capability-based file watching API as a modern alternative to inotify. Unlike inotify (global watch descriptor namespace, process-scoped), FileWatchCap watches are:

  • Capability-scoped: unforgeable, revocable, auditable
  • Memory-bounded: each watch is a capability slot (no global state)
  • Automatically revoked: when the capability is dropped or the process exits
  • Ring-delivered: events go to a typed UmkaOS ring buffer, not a read() queue
  • Composable: multiple watches can share one ring

inotify remains fully supported for Linux compatibility. FileWatchCap is the recommended API for new UmkaOS code.

/// A capability granting the holder the right to watch a specific inode for
/// specific events. Cannot be forged; issued by the kernel only.
/// Revocable via the standard capability revocation path (Section 8.1).
pub struct FileWatchCap {
    /// The inode to watch. Kernel-internal reference — not a path (immune to rename).
    inode: Arc<Inode>,
    /// Events to deliver (subset of InotifyMask).
    mask: InotifyMask,
    /// Watch children of this directory (if inode is a directory).
    watch_children: bool,
    /// Watch children recursively (deep watch — UmkaOS extension, not in inotify).
    watch_recursive: bool,
}

/// Subscribe to inode events via a capability.
/// Events are delivered to `ring` as typed `FileWatchEvent` structs.
///
/// Returns a `WatchHandle` — dropping the handle unregisters the watch.
pub fn inode_watch(
    cap: FileWatchCap,
    ring: Arc<EventRing<FileWatchEvent>>,
) -> Result<WatchHandle, WatchError>;

/// A single file watch event, delivered to the ring.
#[repr(C)]
pub struct FileWatchEvent {
    pub event_type: FileWatchEventType, // enum (see below)
    pub cookie: u32,                    // for rename pairs (FROM/TO share cookie)
    pub inode_id: u64,                  // stable inode number
    pub name: Option<ArrayString<255>>, // filename (for directory events)
    pub timestamp: MonotonicInstant,    // UmkaOS extension: not in inotify
}

pub enum FileWatchEventType {
    Access,       // File was read
    Modify,       // File was written
    Attrib,       // Metadata changed (chmod, chown, timestamps)
    CloseWrite,   // File opened for writing was closed
    CloseNoWrite, // File opened read-only was closed
    Open,         // File was opened
    MovedFrom,    // File moved out (cookie matches MovedTo)
    MovedTo,      // File moved in (cookie matches MovedFrom)
    Create,       // File created in watched directory
    Delete,       // File deleted from watched directory
    DeleteSelf,   // Watched file itself was deleted
    MoveSelf,     // Watched file itself was moved
    Unmount,      // Filesystem containing watched file was unmounted
}

Deep watch (watch_recursive: true): watches a directory tree recursively. UmkaOS maintains a kernel-side tree of watch registrations, automatically adding watches for new subdirectories as they are created (IN_CREATE on a directory). inotify has no recursive watch; tools like inotifywait -r simulate it with userspace polling, which has TOCTOU races. UmkaOS's deep watch is race-free.

Obtaining a FileWatchCap: capability is issued via:

/// Open a FileWatchCap for a path (requires read permission on the path).
pub fn open_watch_cap(
    dirfd: DirFd,
    path: &Path,
    mask: InotifyMask,
    watch_children: bool,
    watch_recursive: bool,
) -> Result<FileWatchCap, WatchError>;

Revocation: WatchHandle::drop() unregisters the watch. When the process exits, all WatchHandles are dropped automatically — no cleanup required. Capability revocation (Section 8.1) also revokes all file watches derived from the revoked capability.

Linux compatibility: FileWatchCap is an UmkaOS-only API. inotify_init(), inotify_add_watch(), inotify_rm_watch() work identically to Linux. FileWatchCap is intended for new UmkaOS applications; existing Linux software uses inotify unchanged.

13.9.4 Cross-References

  • Section 13.1.1 (VFS Traits): inotify/fanotify hooks are inserted at the VFS operation dispatch layer, after InodeOps/FileOps call sites complete successfully.
  • Section 16.1.2 (Namespace Implementation): fanotify marks survive CLONE_NEWNS and remain attached to the underlying inode/mount, not to a specific mount namespace. Marks set in a parent namespace remain visible in child namespaces for the same underlying mount.
  • Section 8.1 (Security): fanotify_init(FAN_CLASS_CONTENT) and fanotify_init(FAN_CLASS_PRE_CONTENT) require CAP_SYS_ADMIN. Informational fanotify (FAN_CLASS_NOTIF) requires only CAP_FOWNER on Linux; UmkaOS follows the same capability requirement for compatibility.

13.10 Local File Locking (flock / fcntl POSIX Locks / OFD Locks)

UmkaOS provides three advisory file locking interfaces, each with distinct semantics:

Interface Granularity Lock scope Inherited on fork Released on
flock(2) Whole file Per open-file-description No (child gets independent fd) Last close of the description
fcntl F_SETLK Byte-range (POSIX) Per process (PID) No Process exit OR any close of the file
fcntl F_OFD_SETLK Byte-range (OFD) Per open-file-description Yes Last close of the description

All three are advisory: a process can read and write a file regardless of locks held by other processes. Locks only prevent other processes from acquiring conflicting locks. Mandatory locking (Linux MS_MANDLOCK) is deliberately not implemented — it was deprecated in Linux 5.15 and is incompatible with modern VFS semantics.

13.10.1 Data Structures

/// A single file lock entry. Stored in the per-inode `FileLockTree`.
pub struct FileLock {
    /// Lock type: read (shared) or write (exclusive).
    pub lock_type: FileLockType,

    /// Byte range: [start, end] inclusive. 0..=u64::MAX represents the whole file.
    /// For flock locks, start=0 and end=u64::MAX always.
    pub start: u64,
    pub end: u64,

    /// For POSIX locks: the PID of the owning process.
    /// All POSIX locks held by a process are released when it exits OR when
    /// any file descriptor for the file is closed (POSIX semantics).
    /// For OFD locks: None. The lock is owned by the open-file-description.
    /// For flock locks: None. The lock is owned by the open-file-description.
    pub owner_pid: Option<Pid>,

    /// The open-file-description that created this lock.
    /// Weak reference: if the description is dropped (last fd closed), the lock
    /// is released. For POSIX locks, `owner_pid` is the primary ownership token
    /// and `owner_fd` is advisory for conflict matching.
    pub owner_fd: Weak<FileDescription>,

    /// Wait queue: tasks blocked waiting for this lock to be released sleep here.
    pub wait_queue: WaitQueueHead,
}

pub enum FileLockType {
    /// Shared (read) lock. Multiple readers can hold simultaneously.
    Read,
    /// Exclusive (write) lock. No other lock may be held concurrently.
    Write,
}

/// Per-inode lock state. Present only on inodes that have had locks acquired;
/// None on inodes that have never been locked (zero overhead on the fast path).
pub struct InodeLocks {
    /// Augmented interval tree of active locks (POSIX, flock, and OFD locks).
    /// Sorted by `l_start`; each node carries `subtree_max: u64` = maximum
    /// `l_end` in its subtree. This enables O(log n) range overlap queries.
    /// See Section 13.10.3 for the full algorithm specification.
    pub locks: FileLockTree,
    /// Protects the lock tree. Operations must be atomic with respect to each other.
    pub lock: SpinLock<()>,
}

/// Augmented interval tree for file lock conflict detection.
/// Red-black tree sorted by `l_start`, augmented with `subtree_max` for
/// O(log n) range overlap queries.
pub struct FileLockTree {
    /// Root of the red-black tree. None when no locks are held.
    root: Option<Box<FileLockNode>>,
    /// Number of locks currently in the tree.
    count: usize,
}

pub struct FileLockNode {
    pub lock: FileLock,
    /// Maximum `l_end` value in this node's subtree (including this node).
    /// Updated on every insert/delete along the path to the root.
    pub subtree_max: u64,
    pub left: Option<Box<FileLockNode>>,
    pub right: Option<Box<FileLockNode>>,
    pub color: RbColor,
}

pub enum RbColor { Red, Black }

13.10.2 Conflict Detection

Two locks conflict if: 1. At least one is a write lock (FileLockType::Write). 2. Their byte ranges overlap: !(lock_a.end < lock_b.start || lock_b.end < lock_a.start). 3. They have different owners: - For POSIX locks: different PIDs. - For OFD/flock locks: different Weak<FileDescription> pointers. - A POSIX lock can upgrade/replace an existing POSIX lock from the same PID without conflict.

13.10.3 Locking Algorithm

UmkaOS uses an augmented interval tree (red-black tree with subtree_max augmentation) for O(log n) file lock conflict detection. This is the correct data structure; there is no O(n) fallback. Linux used an O(n) linked-list scan for decades before adding interval trees in Linux 3.13; UmkaOS starts with the correct design.

FileLockTree structure: - Sorted by l_start (range start) - Each node carries subtree_max: u64 = maximum l_end in its subtree - This augmentation enables O(log n) range overlap queries

Conflict query for range [req_start, req_end): Walk the tree: at each node, if node.subtree_max < req_start, the entire subtree has no overlapping locks — prune. Otherwise check the node itself and recurse into both children. O(log n + k) where k = number of conflicts found.

Insert/delete: O(log n) standard red-black tree operations, plus O(log n) subtree_max recomputation on the path to root.

fcntl_setlk(fd, lock_type, start, end, wait: bool) → Result:
  inode = fd.inode()
  ensure inode.locks is initialized

  inode.locks.lock.lock()

  loop:
    // O(log n + k) interval tree query for conflicting locks in [start, end).
    for existing in inode.locks.locks.query_conflicts(start, end, lock_type, &fd):
      if !wait:
        inode.locks.lock.unlock()
        return Err(EAGAIN)           // F_SETLK: fail immediately

      // F_SETLKW: deadlock detection before sleeping
      if would_deadlock(current_pid, existing.owner_pid):
        inode.locks.lock.unlock()
        return Err(EDEADLK)

      inode.locks.lock.unlock()
      existing.wait_queue.wait_event(|| !lock_conflicts_anymore(...))
      inode.locks.lock.lock()
      continue loop                  // re-check after wakeup (spurious wakeup safe)

    // No conflict: coalesce adjacent/overlapping locks of the same type and owner,
    // then insert the new lock. O((k+1) log n).
    coalesce_and_insert(inode, fd, lock_type, start, end)
    inode.locks.lock.unlock()
    return Ok(())

Lock Coalescing Algorithm (Greedy Interval Merge)

Input: a set of pending lock requests sorted by (offset, len). Output: a minimal set of merged lock requests covering the same byte ranges.

Data structure:

struct PendingLockRequest {
    offset: u64,
    len: u64,
    op: LockOp,  // Shared or Exclusive
}

Algorithm (O(n log n) for n requests): 1. Collect all pending requests into Vec<PendingLockRequest>. 2. Sort by offset (ascending), then by len (descending) as tiebreaker. 3. Sweep left to right: - Start with current = requests[0]. - For each subsequent request r: - If r.offset <= current.offset + current.len (overlapping or adjacent) AND r.op == current.op (same lock type): - current.len = max(current.offset + current.len, r.offset + r.len) - current.offset - Otherwise: emit current, set current = r. 4. Emit final current.

Rationale: coalescing reduces the number of kernel lock table entries for byte-range locking (POSIX fcntl(F_SETLK)), avoiding fragmentation in the per-file lock list.

coalesce_and_insert(new_lock) — called after conflict check passes:

  1. Query the interval tree for all locks owned by new_lock.pid that are adjacent to or overlapping new_lock's range [l_start, l_end) (adjacent = existing.l_end == new_lock.l_start or vice versa)
  2. Compute the union range: min(all.l_start) to max(all.l_end)
  3. Remove all found locks from the interval tree (O(k log n))
  4. Insert a single merged lock covering the union range (O(log n))

Complexity: O((k+1) log n) where k = number of locks merged. Coalescing reduces tree size over time for processes that acquire many adjacent byte-range locks (common in database file locking patterns).

13.10.4 Deadlock Detection

Deadlock Detection: Wait-For Graph DFS (3-Color)

Each lock holder is a node; each blocked waiter is a directed edge (waiter → holder).

Node state per thread: - WHITE: not yet visited in current DFS - GRAY: currently in the DFS recursion stack (potential cycle node) - BLACK: fully explored, no cycle reachable from here

Constants:

const VFS_LOCK_MAX_DEPTH: usize = 64;  // Max wait-chain depth before abort

Algorithm (invoked before blocking on a contested lock):

fn detect_deadlock(start: ThreadId, graph: &WaitForGraph) -> bool:
  color = HashMap<ThreadId, Color>::new()
  return dfs(start, &mut color, graph, depth=0)

fn dfs(node: ThreadId, color: &mut HashMap, graph: &WaitForGraph, depth: usize) -> bool:
  if depth > VFS_LOCK_MAX_DEPTH:
    return true   // treat as deadlock (conservative)
  color[node] = GRAY
  for each holder in graph.holders_of(node):
    match color.get(holder):
      GRAY  => return true   // back-edge: cycle detected
      BLACK => continue      // already explored, safe
      WHITE | None:
        color[holder] = WHITE
        if dfs(holder, color, graph, depth+1): return true
  color[node] = BLACK
  return false

On true return: the blocking call returns Err(LockError::Deadlock) / EDEADLK. The caller must release all currently held locks and retry with a backoff.

The graph is constructed on-demand per lock request and is not persisted. Returning true on depth overflow is safe: it causes the lock request to fail with EDEADLK, which is better than silently allowing a potential deadlock. The depth limit prevents deadlock detection from becoming a denial-of-service vector in pathological chains.

13.10.5 Lock Release on File Description Close

When a FileDescription's reference count drops to zero (the last file descriptor pointing to it is closed):

  • OFD locks: all locks where owner_fd matches this description are removed.
  • flock locks: the flock lock associated with this description (if any) is removed.
  • POSIX locks: all locks where owner_pid == current_process.pid are removed. This is the POSIX-mandated behavior: closing any file descriptor for a file releases all POSIX locks the process holds on that file, regardless of which fd was used to acquire them.

After removing locks, wake all tasks in the wait_queue of each removed lock so they can retry acquisition.

13.10.6 memfd Sealing (F_ADD_SEALS / F_GET_SEALS)

memfd_create(2) returns an anonymous file (backed by tmpfs, with no pathname). Seals are write-once restrictions placed on the file's mutation capabilities:

/// Seal flags for memfd files. Once set, seals cannot be removed.
/// SEAL_SEAL prevents any further seals from being added.
pub struct SealFlags: u32 {
    /// Prevent any further seals from being added.
    const SEAL_SEAL         = 0x0001;
    /// Prevent the file from shrinking (ftruncate to a smaller size returns EPERM).
    const SEAL_SHRINK       = 0x0002;
    /// Prevent the file from growing (writes past EOF, ftruncate to larger size return EPERM).
    const SEAL_GROW         = 0x0004;
    /// Prevent all writes: write(2) returns EPERM, mmap(PROT_WRITE) returns EPERM.
    const SEAL_WRITE        = 0x0008;
    /// Prevent future mmap(PROT_WRITE) but allow existing writable mappings to remain.
    const SEAL_FUTURE_WRITE = 0x0010;
}

fcntl(fd, F_ADD_SEALS, seals): add the specified seals atomically via a compare_exchange on the inode's AtomicU32 seal field. Fails with EPERM if SEAL_SEAL is already set. Fails with EBUSY if SEAL_WRITE is being added while a writable mmap exists on the file.

fcntl(fd, F_GET_SEALS): return the current seal set (atomic load, lock-free).

Seal enforcement in VFS paths: - write(2) and pwrite64(2): check SEAL_WRITE. - ftruncate(2) to smaller size: check SEAL_SHRINK. - ftruncate(2) to larger size: check SEAL_GROW. - mmap(PROT_WRITE): check SEAL_WRITE | SEAL_FUTURE_WRITE.

UmkaOS improvement: seals are stored as an AtomicU32 in the memfd's inode — seal reads are lock-free (a single atomic load), which is important because the seal check appears on every write(2) and mmap(2) call for sealed fds.

13.10.7 Cross-References

  • Section 14.6 (Distributed Lock Manager): the DLM provides cluster-wide advisory locks that extend the local flock/POSIX lock semantics across nodes. Local file locks (this section) are node-local only.
  • Section 13.1.1 (VFS Architecture): FileOps::release() is the call site where OFD and flock locks are released when the last fd to a file description is closed.
  • Section 13 (Containers): POSIX lock ownership is per-PID-namespace-PID. Within a container's PID namespace, lock ownership semantics are unchanged.

13.10.8 Lock Semantics Mode (POSIX Default / OFD Opt-in)

UmkaOS keeps POSIX semantics as the default for F_SETLK to preserve full Linux binary compatibility. Applications and deployments that want the correct OFD semantics as default can opt in at three levels, with the highest-priority source winning:

Priority order (highest first): 1. Per-call explicit constant 2. Per-process prctl 3. Per-user-namespace sysctl 4. System global default: POSIX


Per-call explicit (always available, no mode setting needed)

F_OFD_SETLK    // Always OFD semantics (Linux 3.15+, UmkaOS supported)
F_OFD_SETLKW   // Always OFD semantics, blocking
F_SETLK_POSIX  // UmkaOS extension: always POSIX semantics, explicit
F_SETLKW_POSIX // UmkaOS extension: always POSIX semantics, blocking

F_SETLK_POSIX exists so code inside an OFD-default process can still request POSIX semantics for specific locks (e.g., a bundled library that requires process-death lock release for crash detection).


Per-process opt-in

prctl(PR_SET_LOCK_SEMANTICS, LOCK_SEM_OFD)    // F_SETLK means OFD for this process
prctl(PR_SET_LOCK_SEMANTICS, LOCK_SEM_POSIX)  // Explicit POSIX (escape hatch)
prctl(PR_GET_LOCK_SEMANTICS, 0, 0, 0, 0)      // Query current mode
pub const LOCK_SEM_POSIX: u64 = 0;  // default
pub const LOCK_SEM_OFD:   u64 = 1;

Stored in Task.lock_semantics: LockSemanticsMode (per-thread but inherited from the process — all threads in a process share the same mode via Process.lock_semantics).

Inheritance rules: - fork(): child inherits parent's lock_semantics - exec(): inherited (sticky) — a container runtime sets it once; all descendant processes inherit - exec() of setuid/setgid binary: reset to the user-namespace sysctl default (security: a privilege-elevating binary must not blindly inherit)


Per-user-namespace sysctl

/proc/sys/fs/file_lock_default

Values: posix (default) | ofd

This sysctl is per-user-namespace, not global. Each container has its own user namespace and therefore its own file_lock_default. The container runtime sets it at container creation:

# Inside an UmkaOS-native container's user namespace:
echo ofd > /proc/sys/fs/file_lock_default
/// Per-user-namespace lock semantics default.
/// Stored in UserNamespace.file_lock_default.
pub enum LockSemanticsMode {
    Posix = 0,  // F_SETLK uses POSIX semantics (default)
    Ofd   = 1,  // F_SETLK uses OFD semantics
}

Requires CAP_SYS_ADMIN in the target user namespace to change. Affects new processes only — running processes keep their current mode.


Deployment model

Scenario Recommended config
Host with legacy software sysctl = posix (default), no change needed
UmkaOS-native container runtime sets sysctl = ofd in container's user namespace
Mixed container (some legacy binaries) sysctl = posix, UmkaOS-native apps use prctl
Wine / NFS lockd / old SQLite prctl(LOCK_SEM_POSIX) in launch wrapper

Internal resolution

fn effective_lock_semantics(
    task: &Task,
    cmd: FcntlCmd,
) -> LockSemanticsMode {
    match cmd {
        FcntlCmd::OfdSetLk | FcntlCmd::OfdSetLkW     => LockSemanticsMode::Ofd,
        FcntlCmd::SetLkPosix | FcntlCmd::SetLkWPosix  => LockSemanticsMode::Posix,
        FcntlCmd::SetLk | FcntlCmd::SetLkW => {
            // Resolve: per-process > per-namespace sysctl > global POSIX
            if task.process.lock_semantics != LockSemanticsMode::Unset {
                task.process.lock_semantics
            } else {
                task.user_namespace.file_lock_default
            }
        }
        _ => LockSemanticsMode::Posix,
    }
}

Linux compatibility: existing binaries calling F_SETLK on a system where no mode is set get identical POSIX behaviour to Linux. F_OFD_SETLK was added in Linux 3.15 and is already supported. F_SETLK_POSIX and PR_SET_LOCK_SEMANTICS are UmkaOS extensions with no Linux equivalent.


13.11 Disk Quota Subsystem (quotactl)

Disk quotas enforce per-user, per-group, and per-project limits on filesystem space and inode usage. Required for multi-tenant storage environments and Linux compatibility.

13.11.1 Data Structures

/// Per-subject (user, group, or project) quota accounting and limits.
/// Matches the Linux `struct dqblk` layout for quotactl(2) ABI compatibility.
#[repr(C)]
pub struct DiskQuota {
    /// Hard block limit (bytes). 0 = no limit. Writes that would exceed this
    /// are rejected with EDQUOT immediately, regardless of grace period.
    pub bhardlimit: u64,

    /// Soft block limit (bytes). Exceeding this triggers a grace period timer.
    /// Once the grace period expires, further writes are rejected with EDQUOT.
    pub bsoftlimit: u64,

    /// Current block usage (bytes). Updated on every successful write and truncate.
    pub bcurrent: u64,

    /// Hard inode limit. 0 = no limit. File creation that would exceed this
    /// is rejected with EDQUOT.
    pub ihardlimit: u64,

    /// Soft inode limit. Exceeding this triggers an inode grace period.
    pub isoftlimit: u64,

    /// Current inode count (files + directories + symlinks owned by this subject).
    pub icurrent: u64,

    /// Timestamp when the soft block limit was first exceeded (0 if not exceeded).
    /// Grace period expiry = btime + bgrace.
    pub btime: i64,

    /// Timestamp when the soft inode limit was first exceeded (0 if not exceeded).
    pub itime: i64,

    /// Grace period for the block soft limit, in seconds. Default: 7 days (604800).
    pub bgrace: u32,

    /// Grace period for the inode soft limit, in seconds. Default: 7 days (604800).
    pub igrace: u32,
}

/// Quota subject type.
pub enum QuotaType {
    User    = 0,  // USRQUOTA
    Group   = 1,  // GRPQUOTA
    Project = 2,  // PRJQUOTA
}

/// Quota operations implemented by filesystems that support quotas.
/// Optional — filesystems without quota support omit this and quotactl(2) returns ENOSYS.
pub trait QuotaOps: Send + Sync {
    /// Enable quota enforcement for the given type, reading limits from `quota_file`.
    fn quota_on(&self, quota_type: QuotaType, quota_file: &str) -> Result<(), VfsError>;

    /// Disable quota enforcement for the given type.
    fn quota_off(&self, quota_type: QuotaType) -> Result<(), VfsError>;

    /// Read the quota entry for subject `id` (UID, GID, or project ID).
    fn get_quota(&self, quota_type: QuotaType, id: u32) -> Result<DiskQuota, VfsError>;

    /// Set limits and accounting for subject `id`. Requires CAP_SYS_ADMIN.
    fn set_quota(&self, quota_type: QuotaType, id: u32, quota: &DiskQuota) -> Result<(), VfsError>;

    /// Read global quota state (grace periods, flags) for the given type.
    fn get_info(&self, quota_type: QuotaType) -> Result<QuotaInfo, VfsError>;

    /// Set global quota state (grace periods). Requires CAP_SYS_ADMIN.
    fn set_info(&self, quota_type: QuotaType, info: &QuotaInfo) -> Result<(), VfsError>;

    /// Flush in-memory quota accounting to the quota database file.
    fn sync_quota(&self, quota_type: QuotaType) -> Result<(), VfsError>;
}

/// Global quota state (grace periods and enabled flags) for a single quota type.
pub struct QuotaInfo {
    /// Block grace period in seconds.
    pub bgrace: u32,
    /// Inode grace period in seconds.
    pub igrace: u32,
    /// Quota flags (QIF_FLAGS: quota enabled, quota accounting-only, etc.).
    pub flags: u32,
}

13.11.2 quotactl(2) Dispatch

The quotactl(2) syscall encodes both the quota command and the quota type in a single 32-bit cmd argument: the high 16 bits are the command (Q_QUOTAON, Q_QUOTAOFF, Q_GETQUOTA, Q_SETQUOTA, Q_GETINFO, Q_SETINFO, Q_SYNC) and the low 16 bits are the quota type (USRQUOTA=0, GRPQUOTA=1, PRJQUOTA=2).

quotactl(cmd, dev, id, addr):
  qt_cmd  = cmd >> 16
  qt_type = QuotaType::from(cmd & 0xffff)  // USRQUOTA/GRPQUOTA/PRJQUOTA

  sb = resolve_superblock_from_device_path(dev)
  if sb.quota_ops is None: return Err(ENOSYS)

  // Capability check for mutating operations
  if qt_cmd in [Q_QUOTAON, Q_QUOTAOFF, Q_SETQUOTA, Q_SETINFO]:
    check_capability(CAP_SYS_ADMIN)?

  match qt_cmd:
    Q_QUOTAON   → sb.quota_ops.quota_on(qt_type, addr_as_path)
    Q_QUOTAOFF  → sb.quota_ops.quota_off(qt_type)
    Q_GETQUOTA  → quota = sb.quota_ops.get_quota(qt_type, id)?; copy_to_user(addr, quota)
    Q_SETQUOTA  → quota = copy_from_user(addr)?; sb.quota_ops.set_quota(qt_type, id, &quota)
    Q_GETINFO   → info = sb.quota_ops.get_info(qt_type)?; copy_to_user(addr, info)
    Q_SETINFO   → info = copy_from_user(addr)?; sb.quota_ops.set_info(qt_type, &info)
    Q_SYNC      → sb.quota_ops.sync_quota(qt_type)
    _           → return Err(EINVAL)

13.11.3 VFS Enforcement Hooks

On every write(2), fallocate, create, mkdir, mknod, and symlink call, the VFS checks quotas for all three subject types:

vfs_quota_check_blocks(inode, bytes_requested) → Result:
  creds = current_task().creds
  for qt in [QuotaType::User, QuotaType::Group, QuotaType::Project]:
    id = match qt:
      User    → creds.fsuid
      Group   → creds.fsgid
      Project → inode.project_id  // stored in the inode's extended attribute (user.project_id)
    quota = inode.sb.quota_ops.get_quota(qt, id)?  // from in-memory quota cache
    new_usage = quota.bcurrent + bytes_requested
    if new_usage > quota.bhardlimit && quota.bhardlimit != 0:
      return Err(EDQUOT)  // hard limit exceeded: reject immediately
    if new_usage > quota.bsoftlimit && quota.bsoftlimit != 0:
      now = current_time_secs()
      if quota.btime == 0:
        quota.btime = now + quota.bgrace as i64  // start grace period timer
        update_quota_cache(qt, id, &quota)
      elif now > quota.btime:
        return Err(EDQUOT)  // grace period expired: reject
      // else: within grace period, allow the write
  return Ok(())

vfs_quota_check_inodes(inode, count) → Result:
  // Identical structure to vfs_quota_check_blocks but uses icurrent/isoftlimit/ihardlimit.

13.11.4 In-Memory Quota Cache

Quota accounting state is kept in a per-filesystem in-memory cache to avoid hitting the quota database file on every write. The cache structure mirrors DiskQuota with an additional dirty: bool field. Cache entries are written back to the quota file asynchronously via sync_quota(), which is called:

  • Periodically by the writeback daemon (default interval: 30 seconds).
  • On quotactl(Q_SYNC).
  • On filesystem unmount.
  • On sync(2) / syncfs(2) when the filesystem's quota is dirty.

Cache lookups use a per-filesystem RwLock<HashMap<(QuotaType, u32), DiskQuota>>. The read lock is taken for quota checks (common case); write lock only for updates. This allows concurrent quota checks across different subjects with no contention.

13.11.5 Linux Compatibility

  • quotactl(2) with all seven commands (Q_QUOTAON, Q_QUOTAOFF, Q_GETQUOTA, Q_SETQUOTA, Q_GETINFO, Q_SETINFO, Q_SYNC) is fully implemented.
  • The dqblk structure layout matches the Linux UAPI definition exactly.
  • quota tools (quota, quotacheck, repquota, edquota) work without modification.
  • ext4, XFS, and tmpfs quota implementations are in scope for the initial release.
  • Project quotas (PRJQUOTA) are supported; project IDs are stored in the inode's i_projid field (set via FS_IOC_FSSETXATTR).

13.11.6 Cross-References

  • Section 13.1.1 (VFS Architecture): quota checks are inserted into the VFS dispatch layer at write, create, mkdir, mknod, and fallocate call sites.
  • Section 13 (Containers): cgroup v2 io.max and memory.max provide resource controls complementary to quota; quota enforces per-UID/GID storage limits while cgroups enforce per-container I/O and memory limits.
  • Section 11 (Storage): ext4, XFS, and btrfs filesystem drivers implement QuotaOps as part of their SuperBlock initialization.

13.12 Pipes and FIFOs

Pipes (pipe(2), pipe2(2)) and named FIFOs (mkfifo(2)) are anonymous unidirectional byte streams. They are the oldest and most widely used IPC primitive in UNIX.

13.12.1 Design: Fixed SPSC Ring Buffer

UmkaOS implements pipe data buffering as a fixed-size lock-free SPSC ring rather than Linux's dynamically-allocated page list. Linux pipes allocate and free individual 4KB pages as data fills and drains, requiring a pipe spinlock on every read and write. UmkaOS's ring buffer is:

  • Allocated once at pipe creation (default 65536 bytes = 16 pages, matching Linux default)
  • Lock-free for the common SPSC case (one writer, one reader — the overwhelmingly dominant use: cmd | cmd)
  • Cache-friendly: contiguous memory, no pointer chasing between pages
  • Zero dynamic allocation in the data path
/// Pipe data buffer — a lock-free single-producer single-consumer ring.
pub struct PipeRing {
    /// Contiguous backing buffer. Size is always a power of 2.
    buf: Box<[u8]>,
    /// Writer position (mod buf.len()). Written by writer, read by reader.
    write_pos: AtomicUsize,
    /// Reader position (mod buf.len()). Written by reader, read by writer.
    read_pos: AtomicUsize,
}

impl PipeRing {
    /// Available bytes for reading.
    pub fn available(&self) -> usize {
        let w = self.write_pos.load(Ordering::Acquire);
        let r = self.read_pos.load(Ordering::Relaxed);
        w.wrapping_sub(r)
    }

    /// Free space for writing.
    pub fn free(&self) -> usize {
        self.buf.len() - self.available()
    }
}

/// Per-pipe state shared between reader and writer endpoints.
pub struct Pipe {
    ring: PipeRing,
    /// Writer end open (false = EOF on read when ring drained).
    write_open: AtomicBool,
    /// Reader end open (false = SIGPIPE/EPIPE on write).
    read_open: AtomicBool,
    /// Tasks sleeping waiting for data (reader blocks).
    read_waiters: WaitQueue,
    /// Tasks sleeping waiting for space (writer blocks).
    write_waiters: WaitQueue,
}

13.12.2 Capacity and fcntl(F_SETPIPE_SZ)

Default pipe capacity: 65536 bytes (matches Linux default).

fcntl(F_SETPIPE_SZ, size) resizes the ring: - Rounds up to the next power of 2 (minimum 4096 bytes) - Maximum: /proc/sys/fs/pipe-max-size (default 1MB, same as Linux) - Requires CAP_SYS_RESOURCE to exceed /proc/sys/fs/pipe-max-size - Data currently in the pipe is preserved (ring resized via realloc + data copy) - If the new size is smaller than current content: EBUSY

fcntl(F_GETPIPE_SZ) returns the current capacity.

13.12.3 MPSC Pipes (Multiple Writers)

When more than one process/thread writes to the same pipe (e.g., shell { cmd1; cmd2; } | cmd3), the SPSC assumption breaks. UmkaOS detects multiple writers via Pipe.writer_count: AtomicU32:

  • writer_count == 1: lock-free SPSC path
  • writer_count > 1: writer acquires Pipe.write_lock: Mutex<()> before writing

Writes <= PIPE_BUF (4096 bytes) are always atomic (no interleaving with other writers) — same guarantee as POSIX.

13.12.4 O_DIRECT Pipe Mode

pipe2(O_DIRECT): each write() is a discrete message; read() returns exactly one message. Implemented by prepending a 4-byte length header in the ring:

/// O_DIRECT pipe message header (4 bytes, little-endian).
/// Followed immediately by `len` bytes of payload.
/// Alignment: none required (ring is byte-addressable).
#[repr(C, packed)]
pub struct PipeMessageHdr {
    pub len: u32,
}

Maximum message size: PIPE_BUF (4096 bytes) for atomic writes.

13.12.5 Named FIFOs (mkfifo)

Named FIFOs use the same Pipe struct, but with a VFS inode for pathname lookup:

  • mkfifo(path, mode): creates a VFS inode of type InodeKind::Fifo
  • open(path, O_RDONLY): blocks until a writer opens (unless O_NONBLOCK)
  • open(path, O_WRONLY): blocks until a reader opens (unless O_NONBLOCK)
  • Once both ends are open: identical semantics to anonymous pipe

13.12.6 Splice and Zero-Copy

splice(2) between two pipes (or pipe + socket) operates by transferring ring buffer segments rather than copying data. UmkaOS implements splice as:

  1. Identify contiguous segment in source ring
  2. Map that segment into destination ring (pointer-level transfer for pipe-to-pipe)
  3. Advance source read_pos, destination write_pos

For pipe-to-socket: uses sendmsg with the ring segment as an iov, letting the network stack DMA directly from the pipe buffer (zero kernel-side copy).

13.12.7 Linux Compatibility

  • Default capacity 65536 bytes: identical to Linux
  • F_SETPIPE_SZ / F_GETPIPE_SZ: identical semantics
  • PIPE_BUF = 4096 bytes: POSIX required, identical
  • O_DIRECT pipe mode: identical to Linux 3.4+
  • pipe2(O_CLOEXEC | O_NONBLOCK | O_DIRECT): all flags supported
  • Splice semantics: identical to Linux
  • /proc/sys/fs/pipe-max-size: identical default (1MB), same permission model
  • Signal on broken pipe: SIGPIPE + EPIPE on write to pipe with no readers
  • select()/poll()/epoll(): EPOLLIN when data available, EPOLLOUT when space available, EPOLLHUP on last writer close