Chapter 13: Virtual Filesystem Layer
VFS architecture, dentry cache, mount tree, path resolution, overlayfs, mount namespace operations
13.1 Virtual Filesystem Layer
The VFS (umka-vfs) provides a unified interface over all filesystem types. It is a Tier 1 component running in its own hardware isolation domain (see Section 10.2 for platform-specific isolation mechanisms), isolated from both umka-core and the individual filesystem drivers it manages.
Why VFS is Tier 1 (not Tier 0):
The VFS handles complex, security-sensitive operations: path resolution (symlink loops, mount point crossing), permission checks, and filesystem driver coordination. Isolating VFS from Core provides:
-
Attack surface reduction: Path resolution bugs (symlink attacks, directory traversal) are confined to the VFS domain and cannot corrupt Core memory.
-
Driver isolation chain: Core → VFS (Tier 1) → Filesystem driver (Tier 1/2). A compromised filesystem driver cannot corrupt VFS metadata, and a compromised VFS cannot corrupt Core memory.
-
Crash containment: A VFS panic (e.g., corrupted dentry cache) is recoverable without rebooting the entire kernel. The recovery protocol:
a. Detection: umka-core detects VFS domain death (MPK exception, panic handler,
or watchdog timeout on the VFS heartbeat ring).
b. Freeze: All syscalls that enter VFS (open, stat, read, write, close, etc.)
are blocked at the umka-core domain boundary. Callers receive -ERESTARTSYS
and the VFS ring is drained.
c. Dirty page cache flush: Dirty pages in umka-core's page cache are flushed to
their backing block devices. The page cache is in umka-core memory (not VFS memory),
so it survives the VFS crash. Flush uses the block layer ring directly.
d. Dentry/inode cache rebuild: The new VFS instance starts with an empty dentry
cache. Dentries are lazily re-populated on the next path lookup (cache miss triggers
disk read). Inode cache is similarly rebuilt on demand.
e. Open file descriptor recovery: umka-core maintains a table of open file
descriptors with their inode numbers and seek positions. After VFS restart,
umka-core re-opens each fd by inode number. File descriptors that pointed to
deleted files (unlinked but still open) receive -EIO on next access.
f. Resume: The VFS ring is reopened and blocked syscalls are retried.
Recovery time: ~100-500ms depending on the number of open file descriptors.
Limitation: In-flight writes that had not yet reached the page cache are lost
(the application receives -EIO and must retry).
Dirty Page Handling on VFS Crash
Dirty Page Handling on VFS Crash:
When a Tier 1 VFS driver crashes, UmkaOS Core cannot safely flush dirty pages using the crashed driver's block mapping (the file-offset → block-address translation lives in the now-destroyed VFS domain).
UmkaOS's design: pre-registration of dirty extents.
Before modifying any file pages, the VFS driver must register the affected block extents with UmkaOS Core:
/// Register a dirty file extent before modification.
/// Called by VFS drivers before dirtying page cache pages.
/// UmkaOS Core stores a compact log of registered extents for use
/// during crash recovery.
///
/// # Parameters
/// - `inode_id`: Stable inode identifier (survives VFS crash).
/// - `file_offset`: Byte offset of the dirty range start.
/// - `len`: Length of the dirty range in bytes.
/// - `block_addr`: Physical block address for this range.
/// - `block_len`: Length of the block range in bytes.
///
/// # Errors
/// Returns `VfsDirtyError::RingFull` (equivalent to `EBUSY`) when the
/// dirty extent ring is full. The caller must free ring slots by calling
/// `vfs_flush_extent_complete()` before retrying.
pub fn vfs_register_dirty_extent(
inode_id: InodeId,
file_offset: u64,
len: u64,
block_addr: PhysBlockAddr,
block_len: u64,
) -> Result<(), VfsDirtyError>;
/// Error type for dirty extent registration.
pub enum VfsDirtyError {
/// The dirty extent ring is full (all 4096 slots occupied with
/// unacknowledged extents). Free slots with `vfs_flush_extent_complete()`
/// before retrying. Equivalent to EBUSY.
RingFull,
/// Other VFS error (invalid inode ID, etc.).
Other(VfsError),
}
UmkaOS Core maintains a dirty extent log in core memory (not in VFS domain memory):
- Ring buffer of up to 4096 DirtyExtentRecord entries per filesystem instance.
- Each record: { inode_id, file_offset, len, block_addr, block_len, seq: u64 }.
- Entries are cleared when the VFS driver calls vfs_flush_extent_complete().
Ring overflow policy: vfs_register_dirty_extent() returns EBUSY when the ring is full (all 4096 slots occupied with unacknowledged extents). The VFS driver must not proceed with the write operation when EBUSY is returned; it must first call vfs_flush_extent_complete() for one or more completed extents to free ring slots, then retry vfs_register_dirty_extent().
This is a deliberate design choice that differs from Linux's approach: UmkaOS never silently discards safety information. The EBUSY backpressure ensures that on any VFS crash, umka-core has a complete record of all outstanding dirty extents and can accurately flag inconsistent data — no dirty extent is ever "forgotten."
Filesystem drivers should use flow control: pre-register extents in batches of ≤64, and flush completed extents at every fsync/barrier point. Under normal operation, the 4096-slot ring provides ample buffering for burst writes; EBUSY is only encountered if the VFS driver fails to acknowledge completions promptly.
If the VFS driver is unresponsive (not calling vfs_flush_extent_complete() for >5 seconds), umka-core treats all unacknowledged extents as dirty and initiates VFS driver restart — the EBUSY backpressure prevents ring overflow from masking a stuck VFS driver.
Crash recovery sequence:
When a Tier 1 VFS driver crashes while UmkaOS Core detects pending dirty extents:
- Iterate the dirty extent log for the crashed filesystem instance.
- For each registered extent (newest first, to preserve ordering for journals):
- Issue a direct block write via the block layer (bypassing VFS).
- Use
block_addrandblock_lenfrom the pre-registered extent record. - Wait for write completion.
- After all registered extents are flushed: mark as "crash-flushed" and continue with driver reload.
- Any dirty pages NOT covered by registered extents are flagged as "potentially inconsistent". The filesystem's own journal/log handles recovery on next mount (same as a hard power-off scenario).
Filesystem requirements: - Filesystems with stable pre-allocated extents (ext4, XFS): can register extents at file creation/truncation time. The block address is stable. - Filesystems with copy-on-write (Btrfs): must register the NEW block address before the CoW write begins (not the old block address). - FAT/exFAT: no journaling — if not using registered extents, crash = data loss. The VFS driver must register all dirty cluster chains.
Design rationale: This is better than Linux's approach (which silently loses dirty pages when a kernel module crashes) while being simpler than running a full WAL in UmkaOS Core. The pre-registration overhead is one lightweight ring-buffer push per dirtied file region — negligible for writeback-dominated workloads.
Performance implications and mitigation:
The Core → VFS domain switch costs ~23 cycles for the bare WRPKRU instruction
(x86-64 MPK). The full domain crossing — including argument marshaling via the
inter-domain ring buffer and cache effects — is ~30-35 cycles per crossing.
This overhead is amortized by:
-
Page Cache in Core: The Page Cache (Section 4.1.3) lives in Core, not VFS. Cached file reads/writes hit the Page Cache directly with zero domain switches. Only cache misses (actual I/O) cross into VFS.
-
Batching: Multiple file operations within a single syscall (e.g.,
readv,io_uringbatches) amortize the domain switch over many operations. -
Dentry cache hit rate: The dentry cache (in VFS) has >99% hit rate for typical workloads. Path resolution is fast, and the domain switch cost is dominated by the actual I/O latency (microseconds vs nanoseconds).
Measured overhead: For a 4KB NVMe read (~10μs device latency), the additional domain switches (Core → VFS → FS driver) add ~70 cycles (~30ns total), which is 0.3% overhead. This is well within the "<5% overhead" target.
13.1.1 VFS Architecture
Responsibilities: path resolution, dentry caching, inode management, mount tree traversal, and permission checks (delegated to umka-core's capability system via the inter-domain ring buffer).
Filesystem drivers register as VFS backends. The VFS never interprets on-disk format directly — it delegates all storage operations through three trait interfaces:
Foundational VFS types (used throughout this chapter):
/// Opaque filesystem inode identifier. Unique within a single SuperBlock.
///
/// Inode 0 is never valid (used as the null sentinel in `AtomicOption`).
/// Inode 1 is conventionally the root directory inode.
/// The u64 width accommodates all known filesystem inode spaces (ext4 uses
/// u32 internally but promotes to u64 for future-proofing; Btrfs and ZFS
/// use u64 natively).
///
/// `InodeId` is filesystem-private: the same u64 value in two different
/// `SuperBlock` instances refers to different inodes.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
#[repr(transparent)]
pub struct InodeId(pub u64);
impl From<u64> for InodeId { fn from(v: u64) -> Self { InodeId(v) } }
impl From<InodeId> for u64 { fn from(id: InodeId) -> u64 { id.0 } }
/// Opaque VFS pipe identifier. Each `pipe(2)` / `pipe2(2)` call produces a
/// unique `PipeId` for internal tracking (waitqueue association, splice
/// routing, and PipeBuffer lifetime management). Not visible to userspace.
#[derive(Copy, Clone, Debug, PartialEq, Eq, Hash)]
pub struct PipeId(pub u64);
/// Response envelope for cross-domain VFS ring buffer calls.
/// Matches the ring protocol described in [Section 10.7](10-drivers.md#107-ipc-architecture-and-message-passing).
#[derive(Debug)]
pub enum VfsResponse {
/// Success, possibly with a return value (e.g., byte count for read/write).
Ok(i64),
/// Error code (negated Linux errno, e.g., `-ENOENT`).
Err(i32),
/// Asynchronous completion pending; caller must wait on the completion ring.
Pending,
}
/// Filesystem-level operations (mount, unmount, statfs).
/// Implemented once per filesystem type (ext4, XFS, btrfs, ZFS, tmpfs, etc.).
pub trait FileSystemOps: Send + Sync {
/// Mount a filesystem from the given source device with flags and options.
fn mount(&self, source: &str, flags: MountFlags, data: &[u8]) -> Result<SuperBlock>;
/// Unmount a previously mounted filesystem.
fn unmount(&self, sb: &SuperBlock) -> Result<()>;
/// Force-unmount: abort in-flight I/O with EIO. Called when umount2()
/// is invoked with MNT_FORCE. Not all filesystems support this — return
/// ENOSYS if unsupported. NFS uses this for stale server recovery.
fn force_umount(&self, sb: &SuperBlock) -> Result<()>;
/// Return filesystem statistics (total/free/available blocks and inodes).
fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;
/// Flush all dirty data and metadata for this filesystem to stable storage.
/// Backend for syncfs(2) and the filesystem-level portion of sync(2).
fn sync_fs(&self, sb: &SuperBlock, wait: bool) -> Result<()>;
/// Remount with changed flags/options (e.g., `mount -o remount,ro`).
fn remount(&self, sb: &SuperBlock, flags: MountFlags, data: &[u8]) -> Result<()>;
/// Freeze the filesystem for a consistent snapshot. All pending writes are
/// flushed and new writes block until thaw. Used by LVM snapshots, device-mapper,
/// and backup tools via FIFREEZE ioctl.
fn freeze(&self, sb: &SuperBlock) -> Result<()>;
/// Thaw a previously frozen filesystem, allowing writes to resume.
fn thaw(&self, sb: &SuperBlock) -> Result<()>;
/// Format filesystem-specific mount options for /proc/mounts output.
fn show_options(&self, sb: &SuperBlock, buf: &mut [u8]) -> Result<usize>;
}
/// Inode (directory structure) operations.
/// Handles namespace operations: lookup, create, link, unlink, rename.
///
/// Note: `OsStr` is a kernel-defined type (NOT `std::ffi::OsStr`, which is
/// unavailable in `no_std`). It is a dynamically-sized type (DST) wrapping
/// `[u8]`, representing filenames that may contain arbitrary non-UTF-8 bytes
/// (Linux filenames are byte strings, not Unicode). Defined in
/// `umka-vfs/src/types.rs`:
/// `pub struct OsStr([u8]);`
/// As a DST, `OsStr` cannot be used by value — it is always behind a
/// reference (`&OsStr`) or `Box<OsStr>`. `&OsStr` is a fat pointer
/// (pointer + length), analogous to `&[u8]` but carrying the semantic
/// intent of "filesystem name component." Conversion from `&str` is
/// infallible (UTF-8 is a valid byte sequence); conversion TO `&str`
/// returns `Result` (may fail on non-UTF-8 filenames).
pub trait InodeOps: Send + Sync {
/// Look up a child entry by name within a parent directory.
fn lookup(&self, parent: InodeId, name: &OsStr) -> Result<InodeId>;
/// Create a regular file in the given directory.
fn create(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;
/// Create a subdirectory.
fn mkdir(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;
/// Create a hard link: new entry `new_name` in `new_parent` pointing to `inode`.
fn link(&self, inode: InodeId, new_parent: InodeId, new_name: &OsStr) -> Result<()>;
/// Create a symbolic link containing `target` at `parent/name`.
fn symlink(&self, parent: InodeId, name: &OsStr, target: &OsStr) -> Result<InodeId>;
/// Read the target of a symbolic link.
fn readlink(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;
/// Create a device special file (block/char device, FIFO, or socket).
fn mknod(&self, parent: InodeId, name: &OsStr, mode: FileMode, dev: DevId) -> Result<InodeId>;
/// Remove a directory entry (unlink for files, rmdir for empty directories).
fn unlink(&self, parent: InodeId, name: &OsStr) -> Result<()>;
/// Remove an empty directory. Separate from unlink for POSIX semantics:
/// `unlink()` on a directory returns EISDIR; `rmdir()` on a file returns ENOTDIR.
fn rmdir(&self, parent: InodeId, name: &OsStr) -> Result<()>;
/// Rename/move a directory entry, possibly across directories.
/// `flags` supports RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT
/// (Linux renameat2 semantics, required for overlayfs).
fn rename(
&self,
old_parent: InodeId, old_name: &OsStr,
new_parent: InodeId, new_name: &OsStr,
flags: RenameFlags,
) -> Result<()>;
/// Get inode attributes (size, mode, timestamps, link count).
fn getattr(&self, inode: InodeId) -> Result<InodeAttr>;
/// Set inode attributes (chmod, chown, utimes).
fn setattr(&self, inode: InodeId, attr: &SetAttr) -> Result<()>;
/// List extended attributes on an inode.
fn listxattr(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;
/// Get an extended attribute value.
fn getxattr(&self, inode: InodeId, name: &OsStr, buf: &mut [u8]) -> Result<usize>;
/// Set an extended attribute value.
fn setxattr(&self, inode: InodeId, name: &OsStr, value: &[u8], flags: XattrFlags)
-> Result<()>;
/// Remove an extended attribute.
fn removexattr(&self, inode: InodeId, name: &OsStr) -> Result<()>;
}
/// File data operations (open, read, write, sync, allocate, close).
pub trait FileOps: Send + Sync {
/// Called when a file is opened. Allows the filesystem to initialize per-open
/// state (NFS delegation, device state, lock state). Returns a filesystem-private
/// context value stored in the file descriptor.
fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64>;
/// Called when the last file descriptor referencing this open file is closed.
/// Filesystem releases per-open state (flock release-on-close, NFS delegation
/// return, device cleanup). `private` is the value returned by `open()`.
fn release(&self, inode: InodeId, private: u64) -> Result<()>;
/// Read data from a file at the given offset. `private` is the
/// filesystem-private context value returned by `open()`.
fn read(&self, inode: InodeId, private: u64, offset: u64, buf: &mut [u8]) -> Result<usize>;
/// Write data to a file at the given offset. `private` is the
/// filesystem-private context value returned by `open()`.
fn write(&self, inode: InodeId, private: u64, offset: u64, buf: &[u8]) -> Result<usize>;
/// Truncate a file to the specified size. This is separate from setattr
/// because truncation is a complex operation on many filesystems: it must
/// free blocks/extents, update extent trees, handle COW (ZFS/btrfs),
/// interact with snapshots, and flush in-progress writes beyond the new
/// size. The VFS calls truncate after updating the in-memory inode size.
/// `private` is the filesystem-private context value returned by `open()`.
fn truncate(&self, inode: InodeId, private: u64, new_size: u64) -> Result<()>;
/// Flush file data (and optionally metadata) to stable storage.
/// `private` is the filesystem-private context value returned by `open()`.
fn fsync(&self, inode: InodeId, private: u64, datasync: bool) -> Result<()>;
/// Pre-allocate or punch holes in file storage. `private` is the
/// filesystem-private context value returned by `open()`.
fn fallocate(&self, inode: InodeId, private: u64, offset: u64, len: u64, mode: FallocateMode) -> Result<()>;
/// Read directory entries. Returns entries starting from `offset` (an opaque
/// cookie, not a byte position). The callback is invoked for each entry; it
/// returns `false` to stop iteration (buffer full). This is the backend for
/// `getdents64(2)`. `private` is the filesystem-private context value
/// returned by `open()`.
fn readdir(
&self,
inode: InodeId,
private: u64,
offset: u64,
emit: &mut dyn FnMut(InodeId, u64, FileType, &OsStr) -> bool,
) -> Result<()>;
/// Seek to a data or hole region (SEEK_DATA / SEEK_HOLE, lseek(2)).
/// Filesystems that do not support sparse files return the file size for
/// SEEK_DATA at any offset, and ENXIO for SEEK_HOLE at any offset.
/// `private` is the filesystem-private context value returned by `open()`.
fn llseek(&self, inode: InodeId, private: u64, offset: i64, whence: SeekWhence) -> Result<u64>;
/// Map a file region into a process address space. The VFS calls this to
/// obtain the page frame list; the actual page table manipulation is done
/// by umka-core (Section 4.1). Filesystems that do not support mmap (e.g.,
/// procfs, sysfs) return ENODEV. `private` is the filesystem-private
/// context value returned by `open()`.
fn mmap(&self, inode: InodeId, private: u64, offset: u64, len: usize, prot: MmapProt) -> Result<MmapResult>;
/// Handle a filesystem-specific ioctl. The VFS dispatches generic ioctls
/// (FIOCLEX, FIONREAD, etc.) itself; only unrecognized ioctls reach the
/// filesystem driver. Returns ENOTTY for unsupported ioctls. `private` is
/// the filesystem-private context value returned by `open()`.
fn ioctl(&self, inode: InodeId, private: u64, cmd: u32, arg: u64) -> Result<i64>;
/// Splice data between a file and a pipe without copying through userspace.
/// Backend for splice(2), sendfile(2), and copy_file_range(2). Filesystems
/// that do not implement this get a generic page-cache-based fallback
/// provided by the VFS. `private` is the filesystem-private context value
/// returned by `open()`.
fn splice_read(
&self,
inode: InodeId,
private: u64,
offset: u64,
pipe: PipeId,
len: usize,
) -> Result<usize>;
/// Splice data from a pipe into a file without copying through userspace.
/// Reverse direction of splice_read: pipe is the data source, file is the
/// destination. Backend for splice(2) write direction and vmsplice(2).
/// Filesystems that do not implement this get a generic page-cache-based
/// fallback provided by the VFS. `private` is the filesystem-private
/// context value returned by `open()`.
fn splice_write(
&self,
pipe: PipeId,
inode: InodeId,
private: u64,
offset: u64,
len: usize,
) -> Result<usize>;
/// Poll for readiness events (POLLIN, POLLOUT, POLLERR). Regular files
/// always return ready; special files (pipes, device nodes, eventfd)
/// implement blocking semantics. `private` is the filesystem-private
/// context value returned by `open()`.
fn poll(&self, inode: InodeId, private: u64, events: PollEvents) -> Result<PollEvents>;
}
/// Dentry (directory entry) lifecycle operations.
/// Most filesystems use the default VFS implementations. Only network and
/// clustered filesystems need custom implementations (primarily d_revalidate).
pub trait DentryOps: Send + Sync {
/// Revalidate a cached dentry. Called before using a cached dentry to verify
/// it is still valid. Returns true if the dentry is still valid, false if
/// the VFS should discard it and perform a fresh lookup.
/// Default: always returns true (local filesystems).
/// Network FS: checks with the server. Clustered FS: checks DLM lease (Section 14.6.6).
fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool> {
Ok(true)
}
/// Custom name comparison. Called during lookup to compare a dentry name
/// with a search name. Used by case-insensitive filesystems (e.g., VFAT,
/// CIFS with case folding, ext4 with casefold feature).
/// Default: byte-exact comparison.
fn d_compare(&self, name: &OsStr, search: &OsStr) -> bool {
name == search
}
/// Returns a custom hash for this dentry name, or `None` to use the
/// VFS default (SipHash-1-3 with per-superblock key from `SuperBlock.hash_key`).
/// Must be consistent with d_compare: if two names are equal per d_compare,
/// they must produce the same hash.
///
/// The VFS lookup layer calls `d_hash()` and checks the return value.
/// If `None`, the VFS uses its own SipHash-1-3 with the per-superblock
/// random key directly, without requiring filesystem involvement. This
/// matches Linux's pattern where `d_hash` is only invoked when
/// `dentry->d_op->d_hash` is non-NULL.
///
/// Filesystems with custom hash requirements (e.g., case-insensitive)
/// override this to return `Some(hash_value)` using their own algorithm —
/// they never see the SipHash key. The per-superblock key is managed by
/// the VFS, not exposed to filesystem implementations.
fn d_hash(&self, name: &OsStr) -> Option<u64> {
None
}
/// Called when a dentry's reference count drops to zero (dentry enters
/// the unused LRU list). Filesystem can veto caching by returning false.
fn d_delete(&self, inode: InodeId, name: &OsStr) -> bool {
true // default: allow LRU caching
}
/// Called when a dentry is finally freed from the cache.
fn d_release(&self, inode: InodeId, name: &OsStr) {}
}
/// Inode attribute structure — returned by getattr(), compatible with
/// Linux statx(2) for full metadata exposure.
pub struct InodeAttr {
/// Bitmask of valid fields (STATX_* flags). Filesystems set only
/// the bits for fields they actually populate.
pub mask: u32,
pub mode: u32, // File type and permissions; u32 to accommodate extended
// permission bits — lower 16 bits match Linux umode_t format.
pub nlink: u32, // Hard link count
pub uid: u32, // Owner UID
pub gid: u32, // Group GID
pub ino: u64, // Inode number
pub size: u64, // File size in bytes
pub blocks: u64, // 512-byte blocks allocated
pub blksize: u32, // Preferred I/O block size
// Timestamps with nanosecond precision
pub atime_sec: i64, // Last access
pub atime_nsec: u32,
pub mtime_sec: i64, // Last modification
pub mtime_nsec: u32,
pub ctime_sec: i64, // Last status change
pub ctime_nsec: u32,
pub btime_sec: i64, // Creation time (birth time)
pub btime_nsec: u32,
pub rdev: u64, // Device ID (for device special files). Encodes major:minor as (major << 32) | minor. The Linux compat layer (Section 18.1) splits these into separate u32 major/minor fields for statx() responses.
pub dev: u64, // Device ID of containing filesystem. Encodes major:minor as (major << 32) | minor. The Linux compat layer (Section 18.1) splits these into separate u32 major/minor fields for statx() responses.
pub mount_id: u64, // Mount identifier (STATX_MNT_ID, since Linux 5.8)
pub attributes: u64, // File attributes (STATX_ATTR_* flags)
pub attributes_mask: u64, // Supported attributes mask
// Direct I/O alignment (STATX_DIOALIGN, since Linux 6.1)
pub dio_mem_align: u32, // Required alignment for DIO memory buffers
pub dio_offset_align: u32, // Required alignment for DIO file offsets
// Subvolume identifier (STATX_SUBVOL, since Linux 6.10; btrfs, bcachefs)
pub subvol: u64,
// Atomic write limits (STATX_WRITE_ATOMIC, since Linux 6.11)
pub atomic_write_unit_min: u32, // Min atomic write size (power-of-2)
pub atomic_write_unit_max: u32, // Max atomic write size (power-of-2)
pub atomic_write_segments_max: u32, // Max segments in atomic write
pub atomic_write_unit_max_opt: u32, // Optimal max atomic write size (STATX_WRITE_ATOMIC, since Linux 6.13)
// Direct I/O read alignment (STATX_DIO_READ_ALIGN, since Linux 6.14)
pub dio_read_offset_align: u32, // DIO read offset alignment (0 = use dio_offset_align)
}
Linux comparison: Linux's VFS uses struct super_operations, struct inode_operations,
struct file_operations, and struct dentry_operations — C structs of function pointers
(Linux's file_operations alone has 30+ methods). UmkaOS's trait-based design serves the
same purpose but with Rust's safety guarantees: a filesystem that forgets to implement
fsync is a compile-time error, not a null pointer dereference at runtime. The trait
methods above cover the operations needed for POSIX compatibility; rarely-used operations
(e.g., fiemap, copy_file_range with cross-filesystem support) are handled by generic
VFS fallback code that calls the core read/write/fallocate methods.
13.1.1.1 File Handle Export (ExportOps)
The ExportOps trait is implemented by filesystems that support persistent file handles —
opaque tokens that identify an inode across server reboots and path renames. Required for:
- NFS server (clients hold file handles that survive server restart)
- CRIU checkpoint/restore (
open_by_handle_atreopens files by handle) - Backup software (
rsync --no-implied-dirs, backup agents)
/// File system export operations. Optional — implement only if the filesystem
/// supports persistent, path-independent file handles.
///
/// A file handle is a short opaque byte string (max 128 bytes) that uniquely
/// identifies an inode within a filesystem instance. The handle must survive:
/// - Server reboots (handle encodes stable inode ID + generation counter)
/// - Directory renames (handle does not encode path)
/// - Mount point changes (handle is filesystem-relative, not global)
pub trait ExportOps: Send + Sync {
/// Encode an inode into a file handle.
///
/// Returns the handle bytes written and a filesystem-defined `fh_type` code
/// (passed back to `decode_fh`; used to distinguish handle formats).
///
/// # Typical encoding
/// ext4: [ inode_number: u32, generation: u32 ] → 8 bytes, fh_type=1
/// XFS: [ ino: u64, gen: u32, parent_ino: u64, parent_gen: u32 ] → 24 bytes, fh_type=1
/// Btrfs: [ objectid: u64, root_objectid: u64, gen: u64 ] → 24 bytes, fh_type=1
///
/// Returns `Err(EOVERFLOW)` if `max_bytes` is too small for this filesystem's handle.
fn encode_fh(
&self,
inode: &Inode,
handle: &mut [u8; 128],
max_bytes: usize,
/// If true, include parent inode info to enable NFS reconnect after server reboot.
connectable: bool,
) -> Result<(usize, u8), VfsError>; // (bytes_written, fh_type)
/// Decode a file handle back to an inode reference.
///
/// Called by `open_by_handle_at`. Must look up the inode using the filesystem's
/// internal handle format without path traversal.
///
/// Returns `Err(ESTALE)` if the inode no longer exists or the generation counter
/// does not match (inode number reused after deletion).
fn decode_fh(
&self,
handle: &[u8],
fh_type: u8,
) -> Result<Arc<Inode>, VfsError>;
/// Get the parent directory inode of an inode (for NFS reconnect after reboot).
///
/// Returns `Err(EACCES)` if the filesystem cannot determine the parent without a
/// full tree walk (e.g., hardlinks with multiple parents).
fn get_parent(&self, inode: &Inode) -> Result<Arc<Inode>, VfsError>;
/// Get the directory entry name for `child` within `parent`.
///
/// Used by the NFS server to reconstruct paths for client caches.
/// Returns the byte length of the name written into `name_buf`.
/// Returns `Err(ENOENT)` if no entry for `child` is found in `parent`.
fn get_name(
&self,
parent: &Inode,
child: &Inode,
name_buf: &mut [u8; 256],
) -> Result<usize, VfsError>;
}
/// Kernel-side file handle: wraps the opaque handle bytes with metadata.
/// Matches the layout of Linux's `struct file_handle` for syscall ABI compatibility.
#[repr(C)]
pub struct FileHandle {
/// Byte length of the handle data (the populated prefix of `f_handle`).
pub handle_bytes: u32,
/// Filesystem-defined type code (passed back verbatim to `ExportOps::decode_fh`).
pub handle_type: i32,
/// Opaque handle data (filesystem-defined encoding, up to 128 bytes).
pub f_handle: [u8; 128],
}
name_to_handle_at(2) implementation:
name_to_handle_at(dirfd, pathname, handle, mount_id, flags):
1. Resolve pathname to an inode (using normal path resolution with dirfd as the base;
AT_EMPTY_PATH allows operating on dirfd itself without a pathname component).
2. Retrieve the inode's superblock.
3. Check that the superblock implements ExportOps. Return ENOTSUP if not.
4. Call superblock.export_ops.encode_fh(inode, handle.f_handle, handle.handle_bytes,
connectable=true).
5. Write back handle_bytes and handle_type into the userspace handle struct.
6. Write the mount's numeric ID to *mount_id. Mount IDs are assigned at mount time
via a monotonic counter (Section 13.2.3 MountNode.mnt_id).
7. Return 0 on success; EOVERFLOW if the handle buffer is too small.
open_by_handle_at(2) implementation:
open_by_handle_at(mount_fd, handle, flags):
1. Requires CAP_DAC_READ_SEARCH. This syscall bypasses normal path-based access checks
by design — it is intended for root-equivalent processes such as NFS servers and
backup agents. Return EPERM if the capability is absent.
2. Resolve mount_fd to a MountNamespace and the corresponding mount point.
3. Retrieve the mount's superblock.
4. Check that the superblock implements ExportOps. Return ENOTSUP if not.
5. Call superblock.export_ops.decode_fh(handle.f_handle, handle.handle_type) → Arc<Inode>.
6. If Err(ESTALE): the inode was deleted or the generation counter does not match
(inode number reused). Return ESTALE.
7. Perform a DAC check and LSM check on the inode using the caller's credentials.
8. Allocate a new FileDescription wrapping the inode. The file description does not
carry a path — the inode is accessed directly without directory traversal.
9. Return the new file descriptor number.
Security note: open_by_handle_at intentionally skips directory execute-permission
checks along the path to the inode (the path is not known at this point). This is
the documented and expected behavior for NFS server use. CAP_DAC_READ_SEARCH is the
required guard.
13.1.1.2 Core VFS Data Structures
The VFS layer operates on four fundamental data structures: dentries (directory entries), inodes (index nodes), superblocks (mounted filesystem state), and files (open file handles). This section defines the first three; file handles are defined in Section 7.1 (process model) as part of the file descriptor table.
Dentry (Directory Cache Entry)
/// Directory cache entry — represents a single component in a pathname.
///
/// Dentries form a tree that mirrors the filesystem namespace. Each dentry
/// caches the result of a directory lookup: the mapping from a name to an
/// inode. The dentry cache (dcache) is the primary mechanism for avoiding
/// repeated directory lookups on hot paths.
///
/// **Lifecycle**: Created by `InodeOps::lookup()` on first access. Cached
/// in the dcache hash table (keyed by parent + name). Freed when the
/// reference count drops to zero AND the dentry is evicted from the LRU.
/// Negative dentries (name exists but no inode) are also cached to avoid
/// repeated failed lookups.
///
/// **Concurrency**: Dentries are RCU-protected for lockless path resolution
/// (RCU-walk mode, Section 13.1.3). Mutations (create, unlink, rename)
/// acquire the parent dentry's `d_lock` spinlock.
#[repr(C)]
pub struct Dentry {
/// The name of this directory entry (the final component, not the full path).
/// Inline for short names (<=32 bytes); heap-allocated for longer names.
/// Immutable after creation (renames create a new dentry).
pub d_name: DentryName,
/// Inode that this dentry points to. `None` for negative dentries
/// (cached "does not exist" results). Set once by `d_instantiate()`
/// after a successful lookup or create. Protected by RCU for readers;
/// `d_lock` for writers.
pub d_inode: RcuCell<Option<Arc<Inode>>>,
/// Parent dentry. The root dentry's parent is itself.
/// Protected by RCU (for RCU-walk path resolution).
pub d_parent: RcuCell<Arc<Dentry>>,
/// Hash table linkage for dcache lookup (keyed by parent + name hash).
pub d_hash: HashListNode,
/// Children list (subdirectories and files in this directory).
/// Only meaningful for directory dentries. Protected by `d_lock`.
pub d_children: IntrusiveList<Dentry>,
/// Sibling linkage (entry in parent's `d_children` list).
pub d_sibling: IntrusiveListNode,
/// Per-dentry spinlock. Protects `d_children`, `d_inode` mutations,
/// and `d_flags` updates. Lock level: DENTRY_LOCK (level 8).
pub d_lock: SpinLock<(), DENTRY_LOCK>,
/// Dentry flags (DCACHE_MOUNTED, DCACHE_NEGATIVE, etc.).
pub d_flags: AtomicU32,
/// Reference count. Dentries with refcount > 0 are pinned (in use).
/// Dentries with refcount == 0 are on the LRU and may be evicted
/// under memory pressure.
pub d_refcount: AtomicU32,
/// Cached permission bits for fast path resolution (Section 13.1.3).
pub cached_perm: AtomicU32,
/// Superblock this dentry belongs to.
pub d_sb: Arc<SuperBlock>,
/// Filesystem-specific dentry operations (d_revalidate, d_release, etc.).
/// Set by the filesystem during lookup. NULL for simple filesystems.
pub d_ops: Option<&'static dyn DentryOps>,
/// RCU head for deferred freeing.
pub d_rcu: RcuHead,
/// LRU list linkage for dcache reclaim.
pub d_lru: IntrusiveListNode,
/// Mount point sequence counter. Incremented when a filesystem is
/// mounted or unmounted on this dentry. Used by RCU-walk to detect
/// mount table changes during lockless traversal.
pub d_mount_seq: AtomicU32,
}
/// Short name inline buffer size. Names <=32 bytes are stored inline
/// in the dentry (no heap allocation). Covers >99% of real filenames.
pub const DENTRY_INLINE_NAME_LEN: usize = 32;
/// Dentry name: inline for short names, heap-allocated for long names.
pub enum DentryName {
Inline { buf: [u8; DENTRY_INLINE_NAME_LEN], len: u8 },
Heap { ptr: Box<[u8]> },
}
AddressSpace (Page Cache Mapping)
/// Per-inode page cache. Maps file byte offsets (at page granularity)
/// to physical page frames held in memory.
///
/// Each inode for a regular file or block device has exactly one
/// `AddressSpace`. Directories and symlinks typically do not use
/// `AddressSpace` unless the filesystem maps their data through the page
/// cache (e.g., directories in ext4 are page-cache-backed).
///
/// **Storage**: `AddressSpace` is embedded directly inside `Inode`
/// (field `i_mapping`). No separate allocation is needed on the fast
/// path.
///
/// **Concurrency**:
/// - `pages` (XArray): RCU-safe for readers; writers hold `xa_lock`
/// (a fine-grained spinlock embedded in the XArray).
/// - `nrpages`, `nrdirty`, `nrwriteback`: independent atomic counters;
/// no lock needed for individual increments/decrements.
/// - `writeback_lock`: `Mutex` serializing concurrent writeback of
/// this inode's pages. At most one writeback agent runs per inode
/// at any time.
///
/// **XArray**: Generic ordered sparse array implemented as a tree of
/// `XNode` slots. Provides RCU-safe concurrent reads with fine-grained
/// locking for writes. Equivalent to Linux's `struct xarray`. See
/// Section 13.2 (Dentry Cache) for the XArray lock ordering rules.
pub struct AddressSpace {
/// Back-pointer to the owning inode. `Weak` to avoid a reference
/// cycle (Inode → AddressSpace → Inode).
pub host: Weak<Inode>,
/// Page cache: file page index (u64) → `Arc<PageFrame>`.
///
/// Key: `file_offset >> PAGE_SHIFT` (page index, not byte offset).
/// Value: an `Arc`-wrapped physical page frame holding one page of
/// file data. Absent entries indicate the page is not cached; the
/// filesystem must populate it via `AddressSpaceOps::read_page`.
///
/// The XArray provides O(1) amortised lookup and RCU-safe reads:
/// readers never take a lock; writers take `xa_lock` only for the
/// affected slot.
pub pages: XArray<Arc<PageFrame>>,
/// Total number of page frames currently present in the cache.
/// Incremented when a page is inserted; decremented when evicted.
pub nrpages: AtomicU64,
/// Number of pages that are dirty (modified in memory, not yet
/// flushed to the backing store). The writeback path decrements this
/// as pages are written out.
pub nrdirty: AtomicU64,
/// Number of pages currently under active writeback I/O. A page is
/// counted here from the moment writeback I/O is submitted until the
/// I/O completion handler clears the `PG_WRITEBACK` flag.
pub nrwriteback: AtomicU64,
/// Writeback serialisation state. At most one concurrent writeback
/// agent is permitted per `AddressSpace` to avoid seek amplification
/// on rotational storage and to simplify error propagation.
pub writeback_lock: Mutex<WritebackState>,
/// Filesystem-provided callbacks for page cache operations.
/// Statically known at inode creation time; never changes.
pub ops: &'static dyn AddressSpaceOps,
/// Flags controlling eviction and special page semantics.
///
/// - `AS_UNEVICTABLE` (bit 0): pages must not be reclaimed under
/// memory pressure (e.g., ramfs, tmpfs locked pages).
/// - `AS_BALLOON_PAGE` (bit 1): pages are balloon-inflated and may
/// be reclaimed by the balloon driver at any time.
/// - `AS_EIO` (bit 2): a writeback error occurred; subsequent
/// `fsync` calls must return `-EIO` until the flag is cleared.
/// - `AS_ENOSPC` (bit 3): a writeback error occurred due to no
/// space remaining on device.
pub flags: AtomicU32,
}
/// Serialised writeback state embedded inside `AddressSpace::writeback_lock`.
///
/// Protected by `AddressSpace::writeback_lock`. The `Mutex` ensures only
/// one writeback agent runs at a time; the fields inside track progress
/// so that a new agent can resume where the previous one left off.
pub struct WritebackState {
/// Next page index to examine during writeback. The writeback agent
/// advances this forward as pages are submitted for I/O. Wraps to 0
/// after reaching the last page, implementing a cyclic scan
/// consistent with the kernel's "kupdate" writeback policy.
pub writeback_index: u64,
/// Accumulated bytes of dirty data at the time writeback started.
/// Used to limit how much data a single writeback pass writes, so
/// that a continuous dirty stream does not starve readers.
pub dirty_bytes: u64,
}
/// Filesystem callbacks invoked by the VFS page cache layer.
///
/// Each filesystem that participates in the page cache provides a
/// static `AddressSpaceOps` implementation. The VFS calls these methods
/// when it needs to populate the cache (read miss), flush dirty pages
/// (writeback), or decide whether a page can be dropped (reclaim).
///
/// **Object safety**: all methods take `&self` on the ops vtable plus
/// explicit `AddressSpace`/`PageFrame` references. The vtable itself is
/// `'static`, `Send`, and `Sync`.
pub trait AddressSpaceOps: Send + Sync {
/// Read one page (identified by `index`, a page-aligned file offset
/// divided by `PAGE_SIZE`) from the backing store into the page
/// cache. The implementation must allocate a `PageFrame`, fill it,
/// insert it into `mapping.pages`, and return an `Arc` to it.
///
/// Called with no locks held. The implementation may block.
fn read_page(
&self,
mapping: &AddressSpace,
index: u64,
) -> Result<Arc<PageFrame>, IoError>;
/// Write a single dirty page to the backing store. `wbc` carries
/// writeback control parameters (sync mode, range limits, number
/// of pages already written in this pass). The implementation must
/// clear `PG_DIRTY` on the page before starting I/O and set
/// `PG_WRITEBACK` for the duration of the I/O.
///
/// Called with no locks held. The implementation may block.
fn writepage(
&self,
mapping: &AddressSpace,
page: &PageFrame,
wbc: &WritebackControl,
) -> Result<(), IoError>;
/// Called by the page reclaimer immediately before a clean page is
/// removed from the cache. The filesystem may decline eviction by
/// returning `false` (e.g., because it has pinned the page for
/// journalling). Returning `true` grants permission to evict.
///
/// Must not block; must not acquire locks that might sleep.
fn releasepage(&self, page: &PageFrame) -> bool;
/// Returns the direct-I/O implementation for this address space,
/// if the filesystem supports bypassing the page cache (e.g., for
/// `O_DIRECT` opens). Returns `None` if direct I/O is not supported;
/// the VFS will then fall back to the page-cache path.
fn direct_io(&self) -> Option<&dyn DirectIoOps> {
None
}
}
Inode (Index Node)
/// In-memory representation of a filesystem object (file, directory,
/// symlink, device, pipe, socket).
///
/// Each inode has a unique (superblock, inode_number) pair. The VFS
/// maintains an inode cache (icache) keyed by this pair to avoid
/// repeated disk reads.
///
/// **Lifecycle**: Created by `FileSystemOps::mount()` (root inode) or
/// `InodeOps::lookup()`/`InodeOps::create()` for other entries. Cached
/// in the icache. Freed when the last dentry referencing it is evicted
/// AND the on-disk link count drops to zero (unlinked).
///
/// **Concurrency**: Inode metadata is protected by `i_lock` (spinlock).
/// File data is protected by `i_rwsem` (read-write semaphore) — readers
/// (read, readdir) take shared; writers (write, truncate) take exclusive.
#[repr(C)]
pub struct Inode {
/// Inode number. Unique within a superblock. Assigned by the filesystem.
pub i_ino: u64,
/// File type and permission mode (S_IFREG, S_IFDIR, etc. | rwxrwxrwx).
pub i_mode: u32,
/// Owner UID.
pub i_uid: u32,
/// Owner GID.
pub i_gid: u32,
/// Hard link count. When this reaches 0 and no open file descriptors
/// remain, the inode is freed (both in-memory and on-disk).
pub i_nlink: AtomicU32,
/// File size in bytes. For regular files: data size. For directories:
/// implementation-defined (often the size of the directory data).
/// For symlinks: length of the target path. Updated under `i_rwsem`.
pub i_size: AtomicI64,
/// Timestamps (seconds + nanoseconds since epoch).
pub i_atime: Timespec,
pub i_mtime: Timespec,
pub i_ctime: Timespec,
/// Block size for this inode's filesystem (typically 4096).
pub i_blksize: u32,
/// Number of 512-byte blocks allocated on disk.
pub i_blocks: u64,
/// Device number (major:minor) for device special files (S_IFBLK/S_IFCHR).
/// Encoding: `(major << 32) | minor`. Zero for regular files.
pub i_rdev: u64,
/// Generation number. Incremented when an inode is recycled (same i_ino
/// reused for a new file). Used by NFS file handles to detect stale handles.
pub i_generation: u32,
/// Per-inode spinlock. Protects metadata updates (mode, uid, gid, timestamps,
/// nlink). Lock level: INODE_LOCK (level 7).
pub i_lock: SpinLock<(), INODE_LOCK>,
/// Read-write semaphore for file data. read()/readdir() take shared;
/// write()/truncate() take exclusive.
pub i_rwsem: RwSemaphore,
/// Superblock this inode belongs to.
pub i_sb: Arc<SuperBlock>,
/// Inode operations (lookup, create, link, unlink, etc.).
/// Set by the filesystem when the inode is created.
pub i_op: &'static dyn InodeOps,
/// File operations (read, write, mmap, ioctl, etc.).
/// Set by the filesystem; used when opening this inode as a file.
pub i_fop: &'static dyn FileOps,
/// Filesystem-private data. Opaque pointer used by the filesystem
/// driver to attach its own per-inode state (e.g., ext4_inode_info).
pub i_private: *mut (),
/// Page cache address space for this inode's data.
/// Contains the XArray of cached pages, dirty page counters, and
/// writeback state. See the `AddressSpace` struct defined above.
pub i_mapping: AddressSpace,
/// Reference count. Managed by dentry references and open file handles.
pub i_refcount: AtomicU32,
/// Dirty flag. Set when inode metadata has been modified in memory
/// but not yet written to disk.
pub i_state: AtomicU32,
/// Hash table linkage for icache lookup (keyed by sb + i_ino).
pub i_hash: HashListNode,
/// Superblock dirty inode list linkage.
pub i_sb_list: IntrusiveListNode,
}
SuperBlock
/// In-memory representation of a mounted filesystem.
///
/// Each mount creates one SuperBlock instance. The superblock holds
/// filesystem-level metadata (block size, feature flags, root inode)
/// and provides the interface between the VFS and the filesystem driver.
///
/// **Lifecycle**: Created by `FileSystemOps::mount()`. Destroyed by
/// `FileSystemOps::unmount()` after all references are released.
pub struct SuperBlock {
/// Filesystem type identifier (e.g., "ext4", "xfs", "tmpfs").
pub s_type: &'static str,
/// Block size in bytes (typically 1024, 2048, or 4096).
pub s_blocksize: u32,
/// Log2 of block size (for bit-shift division).
pub s_blocksize_bits: u8,
/// Maximum file size supported by this filesystem.
pub s_maxbytes: i64,
/// Root dentry of the mounted filesystem.
pub s_root: Arc<Dentry>,
/// Filesystem operations (mount, unmount, statfs, sync).
pub s_op: &'static dyn FileSystemOps,
/// Mount flags (MS_RDONLY, MS_NOSUID, MS_NODEV, etc.).
pub s_flags: AtomicU32,
/// Filesystem-specific data. Opaque pointer used by the filesystem
/// driver to attach its own per-superblock state (e.g., ext4_sb_info,
/// xfs_mount).
pub s_fs_info: *mut (),
/// UUID of the filesystem (if supported). Used for persistent mount
/// identification and `/proc/mounts` output.
pub s_uuid: [u8; 16],
/// List of all inodes belonging to this superblock.
/// Protected by `s_inode_list_lock`.
pub s_inodes: IntrusiveList<Inode>,
/// List of dirty inodes that need writeback.
pub s_dirty: IntrusiveList<Inode>,
/// Per-superblock lock for inode list management.
pub s_inode_list_lock: SpinLock<()>,
/// Block device backing this filesystem (None for pseudo-filesystems
/// like tmpfs, procfs, sysfs).
pub s_bdev: Option<Arc<BlockDevice>>,
/// Reference count. Held by Mount nodes and open file handles.
pub s_refcount: AtomicU32,
/// Freeze count. >0 means filesystem is frozen (FIFREEZE).
pub s_freeze_count: AtomicU32,
}
13.1.1.3 VFS Ring Buffer Protocol (Cross-Domain Dispatch)
The tier model (Section 10.4) requires ALL cross-domain communication to use ring
buffer IPC. However, the FileSystemOps, InodeOps, and FileOps traits defined
above use direct Rust function call signatures. This section specifies how trait
method calls are marshaled across the isolation domain boundary between umka-core
(VFS layer) and Tier 1 filesystem drivers.
Architecture: Each mounted filesystem has a dedicated request/response ring pair:
/// Per-mount ring buffer pair for VFS <-> filesystem driver communication.
///
/// The VFS (in umka-core) enqueues requests on `request_ring`; the filesystem
/// driver dequeues, processes, and enqueues responses on `response_ring`.
/// Both rings are in shared memory (PKEY 1 on x86-64 — read-only for both
/// domains; actual data in PKEY 14 shared DMA pool).
pub struct VfsRingPair {
/// Request ring: VFS -> filesystem driver. SPSC (VFS is the sole producer;
/// the driver is the sole consumer). Ring size: 256 entries (configurable
/// per-mount via mount options).
pub request_ring: RingBuffer<VfsRequest>,
/// Response ring: filesystem driver -> VFS. SPSC (driver produces, VFS
/// consumes). Same size as request ring.
pub response_ring: RingBuffer<VfsResponse>,
/// Doorbell: filesystem driver writes to signal request availability.
/// Uses the doorbell coalescing mechanism (Section 10.6.1.1) to batch
/// notifications when multiple requests are enqueued.
pub doorbell: DoorbellRegister,
/// Completion event: VFS waits on this when a synchronous operation
/// needs a response. Uses `WaitQueue` for blocking callers.
pub completion: WaitQueue,
}
/// VFS request message. Serialized representation of a trait method call.
/// Fixed-size header + variable-length payload.
#[repr(C)]
pub struct VfsRequest {
/// Unique request ID for matching responses. Monotonically increasing
/// per-ring. Wraps at u64::MAX.
pub request_id: u64,
/// Operation code identifying the trait method.
pub opcode: VfsOpcode,
/// Inode number (for InodeOps/FileOps calls). 0 for FileSystemOps calls.
pub ino: u64,
/// File handle (for FileOps calls). u64::MAX for non-file operations.
pub fh: u64,
/// Operation-specific arguments. The variant must match `opcode`.
/// Variable-length data (filenames, xattr values, write data) is
/// passed via shared DMA buffer references embedded in the variant,
/// not stored inline in the ring entry.
///
/// The VFS dispatcher validates that the `args` variant matches
/// `opcode` before dispatching; a mismatch is a kernel bug and
/// triggers a panic in debug builds, a silent no-op error response
/// in release builds.
pub args: VfsRequestArgs,
}
/// Per-opcode argument payload for a `VfsRequest`.
///
/// This is a Rust enum (tagged union) rather than a C-style `union` for
/// memory safety. Every `VfsOpcode` variant has a corresponding
/// `VfsRequestArgs` variant with the exact parameters that the trait
/// method requires. Variants that carry no extra data beyond what is
/// already in the `VfsRequest` header (opcode, ino, fh) use an empty
/// body `{}`.
///
/// **Inline string limits**: `KernelString` holds up to 255 bytes. Names
/// longer than 255 bytes (possible on some exotic filesystems) must be
/// passed via a `DmaBufferHandle` placed in the `buf` field of the
/// relevant variant; the VFS sets the string `len` to 0 as a sentinel in
/// that case.
///
/// **Caller contract**: The caller fills `VfsRequest { opcode, args, .. }`
/// and enqueues it on `request_ring`. The VFS dispatcher validates that
/// the `args` variant matches `opcode` before dispatching to the
/// filesystem driver.
pub enum VfsRequestArgs {
// ---------------------------------------------------------------
// FileSystemOps
// ---------------------------------------------------------------
/// `FileSystemOps::mount`. No extra args; mount options are passed
/// via a separate `DmaBufferHandle` in the ring header.
Mount {},
/// `FileSystemOps::unmount`. Graceful unmount; all dirty data must
/// be flushed before the response is sent.
Unmount {},
/// `FileSystemOps::force_unmount`. Best-effort: abandon in-flight
/// I/O and free resources.
ForceUnmount {},
/// `FileSystemOps::statfs`. No per-call arguments.
Statfs {},
/// `FileSystemOps::sync_fs`. `wait` controls whether the driver
/// must block until all I/O is complete (`true`) or may return once
/// I/O is queued (`false`).
SyncFs { wait: bool },
/// `FileSystemOps::remount`. New flags; updated option string is in
/// a `DmaBufferHandle` in the ring header.
Remount { flags: u32 },
/// `FileSystemOps::freeze`. Quiesce all writes for snapshotting.
Freeze {},
/// `FileSystemOps::thaw`. Resume writes after a freeze.
Thaw {},
// ---------------------------------------------------------------
// InodeOps
// ---------------------------------------------------------------
/// `InodeOps::lookup`. Look up `name` in the directory identified
/// by `VfsRequest::ino`.
Lookup { name: KernelString },
/// `InodeOps::create`. Create a regular file named by the dentry
/// already allocated by the VFS. `mode` is the combined file-type
/// and permission bits.
Create { mode: FileMode },
/// `InodeOps::link`. Create a hard link whose new name is
/// `new_name` inside the directory inode of the request.
Link { new_name: KernelString },
/// `InodeOps::unlink`. Remove a directory entry. The inode is freed
/// when its link count reaches zero and all file descriptors are
/// closed.
Unlink {},
/// `InodeOps::mkdir`. Create a directory with the given permission
/// bits.
Mkdir { mode: FileMode },
/// `InodeOps::rmdir`. Remove an empty directory.
Rmdir {},
/// `InodeOps::rename`. Move or rename a directory entry.
/// `new_dir_ino` is the inode number of the destination directory.
/// `new_name` is the destination name. `flags` carries `RENAME_*`
/// constants (e.g., `RENAME_NOREPLACE`, `RENAME_EXCHANGE`).
Rename { new_dir_ino: u64, new_name: KernelString, flags: u32 },
/// `InodeOps::symlink`. Create a symbolic link whose target path is
/// `target`. The created inode is named by the dentry pre-allocated
/// by the VFS.
Symlink { target: KernelString },
/// `InodeOps::readlink`. Resolve the symlink target into
/// `buf`. The driver writes the target string into the DMA buffer
/// identified by `buf`.
Readlink { buf: DmaBufferHandle },
/// `InodeOps::mknod`. Create a special file (block device, character
/// device, FIFO, or socket). `dev` carries the (major, minor) pair
/// encoded as `(major << 32) | minor`.
Mknod { mode: FileMode, dev: DeviceNumber },
/// `InodeOps::getattr`. Retrieve inode attributes into an
/// `InodeAttr`. `request_mask` is a bitmask of `STATX_*` fields the
/// caller wants. `flags` is `AT_*` flags from `statx(2)`.
GetAttr { request_mask: u32, flags: u32 },
/// `InodeOps::setattr`. Modify inode attributes. `valid` is a
/// bitmask of `ATTR_*` flags indicating which fields in `attr` the
/// driver must update.
SetAttr { attr: InodeAttr, valid: u32 },
/// `InodeOps::truncate`. Set the file size to `size` bytes,
/// releasing or zero-extending as needed.
Truncate { size: u64 },
/// `InodeOps::getxattr`. Retrieve the extended attribute `name` into
/// `buf`. On return, the response `bytes_read` field carries the
/// attribute value length.
GetXattr { name: KernelString, buf: DmaBufferHandle },
/// `InodeOps::setxattr`. Set extended attribute `name` to `value`.
/// `flags` is `XATTR_CREATE`, `XATTR_REPLACE`, or 0.
SetXattr { name: KernelString, value: DmaBufferHandle, value_len: u32, flags: u32 },
/// `InodeOps::listxattr`. Enumerate all extended attribute names into
/// `buf` as a sequence of NUL-terminated strings. On return, the
/// response `bytes_read` field carries the total length written.
ListXattr { buf: DmaBufferHandle },
/// `InodeOps::removexattr`. Delete the extended attribute `name`.
RemoveXattr { name: KernelString },
/// `FileSystemOps::show_options`. Write the filesystem-specific
/// mount options (as they would appear in `/proc/mounts`) into
/// `buf`.
ShowOptions { buf: DmaBufferHandle },
// ---------------------------------------------------------------
// FileOps
// ---------------------------------------------------------------
/// `FileOps::open`. Open the file. `flags` are the `O_*` open
/// flags from `open(2)`/`openat(2)`. `mode` is relevant only when
/// `O_CREAT` is set.
Open { flags: u32, mode: FileMode },
/// `FileOps::release`. The last reference to this open file
/// descriptor has been closed. The driver must flush any cached
/// state for `fh`.
Release {},
/// `FileOps::read`. Read up to `count` bytes starting at `offset`
/// from the file into `buf`. The driver writes data into the DMA
/// buffer identified by `buf`. On return, `VfsResponse::bytes_read`
/// carries the number of bytes actually written.
Read { buf: DmaBufferHandle, offset: u64, count: u32 },
/// `FileOps::write`. Write `count` bytes from `buf` into the file
/// starting at `offset`. `buf` points to a DMA buffer the VFS has
/// already filled with the data to be written.
Write { buf: DmaBufferHandle, offset: u64, count: u32 },
/// `FileOps::fsync`. Flush dirty data and metadata to stable
/// storage. If `datasync` is `true`, only data blocks need to be
/// flushed (equivalent to `fdatasync(2)`). `start`..`end` is the
/// byte range to sync; `end == u64::MAX` means "to end of file".
Fsync { datasync: bool, start: u64, end: u64 },
/// `FileOps::readdir`. Enumerate directory entries into `buf`
/// starting after the position identified by `cookie`. A `cookie` of
/// 0 means start from the beginning. The driver fills `buf` with
/// `linux_dirent64` records and sets `VfsResponse::bytes_read` to
/// the number of bytes written.
ReadDir { buf: DmaBufferHandle, cookie: u64 },
/// `FileOps::ioctl`. Pass a device-specific command to the
/// filesystem driver. `cmd` is the ioctl number; `arg` is the raw
/// usize argument (may be a user pointer, a small integer, or a
/// `DmaBufferHandle` depending on the command).
Ioctl { cmd: u32, arg: usize },
/// `FileOps::mmap`. Establish a memory mapping. `vma_token` is an
/// opaque handle the VFS passes to the driver to identify the
/// virtual memory area; the driver uses it to call back into the
/// VFS to install PTEs via the KABI page-fault callback.
Mmap { vma_token: u64, prot: u32, flags: u32 },
/// `FileOps::fallocate`. Pre-allocate or manipulate storage for the
/// given byte range. `mode` carries `FALLOC_FL_*` flags.
Fallocate { mode: u32, offset: u64, len: u64 },
/// `FileOps::seek_data`. Find the next byte range containing data
/// at or after `offset` (implements `SEEK_DATA` from `lseek(2)`).
SeekData { offset: u64 },
/// `FileOps::seek_hole`. Find the next hole (unallocated range) at
/// or after `offset` (implements `SEEK_HOLE` from `lseek(2)`).
SeekHole { offset: u64 },
/// `FileOps::poll`. Query which I/O events are ready. `events` is
/// a bitmask of `POLLIN`, `POLLOUT`, `POLLERR`, etc. The driver
/// responds immediately with the currently ready events; the VFS
/// handles `epoll`/`select` wait registration separately.
Poll { events: u32 },
/// `FileOps::splice_read`. Transfer up to `len` bytes from the file
/// at `offset` into an in-kernel pipe identified by `pipe_ino`,
/// without copying through userspace. `flags` carries `SPLICE_F_*`
/// flags.
SpliceRead { pipe_ino: u64, offset: u64, len: u32, flags: u32 },
/// `FileOps::splice_write`. Transfer up to `len` bytes from the
/// in-kernel pipe identified by `pipe_ino` into the file at
/// `offset`. `flags` carries `SPLICE_F_*` flags.
SpliceWrite { pipe_ino: u64, offset: u64, len: u32, flags: u32 },
}
/// Bounded kernel-internal string. Avoids heap allocation for the common
/// case of short names (directory entries, xattr names, symlink targets
/// ≤ 255 bytes).
///
/// For strings longer than 255 bytes the caller must use a
/// `DmaBufferHandle` instead and set `len = 0` as a sentinel.
pub struct KernelString {
/// Byte length of the string, not including any NUL terminator.
/// Range: 0 (sentinel for "use DMA buffer") to 255.
pub len: u8,
/// Inline storage. Valid bytes are `data[..len]`. The remainder
/// is zero-padded. Not NUL-terminated; callers must use `len`.
pub data: [u8; 255],
}
/// VFS operation codes. One-to-one mapping to trait methods.
#[repr(u32)]
pub enum VfsOpcode {
// FileSystemOps
Mount = 1,
Unmount = 2,
ForceUnmount = 3,
Statfs = 4,
SyncFs = 5,
Remount = 6,
Freeze = 7,
Thaw = 8,
// InodeOps
Lookup = 20,
Create = 21,
Link = 22,
Unlink = 23,
Mkdir = 24,
Rmdir = 25,
Rename = 26,
Symlink = 27,
Readlink = 28,
Getattr = 29,
Setattr = 30,
Truncate = 35,
Getxattr = 31,
Setxattr = 32,
Listxattr = 33,
Removexattr = 34,
// InodeOps (continued)
Mknod = 36, // → InodeOps::mknod; called by mknod(2) for device nodes
ShowOptions = 37, // → FileSystemOps::show_options; called by /proc/mounts, mount(8)
// FileOps
Open = 40,
Release = 41,
Read = 42,
Write = 43,
Fsync = 44,
Readdir = 45,
Ioctl = 46,
Mmap = 47,
Fallocate = 48,
SeekData = 49,
SeekHole = 50,
Poll = 51,
SpliceRead = 52, // → FileOps::splice_read; called by splice(2), sendfile(2)
SpliceWrite = 53, // → FileOps::splice_write; called by splice(2) write side
}
Dispatch flow (read syscall example):
- Userspace calls
read(fd, buf, len). - Syscall entry point resolves
fdto aValidatedCap(Section 8.1.1). - VFS checks the page cache (Section 4.1.3). On cache HIT: data is served from core memory with zero domain crossings. On cache MISS: continue.
- VFS constructs a
VfsRequest { opcode: Read, ino, fh, args: { offset, len, buf_handle } }. Thebuf_handleis aDmaBufferHandlepointing to a shared-memory region where the driver will write the read data (zero-copy). - VFS enqueues the request on
request_ringand rings the doorbell. - The filesystem driver (in its Tier 1 domain) dequeues the request, performs the
I/O (issuing block reads via
BlockDevice, Section 14.3), and writes data to the shared buffer. - Driver enqueues a
VfsResponse { request_id, status, bytes_read }onresponse_ring. - VFS dequeues the response, populates the page cache, and copies data to userspace.
Key design properties:
- Page cache absorbs most I/O: Only cache misses cross the domain boundary. On a warm cache (common for frequently accessed files), read() has zero domain crossings — data is served directly from core memory. This is why the page cache lives in umka-core, not in the filesystem driver.
- Zero-copy data path: Read/write data is transferred via shared DMA buffer handles, not copied into the ring buffer. The ring carries only the metadata (opcode, offsets, lengths, buffer handles). Data pages are in the shared DMA pool (PKEY 14 / domain 2).
- Batching: The doorbell coalescing mechanism (Section 10.6.1.1) batches multiple requests into a single domain switch. readahead() enqueues multiple read requests before ringing the doorbell once.
- Trait interface as specification: The
FileSystemOps,InodeOps, andFileOpstraits defined above serve as the SPECIFICATION of the ring protocol. Each trait method maps to exactly oneVfsOpcode. The trait signatures define the arguments; the ring protocol serializes them intoVfsRequestArgs. Filesystem driver developers implement the traits; the KABI code generator (Section 11.1) produces the serialization/deserialization stubs.
VFS Ring Error Handling and Cancellation:
Every cross-domain VFS request is subject to timeout, cancellation, and driver crash handling. This section specifies the complete lifecycle of a request that does not complete normally.
1. Timeout: Every VFS request has a per-operation timeout based on the expected latency class of the operation:
| Timeout class | Operations | Default timeout |
|---|---|---|
| Regular | Read, Write, Stat, Lookup, Create, Open, Release, Getattr, Setattr, Readdir, Readlink, Link, Unlink, Mkdir, Rmdir, Rename, Symlink, Getxattr, Setxattr, Listxattr, Removexattr, Mmap, SeekData, SeekHole, Poll, Ioctl |
30 seconds |
| Slow | Fsync, Truncate, Fallocate |
120 seconds |
| Mount | Mount, Unmount, ForceUnmount, Remount, Statfs, SyncFs, Freeze, Thaw |
300 seconds |
The kernel VFS layer starts a per-request timer when the request is enqueued on the
request_ring. If the timer fires before a VfsResponse::Ok or VfsResponse::Err
arrives on the response_ring, the kernel performs the following steps:
a. Sets request.state to Cancelled in the shared ring metadata.
b. Returns ETIMEDOUT to the waiting syscall (waking the blocked thread via
the VfsRingPair::completion wait queue).
c. Enqueues a CancelToken { request_id, reason: CancelReason::Timeout } on a
dedicated cancellation side-channel in the ring so the filesystem driver can
detect the cancellation and avoid processing a stale request. The driver is
expected to check the cancellation channel before beginning I/O for each
dequeued request.
Timeouts are per-mount configurable via mount options (vfs_timeout_regular=<secs>,
vfs_timeout_slow=<secs>, vfs_timeout_mount=<secs>). The values above are defaults.
2. Crash handling (filesystem driver crashes): When a Tier 1 filesystem driver crashes (detected by the isolation recovery mechanism described in Section 10.4.3), the kernel VFS layer performs the following recovery sequence:
a. All pending requests for the crashed filesystem driver are immediately failed
with EIO. Every thread blocked on VfsRingPair::completion for that mount is
woken with VfsResponse::Err(-EIO).
b. The VFS ring is closed: the kernel unmaps the shared ring pages and marks the
VfsRingPair as defunct. No new requests are accepted.
c. Any subsequent access to files on that filesystem (open files, cached dentries,
inode operations) returns ENOTCONN until the driver is restarted and the
filesystem is remounted.
d. For Tier 1 filesystem drivers: the crash recovery mechanism reloads the driver
module and replays the mount sequence (using the stored mount arguments from
SuperBlock). Pending request state is lost — applications whose requests
were failed with EIO must retry. Open file descriptors pointing to the crashed
filesystem become invalid and return ENOTCONN on any operation; applications
must close and reopen them after remount completes.
Crash Recovery Algorithm — Complete Specification:
VFS crash recovery runs when a Tier 1 VFS driver (e.g., ext4, XFS) crashes and is reloaded (Section 10.8).
Lock ordering during recovery (must acquire in this order to prevent deadlock):
1. vfs_global_lock (prevents new VFS operations from starting)
2. Per-superblock sb.recovery_lock (one at a time, in mounting-order sequence)
3. Per-inode inode.lock (only if individual inodes need repair)
Never hold an inode lock while acquiring sb.recovery_lock.
Step 1: Quiesce in-flight operations
- Set sb.state = SuperblockState::Recovering (atomic store, Release ordering).
- All new VFS operations on this superblock return ENXIO immediately (no-op check at syscall entry).
- Wait for the per-sb operation counter sb.inflight_ops to reach zero (spin with a 5s timeout; if not drained after 5s, send SIGKILL to processes with operations in flight).
- The in-flight counter is incremented at VFS entry (vfs_op_enter()) and decremented at exit (vfs_op_exit()), both under the per-task RCU read lock to prevent grace-period racing.
Step 2: Drain the ring buffer
- The driver-to-kernel ring buffer (Section 11.1) may have pending completion events from operations submitted before the crash.
- Call ring_drain_completions(sb.driver_ring): process all pending completions (call the registered callback for each entry). Completions after a crash return EIO.
- Discard all pending submission-side entries (operations that were submitted but not yet seen by the driver) by marking them complete(EIO) without forwarding.
Step 3: Dirty page detection and writeback
- Walk the superblock's page cache (sb.page_cache) for all dirty pages: pages with PageFlags::Dirty set.
- For each dirty page, check page.last_written_by_lsn against sb.last_committed_lsn.
- If page.lsn <= sb.last_committed_lsn: page was committed to the journal; mark clean (journal will replay it on fsck).
- If page.lsn > sb.last_committed_lsn: page is beyond the last journal commit; writeback must be deferred until the filesystem is repaired.
- Dirty pages beyond the last commit are kept in memory (pinned) until the filesystem is fsck'd and remounted, at which point a forced writeback is issued.
Step 4: Reload driver and remount
- Load the new driver image (Section 10.2 reload protocol).
- Call driver.mount(sb.device, sb.flags) with MS_RDONLY first (safe mode).
- Run the filesystem's built-in consistency check (ext4 replay journal; XFS log recovery; Btrfs tree walk) via driver.fsck_fast().
- If fsck_fast() returns Ok(()): remount read-write; resume normal operations.
- If fsck_fast() returns Err: emit FMA fault event, keep read-only, require manual intervention.
Step 5: Flush deferred dirty pages
- After successful RW remount, call writeback_deferred_dirty(sb) to flush the dirty pages held since Step 3.
Recovery latency target: ≤500ms for ≤1 million in-flight operations and ≤10 million dirty pages.
3. Cancellation protocol: A caller (or the kernel on behalf of a caller) can cancel a pending request through the following protocol:
a. The caller invokes vfs_cancel(request_id) (internal kernel API, not exposed
as a syscall — cancellation is triggered by signal delivery, thread exit, or
timeout).
b. The kernel sets request.state = Cancelled in the shared ring metadata for the
target request.
c. The kernel enqueues a CancelToken on the cancellation side-channel of the
VfsRingPair.
d. The filesystem driver checks the cancellation channel before processing each
dequeued request. If the request is marked Cancelled, the driver skips the
operation and sends no response (the kernel has already returned an error to
the caller).
e. If the driver has already started processing the request (e.g., issued a block
I/O read), it may complete the operation — the result is silently discarded by
the kernel since the request is already resolved.
/// Token placed on the cancellation side-channel of a VfsRingPair to notify
/// the filesystem driver that a previously enqueued request should be skipped.
#[repr(C)]
pub struct CancelToken {
/// The `request_id` of the cancelled request. Matches `VfsRequest::request_id`.
pub request_id: u64,
/// Why the request was cancelled.
pub reason: CancelReason,
}
/// Reason for request cancellation.
#[repr(u32)]
pub enum CancelReason {
/// The per-operation timeout expired before the driver responded.
Timeout = 1,
/// The calling thread was interrupted (signal delivery or thread exit).
CallerCancelled = 2,
/// The filesystem driver crashed; all pending requests are being flushed.
DriverCrash = 3,
}
4. VfsResponse::Pending semantics: A VfsResponse::Pending response from the
filesystem driver means the request has been accepted and acknowledged but not yet
completed (for example, the driver has issued a block I/O request and is waiting for
device completion). The contract is:
- The caller must poll the
response_ringor sleep onVfsRingPair::completionfor the finalVfsResponse::OkorVfsResponse::Err. Pendingdoes NOT reset the per-request timeout timer. The maximum time inPendingstate is bounded by the operation timeout defined above. If the final response does not arrive within the timeout, the request is cancelled using the standard cancellation protocol (step 3).- A driver may send at most one
Pendingresponse per request. Sending multiplePendingresponses for the samerequest_idis a protocol violation; the kernel logs a warning and ignores duplicatePendingresponses. Pendingis optional: a driver may respond directly withOkorErrwithout ever sendingPending. It exists to allow the VFS layer to distinguish "driver has seen the request" from "request is still sitting in the ring unprocessed" for diagnostic and health-monitoring purposes (Section 19.1).
13.1.2 Dentry Cache
The dentry (directory entry) cache is the performance-critical data structure of the VFS.
It maps (parent_inode, name) pairs to child inodes, eliminating repeated disk lookups
for path resolution.
Data structure: RCU-protected hash table. Read-side lookups are lock-free — no atomic operations on the read path, only a memory barrier on RCU read lock entry/exit. This matches Linux's dentry cache design, which is similarly RCU-protected for the same performance reasons.
Negative dentries: When a lookup() returns ENOENT, the VFS caches a negative
dentry for that (parent, name) pair. Subsequent lookups for the same nonexistent path
component return ENOENT immediately without calling into the filesystem driver. This
is critical for workloads like $PATH searches where the shell looks for an executable
in 5-10 directories, finding it only in one. Without negative dentries, every command
invocation would perform 4-9 unnecessary disk lookups.
Eviction: LRU eviction under memory pressure. The dentry cache integrates with umka-core's memory reclaim (Section 4.2 — Memory Compression Tier, in 04-memory.md) — when the page allocator signals memory pressure, the dentry cache shrinker evicts least-recently-used entries. Negative dentries are evicted preferentially (they are cheaper to re-create than positive dentries).
13.1.3 Path Resolution
Path resolution walks the dentry cache component by component. For example,
/usr/lib/libfoo.so resolves as: root dentry -> lookup("usr") -> lookup("lib")
-> lookup("libfoo.so").
RCU path walk (fast path): The entire resolution is attempted under an RCU read-side critical section. No dentry reference counts are taken, no locks are acquired. If every component is in the dentry cache and no concurrent renames or unmounts are in progress, the entire path resolves with zero atomic operations.
Ref-walk fallback (slow path): If any component is not cached, or if a concurrent
mount/rename is detected (via sequence counters), the RCU walk aborts and restarts in
ref-walk mode. Ref-walk takes dentry reference counts and inode locks as needed. This
two-phase approach is identical to Linux's LOOKUP_RCU -> LOOKUP_LOCKED fallback.
Mount point traversal: When a dentry is flagged as a mount point, resolution crosses into the mounted filesystem's root dentry. The mount table is consulted via RCU lookup (no lock) in the fast path.
Symlink resolution: The VFS follows up to 40 nested symlinks before returning
ELOOP. This matches the Linux limit and prevents infinite symlink loops.
Capability checks: Traverse permission is checked at each path component, but
not via an inter-domain ring call on every component. Instead, the dentry cache
stores a cached_perm: AtomicU32 field containing the permission bits resolved on the
last successful access by the current UID. During RCU-walk, the VFS reads
cached_perm from the dentry (same domain, no ring call) and compares against the
requesting process's UID and requested permission. If the cached permission matches
(common case — same user accessing the same path), no domain crossing occurs and the
check costs only a single atomic load (~1-3 cycles). The permission cache is
invalidated on chmod(), chown(), ACL changes, and capability revocation (all
infrequent operations).
Permission cache encoding: The 32-bit cached_perm field is divided into:
- Bits [31:16]: Truncated UID hash (upper 16 bits of a fast hash of the accessor's
UID). This is NOT a full UID — it is a probabilistic match filter.
- Bits [15:12]: Reserved (zero).
- Bits [11:9]: Permission result for owner (rwx).
- Bits [8:6]: Permission result for group (rwx).
- Bits [5:3]: Permission result for other (rwx).
- Bits [2:0]: Access mode that was checked (rwx).
On a cache hit (UID hash matches AND requested permission bits are a subset of the cached grant), the VFS skips the domain crossing. On a cache miss (UID hash mismatch or permission bits not cached), the VFS performs a full capability check via the inter-domain ring and updates the cache. The 16-bit UID hash has a ~1/65536 false positive rate — a different user may incorrectly hit the cache and receive a stale permission result.
On hash collision (false positive rate ~1/65536 per lookup): access is denied — the VFS falls back to the slow-path inter-domain capability check, which always produces a correct result. The permission cache is purely advisory; a collision always causes a cache miss, never a permission elevation. Fail-safe direction: deny unknown, never grant unknown.
This design is correct because: 1. A cache hit is only accepted when the UID hash AND the requested permission bits match the stored grant exactly. The probability of a different user with a different permission set matching both fields is ~1/65536 per lookup — and that case results in a cache miss and full slow-path check anyway. 2. The cache is invalidated on ALL permission-changing operations (chmod, chown, ACL changes, capability revocation), ensuring stale grants are never served after the underlying permission state changes. 3. Only the slow-path inter-domain ring call is authoritative. umka-vfs cannot grant access that umka-core's capability tables do not authorize.
Only on a cache miss (first access, different UID, or invalidated entry) does the
VFS call umka-core via the inter-domain ring to perform a full capability check and
update the dentry's cached permissions. This amortized design preserves the security
guarantee (umka-vfs cannot bypass capability checks — it has no access to capability
tables, per Section 10.2 and Section 10.4) while keeping the hot-path overhead to a single
atomic load per component, comparable to Linux's inode->i_mode check.
13.1.4 Mount Namespace and Capability-Gated Mounting
Each process belongs to a mount namespace containing its own mount tree.
Mount operations are capability-gated:
| Operation | Required Capability | Scope |
|---|---|---|
| mount | CAP_MOUNT |
Mount namespace |
| bind mount | CAP_MOUNT + read access to source |
Mount namespace + source |
| remount | CAP_MOUNT |
Mount namespace |
| umount | CAP_MOUNT |
Mount namespace |
| pivot_root | CAP_SYS_ADMIN |
Mount namespace |
CAP_MOUNT is scoped to the calling process's mount namespace — it does not grant
mount authority in other namespaces. A container with its own mount namespace can mount
filesystems within that namespace without affecting the host.
Mount propagation: Shared, private, slave, and unbindable propagation types, with
the same semantics as Linux (MS_SHARED, MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE).
This is essential for container runtimes that rely on mount propagation for volume mounts.
Filesystem type registration: Only umka-core can register new filesystem types with the VFS. Filesystem drivers request registration via the inter-domain ring, and umka-core verifies the driver's identity and KABI certification before granting registration.
13.2 Mount Tree Data Structures and Operations
The mount tree is the central data structure of the VFS layer that tracks all mounted filesystems, their hierarchical relationships, and their propagation properties. Every path resolution operation traverses the mount tree (via the mount hash table) to cross mount boundaries. This section defines the complete data structures, algorithms, and namespace operations that were previously referenced but unspecified by Section 13.1.3, 10.1.4, and Section 16.1.3.
Design principles:
-
RCU for the read path: Mount hash table lookups happen on every path resolution (every
open(),stat(),readlink(),execve()). The read path must be completely lock-free. Writers (mount/unmount) serialize through the per-namespacemount_lockand publish changes via RCU. -
Per-namespace scoping: Unlike Linux, which uses a single global
mount_hashtable, UmkaOS scopes the mount hash table per mount namespace. This eliminates contention between namespaces in container-heavy workloads (thousands of namespaces with independent mount trees) and allows mount operations in different namespaces to proceed in parallel with no shared lock. The trade-off is additional memory per namespace; this is acceptable because each namespace already has an independent mount tree and the hash table overhead is proportional to the number of mounts (typically 30-100 per container, well under 1 KiB of hash table memory). -
Arc-based lifetime management: Mount nodes are reference-counted via
Arc<Mount>. Parent, master, and peer references useArc(strong) orWeak(where appropriate to break cycles). RCU protects the hash chains and list traversals;Arcprotects theMountnode lifetime beyond the RCU grace period. -
Capability gating: All mount tree modifications check
CAP_MOUNTorCAP_SYS_ADMINas specified in Section 13.1.4. The data structures below enforce this at the entry point of each operation, not deep inside the algorithm. -
64-bit mount IDs: Per-namespace monotonic counter, never wrapping on any realistic system. Mount IDs are unique within a namespace and are the stable identifier used by
statx()(STATX_MNT_ID), the newstatmount()/listmount()syscalls, and/proc/PID/mountinfo.
13.2.1 Mount Flags
bitflags! {
/// Per-mount flags controlling security and access behavior.
///
/// These are distinct from per-superblock options (which control the
/// filesystem driver's behavior). A single superblock can be mounted
/// at multiple locations with different per-mount flags (e.g., one
/// mount point read-write, another read-only via bind mount + remount).
///
/// Bit assignments match Linux's `MNT_*` internal flags for
/// straightforward compat-layer translation. The `mount(2)` compat
/// shim translates `MS_*` userspace flags to `MountFlags` at syscall
/// entry; the new mount API (`mount_setattr(2)`) translates
/// `MOUNT_ATTR_*` flags similarly.
#[repr(transparent)]
pub struct MountFlags: u64 {
// --- Userspace-visible flags (set via mount/remount/mount_setattr) ---
/// Do not honor set-user-ID and set-group-ID bits on executables.
const MNT_NOSUID = 1 << 0;
/// Do not allow access to device special files on this mount.
const MNT_NODEV = 1 << 1;
/// Do not allow execution of programs on this mount.
const MNT_NOEXEC = 1 << 2;
/// Mount is read-only. Writes return EROFS.
const MNT_READONLY = 1 << 3;
/// Do not update access times on this mount.
const MNT_NOATIME = 1 << 4;
/// Do not update directory access times on this mount.
const MNT_NODIRATIME = 1 << 5;
/// Update atime only if atime <= mtime or atime <= ctime, or if
/// the previous atime is more than 24 hours old. Default for most
/// mounts since Linux 2.6.30 and UmkaOS.
const MNT_RELATIME = 1 << 6;
/// Buffer atime updates in memory and flush lazily. Reduces write
/// I/O for atime-heavy workloads (e.g., mail servers).
const MNT_LAZYTIME = 1 << 7;
/// Do not follow symlinks on this mount. Used by container runtimes
/// to prevent symlink-based escapes from bind-mounted directories.
const MNT_NOSYMFOLLOW = 1 << 8;
// --- Internal flags (kernel-managed, not settable by userspace) ---
/// Mount is in the process of being unmounted. Set by `umount()`
/// before removing the mount from the hash table. Prevents new
/// path lookups from entering this mount. Once set, never cleared
/// (the mount node is freed after the RCU grace period).
const MNT_DOOMED = 1 << 16;
/// Mount is locked and cannot be unmounted by unprivileged
/// processes. Set on mounts visible in child mount namespaces
/// created by unprivileged users — prevents a child namespace
/// from unmounting a mount inherited from the parent. Cleared
/// only by a process with `CAP_SYS_ADMIN` in the mount's owning
/// user namespace.
const MNT_LOCKED = 1 << 17;
/// Mount can be expired and automatically unmounted under memory
/// pressure or after an idle timeout. Used by autofs. The VFS
/// checks `mnt_count == 0` before expiring a shrinkable mount.
const MNT_SHRINKABLE = 1 << 18;
/// Mount was created by the new mount API (fsopen/fsmount) and
/// has not yet been attached to the mount tree via move_mount().
/// Detached mounts are invisible to path resolution and
/// /proc/PID/mountinfo. They become visible only after
/// move_mount() attaches them.
const MNT_DETACHED = 1 << 19;
}
}
13.2.2 Propagation Type
/// Mount propagation type. Controls whether mount/unmount events at this
/// mount point are propagated to other mount points, and in which direction.
///
/// Propagation is fundamental to container runtimes: Docker sets the rootfs
/// to MS_PRIVATE by default, Kubernetes uses MS_SHARED for volume mounts
/// that must be visible across pod containers.
///
/// See: Linux kernel Documentation/filesystems/sharedsubtree.rst
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum PropagationType {
/// Mount events propagate bidirectionally within the peer group.
/// All mounts in the same peer group see each other's mount/unmount
/// events. This is the Linux default for the initial namespace root.
Shared = 0,
/// Mount events are not propagated to or from this mount. This is
/// the default for new mount namespaces (container isolation).
Private = 1,
/// Mount events propagate unidirectionally from the master to this
/// mount, but not in the reverse direction. Used when a container
/// should see new mounts from the host but not expose its own mounts
/// to the host.
Slave = 2,
/// Like Private, but additionally prevents this mount from being
/// used as the source of a bind mount. Used for security-sensitive
/// mount points that should never be replicated.
Unbindable = 3,
}
13.2.3 Mount Node
/// A single mount instance in the mount tree.
///
/// Equivalent to Linux's `struct mount` (not `struct vfsmount` — the latter
/// is the subset exposed to filesystem drivers; `struct mount` is the full
/// internal structure). Each `Mount` represents one attachment of a
/// filesystem at a specific point in the directory tree.
///
/// **Lifetime**: `Mount` nodes are allocated via `Arc<Mount>`. References
/// are held by:
/// - The mount hash table (via RCU-protected hash chain)
/// - The parent mount's `children` list
/// - The peer group's `mnt_share` ring
/// - The master mount's `mnt_slave_list`
/// - Any open file descriptor whose path traversed this mount
/// (via `mnt_count` reference count)
/// - The `MountNamespace.mount_list`
///
/// A mount node is freed when all strong references are dropped, which
/// happens after: (a) removal from the hash table, (b) removal from the
/// parent's child list, (c) RCU grace period completion, and (d) all
/// path-resolution references (`mnt_count`) have been released.
pub struct Mount {
// --- Identity ---
/// Unique mount identifier within the owning namespace. Monotonically
/// increasing, 64-bit, never reused. This is the value returned by
/// `statx()` in `stx_mnt_id` (STATX_MNT_ID) and reported in
/// `/proc/PID/mountinfo` field 1.
pub mount_id: u64,
/// Device name string (e.g., "/dev/sda1", "tmpfs", "overlay").
/// Displayed in `/proc/PID/mountinfo` field 10 (mount source).
/// Heap-allocated, immutable after mount creation.
pub device_name: Box<[u8]>,
// --- Tree structure ---
/// Parent mount. `None` for the root of the mount namespace.
/// Uses `Weak` to prevent reference cycles in the mount tree:
/// parent -> children -> parent would create a cycle with `Arc`.
/// The parent is always alive while any child exists (the child
/// holds a position in the parent's hash chain), so the `Weak`
/// can always be upgraded during normal operation. It fails only
/// during the teardown of a doomed mount tree, which is expected.
pub parent: Option<Weak<Mount>>,
/// The dentry in the parent mount's filesystem where this mount is
/// attached. For the root mount of a namespace, this is the root
/// dentry of the parent mount (which is itself).
///
/// Together with `parent`, this pair `(parent_mount, mountpoint_dentry)`
/// is the key in the mount hash table. Path resolution uses this to
/// detect mount crossings: when a dentry has `DCACHE_MOUNTED` set,
/// the VFS calls `lookup_mnt(current_mount, dentry)` to find the
/// child mount.
pub mountpoint: DentryRef,
/// Root dentry of the mounted filesystem. When path resolution
/// crosses into this mount, it continues from this dentry.
pub root: DentryRef,
/// The superblock of the mounted filesystem. Shared across all
/// mounts of the same filesystem instance (e.g., bind mounts share
/// the superblock). The superblock holds the filesystem-specific
/// state and the `FileSystemOps`/`InodeOps`/`FileOps` trait objects.
pub superblock: Arc<SuperBlock>,
/// Children of this mount — sub-mounts attached at dentries within
/// this mount's filesystem. Intrusive doubly-linked list for O(1)
/// insertion and removal. Protected by the namespace's `mount_lock`
/// for writes; RCU-protected for reads during path resolution.
pub children: IntrusiveList<Arc<Mount>>,
/// Link entry for this mount in its parent's `children` list.
/// Embedded in the `Mount` node to avoid per-child heap allocation.
pub child_link: IntrusiveListNode,
// --- Mount flags ---
/// Per-mount flags (nosuid, nodev, noexec, readonly, noatime, etc.).
/// Atomically readable for the path-resolution hot path (no lock
/// needed to check MNT_READONLY or MNT_NOSUID). Modified only under
/// `mount_lock` via atomic store with Release ordering.
pub flags: AtomicU64,
// --- Propagation ---
/// Propagation type for this mount (Shared, Private, Slave, Unbindable).
/// Determines how mount/unmount events are forwarded to related mounts.
/// Modified only under `mount_lock`.
pub propagation: PropagationType,
/// Peer group ID for shared mounts. All mounts in the same peer group
/// have the same `group_id`. Private and unbindable mounts have
/// `group_id == 0`. Slave mounts retain the `group_id` of their
/// former peer group (for /proc/PID/mountinfo optional fields).
///
/// Allocated from the namespace's `group_id_allocator`. Unique within
/// a namespace.
pub group_id: u64,
/// Circular linked list of peer mounts (shared propagation).
/// All mounts in a peer group are linked through `mnt_share`.
/// When a mount/unmount event occurs on any peer, it is propagated
/// to all other peers in the ring. For Private/Unbindable mounts,
/// this list contains only the mount itself (self-loop).
pub mnt_share: IntrusiveListNode,
/// Master mount for slave propagation. When this mount is a slave,
/// `mnt_master` points to the shared mount from which this mount
/// receives (but does not send) propagation events.
/// `None` for shared, private, and unbindable mounts.
pub mnt_master: Option<Weak<Mount>>,
/// List head for slave mounts of this mount. When this mount is
/// shared (or was shared), slave mounts derived from it are linked
/// through `mnt_slave_list`. Each slave's `mnt_slave` node is an
/// entry in this list.
pub mnt_slave_list: IntrusiveList<Arc<Mount>>,
/// Link entry for this mount in its master's `mnt_slave_list`.
pub mnt_slave: IntrusiveListNode,
// --- Namespace membership ---
/// The mount namespace that owns this mount. `Weak` because the
/// namespace may be destroyed (all processes exited) while detached
/// mounts or lazy-unmount remnants still exist.
pub ns: Weak<MountNamespace>,
/// Link entry in the namespace's `mount_list`. Used for ordered
/// iteration (e.g., /proc/PID/mountinfo output, umount ordering).
pub ns_list_link: IntrusiveListNode,
// --- Reference counting ---
/// Active reference count. Incremented when path resolution enters
/// this mount (ref-walk mode) or when an open file descriptor
/// references a path within this mount. `umount()` checks this
/// before removing the mount: if `mnt_count > 0`, the mount is
/// busy and umount returns `EBUSY` (unless `MNT_DETACH` is used).
///
/// Note: this is separate from the `Arc` reference count. `Arc`
/// tracks the lifetime of the `Mount` struct itself. `mnt_count`
/// tracks whether the mount is actively *in use* by path lookups
/// and open files. A mount can have `mnt_count == 0` (not busy)
/// while still having `Arc` strong count > 0 (struct not yet freed
/// because it's still in the hash table or child list).
pub mnt_count: AtomicU64,
// --- Mount hash chain ---
/// Link entry in the mount hash table bucket chain. RCU-protected:
/// readers traverse the chain under `rcu_read_lock()` without any
/// lock; writers modify the chain under `mount_lock` and publish
/// via RCU. Uses intrusive linking for zero-allocation hash insertion.
pub hash_link: IntrusiveListNode,
}
/// Reference to a dentry. Wraps the dentry's inode ID and parent inode ID,
/// which together uniquely identify a dentry in the dentry cache (Section
/// 27a.2). The VFS resolves this to a cached dentry entry on access.
///
/// This avoids holding a direct pointer into the dentry cache (which is
/// RCU-managed and may be evicted), while still providing O(1) lookup via
/// the dentry hash table.
pub struct DentryRef {
/// Inode ID of the parent directory containing this dentry.
pub parent_inode: InodeId,
/// Name hash of this dentry (SipHash-1-3 of the name component).
/// Used for O(1) dentry cache lookup without storing the full name.
pub name_hash: u64,
/// Inode ID of the dentry itself (for positive dentries).
pub inode: InodeId,
}
13.2.4 Mount Hash Table
/// Per-namespace mount hash table. Maps `(parent_mount_id, mountpoint_dentry)`
/// pairs to child `Mount` nodes. This is the data structure consulted on
/// every mount-point crossing during path resolution.
///
/// **Why per-namespace**: Linux uses a single global `mount_hashtable` with
/// ~2048 buckets, protected by a per-bucket spinlock for writes and RCU for
/// reads. In container-heavy environments (thousands of namespaces, each with
/// 30-100 mounts), this creates false sharing on hash buckets and limits
/// scalability of concurrent mount operations across namespaces. UmkaOS's
/// per-namespace hash table eliminates cross-namespace contention entirely.
///
/// **Sizing**: The hash table is sized to the number of mounts in the
/// namespace, with a minimum of 32 buckets and a maximum of 1024. The table
/// is resized (doubled) when the load factor exceeds 2.0, and shrunk
/// (halved) when the load factor drops below 0.25. Resizing allocates a
/// new bucket array, rehashes under `mount_lock`, and publishes via RCU.
///
/// **Hash function**: SipHash-1-3 of `(parent_mount_id, mountpoint_inode_id)`.
/// The SipHash key is per-namespace, generated from a CSPRNG at namespace
/// creation. This prevents hash-flooding attacks where an adversary crafts
/// mount points that collide in the hash table.
pub struct MountHashTable {
/// RCU-protected bucket array. Each bucket is the head of an intrusive
/// singly-linked list of `Mount` nodes (via `Mount.hash_link`).
/// Readers traverse under `rcu_read_lock()`; writers modify under
/// the namespace's `mount_lock`.
buckets: RcuCell<Box<[MountHashBucket]>>,
/// Number of entries in the hash table. Used for load-factor
/// computation during resize decisions. Modified only under `mount_lock`.
count: u32,
/// SipHash key for this hash table. Per-namespace, generated at
/// namespace creation from the kernel CSPRNG.
hash_key: [u64; 2],
}
/// A single bucket in the mount hash table. Contains the head pointer
/// of an RCU-protected chain of Mount nodes.
struct MountHashBucket {
/// Head of the intrusive linked list of Mount nodes hashing to this
/// bucket. Null if the bucket is empty. Readers follow this chain
/// under RCU; writers modify under `mount_lock`.
head: AtomicPtr<Mount>,
}
impl MountHashTable {
/// Look up a child mount at the given `(parent, dentry)` pair.
///
/// Called during path resolution when a dentry has the `DCACHE_MOUNTED`
/// flag set. Must be called under `rcu_read_lock()`.
///
/// Returns `Some(&Mount)` if a mount is found at this point, or
/// `None` if the dentry is not a mount point (stale `DCACHE_MOUNTED`
/// flag — possible after lazy unmount).
///
/// **Performance**: O(1) expected, O(n) worst-case where n is the
/// chain length (bounded by load factor < 2.0). No locks, no atomics
/// beyond the initial `Acquire` load of the bucket head pointer.
pub fn lookup<'a>(
&'a self,
parent_mount_id: u64,
mountpoint_inode: InodeId,
_rcu: &'a RcuReadGuard,
) -> Option<&'a Mount> {
let hash = siphash_1_3(
self.hash_key,
parent_mount_id,
mountpoint_inode,
);
let bucket_idx = hash as usize % self.bucket_count();
let bucket = &self.buckets.read(_rcu)[bucket_idx];
let mut current = bucket.head.load(Ordering::Acquire);
while !current.is_null() {
// SAFETY: `current` is a valid Mount pointer within an RCU
// read-side critical section. The Mount node is not freed
// until after the RCU grace period.
let mnt = unsafe { &*current };
if mnt.mount_id_of_parent() == parent_mount_id
&& mnt.mountpoint_inode() == mountpoint_inode
&& !mnt.is_doomed()
{
return Some(mnt);
}
current = mnt.hash_link.next.load(Ordering::Acquire);
}
None
}
}
13.2.5 Mount Namespace
/// A mount namespace. Contains an independent mount tree with its own root
/// mount, hash table, and mount list. Created by `clone(CLONE_NEWNS)` or
/// `unshare(CLONE_NEWNS)`.
///
/// The `vfs_root: Capability<VfsNode>` field in `NamespaceSet` (Section 16.1.2)
/// is updated to point to this namespace's root mount:
///
/// ```rust
/// // Updated NamespaceSet field (replaces the previous Capability<VfsNode>):
/// pub mount_ns: Arc<MountNamespace>,
/// ```
///
/// **Relationship to NamespaceSet**: Each process's `NamespaceSet` holds
/// an `Arc<MountNamespace>`. Multiple processes in the same mount namespace
/// share the same `Arc<MountNamespace>`. When `clone(CLONE_NEWNS)` is called,
/// a new `MountNamespace` is created by cloning the parent's mount tree
/// (via `copy_tree()`).
pub struct MountNamespace {
/// Unique namespace identifier. Used for `/proc/PID/ns/mnt` inode
/// number and `setns()` namespace comparison.
pub ns_id: u64,
/// Root mount of this namespace's mount tree. This is the mount
/// that corresponds to "/" for all processes in this namespace.
/// Updated atomically by `pivot_root()`.
pub root: RcuCell<Arc<Mount>>,
/// Ordered list of all mounts in this namespace. The ordering is
/// topological: parent mounts appear before their children. This
/// ordering is used by:
/// - `/proc/PID/mountinfo`: output follows this order
/// - `umount -a`: unmounts in reverse order (leaves before parents)
/// - Namespace teardown: unmounts in reverse topological order
pub mount_list: IntrusiveList<Arc<Mount>>,
/// Number of mounts in this namespace. Used to enforce the
/// per-namespace mount count limit (default: 100,000 — matching
/// Linux's `sysctl fs.mount-max`). Prevents mount-storm DoS attacks
/// where a compromised container creates millions of mounts.
pub mount_count: AtomicU64,
/// Event counter. Incremented on every mount/unmount/remount
/// operation. Used by `poll()` on `/proc/PID/mountinfo` to detect
/// mount tree changes. Container runtimes and systemd use this
/// to react to mount events without periodic scanning.
pub event_seq: AtomicU64,
/// Per-namespace mount hash table. Maps `(parent_mount, dentry)` to
/// child mount for path resolution mount-point crossings.
pub hash_table: MountHashTable,
/// Mutex serializing mount tree modifications (mount, unmount,
/// remount, pivot_root, bind mount, move mount). Readers (path
/// resolution) do not acquire this lock — they use RCU.
/// Lock hierarchy level 9 (MOUNT_LOCK): above DENTRY_LOCK (8),
/// below NET (10). See Section 3.1.5 lock hierarchy table.
pub mount_lock: Mutex<()>,
/// Mount ID allocator. Monotonically increasing 64-bit counter.
/// IDs are never reused within a namespace. At 1 mount/second
/// sustained, a 64-bit counter would not wrap for ~584 billion years.
pub id_allocator: AtomicU64,
/// Peer group ID allocator. Like mount IDs, monotonically increasing
/// and never reused. Separate from mount IDs because group IDs are
/// shared across mounts and have a different lifecycle.
pub group_id_allocator: AtomicU64,
/// User namespace that owns this mount namespace. Determines
/// capability checks for mount operations. A process must have
/// `CAP_MOUNT` in this user namespace (or an ancestor) to modify
/// the mount tree.
pub user_ns: Arc<UserNamespace>,
}
13.2.6 DCACHE_MOUNTED Integration
The dentry cache (Section 13.1.2) must track which dentries are mount points.
When a filesystem is mounted at a dentry, the VFS sets the DCACHE_MOUNTED
flag on that dentry. During path resolution (Section 13.1.3), when the VFS
encounters a dentry with DCACHE_MOUNTED set, it calls
MountHashTable::lookup() to find the child mount and continues resolution
from the child mount's root dentry.
/// Dentry cache entry flags. Stored in the dentry's `flags: AtomicU32` field.
/// Extended to include DCACHE_MOUNTED for mount-point detection.
bitflags! {
#[repr(transparent)]
pub struct DcacheFlags: u32 {
/// This dentry is a mount point — a filesystem is mounted on it.
/// Set by `do_mount()` when attaching a mount. Cleared by
/// `do_umount()` when the last mount at this dentry is removed.
///
/// Path resolution checks this flag on every path component.
/// When set, `MountHashTable::lookup(current_mount, dentry)` is
/// called to find the child mount. This check is a single atomic
/// load (~1 cycle) — the flag exists specifically to avoid a hash
/// table lookup on every path component (only mount points need
/// the lookup).
const DCACHE_MOUNTED = 1 << 0;
/// Dentry has been disconnected from the tree (e.g., NFS stale
/// handle, deleted directory that is still open).
const DCACHE_DISCONNECTED = 1 << 1;
/// Dentry is a negative dentry (caches a failed lookup).
const DCACHE_NEGATIVE = 1 << 2;
/// Dentry has filesystem-specific operations (d_revalidate, etc.).
const DCACHE_OP_MASK = 1 << 3;
}
}
13.2.7 Filesystem Context (New Mount API)
The new mount API (Linux 5.2+, used increasingly by container runtimes and
systemd) separates mount operations into discrete steps: context creation,
configuration, superblock creation, and attachment. This provides better
error reporting (errors at each step, not a single mount(2) errno) and
supports atomic mount configuration changes.
/// Filesystem context for the new mount API.
///
/// Created by `fsopen()`, configured by `fsconfig()`, and consumed by
/// `fsmount()`. The context holds all the state needed to create a new
/// superblock and mount, accumulated through multiple `fsconfig()` calls.
///
/// This is equivalent to Linux's `struct fs_context`.
///
/// **Lifetime**: The context is reference-counted via a file descriptor
/// returned by `fsopen()`. It is destroyed when the file descriptor is
/// closed. If `fsmount()` has not been called, the context is simply
/// freed (no mount created). If `fsmount()` was called, the context's
/// state has been consumed and the mount exists independently.
pub struct FsContext {
/// Filesystem type (e.g., "ext4", "tmpfs", "overlay"). Set at
/// `fsopen()` time and immutable thereafter.
pub fs_type: Arc<dyn FileSystemOps>,
/// Filesystem type name (for diagnostics and /proc/mounts).
pub fs_type_name: Box<[u8]>,
/// Source device or path (equivalent to mount(2) `source` parameter).
/// Set via `fsconfig(FSCONFIG_SET_STRING, "source", ...)`.
pub source: Option<Box<[u8]>>,
/// Accumulated mount options as key-value pairs. Each `fsconfig()`
/// call adds or modifies an entry. The filesystem driver validates
/// options at `fsconfig(FSCONFIG_CMD_CREATE)` time.
pub options: Vec<(Box<[u8]>, Box<[u8]>)>,
/// Binary data options (for filesystems that accept binary mount data).
/// Set via `fsconfig(FSCONFIG_SET_BINARY, ...)`.
pub binary_options: Vec<(Box<[u8]>, Box<[u8]>)>,
/// Mount flags to apply to the created mount.
pub mount_flags: MountFlags,
/// The created superblock. Set by `fsconfig(FSCONFIG_CMD_CREATE)`,
/// consumed by `fsmount()`.
pub superblock: Option<Arc<SuperBlock>>,
/// Error log. Filesystem drivers write diagnostic messages here
/// during context creation and configuration. Readable by userspace
/// via `read()` on the fscontext file descriptor.
pub log: Vec<u8>,
/// Purpose of this context: new mount, reconfiguration, or submount.
pub purpose: FsContextPurpose,
/// User namespace for permission checks. Set at `fsopen()` time
/// to the caller's user namespace.
pub user_ns: Arc<UserNamespace>,
}
/// Purpose of a filesystem context, controlling which operations are valid.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
pub enum FsContextPurpose {
/// Creating a new mount (from `fsopen()`).
NewMount = 0,
/// Reconfiguring an existing mount (from `fspick()`).
Reconfig = 1,
/// Internal: creating a submount (e.g., automount).
Submount = 2,
}
13.2.7.1 FsContext Lifecycle and Error Channel
The new mount API separates mount configuration into discrete, verifiable steps. Each step either advances the context state or returns a structured error. The full lifecycle:
Step 1: fd = fsopen("ext4", FSOPEN_CLOEXEC)
→ Validates "ext4" against the filesystem type registry.
→ Allocates FsContext { fs_type: ext4_ops, purpose: NewMount, state: Blank, ... }.
→ Returns an O_RDWR file descriptor backed by the FsContext.
→ FsContext state: Blank.
Step 2: fsconfig(fd, FSCONFIG_SET_STRING, "source", "/dev/sda1", 0)
fsconfig(fd, FSCONFIG_SET_STRING, "errors", "remount-ro", 0)
fsconfig(fd, FSCONFIG_SET_FLAG, "noatime", NULL, 0)
→ Each call appends to FsContext.options: [("source", "/dev/sda1"), ("errors", "remount-ro"), ...].
→ Returns 0 on success; EINVAL if the key is not recognized by the filesystem type.
→ FsContext state: Blank (still accumulating options).
Step 3: fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
→ Calls FileSystemOps::mount(source, flags, options) on the configured filesystem type.
→ On success: FsContext.superblock = Some(sb); state → Ready.
→ On failure: diagnostic message is written to FsContext.log; state → Failed.
Caller can read the error via read(fd, buf, len) — see Error Channel below.
→ Returns 0 on success; -errno on failure.
Step 4: mnt_fd = fsmount(fd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOATIME)
→ Consumes FsContext.superblock (state must be Ready; returns EBUSY if Mounted,
EINVAL if Blank or Failed).
→ Allocates a MountNode with MNT_DETACHED flag set.
→ Returns an O_PATH fd referencing the detached mount.
→ FsContext state: Mounted (further fsconfig/fsmount calls return EBUSY).
Step 5: move_mount(mnt_fd, "", AT_FDCWD, "/mnt/data", MOVE_MOUNT_F_EMPTY_PATH)
→ Attaches the detached mount to the namespace mount tree at /mnt/data.
→ Clears MNT_DETACHED from the MountNode.
→ Triggers mount propagation to peer/slave mounts (Section 13.2.10).
open_tree(2) — clone or open a mount:
fd = open_tree(dirfd, path, OPEN_TREE_CLONE | AT_RECURSIVE)
→ Resolves path to a mount.
→ OPEN_TREE_CLONE: creates a detached copy of the mount tree rooted at path,
identical to a recursive bind mount but without modifying the namespace.
AT_RECURSIVE: the clone includes all submounts below path.
→ The returned O_PATH fd can be passed to move_mount() to attach elsewhere.
→ Without OPEN_TREE_CLONE: returns an O_PATH fd referencing the existing mount
without cloning (useful for passing a mount reference across namespaces).
mount_setattr(2) — bulk-modify mount tree flags:
mount_setattr(dirfd, path, AT_RECURSIVE, &mount_attr { attr_set, attr_clr }, sizeof)
→ Resolves path to a mount.
→ AT_RECURSIVE: applies to all mounts in the subtree rooted at path.
→ attr_clr: clears these flags from each mount (applied first).
→ attr_set: sets these flags on each mount (applied after attr_clr).
→ The operation is atomic within the subtree: if validation fails for any mount
(e.g., clearing MNT_READONLY on a superblock-level read-only filesystem), no
flags are changed on any mount.
→ Requires CAP_MOUNT.
FsContext Error Channel:
When fsconfig(FSCONFIG_CMD_CREATE) or fsmount() encounter a filesystem-level error
(e.g., superblock checksum mismatch, missing required option, device I/O error), the error
is not conveyed solely via errno. The filesystem driver writes a human-readable diagnostic
string to FsContext.log. The caller retrieves it via read(fd, buf, len) on the FsContext
file descriptor:
read(fs_context_fd, buf, len):
if FsContext.log is empty: return 0 (EOF — no error message pending)
n = min(len, FsContext.log.len())
copy_to_user(buf, FsContext.log[..n])
FsContext.log.drain(..n)
return n
Example error message (readable by system administrators):
ext4: superblock checksum mismatch at block 0: expected 0xdeadbeef, got 0xcafebabe
This approach is superior to the traditional single-errno response: it gives system
administrators and container runtimes actionable diagnostic information without requiring
a separate diagnostics ioctl or /proc file.
13.2.8 Mount Attribute Structure (mount_setattr)
/// User-visible mount attribute structure for `mount_setattr(2)`.
/// Matches Linux's `struct mount_attr` exactly for ABI compatibility.
///
/// `mount_setattr()` atomically modifies mount properties on a single
/// mount or recursively on an entire mount tree (when `AT_RECURSIVE`
/// is passed). Container runtimes use this for recursive read-only
/// mounts (`MOUNT_ATTR_RDONLY` + `AT_RECURSIVE`).
#[repr(C)]
pub struct MountAttr {
/// Flags to set on the mount(s). Bits correspond to `MOUNT_ATTR_*`
/// constants. Applied after `attr_clr` (clear first, then set).
pub attr_set: u64,
/// Flags to clear from the mount(s). Applied before `attr_set`.
pub attr_clr: u64,
/// Propagation type to set. One of `MS_SHARED`, `MS_PRIVATE`,
/// `MS_SLAVE`, `MS_UNBINDABLE`, or 0 (no change). Only one
/// propagation flag may be set; combining them returns `EINVAL`.
pub propagation: u64,
/// File descriptor of the user namespace to associate with the
/// mount (for ID-mapped mounts). Set to 0 or omit if not
/// changing the mount's user namespace mapping.
pub userns_fd: u64,
}
/// MOUNT_ATTR_* flag constants for mount_setattr(2).
/// These map to MountFlags but use a separate constant space matching
/// Linux's UAPI.
pub const MOUNT_ATTR_RDONLY: u64 = 0x00000001;
pub const MOUNT_ATTR_NOSUID: u64 = 0x00000002;
pub const MOUNT_ATTR_NODEV: u64 = 0x00000004;
pub const MOUNT_ATTR_NOEXEC: u64 = 0x00000008;
pub const MOUNT_ATTR_NOATIME: u64 = 0x00000010;
pub const MOUNT_ATTR_STRICTATIME: u64 = 0x00000020;
pub const MOUNT_ATTR_NODIRATIME: u64 = 0x00000080;
pub const MOUNT_ATTR_NOSYMFOLLOW: u64 = 0x00200000;
13.2.9 Mount Operations — Algorithms
All mount tree modification algorithms require holding the namespace's
mount_lock (lock hierarchy level 9, Section 3.1.5). Path resolution (read
path) uses only RCU and never acquires mount_lock. The algorithms below
describe the kernel-internal implementation; the syscall entry points
(mount(2), umount2(2), and the new mount API) perform argument
validation and capability checks before calling these internal functions.
13.2.9.1 do_mount — Mount a Filesystem
do_mount(source, target_path, fs_type, flags, data) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. Resolve `target_path` to (mount, dentry) via path resolution (Section 13.1.3).
2. If `flags` contains MS_REMOUNT, delegate to do_remount() (Section 13.2.9.4).
3. If `flags` contains MS_BIND, delegate to do_bind_mount() (Section 13.2.9.5).
4. If `flags` contains MS_MOVE, delegate to do_move_mount() (Section 13.2.9.6).
5. If `flags` contains MS_SHARED|MS_PRIVATE|MS_SLAVE|MS_UNBINDABLE,
delegate to do_change_propagation() (Section 13.2.9.7).
6. Otherwise, this is a new filesystem mount:
a. Look up the filesystem type by name in the filesystem registry.
If not registered, return ENODEV.
b. Call `FileSystemOps::mount(source, flags, data)` on the filesystem
driver. This creates and returns a `SuperBlock`. On failure, return
the error from the driver.
c. Check namespace mount count against `mount_max` limit. If exceeded,
drop the superblock and return ENOSPC.
d. Allocate a new `Mount` node:
- `mount_id` from `namespace.id_allocator.fetch_add(1)`
- `parent` = resolved mount from step 1
- `mountpoint` = resolved dentry from step 1
- `root` = superblock's root dentry
- `superblock` = the SuperBlock from step 6b
- `flags` = translate MS_* to MountFlags
- `propagation` = Private (default for new mounts)
- `group_id` = 0 (private mount has no peer group)
- `mnt_count` = 0
e. Acquire `mount_lock`.
f. Set `DCACHE_MOUNTED` on the target dentry.
g. Insert the Mount into the mount hash table at
bucket(parent_mount_id, mountpoint_inode_id).
h. Add the Mount to the parent's `children` list.
i. Add the Mount to the namespace's `mount_list` (after its parent
in topological order).
j. Increment `namespace.mount_count`.
k. Propagate: if the parent mount is shared, call
`propagate_mount()` (Section 13.2.10.1) to replicate this mount
on all peers and slaves of the parent.
l. Increment `namespace.event_seq`.
m. Release `mount_lock`.
13.2.9.2 do_umount — Unmount a Filesystem
do_umount(target_mount, flags) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. If `target_mount` is the namespace root and flags does not contain
MNT_DETACH, return EBUSY (cannot unmount root).
2. If `target_mount.flags` has MNT_LOCKED and the caller lacks
CAP_SYS_ADMIN in the mount's owning user namespace, return EPERM.
3. If `flags` does not contain MNT_DETACH (not lazy):
a. Check `target_mount.mnt_count`. If > 0, return EBUSY.
b. Check that `target_mount.children` is empty. If not, return EBUSY
(sub-mounts must be unmounted first, unless MNT_DETACH is used).
4. If `flags` contains MNT_FORCE:
a. Call `FileSystemOps::force_umount()` if the filesystem supports it.
This causes in-flight I/O to fail with EIO. NFS uses this for stale
server recovery.
5. Acquire `mount_lock`.
6. Set `MNT_DOOMED` on `target_mount.flags` (atomic OR).
This prevents new path lookups from entering the mount.
7. Remove `target_mount` from the mount hash table.
8. Remove `target_mount` from the parent's `children` list.
9. If the target dentry no longer has any mounts, clear `DCACHE_MOUNTED`
on the mountpoint dentry. (Multiple mounts can be stacked on the same
dentry; only clear when the last one is removed.)
10. Propagate: if the parent mount is shared, call `propagate_umount()`
(Section 13.2.10.2) to remove corresponding mounts from peers and slaves.
11. Remove from `namespace.mount_list`.
12. Decrement `namespace.mount_count`.
13. Increment `namespace.event_seq`.
14. Release `mount_lock`.
15. If `flags` contains MNT_DETACH (lazy unmount):
a. The mount is now disconnected from the tree but may still be
referenced by open file descriptors (mnt_count > 0). It will be
fully freed when the last reference is dropped.
b. Open files continue to work on the disconnected mount. New path
lookups cannot reach it.
16. If not lazy: call `FileSystemOps::unmount()` synchronously.
If lazy: schedule `FileSystemOps::unmount()` to run when `mnt_count`
drops to 0 (via a callback registered on the final `Arc::drop`).
13.2.9.3 do_umount_tree — Recursive Unmount
do_umount_tree(root_mount, flags) -> Result<()>
Used by MNT_DETACH on a mount with sub-mounts, and by namespace teardown.
1. Acquire `mount_lock`.
2. Collect all mounts in the subtree rooted at `root_mount` by traversing
`root_mount.children` recursively. Collect in reverse topological order
(leaves first, root last).
3. For each mount in the collected list:
a. Set MNT_DOOMED.
b. Remove from hash table.
c. Remove from parent's children list.
d. Clear DCACHE_MOUNTED if no other mount remains at that dentry.
e. Remove from namespace.mount_list.
f. Decrement namespace.mount_count.
4. Propagate umount for each removed mount.
5. Increment namespace.event_seq.
6. Release `mount_lock`.
7. For each collected mount: schedule filesystem unmount (immediate
if mnt_count == 0, deferred if lazy).
13.2.9.4 do_remount — Change Mount Flags/Options
do_remount(target_mount, flags, data) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. Translate new `flags` to `MountFlags`.
2. Extract per-superblock options from `data`.
3. Acquire `mount_lock`.
4. Update `target_mount.flags` atomically.
Note: a remount can change per-mount flags (readonly, nosuid, etc.)
independently of superblock options. For example, `mount -o remount,ro`
on a bind mount makes that mount point read-only without affecting
other mount points of the same filesystem.
5. If per-superblock options changed, call
`FileSystemOps::remount(sb, flags, data)`. On failure, restore the
old flags and return the error.
6. Increment `namespace.event_seq`.
7. Release `mount_lock`.
13.2.9.5 do_bind_mount — Bind Mount (MS_BIND)
do_bind_mount(source_path, target_path, flags) -> Result<()>
Capability check: CAP_MOUNT + read access to source path.
1. Resolve `source_path` to (source_mount, source_dentry).
2. Resolve `target_path` to (target_mount, target_dentry).
3. If `source_mount.propagation == Unbindable`, return EINVAL.
4. Clone the source mount:
a. Allocate a new `Mount` node.
b. `superblock` = `source_mount.superblock` (shared — same filesystem
instance, same data pages).
c. `root` = `source_dentry` (bind mount's root is the source path,
not necessarily the source mount's root — this is how bind mounts
of subdirectories work).
d. `flags` = copy from source, then apply any new flags from `flags`.
e. `propagation` = Private (new bind mounts default to Private).
5. If `flags` contains MS_REC (recursive bind):
a. For each sub-mount under `source_mount` (descendants of
`source_dentry`), clone the mount and attach it at the
corresponding dentry under the new bind mount.
b. Skip unbindable mounts.
6. Acquire `mount_lock`.
7. Attach the cloned mount(s) at target_path (same steps as
do_mount steps 6f-6m).
8. Release `mount_lock`.
13.2.9.6 do_move_mount — Move a Mount (MS_MOVE)
do_move_mount(source_mount, target_path) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. Resolve `target_path` to (target_parent_mount, target_dentry).
2. Verify `target_dentry` is not a descendant of `source_mount`
(moving a mount underneath itself would create a cycle). Return
EINVAL if it is.
3. Verify `source_mount` is not the namespace root. Return EINVAL.
4. Acquire `mount_lock`.
5. Remove `source_mount` from the old location:
a. Remove from hash table at old (parent, dentry) key.
b. Remove from old parent's children list.
c. Clear DCACHE_MOUNTED on old mountpoint dentry (if no other
mount remains).
6. Attach at new location:
a. Update `source_mount.parent` to `target_parent_mount`.
b. Update `source_mount.mountpoint` to `target_dentry`.
c. Insert into hash table at new (parent, dentry) key.
d. Add to new parent's children list.
e. Set DCACHE_MOUNTED on `target_dentry`.
7. Propagation: moving a mount does not trigger propagation
(matches Linux behavior).
8. Increment `namespace.event_seq`.
9. Release `mount_lock`.
13.2.9.7 do_change_propagation — Set Propagation Type
do_change_propagation(target_mount, type, flags) -> Result<()>
Capability check: CAP_MOUNT in caller's mount namespace.
1. Determine the target mount(s):
- If `flags` contains MS_REC: target mount and all descendants.
- Otherwise: target mount only.
2. Acquire `mount_lock`.
3. For each target mount:
a. If changing to Shared:
- Allocate a new `group_id` from `namespace.group_id_allocator`.
- Set `mount.group_id = new_id`.
- If the mount was previously a slave, it becomes shared+slave
(receives from master AND propagates to peers).
b. If changing to Private:
- Remove from peer group ring (`mnt_share`).
- Remove from master's slave list (if slave).
- Set `mount.group_id = 0`.
- Set `mount.mnt_master = None`.
c. If changing to Slave:
- If the mount is currently shared, it becomes a slave of its
former peer group. The first remaining peer becomes the master.
- Remove from peer group ring.
- Add to master's `mnt_slave_list`.
- Set `mount.mnt_master` to the former peer group leader.
- Mount retains its `group_id` (for mountinfo optional fields).
d. If changing to Unbindable:
- Same as Private, plus prevents bind mount of this mount.
e. Update `mount.propagation`.
4. Increment `namespace.event_seq`.
5. Release `mount_lock`.
13.2.10 Mount Propagation Algorithms
Mount propagation ensures that mount/unmount events on shared mount points are replicated across all related mount points. This is essential for container volume mounts: when a volume is mounted on a shared host path, all containers that have a slave relationship to that path see the new mount.
13.2.10.1 propagate_mount
propagate_mount(source_mount, new_child_mount) -> Result<()>
Called under mount_lock when a mount is added to a shared mount point.
1. Walk the peer group ring of `source_mount` (via `mnt_share` links).
For each peer mount (excluding `source_mount` itself):
a. Clone `new_child_mount` with the peer as parent.
The clone's mountpoint is the dentry in the peer's filesystem
that corresponds to `new_child_mount.mountpoint` in the source.
b. Attach the clone at the peer (insert into hash table, set
DCACHE_MOUNTED, add to children list, add to mount_list).
c. If the clone's parent is shared, recursively propagate to
that peer group (but track visited groups to avoid infinite loops).
2. Walk the slave list of `source_mount` (via `mnt_slave_list`).
For each slave mount:
a. Clone `new_child_mount` with the slave as parent.
b. Attach the clone at the slave.
c. If the slave is also shared (shared+slave), propagate to the
slave's peer group (step 1 applied to the slave's peers).
3. If the cloning in any propagation step fails (e.g., ENOMEM for
the mount count limit), roll back: remove all clones created in
this propagation pass and return the error. Propagation is
all-or-nothing within a single mount operation.
13.2.10.2 propagate_umount
propagate_umount(source_mount) -> Result<()>
Called under mount_lock when a mount is removed from a shared mount point.
1. Walk the peer group ring of `source_mount.parent` (the parent must
be shared for propagation to occur).
For each peer of the parent:
a. Look up a child mount at the corresponding mountpoint dentry
in the peer's mount hash table.
b. If found and the child's superblock matches `source_mount`'s
superblock (same filesystem), unmount it (do_umount steps 6-12).
c. If the child mount has its own children, recursively unmount
the subtree (do_umount_tree).
2. Walk the slave list of the parent.
For each slave:
a. Same as step 1a-1c, applied to the slave.
13.2.11 Namespace Operations
13.2.11.1 copy_tree — Clone Mount Tree for CLONE_NEWNS
copy_tree(source_root_mount, source_root_dentry) -> Result<Arc<MountNamespace>>
Called by clone(CLONE_NEWNS) and unshare(CLONE_NEWNS).
1. Allocate a new `MountNamespace` with fresh `ns_id`, empty hash table,
and a new `mount_lock`.
2. The new namespace inherits the parent's `user_ns`.
3. Clone the source root mount:
a. Allocate new `Mount` with the same superblock and root dentry.
b. Flags are copied. Propagation is set to Private (default for
new namespace — Section 16.1.2 states "CLONE_NEWNS: child's mounts
are private unless marked shared").
4. For each mount in the source namespace's mount_list (topological order):
a. Skip unbindable mounts.
b. Clone the mount into the new namespace.
c. Preserve the parent-child relationship (the cloned child's parent
is the clone of the original child's parent).
d. Insert into the new namespace's hash table and mount_list.
e. Set propagation:
- If the source mount is shared: the clone is added to the same
peer group (shared propagation preserved across CLONE_NEWNS).
This is critical for container runtimes that rely on propagation.
- If the source mount is private/slave/unbindable: the clone is
Private.
5. Set the new namespace's root to the clone of `source_root_mount`.
6. Return the new namespace.
13.2.11.2 pivot_root Integration
The pivot_root(2) algorithm specified in Section 16.1.3 is updated to
use the Mount data structure:
pivot_root(new_root_path, put_old_path) -> Result<()>
Capability check: CAP_SYS_ADMIN in caller's user namespace.
The caller must be in a mount namespace (not the initial namespace).
1. Resolve `new_root_path` to (new_root_mount, new_root_dentry).
Verify `new_root_dentry` is the root of `new_root_mount` (i.e.,
new_root is a mount point, not just a directory).
2. Resolve `put_old_path` to (put_old_mount, put_old_dentry).
Verify `put_old` is at or under `new_root`.
3. Verify `new_root_mount` is not the current namespace root.
4. Verify `new_root` is not already the root of the namespace.
5. Acquire `mount_lock`.
6. Let `old_root_mount` = namespace's current root mount.
7. Detach `new_root_mount` from its current position:
a. Remove from hash table.
b. Remove from parent's children.
c. Clear DCACHE_MOUNTED on its old mountpoint.
8. Reattach `old_root_mount` at `put_old`:
a. Set `old_root_mount.parent` = `new_root_mount`.
b. Set `old_root_mount.mountpoint` = the dentry corresponding to
`put_old` within `new_root_mount`'s filesystem.
c. Insert `old_root_mount` into hash table at new position.
d. Set DCACHE_MOUNTED on the put_old dentry.
9. Set `new_root_mount` as the namespace root:
a. `new_root_mount.parent` = None (it is now the root).
b. `new_root_mount.mountpoint` = `new_root_mount.root` (self-referential
for the root mount).
c. `namespace.root.update(new_root_mount, &mount_lock_guard)` (RCU
publish via RcuCell::update).
10. Update all processes in this namespace whose root or cwd was
under the old root to point to the new root.
11. Increment `namespace.event_seq`.
12. Release `mount_lock`.
Note: Steps 7-9 are the atomic state change. In-flight path lookups
that started before step 9 see the old root via RCU (the old
`RcuCell` value remains valid until the grace period). New lookups
after step 9 see the new root. This matches the atomicity guarantee
specified in Section 16.1.3.
13.2.11.3 Namespace Teardown
When a mount namespace is destroyed (all processes exited, all
/proc/PID/ns/mnt file descriptors closed, all bind mounts of the
namespace file unmounted):
destroy_mount_namespace(ns) -> ()
1. Acquire `mount_lock`.
2. Iterate `ns.mount_list` in reverse topological order (leaves first).
3. For each mount:
a. Set MNT_DOOMED.
b. Remove from hash table.
c. Remove from parent's children.
d. Remove from peer group and slave lists.
4. Release `mount_lock`.
5. For each removed mount (in reverse order):
a. If `mnt_count == 0`, call `FileSystemOps::unmount()`.
b. If `mnt_count > 0` (lazy unmount remnants still referenced by
open file descriptors), defer unmount to final reference drop.
6. Drop the hash table and mount list.
13.2.12 New Mount API Syscalls
UmkaOS implements the Linux 5.2+ mount API syscalls for compatibility with modern container runtimes (containerd, CRI-O) and systemd. These are thin wrappers around the internal mount operations described above.
| Syscall | Purpose | Capability |
|---|---|---|
fsopen(fs_type, flags) |
Create a filesystem context | CAP_MOUNT |
fspick(dirfd, path, flags) |
Create a reconfiguration context for an existing mount | CAP_MOUNT |
fsconfig(fd, cmd, key, value, aux) |
Configure a filesystem context | CAP_MOUNT |
fsmount(fs_fd, flags, mount_attr) |
Create a detached mount from a configured context | CAP_MOUNT |
move_mount(from_dirfd, from_path, to_dirfd, to_path, flags) |
Attach a detached mount or move an existing mount | CAP_MOUNT |
open_tree(dirfd, path, flags) |
Open or clone a mount point as a file descriptor | CAP_MOUNT (if OPEN_TREE_CLONE) |
mount_setattr(dirfd, path, flags, attr, size) |
Modify mount attributes, optionally recursively | CAP_MOUNT |
fsopen flow:
1. Validate fs_type against the filesystem registry.
2. Allocate FsContext with purpose = NewMount.
3. Return a file descriptor referencing the context.
fsconfig flow (selected commands):
- FSCONFIG_SET_STRING: set a key-value option string.
- FSCONFIG_SET_BINARY: set a binary option blob.
- FSCONFIG_SET_FD: set an option to a file descriptor (e.g., source device).
- FSCONFIG_CMD_CREATE: validate all options and create the superblock
by calling FileSystemOps::mount(). On success, the superblock is stored
in FsContext.superblock. On failure, diagnostic messages are written
to the context's error log.
- FSCONFIG_CMD_RECONFIGURE: for fspick contexts, apply new options
to the existing superblock via FileSystemOps::remount().
fsmount flow:
1. Consume the superblock from the FsContext.
2. Allocate a Mount node with MNT_DETACHED flag set.
3. The mount is not yet attached to any namespace or visible to path
resolution. It exists only as a detached object referenced by the
returned file descriptor.
4. Return an O_PATH file descriptor referencing the detached mount.
move_mount flow:
1. Resolve the source (detached mount fd or existing mount path).
2. Resolve the target path.
3. If the source is detached (MNT_DETACHED):
a. Clear MNT_DETACHED.
b. Attach to the namespace via do_mount steps 6e-6m.
4. If the source is an existing mount:
a. Delegate to do_move_mount() (Section 13.2.9.6).
open_tree flow:
1. Resolve the path to a mount.
2. If OPEN_TREE_CLONE:
a. Clone the mount (like do_bind_mount without attaching).
b. The clone is detached (MNT_DETACHED).
c. If OPEN_TREE_CLONE | AT_RECURSIVE: recursively clone the subtree.
3. Return an O_PATH file descriptor.
mount_setattr flow:
1. Resolve the path to a mount.
2. Validate attr_set and attr_clr do not conflict.
3. Acquire mount_lock.
4. If AT_RECURSIVE:
a. Collect all mounts in the subtree.
b. Validate the changes are valid for all mounts (e.g., clearing
MNT_READONLY on a mount whose superblock is read-only is invalid).
c. If validation fails for any mount, return error (no partial changes).
d. Apply attr_clr then attr_set to all mounts atomically.
5. If not recursive: apply to the single mount.
6. If attr.propagation != 0: change propagation type (Section 13.2.9.7).
7. Increment namespace.event_seq.
8. Release mount_lock.
13.2.13 Mount Introspection Syscalls
Linux 6.8 introduced statmount(2) and listmount(2) as structured
replacements for parsing /proc/PID/mountinfo. UmkaOS implements both for
container introspection tools and future-compatible userspace.
| Syscall | Purpose | Capability |
|---|---|---|
statmount(req, buf, bufsize, flags) |
Query detailed mount information by mount ID | None (own namespace) |
listmount(req, buf, bufsize, flags) |
List child mount IDs of a given mount | None (own namespace) |
statmount: Returns a struct statmount containing the mount's ID,
parent ID, mount flags, propagation type, peer group ID, master mount ID,
filesystem type, mount source, mount point path, and superblock options.
The request specifies which fields to populate via a bitmask, avoiding
unnecessary work (e.g., path resolution for mount point is skipped if
STATMOUNT_MNT_POINT is not requested).
listmount: Returns an array of 64-bit mount IDs for the child mounts
of a given mount. Supports cursor-based iteration: the caller passes the
last seen mount ID, and listmount returns mount IDs after that cursor.
This handles concurrent mount/unmount gracefully (mounts added after the
cursor are seen; mounts removed are skipped).
13.2.14 /proc/PID/mountinfo Format
Each process exposes its mount namespace's mount tree through
/proc/PID/mountinfo and /proc/PID/mounts. These files are read by
systemd, Docker, findmnt, df, mountpoint, and other tools.
mountinfo line format (one line per mount, matching Linux exactly):
<mount_id> <parent_id> <major>:<minor> <root> <mount_point> <mount_options> <optional_fields> - <fs_type> <mount_source> <super_options>
| Field | Source | Example |
|---|---|---|
| mount_id | Mount.mount_id |
36 |
| parent_id | Mount.parent.mount_id (self for root) |
35 |
| major:minor | SuperBlock.dev major:minor |
98:0 |
| root | Path of mount root within the filesystem | / or /subdir |
| mount_point | Path of mount point relative to process root | /mnt/data |
| mount_options | Per-mount flags as comma-separated options | rw,noatime,nosuid |
| optional fields | Propagation: shared:N, master:N, propagate_from:N |
shared:1 master:2 |
| separator | Literal hyphen | - |
| fs_type | Filesystem type name | ext4 |
| mount_source | Mount.device_name |
/dev/sda1 |
| super_options | From FileSystemOps::show_options() |
rw,errors=continue |
Implementation: The VFS iterates the namespace's mount_list under
rcu_read_lock() and formats each line. The mount_list's topological
ordering ensures that parent mounts appear before children (matching
Linux's output order).
/proc/PID/mounts: A simplified view matching the old /etc/mtab
format: <device> <mount_point> <fs_type> <options> 0 0. Generated
from the same mount_list, omitting mount IDs and propagation fields.
13.2.15 Path Resolution Integration
This section details how the mount tree integrates with the path resolution algorithm described in Section 13.1.3.
Mount crossing in RCU-walk (fast path):
resolve_component_rcu(current_mount, current_dentry, name):
1. Look up `name` in the dentry cache: dentry = dcache_lookup(current_dentry, name).
2. If dentry is not found: fall through to ref-walk (cache miss).
3. If dentry.flags has DCACHE_MOUNTED:
a. Call MountHashTable::lookup(current_mount.mount_id, dentry.inode, &rcu_guard).
b. If a child mount is found:
- current_mount = child_mount
- current_dentry = child_mount.root
- If child_mount.root also has DCACHE_MOUNTED, repeat step 3
(stacked mounts — rare but legal).
c. If no child mount found: DCACHE_MOUNTED is stale (race with
umount). Clear the flag lazily and continue with the dentry.
4. Return (current_mount, dentry).
Mount crossing in ref-walk (slow path):
resolve_component_ref(current_mount, current_dentry, name):
1. Same as RCU-walk step 1, but takes a dentry reference count.
2. Same DCACHE_MOUNTED check.
3. If mount crossing:
a. Call MountHashTable::lookup() under rcu_read_lock().
b. If found: increment child_mount.mnt_count (atomic add).
c. Decrement current_mount.mnt_count.
d. current_mount = child_mount; current_dentry = child_mount.root.
4. Return (current_mount, dentry).
".." traversal across mount boundaries:
resolve_dotdot(current_mount, current_dentry):
1. If current_dentry == current_mount.root:
- We are at the root of this mount. ".." should cross into the parent
mount.
- If current_mount.parent is None: we are at the namespace root.
".." resolves to the root itself (cannot go above /).
- Otherwise: current_mount = current_mount.parent.
current_dentry = current_mount.mountpoint.
(Continue resolving ".." from the parent mount's mountpoint.)
2. If current_dentry != current_mount.root:
- Normal ".." within the mount's filesystem.
- current_dentry = current_dentry.parent.
3. Return (current_mount, current_dentry).
13.2.16 Performance Characteristics
| Operation | Cost | Notes |
|---|---|---|
| Mount hash lookup (RCU read) | ~5-15 ns | SipHash + 1-2 pointer chases, no locks, no atomics. Occurs on every mount-point crossing during path resolution. |
| DCACHE_MOUNTED check | ~1 ns | Single atomic load of dentry flags. Occurs on every path component — the gate that avoids hash lookup on non-mount-point dentries. |
| Mount (new filesystem) | ~1-10 us | Dominated by filesystem driver's mount() (superblock creation). Mount tree insertion is ~200 ns under lock. |
| Unmount | ~500 ns - 5 us | Hash removal + propagation. Filesystem unmount() cost varies (ext4 journal flush vs. tmpfs instant). |
| Bind mount | ~300 ns | Mount node clone + hash insertion. No filesystem I/O. |
| Bind mount (recursive, N sub-mounts) | ~300*N ns | Linear in subtree size. |
| Propagation (mount, M peers) | ~300*M ns | One clone per peer. Propagation to slaves adds per-slave overhead. |
| /proc/PID/mountinfo generation | ~50 ns/mount | One line per mount. 100-mount namespace: ~5 us total. |
| copy_tree (CLONE_NEWNS, N mounts) | ~500*N ns | Clone all mounts. 100-mount namespace: ~50 us. |
| pivot_root | ~1 us | Two hash table mutations + RCU publish. |
Memory overhead per mount: ~320 bytes for the Mount struct (including
all intrusive list nodes and propagation fields) plus ~16 bytes for the hash
table entry. A container with 100 mounts consumes ~33 KiB of mount tree
metadata. A system with 10,000 containers (1 million mounts total) consumes
~330 MiB — proportional to the actual number of mounts, not pre-allocated.
13.2.17 Cross-References
- Section 3.1.5 (Lock Hierarchy):
MOUNT_LOCKat level 9, betweenDENTRY_LOCK(8) andSOCK_LOCK(10). - Section 8.1.1 (Capabilities):
CAP_MOUNT(bit 70) gates all mount operations.CAP_SYS_ADMIN(bit 21) required forpivot_rootandMNT_LOCKEDoverride. - Section 13.1.1 (VFS Architecture):
FileSystemOps::mount()creates the superblock consumed bydo_mount().FileSystemOps::unmount()is called bydo_umount()after tree removal. - Section 13.1.2 (Dentry Cache):
DCACHE_MOUNTEDflag triggers mount hash table lookup during path resolution. - Section 13.1.3 (Path Resolution): RCU-walk and ref-walk mount crossing detailed in Section 13.2.15.
- Section 13.1.4 (Mount Namespace and Capability-Gated Mounting): The capability table and propagation type summary specified there are implemented by the data structures in this section.
- Section 13.4 (overlayfs):
OverlayFs::mount()creates anOverlaySuperBlockconsumed via the standarddo_mount()path. - Section 16.1.2 (Namespace Implementation):
NamespaceSet.vfs_rootis updated toNamespaceSet.mount_ns: Arc<MountNamespace>, providing access to the full mount tree rather than just a capability handle to the root VFS node. - Section 16.1.3 (pivot_root): The step-by-step algorithm there is superseded by the precise
Mount-struct-based algorithm in Section 13.2.11.2. - Section 16.1.5 (Namespace Inheritance):
CLONE_NEWNStriggerscopy_tree()(Section 13.2.11.1).
13.3 Distribution-Aware VFS Extensions
When filesystems are shared across cluster nodes (Section 14.5), the VFS must handle cache validity, locking granularity, and metadata coherence across node boundaries. Linux's VFS was designed for local filesystems with network filesystem support bolted on afterward, resulting in several systemic performance problems. UmkaOS's VFS addresses these by integrating with the Distributed Lock Manager (Section 14.6).
| Linux Problem | Impact | UmkaOS Fix |
|---|---|---|
| Dentry cache assumes local validity | Remote rename/unlink leaves stale dentries on other nodes | Callback-based invalidation: DLM lock downgrade (Section 14.6.8) triggers targeted dentry invalidation for affected directory entries only |
d_revalidate() on every lookup for network FS |
Extra round-trip per path component on NFS/CIFS/GFS2 | Lease-attached dentries: dentry is valid while parent directory DLM lock is held (Section 14.6.6); zero revalidation cost during lease period |
| Inode-level locking forces false sharing | Two nodes writing to different byte ranges of the same file serialize on the inode lock | Range locks in VFS: DLM byte-range lock resources (Section 14.6.4) allow concurrent operations on different ranges of the same file |
| No concurrent directory operations | mkdir and create in the same directory serialize globally |
Per-bucket directory locks: hash-based directory formats (ext4 htree, GFS2 leaf blocks) use separate DLM resources per hash bucket |
readdir() + stat() = 2N round-trips for N files |
ls -l on a 1000-file remote directory requires 2001 operations |
getdents_plus() returning attributes with directory entries (analogous to NFS READDIRPLUS but in-kernel, avoiding the userspace/kernel boundary per entry). getdents_plus() is an UmkaOS VFS-internal operation (not a new syscall): the VFS's readdir implementation populates both the directory entry and its InodeAttr in a single filesystem callback, caching the attributes for immediate use by a subsequent getattr() / stat() call. Userspace accesses this via the standard getdents64(2) + statx(2) syscalls — the optimization is transparent, eliminating redundant disk or DLM round-trips inside the kernel. |
| Full inode cache invalidation on lock drop | Dropping a DLM lock on an inode discards all cached metadata, even fields that haven't changed | Per-field inode validity: mtime/size read from DLM Lock Value Block (Section 14.6.3); permissions and ownership from local capability cache; only stale fields refreshed on lock reacquire |
Integration with Section 14.6 DLM:
-
Dentry lease binding: When the VFS caches a dentry for a clustered filesystem, it records the DLM lock resource that protects the parent directory. The dentry remains valid as long as that lock is held at CR (Concurrent Read) mode or stronger. When the DLM downgrades or releases the lock (due to contention from another node), the VFS receives a callback and invalidates only the affected dentries — not the entire dentry subtree.
-
Range-aware writeback: When a process holds a DLM byte-range lock and writes to pages within that range, the VFS tracks dirty pages per lock range (not per inode). On lock downgrade, only dirty pages within the lock's range are flushed (Section 14.6.8). This eliminates the Linux problem where dropping a lock on a 100 GB file requires flushing all dirty pages, even if only 4 KB was modified.
-
Attribute caching via LVB: The VFS reads frequently-accessed inode attributes (
i_size,i_mtime,i_blocks) from the DLM Lock Value Block (Section 14.6.3) rather than performing a disk read on every lock acquire. The LVB is updated by the last writer on lock release, so readers always get current values at the cost of a single RDMA operation (~3-4 μs) instead of a disk I/O (~10-15 μs for NVMe).
13.4 overlayfs: Union Filesystem for Containers
Use case: Container image layering. Docker, containerd, Podman, and Kubernetes all use overlayfs as their primary storage driver. A container image is a stack of read-only filesystem layers; overlayfs merges them with a writable upper layer to present a unified view. Without overlayfs, container runtimes fall back to copy-the-entire-layer approaches (VFS copy, naive snapshots), which are orders of magnitude slower for image pull and container startup.
Tier: Tier 1 (runs in the VFS isolation domain alongside umka-vfs).
Rationale for Tier 1 (not Tier 2): overlayfs is a stacking filesystem — it sits between the VFS and the underlying filesystem drivers (ext4, XFS, btrfs, tmpfs). Every path lookup, readdir, and file open in a container traverses overlayfs. Placing it in Tier 2 (Ring 3, process boundary) would add two domain crossings per VFS operation inside every container, roughly doubling the path resolution overhead. Since overlayfs delegates all storage I/O to the underlying filesystem (which is itself a Tier 1 driver), overlayfs never touches hardware directly — it is a pure VFS client. Its code complexity is moderate (~3,000 SLOC in Linux) and auditable. The crash containment boundary is the VFS domain: if overlayfs panics, the VFS recovery protocol (Section 13.1) handles it.
Design: overlayfs implements FileSystemOps, InodeOps, FileOps, and
DentryOps from the VFS trait system (Section 13.1.1). It does not introduce new
VFS abstractions — it composes existing ones.
13.4.1 Mount Options and Configuration
/// Mount options parsed from the `data` parameter of `FileSystemOps::mount()`.
/// Encoded as comma-separated key=value pairs in the `data: &[u8]` slice,
/// matching Linux's overlayfs mount option syntax exactly.
///
/// Example mount command:
/// ```
/// mount -t overlay overlay \
/// -o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work \
/// /merged
/// ```
///
/// For read-only overlays (no upperdir/workdir), only lowerdir is required.
/// This is used for container image inspection without a writable layer.
pub struct OverlayMountOptions {
/// Colon-separated list of lower layer paths, ordered from topmost to
/// bottommost. At least one lower layer is required. Maximum 500 layers
/// (matching Linux's limit, which Docker/containerd never approach —
/// typical images have 5-20 layers).
///
/// Each path must be an existing directory on a mounted filesystem.
/// The VFS resolves each path to an `InodeId` at mount time and holds
/// a reference to the underlying superblock for the mount's lifetime.
///
/// Heap-allocated rather than inline (`ArrayVec<_, 500>` would be up to
/// 4000 bytes on the stack). The 500-layer maximum is enforced at mount
/// validation time. Mount processing is a rare, non-hot-path operation
/// where heap allocation is acceptable.
pub lower_dirs: Box<[InodeId]>,
/// Upper layer directory (read-write). `None` for read-only overlays.
/// Must reside on a filesystem that supports: xattr (for whiteouts and
/// metacopy markers), rename with RENAME_WHITEOUT, and mknod (for
/// character-device whiteouts). The upper filesystem must be writable.
pub upper_dir: Option<InodeId>,
/// Work directory for atomic copy-up staging. Required if `upper_dir`
/// is set. Must be on the **same filesystem** as `upper_dir` (same
/// superblock) — copy-up uses rename(2) from workdir to upperdir,
/// which requires same-device semantics. The VFS verifies this at
/// mount time by comparing `SuperBlock` identity.
///
/// The workdir must be empty at mount time. overlayfs creates a `work/`
/// subdirectory inside it for staging, and an `index/` subdirectory
/// for NFS export handles (if enabled).
pub work_dir: Option<InodeId>,
/// Enable metadata-only copy-up. When true, operations that modify
/// only metadata (chmod, chown, utimes, setxattr) copy only the
/// inode metadata to the upper layer, deferring data copy until the
/// first write. Dramatically reduces container startup I/O: a
/// `chmod` on a 200 MB binary copies ~4 KB of metadata instead of
/// 200 MB of data.
///
/// Default: true (matches Docker/containerd default since Linux 5.11+
/// with kernel config `OVERLAY_FS_METACOPY=y`).
///
/// Security restriction: this option is silently forced to `false`
/// when the mount is user-namespace-influenced (i.e., when the caller
/// does not hold `CAP_SYS_ADMIN` in the initial user namespace). In
/// such mounts the upper layer uses `user.overlay.*` xattrs, which
/// are writable by the file owner without privilege; a forged
/// metacopy xattr could redirect reads to arbitrary lower-layer files.
/// See [Section 13.4.6.1](#13461-metacopy-trust-model-and-security-constraints)
/// for the complete trust model and enforcement mechanism.
pub metacopy: bool,
/// Directory rename/redirect handling.
///
/// - `On`: Enable redirect xattrs for directory renames. Required
/// for rename(2) on merged directories to succeed (without this,
/// rename of a directory that exists in a lower layer returns EXDEV).
/// - `Follow`: Follow existing redirect xattrs but do not create new
/// ones. Safe for mounting layers created by a trusted system.
/// - `NoFollow`: Ignore redirect xattrs entirely. Most restrictive.
/// - `Off`: Disable redirect handling; directory renames return EXDEV.
///
/// Default: `On` (required by Docker/containerd for correct semantics).
pub redirect_dir: RedirectDirMode,
/// Volatile mode. When enabled, overlayfs skips all fsync/sync_fs calls
/// to the upper filesystem. A crash or power loss may leave the upper
/// layer in an inconsistent state (workdir staging artifacts, partial
/// copy-ups). The overlay refuses to remount if it detects a previous
/// volatile session that was not cleanly unmounted.
///
/// Docker uses volatile mode for ephemeral containers where persistence
/// is not needed (CI runners, build containers, test environments).
///
/// Default: false.
pub volatile: bool,
/// Use `user.overlay.*` xattr namespace instead of `trusted.overlay.*`.
/// Required for unprivileged (rootless) overlayfs mounts where the
/// calling process lacks CAP_SYS_ADMIN in the initial user namespace.
/// The `user.*` xattr namespace is writable by the file owner without
/// special capabilities.
///
/// Default: false (use `trusted.overlay.*`).
pub userxattr: bool,
/// Extended inode number mode. Controls how overlayfs composes inode
/// numbers to guarantee uniqueness across layers.
///
/// - `On`: Compose inode numbers using upper bits for layer index.
/// Requires underlying filesystems to use <32-bit inode numbers
/// (ext4, XFS with `inode32` mount option).
/// - `Off`: Use raw underlying inode numbers. Risk of collisions
/// across layers (two files on different layers may share an ino).
/// - `Auto`: Enable if all underlying filesystems have small enough
/// inode numbers; disable otherwise.
///
/// Default: `Auto`.
pub xino: XinoMode,
/// NFS export support. When enabled, overlayfs maintains an index
/// directory (inside workdir) that maps NFS file handles to overlay
/// dentries. Required if the overlay mount will be exported via NFS.
///
/// Default: false (NFS export of container filesystems is uncommon).
pub nfs_export: bool,
/// fs-verity digest validation for lower layer files. When enabled,
/// overlayfs verifies that lower-layer files have valid fs-verity
/// digests matching the expected values stored in the upper layer's
/// metacopy xattr. Provides content integrity for container image
/// layers without requiring dm-verity on the entire block device.
///
/// - `Off`: No verity checking.
/// - `On`: Verify if digest is present; allow files without digest.
/// - `Require`: Reject files that lack a valid fs-verity digest.
///
/// Default: `Off`.
pub verity: VerityMode,
}
/// Redirect directory mode.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum RedirectDirMode {
/// Create and follow redirect xattrs.
On,
/// Follow existing redirect xattrs but do not create new ones.
Follow,
/// Do not follow redirect xattrs.
NoFollow,
/// Disable redirect handling; directory renames return EXDEV.
Off,
}
/// Extended inode number composition mode.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum XinoMode {
/// Always compose inode numbers.
On,
/// Never compose inode numbers.
Off,
/// Compose if underlying inode numbers fit.
Auto,
}
/// fs-verity enforcement mode for lower layer files.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum VerityMode {
/// No verity checking.
Off,
/// Verify if digest present; allow files without digest.
On,
/// Reject lower files without valid fs-verity digest.
Require,
}
13.4.2 Core Data Structures
/// Overlay filesystem superblock state. One instance per overlay mount.
/// Created by `OverlayFs::mount()` and stored in the `SuperBlock`'s
/// filesystem-private field.
pub struct OverlaySuperBlock {
/// Lower layer inodes (topmost first). Index 0 is the highest-priority
/// lower layer (searched first after upper). These are directory inodes
/// on the underlying filesystems, held for the mount's lifetime.
///
/// Heap-allocated rather than inline (`ArrayVec<_, 500>` would exceed
/// the safe stack frame budget — each `OverlayLayer` contains an
/// `InodeId`, a `SuperBlock` reference, and a `u16` index). The
/// 500-layer maximum is enforced at mount validation time. Mount
/// processing is a rare, non-hot-path operation where heap allocation
/// is acceptable.
pub lower_layers: Box<[OverlayLayer]>,
/// Upper layer state. `None` for read-only overlay mounts.
pub upper_layer: Option<OverlayLayer>,
/// Work directory inode on the upper filesystem. Used as a staging
/// area for atomic copy-up operations.
pub work_dir: Option<InodeId>,
/// Index directory inode (inside workdir). Used for NFS export file
/// handle resolution and hard link tracking across copy-up.
pub index_dir: Option<InodeId>,
/// Parsed mount options (immutable after mount).
pub config: OverlayMountOptions,
/// The xattr prefix used for overlay-private xattrs. Either
/// `"trusted.overlay."` (privileged) or `"user.overlay."` (userxattr
/// mode). Stored once to avoid branching on every xattr operation.
pub xattr_prefix: &'static [u8],
/// Volatile session marker. If volatile mode is enabled, this is set
/// to true after creating the `$workdir/work/incompat/volatile`
/// sentinel directory. On mount, if the sentinel exists from a
/// previous unclean session, mount fails with EINVAL.
pub volatile_active: bool,
/// True if this overlay was mounted from within a user namespace or
/// if the upper layer's filesystem mount is owned by a non-initial
/// user namespace. When true, `metacopy` and `redirect_dir=on` are
/// disabled regardless of mount options, `userxattr` mode is
/// mandatory, and data-only lower layers are rejected.
///
/// Set once at `OverlayFs::mount()` time by checking whether the
/// calling process's user namespace is the initial user namespace
/// (`current_user_ns() == &init_user_ns`). Immutable thereafter.
///
/// See Section 13.4.6.1 for the full security model.
pub userns_influenced: bool,
}
/// A single layer in the overlay stack.
pub struct OverlayLayer {
/// Root directory inode of this layer on its underlying filesystem.
pub root: InodeId,
/// Superblock of the underlying filesystem. Held as a reference
/// for the overlay mount's lifetime.
pub sb: SuperBlock,
/// Layer index (0 = upper or topmost lower; increases downward).
/// Used for xino composition and for identifying which layer an
/// overlay inode's data resides on.
pub index: u16,
}
/// Atomic optional value using a sentinel for the `None` state.
/// `InodeId` of 0 represents `None` (inode 0 is never valid in any filesystem).
/// Provides lock-free read access via `Acquire` load and one-time write
/// via `compare_exchange` (for copy-up transitions from None -> Some).
pub struct AtomicOption<T: Into<u64> + From<u64>> {
value: AtomicU64, // 0 = None, non-zero = Some(T)
}
impl AtomicOption<InodeId> {
pub fn none() -> Self { Self { value: AtomicU64::new(0) } }
pub fn load(&self) -> Option<InodeId> {
match self.value.load(Ordering::Acquire) {
0 => None,
v => Some(InodeId(v)),
}
}
/// Atomically transition from None to Some. Returns Err if already set.
pub fn set_once(&self, val: InodeId) -> Result<(), InodeId> {
self.value.compare_exchange(0, val.0, Ordering::AcqRel, Ordering::Acquire)
.map(|_| ())
.map_err(|v| InodeId(v))
}
}
/// Per-inode overlay state. Tracks which layers contribute to a merged
/// view of this inode.
///
/// An `OverlayInode` is created on first lookup and cached in the VFS
/// inode cache. It is the filesystem-private data attached to the VFS
/// inode via `InodeId`.
pub struct OverlayInode {
/// Inode in the upper layer. `Some` if the entry exists in upper
/// (either originally or after copy-up). `None` if the entry exists
/// only in lower layers.
///
/// Protected by `copy_up_lock`: transitions from `None` to `Some`
/// exactly once during copy-up. Once set, never changes back.
/// Reads after copy-up are lock-free (Acquire load on the Option
/// discriminant).
pub upper: AtomicOption<InodeId>,
/// Inode in the topmost lower layer that contains this entry.
/// `None` if the entry exists only in upper (newly created file).
pub lower: Option<LowerInodeRef>,
/// True if this inode is a metacopy-only upper entry (metadata
/// copied, data still in lower layer). Cleared to false after full
/// data copy-up completes.
pub metacopy: AtomicBool,
/// True if this is an opaque directory. An opaque directory hides
/// all entries from lower layers — readdir and lookup do not
/// descend into lower layers below this point.
pub opaque: bool,
/// Redirect path for directory renames. When a merged directory is
/// renamed in the upper layer, this field stores the original lower
/// path so that lookups can find the renamed directory's lower
/// contents. `None` for non-redirected entries.
pub redirect: Option<Box<OsStr>>,
/// Lock serializing copy-up operations on this inode. Only one
/// thread may copy-up a given inode at a time. Other threads
/// attempting to modify the same lower-layer file block on this
/// lock until copy-up completes, then proceed against the upper copy.
///
/// This is a `Mutex`, not an `RwLock`, because copy-up is an
/// exclusive state transition (None -> Some). Read paths check
/// `upper` with an Acquire load and only take the lock if they
/// need to trigger copy-up.
pub copy_up_lock: Mutex<()>,
/// Overlay inode type. Needed because the overlay may present a
/// different view than the underlying filesystem (e.g., a whiteout
/// character device appears as "entry does not exist").
pub inode_type: OverlayInodeType,
}
/// Reference to a lower-layer inode.
pub struct LowerInodeRef {
/// Inode ID on the lower layer's filesystem.
pub inode: InodeId,
/// Which lower layer this inode resides on (index into
/// `OverlaySuperBlock::lower_layers`).
pub layer_index: u16,
}
/// Overlay inode type classification.
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum OverlayInodeType {
/// Regular file (may be metacopy).
Regular,
/// Directory (may be merged or opaque).
Directory,
/// Symbolic link.
Symlink,
/// Character device, block device, FIFO, or socket.
Special,
/// Whiteout entry (exists in upper layer to mark deletion of a
/// lower-layer entry). Not visible to userspace — lookups return
/// ENOENT. Internally represented as either a character device
/// with major:minor 0:0 or a zero-size file with the
/// `trusted.overlay.whiteout` xattr.
Whiteout,
}
13.4.3 Overlay Dentry Operations
overlayfs requires custom DentryOps to handle the dynamic nature of the
merged filesystem view. Copy-up changes which layer serves a file, so cached
dentries must be revalidated.
/// overlayfs dentry operations.
impl DentryOps for OverlayDentryOps {
/// Revalidate a cached overlay dentry.
///
/// Returns `false` (forcing re-lookup) in these cases:
/// 1. The overlay inode has been copied up since the dentry was cached
/// (detected by checking if `OverlayInode::upper` transitioned from
/// None to Some since the last lookup).
/// 2. The underlying filesystem's dentry has been invalidated (delegates
/// to the underlying filesystem's `d_revalidate` if it implements one,
/// e.g., for NFS lower layers).
/// 3. A whiteout has been created or removed in the upper layer for this
/// name (detected by checking upper-layer lookup result against cached
/// overlay state).
///
/// Returns `true` (dentry is still valid) in all other cases.
fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool>;
/// Overlay dentries use the default VFS hash (SipHash-1-3).
fn d_hash(&self, _name: &OsStr) -> Option<u64> {
None
}
/// Overlay dentries are always eligible for LRU caching.
fn d_delete(&self, _inode: InodeId, _name: &OsStr) -> bool {
true
}
/// On dentry release, drop the overlay inode's references to
/// underlying filesystem inodes.
fn d_release(&self, inode: InodeId, name: &OsStr);
}
Dentry cache interaction: When a copy-up occurs, the overlay must invalidate the affected dentry in the VFS dentry cache (Section 13.1.2) so that subsequent lookups see the upper-layer inode instead of the stale lower-layer reference. The invalidation sequence:
- Copy-up completes (new file exists in upper layer).
OverlayInode::upperis set via an atomic Release store.- The overlay calls
d_invalidate()on the parent directory's dentry for the affected name. This removes the dentry from the hash table and marks it for re-lookup. - The next lookup for this name calls
OverlayInodeOps::lookup(), which now finds the upper-layer entry and returns the updatedOverlayInode.
Negative dentry handling: Negative dentries (cached ENOENT results) in the overlay dentry cache are invalidated when: - A new file is created in the upper layer (the negative dentry for that name must be purged). - A whiteout is removed (the previously-hidden lower-layer entry becomes visible again).
13.4.4 Lookup Algorithm
overlayfs lookup implements the layer search order:
OverlayInodeOps::lookup(parent: InodeId, name: &OsStr) -> Result<InodeId>:
let overlay_parent = get_overlay_inode(parent)
// Step 1: Search upper layer (if writable overlay).
if let Some(upper_dir) = overlay_parent.upper {
match underlying_lookup(upper_dir, name) {
Ok(upper_inode) => {
// Check if this is a whiteout.
if is_whiteout(upper_inode) {
// Entry was deleted. Do NOT search lower layers.
// Cache a negative dentry.
return Err(ENOENT)
}
// Check if this is an opaque directory.
let opaque = is_opaque_dir(upper_inode)
// Found in upper. If directory and not opaque, may need
// to merge with lower layers.
if is_directory(upper_inode) && !opaque {
// Merged directory: upper exists, also search lower
// for the merge view.
let lower = find_in_lower_layers(overlay_parent, name)
return create_overlay_inode(Some(upper_inode), lower, ...)
}
// Non-directory or opaque directory: upper is authoritative.
return create_overlay_inode(Some(upper_inode), None, ...)
}
Err(ENOENT) => {
// Not in upper, fall through to lower layers.
}
Err(e) => return Err(e), // Propagate I/O errors.
}
}
// Step 2: Search lower layers (topmost first).
// If parent directory has a redirect, follow it.
for (layer_idx, lower_layer) in lower_layers_for(overlay_parent) {
match underlying_lookup(lower_dir_at(lower_layer, overlay_parent), name) {
Ok(lower_inode) => {
if is_whiteout(lower_inode) {
// Whiteout in this lower layer. Stop searching.
return Err(ENOENT)
}
return create_overlay_inode(None, Some(LowerInodeRef {
inode: lower_inode,
layer_index: layer_idx,
}), ...)
}
Err(ENOENT) => continue, // Try next lower layer.
Err(e) => return Err(e),
}
}
// Not found in any layer.
Err(ENOENT)
Whiteout detection: An upper-layer entry is a whiteout if either:
- It is a character device with major:minor 0:0 (traditional format), OR
- It is a zero-size regular file with the trusted.overlay.whiteout (or
user.overlay.whiteout in userxattr mode) xattr set.
Both formats are supported for compatibility with existing container images.
UmkaOS creates whiteouts using the xattr format by default (avoids requiring
mknod capability for character device creation in unprivileged containers).
Opaque directory detection: A directory is opaque if it has the xattr
trusted.overlay.opaque (or user.overlay.opaque) set to "y". An opaque
directory hides all entries from lower layers — lookups do not descend past it.
This is used when an entire directory is deleted and recreated in the upper layer.
13.4.5 Copy-Up Protocol
Copy-up is the central operation of overlayfs. When a lower-layer file must be modified, its contents (and/or metadata) are first copied to the upper layer. The copy-up must be atomic from the perspective of concurrent readers: at no point should a reader see a partially-copied file.
Full copy-up algorithm (for regular files when metacopy is disabled, or on first write to a metacopy-only file):
copy_up(overlay_inode: &OverlayInode) -> Result<InodeId>:
// Fast path: already copied up.
if let Some(upper) = overlay_inode.upper.load(Acquire) {
if !overlay_inode.metacopy.load(Acquire) {
return Ok(upper) // Fully copied up already.
}
// Metacopy exists but needs full data copy. Fall through.
}
// Slow path: take copy-up lock.
let _guard = overlay_inode.copy_up_lock.lock()
// Double-check after acquiring lock (another thread may have completed
// copy-up while we waited).
if let Some(upper) = overlay_inode.upper.load(Acquire) {
if !overlay_inode.metacopy.load(Acquire) {
return Ok(upper)
}
}
let lower = overlay_inode.lower.as_ref().expect("copy-up requires lower");
let sb = overlay_super_block()
// Step 1: Ensure parent directory exists in upper layer.
// Recursively copy-up parent directories if needed.
let upper_parent = ensure_upper_parent(overlay_inode)
// Step 2: Create temporary file in workdir (same filesystem as upper).
// The workdir is on the same device as upperdir, enabling atomic rename.
let tmp_name = generate_temp_name() // e.g., "#overlay.XXXXXXXX"
let tmp_inode = underlying_create(sb.work_dir, tmp_name, lower_mode)
// Step 3: Copy metadata from lower to tmp.
let lower_attr = underlying_getattr(lower.inode)
underlying_setattr(tmp_inode, &lower_attr) // owner, mode, timestamps
// Step 4: Copy xattrs from lower to tmp.
// Filter out overlay-private xattrs (trusted.overlay.*).
copy_xattrs_filtered(lower.inode, tmp_inode, sb.xattr_prefix)
// Step 5: Copy file data (skip if metacopy mode and this is a
// metadata-only copy-up triggered by chmod/chown/utimes).
if !metacopy_only {
copy_file_data(lower.inode, tmp_inode)
// Uses splice/sendfile internally for zero-copy where possible.
// Falls back to read+write for filesystems that don't support splice.
} else {
// Set metacopy xattr on the tmp file. This marks it as containing
// metadata only — data will be copied on first write.
underlying_setxattr(tmp_inode,
concat(sb.xattr_prefix, "metacopy"), b"", 0)
// If the lower file is itself a metacopy (nested overlay), follow
// the redirect chain to find the actual data source.
if let Some(origin) = get_metacopy_origin(lower.inode) {
underlying_setxattr(tmp_inode,
concat(sb.xattr_prefix, "origin"), &encode_fh(origin), 0)
}
}
// Step 6: Set security context on tmp file.
// Copy security.* xattrs that the security framework requires.
// Step 7: Atomic rename from workdir to upperdir.
// This is the commit point. Before this rename, the copy-up is invisible
// to other processes. After this rename, the upper-layer file is live.
underlying_rename(sb.work_dir, tmp_name, upper_parent, target_name,
RenameFlags::RENAME_NOREPLACE)
// Step 8: Update overlay inode state.
let upper_inode = underlying_lookup(upper_parent, target_name)
overlay_inode.upper.store(Some(upper_inode), Release)
if metacopy_only {
overlay_inode.metacopy.store(true, Release)
}
// Step 9: Invalidate the dentry cache entry for this name.
// Forces subsequent lookups to see the upper-layer version.
d_invalidate(upper_parent, target_name)
Ok(upper_inode)
Atomicity guarantee: The rename in Step 7 is the single atomic commit point.
If the system crashes before Step 7, the temporary file in workdir is orphaned and
cleaned up on next mount (overlayfs scans workdir for stale temporaries during
mount() and removes them). If the system crashes after Step 7, the upper-layer
file is complete and consistent.
Error recovery (runtime failures): Each step that can fail must clean up all prior steps before returning an error to the caller. The protocol:
| Step that fails | Cleanup required | Returned error |
|---|---|---|
underlying_create() (Step 2) |
None (nothing created yet) | EIO / ENOSPC |
underlying_setattr() (Step 3) |
Unlink tmp_name from workdir |
EIO |
copy_file_data() (Step 5) |
Unlink tmp_name from workdir |
EIO / ENOSPC |
underlying_rename() (Step 7) |
Unlink tmp_name from workdir |
EIO |
overlay_inode.upper.store() (Step 8) |
Rename committed; upper file is live. Do NOT clean up — return EIO only if the store itself fails (hardware error). The upper file is kept and will be found on retry. |
EIO (rare) |
If cleanup of the temporary file itself fails (i.e., underlying_unlink() returns
an error during recovery), the orphaned temporary is left in workdir and will be
removed by the next mount() scan. The original copy-up failure is still returned
to the caller as an error. The orphaned file does not affect correctness because
the rename (Step 7) did not complete.
Parent directory copy-up: Directories are copied up recursively. When copying
up /a/b/c/file.txt, if /a/b/c/ does not exist in upper, overlayfs creates
/a/, then /a/b/, then /a/b/c/ in upper (each with appropriate metadata and
the trusted.overlay.origin xattr pointing to the lower original). Only then does
the file copy-up proceed. Each directory copy-up is itself atomic (created in
workdir, renamed to upper).
Hard link handling on copy-up: If a lower-layer file has multiple hard links (nlink > 1), all names referencing the same lower inode must resolve to the same upper inode after copy-up. The overlay maintains an index directory (inside workdir) that maps lower file handles to upper inodes. On copy-up, the overlay checks the index first: - If an index entry exists, the file was already copied up via another name. Create a hard link in upper rather than copying data again. - If no index entry exists, perform a full copy-up and record the mapping.
This index is also used for NFS export (mapping file handles across copy-up).
13.4.6 Metacopy Mode
Metacopy is the performance-critical optimization for container startup. Without metacopy, any metadata operation (chmod, chown, utimes) on a lower-layer file triggers a full data copy. With metacopy enabled, only metadata is copied, and data copy is deferred until the file is opened for writing.
Metacopy lifecycle:
State transitions for a file in metacopy mode:
[Lower-only]
│
│ chmod/chown/utimes/setxattr
▼
[Metacopy in upper] ← metadata copied, data in lower
│ upper has trusted.overlay.metacopy xattr
│ open(O_WRONLY/O_RDWR) or truncate
▼
[Full copy-up] ← data + metadata in upper
trusted.overlay.metacopy xattr removed
Read path for metacopy files: When a metacopy file is opened for reading
(O_RDONLY), data is served from the lower layer. The OverlayFileOps::read()
implementation checks overlay_inode.metacopy and dispatches to the lower-layer
FileOps::read() with the lower inode. No data copy occurs.
Write trigger: When a metacopy file is opened for writing (O_WRONLY,
O_RDWR) or truncated, the overlay triggers a full data copy-up before allowing
the write:
impl FileOps for OverlayFileOps {
fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64> {
let oi = get_overlay_inode(inode);
// If opening for write and file is metacopy-only, trigger
// full data copy-up before returning the fd.
if flags.is_writable() && oi.metacopy.load(Acquire) {
copy_up_data(oi)?;
// copy_up_data() copies file data from lower to upper,
// removes the metacopy xattr, and clears oi.metacopy.
}
// Delegate open to the appropriate underlying filesystem.
if let Some(upper) = oi.upper.load(Acquire) {
underlying_open(upper, flags)
} else {
// Read-only open on a lower-only file. No copy-up needed.
underlying_open(oi.lower.unwrap().inode, flags)
}
}
}
13.4.6.1 Metacopy Trust Model and Security Constraints
The metacopy mechanism is only safe when the kernel can trust that
trusted.overlay.metacopy (or user.overlay.metacopy in userxattr mode) was
written by the overlay itself during a copy-up, not forged by a process with
write access to the upper layer. If forged, an attacker could create a file
whose upper stub has a redirect xattr pointing to an arbitrary path in a lower
layer, then set the metacopy xattr to tell the kernel to serve lower-layer data
through the stub — exposing files the attacker would not otherwise be able to read
via the overlay's merged view.
Xattr namespace privilege boundary
The trusted. xattr namespace is the primary safeguard. The kernel checks
CAP_SYS_ADMIN via capable() — which verifies the capability against the
initial user namespace — not via ns_capable() (which would accept a user
namespace root). This means:
A process that holds
CAP_SYS_ADMINonly within a user namespace (i.e., container root mapped to an unprivileged host UID) cannot set or readtrusted.*xattrs on the host filesystem. Only a process withCAP_SYS_ADMINin the initial user namespace can writetrusted.overlay.*xattrs.
This provides complete protection for overlayfs mounts created in the initial
user namespace: container processes cannot forge trusted.overlay.metacopy or
trusted.overlay.redirect xattrs because they lack the required capability on
the host filesystem.
User-namespace-influenced mounts: the attack surface
Since Linux 5.11, overlayfs can be mounted from within a user namespace
(CAP_SYS_ADMIN in the user namespace that owns the mount namespace suffices to
call mount("overlay", ...)). Such mounts are required to use userxattr mode
(-o userxattr), which substitutes the user.overlay.* xattr namespace for
trusted.overlay.*. Unlike trusted.*, the user.* namespace is writable by
the file owner without any privilege — specifically, the unprivileged host UID
that the container root maps to can set user.overlay.metacopy and
user.overlay.redirect xattrs on files in the upper layer.
A user-namespace-influenced mount is defined as any overlayfs mount where either:
- The overlayfs
mount()call was made from within a user namespace (the calling process's user namespace is not the initial user namespace), or - The upper directory's owning user namespace differs from the initial user
namespace (detected by comparing the user namespace of the mount namespace
that created the upper directory's filesystem mount against
init_user_ns).
Enforcement: metacopy disabled for user-namespace-influenced mounts
UmkaOS enforces the following rule at mount time and at metacopy lookup time:
Mount-time enforcement: When OverlayFs::mount() is called from a process
not in the initial user namespace, the metacopy and redirect_dir options are
forced to off regardless of what the caller requested. The mount proceeds with
these features disabled. The kernel logs:
overlayfs: metacopy and redirect_dir disabled for user-namespace mount (CVE mitigation, Section 13.4.6.1)
This matches Linux's behaviour (since kernel 5.11, user-namespace overlayfs
mounts are restricted to userxattr mode and metacopy is not permitted unless
the caller has CAP_SYS_ADMIN in the initial user namespace).
The OverlaySuperBlock records whether the mount is user-namespace-influenced:
pub struct OverlaySuperBlock {
// ... existing fields ...
/// True if this overlay was mounted from within a user namespace (the
/// mounting process's user namespace is not the initial user namespace)
/// or if the upper layer's filesystem mount is owned by a non-initial
/// user namespace. When true, metacopy and redirect_dir are disabled
/// regardless of mount options, and userxattr mode is mandatory.
///
/// Set once at mount time; immutable thereafter.
pub userns_influenced: bool,
}
Lookup-time enforcement: Even if metacopy is enabled in the mount options,
the metacopy lookup path checks userns_influenced before reading or acting on
any metacopy xattr:
/// Attempt to read a metacopy stub from the given upper-layer dentry.
/// Returns `None` (treat as a regular upper file) if:
/// - The mount is user-namespace-influenced, or
/// - No metacopy xattr is present, or
/// - The xattr value fails validation.
fn ovl_lookup_metacopy(dentry: &Dentry, sb: &OverlaySuperBlock) -> Option<OverlayMetacopy> {
// Never trust metacopy xattrs from user-namespace-influenced mounts.
// The xattr namespace used by such mounts (user.overlay.*) is writable
// by the file owner without privilege, so any metacopy xattr present
// must be treated as potentially forged.
if sb.userns_influenced {
return None;
}
// Read the metacopy xattr from the upper-layer file.
let xattr_name = concat_static(sb.xattr_prefix, "metacopy");
let xattr = dentry.get_xattr(xattr_name)?;
// Validate xattr value. The Linux-compatible format is either empty
// (legacy, no digest) or a 4+N byte structure: 4-byte header followed
// by an optional fs-verity SHA-256 digest (32 bytes). Reject anything
// that does not match either form.
validate_metacopy_xattr(xattr)
}
The lookup-time check is defence-in-depth: the mount-time enforcement already
prevents metacopy=on from reaching OverlaySuperBlock::config on
user-namespace mounts, so ovl_lookup_metacopy would not be called. The
redundant check in ovl_lookup_metacopy protects against future code paths that
might bypass the mount-time gate.
Userxattr mode and data-only layers
When userxattr=on is set (required for user-namespace mounts), user.overlay.*
xattrs are used throughout. The user.overlay.redirect xattr controls directory
rename semantics and, in data-only layer configurations, points metacopy stubs to
their data sources. Because user.* xattrs are writable by the file owner, and
because data-only layer configurations allow a metacopy file in one lower layer to
redirect to a file in a data-only lower layer via user.overlay.redirect:
redirect_dir=onis disallowed for user-namespace-influenced mounts (forced tooffat mount time).- Data-only lower layers are disallowed for user-namespace-influenced mounts:
OverlayFs::mount()returnsEPERMif any lower layer path is specified with the::data-only separator syntax whenuserns_influencedis true.
These restrictions prevent the user.overlay.redirect xattr from being used to
point a metacopy stub in one layer at a file in another layer that the container
would not otherwise be able to access.
Summary of security invariants
| Condition | trusted.overlay.* metacopy |
user.overlay.* metacopy |
|---|---|---|
Initial user namespace mount, metacopy=on |
Trusted (forging requires host CAP_SYS_ADMIN) | N/A (userxattr not used in privileged mounts by default) |
| User-namespace mount | N/A (trusted.* inaccessible from user NS) | Disabled (forced off at mount time; ovl_lookup_metacopy returns None) |
User-namespace mount, userxattr=on, data-only layers |
N/A | Rejected at mount time (EPERM) |
13.4.7 Directory Operations
Readdir merge: Reading a merged directory (one that exists in both upper and lower layers) requires combining entries from all layers, excluding whiteouts and applying opaque directory semantics.
OverlayFileOps::readdir(inode, private, offset, emit) -> Result<()>:
let oi = get_overlay_inode(inode)
// Phase 1: Collect entries from upper layer.
let mut seen: HashSet<OsStr> = HashSet::new()
if let Some(upper) = oi.upper.load(Acquire) {
underlying_readdir(upper, |entry_inode, entry_off, ftype, name| {
// Skip whiteout entries — they indicate deleted lower entries.
if is_whiteout_entry(entry_inode) {
seen.insert(name.to_owned()) // Track for lower suppression.
return true // Continue iteration.
}
seen.insert(name.to_owned())
emit(overlay_inode_for(entry_inode), entry_off, ftype, name)
})
}
// Phase 2: If directory is opaque, stop here. Lower entries are hidden.
if oi.opaque {
return Ok(())
}
// Phase 3: Collect entries from lower layers, skipping duplicates.
for lower_ref in lower_dirs_for(oi) {
underlying_readdir(lower_ref.inode, |entry_inode, entry_off, ftype, name| {
// Skip entries already seen in upper or higher lower layers.
if seen.contains(name) {
return true
}
// Skip whiteout entries from lower layers too.
if is_whiteout_entry(entry_inode) {
seen.insert(name.to_owned())
return true
}
seen.insert(name.to_owned())
emit(overlay_inode_for(entry_inode), entry_off, ftype, name)
})
}
Ok(())
Readdir caching: The merged directory listing is cached in the overlay file's
private state (returned by open()) for the lifetime of the open directory file
descriptor. This matches Linux's behavior: the merge is computed once per
opendir() and subsequent readdir() calls return entries from the cache. The
cache is invalidated on rewinddir() (seek to offset 0).
Performance note on seen HashSet: The HashSet<OsStr> in the pseudocode above
is allocated once per opendir() call (during the initial merge), not once per
readdir() call. The cache stores the deduplicated entry list; subsequent readdir()
calls walk the already-merged cache without re-allocating or re-hashing. For large
directories (>10,000 entries), the initial opendir() merge is O(N) with one
allocation per distinct entry name (stored in the HashSet during merge, then released
when the merge completes and entries are stored in a flat Vec in the file private
state). The hot path — repeated readdir() calls iterating through the cached Vec —
is O(entries) with zero heap allocations.
Directory rename (redirect_dir=on): When a merged directory is renamed,
overlayfs cannot rename the lower-layer directory (it is read-only). Instead:
- Create the new directory name in the upper layer.
- Set the
trusted.overlay.redirectxattr on the new upper directory, containing the absolute path (from the overlay root) of the original lower directory. Maximum redirect path: 256 bytes. - Lookups for the renamed directory follow the redirect: when searching lower layers, use the redirect path instead of the current name.
- Create a whiteout at the old name to hide the lower-layer original.
Opaque directory creation (rmdir + mkdir of same name):
- Create whiteout or opaque directory in upper layer.
- Set
trusted.overlay.opaquexattr to"y"on the new upper directory. - All lower-layer entries under this path are hidden.
13.4.8 Whiteout and Deletion
When a file or directory is deleted from a merged view, overlayfs must hide the lower-layer entry without modifying the lower layer:
File deletion (unlink on a merged file):
1. If the file exists in upper: remove the upper entry via underlying_unlink().
2. If the file exists in any lower layer: create a whiteout in the upper layer
at the same path.
3. Invalidate the dentry cache entry.
Directory deletion (rmdir on a merged directory):
1. Verify the merged view of the directory is empty (no entries from any layer
that are not whiteouts). Return ENOTEMPTY if non-empty.
2. If the directory exists in upper: remove it.
3. If the directory exists in lower: create an opaque whiteout in upper.
Whiteout creation:
/// Create a whiteout entry in the upper layer.
///
/// UmkaOS uses the xattr-based whiteout format by default: a zero-size
/// regular file with the overlay whiteout xattr set. This avoids
/// requiring mknod(2) capability (character device 0:0 creation
/// requires CAP_MKNOD in the filesystem's user namespace).
///
/// For compatibility, the character-device whiteout format is also
/// recognized on read (lookup).
fn create_whiteout(upper_parent: InodeId, name: &OsStr) -> Result<()> {
let sb = overlay_super_block();
// Create zero-size regular file.
let whiteout = underlying_create(upper_parent, name,
FileMode::regular(0o000))?;
// Set the whiteout xattr.
underlying_setxattr(whiteout,
concat(sb.xattr_prefix, "whiteout"), b"y", XattrFlags::CREATE)?;
Ok(())
}
RENAME_WHITEOUT integration: The VFS rename() with RENAME_WHITEOUT flag
(already supported in InodeOps::rename(), Section 13.1.1) atomically renames a
file and creates a whiteout at the old name. overlayfs uses this during copy-up
of directory entries: when a file is copied from lower to upper, the old lower
path is hidden by a whiteout created atomically with the rename.
13.4.9 Volatile Mode
Volatile mode disables all durability guarantees for the upper layer. This is a deliberate trade-off for ephemeral container workloads.
Behavior:
- fsync(), fdatasync(), and sync_fs() on overlay files are no-ops (return
success without calling the underlying filesystem's sync).
- On mount with volatile=true, create the sentinel directory
$workdir/work/incompat/volatile/.
- On unmount, remove the sentinel directory (clean shutdown).
- On next mount, if the sentinel exists, return EINVAL with a diagnostic
message: the previous volatile session was not cleanly unmounted, and the
upper/work directories may be inconsistent. The operator must delete upper
and work directories and recreate them.
- After any writeback error on the upper filesystem, subsequent fsync() calls
on overlay files return EIO persistently (matching Linux's error stickiness
behavior from Section 14.1).
Container runtime usage: Docker enables volatile mode for containers started
with --storage-opt overlay2.volatile=true. This is common for CI/CD runners,
build containers, and test environments where container state is discarded after
each run.
13.4.10 Extended Attribute Handling
overlayfs must handle xattrs carefully because it uses private xattrs for internal bookkeeping (whiteouts, metacopy, redirects, opaque markers) and must pass through user-visible xattrs correctly.
Xattr namespace partitioning:
| Namespace | Behavior |
|---|---|
trusted.overlay.* (or user.overlay.* in userxattr mode) |
Internal: overlay-private. Not visible to userspace via listxattr()/getxattr(). Used for whiteout, opaque, metacopy, redirect, origin markers. |
security.* |
Pass-through with copy-up: Copied from lower to upper during copy-up. setxattr() triggers copy-up. Includes security.selinux, security.capability (file caps), security.ima. |
system.posix_acl_access, system.posix_acl_default |
Pass-through with copy-up: POSIX ACLs are copied during copy-up. setfacl triggers copy-up. |
user.* (excluding user.overlay.* in userxattr mode) |
Pass-through with copy-up: User-defined xattrs. Copied during copy-up. |
trusted.* (excluding trusted.overlay.*) |
Pass-through with copy-up: Only accessible to CAP_SYS_ADMIN processes. Copied during copy-up. |
getxattr/setxattr dispatch:
OverlayInodeOps::getxattr(inode, name, buf) -> Result<usize>:
// Block access to overlay-private xattrs.
if name.starts_with(overlay_xattr_prefix()) {
return Err(ENODATA)
}
// Serve from upper if available, otherwise from lower.
let target = upper_or_lower(inode)
underlying_getxattr(target, name, buf)
OverlayInodeOps::setxattr(inode, name, value, flags) -> Result<()>:
// Block writes to overlay-private xattrs.
if name.starts_with(overlay_xattr_prefix()) {
return Err(EPERM)
}
// setxattr triggers copy-up (xattr must be set on upper).
let upper = copy_up(inode)?
underlying_setxattr(upper, name, value, flags)
OverlayInodeOps::listxattr(inode, buf) -> Result<usize>:
// List xattrs from upper (if exists) or lower.
// Filter out overlay-private xattrs from the result.
let target = upper_or_lower(inode)
let raw = underlying_listxattr(target, buf)?
filter_out_overlay_xattrs(buf, raw)
Nested overlayfs: When overlayfs is mounted on top of another overlayfs
(nested container images, uncommon but valid), the inner overlay's xattrs must
not collide with the outer overlay's. Linux handles this via "xattr escaping":
the inner overlay stores its xattrs under trusted.overlay.overlay.* instead
of trusted.overlay.*. UmkaOS implements the same escaping mechanism. This is
transparent to the filesystem — the inner overlay simply uses a longer prefix.
13.4.11 statfs Behavior
OverlayFs::statfs() returns statistics from the upper layer's filesystem (if
present). For read-only overlays (no upper), statistics from the topmost lower
layer are returned. This matches Linux behavior and ensures that df on a
container's root filesystem shows the available space on the writable layer.
13.4.12 Inode Number Composition (xino)
To guarantee unique inode numbers across the merged view, overlayfs composes inode numbers from the underlying filesystem's inode number and the layer index:
composed_ino = (layer_index << xino_bits) | underlying_ino
Where xino_bits is the number of bits available for the underlying inode
(typically 32 for ext4 with default inode sizes). This ensures that
stat() returns unique inode numbers for files from different layers that
happen to share the same underlying inode number (common when layers are on
the same filesystem).
When xino=off or when underlying inode numbers exceed the available bit width,
overlayfs falls back to using the underlying inode numbers directly. In this mode,
st_dev differs between upper and lower files (the VFS assigns a unique device
number per overlay mount), but st_ino may collide across layers. Applications
that rely on (st_dev, st_ino) pairs for file identity (e.g., tar, rsync,
find -inum) may exhibit incorrect behavior. xino=auto avoids this by
enabling composition only when it is safe.
13.4.13 Mount and Unmount Flow
Mount:
OverlayFs::mount(source, flags, data) -> Result<SuperBlock>:
1. Parse mount options from `data` into `OverlayMountOptions`.
2. Determine user-namespace influence (security policy, Section 13.4.6.1):
userns_influenced = (current_user_ns() != &init_user_ns)
If userns_influenced:
a. Force options.metacopy = false.
Force options.redirect_dir = RedirectDirMode::Off.
Log: "overlayfs: metacopy and redirect_dir disabled for
user-namespace mount (Section 13.4.6.1)"
b. Require options.userxattr == true. If not set, return EPERM.
(User-namespace mounts cannot use trusted.overlay.* xattrs.)
c. If any lower_dir entry uses the data-only '::' separator syntax:
return EPERM. (Data-only layers with userxattr are disallowed
because user.overlay.redirect is owner-writable.)
3. Resolve each lower_dir path to an InodeId via VFS path lookup.
Verify each is a directory. Hold references for mount lifetime.
4. If upper_dir is set:
a. Resolve upper_dir to InodeId. Verify it is a writable directory.
b. Resolve work_dir to InodeId. Verify same superblock as upper_dir.
c. Check work_dir is empty.
d. Create `$workdir/work/` subdirectory if it does not exist.
e. If volatile mode:
- Check for `$workdir/work/incompat/volatile/` sentinel.
If exists: return EINVAL ("previous volatile session unclean").
- Create the sentinel directory.
f. If nfs_export: create `$workdir/index/` subdirectory.
g. Clean stale temporary files from workdir (names starting with
`#overlay.`). These are remnants of interrupted copy-ups.
5. Verify upper filesystem supports required operations:
- xattr support (getxattr/setxattr succeed with overlay prefix).
- rename with RENAME_WHITEOUT (test with a dummy file in workdir).
6. Construct `OverlaySuperBlock` with userns_influenced as determined
in step 2, and `SuperBlock`.
7. Register overlay dentry ops with the VFS.
8. Emit mount options for /proc/mounts via show_options().
Unmount:
OverlayFs::unmount(sb) -> Result<()>:
1. If volatile mode: remove sentinel directory
`$workdir/work/incompat/volatile/`.
2. Release all layer references (InodeId references to underlying
filesystem directories).
3. Drop the OverlaySuperBlock.
13.4.14 Performance Characteristics
| Operation | Overhead vs. direct filesystem access | Notes |
|---|---|---|
| Path lookup (cached) | +1 hash lookup per component | Overlay dentry points to underlying dentry |
| Read (lower-only file) | ~0% | Direct delegation to lower filesystem |
| Read (upper file) | ~0% | Direct delegation to upper filesystem |
| Read (metacopy file) | ~0% | Reads from lower, same as lower-only |
| Write (upper file) | ~0% | Direct delegation to upper filesystem |
| Write (first write, copy-up) | O(file_size) one-time | Sequential read+write of file data |
| Write (metacopy first write) | O(file_size) one-time | Deferred from container startup |
| chmod/chown (metacopy) | O(1) ~10μs | Metadata-only copy-up (no data copy) |
| chmod/chown (no metacopy) | O(file_size) | Full copy-up triggered |
| readdir (merged) | O(entries × layers) | Hash-based dedup over all layers |
| stat (cached) | ~0% | Overlay inode cached in VFS |
Container startup optimization: With metacopy enabled, pulling and starting a container image avoids copying any file data during the initial setup phase (only metadata operations occur: chmod, chown, symlink creation for the container's init process). Data is copied lazily on first write. For typical container images (200-500 MB of layers), this reduces container start time from seconds to tens of milliseconds for the filesystem setup phase.
13.4.15 dm-verity Integration for Container Image Layers
Read-only lower layers in a container overlay can be protected by dm-verity (Section 8.2.6). The container runtime mounts each image layer's block device with dm-verity verification, then stacks them as overlayfs lower layers:
Container image mount sequence:
1. Pull image layers: layer1.img, layer2.img, ..., layerN.img
2. For each layer:
a. Set up dm-verity on the layer's block device (Merkle tree
verification, Section 8.2.6)
b. Mount the verified block device read-only (ext4/XFS)
3. Mount overlayfs:
mount -t overlay overlay \
-o lowerdir=/mnt/layerN:...:/mnt/layer1,upperdir=...,workdir=...
/container/rootfs
This provides block-level integrity verification for all read-only container layers. The writable upper layer is covered by IMA (Section 8.4) for runtime integrity measurement of modified files. Together, dm-verity (lower layers) + IMA (upper layer) provide complete integrity coverage for container filesystems.
The optional verity=require mount option (Section 13.4.15) provides an
additional layer of verification at the overlayfs level using fs-verity digests,
independent of dm-verity block device verification.
13.4.16 Linux Compatibility
overlayfs is compatible with Linux's overlayfs at the mount interface and xattr format level:
- Upper and lower directories created by Linux overlayfs are mountable by UmkaOS
and vice versa. The xattr format (
trusted.overlay.*names and values) is identical. - Mount option syntax matches Linux exactly (
-o lowerdir=...,upperdir=..., workdir=...). - Whiteout formats (both character device 0:0 and xattr-based) are recognized.
- Metacopy xattr format is compatible: layers created with
metacopy=onon Linux work on UmkaOS. redirect_dirxattr format and path encoding match Linux./proc/mountsoutput format matches Linux for container introspection tools./sys/module/overlay/parameters/*is not emulated (UmkaOS does not use kernel modules); per-mount options in the mount command are the sole configuration mechanism.
Docker/containerd/Podman compatibility: These runtimes interact with
overlayfs exclusively through the mount(2) syscall and standard file operations.
They do not use any overlayfs-specific ioctls or sysfs interfaces. UmkaOS's
implementation of mount("overlay", ...) with the standard option string is
sufficient for full compatibility. The overlay2 storage driver in Docker and
the overlayfs snapshotter in containerd are fully supported.
---
## 13.5 binfmt_misc — Arbitrary Binary Format Registration
`binfmt_misc` is a VFS-level mechanism that allows userspace to register handlers
for arbitrary binary formats, identified by magic bytes or file extension. When the
kernel's exec path attempts to start a file and neither the native ELF handler nor
the `#!` script handler matches, the kernel delegates to a registered `binfmt_misc`
interpreter. The registered interpreter binary is invoked with the original file
path as an additional argument.
Critical use cases:
- **Multi-architecture containers**: `qemu-aarch64-static` is registered as the
interpreter for AArch64 ELF binaries, identified by the AArch64 ELF magic header.
This allows running unmodified ARM64 Docker images on an x86-64 host without
hardware virtualisation.
- **Java**: `.jar` files executed as if they were executables via a registration
that maps the `.jar` extension to `/usr/bin/java -jar`.
- **.NET**: PE32+ executables identified by the `MZ` magic bytes are mapped to
`dotnet exec`.
- **Wine**: 16-bit and 32-bit Windows PE files mapped to `wine`.
### 13.5.1 Data Structures
```rust
/// A single registered binfmt_misc entry.
pub struct BinfmtMiscEntry {
/// Registration name. Shown as the filename under the binfmt_misc mount.
/// Alphanumeric, hyphen, and underscore only. NUL-terminated.
pub name: [u8; 64],
/// Matching strategy: magic bytes or file extension.
pub match_type: BinfmtMatch,
/// Magic bytes to compare against file content (BinfmtMatch::Magic only).
/// Maximum 128 bytes. Length of `magic` and `mask` must be equal.
pub magic: Option<[u8; 128]>,
/// Length of the valid portion of `magic` and `mask` arrays.
pub magic_len: u8,
/// Bitmask applied to each file byte before comparison with `magic`.
/// A mask byte of `0xff` means "match exactly"; `0x00` means "ignore".
pub mask: Option<[u8; 128]>,
/// Byte offset within the file at which `magic` is compared.
pub magic_offset: u16,
/// File extension string (BinfmtMatch::Extension only).
/// Case-sensitive. Does not include the leading `.`. NUL-terminated.
pub extension: Option<[u8; 8]>,
/// Absolute path to the interpreter binary.
pub interpreter: [u8; PATH_MAX],
/// Behavioural flags.
pub flags: BinfmtFlags,
/// Whether this entry participates in exec matching.
pub enabled: AtomicBool,
}
/// How the entry identifies matching binaries.
pub enum BinfmtMatch {
/// Match by magic bytes at a fixed offset within the file.
Magic,
/// Match by the file extension of the executed path.
Extension,
}
bitflags! {
/// Behavioural flags for a binfmt_misc entry.
pub struct BinfmtFlags: u32 {
/// Pass the original filename as argv[0] to the interpreter instead
/// of substituting the interpreter path.
const PRESERVE_ARGV0 = 0x01;
/// Open the binary file and pass it to the interpreter as an open fd
/// (via `/proc/self/fd/N`). Required when the binary is not
/// world-readable and the interpreter runs without elevated privilege.
const OPEN_BINARY = 0x02;
/// Use the credentials (uid, gid, capabilities) of the interpreter
/// binary rather than those of the executed file. Equivalent to
/// setuid execution for the interpreter.
const CREDENTIALS = 0x04;
/// Fix binary: the interpreter is not itself subject to further
/// binfmt_misc or personality transformation. Prevents recursion.
const FIX_BINARY = 0x08;
/// Secure: do not grant elevated credentials even when the interpreter
/// binary is setuid. Overrides CREDENTIALS for privilege de-escalation.
const SECURE = 0x10;
}
}
The global entry table is a RwLock<Vec<Arc<BinfmtMiscEntry>>>. Reads (exec path)
take a read lock for a bounded scan; writes (registration, enable/disable, removal)
take the write lock. The list is short in practice (fewer than 64 entries on any
real system), so O(N) scan cost is negligible relative to exec overhead.
13.5.2 Registration Interface
The binfmt_misc filesystem is mounted at /proc/sys/fs/binfmt_misc (also accessible
at /sys/kernel/umka/binfmt_misc/ via the umkafs namespace — see
Section 19.4). It exposes:
| Path | Type | Description |
|---|---|---|
register |
write-only file | Register a new entry |
status |
read/write file | 1 = all entries active; 0 = all disabled globally |
<name>/enabled |
read/write file | 1 enable, 0 disable, -1 remove this entry |
<name> |
read-only file | Shows entry details (flags, interpreter, magic/extension) |
Writing to register or any <name>/enabled file requires Capability::SysAdmin
in the caller's capability set.
Registration format (written as a single line to register):
:name:type:offset:magic:mask:interpreter:flags
Fields are separated by the same delimiter character as the leading :. Any
printable non-alphanumeric character may be used as the delimiter (allowing paths
that contain colons).
| Field | Description |
|---|---|
name |
Identifier: alphanumeric, -, _. Maximum 63 characters. |
type |
M for magic-byte match; E for extension match. |
offset |
Decimal byte offset for magic comparison (type M). 0 for most formats. |
magic |
Hex-escaped bytes for type M (e.g., \x7fELF). Extension string for type E. |
mask |
Hex-escaped bitmask for type M; same length as magic. Empty for type E. |
interpreter |
Absolute path to the interpreter binary. Must exist at registration time. |
flags |
Subset of POCFS: P = PRESERVE_ARGV0, O = OPEN_BINARY, C = CREDENTIALS, F = FIX_BINARY, S = SECURE. |
Example — registering QEMU user-mode for AArch64 ELF binaries on an x86-64 host:
:qemu-aarch64:M:0:\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00::qemu-aarch64-static:OC
- Type
M, offset0: compare 20 magic bytes starting at file byte 0. - No mask: all bytes compared exactly (
\xffmask is implied). O(OPEN_BINARY): interpreter receives file as fd, not path, for cross-uid access.C(CREDENTIALS): interpreter's credentials govern setuid semantics.
Parsing algorithm:
parse_registration(line: &[u8]) -> Result<BinfmtMiscEntry>:
1. delimiter = line[0]
2. Split line on delimiter into fields: [name, type, offset, magic_or_ext,
mask, interpreter, flags_str].
3. Validate name: alphanumeric + '-' + '_', length 1–63.
4. Parse type: 'M' → BinfmtMatch::Magic, 'E' → BinfmtMatch::Extension.
5. For type M:
a. Parse offset as decimal u16.
b. Decode hex-escaped bytes into magic array (max 128 bytes).
c. If mask non-empty: decode hex-escaped bytes; must equal magic.len().
d. If mask empty: fill mask with 0xff bytes (exact match).
6. For type E:
a. Validate extension: printable ASCII, no '/', no '.', max 7 chars.
b. Store extension without leading '.'.
7. Validate interpreter: starts with '/', exists in VFS (path lookup),
is a regular file with execute permission for at least one uid.
8. Parse flags_str: accept 'P', 'O', 'C', 'F', 'S' in any order.
9. Construct BinfmtMiscEntry with enabled = AtomicBool::new(true).
10. Acquire write lock on global table; reject if name already exists.
11. Push Arc<BinfmtMiscEntry> to table.
13.5.3 Exec Path Integration
During do_execve (Section 7.3), after the ELF handler and the
#! script handler both decline the binary (return ENOEXEC), the kernel calls
binfmt_misc_load_binary(file, argv, envp).
Matching algorithm:
binfmt_misc_load_binary(file, argv, envp) -> Result<()>:
1. Acquire read lock on global entry table.
2. If global status is disabled: return ENOEXEC.
3. Read a probe buffer of min(128 + max_magic_offset, 256) bytes from
offset 0 of `file`. This single read covers all registered magic ranges.
4. For each entry in table order:
a. If !entry.enabled.load(Relaxed): skip.
b. If entry.match_type == Magic:
i. end = entry.magic_offset as usize + entry.magic_len as usize.
ii. If end > probe_buffer.len(): skip (file too short).
iii.For each byte i in 0..magic_len:
file_byte = probe[magic_offset + i] & mask[i]
if file_byte != magic[i] & mask[i]: break → no match
iv. If all bytes matched: entry is selected.
c. If entry.match_type == Extension:
i. Extract filename from argv[0] (last path component).
ii. If filename ends with '.' + extension (case-sensitive): entry is selected.
5. If no entry matched: release lock; return ENOEXEC.
6. Clone the matched entry (Arc clone, no copy of byte arrays).
7. Release read lock.
8. Build new argv:
a. If PRESERVE_ARGV0 set: new_argv = [interpreter, argv[0], argv[1..]]
b. Else: new_argv = [interpreter, original_file_path, argv[1..]]
c. If OPEN_BINARY set: pass file as open fd; prepend "/proc/self/fd/<N>"
in place of original_file_path.
9. If CREDENTIALS set: use interpreter binary's uid/gid/caps for the new exec.
10. If SECURE set: clear any setuid bits that CREDENTIALS would have applied.
11. Invoke do_execve recursively with interpreter path and new_argv.
If FIX_BINARY set: skip binfmt_misc matching in the recursive exec
(set a per-exec flag to prevent re-entry into this function).
Step 11's recursive do_execve processes the interpreter itself through the
normal ELF handler. QEMU user-mode binaries are statically linked ELF executables,
so the recursion terminates in one level.
13.5.4 The binfmt_misc Filesystem
binfmt_misc_fs is a minimal VFS filesystem type (FsType::BinfmtMisc) with the
following FsOps implementation:
impl FsOps for BinfmtMiscFs {
fn mount(&self, flags: MountFlags, _data: &[u8]) -> Result<Arc<SuperBlock>>;
fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;
}
impl InodeOps for BinfmtMiscDir {
fn lookup(&self, name: &OsStr) -> Result<Arc<Dentry>>;
fn iterate_dir(&self, ctx: &mut DirContext) -> Result<()>;
}
impl FileOps for BinfmtMiscRegister {
fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>; // parse_registration
}
impl FileOps for BinfmtMiscStatus {
fn read(&self, buf: &mut [u8], _offset: u64) -> Result<usize>; // "enabled\n" or "disabled\n"
fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>; // "1" / "0"
}
impl FileOps for BinfmtMiscEntryFile {
fn read(&self, buf: &mut [u8], _offset: u64) -> Result<usize>; // entry details
fn write(&self, buf: &[u8], _offset: u64) -> Result<usize>; // "1" / "0" / "-1"
}
The filesystem has no on-disk backing store. All state lives in the in-kernel
Vec<Arc<BinfmtMiscEntry>>. Directory inodes are synthesised dynamically: lookup
scans the entry table for a matching name and returns a synthetic inode. iterate_dir
emits register, status, and all current entry names.
Multiple mounts of the binfmt_misc filesystem share the same global entry table
(identical to Linux semantics). Unmounting does not clear registrations; entries
persist until explicitly removed via echo -1 > /proc/sys/fs/binfmt_misc/<name>/enabled
or until the kernel reboots.
Mount point: The standard location is /proc/sys/fs/binfmt_misc, mounted by
systemd-binfmt.service at early boot before loading entries from
/etc/binfmt.d/*.conf and /usr/lib/binfmt.d/*.conf.
13.5.5 Persistence and systemd Integration
The kernel holds registrations only in memory. Registrations are lost on reboot.
The systemd-binfmt.service unit re-registers all entries at each boot by reading
configuration files with the format:
# /etc/binfmt.d/qemu-aarch64.conf
:qemu-aarch64:M:0:\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00::qemu-aarch64-static:OC
Each non-comment, non-empty line is written verbatim to
/proc/sys/fs/binfmt_misc/register. Drop-in files in /usr/lib/binfmt.d/ are
processed first, then /etc/binfmt.d/ (higher priority). Conflicting entries with
the same name are rejected by the kernel (duplicate-name check in parse_registration).
13.5.6 Security Model
- Privilege: Writing to
registeror anyenabledfile requiresCapability::SysAdmin. Unprivileged processes cannot add or modify entries. - Interpreter credentials: By default (no
CREDENTIALSflag), the interpreter runs with the calling process's credentials. The setuid bits of the interpreter binary are ignored. This prevents privilege escalation via a crafted binary whose magic bytes happen to match a setuid interpreter's registration. CREDENTIALSflag: Explicitly opts in to interpreter-binary credential inheritance. Should only be set for fully trusted interpreters.SECUREflag: When set alongsideCREDENTIALS, strips any elevated privilege that would have been inherited. Useful for sandboxed interpreters.OPEN_BINARYflag: The kernel opens the binary file before constructing the newargv, so the interpreter receives an already-open fd. This allows the interpreter to read the file even when the binary is not world-readable (e.g.,chmod 700user-owned binaries run through QEMU on a shared host). The fd is passed as a/proc/self/fd/Npath to remain compatible with interpreters that accept a file path argument.- Recursion guard: The
FIX_BINARYflag, combined with the per-exec recursion flag set in step 11 of Section 13.5.3, prevents pathological interpreter chains where an interpreter is itself a binfmt_misc-dispatched binary.
13.6 autofs — Kernel Automount Trigger
autofs is the kernel side of the automount subsystem. Its role is narrow: detect
access to a path that has not yet been mounted, suspend the filesystem lookup, notify
a userspace daemon, and resume the lookup after the daemon has performed the mount.
The kernel does not decide what to mount or where it comes from — that is
entirely the daemon's responsibility.
Used extensively by systemd through .automount units: lazy NFS home directories
(/home/$user), removable media (/media/disk), and network shares that should
only connect on demand.
13.6.1 Architecture
autofs registers a VFS filesystem type (FsType::Autofs). An autofs filesystem
instance covers a single mount point. Inside that mount point, the kernel may see
directory entries that are not yet backed by a real mount. When path resolution
(Section 13.1.3) traverses one of these
directories and finds DCACHE_NEED_AUTOMOUNT set on its dentry, it calls the
dentry's d_automount operation.
The two fundamental mount modes are:
| Mode | Description |
|---|---|
indirect |
autofs mount covers a directory; lookups of subdirectories trigger mounts. /nfs is autofs; accessing /nfs/fileserver triggers a mount of fileserver:/export onto /nfs/fileserver. |
direct |
The autofs mount point IS the trigger. Accessing the exact path (e.g., /mnt/backup) triggers the mount. |
13.6.2 Data Structures
/// State for one autofs filesystem instance (one mount point).
pub struct AutofsMount {
/// Pipe to the automount daemon. Kernel writes AutofsPacket messages here.
pub pipe: Arc<Pipe>,
/// Protocol version negotiated with the daemon (UmkaOS implements v5).
pub proto_version: u32,
/// Whether the daemon has declared itself gone (catatonic state).
pub catatonic: AtomicBool,
/// Idle timeout in seconds after which expire packets are sent.
pub timeout_secs: AtomicU32,
/// All outstanding lookup requests waiting for daemon response.
pub pending: Mutex<HashMap<u32, Arc<AutofsPendingRequest>>>,
/// Monotonically increasing token counter (wraps at u32::MAX).
pub next_token: AtomicU32,
/// Mount type: indirect or direct.
pub mount_type: AutofsMountType,
}
pub enum AutofsMountType {
Indirect,
Direct,
Offset, // Internal: used for sub-mounts within a multi-mount map.
}
/// One outstanding automount request.
pub struct AutofsPendingRequest {
/// Token echoed back in the daemon's IOC_READY / IOC_FAIL ioctl.
pub token: u32,
/// Path component that triggered the lookup (indirect) or full path (direct).
pub name: CString,
/// Sleeping callers blocked on this mount.
pub waitq: WaitQueue,
/// Result set by the daemon: Ok(()) on success, Err(errno) on failure.
pub result: Once<Result<()>>,
}
/// Packet written to the daemon pipe for a missing mount (protocol v5).
#[repr(C)]
pub struct AutofsPacketMissing {
pub hdr: AutofsPacketHdr,
/// Token for AUTOFS_IOC_READY / AUTOFS_IOC_FAIL.
pub wait_queue_token: u32,
/// Length of `name` (not including NUL).
pub len: i32,
/// Name of the missing directory component (NUL-terminated).
pub name: [u8; NAME_MAX + 1],
}
/// Packet written to the daemon pipe requesting expiry of an idle mount.
#[repr(C)]
pub struct AutofsPacketExpire {
pub hdr: AutofsPacketHdr,
pub wait_queue_token: u32,
pub len: i32,
pub name: [u8; NAME_MAX + 1],
}
/// Common packet header.
#[repr(C)]
pub struct AutofsPacketHdr {
pub proto_version: u32,
pub packet_type: AutofsPacketType,
}
#[repr(u32)]
pub enum AutofsPacketType {
Missing = 0,
Expire = 1,
}
13.6.3 Automount Protocol
Trigger sequence (the fast path through VFS path resolution):
autofs_d_automount(dentry, path) -> Result<Option<Arc<VfsMount>>>:
Precondition: called from REF-walk (never RCU-walk; see Section 13.6.6).
1. Obtain the AutofsMount for this dentry's superblock.
2. If catatonic: return Err(ENOENT) immediately.
3. Check if `dentry` is already a mount point (DCACHE_MOUNTED set):
return Ok(None) — another thread raced and completed the mount.
4. Allocate token = next_token.fetch_add(1, Relaxed).
5. Construct AutofsPacketMissing { token, name = dentry.name or full path }.
6. Insert Arc<AutofsPendingRequest> into pending table under token.
7. Write packet to pipe (non-blocking; if pipe is full, return ENOMEM —
the daemon is overloaded).
8. Sleep on pending.waitq with timeout = timeout_secs seconds.
9. On wake:
a. Remove request from pending table.
b. If result is Ok(()):
- Verify dentry is now a mount point (DCACHE_MOUNTED).
- Return Ok(None) (VFS follow_mount() will handle the new mount).
c. If result is Err(e): return Err(e).
10. On timeout:
a. Remove request from pending table.
b. Return Err(ETIMEDOUT).
Daemon response (via ioctl on the autofs pipe fd or mount point fd):
AUTOFS_IOC_READY(token: u32):
1. Acquire pending lock; look up token.
2. If not found: return ENXIO (stale token; request already timed out).
3. Set request.result = Ok(()).
4. Wake all waiters on request.waitq.
5. Remove from pending table.
AUTOFS_IOC_FAIL(token: u32):
1. Acquire pending lock; look up token.
2. If not found: return ENXIO.
3. Set request.result = Err(ENOENT).
4. Wake all waiters.
5. Remove from pending table.
Multiple callers may race to access the same missing path simultaneously. All of
them find the same AutofsPendingRequest in the pending table (inserted by the
first caller) and sleep on the same waitq. When the daemon responds, all waiters
wake together.
13.6.4 Control Interface
All autofs control operations are performed via ioctl(2) on the file descriptor
of the autofs pipe (passed to the kernel at mount time via the fd=N mount option)
or on a file descriptor opened on the autofs mount point itself.
| ioctl | Direction | Description |
|---|---|---|
AUTOFS_IOC_READY |
daemon→kernel | Mount succeeded for token. |
AUTOFS_IOC_FAIL |
daemon→kernel | Mount failed for token. |
AUTOFS_IOC_CATATONIC |
daemon→kernel | Daemon is exiting; all future lookups fail with ENOENT. |
AUTOFS_IOC_PROTOVER |
kernel→daemon | Returns protocol version (5 for UmkaOS). |
AUTOFS_IOC_SETTIMEOUT |
daemon→kernel | Sets idle expiry timeout in seconds. |
AUTOFS_IOC_EXPIRE |
kernel→daemon | Requests daemon to expire (unmount) one idle subtree. |
AUTOFS_IOC_EXPIRE_MULTI |
kernel→daemon | Requests daemon to expire up to N idle subtrees. |
AUTOFS_IOC_EXPIRE_INDIRECT |
kernel→daemon | Like EXPIRE but limited to indirect-mode subtrees. |
AUTOFS_IOC_EXPIRE_DIRECT |
kernel→daemon | Like EXPIRE but limited to direct-mode mount points. |
AUTOFS_IOC_PROTOSUBVER |
kernel→daemon | Returns protocol sub-version (UmkaOS: 2). |
AUTOFS_IOC_ASKUMOUNT |
daemon→kernel | Query whether the autofs mount point can be unmounted. |
13.6.5 Expiry
After an autofs-triggered mount has been idle for timeout_secs seconds, the
kernel initiates expiry. Expiry is cooperative: the kernel asks the daemon to
consider unmounting; the daemon decides whether conditions are met (no processes
have open files under the mount, no active chdir into it) and issues umount(2)
if appropriate.
autofs_expire_run(mount: &AutofsMount):
Executed from a kernel timer callback at intervals of timeout_secs / 4.
1. Walk all mounts that are children of this autofs mount point.
2. For each child mount M:
a. Compute idle_time = now - M.last_access_time.
b. If idle_time < timeout_secs: skip.
c. If any process has an open fd into M's subtree (check mount's
open-file reference count): skip.
d. Allocate token = next_token.fetch_add(1, Relaxed).
e. Write AutofsPacketExpire { token, name = M.mountpoint_name } to pipe.
f. Insert AutofsPendingRequest into pending table.
g. Daemon calls AUTOFS_IOC_READY(token) after umount(2) succeeds, or
AUTOFS_IOC_FAIL(token) if the mount is still busy.
3. The timer reschedules itself unless the mount is in catatonic state.
The expiry path does not sleep in the kernel; it is fire-and-forget from the kernel's perspective. The daemon drives the actual unmount.
13.6.6 VFS Integration
autofs inserts itself into the VFS path walk at the d_automount dentry operation
hook, which is called by follow_automount() inside the path resolution loop
(Section 13.1.3):
follow_automount(path, nd) -> Result<()>:
1. Verify nd.flags does not include LOOKUP_NO_AUTOMOUNT.
2. Call dentry.ops.d_automount(dentry, path) → new_mnt (may be None).
3. If new_mnt is Some(mnt): call do_add_mount(mnt, path).
4. Continue path walk over the now-mounted subtree.
RCU-walk downgrade: d_automount cannot sleep, and sleeping is required to
wait for the daemon response. Therefore, if the path walk is in RCU mode (the
optimistic lockless fast path), it is downgraded to REF-walk before
d_automount is called. The downgrade is performed by unlazy_walk(), which
acquires reference counts on the path components traversed so far. Once in
REF-walk, the kernel can sleep safely in autofs_d_automount.
LOOKUP_NO_AUTOMOUNT: Certain operations (stat, openat with
O_NOFOLLOW | O_PATH, utimensat with AT_SYMLINK_NOFOLLOW) set this flag to
avoid triggering automounts on stat-only access. This matches Linux semantics.
13.6.7 Mount Options
autofs is mounted by the daemon at startup with options passed via the data
argument to mount(2):
| Option | Description |
|---|---|
fd=N |
File descriptor of the daemon-side pipe end. Required. |
uid=N |
UID of the daemon process. Used for permission checks on expire. |
gid=N |
GID of the daemon process. |
minproto=N |
Minimum acceptable protocol version (daemon's minimum). |
maxproto=N |
Maximum acceptable protocol version (daemon's maximum). |
indirect |
Mount in indirect mode (default). |
direct |
Mount in direct mode. |
offset |
Mount in offset mode (internal; used by the daemon for sub-mounts). |
UmkaOS implements autofs protocol version 5, sub-version 2, matching the version
supported by systemd's automount daemon as of systemd v252+. The protocol version
is negotiated at mount time: the kernel picks min(maxproto, UMKA_PROTO_VERSION)
and returns it via AUTOFS_IOC_PROTOVER.
13.6.8 systemd Integration
A systemd .automount unit creates an autofs mount point at the path specified
by Where=, paired with a .mount unit of the same name. systemd acts as the
automount daemon:
- At unit activation, systemd calls
mount("autofs", Where, "autofs", 0, "fd=N,..."). - When
AutofsPacketMissingarrives on the pipe, systemd activates the corresponding.mountunit (which runsmount(2)for the real filesystem). - On success, systemd calls
AUTOFS_IOC_READY(token); on failure,AUTOFS_IOC_FAIL(token). TimeoutIdleSec=in the.automountunit maps directly toAUTOFS_IOC_SETTIMEOUT.- After the idle timeout, systemd receives
AutofsPacketExpireand issuesumount(2)if the mount is not busy, then callsAUTOFS_IOC_READY(token).
Example unit (/etc/systemd/system/home.automount):
[Unit]
Description=Automount /home via NFS
[Automount]
Where=/home
TimeoutIdleSec=300
[Install]
WantedBy=multi-user.target
Paired with /etc/systemd/system/home.mount which specifies the NFS source and
options. systemd creates the autofs mount point when the .automount unit starts
and tears it down when the unit stops.
13.6.9 Linux Compatibility
UmkaOS's autofs implementation is wire-compatible with Linux autofs4:
- Protocol version 5, sub-version 2 — matches Linux kernel 5.0+.
- All ioctl numbers are identical to Linux (
AUTOFS_IOC_*from<linux/auto_fs.h>). AutofsPacketMissingandAutofsPacketExpirestructs are#[repr(C)]and match the Linux kernel ABI exactly.- Mount option string format (
fd=N,uid=N,...) matches Linux. - systemd's automount daemon,
autofs(5)userspace tools, andmount.autofsall operate without modification against UmkaOS's autofs implementation.
13.7 FUSE — Filesystem in Userspace
FUSE allows user-space processes to implement complete filesystems. A FUSE filesystem
daemon opens /dev/fuse (character device, major 10, minor 229), mounts via
FUSE_SUPER_MAGIC, and serves kernel VFS calls by reading and writing structured
FUSE messages over the device fd. Any FUSE protocol-compliant daemon runs without
modification on UmkaOS.
13.7.1 Architecture
User Process (e.g., sshfs, rclone, glusterfs-fuse)
│ write(fuse_fd, fuse_out_header + reply)
│ read(fuse_fd, fuse_in_header + args)
▼
/dev/fuse (character device, major 10 minor 229)
│
┌────┴────────────────────────────────────────┐
│ FuseConn: pending request queue │
│ FuseInode: nodeid → dentry mapping │
└────┬────────────────────────────────────────┘
│ VFS callbacks → fuse_request dispatch
▼
UmkaOS VFS layer (lookup, read, write, open, ...)
│
POSIX application
The FUSE connection object (FuseConn) is the central coordination point. It
maintains two queues: pending (requests waiting for the daemon to pick up) and
processing (requests sent to the daemon, awaiting reply). Each VFS thread that
triggers a FUSE operation enqueues a request and blocks until the daemon writes
the corresponding reply.
13.7.2 Core Data Structures
/// One FUSE connection — shared between all fds opened on this mount.
pub struct FuseConn {
/// Pending requests waiting for the daemon to read.
pub pending: Mutex<VecDeque<Arc<FuseRequest>>>,
/// Requests sent to the daemon, awaiting reply.
pub processing: Mutex<BTreeMap<u64, Arc<FuseRequest>>>,
/// Wait queue: daemon blocked in read() waiting for new requests.
pub waitq: WaitQueue,
/// Connection options negotiated via FUSE_INIT.
pub opts: FuseConnOpts,
/// Next unique request ID (monotonically increasing).
pub next_unique: AtomicU64,
/// True after the daemon has exchanged FUSE_INIT.
pub initialized: AtomicBool,
/// True when the connection is shutting down.
pub destroyed: AtomicBool,
/// Maximum write size negotiated (from FUSE_INIT reply).
pub max_write: u32,
/// Maximum read size.
pub max_read: u32,
}
/// A single FUSE request/reply pair.
pub struct FuseRequest {
/// Monotonic ID — matches `FuseInHeader.unique` and `FuseOutHeader.unique`.
pub unique: u64,
pub opcode: FuseOpcode,
/// Serialized FUSE input args (everything after the `FuseInHeader`).
pub in_args: Vec<u8>,
pub reply: Mutex<FuseReply>,
/// Woken when `reply` transitions to `Done`.
pub waker: WaitEntry,
}
/// State of a request's reply.
pub enum FuseReply {
/// Not yet answered by the daemon.
Pending,
/// Reply bytes, or a negative errno on error.
Done(Result<Vec<u8>, i32>),
}
/// FUSE connection options negotiated during FUSE_INIT.
pub struct FuseConnOpts {
pub max_write: u32,
pub max_read: u32,
pub max_pages: u16,
/// Capabilities declared by the daemon (server side).
pub capable: FuseInitFlags,
/// Capabilities the kernel requests (client side).
pub want: FuseInitFlags,
/// Timestamp granularity in nanoseconds (0 = 1 ns, i.e., full precision).
pub time_gran: u32,
pub writeback_cache: bool,
pub parallel_dirops: bool,
pub async_dio: bool,
pub posix_acl: bool,
pub default_permissions: bool,
pub allow_other: bool,
}
FuseConn is reference-counted via Arc and held by:
- The superblock of the mounted filesystem.
- Every open file descriptor on /dev/fuse belonging to that mount.
When the last daemon fd is closed, FuseConn.destroyed is set and all further
VFS operations return EIO. The mount point must then be explicitly unmounted
with fusermount -u or umount.
13.7.3 Wire Protocol
All FUSE communication is framed with fixed headers. The kernel writes a request header followed by opcode-specific arguments; the daemon writes a reply header followed by opcode-specific data.
/// Fixed header preceding every FUSE request (kernel → daemon).
#[repr(C)]
pub struct FuseInHeader {
/// Total request length (this header + opcode args).
pub len: u32,
/// Opcode (FuseOpcode value).
pub opcode: u32,
/// Unique request ID; matched by the reply.
pub unique: u64,
/// Target inode number (0 for FUSE_INIT / FUSE_STATFS).
pub nodeid: u64,
/// Effective UID of the calling process.
pub uid: u32,
/// Effective GID of the calling process.
pub gid: u32,
/// PID of the calling process.
pub pid: u32,
pub padding: u32,
}
/// Fixed header preceding every FUSE reply (daemon → kernel).
#[repr(C)]
pub struct FuseOutHeader {
/// Total reply length (this header + reply data).
pub len: u32,
/// 0 on success; negative errno on error (e.g., -ENOENT = -2).
pub error: i32,
/// Matches the `unique` field from the corresponding `FuseInHeader`.
pub unique: u64,
}
Requests and replies are variable-length. The daemon must read exactly
FuseInHeader.len bytes per request and must write exactly FuseOutHeader.len
bytes per reply. A short read or write is a protocol error and terminates the
connection.
FUSE_FORGET and FUSE_BATCH_FORGET are the only opcodes that carry no reply;
the daemon must not write a reply for them.
13.7.4 FUSE Opcodes
The direction column records who initiates the message: K→D = kernel to daemon (a VFS call from a user process), D→K = daemon to kernel (a notify or retrieve reply with no corresponding VFS initiator).
| Opcode | Value | Direction | Description |
|---|---|---|---|
| FUSE_LOOKUP | 1 | K→D | Lookup a name within a directory |
| FUSE_FORGET | 2 | K→D | Decrement inode reference count (no reply) |
| FUSE_GETATTR | 3 | K→D | Fetch inode attributes |
| FUSE_SETATTR | 4 | K→D | Modify inode attributes |
| FUSE_READLINK | 5 | K→D | Read the target of a symbolic link |
| FUSE_SYMLINK | 6 | K→D | Create a symbolic link |
| FUSE_MKNOD | 8 | K→D | Create a special or regular file |
| FUSE_MKDIR | 9 | K→D | Create a directory |
| FUSE_UNLINK | 10 | K→D | Remove a file |
| FUSE_RMDIR | 11 | K→D | Remove a directory |
| FUSE_RENAME | 12 | K→D | Rename a file (v1; same mount) |
| FUSE_LINK | 13 | K→D | Create a hard link |
| FUSE_OPEN | 14 | K→D | Open a file |
| FUSE_READ | 15 | K→D | Read file data |
| FUSE_WRITE | 16 | K→D | Write file data |
| FUSE_STATFS | 17 | K→D | Query filesystem statistics |
| FUSE_RELEASE | 18 | K→D | Close file (last close releases the handle) |
| FUSE_FSYNC | 20 | K→D | Sync file data to stable storage |
| FUSE_SETXATTR | 21 | K→D | Set an extended attribute |
| FUSE_GETXATTR | 22 | K→D | Get an extended attribute value |
| FUSE_LISTXATTR | 23 | K→D | List all extended attribute names |
| FUSE_REMOVEXATTR | 24 | K→D | Remove an extended attribute |
| FUSE_FLUSH | 25 | K→D | Flush on close (sent before FUSE_RELEASE) |
| FUSE_INIT | 26 | K→D | Initialize connection (first message exchanged) |
| FUSE_OPENDIR | 27 | K→D | Open a directory |
| FUSE_READDIR | 28 | K→D | Read directory entries |
| FUSE_RELEASEDIR | 29 | K→D | Close a directory |
| FUSE_FSYNCDIR | 30 | K→D | Sync directory metadata to stable storage |
| FUSE_GETLK | 31 | K→D | Test a POSIX byte-range lock |
| FUSE_SETLK | 32 | K→D | Acquire or release a POSIX lock (non-blocking) |
| FUSE_SETLKW | 33 | K→D | Acquire a POSIX lock (blocking) |
| FUSE_ACCESS | 34 | K→D | Check access (used only when default_permissions is false) |
| FUSE_CREATE | 35 | K→D | Atomically create and open a file |
| FUSE_INTERRUPT | 36 | K→D | Cancel a pending request |
| FUSE_BMAP | 37 | K→D | Map logical file block to device block |
| FUSE_DESTROY | 38 | K→D | Tear down the connection |
| FUSE_IOCTL | 39 | K→D | Forward an ioctl to the userspace filesystem |
| FUSE_POLL | 40 | K→D | Poll a file for readiness events |
| FUSE_NOTIFY_REPLY | 41 | D→K | Deliver data in response to FUSE_NOTIFY_RETRIEVE |
| FUSE_BATCH_FORGET | 42 | K→D | Drop references for multiple inodes at once |
| FUSE_FALLOCATE | 43 | K→D | Pre-allocate or de-allocate file space |
| FUSE_READDIRPLUS | 44 | K→D | Read directory entries together with their attributes |
| FUSE_RENAME2 | 45 | K→D | Rename with RENAME_EXCHANGE or RENAME_NOREPLACE |
| FUSE_LSEEK | 46 | K→D | Seek with SEEK_DATA or SEEK_HOLE |
| FUSE_COPY_FILE_RANGE | 47 | K→D | Server-side copy (copy_file_range) |
| FUSE_SETUPMAPPING | 48 | K→D | Set up a DAX direct memory mapping |
| FUSE_REMOVEMAPPING | 49 | K→D | Remove a DAX mapping |
| FUSE_SYNCFS | 50 | K→D | Sync the entire filesystem |
| FUSE_TMPFILE | 51 | K→D | Create an unnamed temporary file (O_TMPFILE) |
| FUSE_STATX | 52 | K→D | Extended stat (statx(2)) |
Notify messages (daemon → kernel, unsolicited; no reply is sent by the kernel
except for FUSE_NOTIFY_RETRIEVE which expects FUSE_NOTIFY_REPLY):
| Notify code | Description |
|---|---|
| FUSE_NOTIFY_POLL | Wake all pollers on the specified file handle |
| FUSE_NOTIFY_INVAL_INODE | Invalidate cached attributes and, optionally, a byte range of page cache |
| FUSE_NOTIFY_INVAL_ENTRY | Invalidate a specific dentry in a parent directory |
| FUSE_NOTIFY_STORE | Pre-populate a byte range of the page cache |
| FUSE_NOTIFY_RETRIEVE | Request the kernel to send page-cache contents back to the daemon |
| FUSE_NOTIFY_DELETE | Remove a dentry without a round-trip FUSE_LOOKUP failure |
13.7.5 FUSE_INIT Handshake
FUSE_INIT is always the first message exchanged. The kernel sends
FuseInitIn and the daemon replies with FuseInitOut. The two sides negotiate
protocol version and capability flags; the connection uses the minimum agreed
minor version.
/// FUSE_INIT request body (kernel → daemon).
#[repr(C)]
pub struct FuseInitIn {
/// FUSE major protocol version (kernel sends 7).
pub major: u32,
/// FUSE minor protocol version (kernel sends 40 for Linux 6.10 equivalent).
pub minor: u32,
pub max_readahead: u32,
/// Capability bitmask the kernel supports (low 32 bits of FuseInitFlags).
/// Wire format: flags = FuseInitFlags bits 0-31 (low 32 bits).
pub flags: u32,
/// Extended capability flags (protocol minor ≥ 36, FUSE_INIT_EXT must be set in flags).
/// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
/// This matches the FUSE protocol extension for large flag sets (kernel 5.13+).
pub flags2: u32,
pub unused: [u32; 11],
}
/// FUSE_INIT reply body (daemon → kernel).
#[repr(C)]
pub struct FuseInitOut {
pub major: u32,
pub minor: u32,
pub max_readahead: u32,
/// Capabilities the daemon acknowledges and enables (low 32 bits of FuseInitFlags).
/// Wire format: flags = FuseInitFlags bits 0-31 (low 32 bits).
pub flags: u32,
/// Maximum number of outstanding background requests.
pub max_background: u16,
/// Congestion threshold: kernel slows down at this many background requests.
pub congestion_threshold: u16,
/// Maximum bytes per WRITE request.
pub max_write: u32,
/// Timestamp granularity in nanoseconds (0 = 1 ns, i.e., full precision).
pub time_gran: u32,
/// Maximum scatter-gather page count per request.
pub max_pages: u16,
/// Alignment required for DAX mappings.
pub map_alignment: u16,
/// Extended flags (protocol minor ≥ 36, requires FUSE_INIT_EXT set in flags).
/// Wire format: flags2 = FuseInitFlags bits 32-63 shifted down 32 bits.
/// This matches the FUSE protocol extension for large flag sets (kernel 5.13+).
pub flags2: u32,
pub max_stack_depth: u32,
pub unused: [u32; 6],
}
bitflags! {
/// Capability flags exchanged during FUSE_INIT.
pub struct FuseInitFlags: u64 {
/// Daemon supports asynchronous read requests.
const ASYNC_READ = 1 << 0;
/// Daemon handles POSIX advisory byte-range locks.
const POSIX_LOCKS = 1 << 1;
/// Daemon uses file handles returned in open replies.
const FILE_OPS = 1 << 2;
/// Daemon handles O_TRUNC atomically in open.
const ATOMIC_O_TRUNC = 1 << 3;
/// Filesystem supports NFS export (node IDs are stable across reboots).
const EXPORT_SUPPORT = 1 << 4;
/// Daemon supports writes larger than 4 KiB.
const BIG_WRITES = 1 << 5;
/// Kernel should not apply the process umask to create operations.
const DONT_MASK = 1 << 6;
/// Daemon supports splice(2)-based writes.
const SPLICE_WRITE = 1 << 7;
/// Daemon supports splice(2)-based moves.
const SPLICE_MOVE = 1 << 8;
/// Daemon supports splice(2)-based reads.
const SPLICE_READ = 1 << 9;
/// Daemon handles BSD flock() locking.
const FLOCK_LOCKS = 1 << 10;
/// Daemon supports ioctl on directories.
const HAS_IOCTL_DIR = 1 << 11;
/// Kernel auto-invalidates cached data on attribute changes.
const AUTO_INVAL_DATA = 1 << 12;
/// Kernel uses FUSE_READDIRPLUS instead of FUSE_READDIR.
const DO_READDIRPLUS = 1 << 13;
/// Kernel switches adaptively between READDIRPLUS and READDIR.
const READDIRPLUS_AUTO = 1 << 14;
/// Daemon supports asynchronous direct I/O.
const ASYNC_DIO = 1 << 15;
/// Daemon supports writeback caching (batched dirty page writeback).
const WRITEBACK_CACHE = 1 << 16;
/// Daemon does not need FUSE_OPEN (open is a no-op).
const NO_OPEN_SUPPORT = 1 << 17;
/// Parallel directory operations are safe (no serialization needed).
const PARALLEL_DIROPS = 1 << 18;
/// Kernel clears setuid/setgid bits on write (v1).
const HANDLE_KILLPRIV = 1 << 19;
/// Daemon supports POSIX ACLs.
const POSIX_ACL = 1 << 20;
/// Daemon sets error on abort rather than returning EIO.
const ABORT_ERROR = 1 << 21;
/// `max_pages` field in FuseInitOut is valid.
const MAX_PAGES = 1 << 22;
/// Daemon caches symlink targets.
const CACHE_SYMLINKS = 1 << 23;
/// Daemon does not need FUSE_OPENDIR.
const NO_OPENDIR_SUPPORT = 1 << 24;
/// Daemon explicitly invalidates data (FUSE_NOTIFY_INVAL_INODE).
const EXPLICIT_INVAL_DATA = 1 << 25;
/// `map_alignment` field in FuseInitOut is valid.
const MAP_ALIGNMENT = 1 << 26;
/// Daemon is aware of submount semantics.
const SUBMOUNTS = 1 << 27;
/// Kernel clears setuid/setgid bits on write (v2, extended semantics).
const HANDLE_KILLPRIV_V2 = 1 << 28;
/// Extended setxattr arguments (flags field present).
const SETXATTR_EXT = 1 << 29;
/// `flags2` fields in FuseInitIn/Out are valid.
const INIT_EXT = 1 << 30;
const INIT_RESERVED = 1 << 31;
}
}
If the daemon returns a minor version lower than what the kernel sent, the kernel downconverts: fields that did not exist in the older protocol minor are ignored. If the daemon sends a major version other than 7, the kernel closes the connection.
13.7.6 VFS Integration
FUSE registers filesystem type "fuse" with superblock magic FUSE_SUPER_MAGIC =
0x65735546. Mounting proceeds as follows:
mount(2) path
- User invokes
mount -t fuse -o fd=N,...or uses thefusermount3helper. - The kernel parses the
fd=Nmount option and resolves the fd to an open/dev/fusefile. - A
FuseConnis allocated and attached to the fd and the new superblock. - The kernel sends
FUSE_INITand waits for the daemon's reply; on success,FuseConn.initializedis set and the mount completes.
VFS → FUSE dispatch
For every VFS operation on a FUSE mount (lookup, read, write, getattr, etc.) the kernel:
- Allocates a
FuseRequestwith a freshuniqueID. - Serializes the opcode-specific arguments into
in_args. - Appends the request to
FuseConn.pendingand wakes the daemon's wait queue. - Blocks on
FuseRequest.wakeruntil the daemon writes a reply. - Deserializes the reply from
FuseRequest.replyand returns to the VFS caller.
The daemon loop is simply:
loop {
bytes = read(fuse_fd, buf) // blocks until a request is pending
handle_opcode(parse(buf))
write(fuse_fd, reply_bytes) // unblocks the kernel thread
}
Interrupt handling
If the calling thread receives a fatal signal while waiting for a FUSE reply,
the kernel enqueues a FUSE_INTERRUPT message targeting the original request's
unique ID. It then waits a short grace period (default 20 milliseconds). If the
daemon does not abort the request and send a reply within that window, the kernel
forcibly removes the request from FuseConn.processing and returns EINTR to
the caller. The daemon is expected to ignore any subsequent reply it sends for
the interrupted unique.
Writeback cache (WRITEBACK_CACHE flag)
When this capability is negotiated, dirty pages accumulate in the kernel page
cache and are written to the daemon in larger batches via FUSE_WRITE. Without
it, every write(2) to a FUSE file generates an immediate, synchronous
FUSE_WRITE to the daemon, serializing all write traffic. Most performance-
sensitive FUSE filesystems negotiate WRITEBACK_CACHE.
Connection death
When the last daemon fd is closed (daemon exits, crashes, or explicitly calls
FUSE_DESTROY):
FuseConn.destroyedis set atomically.- All requests in
FuseConn.processingare completed with errorENOTCONN. - All requests in
FuseConn.pendingare discarded. - Subsequent VFS operations on the mount return
EIO. - The mount point persists in the namespace; an explicit
umountorfusermount -uis required to remove it.
13.7.7 Security Model
Mount-owner restriction (default)
Unless the allow_other mount option is passed, only the UID that opened
/dev/fuse and performed the mount may access the filesystem. All other UIDs
receive EACCES from the UmkaOS VFS layer before the request reaches the daemon,
regardless of the file mode bits the daemon returns.
allow_other option
Permits any UID to access the filesystem subject to normal Unix permission
checks. Because allow_other exposes the daemon process to arbitrary user
requests, it requires either:
- The SysAdmin capability in the mount namespace, or
- The /proc/sys/fs/fuse/user_allow_other sysctl set to 1 (off by default).
default_permissions option
When set, the kernel enforces standard Unix permission checks (owner, group,
other; st_mode, st_uid, st_gid) against the attributes the daemon returns
in FUSE_GETATTR. The kernel never sends FUSE_ACCESS in this mode. Without
default_permissions, the daemon is responsible for its own access control and
receives FUSE_ACCESS for every access check.
Privilege requirement for mounting
Unprivileged FUSE mounts (without SysAdmin) are permitted only through
fusermount3, which is installed setuid-root and validates that the user owns
the target mountpoint. Direct mount(2) requires SysAdmin in the current
user namespace.
13.7.8 io_uring FUSE
UmkaOS supports the io_uring-based FUSE I/O path (FUSE_URING feature, equivalent
to Linux 6.14+). The daemon opts in by negotiating the FUSE_URING capability
during FUSE_INIT and then submitting SQEs of type IORING_OP_URING_CMD to the
/dev/fuse fd rather than using blocking read/write.
Benefits over the classic blocking I/O path:
- Asynchronous request handling — the daemon can have many requests in flight simultaneously without blocking threads.
- Reduced syscall overhead — requests are batched via
io_uring_submit; one syscall drains or fills multiple queue slots. - CPU affinity — the daemon can pin io_uring workers to specific CPUs, reducing cross-socket latency for NUMA-aware FUSE filesystems.
The FUSE daemon registers a fixed buffer pool at startup. The kernel delivers
requests into pre-registered buffers, and the daemon submits replies via the same
ring. The wire format (FuseInHeader, FuseOutHeader, opcode bodies) is
unchanged; only the transport mechanism differs.
13.7.9 Linux Compatibility
/dev/fusedevice node (major 10, minor 229): identical to Linux.- FUSE protocol version 7.40 (Linux 6.10 equivalent) is the maximum negotiated kernel version. Daemons advertising higher minors receive 7.40 in the reply.
libfuse3(3.x series): works without modification.fusermount3and thefuse.ko-equivalent path: built into the UmkaOS VFS layer; no kernel module is required.- All widely deployed FUSE filesystems run without modification:
sshfs,rclone mount,glusterfs-fuse,ceph-fuse,bindfs,s3fs-fuse,encfs,gvfs,ntfs-3g. - DAX (
FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING) is supported on persistent memory-backed FUSE mounts, providing zero-copy access to file data.
13.8 configfs — Kernel Object Configuration Filesystem
configfs is a RAM-resident pseudo-filesystem (similar to sysfs) that allows
user-space to create, configure, and destroy kernel objects by manipulating
directories and files under a single mount point. The key distinction from sysfs
is direction of control: sysfs exports kernel-managed objects to user-space,
while configfs gives user-space the power to instantiate new kernel objects via
mkdir.
configfs is used by:
- LIO iSCSI / NVMe-oF target (/sys/kernel/config/target/, /sys/kernel/config/nvmet/) — see Section 11 for the block-layer and NVMe-oF protocol details.
- USB gadget framework (/sys/kernel/config/usb_gadget/)
- 9pnet and netconsole subsystems
13.8.1 Architecture
User Space
mkdir / rmdir / cat / echo
│
/sys/kernel/config/
│ (VFS operations)
┌────────────┴────────────────────────┐
│ configfs VFS layer │
│ ConfigfsSubsystem → ConfigGroup │
│ ConfigItem → ConfigAttribute │
└────────────┬────────────────────────┘
│ callbacks
Kernel subsystem
(LIO, nvmet, USB gadget, ...)
User-space operates exclusively with POSIX filesystem primitives. No ioctl or dedicated syscall is needed. The kernel subsystem registers callback functions that the configfs VFS layer invokes in response to standard filesystem operations.
13.8.2 Data Structures
/// A configfs subsystem, registered by a kernel module at init time.
pub struct ConfigfsSubsystem {
/// Directory name created under /sys/kernel/config/.
pub name: &'static str,
/// Root group of this subsystem.
pub root: Arc<ConfigGroup>,
}
/// A configfs group — a directory that may contain items, subgroups, and
/// attributes. Groups may also carry a set of default child groups that are
/// created automatically when the group itself is created.
pub struct ConfigGroup {
pub item: ConfigItem,
/// Active children (items and subgroups) keyed by name.
pub children: RwLock<BTreeMap<String, ConfigChild>>,
/// Type descriptor controlling allowed operations on this group.
pub item_type: Arc<ConfigItemType>,
/// Subgroups automatically created alongside this group (not user-removable).
pub default_groups: Vec<Arc<ConfigGroup>>,
}
/// Discriminated union of group children.
pub enum ConfigChild {
Item(Arc<ConfigItem>),
Group(Arc<ConfigGroup>),
}
/// A configfs item — the leaf directory representing one kernel object.
pub struct ConfigItem {
pub name: Mutex<String>,
/// Reference count; item is dropped when it reaches zero.
pub kref: AtomicUsize,
pub parent: Weak<ConfigGroup>,
pub item_type: Arc<ConfigItemType>,
}
/// Type descriptor: defines the callbacks and attributes for an item or group.
pub struct ConfigItemType {
pub name: &'static str,
/// Called when the item's reference count drops to zero.
pub release: fn(&ConfigItem),
/// Attribute files exposed in every instance of this item type.
pub attrs: &'static [&'static dyn ConfigAttribute],
/// Returns additional child groups (used for complex multi-level objects).
pub groups: Option<fn(&ConfigItem) -> Vec<Arc<ConfigGroup>>>,
/// Create a new leaf item inside this group (triggered by mkdir).
pub make_item: Option<fn(group: &ConfigGroup, name: &str)
-> Result<Arc<ConfigItem>, KernelError>>,
/// Create a new subgroup inside this group (triggered by mkdir).
pub make_group: Option<fn(group: &ConfigGroup, name: &str)
-> Result<Arc<ConfigGroup>, KernelError>>,
/// Notify the subsystem before an item is removed (triggered by rmdir).
pub drop_item: Option<fn(group: &ConfigGroup, item: &ConfigItem)>,
}
/// A single configfs attribute — a regular file in the item directory.
pub trait ConfigAttribute: Send + Sync {
/// File name within the item directory.
fn name(&self) -> &str;
/// Unix permission bits (typically 0644 for read-write, 0444 for read-only).
fn mode(&self) -> u32;
/// Populate `buf` with a text representation of the attribute value.
/// Returns the number of bytes written.
fn show(&self, item: &ConfigItem, buf: &mut [u8]) -> Result<usize, KernelError>;
/// Parse `buf` and apply the new attribute value.
/// Returns the number of bytes consumed.
fn store(&self, item: &ConfigItem, buf: &[u8]) -> Result<usize, KernelError>;
}
Lifetimes and reference counting mirror those of the objects the subsystem
manages. A ConfigItem is kept alive as long as the directory exists in the
configfs namespace. Removal (rmdir) calls drop_item, decrements the kref,
and invokes release when the count reaches zero.
13.8.3 Mount Point and Directory Layout
configfs is mounted at boot by configfs_init() and exposed at
/sys/kernel/config. User-space may also mount it manually:
mount -t configfs configfs /sys/kernel/config
Illustrative layout showing the NVMe-oF and iSCSI target subsystems (see Section 11 for full protocol details):
/sys/kernel/config/
├── target/ ← LIO iSCSI / generic target subsystem
│ ├── core/
│ │ └── iblock_0/ ← mkdir: create iblock backstore group
│ │ └── lio_disk0/ ← mkdir: create a new block device object
│ │ ├── dev ← echo /dev/sda > dev
│ │ ├── udev_path ← echo /dev/sda > udev_path
│ │ └── enable ← echo 1 > enable
│ └── iscsi/
│ └── iqn.2024-01.com.example:storage/ ← mkdir: create iSCSI target IQN
│ └── tpgt_1/ ← mkdir: create target portal group
│ ├── enable
│ ├── lun/
│ │ └── lun_0 → ../../core/iblock_0/lio_disk0 ← symlink
│ ├── acls/
│ │ └── iqn.2024-01.com.client:host1/
│ │ ├── auth/
│ │ └── mapped_lun0/
│ └── fabric_statistics/
├── nvmet/ ← NVMe-oF target subsystem
│ ├── subsystems/
│ │ └── nqn.2024-01.com.example:nvme-ssd/ ← mkdir: create NVMe subsystem NQN
│ │ ├── attr_allow_any_host
│ │ └── namespaces/
│ │ └── 1/ ← mkdir: create namespace ID 1
│ │ ├── device_path ← echo /dev/nvme0n1 > device_path
│ │ └── enable ← echo 1 > enable
│ └── ports/
│ └── 1/ ← mkdir: create NVMe-oF port
│ ├── addr_trtype ← echo tcp > addr_trtype
│ ├── addr_traddr ← echo 192.0.2.1 > addr_traddr
│ ├── addr_trsvcid ← echo 4420 > addr_trsvcid
│ └── subsystems/
│ └── nqn.2024-01.com.example:nvme-ssd ← symlink
└── usb_gadget/ ← USB gadget framework
└── g1/ ← mkdir: create a gadget instance
├── idVendor
├── idProduct
└── functions/
└── mass_storage.0/
└── lun.0/
└── file ← echo /dev/sdb > file
The directory hierarchy encodes object relationships. Symlinks express associations between independently-created objects (e.g., linking a LUN to its backing store, or attaching a subsystem to a port).
13.8.4 VFS Operations
configfs maps the five fundamental filesystem operations onto subsystem callbacks:
mkdir(path)
The parent directory's ConfigItemType is consulted. If make_group is
defined, a new ConfigGroup is allocated and returned as a subdirectory dentry.
If make_item is defined, a new ConfigItem is allocated and returned. Only one
of the two may be non-null for a given group type; attempting mkdir on a group
that defines neither returns EPERM. Default child groups are created
automatically alongside any new group.
rmdir(path)
The directory must be empty (no user-created children; default children are
exempt from this check and are removed automatically). drop_item is invoked
on the parent's ConfigItemType, then the item's kref is decremented. If the
kref reaches zero, release is called. Attempting to remove a non-empty
directory returns ENOTEMPTY.
open(attr_path) / read(attr_fd)
The fd is associated with the specific ConfigAttribute. read(2) invokes
ConfigAttribute::show(), which populates the kernel buffer with a text
representation. The output is always \n-terminated for shell compatibility.
open(attr_path) / write(attr_fd)
write(2) invokes ConfigAttribute::store() with the user-supplied buffer.
The subsystem parses and validates the value; on error it returns a negative
errno. Writes larger than PAGE_SIZE (4 KiB) are rejected with EINVAL to
prevent unbounded allocations.
symlink(src, dst)
Used to express dependencies between items: for example, associating a LUN
directory with a backstore object, or adding a subsystem to a port's subscriber
list. configfs validates that both the source and destination are within the
same configfs mount before creating the link. The subsystem's ConfigItemType
may reject symlinks by returning EPERM from an optional allow_link callback.
readdir
Returns all children of a group: items, subgroups, attribute files, and symlinks.
Attribute names are synthesized from ConfigItemType.attrs; no inode backing
store is needed.
13.8.5 Linux Compatibility
/sys/kernel/config/mount point and directory layout: byte-for-byte identical to Linux configfs (kernel 5.0+).- The
ConfigAttributeread/write text format (newline-terminated strings,echo value > fileidiom) matches Linux. - LIO iSCSI target tools (
targetcli,targetcli-fb,rtslib-fb) work without modification. - NVMe-oF target tools (
nvmetcli) work without modification; see Section 11 for NVMe-oF transport configuration details. - USB gadget framework (
configfs-gadget,libusbgx) works without modification. - Symlink semantics (cross-item dependencies) are identical to Linux: both source and destination must reside within the same configfs mount.
13.9 File Notification System
UmkaOS implements inotify and fanotify with full Linux syscall and wire-format compatibility. Internal delivery uses typed structured channels rather than raw fd-write protocols; the external syscall interfaces are byte-for-byte identical to Linux.
Two interfaces are provided:
- inotify: informational events (IN_CREATE, IN_MODIFY, etc.), delivered asynchronously
via a file descriptor readable with
read(2). - fanotify: superset of inotify, plus permission events (
FAN_OPEN_PERM,FAN_ACCESS_PERM,FAN_OPEN_EXEC_PERM) that block the originating syscall until userspace responds with allow or deny. Used by malware scanners, file integrity monitors, and backup software.
Both are implemented in umka-vfs. Event delivery hooks are called from within the VFS
operation dispatch paths — after permission checks pass, before returning to userspace.
13.9.1 inotify
13.9.1.1 In-Kernel Objects
/// Per-inotify-instance state. Created by inotify_init() / inotify_init1().
/// Exposed to userspace as a file descriptor (the fd is backed by a synthetic
/// inode in the anonymous inode filesystem; read(2) on it drains the event queue).
pub struct InotifyInstance {
/// Watch descriptors: maps wd → InotifyWatch.
/// Protected by an RwLock: concurrent watchers on disjoint inodes do not
/// contend, and watch addition/removal is infrequent.
pub watches: RwLock<BTreeMap<WatchDescriptor, Arc<InotifyWatch>>>,
/// Monotonically increasing allocator for watch descriptors.
/// WDs are 1-based positive integers per inotify_add_watch(2) contract.
pub next_wd: AtomicI32,
/// Per-watch event queue. Fixed capacity avoids heap allocation under spinlock.
/// Overflow policy: oldest events are dropped when full; EVENTIN_Q_OVERFLOW
/// synthetic event is prepended to the next successful read.
/// Capacity 256 events per watch is sufficient for typical usage; overflowed
/// queues set the `overflow` flag and deliver a synthetic `IN_Q_OVERFLOW` event
/// on next read (matches Linux inotify behavior).
pub event_queue: SpinLock<RingBuffer<InotifyEventBuf, 256>>,
/// Set when the event queue overflowed since the last read(2). A synthetic
/// `IN_Q_OVERFLOW` event is prepended to the next read response and this flag
/// is cleared. Separate from the queue to avoid occupying a queue slot.
pub overflow: AtomicBool,
/// Wait queue for poll()/select()/epoll() on this instance.
pub wait_queue: WaitQueueHead,
/// Flags from inotify_init1() (IN_CLOEXEC, IN_NONBLOCK).
pub flags: u32,
}
/// One inotify watch: a single inode being monitored for specific events.
pub struct InotifyWatch {
/// Watch descriptor (the value returned to userspace by inotify_add_watch).
pub wd: WatchDescriptor,
/// The inode being watched. Holds an Arc reference to prevent premature eviction
/// while the watch is active.
pub inode: Arc<Inode>,
/// Bitmask of watched events (IN_CREATE | IN_MODIFY | IN_CLOSE_WRITE | etc.).
pub mask: u32,
/// Back-reference to the owning InotifyInstance (weak to avoid cycles).
pub instance: Weak<InotifyInstance>,
}
/// Event delivered to userspace via read(2) on the inotify fd.
/// Matches the Linux inotify_event ABI exactly.
#[repr(C)]
pub struct InotifyEvent {
/// Watch descriptor that fired.
pub wd: i32,
/// Event type (IN_CREATE, IN_MODIFY, IN_DELETE, etc.).
pub mask: u32,
/// Links related IN_MOVED_FROM and IN_MOVED_TO events (same cookie = same rename).
pub cookie: u32,
/// Length of the name[] field in bytes, including the null terminator and any
/// trailing padding bytes. 0 if no filename is associated with this event
/// (e.g., IN_ATTRIB on a non-directory inode).
pub len: u32,
// Followed immediately by name[len]: null-terminated filename, valid only for
// events on directory inodes. Padded to a 4-byte boundary.
}
/// Internal buffer holding a complete inotify event + filename bytes.
pub struct InotifyEventBuf {
pub header: InotifyEvent,
/// The filename, null-padded to a multiple of 4 bytes.
pub name: Vec<u8>,
}
13.9.1.2 VFS Integration Hooks
inotify events are generated from dentry/inode operation call sites within the VFS dispatch layer. The fast path check costs a single pointer load:
| VFS operation | Event(s) generated |
|---|---|
create, mkdir, mknod, symlink |
IN_CREATE on parent dir inode |
unlink, rmdir |
IN_DELETE on parent dir; IN_DELETE_SELF on the target inode |
rename (source side) |
IN_MOVED_FROM on old parent + cookie |
rename (destination side) |
IN_MOVED_TO on new parent + same cookie |
open |
IN_OPEN on the inode |
read, readdir |
IN_ACCESS on the inode |
write, truncate, fallocate |
IN_MODIFY on the inode |
setattr (chmod/chown/utimes) |
IN_ATTRIB on the inode |
close (file was written) |
IN_CLOSE_WRITE on the inode |
close (read-only open) |
IN_CLOSE_NOWRITE on the inode |
| inotify watch removed (inode evicted or inotify_rm_watch) | IN_IGNORED on the watch descriptor |
Each Inode carries an inotify_watches field:
/// Per-inode inotify watch list. None when no watches are active (the common case).
/// This field is checked on every relevant VFS operation; None costs a single
/// null pointer load with no branch misprediction.
pub inotify_watches: Option<SpinLock<Vec<Arc<InotifyWatch>>>>,
When the field is None (no watches active), the check is a single null pointer
comparison — zero overhead on the fast path for the vast majority of inodes.
13.9.1.3 Event Delivery Algorithm
fsnotify_inode_event(inode, event_mask, name, cookie):
watches_opt = inode.inotify_watches // single load
if watches_opt is None: return // fast path: no watches on this inode
watches = watches_opt.as_ref().lock()
for watch in watches.iter():
fired_mask = watch.mask & event_mask
if fired_mask == 0: continue
if let Some(instance) = watch.instance.upgrade():
buf = InotifyEventBuf {
header: InotifyEvent { wd: watch.wd, mask: fired_mask, cookie, len: name.len() + padding },
name: name_bytes_padded_to_4_bytes,
}
queue = instance.event_queue.lock()
if !queue.is_full():
queue.push(buf)
else:
// Queue overflow: set the overflow flag so that the next read(2) prepends
// a synthetic IN_Q_OVERFLOW event. The AtomicBool lives outside the spinlock;
// store is done while still holding the lock to ensure the writer side sees
// the flag before any reader drains the queue.
instance.overflow.store(true, Ordering::Release)
drop(queue)
instance.wait_queue.wake_up_one() // unblock read()/poll()
13.9.1.4 Syscall Implementations
inotify_add_watch(fd, path, mask) → wd:
1. Resolve path → inode using normal path resolution.
2. Look up fd → InotifyInstance.
3. Scan instance.watches for an existing watch on this inode:
- If found: update watch.mask = mask (OR behavior if IN_MASK_ADD flag is set;
replace otherwise). Return the existing wd.
4. Allocate a new WatchDescriptor from instance.next_wd.fetch_add(1).
5. Construct InotifyWatch { wd, inode: inode.clone(), mask, instance: Arc::downgrade(&instance) }.
6. Initialize inode.inotify_watches if it was None.
7. Insert the watch into both inode.inotify_watches and instance.watches.
8. Return wd.
inotify_rm_watch(fd, wd) → 0:
1. Look up fd → InotifyInstance.
2. Remove the watch from instance.watches by wd. Return EINVAL if not found.
3. Remove the corresponding entry from inode.inotify_watches.
4. If inode.inotify_watches is now empty, set it to None.
5. Deliver an IN_IGNORED event to the instance.
6. Drop the Arc<InotifyWatch>.
13.9.1.5 Mandatory Event Coalescing
Coalescing rule (mandatory): Before enqueuing a new event, the delivery path
checks whether the tail of the instance's EventQueue is an identical event. If
so, the new event is discarded (coalesced) rather than enqueued. Two events are
identical if and only if:
fn events_are_identical(a: &InotifyEvent, b: &InotifyEvent) -> bool {
a.wd == b.wd &&
a.mask == b.mask &&
a.cookie == b.cookie &&
a.name == b.name // byte-for-byte name comparison
}
The check is against the tail only (O(1)), not the entire queue. Events are coalesced only when consecutive and identical — non-consecutive duplicates are not coalesced (ordering is preserved for different events between duplicates).
IN_MOVED_FROM / IN_MOVED_TO cookie pairing: Cookie values are assigned by
a per-VFS-instance AtomicU32 cookie_counter. Consecutive rename operations
get consecutive cookie values. Coalescing does NOT apply to cookie-bearing
events (mask has IN_MOVED_FROM or IN_MOVED_TO set) — rename pairs must always
be delivered in full.
IN_Q_OVERFLOW: When the fixed-capacity RingBuffer is full and a new event
cannot be enqueued (even after attempting coalescing), the InotifyInstance.overflow
AtomicBool is set to true. On the next read(2), the read path checks this flag
first: if set, it clears the flag and prepends a synthetic IN_Q_OVERFLOW event
(wd=-1, mask=IN_Q_OVERFLOW, cookie=0, name="") before draining normal events.
This keeps the overflow sentinel out of the ring buffer itself, preserving all 256
slots for real events. The queue is never silently dropped without this sentinel.
Performance: Under cargo build workloads (10k+ file writes), inotify
watchers on the build directory receive IN_MODIFY storms. Coalescing reduces
queue pressure by 10-100x for write-heavy workloads where the application
re-reads the file on any change (editor reload, build system).
Linux compatibility: Linux inotify performs the same tail-coalescing. UmkaOS mandates it (Linux specifies it informally). The IN_Q_OVERFLOW sentinel behaviour is identical to Linux.
13.9.2 fanotify
fanotify extends inotify with:
- Filesystem-wide and mount-wide marks (not just per-inode): a single mark can cover an entire mount point or filesystem, eliminating the need to add per-inode watches for directories being monitored for new file creation.
- Permission events (
FAN_OPEN_PERM,FAN_ACCESS_PERM,FAN_OPEN_EXEC_PERM): the originating syscall blocks until the fanotify daemon responds with allow or deny, subject to a mandatory per-group timeout (default 5000ms) to prevent system-wide I/O stalls.
13.9.2.1 Data Structures
/// Per-fanotify-instance state. Created by fanotify_init().
pub struct FanotifyInstance {
/// Mark table: key is (mark_type, object_id) where mark_type is inode/mount/sb.
pub marks: RwLock<BTreeMap<FanotifyMarkKey, Arc<FanotifyMark>>>,
/// Informational event queue (non-permission events).
pub event_queue: SpinLock<VecDeque<FanotifyEvent>>,
/// Pending permission requests: keyed by a unique request ID assigned at creation.
/// Entries are removed when the daemon writes a response.
pub perm_queue: SpinLock<BTreeMap<u64, Arc<FanotifyPermRequest>>>,
/// Next permission request ID (monotonically increasing).
pub next_perm_id: AtomicU64,
/// Wait queue for poll()/select()/epoll() on this instance.
pub wait_queue: WaitQueueHead,
/// Notification class: determines permission event delivery order when multiple
/// fanotify instances watch the same inode.
/// FAN_CLASS_NOTIF=0: informational only.
/// FAN_CLASS_CONTENT=1: content scanners (see file after open).
/// FAN_CLASS_PRE_CONTENT=2: DLP / integrity monitors (see file before open).
/// Higher class is notified first. Within the same class, order is unspecified.
pub class: FanotifyClass,
/// Flags from fanotify_init() (FAN_CLOEXEC, FAN_NONBLOCK, FAN_REPORT_FID, etc.).
pub flags: u32,
/// Maximum time to wait for a permission event response.
/// Default: 5000ms. Configurable per group at fanotify_init() time via
/// FANOTIFY_INIT_PERM_TIMEOUT_MS (UmkaOS extension, not in Linux).
/// A value of 0 means: use the system default from
/// /proc/sys/fs/fanotify/perm_timeout_ms.
pub perm_timeout: Duration,
/// Action taken when a permission request times out:
/// - PermTimeoutAction::Deny: return EPERM to the originating syscall (safe default)
/// - PermTimeoutAction::Allow: allow the operation (permissive mode for monitoring-only daemons)
pub perm_timeout_action: PermTimeoutAction,
}
pub enum PermTimeoutAction {
Deny, // Return EPERM to originating syscall on timeout (default)
Allow, // Allow the operation on timeout (for monitoring daemons that tolerate loss)
}
/// A single fanotify mark: attaches event interest to an inode, mount, or superblock.
pub struct FanotifyMark {
pub mark_type: FanotifyMarkType, // FAN_MARK_INODE, FAN_MARK_MOUNT, FAN_MARK_FILESYSTEM
/// Object identifier: inode_id (for inode marks), mount_id (for mount marks),
/// or superblock pointer (for filesystem marks).
pub object_id: u64,
/// Event mask this mark is listening for.
pub mask: u64,
/// Ignore mask: events matching this mask are suppressed even if mask is set.
pub ignored_mask: u64,
pub instance: Weak<FanotifyInstance>,
}
/// A pending permission request: holds the event plus the response channel.
pub struct FanotifyPermRequest {
/// The event as delivered to userspace via read(2) on the fanotify fd.
pub event: FanotifyEvent,
/// Unique request ID (matches the fd-based identification in the response).
pub request_id: u64,
/// Set to FAN_ALLOW or FAN_DENY by the daemon's write(2) response.
/// Protected by the Mutex; None while pending.
pub response: Mutex<Option<u32>>,
/// Wakes the blocked originating syscall when response becomes Some.
pub waker: WaitQueueHead,
}
/// Event delivered to userspace via read(2) on the fanotify fd.
/// Matches Linux's fanotify_event_metadata ABI.
#[repr(C)]
pub struct FanotifyEvent {
pub event_len: u32, // Total length of this event record (including variable info records)
pub vers: u8, // FANOTIFY_METADATA_VERSION (always 3)
pub reserved: u8,
pub metadata_len: u16, // sizeof(FanotifyEvent)
pub mask: u64, // Event type bitmask
pub fd: i32, // Opened fd for the file (or -1 with FAN_REPORT_FID)
pub pid: i32, // PID of the process that triggered the event
}
pub enum FanotifyMarkType { Inode, Mount, Filesystem }
pub enum FanotifyClass {
Notif = 0, // FAN_CLASS_NOTIF
Content = 1, // FAN_CLASS_CONTENT
PreContent = 2, // FAN_CLASS_PRE_CONTENT
}
13.9.2.2 Permission Event Flow
When a VFS operation triggers a permission-event mask bit (e.g., FAN_OPEN_PERM on
open(2)):
fanotify_perm_event(inode, event_type, opener_pid):
// Collect all matching fanotify instances in class order (PreContent first).
matching = collect_matching_marks(inode, event_type)
if matching is empty: return Ok(()) // fast path
for instance in matching sorted by class descending:
id = instance.next_perm_id.fetch_add(1)
event_fd = open_file_for_fanotify(inode) // opens fd for daemon to inspect
event = FanotifyEvent { mask: event_type, fd: event_fd, pid: opener_pid, ... }
req = Arc::new(FanotifyPermRequest { event, request_id: id, response: None, waker })
instance.perm_queue.lock().insert(id, req.clone())
instance.event_queue.lock().push_back(event)
instance.wait_queue.wake_up_one()
// Block with mandatory timeout — never block indefinitely
match req.channel.wait_timeout(instance.perm_timeout):
Ok(response):
if response.allow: continue // allow and check next instance
else: close(event_fd); return Err(EPERM)
Err(Timeout):
// Log timeout: fanotify daemon too slow
log_warn!("fanotify: perm request timed out after {:?}, action={:?}",
instance.perm_timeout, instance.perm_timeout_action)
// Increment per-group timeout counter (visible in /proc/PID/fdinfo/<fafd>)
instance.timeout_count.fetch_add(1, Ordering::Relaxed)
match instance.perm_timeout_action:
PermTimeoutAction::Deny → close(event_fd); return Err(EPERM)
PermTimeoutAction::Allow → continue // allow on timeout
close(event_fd)
return Ok(()) // all instances allowed
Mandatory permission event timeout: Permission events (FAN_OPEN_PERM, FAN_ACCESS_PERM, FAN_OPEN_EXEC_PERM) have a mandatory response timeout to prevent system-wide I/O stalls.
System-wide timeout knob: /proc/sys/fs/fanotify/perm_timeout_ms (default: 5000). Can be set to 0 to disable timeout (not recommended; requires CAP_SYS_ADMIN).
Monitoring: /proc/sys/fs/fanotify/perm_timeout_count — system-wide count of permission request timeouts (monotonic counter, reset on boot). Per-group count in /proc/PID/fdinfo/<fafd> as perm_timeout_count: N.
Linux compatibility note: Linux fanotify has no timeout on permission events (daemon death causes permanent block — requires daemon restart or fanotify fd close). UmkaOS's timeout is an improvement over Linux; existing fanotify daemons work unchanged (they don't set FANOTIFY_INIT_PERM_TIMEOUT_MS, so they get the 5s default with Deny on timeout). Tools like systemd-oomd, CrowdStrike Falcon, and audit daemons that use fanotify will benefit automatically from the safety timeout.
Userspace daemon writes FAN_ALLOW / FAN_DENY:
write(fanotify_fd, &fanotify_response { fd: event_fd, response: FAN_ALLOW_or_DENY }):
// Match the response to a pending request by event_fd.
req = find_perm_request_by_fd(instance.perm_queue, event_fd)
if req is None: return Err(EINVAL) // stale or already answered
req.response.lock() = Some(FAN_ALLOW or FAN_DENY)
req.waker.wake_up_one() // unblock the blocked syscall
UmkaOS improvement over Linux fanotify: Linux matches responses to pending permission
requests by the fd number inside the fanotify_response struct, which becomes ambiguous
if the daemon closes and reopens fds in the event window. UmkaOS uses a typed
FanotifyPermRequest with a structured response channel keyed by a monotonically
increasing request_id. The Arc<FanotifyPermRequest> lifetime guarantees the blocked
syscall's stack is valid until the response arrives, eliminating the lifetime
ambiguity in the fd-matching approach.
13.9.3 UmkaOS-Native File Watch Capabilities
UmkaOS provides a capability-based file watching API as a modern alternative to
inotify. Unlike inotify (global watch descriptor namespace, process-scoped),
FileWatchCap watches are:
- Capability-scoped: unforgeable, revocable, auditable
- Memory-bounded: each watch is a capability slot (no global state)
- Automatically revoked: when the capability is dropped or the process exits
- Ring-delivered: events go to a typed UmkaOS ring buffer, not a read() queue
- Composable: multiple watches can share one ring
inotify remains fully supported for Linux compatibility. FileWatchCap is the
recommended API for new UmkaOS code.
/// A capability granting the holder the right to watch a specific inode for
/// specific events. Cannot be forged; issued by the kernel only.
/// Revocable via the standard capability revocation path (Section 8.1).
pub struct FileWatchCap {
/// The inode to watch. Kernel-internal reference — not a path (immune to rename).
inode: Arc<Inode>,
/// Events to deliver (subset of InotifyMask).
mask: InotifyMask,
/// Watch children of this directory (if inode is a directory).
watch_children: bool,
/// Watch children recursively (deep watch — UmkaOS extension, not in inotify).
watch_recursive: bool,
}
/// Subscribe to inode events via a capability.
/// Events are delivered to `ring` as typed `FileWatchEvent` structs.
///
/// Returns a `WatchHandle` — dropping the handle unregisters the watch.
pub fn inode_watch(
cap: FileWatchCap,
ring: Arc<EventRing<FileWatchEvent>>,
) -> Result<WatchHandle, WatchError>;
/// A single file watch event, delivered to the ring.
#[repr(C)]
pub struct FileWatchEvent {
pub event_type: FileWatchEventType, // enum (see below)
pub cookie: u32, // for rename pairs (FROM/TO share cookie)
pub inode_id: u64, // stable inode number
pub name: Option<ArrayString<255>>, // filename (for directory events)
pub timestamp: MonotonicInstant, // UmkaOS extension: not in inotify
}
pub enum FileWatchEventType {
Access, // File was read
Modify, // File was written
Attrib, // Metadata changed (chmod, chown, timestamps)
CloseWrite, // File opened for writing was closed
CloseNoWrite, // File opened read-only was closed
Open, // File was opened
MovedFrom, // File moved out (cookie matches MovedTo)
MovedTo, // File moved in (cookie matches MovedFrom)
Create, // File created in watched directory
Delete, // File deleted from watched directory
DeleteSelf, // Watched file itself was deleted
MoveSelf, // Watched file itself was moved
Unmount, // Filesystem containing watched file was unmounted
}
Deep watch (watch_recursive: true): watches a directory tree recursively.
UmkaOS maintains a kernel-side tree of watch registrations, automatically adding
watches for new subdirectories as they are created (IN_CREATE on a directory).
inotify has no recursive watch; tools like inotifywait -r simulate it with
userspace polling, which has TOCTOU races. UmkaOS's deep watch is race-free.
Obtaining a FileWatchCap: capability is issued via:
/// Open a FileWatchCap for a path (requires read permission on the path).
pub fn open_watch_cap(
dirfd: DirFd,
path: &Path,
mask: InotifyMask,
watch_children: bool,
watch_recursive: bool,
) -> Result<FileWatchCap, WatchError>;
Revocation: WatchHandle::drop() unregisters the watch. When the process
exits, all WatchHandles are dropped automatically — no cleanup required.
Capability revocation (Section 8.1) also revokes all file watches derived from
the revoked capability.
Linux compatibility: FileWatchCap is an UmkaOS-only API. inotify_init(),
inotify_add_watch(), inotify_rm_watch() work identically to Linux.
FileWatchCap is intended for new UmkaOS applications; existing Linux software
uses inotify unchanged.
13.9.4 Cross-References
- Section 13.1.1 (VFS Traits): inotify/fanotify hooks are inserted at the VFS
operation dispatch layer, after
InodeOps/FileOpscall sites complete successfully. - Section 16.1.2 (Namespace Implementation): fanotify marks survive
CLONE_NEWNSand remain attached to the underlying inode/mount, not to a specific mount namespace. Marks set in a parent namespace remain visible in child namespaces for the same underlying mount. - Section 8.1 (Security):
fanotify_init(FAN_CLASS_CONTENT)andfanotify_init(FAN_CLASS_PRE_CONTENT)requireCAP_SYS_ADMIN. Informational fanotify (FAN_CLASS_NOTIF) requires onlyCAP_FOWNERon Linux; UmkaOS follows the same capability requirement for compatibility.
13.10 Local File Locking (flock / fcntl POSIX Locks / OFD Locks)
UmkaOS provides three advisory file locking interfaces, each with distinct semantics:
| Interface | Granularity | Lock scope | Inherited on fork |
Released on |
|---|---|---|---|---|
flock(2) |
Whole file | Per open-file-description | No (child gets independent fd) | Last close of the description |
fcntl F_SETLK |
Byte-range (POSIX) | Per process (PID) | No | Process exit OR any close of the file |
fcntl F_OFD_SETLK |
Byte-range (OFD) | Per open-file-description | Yes | Last close of the description |
All three are advisory: a process can read and write a file regardless of locks held
by other processes. Locks only prevent other processes from acquiring conflicting locks.
Mandatory locking (Linux MS_MANDLOCK) is deliberately not implemented — it was
deprecated in Linux 5.15 and is incompatible with modern VFS semantics.
13.10.1 Data Structures
/// A single file lock entry. Stored in the per-inode `FileLockTree`.
pub struct FileLock {
/// Lock type: read (shared) or write (exclusive).
pub lock_type: FileLockType,
/// Byte range: [start, end] inclusive. 0..=u64::MAX represents the whole file.
/// For flock locks, start=0 and end=u64::MAX always.
pub start: u64,
pub end: u64,
/// For POSIX locks: the PID of the owning process.
/// All POSIX locks held by a process are released when it exits OR when
/// any file descriptor for the file is closed (POSIX semantics).
/// For OFD locks: None. The lock is owned by the open-file-description.
/// For flock locks: None. The lock is owned by the open-file-description.
pub owner_pid: Option<Pid>,
/// The open-file-description that created this lock.
/// Weak reference: if the description is dropped (last fd closed), the lock
/// is released. For POSIX locks, `owner_pid` is the primary ownership token
/// and `owner_fd` is advisory for conflict matching.
pub owner_fd: Weak<FileDescription>,
/// Wait queue: tasks blocked waiting for this lock to be released sleep here.
pub wait_queue: WaitQueueHead,
}
pub enum FileLockType {
/// Shared (read) lock. Multiple readers can hold simultaneously.
Read,
/// Exclusive (write) lock. No other lock may be held concurrently.
Write,
}
/// Per-inode lock state. Present only on inodes that have had locks acquired;
/// None on inodes that have never been locked (zero overhead on the fast path).
pub struct InodeLocks {
/// Augmented interval tree of active locks (POSIX, flock, and OFD locks).
/// Sorted by `l_start`; each node carries `subtree_max: u64` = maximum
/// `l_end` in its subtree. This enables O(log n) range overlap queries.
/// See Section 13.10.3 for the full algorithm specification.
pub locks: FileLockTree,
/// Protects the lock tree. Operations must be atomic with respect to each other.
pub lock: SpinLock<()>,
}
/// Augmented interval tree for file lock conflict detection.
/// Red-black tree sorted by `l_start`, augmented with `subtree_max` for
/// O(log n) range overlap queries.
pub struct FileLockTree {
/// Root of the red-black tree. None when no locks are held.
root: Option<Box<FileLockNode>>,
/// Number of locks currently in the tree.
count: usize,
}
pub struct FileLockNode {
pub lock: FileLock,
/// Maximum `l_end` value in this node's subtree (including this node).
/// Updated on every insert/delete along the path to the root.
pub subtree_max: u64,
pub left: Option<Box<FileLockNode>>,
pub right: Option<Box<FileLockNode>>,
pub color: RbColor,
}
pub enum RbColor { Red, Black }
13.10.2 Conflict Detection
Two locks conflict if:
1. At least one is a write lock (FileLockType::Write).
2. Their byte ranges overlap: !(lock_a.end < lock_b.start || lock_b.end < lock_a.start).
3. They have different owners:
- For POSIX locks: different PIDs.
- For OFD/flock locks: different Weak<FileDescription> pointers.
- A POSIX lock can upgrade/replace an existing POSIX lock from the same PID without conflict.
13.10.3 Locking Algorithm
UmkaOS uses an augmented interval tree (red-black tree with subtree_max
augmentation) for O(log n) file lock conflict detection. This is the correct
data structure; there is no O(n) fallback. Linux used an O(n) linked-list scan
for decades before adding interval trees in Linux 3.13; UmkaOS starts with the
correct design.
FileLockTree structure:
- Sorted by l_start (range start)
- Each node carries subtree_max: u64 = maximum l_end in its subtree
- This augmentation enables O(log n) range overlap queries
Conflict query for range [req_start, req_end):
Walk the tree: at each node, if node.subtree_max < req_start, the entire
subtree has no overlapping locks — prune. Otherwise check the node itself
and recurse into both children. O(log n + k) where k = number of conflicts found.
Insert/delete: O(log n) standard red-black tree operations, plus O(log n)
subtree_max recomputation on the path to root.
fcntl_setlk(fd, lock_type, start, end, wait: bool) → Result:
inode = fd.inode()
ensure inode.locks is initialized
inode.locks.lock.lock()
loop:
// O(log n + k) interval tree query for conflicting locks in [start, end).
for existing in inode.locks.locks.query_conflicts(start, end, lock_type, &fd):
if !wait:
inode.locks.lock.unlock()
return Err(EAGAIN) // F_SETLK: fail immediately
// F_SETLKW: deadlock detection before sleeping
if would_deadlock(current_pid, existing.owner_pid):
inode.locks.lock.unlock()
return Err(EDEADLK)
inode.locks.lock.unlock()
existing.wait_queue.wait_event(|| !lock_conflicts_anymore(...))
inode.locks.lock.lock()
continue loop // re-check after wakeup (spurious wakeup safe)
// No conflict: coalesce adjacent/overlapping locks of the same type and owner,
// then insert the new lock. O((k+1) log n).
coalesce_and_insert(inode, fd, lock_type, start, end)
inode.locks.lock.unlock()
return Ok(())
Lock Coalescing Algorithm (Greedy Interval Merge)
Input: a set of pending lock requests sorted by (offset, len).
Output: a minimal set of merged lock requests covering the same byte ranges.
Data structure:
struct PendingLockRequest {
offset: u64,
len: u64,
op: LockOp, // Shared or Exclusive
}
Algorithm (O(n log n) for n requests):
1. Collect all pending requests into Vec<PendingLockRequest>.
2. Sort by offset (ascending), then by len (descending) as tiebreaker.
3. Sweep left to right:
- Start with current = requests[0].
- For each subsequent request r:
- If r.offset <= current.offset + current.len (overlapping or adjacent)
AND r.op == current.op (same lock type):
- current.len = max(current.offset + current.len, r.offset + r.len) - current.offset
- Otherwise: emit current, set current = r.
4. Emit final current.
Rationale: coalescing reduces the number of kernel lock table entries for byte-range
locking (POSIX fcntl(F_SETLK)), avoiding fragmentation in the per-file lock list.
coalesce_and_insert(new_lock) — called after conflict check passes:
- Query the interval tree for all locks owned by
new_lock.pidthat are adjacent to or overlappingnew_lock's range[l_start, l_end)(adjacent =existing.l_end == new_lock.l_startor vice versa) - Compute the union range:
min(all.l_start)tomax(all.l_end) - Remove all found locks from the interval tree (O(k log n))
- Insert a single merged lock covering the union range (O(log n))
Complexity: O((k+1) log n) where k = number of locks merged. Coalescing reduces tree size over time for processes that acquire many adjacent byte-range locks (common in database file locking patterns).
13.10.4 Deadlock Detection
Deadlock Detection: Wait-For Graph DFS (3-Color)
Each lock holder is a node; each blocked waiter is a directed edge (waiter → holder).
Node state per thread:
- WHITE: not yet visited in current DFS
- GRAY: currently in the DFS recursion stack (potential cycle node)
- BLACK: fully explored, no cycle reachable from here
Constants:
const VFS_LOCK_MAX_DEPTH: usize = 64; // Max wait-chain depth before abort
Algorithm (invoked before blocking on a contested lock):
fn detect_deadlock(start: ThreadId, graph: &WaitForGraph) -> bool:
color = HashMap<ThreadId, Color>::new()
return dfs(start, &mut color, graph, depth=0)
fn dfs(node: ThreadId, color: &mut HashMap, graph: &WaitForGraph, depth: usize) -> bool:
if depth > VFS_LOCK_MAX_DEPTH:
return true // treat as deadlock (conservative)
color[node] = GRAY
for each holder in graph.holders_of(node):
match color.get(holder):
GRAY => return true // back-edge: cycle detected
BLACK => continue // already explored, safe
WHITE | None:
color[holder] = WHITE
if dfs(holder, color, graph, depth+1): return true
color[node] = BLACK
return false
On true return: the blocking call returns Err(LockError::Deadlock) / EDEADLK.
The caller must release all currently held locks and retry with a backoff.
The graph is constructed on-demand per lock request and is not persisted. Returning
true on depth overflow is safe: it causes the lock request to fail with EDEADLK,
which is better than silently allowing a potential deadlock. The depth limit prevents
deadlock detection from becoming a denial-of-service vector in pathological chains.
13.10.5 Lock Release on File Description Close
When a FileDescription's reference count drops to zero (the last file descriptor
pointing to it is closed):
- OFD locks: all locks where
owner_fdmatches this description are removed. - flock locks: the flock lock associated with this description (if any) is removed.
- POSIX locks: all locks where
owner_pid == current_process.pidare removed. This is the POSIX-mandated behavior: closing any file descriptor for a file releases all POSIX locks the process holds on that file, regardless of which fd was used to acquire them.
After removing locks, wake all tasks in the wait_queue of each removed lock so they
can retry acquisition.
13.10.6 memfd Sealing (F_ADD_SEALS / F_GET_SEALS)
memfd_create(2) returns an anonymous file (backed by tmpfs, with no pathname).
Seals are write-once restrictions placed on the file's mutation capabilities:
/// Seal flags for memfd files. Once set, seals cannot be removed.
/// SEAL_SEAL prevents any further seals from being added.
pub struct SealFlags: u32 {
/// Prevent any further seals from being added.
const SEAL_SEAL = 0x0001;
/// Prevent the file from shrinking (ftruncate to a smaller size returns EPERM).
const SEAL_SHRINK = 0x0002;
/// Prevent the file from growing (writes past EOF, ftruncate to larger size return EPERM).
const SEAL_GROW = 0x0004;
/// Prevent all writes: write(2) returns EPERM, mmap(PROT_WRITE) returns EPERM.
const SEAL_WRITE = 0x0008;
/// Prevent future mmap(PROT_WRITE) but allow existing writable mappings to remain.
const SEAL_FUTURE_WRITE = 0x0010;
}
fcntl(fd, F_ADD_SEALS, seals): add the specified seals atomically via a
compare_exchange on the inode's AtomicU32 seal field. Fails with EPERM if
SEAL_SEAL is already set. Fails with EBUSY if SEAL_WRITE is being added while
a writable mmap exists on the file.
fcntl(fd, F_GET_SEALS): return the current seal set (atomic load, lock-free).
Seal enforcement in VFS paths:
- write(2) and pwrite64(2): check SEAL_WRITE.
- ftruncate(2) to smaller size: check SEAL_SHRINK.
- ftruncate(2) to larger size: check SEAL_GROW.
- mmap(PROT_WRITE): check SEAL_WRITE | SEAL_FUTURE_WRITE.
UmkaOS improvement: seals are stored as an AtomicU32 in the memfd's inode — seal reads
are lock-free (a single atomic load), which is important because the seal check appears
on every write(2) and mmap(2) call for sealed fds.
13.10.7 Cross-References
- Section 14.6 (Distributed Lock Manager): the DLM provides cluster-wide advisory locks that extend the local flock/POSIX lock semantics across nodes. Local file locks (this section) are node-local only.
- Section 13.1.1 (VFS Architecture):
FileOps::release()is the call site where OFD and flock locks are released when the last fd to a file description is closed. - Section 13 (Containers): POSIX lock ownership is per-PID-namespace-PID. Within a container's PID namespace, lock ownership semantics are unchanged.
13.10.8 Lock Semantics Mode (POSIX Default / OFD Opt-in)
UmkaOS keeps POSIX semantics as the default for F_SETLK to preserve full Linux
binary compatibility. Applications and deployments that want the correct OFD
semantics as default can opt in at three levels, with the highest-priority source
winning:
Priority order (highest first): 1. Per-call explicit constant 2. Per-process prctl 3. Per-user-namespace sysctl 4. System global default: POSIX
Per-call explicit (always available, no mode setting needed)
F_OFD_SETLK // Always OFD semantics (Linux 3.15+, UmkaOS supported)
F_OFD_SETLKW // Always OFD semantics, blocking
F_SETLK_POSIX // UmkaOS extension: always POSIX semantics, explicit
F_SETLKW_POSIX // UmkaOS extension: always POSIX semantics, blocking
F_SETLK_POSIX exists so code inside an OFD-default process can still request
POSIX semantics for specific locks (e.g., a bundled library that requires
process-death lock release for crash detection).
Per-process opt-in
prctl(PR_SET_LOCK_SEMANTICS, LOCK_SEM_OFD) // F_SETLK means OFD for this process
prctl(PR_SET_LOCK_SEMANTICS, LOCK_SEM_POSIX) // Explicit POSIX (escape hatch)
prctl(PR_GET_LOCK_SEMANTICS, 0, 0, 0, 0) // Query current mode
pub const LOCK_SEM_POSIX: u64 = 0; // default
pub const LOCK_SEM_OFD: u64 = 1;
Stored in Task.lock_semantics: LockSemanticsMode (per-thread but inherited from
the process — all threads in a process share the same mode via Process.lock_semantics).
Inheritance rules:
- fork(): child inherits parent's lock_semantics
- exec(): inherited (sticky) — a container runtime sets it once; all descendant
processes inherit
- exec() of setuid/setgid binary: reset to the user-namespace sysctl default
(security: a privilege-elevating binary must not blindly inherit)
Per-user-namespace sysctl
/proc/sys/fs/file_lock_default
Values: posix (default) | ofd
This sysctl is per-user-namespace, not global. Each container has its own
user namespace and therefore its own file_lock_default. The container runtime
sets it at container creation:
# Inside an UmkaOS-native container's user namespace:
echo ofd > /proc/sys/fs/file_lock_default
/// Per-user-namespace lock semantics default.
/// Stored in UserNamespace.file_lock_default.
pub enum LockSemanticsMode {
Posix = 0, // F_SETLK uses POSIX semantics (default)
Ofd = 1, // F_SETLK uses OFD semantics
}
Requires CAP_SYS_ADMIN in the target user namespace to change.
Affects new processes only — running processes keep their current mode.
Deployment model
| Scenario | Recommended config |
|---|---|
| Host with legacy software | sysctl = posix (default), no change needed |
| UmkaOS-native container | runtime sets sysctl = ofd in container's user namespace |
| Mixed container (some legacy binaries) | sysctl = posix, UmkaOS-native apps use prctl |
| Wine / NFS lockd / old SQLite | prctl(LOCK_SEM_POSIX) in launch wrapper |
Internal resolution
fn effective_lock_semantics(
task: &Task,
cmd: FcntlCmd,
) -> LockSemanticsMode {
match cmd {
FcntlCmd::OfdSetLk | FcntlCmd::OfdSetLkW => LockSemanticsMode::Ofd,
FcntlCmd::SetLkPosix | FcntlCmd::SetLkWPosix => LockSemanticsMode::Posix,
FcntlCmd::SetLk | FcntlCmd::SetLkW => {
// Resolve: per-process > per-namespace sysctl > global POSIX
if task.process.lock_semantics != LockSemanticsMode::Unset {
task.process.lock_semantics
} else {
task.user_namespace.file_lock_default
}
}
_ => LockSemanticsMode::Posix,
}
}
Linux compatibility: existing binaries calling F_SETLK on a system where
no mode is set get identical POSIX behaviour to Linux. F_OFD_SETLK was added
in Linux 3.15 and is already supported. F_SETLK_POSIX and
PR_SET_LOCK_SEMANTICS are UmkaOS extensions with no Linux equivalent.
13.11 Disk Quota Subsystem (quotactl)
Disk quotas enforce per-user, per-group, and per-project limits on filesystem space and inode usage. Required for multi-tenant storage environments and Linux compatibility.
13.11.1 Data Structures
/// Per-subject (user, group, or project) quota accounting and limits.
/// Matches the Linux `struct dqblk` layout for quotactl(2) ABI compatibility.
#[repr(C)]
pub struct DiskQuota {
/// Hard block limit (bytes). 0 = no limit. Writes that would exceed this
/// are rejected with EDQUOT immediately, regardless of grace period.
pub bhardlimit: u64,
/// Soft block limit (bytes). Exceeding this triggers a grace period timer.
/// Once the grace period expires, further writes are rejected with EDQUOT.
pub bsoftlimit: u64,
/// Current block usage (bytes). Updated on every successful write and truncate.
pub bcurrent: u64,
/// Hard inode limit. 0 = no limit. File creation that would exceed this
/// is rejected with EDQUOT.
pub ihardlimit: u64,
/// Soft inode limit. Exceeding this triggers an inode grace period.
pub isoftlimit: u64,
/// Current inode count (files + directories + symlinks owned by this subject).
pub icurrent: u64,
/// Timestamp when the soft block limit was first exceeded (0 if not exceeded).
/// Grace period expiry = btime + bgrace.
pub btime: i64,
/// Timestamp when the soft inode limit was first exceeded (0 if not exceeded).
pub itime: i64,
/// Grace period for the block soft limit, in seconds. Default: 7 days (604800).
pub bgrace: u32,
/// Grace period for the inode soft limit, in seconds. Default: 7 days (604800).
pub igrace: u32,
}
/// Quota subject type.
pub enum QuotaType {
User = 0, // USRQUOTA
Group = 1, // GRPQUOTA
Project = 2, // PRJQUOTA
}
/// Quota operations implemented by filesystems that support quotas.
/// Optional — filesystems without quota support omit this and quotactl(2) returns ENOSYS.
pub trait QuotaOps: Send + Sync {
/// Enable quota enforcement for the given type, reading limits from `quota_file`.
fn quota_on(&self, quota_type: QuotaType, quota_file: &str) -> Result<(), VfsError>;
/// Disable quota enforcement for the given type.
fn quota_off(&self, quota_type: QuotaType) -> Result<(), VfsError>;
/// Read the quota entry for subject `id` (UID, GID, or project ID).
fn get_quota(&self, quota_type: QuotaType, id: u32) -> Result<DiskQuota, VfsError>;
/// Set limits and accounting for subject `id`. Requires CAP_SYS_ADMIN.
fn set_quota(&self, quota_type: QuotaType, id: u32, quota: &DiskQuota) -> Result<(), VfsError>;
/// Read global quota state (grace periods, flags) for the given type.
fn get_info(&self, quota_type: QuotaType) -> Result<QuotaInfo, VfsError>;
/// Set global quota state (grace periods). Requires CAP_SYS_ADMIN.
fn set_info(&self, quota_type: QuotaType, info: &QuotaInfo) -> Result<(), VfsError>;
/// Flush in-memory quota accounting to the quota database file.
fn sync_quota(&self, quota_type: QuotaType) -> Result<(), VfsError>;
}
/// Global quota state (grace periods and enabled flags) for a single quota type.
pub struct QuotaInfo {
/// Block grace period in seconds.
pub bgrace: u32,
/// Inode grace period in seconds.
pub igrace: u32,
/// Quota flags (QIF_FLAGS: quota enabled, quota accounting-only, etc.).
pub flags: u32,
}
13.11.2 quotactl(2) Dispatch
The quotactl(2) syscall encodes both the quota command and the quota type in a
single 32-bit cmd argument: the high 16 bits are the command
(Q_QUOTAON, Q_QUOTAOFF, Q_GETQUOTA, Q_SETQUOTA, Q_GETINFO, Q_SETINFO,
Q_SYNC) and the low 16 bits are the quota type (USRQUOTA=0, GRPQUOTA=1,
PRJQUOTA=2).
quotactl(cmd, dev, id, addr):
qt_cmd = cmd >> 16
qt_type = QuotaType::from(cmd & 0xffff) // USRQUOTA/GRPQUOTA/PRJQUOTA
sb = resolve_superblock_from_device_path(dev)
if sb.quota_ops is None: return Err(ENOSYS)
// Capability check for mutating operations
if qt_cmd in [Q_QUOTAON, Q_QUOTAOFF, Q_SETQUOTA, Q_SETINFO]:
check_capability(CAP_SYS_ADMIN)?
match qt_cmd:
Q_QUOTAON → sb.quota_ops.quota_on(qt_type, addr_as_path)
Q_QUOTAOFF → sb.quota_ops.quota_off(qt_type)
Q_GETQUOTA → quota = sb.quota_ops.get_quota(qt_type, id)?; copy_to_user(addr, quota)
Q_SETQUOTA → quota = copy_from_user(addr)?; sb.quota_ops.set_quota(qt_type, id, "a)
Q_GETINFO → info = sb.quota_ops.get_info(qt_type)?; copy_to_user(addr, info)
Q_SETINFO → info = copy_from_user(addr)?; sb.quota_ops.set_info(qt_type, &info)
Q_SYNC → sb.quota_ops.sync_quota(qt_type)
_ → return Err(EINVAL)
13.11.3 VFS Enforcement Hooks
On every write(2), fallocate, create, mkdir, mknod, and symlink call, the
VFS checks quotas for all three subject types:
vfs_quota_check_blocks(inode, bytes_requested) → Result:
creds = current_task().creds
for qt in [QuotaType::User, QuotaType::Group, QuotaType::Project]:
id = match qt:
User → creds.fsuid
Group → creds.fsgid
Project → inode.project_id // stored in the inode's extended attribute (user.project_id)
quota = inode.sb.quota_ops.get_quota(qt, id)? // from in-memory quota cache
new_usage = quota.bcurrent + bytes_requested
if new_usage > quota.bhardlimit && quota.bhardlimit != 0:
return Err(EDQUOT) // hard limit exceeded: reject immediately
if new_usage > quota.bsoftlimit && quota.bsoftlimit != 0:
now = current_time_secs()
if quota.btime == 0:
quota.btime = now + quota.bgrace as i64 // start grace period timer
update_quota_cache(qt, id, "a)
elif now > quota.btime:
return Err(EDQUOT) // grace period expired: reject
// else: within grace period, allow the write
return Ok(())
vfs_quota_check_inodes(inode, count) → Result:
// Identical structure to vfs_quota_check_blocks but uses icurrent/isoftlimit/ihardlimit.
13.11.4 In-Memory Quota Cache
Quota accounting state is kept in a per-filesystem in-memory cache to avoid hitting
the quota database file on every write. The cache structure mirrors DiskQuota with an
additional dirty: bool field. Cache entries are written back to the quota file
asynchronously via sync_quota(), which is called:
- Periodically by the writeback daemon (default interval: 30 seconds).
- On
quotactl(Q_SYNC). - On filesystem unmount.
- On
sync(2)/syncfs(2)when the filesystem's quota is dirty.
Cache lookups use a per-filesystem RwLock<HashMap<(QuotaType, u32), DiskQuota>>.
The read lock is taken for quota checks (common case); write lock only for updates.
This allows concurrent quota checks across different subjects with no contention.
13.11.5 Linux Compatibility
quotactl(2)with all seven commands (Q_QUOTAON,Q_QUOTAOFF,Q_GETQUOTA,Q_SETQUOTA,Q_GETINFO,Q_SETINFO,Q_SYNC) is fully implemented.- The
dqblkstructure layout matches the Linux UAPI definition exactly. - quota tools (
quota,quotacheck,repquota,edquota) work without modification. - ext4, XFS, and tmpfs quota implementations are in scope for the initial release.
- Project quotas (
PRJQUOTA) are supported; project IDs are stored in the inode'si_projidfield (set viaFS_IOC_FSSETXATTR).
13.11.6 Cross-References
- Section 13.1.1 (VFS Architecture): quota checks are inserted into the VFS
dispatch layer at
write,create,mkdir,mknod, andfallocatecall sites. - Section 13 (Containers): cgroup v2
io.maxandmemory.maxprovide resource controls complementary to quota; quota enforces per-UID/GID storage limits while cgroups enforce per-container I/O and memory limits. - Section 11 (Storage): ext4, XFS, and btrfs filesystem drivers implement
QuotaOpsas part of theirSuperBlockinitialization.
13.12 Pipes and FIFOs
Pipes (pipe(2), pipe2(2)) and named FIFOs (mkfifo(2)) are anonymous
unidirectional byte streams. They are the oldest and most widely used IPC
primitive in UNIX.
13.12.1 Design: Fixed SPSC Ring Buffer
UmkaOS implements pipe data buffering as a fixed-size lock-free SPSC ring rather than Linux's dynamically-allocated page list. Linux pipes allocate and free individual 4KB pages as data fills and drains, requiring a pipe spinlock on every read and write. UmkaOS's ring buffer is:
- Allocated once at pipe creation (default 65536 bytes = 16 pages, matching Linux default)
- Lock-free for the common SPSC case (one writer, one reader — the overwhelmingly
dominant use:
cmd | cmd) - Cache-friendly: contiguous memory, no pointer chasing between pages
- Zero dynamic allocation in the data path
/// Pipe data buffer — a lock-free single-producer single-consumer ring.
pub struct PipeRing {
/// Contiguous backing buffer. Size is always a power of 2.
buf: Box<[u8]>,
/// Writer position (mod buf.len()). Written by writer, read by reader.
write_pos: AtomicUsize,
/// Reader position (mod buf.len()). Written by reader, read by writer.
read_pos: AtomicUsize,
}
impl PipeRing {
/// Available bytes for reading.
pub fn available(&self) -> usize {
let w = self.write_pos.load(Ordering::Acquire);
let r = self.read_pos.load(Ordering::Relaxed);
w.wrapping_sub(r)
}
/// Free space for writing.
pub fn free(&self) -> usize {
self.buf.len() - self.available()
}
}
/// Per-pipe state shared between reader and writer endpoints.
pub struct Pipe {
ring: PipeRing,
/// Writer end open (false = EOF on read when ring drained).
write_open: AtomicBool,
/// Reader end open (false = SIGPIPE/EPIPE on write).
read_open: AtomicBool,
/// Tasks sleeping waiting for data (reader blocks).
read_waiters: WaitQueue,
/// Tasks sleeping waiting for space (writer blocks).
write_waiters: WaitQueue,
}
13.12.2 Capacity and fcntl(F_SETPIPE_SZ)
Default pipe capacity: 65536 bytes (matches Linux default).
fcntl(F_SETPIPE_SZ, size) resizes the ring:
- Rounds up to the next power of 2 (minimum 4096 bytes)
- Maximum: /proc/sys/fs/pipe-max-size (default 1MB, same as Linux)
- Requires CAP_SYS_RESOURCE to exceed /proc/sys/fs/pipe-max-size
- Data currently in the pipe is preserved (ring resized via realloc + data copy)
- If the new size is smaller than current content: EBUSY
fcntl(F_GETPIPE_SZ) returns the current capacity.
13.12.3 MPSC Pipes (Multiple Writers)
When more than one process/thread writes to the same pipe (e.g., shell { cmd1; cmd2; } | cmd3), the SPSC assumption breaks. UmkaOS detects multiple writers via Pipe.writer_count: AtomicU32:
writer_count == 1: lock-free SPSC pathwriter_count > 1: writer acquiresPipe.write_lock: Mutex<()>before writing
Writes <= PIPE_BUF (4096 bytes) are always atomic (no interleaving with other writers) — same guarantee as POSIX.
13.12.4 O_DIRECT Pipe Mode
pipe2(O_DIRECT): each write() is a discrete message; read() returns exactly one message. Implemented by prepending a 4-byte length header in the ring:
/// O_DIRECT pipe message header (4 bytes, little-endian).
/// Followed immediately by `len` bytes of payload.
/// Alignment: none required (ring is byte-addressable).
#[repr(C, packed)]
pub struct PipeMessageHdr {
pub len: u32,
}
Maximum message size: PIPE_BUF (4096 bytes) for atomic writes.
13.12.5 Named FIFOs (mkfifo)
Named FIFOs use the same Pipe struct, but with a VFS inode for pathname lookup:
mkfifo(path, mode): creates a VFS inode of typeInodeKind::Fifoopen(path, O_RDONLY): blocks until a writer opens (unless O_NONBLOCK)open(path, O_WRONLY): blocks until a reader opens (unless O_NONBLOCK)- Once both ends are open: identical semantics to anonymous pipe
13.12.6 Splice and Zero-Copy
splice(2) between two pipes (or pipe + socket) operates by transferring ring
buffer segments rather than copying data. UmkaOS implements splice as:
- Identify contiguous segment in source ring
- Map that segment into destination ring (pointer-level transfer for pipe-to-pipe)
- Advance source
read_pos, destinationwrite_pos
For pipe-to-socket: uses sendmsg with the ring segment as an iov, letting the
network stack DMA directly from the pipe buffer (zero kernel-side copy).
13.12.7 Linux Compatibility
- Default capacity 65536 bytes: identical to Linux
F_SETPIPE_SZ/F_GETPIPE_SZ: identical semanticsPIPE_BUF= 4096 bytes: POSIX required, identicalO_DIRECTpipe mode: identical to Linux 3.4+pipe2(O_CLOEXEC | O_NONBLOCK | O_DIRECT): all flags supported- Splice semantics: identical to Linux
/proc/sys/fs/pipe-max-size: identical default (1MB), same permission model- Signal on broken pipe:
SIGPIPE+EPIPEon write to pipe with no readers select()/poll()/epoll():EPOLLINwhen data available,EPOLLOUTwhen space available,EPOLLHUPon last writer close